Can we train a model that is able to detect good or bad content quality?
Create a new virtual environment and install all required packages:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
We also need to install the custom version of MTEB, which is defined as a submodule:
pip install mteb/
For evaluation run the rerank.py
script from the top directory:
python3 code/eval/rerank.py
This code is still being generalized, but in general, scripts to extract and save data will be found in dataset/scraping/scrape_*.py
:
python3 code/dataset/scraping/scrape_gb_wiki.py
python3 code/dataset/extract_warcs.py
To experiment with training binary classifiers, run the train_classifier.py
script from the top directory:
python3 code/models/train_classifier.py