Embedding Preference Training

Can we train a model that is able to detect good or bad content quality?

Set-Up

Create a new virtual environment and install all required packages:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

We also need to install the custom version of MTEB, which is defined as a submodule:

pip install mteb/

For evaluation run the rerank.py script from the top directory:

python3 code/eval/rerank.py

This code is still being generalized, but in general, scripts to extract and save data will be found in dataset/scraping/scrape_*.py:

python3 code/dataset/scraping/scrape_gb_wiki.py
python3 code/dataset/extract_warcs.py

To experiment with training binary classifiers, run the train_classifier.py script from the top directory:

python3 code/models/train_classifier.py