Skip to content

Files

Latest commit

1a9d3bc · Dec 5, 2024

History

History
45 lines (30 loc) · 990 Bytes

README.md

File metadata and controls

45 lines (30 loc) · 990 Bytes

Embedding Preference Training

Can we train a model that is able to detect good or bad content quality?

Set-Up

Create a new virtual environment and install all required packages:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

We also need to install the custom version of MTEB, which is defined as a submodule:

pip install mteb/

To Run

Evaluating Embedding Models

For evaluation run the rerank.py script from the top directory:

python3 code/eval/rerank.py

Scraping Data

This code is still being generalized, but in general, scripts to extract and save data will be found in dataset/scraping/scrape_*.py:

python3 code/dataset/scraping/scrape_gb_wiki.py
python3 code/dataset/extract_warcs.py

Train Binary Classifier

To experiment with training binary classifiers, run the train_classifier.py script from the top directory:

python3 code/models/train_classifier.py