ml_netflix.csv
: join on ml-latest-small and Netflix data
- code for generating the dataset in
utils/gen_data.ipynb
- LDA
- pure Python
- can run the notebook straight through
- Stan
- warning: takes ~1.67 hours to finish sampling on all 1056 movies
- run
python3 stan-lda.py
in terminal with the following available flags:regen_words_df
(bool): True if you'd like to regenerate the dataframe mapping each word (ID) to a document/movie (ID)- saved to
cache/words_df.csv
- saved to
regen_data_lemmatized
(bool): True if you'd like to regenerate the lemmatized movie descriptions- saved to
cached/data_lemmatized.txt
- saved to
num_movies
(int): the firstnum_movies
movies from the data set that you'd like to train on- by default, it's the number of movies in the data set (1056)
just_eval
(bool): True if you'd like to just calculate the evaluation metrics. Assumes you already have the trained posterior values inresults/theta.npy
.
- Pyro
- can run the notebook straight through
- modify number of topics, number of epochs run, etc. in cell 4
- Turing
- can run the notebook using Julia runtime
- results are output to CSV (
cache/julia_out.csv
) for evaluation in Python usingeval_julia.ipynb
- pure Python
- PMF
- pure Python
- can run the notebook straight through
- Stan
- run
python3 stan-pmf.py
in terminal with the followiing available flags:just_eval
(bool): True if you'd like to just calculate the evaluation metrics. Assumes you already have the trained posterior values inresults/Z.npy
andresults/W.npy
.
- run
- pure Python