-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The corpus file used to be evaluated #10
Comments
Same issue here I'm using Jey Han Lau's wiki corpus and his script here https://github.com/jhlau/topic_interpretability/blob/master/run-oc.sh Note that he used 20 words sliding window in his script. I did a run on the 2nd group of topic here and got results that are substantially different than yours @akashgit [0.07] ( 0.07; ) apartment woman neighbor jesus armenians tear daughter soldier hide afraid ========================================================================== |
BTW, I don't really question the validity of the results from the paper, I really enjoyed reading it. Just want to make sure we are all using the same method to evaluate ;) |
If you run topic model on 20News, the reference corpus to compute topic coherence is 20News accordingly. You'll get confused result if you use different corpus to get topic coherence. |
Hi, [0.01] ( 0.01; ) apartment woman neighbor jesus armenians tear daughter soldier hide afraid ========================================================================== @dingran How did you run the topic coherence script, and was the same result as reported in the paper obtained when using 20news as the reference corpus? What I got when using 20news as the reference corpus: [0.03] ( 0.03; ) apartment woman neighbor jesus armenians tear daughter soldier hide afraid ========================================================================== |
the corpus files are here: autoencoding_vi_for_topic_models/data/20news_clean/ |
UPDATE: we recently found that the TC numbers in the paper are slightly under reported due to the way the TC script works. please make sure that you set the window size to -1 (whole document) if you are using the same script as me.
|
I am also curious which corpus refrence they have used. Based on your comment, you mean for evaluating I thought its not valid to do like this. Am I missing something here? |
Hi, Could you share your Thanks~ |
Could you share the link that you have used for the coherence? |
document = -1 means that, if two words appeared in a document, they have a coappearence, which is used in NPMI counting. |
Thanks, Yongfeiyan, for the quick reply. You mean I have to set the window_size = 0? Also, Im using the same code as you. That source code already shared the Thanks~ |
The .npy file at https://github.com/akashgit/autoencoding_vi_for_topic_models/tree/master/data/20news_clean are BOW format of share D x V, D is the total number of documents, V is the vocab size. from itertools import chain
list(chain(*[[vocab[i]]*v for i, v in enumerate(row)]) # yields the document corresponding to row in npy file. I uploaded the codes I wrote at https://github.com/YongfeiYan/Neural-Document-Modeling , with 20NG in data dir and modified topic evaluation scripts. |
Sorry Actually I have not worked with Does it should be like this:
Thank you again for taking the time. |
Hi,
When I use the scripts to evaluate the performance, I find there is a corpus file needed by the code. I download the files from the 20news homepage, but the result is not the same as the result file. Could you share the corpus file?
Best
The text was updated successfully, but these errors were encountered: