Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error happened when new token appears in the valid/test data set #8

Open
carter54 opened this issue Dec 6, 2019 · 1 comment
Open

Comments

@carter54
Copy link

carter54 commented Dec 6, 2019

Thanks first for such nice paper and work!
I'm trying to train a text generation model with my own dataset. The tokenize function in data.py

def tokenize(self, path, construct_dictionary=False):

uses split() to tokenize sentence in the train dataset, and add token id in the dict. But in the valid/test dataset, the new tokens are neither added in the dict or tagged as an unknown token. Thus, the following error pop up.

Producing dataset...
Traceback (most recent call last):
  File "main.py", line 121, in <module>
    corpus = data.Corpus(args.data)
  File "/home/haha/sha-rnn/data.py", line 31, in __init__
    self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
  File "/home/haha/sha-rnn/data.py", line 60, in tokenize
    ids[token] = self.dictionary.word2idx[word]
KeyError: 'bower_components'

Do you recommend to use other tokenize method (like word-piece) here?

Thanks again~

@Smerity
Copy link
Owner

Smerity commented Dec 10, 2019

The main issue is if a token occurs in validation or test without appearing in training then it's bad news for the model. The weights will be uninitialized at best.

Using wordpieces would likely be the best solution. You could also do what the Penn Treebank (PTB) did and add each of the words found in validation/test at the start or end of the training file. Not an optimal solution but it is a solution at least. You could also add an unknown token (<unk>) to the dataset as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants