You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
uses split() to tokenize sentence in the train dataset, and add token id in the dict. But in the valid/test dataset, the new tokens are neither added in the dict or tagged as an unknown token. Thus, the following error pop up.
Producing dataset...
Traceback (most recent call last):
File "main.py", line 121, in <module>
corpus = data.Corpus(args.data)
File "/home/haha/sha-rnn/data.py", line 31, in __init__
self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
File "/home/haha/sha-rnn/data.py", line 60, in tokenize
ids[token] = self.dictionary.word2idx[word]
KeyError: 'bower_components'
Do you recommend to use other tokenize method (like word-piece) here?
Thanks again~
The text was updated successfully, but these errors were encountered:
The main issue is if a token occurs in validation or test without appearing in training then it's bad news for the model. The weights will be uninitialized at best.
Using wordpieces would likely be the best solution. You could also do what the Penn Treebank (PTB) did and add each of the words found in validation/test at the start or end of the training file. Not an optimal solution but it is a solution at least. You could also add an unknown token (<unk>) to the dataset as well.
Thanks first for such nice paper and work!
I'm trying to train a text generation model with my own dataset. The tokenize function in data.py
sha-rnn/data.py
Line 34 in 218d748
uses split() to tokenize sentence in the train dataset, and add token id in the dict. But in the valid/test dataset, the new tokens are neither added in the dict or tagged as an unknown token. Thus, the following error pop up.
Do you recommend to use other tokenize method (like word-piece) here?
Thanks again~
The text was updated successfully, but these errors were encountered: