error happened when new token appears in the valid/test data set #8

carter54 · 2019-12-06T09:44:36Z

Thanks first for such nice paper and work!
I'm trying to train a text generation model with my own dataset. The tokenize function in data.py

sha-rnn/data.py

Line 34 in 218d748

def tokenize(self, path, construct_dictionary=False):

uses split() to tokenize sentence in the train dataset, and add token id in the dict. But in the valid/test dataset, the new tokens are neither added in the dict or tagged as an unknown token. Thus, the following error pop up.

Producing dataset...
Traceback (most recent call last):
  File "main.py", line 121, in <module>
    corpus = data.Corpus(args.data)
  File "/home/haha/sha-rnn/data.py", line 31, in __init__
    self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
  File "/home/haha/sha-rnn/data.py", line 60, in tokenize
    ids[token] = self.dictionary.word2idx[word]
KeyError: 'bower_components'

Do you recommend to use other tokenize method (like word-piece) here?

Thanks again~

The text was updated successfully, but these errors were encountered:

Smerity · 2019-12-10T00:43:47Z

The main issue is if a token occurs in validation or test without appearing in training then it's bad news for the model. The weights will be uninitialized at best.

Using wordpieces would likely be the best solution. You could also do what the Penn Treebank (PTB) did and add each of the words found in validation/test at the start or end of the training file. Not an optimal solution but it is a solution at least. You could also add an unknown token (<unk>) to the dataset as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error happened when new token appears in the valid/test data set #8

error happened when new token appears in the valid/test data set #8

carter54 commented Dec 6, 2019 •

edited

Loading

Smerity commented Dec 10, 2019

error happened when new token appears in the valid/test data set #8

error happened when new token appears in the valid/test data set #8

Comments

carter54 commented Dec 6, 2019 • edited Loading

Smerity commented Dec 10, 2019

carter54 commented Dec 6, 2019 •

edited

Loading