how do i too add the extra_ids already for the tokenizer such that i could keep the extra_ids=0 #740

StephennFernandes · 2022-08-30T13:17:08Z

StephennFernandes
Aug 30, 2022

I've been seeing in the mt5 tokenizer the extra 100 ids needed for the sentinel tokens come already included in the tokenizer.
'gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model'

how do i too train my own sentencepiece tokenizer where i could too include the 100extra already in my tokenizer, just like the original one above.

as i am training a variant of mt5 that covers multiple langugaes that the common mt5 does cover. i went on to training a unigram sentencpiece tokenizer with 250000 vocab_size but i tripped into confusion, on how to get the 100 extra_ids similar the the official mt5 tokenizer.

StephennFernandes · 2022-09-01T23:16:20Z

StephennFernandes
Sep 1, 2022
Author

solved link

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how do i too add the extra_ids already for the tokenizer such that i could keep the extra_ids=0 #740

{{title}}

Replies: 1 comment

{{title}}

Select a reply

how do i too add the extra_ids already for the tokenizer such that i could keep the extra_ids=0 #740

StephennFernandes Aug 30, 2022

Replies: 1 comment

StephennFernandes Sep 1, 2022 Author

StephennFernandes
Aug 30, 2022

StephennFernandes
Sep 1, 2022
Author