how do i too add the extra_ids already for the tokenizer such that i could keep the extra_ids=0 #740
StephennFernandes
started this conversation in
General
Replies: 1 comment
-
solved link |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been seeing in the mt5 tokenizer the extra 100 ids needed for the sentinel tokens come already included in the tokenizer.
'gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model'
how do i too train my own sentencepiece tokenizer where i could too include the 100extra already in my tokenizer, just like the original one above.
as i am training a variant of mt5 that covers multiple langugaes that the common mt5 does cover. i went on to training a unigram sentencpiece tokenizer with 250000 vocab_size but i tripped into confusion, on how to get the 100 extra_ids similar the the official mt5 tokenizer.
Beta Was this translation helpful? Give feedback.
All reactions