You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thanks for open sourcing the code for that great project ❤️
I trained a character-based model for German on ~1GB of text (mainly from OPUS). It worked well for two epochs, but then the following error message is thrown:
| epoch 1 | 121090/129094 batches | lr 0.00200 | ms/batch 216.22 | loss 0.83 | ppl 2.28 | bpc 1.191
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
| epoch 1 | 121100/129094 batches | lr 0.00200 | ms/batch 190.10 | loss nan | ppl nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Traceback (most recent call last):
File "main.py", line 379, in<module>
train(epoch - 1)
File "main.py", line 302, in train
scaled_loss.backward()
File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 127, in post_backward_models_are_masters
scale_override=(grads_have_scale, stashed_have_scale, out_scale))
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 176, in unscale_with_stashed
out_scale/grads_have_scale, # 1./scale,
ZeroDivisionError: float division by zero
Then I resumed training with a lower learning rate (pretty much the same parameters as stated in the main readme) and the same error is thrown after one epochs.
Do you know how this can be prevented 🤔
However, the generated text is very interesting 😅
||||D i e _ P r o t o k o l l e , _ d i e _ a u f _ B e r e c h n u n g e n _ d e r _ F a m i l i e _ P a a r u n g _ u n d _ d e r _ E n t s c h e i d u n g e n _ f ü r _ d i e _ R e g u l i e r u n g _ d e r _ U n i v e r s i t ä t _ v o r g e b r a c h t _ w e r d e n , _ w i r d _ d e m n a
c h _ m i t _ n i e d r i g e n _ G e w i n n n i v e a u s _ d i s k u t i e r t _ w e r d e n .
S i e _ k ö n n e n _ a u c h _ z w i s c h e n _ d e n _ v e r s c h i e d e n e n _ K o n z e p t i o n e n _ v o n _ F a m i l i e n _ u n d _ K i n d e r n _ i n t e r e s s i e r e n : _ b e i s p i e l s w e i s e : _ B i o g r a p h i e , _ M a g i e , _ G e s c h i c h t e , _ C a p t a
i n _ S l a v i a - S t i l , _ A n s i c h t e n _ u n d _ V i d e o s .
D i e s e r _ S c h a l l p e g e l _ l ä u f t _ i n _ e i n _ H ö h e n v e r s t e l l u n g s g e f ä ß _ d e s _ a l l g e m e i n e n _ G e r ä t e s _ b e i _ d e r _ D i c h t h e i t .||||A u f _ d e r _ W e s t s e i t e _ d e r _ A u t o b a h n _ A 1 , _ n a h e _ L a _ G o m e r a _ b e f i n d e n _ s i c h _ z w e i _ S t r a ß e n v e r b i n d u n g e n _ z w i s c h e n _ d e n _ B e r g e n _ u n d _ d e r _ S e h e n s w ü r d i g k e i t .
Z u _ d e n _ f o l g e n d e n _ D i e n s t l e i s t u n g e n _ g e h ö r e n _ T e l e f o n , _ k o s t e n l o s e _ P a r k p l ä t z e , _ e i n _ B ü g e l e i s e n / - b r e t t _ ( 2 4 - S t u n d e n - R e z e p t i o n ) .
The text was updated successfully, but these errors were encountered:
Thanks for running this on German @stefan-it! I haven't done experiments on different languages yet and it's great to see it at least hold!
Unfortunately I have run into similar issues in the past re: NaNs at around 15 or 20 epochs on the enwik8 data which works out to around 2 epochs on your larger German dataset.
I still haven't tracked down exactly what it might be but I do know the random seed can impact it. My guess is that there might be an issue with dropout over the attention window or similar.
I'll be making a new cleaner codebase and ensuring that issues like this don't occur will be a top priority. As a temporary fix if you're curious to continue investigating you could save the model once every N iterations if the loss hasn't NaN'ed out and restart as you've done with a different random seed. That's admittedly not a great solution however.
I'm glad you're enjoying the generated text! I ran it through Google Translate and it at least produces something I can read lol. I'll note that for this model the more context you can seed it with the better!
Hi @Smerity ,
thanks for open sourcing the code for that great project ❤️
I trained a character-based model for German on ~1GB of text (mainly from OPUS). It worked well for two epochs, but then the following error message is thrown:
Then I resumed training with a lower learning rate (pretty much the same parameters as stated in the main readme) and the same error is thrown after one epochs.
Do you know how this can be prevented 🤔
However, the generated text is very interesting 😅
The text was updated successfully, but these errors were encountered: