Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losses are Nan and Infinite #1869

Open
SSwethaSel0609 opened this issue Jan 21, 2025 · 3 comments
Open

Losses are Nan and Infinite #1869

SSwethaSel0609 opened this issue Jan 21, 2025 · 3 comments

Comments

@SSwethaSel0609
Copy link

I'm finetuning the model in zipformer. When I finetune with 100hrs of data, there was no issue but when I finetune the model with 3000hrs of data I'm facing the infinity or nan losses. What will be the cause for this issue
[1,mpirank:5,algo-1]:2025-01-19 09:00:32,064 INFO [finetune.py:1142] (5/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3310.00 frames. ], tot_loss[over 792510.56 frames. ], batch size: 14, lr: 4.28e-03,
[1,mpirank:0,algo-1]:2025-01-19 09:00:32,065 INFO [finetune.py:1142] (0/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4819.00 frames. ], tot_loss[over 814690.14 frames. ], batch size: 58, lr: 4.28e-03,
[1,mpirank:6,algo-1]:2025-01-19 09:00:32,068 INFO [finetune.py:1142] (6/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3370.00 frames. ], tot_loss[over 799670.96 frames. ], batch size: 13, lr: 4.28e-03,
[1,mpirank:3,algo-1]:2025-01-19 09:00:32,070 INFO [finetune.py:1142] (3/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4945.00 frames. ], tot_loss[over 807011.63 frames. ], batch size: 33, lr: 4.28e-03,
[1,mpirank:2,algo-1]:2025-01-19 09:00:32,071 INFO [finetune.py:1142] (2/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4949.00 frames. ], tot_loss[over 812248.61 frames. ], batch size: 66, lr: 4.28e-03,
[1,mpirank:1,algo-1]:2025-01-19 09:00:32,073 INFO [finetune.py:1142] (1/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4903.00 frames. ], tot_loss[over 823203.24 frames. ], batch size: 49, lr: 4.28e-03,
[1,mpirank:4,algo-1]:2025-01-19 09:00:32,075 INFO [finetune.py:1142] (4/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4743.00 frames. ], tot_loss[over 806376.22 frames. ], batch size: 27, lr: 4.28e-03,

@danpovey
Copy link
Collaborator

Which directory did you get the code? The later version in zipformer/ is more stable, there are earlier versions that eventually get unstable like that.
If you run from the epoch that failed, i..e epoch 7, with --inf-check=True, it should produce some output that indicates that the problem is.

@SSwethaSel0609
Copy link
Author

How can I find which version I'm using?
I have started from epoch 1.

[1,mpirank:5,algo-1]:2025-01-18 18:21:26,565 INFO [finetune.py:1142] (5/8) Epoch 2, batch 5850, loss[loss=0.2428, simple_loss=0.2706, pruned_loss=0.0785, ctc_loss=0.1449, over 2699.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.2835, pruned_loss=0.1019, ctc_loss=0.1808, over 643152.39 frames. ], batch size: 10, lr: 4.48e-03, -- [1,mpirank:2,algo-1]:2025-01-18 18:21:26,572 INFO [finetune.py:1142] (2/8) Epoch 2, batch 5850, loss[loss=0.1814, simple_loss=0.2199, pruned_loss=0.05261, ctc_loss=0.09411, over 2742.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.2847, pruned_loss=0.1039, ctc_loss=0.1847, over 636660.49 frames. ], batch size: 10, lr: 4.48e-03, [1,mpirank:4,algo-1]:2025-01-18 18:21:26,573 INFO [finetune.py:1142] (4/8) Epoch 2, batch 5850, loss[loss=0.3142, simple_loss=0.3101, pruned_loss=0.1158, ctc_loss=0.2165, over 3081.00 frames. ], tot_loss[loss=0.282, simple_loss=0.2847, pruned_loss=0.1029, ctc_loss=0.1838, over 642994.73 frames. ], batch size: 12, lr: 4.48e-03, [1,mpirank:1,algo-1]:2025-01-18 18:21:26,574 INFO [finetune.py:1142] (1/8) Epoch 2, batch 5850, loss[loss=0.3755, simple_loss=0.3408, pruned_loss=0.1521, ctc_loss=0.2651, over 3001.00 frames. ], tot_loss[loss=0.286, simple_loss=0.2879, pruned_loss=0.105, ctc_loss=0.1852, over 635433.97 frames. ], batch size: 11, lr: 4.48e-03, [1,mpirank:3,algo-1]:2025-01-18 18:23:08,504 INFO [finetune.py:1142] (3/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3194.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.2546, pruned_loss=0.09303, ctc_loss=0.1655, over 636192.78 frames. ], batch size: 11, lr: 4.48e-03, [1,mpirank:6,algo-1]:2025-01-18 18:23:08,505 INFO [finetune.py:1142] (6/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3187.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.2572, pruned_loss=0.09711, ctc_loss=0.171, over 635349.10 frames. ], batch size: 13, lr: 4.48e-03, [1,mpirank:7,algo-1]:2025-01-18 18:23:08,505 INFO [finetune.py:1142] (7/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3130.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.2522, pruned_loss=0.09163, ctc_loss=0.1638, over 637478.71 frames. ], batch size: 13, lr: 4.48e-03, [1,mpirank:0,algo-1]:2025-01-18 18:23:08,506 INFO [finetune.py:1142] (0/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 2829.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2517, pruned_loss=0.09079, ctc_loss=0.1618, over 636327.01 frames. ], batch size: 10, lr: 4.48e-03, [1,mpirank:5,algo-1]:2025-01-18 18:23:08,508 INFO [finetune.py:1142] (5/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 2825.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.2527, pruned_loss=0.09098, ctc_loss=0.1624, over 638286.87 frames. ], batch size: 12, lr: 4.48e-03, [1,mpirank:2,algo-1]:2025-01-18 18:23:08,511 INFO [finetune.py:1142] (2/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3732.00 frames. ], tot_loss[loss=0.251, simple_loss=0.2528, pruned_loss=0.09189, ctc_loss=0.1643, over 635974.59 frames. ], batch size: 13, lr: 4.48e-03, [1,mpirank:1,algo-1]:2025-01-18 18:23:08,512 INFO [finetune.py:1142] (1/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3308.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.2542, pruned_loss=0.09245, ctc_loss=0.164, over 631021.03 frames. ], batch size: 14, lr: 4.48e-03, [1,mpirank:4,algo-1]:2025-01-18 18:23:08,514 INFO [finetune.py:1142] (4/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3168.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.2521, pruned_loss=0.09174, ctc_loss=0.1649, over 639371.58 frames. ], batch size: 12, lr: 4.48e-03, [1,mpirank:3,algo-1]:2025-01-18 18:24:45,694 INFO [finetune.py:1142] (3/8) Epoch 2, batch 5950, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3513.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.1978, pruned_loss=0.07229, ctc_loss=0.1286, over 637232.71 frames. ], batch size: 13, lr: 4.48e-03,

@danpovey
Copy link
Collaborator

well if it's a git repo "git log -1" might tell you, if you are using a pip package then pip show icefall.
but what directory did you find the scripts in, that also matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants