Trying to run conformer-transformer architecture for speech transcription on indic languages #9383

DeveshS1209 · 2024-06-05T07:51:21Z

DeveshS1209
Jun 5, 2024

we are trying to train conformer-encoder and transformer-decoder model using the config file "fast-conformer_transformer.yaml" present in "examples/asr/conf/speech_translation/." for transcription task . So when we try to do validation step ..we get this error:

../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [15,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [3,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
self.fit_loop.run()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
self.advance()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 141, in run
self.on_advance_end(data_fetcher)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 295, in on_advance_end
self.val_loop.run()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run
self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 411, in validation_step
return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 642, in call
wrapper_output = wrapper_module(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward
else self._run_ddp_forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 635, in wrapped_forward
out = method(_args, **_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sankeerth/transcription/NeMo/nemo/collections/asr/models/transformer_bpe_models.py", line 469, in validation_step
beam_hypotheses = self.beam_search(
^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sankeerth/transcription/NeMo/nemo/collections/asr/modules/transformer/transformer_generators.py", line 177, in call
results = self._forward(
^^^^^^^^^^^^^^
File "/home/ubuntu/sankeerth/transcription/NeMo/nemo/collections/asr/modules/transformer/transformer_generators.py", line 308, in _forward
log_probs, decoder_mems_list = self._one_step_forward(tgt, encoder_hidden_states, encoder_input_mask, None, 0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sankeerth/transcription/NeMo/nemo/collections/asr/modules/transformer/transformer_generators.py", line 94, in _one_step_forward
decoder_hidden_states = self.embedding.forward(decoder_input_ids, start_pos=pos)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sankeerth/transcription/NeMo/nemo/collections/asr/modules/transformer/transformer_modules.py", line 136, in forward
token_embeddings = self.token_embedding(input_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 163, in forward
return F.embedding(
^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/nn/functional.py", line 2237, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/sankeerth/transcription/conformer/trainer_CT.py", line 32, in main
trainer.fit(asr_model)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 68, in _call_and_handle_interrupt
trainer._teardown()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1010, in _teardown
self.strategy.teardown()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 419, in teardown
super().teardown()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/strategies/parallel.py", line 133, in teardown
super().teardown()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 533, in teardown
_optimizers_to_device(self.optimizers, torch.device("cpu"))
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/lightning_fabric/utilities/optimizer.py", line 28, in _optimizers_to_device
_optimizer_to_device(opt, device)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/lightning_fabric/utilities/optimizer.py", line 34, in _optimizer_to_device
optimizer.state[p] = apply_to_collection(v, Tensor, move_data_to_device, device, allow_frozen=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 52, in apply_to_collection
return _apply_to_collection_slow(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 104, in _apply_to_collection_slow
v = _apply_to_collection_slow(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 96, in _apply_to_collection_slow
return function(data, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/lightning_fabric/utilities/apply_func.py", line 103, in move_data_to_device
return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 64, in apply_to_collection
return function(data, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/lightning_fabric/utilities/apply_func.py", line 97, in batch_to
data_output = data.to(device, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f771481ad87 in /home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f77147cb75f in /home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f77148eb8a8 in /home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f77159be3ac in /home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f77159c24c8 in /home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f77159c5bfa in /home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f77159c6839 in /home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f775f6e1bf4 in /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f7761b39609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f7761904353 in /lib/x86_64-linux-gnu/libc.so.6)

we are seeking some guidance to resolve this error.

titu1994 · 2024-06-05T09:22:59Z

titu1994
Jun 5, 2024
Maintainer

This error occurs when your attempt to index an incorrect id from an embedding layer on the GPU. Check if your models vocab size (output dimension of the decided) is correctly matching the vocab size of your tokenizer.

@AlexGrinch for further advice

1 reply

DeveshS1209 Jun 5, 2024
Author

I have the vocab size as 128.
The output of the model is of size 128 itself. Note that this particular issue occurs only in the validation step and not in the training step.
@AlexGrinch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to run conformer-transformer architecture for speech transcription on indic languages #9383

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Trying to run conformer-transformer architecture for speech transcription on indic languages #9383

DeveshS1209 Jun 5, 2024

Replies: 1 comment · 1 reply

titu1994 Jun 5, 2024 Maintainer

DeveshS1209 Jun 5, 2024 Author

DeveshS1209
Jun 5, 2024

Replies: 1 comment 1 reply

titu1994
Jun 5, 2024
Maintainer

DeveshS1209 Jun 5, 2024
Author