Conformer case study #6350
Replies: 2 comments
-
Our team has discussed this paper extensively over the week and the conclusion is that the supposed issues suggested by the "bugs" have minimal impact to either training or inference. We have been aware of all the 3 issues and already addressed them two years ago or experimentally validated that they don’t affect the results significantly.
This issue is common among all the models based on convolutions with kernel size larger than one when inputs have variable length and padding is added. We did some experiments at that time and the effect of this issue was insignificant. Theoretically when kernel size is significantly large like Conformer (kernel size 31), it can affect the results and that is the idea behind proposing partial convolutions in this paper: However in ASR, we use paddings (value zero) for time masking as spec augmentation. Zero in input relates to average speech when normalized Log Mel-spectrogram is being used. So the ASR model is observing these zeros everywhere and gets accustomed to them over the learning. However as we had large kernels, @VahidooX decided to fix this issue in our code by masking the inputs before convolutions out of an abundance of caution. Note: The authors of the paper note that NeMo does not exhibit this bug, which is in large part due to this masking of input step.
It is the in the same category as the first one but we have just 2 layers (when we have 4x downsampling), and kernel sizes are just 3 - therefore at most 4 to 5 padding steps may leak into the valid steps. It means having more than 5 paddings does not affect the results at all. Also it just pollutes one or two steps at the end of the speech. At that time, our team did some experiments and decided to let it be as it is. We can easily fix it but we doubt that it affects the results significantly. You will note - the authors show that in table 3 - TF32 already damages WER from 10.52 to 10.73, then NeMo does not show bug 1 so we can skip that row, next bug 2 has barely any effect on WER at all wrt IBS - even at 100 it goes from 10.73 to 10.74. This is in line with our expectations and our experiments. The other reason our team decided to ignore it unlike the Bug 1 was that zeros in inputs were again anywhere in the input (due to spec augment) and some extra padding at end did not seem to make a difference. ASR is an alignment problem, the model should absolutely learn to emit BLANK tokens if it sees 0 vectors to align it as a non-token emitting step. However we can easily fix it with a simple masking after doing some experiments if we want to be academically correct, as long as it does not diminish the quality of models or significantly impact training or inference speed.
To be perfectly clear, this one has quite the backstory and we are not entirely sure the authors implementation of the fix is correct here. Lets first start with the history of this bug - The original relative positional encoding approach introduced in Transformer-XL’s paper was defined and formulated for GPT style LM’s where they have causal decoders. It is different from ASR models where we have both the left and right context in the self-attention. The same causal definition was implemented in the original PyTorch code on GitHub here: kimiyoung/transformer-xl. When Conformer came around in middle of 2020, everybody, including us and ESPNet, used this repo for the self-attention module. We all made the same mistake and did not check the detail of the code in T-XL, as we trusted the original implementation by the authors of T-XL! When we all struggled with reproducing the Conformer’s original paper’s result, @VahidooX read the whole code and found the bug and fixed it in NeMo. We needed to change the positional encodings and also the matrix rotating to handle both left and right contexts for relative positional encoding (can be seen in PR #1527). The original definition could not work for Conformer model. We found out that after fixing the bug, the accuracy improved but not as much we expected. A couple of months later, somebody notified the ESPNet guys about this bug and they also fixed this bug. This is the issue and also the PR with the fix: espnet/espnet#2684 Then later @VahidooX found out that the guys who implemented the relative positional encoding MHA in Lingvo (the codebase where conformer encoder was released later in Tensorflow) already knew about this and they did it in the right way from the beginning. See https://github.com/tensorflow/lingvo/blob/master/lingvo/core/conformer_layer.py#L781-L784 and https://github.com/tensorflow/lingvo/blob/master/lingvo/core/batch_major_attention.py#L1419-L1569 From the review of the author's codebase, It looks like the authors of the new paper have used a buggy implementation (not from ESPNet or NeMo), likely to be this one https://github.com/sooftware/attentions/blob/master/attentions.py. In our review, we think the fix looks like to be another wrong code. The matrix BD which needs rotation should have a size of (seq_len x seq_len2+1). But as you see in the examples of the paper, the matrix is (seq_len x seq_len). The positional encoding matrix passed to relative positional encoding for full context should have length of 2seq_len+1. So either the authors focussed on Absolute PE basing the fix on the less-used positional encoding, or the authors chose to use absolute positional encoding as their basis for Conformer of their own volition. Do note, most modern Conformers don't bother with absolute PE anymore - it cannot be used for inference on long audio that is longer than during training mode (20 sec). The WER of absolute PE is also significantly worse than for relative PE. So in our opinion, the third bug either arent significant for relative PE, or the authors description of the 3rd bug applied only to Absolute PE which we don't use (all NeMo conformed models are relative PE by default). Finally, the papers results are focused on Speech Translation and have reduced focus on ASR. While this is fine for academic studies, Conformer is far more common to see for ASR task. Since NeMo has automatically fixed (1), Bug (2) has negligible impact and (3) is not applicable to relative PE models, I personally think there is no impact of these so called bugs on NeMo ASR models. Of course, we are more than willing to correct any actual bugs if they have any form of significant effect on our results. |
Beta Was this translation helpful? Give feedback.
-
Hey @titu1994 , Thanks for the detailed response! 😊 |
Beta Was this translation helpful? Give feedback.
-
Whats your take on Reproducibility is Nothing without Correctness: The Importance of Testing Code in NLP ?
Beta Was this translation helpful? Give feedback.
All reactions