On Kernel layouts for CitriNet #3881
Replies: 3 comments 1 reply
-
Good question ! We also use kernel size factor to scale down Citrinet for streaming mode (there are some checkpoints on ngc marked with gamma_0_25 indicating the kernel second of 0.25x). There's actually a minor deviation from the original paper, on the sense that we scale down every single kernel after the first layer - as in even the final 41 kernel size gets multiplied by gamma. We found that this further improves the streaming / buffered mode inference of this model. It's not just Riva supported btw, we have buffered CTC and RNNT support in Nemo too, but Riva is a lot more efficient + supports true streaming inference with low latency. Riva will do much better with a gamam Citrinet (plus the final kernel * gamma) than the original offline model (gamma=1). Since you are doing kernel scaling, you must change the gamma value before you create the model - it cannot be changed once the model has been built or trained - and yes, weights are incompatible between different gammas (as the conv kernel shape changes) |
Beta Was this translation helpful? Give feedback.
-
For reference, you can see the config of the encoder.jasper part of the stt_en_citrinet_1024_gamma_0_25 model. |
Beta Was this translation helpful? Give feedback.
-
Nice, this is good to know, thanks!
Will do better in what way? Improved inference latency, WER accuracy, or both? |
Beta Was this translation helpful? Give feedback.
-
In the CitriNet paper, section 4.1, it's stated that:
Does this apply to the PyTorch model generated using NeMo only or is it applicable for RIVA as well?
Does RIVA already optimize for streaming models somehow so this is not needed?
Also, if I fine-tune a pre-trained model, but adjust the
kernel_size_factor
in my cfg, does that actually change the kernel layout? It doesn't seem to do so unless I train from scratch since when I tried torestore_from()
my finetuned model, I got an error similar to #3167, but with the shapes of the encoder corresponding to a different kernel layout configurations specified in Table 2 of the paper.If this question is more suitable in the Riva forum, lmk. I thought to ask here since the question relates to the model's architecture and training w/ NeMo.
Beta Was this translation helpful? Give feedback.
All reactions