Replies: 1 comment
-
The reason creating masking on the fly in the forward is avoided is that some operations like torch.triu was not supported by ONNX at that time. We can take a look again and see if it is supported, then move it in the forward. Other simple fix is to call the a method to recreate the masks in the setup_streaming_params.
The left_chunks should be limited as it is a streaming model and there should be a limit on the left context otherwise the memory consumption would increase over the streaming audio. So it is better to have a limited left context. Also note that usually left contexts more than 6 seconds does not help the accuracy significantly. This is a layer-wise left context, so effective left context is actually so much larger already. It should be a useful feature to be able to change the left context after the training but it is not critical as users usually train their models with the left context they have in mind for streaming.
We have developed muli-lookahead cache-aware which can supports multiple left and right contexts in a single model. We have not added it to NeMo yet. That one creates the masking in the forward on the fly: |
Beta Was this translation helpful? Give feedback.
-
TL; DR: The current implementation of the cache-aware Conformer doesn't look like it can attend beyond the fixed length context, while implementations at k2 icefall and WeNet have the correct behavior.
Please let me know if this reasoning sounds good, or if I'm missing something.
In the cache-aware Conformer implementation, the
left_chunks_num
is initialized here as:Isn't one of the use cases of the cache to enable attending to history beyond what the model was trained on?
Looks like the chunked_limited_mask will not change once it's initialized, unless left_chunks_num also changes, and a call to set_max_audio_length is made with a
max_audio_length
greater than what the model was trained on.Here's an example:
Consider a cache-aware Confomer trained with the following config
Now, at inference, if I want the model to use a larger context and set
left_chunks = 10
by callingsetup_streaming_params()
, it still attends to the context (cache+current chunk) set for 4 left chunks (in this case 170 frames) as shown below:att_mask
However, if I explicitly set the
left_chunks_num = 10
and subsequently callset_max_audio_length()
to update thechunked_limited_mask
, I get the correct attention score where the latest chunk can attend to itself and everything in the history up to the specified chunk length.att_mask
This is mostly because the
chunked_limited_mask
is not updated at inference for a newleft_chunks
.Currently, what
chunked_limited_mask
looks like forleft_chunks=10
(when it was set to 4 in training)What
chunked_limited_mask
should ideally look like forleft_chunks=10
(irrespective of what it was set to in training)The implementations at WeNet and k2 icefall (borrowed from WeNet) seem to have the correct behavior here.
What
chunked_limited_mask
from icefall and WeNet look like forleft_chunks=10
.Not sure if I'm missing something here, please let me know. Happy to hear any thoughts or explanations on this.
Beta Was this translation helpful? Give feedback.
All reactions