-
Hi,I am confused that there is a layer normalization between the down-sample and up-sample of Q. However, this layer normalization is not shown in the DeepSeek v2 paper. Here is the code of sglang Here is the formulate in paper |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
@ispobock Hey Ke, could you help with this? |
Beta Was this translation helpful? Give feedback.
-
It's added in the original implementation. And also mentioned in the paper:
|
Beta Was this translation helpful? Give feedback.
It's added in the original implementation.
Ref: https://huggingface.co/deepseek-ai/DeepSeek-V2.5/blob/c85b5ede86f2a598af339624cac5723861e557ed/modeling_deepseek.py#L825
And also mentioned in the paper: