[BPT] Question about the Scaling Factor in Equation from the Paper

Hi, BPT and RingAttention are awesome works! Thanks a lot for open-sourcing the code.

I have a question about the **2nd equation** in the following snapshot taken from the [paper](https://arxiv.org/abs/2305.19370v3). I am having difficulty deriving the LHS and RHS as equal.

<img width="1554" alt="Screenshot 2024-10-23 at 23 19 40" src="https://github.com/user-attachments/assets/6bc119e4-e7ae-4a6f-8792-a9af8e8bbd9c">

1. Should the scaling factor be $$\exp(max(Q_i K_j^T) - \max_i)$$ instead of $$\exp(Q_i K_j^T - \max_i)$$, i.e., the maximum symbol is missing?

2. Even fixing the above issue, following [Online Normalizer Calculation for Softmax](https://arxiv.org/abs/1805.02867), should the scaling factor be applied to both the numerator and the denominator as done in both the pseudo-code in the paper (L43-45) as well as the implementation in the following?
https://github.com/haoliuhl/ringattention/blob/aef108ab97d7d1105ce28e7e8c004159d6f8f045/ringattention/ringattention_jax.py#L142-L144

But I may miss something in the paper. Any guidance would be much appreciated.

Thanks a lot in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BPT] Question about the Scaling Factor in Equation from the Paper #22

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	correction = rearrange(jnp.exp(prev_max_score_chunk - max_score_chunk), 'b h q -> b q h')[..., None]
	numerator_chunk = numerator_chunk * correction + exp_values
	denominator_chunk = denominator_chunk * jnp.exp(prev_max_score_chunk - max_score_chunk) + exp_weights.sum(axis=-1)

[BPT] Question about the Scaling Factor in Equation from the Paper #22

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions