[Feature Request] Balancing computation with zigzag blocking

Currently the implementation will split the input sequence into n blocks, e.g. 4 gpu will split into:

```
b0 | b1 | b2 | b3
```

however, this will result in uneven calculation, where the gpu that has `b3` will do around 4 times more calculation than the gpu that has `b0`, due to causal attention mask.

If we split the input sequence into 2n blocks, e.g. 4 gpu will split into:

```
b0,b7 | b1,b6 | b2,b5 | b3,b4
```

then all gpu will have the same amount of calculation, and theoratically the latency should be decrease by half.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Balancing computation with zigzag blocking #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature Request] Balancing computation with zigzag blocking #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions