Skip to content

Conversation

tohtana
Copy link
Contributor

@tohtana tohtana commented Jun 17, 2025

This PR fixes the behavior of DeepCompile's ZeRO stage 1 and adds stage 2 support.

DeepCompile's ZeRO1 currently performs allreduce at every iteration even when it is not a gradient accumulation boundary. This significantly slows down the performance when gradient accumulation is enabled. This PR fixes this issue by performing allreduce only at the gradient accumulation boundary.

As the current behavior is similar to ZeRO2, this PR also adds DeepCompile's ZeRO2 support. We can now set zero stage to 2 with DeepCompile.

The loss values, performance, and memory usages were verified using this verification tool (results).

@loadams loadams enabled auto-merge (squash) June 27, 2025 22:03
@loadams loadams merged commit be8124c into master Jun 27, 2025
11 checks passed
@loadams loadams deleted the tohtana/dc_z1_no_sync branch June 27, 2025 22:31
tohtana added a commit that referenced this pull request Jun 27, 2025
This PR improves the coverage of DeepCompile.

- Use real parameters when recompilation happens
- Handling overflow error in profiling

This PR should be merged after #7366.

ZeRO1 and ZeRO3 both worked with OpenRLHF. See [Wiki
page](https://github.com/tohtana/DeepCompile_docs/wiki/Debug-with-OpenRLHF-(%237243))
for more details.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
lpnpcs pushed a commit to lpnpcs/DeepSpeed that referenced this pull request Jul 30, 2025
…i#7366)

This PR fixes the behavior of DeepCompile's ZeRO stage 1 and adds stage
2 support.

DeepCompile's ZeRO1 currently performs allreduce at every iteration even
when it is not a gradient accumulation boundary. This significantly
slows down the performance when gradient accumulation is enabled. This
PR fixes this issue by performing allreduce only at the gradient
accumulation boundary.

As the current behavior is similar to ZeRO2, this PR also adds
DeepCompile's ZeRO2 support. We can now set zero stage to 2 with
DeepCompile.

The loss values, performance, and memory usages were verified using this
[verification tool](https://github.com/tohtana/ds_verify_loss)
([results](https://github.com/tohtana/ds_verify_loss/blob/main/results/results_20250617_035117/report.md)).

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
lpnpcs pushed a commit to lpnpcs/DeepSpeed that referenced this pull request Jul 30, 2025
This PR improves the coverage of DeepCompile.

- Use real parameters when recompilation happens
- Handling overflow error in profiling

This PR should be merged after deepspeedai#7366.

ZeRO1 and ZeRO3 both worked with OpenRLHF. See [Wiki
page](https://github.com/tohtana/DeepCompile_docs/wiki/Debug-with-OpenRLHF-(%237243))
for more details.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants