Skip to content

Conversation

inkcherry
Copy link
Contributor

fix ci hang.
improve the ut.

inkcherry added 4 commits May 30, 2025 02:08
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
@sfc-gh-truwase
Copy link
Collaborator

@inkcherry, thanks for the quick PR. I have a few questions

  1. It seems this PR is a workaround using reuse_dist_env=False rather than fixing autotp itself. Is this correct?
  2. Do you know why world_size affects the hang?
  3. Can you confirm that set_autotp_mode(training=False) did not affect the hang in your environment?

@inkcherry
Copy link
Contributor Author

inkcherry commented May 30, 2025

@inkcherry, thanks for the quick PR. I have a few questions

  1. It seems this PR is a workaround using reuse_dist_env=False rather than fixing autotp itself. Is this correct?
  2. Do you know why world_size affects the hang?
  3. Can you confirm that set_autotp_mode(training=False) did not affect the hang in your environment?

In my environment(torch=2.7.0) , it seems that the issue is related to the DistributedTest class not AutoTP.

we could create a new test file to reproduce.

from unit.common import DistributedTest
import pytest
@pytest.mark.parametrize("tp_size", [2, 4])
class TestTpDataloaderCorrectness(DistributedTest):
    world_size = 2
    reuse_dist_env = False **# wsz=4 and reuse_dist_env=True will hang.**

    def test(self, tp_size: int):
        print("finished test")
        return 

some process(random) will hang in teardown process with wsz=4 and reuse_dist_env
tests/unit/common.py ->_dist_destroy(self) ->dist.destroy_process_group()

Notice that this is the only unit test using world_size > 2 together with reuse_dist_env=True, so we can temporarily work around the issue by avoiding this combination.

@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) May 31, 2025 15:39
@sfc-gh-truwase sfc-gh-truwase disabled auto-merge June 2, 2025 19:40
@sfc-gh-truwase
Copy link
Collaborator

nv-torch-latest-v100 is currently broken.

@sfc-gh-truwase sfc-gh-truwase merged commit 8b03a35 into deepspeedai:master Jun 2, 2025
9 of 10 checks passed
deepcharm pushed a commit to deepcharm/DeepSpeed that referenced this pull request Jun 16, 2025
fix ci hang.
improve the ut.

---------

Signed-off-by: inkcherry <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Max Kovalenko <[email protected]>
Antlera pushed a commit to Antlera/DeepSpeed that referenced this pull request Jun 27, 2025
fix ci hang.
improve the ut.

---------

Signed-off-by: inkcherry <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants