-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Fix ci hang in torch2.7& improve ut #7321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
@inkcherry, thanks for the quick PR. I have a few questions
|
In my environment(torch=2.7.0) , it seems that the issue is related to the DistributedTest class not AutoTP. we could create a new test file to reproduce.
some process(random) will hang in teardown process with Notice that this is the only unit test using world_size > 2 together with reuse_dist_env=True, so we can temporarily work around the issue by avoiding this combination. |
nv-torch-latest-v100 is currently broken. |
fix ci hang. improve the ut. --------- Signed-off-by: inkcherry <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Max Kovalenko <[email protected]>
fix ci hang. improve the ut. --------- Signed-off-by: inkcherry <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
fix ci hang.
improve the ut.