-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix rank_zero_only
rank not set in ddp-spawn based strategies
#19030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #19030 +/- ##
==========================================
- Coverage 84% 54% -30%
==========================================
Files 443 438 -5
Lines 36154 36060 -94
==========================================
- Hits 30260 19461 -10799
- Misses 5894 16599 +10705 |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I always suggest that only the fabric/pl imports are used (e.g. #16178)
Unfortunately, PyCharm seems to prefer importing utilities when using auto-import, so it's easy to miss
(cherry picked from commit f652e6c)
(cherry picked from commit f652e6c)
What does this PR do?
This PR temporarily fixes the issue that the DDP strategies who call
set_world_ranks
only set therank_zero_only.rank
attribute for the fabric/pytorch utilities, however NOT for thelightning_utilities
package. This leads to logs appearing on rank > 0 on spawn-based strategies.The decision to outsource these utilities into the separate package made all of this very brittle. Now we need to maintain 2 globals instead of one. With this fix, I will open an issue requesting that these utilities get moved back into the Lightning package so we only need to maintain one global variable (rank_zero_only.rank).
Minimal repro (just observe info and warn outputs duplicated):
Discovered while debugging CI flakiness. This fix should automatically resolve the timeouts for some of the ddp-spawn tests like
test_result_reduce_ddp
etc.📚 Documentation preview 📚: https://pytorch-lightning--19030.org.readthedocs.build/en/19030/
cc @Borda @tchaton @carmocca @justusschock @awaelchli