Skip to content

Conversation

jjh42
Copy link
Contributor

@jjh42 jjh42 commented Jul 1, 2025

What does this PR do?

Currently if using async checkpointing if fit or validate is called than once it will crash (because the threadpool is shutdown and never re-created).

  • This PR modifies the test to induce the crash and fixes it.

    No.

  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)

    No, this is just a bugfix, not a behavior change. Should I create an issue?

  • Did you read the contributor guideline, Pull Request section?

    Yes

  • Did you make sure your PR does only one thing, instead of bundling different changes together?

    Yes

  • Did you make sure to update the documentation with your changes? (if necessary)

    na

  • Did you write any new necessary tests? (not for typos and docs)

    yes

  • Did you verify new and existing tests pass locally with your changes?

    as best I could, I'm not very clear the recommended setup for testing pytorch lightning locally, I was only able to run the test I modified.

  • Did you list all the breaking changes introduced by this pull request?

    na

  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

    Yes


📚 Documentation preview 📚: https://pytorch-lightning--20952.org.readthedocs.build/en/20952/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Jul 1, 2025
@Borda Borda changed the title Make asyncio checkpointing work if validate/fit is called more than o… Make asyncio checkpointing work if validate/fit is called more than once Jul 14, 2025
@jjh42
Copy link
Contributor Author

jjh42 commented Jul 15, 2025

let me know if this looks ok and then I'll fix the mypy errors.

@jjh42
Copy link
Contributor Author

jjh42 commented Aug 8, 2025

any feedback? is it worth spending time fixing mypy / conflict or there is no interest in this (I would say the more serious issue is the linked issue which seems like it could result in checkpoints which are degraded).

@Borda Borda requested a review from Isalia20 August 18, 2025 16:57
@Borda Borda requested review from bhimrazy and Isalia20 August 19, 2025 13:37
@Borda Borda merged commit ff64a92 into Lightning-AI:master Aug 19, 2025
84 checks passed
Borda added a commit that referenced this pull request Aug 28, 2025
…nce (#20952)

* Make asyncio checkpointing work if validate/fit is called more than once.
* Apply suggestions from code review
* Add assertion to ensure executor is initialized before saving checkpoint
* update

---------

Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Jirka B <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: bhimrazy <[email protected]>
(cherry picked from commit ff64a92)
lantiga pushed a commit that referenced this pull request Aug 29, 2025
…nce (#20952)

* Make asyncio checkpointing work if validate/fit is called more than once.
* Apply suggestions from code review
* Add assertion to ensure executor is initialized before saving checkpoint
* update

---------

Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Jirka B <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: bhimrazy <[email protected]>
(cherry picked from commit ff64a92)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants