Skip to content

Conversation

JeevanChevula
Copy link
Contributor

@JeevanChevula JeevanChevula commented Sep 2, 2025

Fixes #934

This PR adds a "saved_checkpoint" event that fires after successful checkpoint saves.

Usage:

# Users need to register the event first
engine.register_events("saved_checkpoint")

@trainer.on("saved_checkpoint")
def after_saving(engine):
    print("Checkpoint saved!")

@github-actions github-actions bot added the module: handlers Core Handlers module label Sep 2, 2025
@JeevanChevula JeevanChevula marked this pull request as ready for review September 2, 2025 06:16
@JeevanChevula JeevanChevula marked this pull request as draft September 3, 2025 06:51
@github-actions github-actions bot added the docs label Sep 3, 2025
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 3, 2025

@JeevanChevula thanks for the PR. However, let's rework the API of the new feature you are working on:

  1. Let's avoid using string type as event name and do not ask users to register the event. This should be done internally by ignite.

  2. Let's define the event in the Checkpoint as following:

# checkpoint.py

class CheckpointEvents(EventEnum):
    SAVED_CHECKPOINT = "saved_checkpoint"


class Checkpoint(...):
    SAVED_CHECKPOINT = CheckpointEvents.SAVED_CHECKPOINT
    ...
  1. finally the usage can be something like this:
from ignite.engine import Engine, Events
from ignite.handlers import Checkpoint, global_step_from_engine

trainer = ...
evaluator = ...
# Setup Accuracy metric computation on evaluator.
# evaluator.state.metrics contain 'accuracy',
# which will be used to define ``score_function`` automatically.
# Run evaluation on epoch completed event
# ...

to_save = {'model': model}
handler = Checkpoint(
    to_save, '/tmp/models',
    n_saved=2, filename_prefix='best',
    score_name="accuracy",
    global_step_transform=global_step_from_engine(trainer)
)

evaluator.add_event_handler(Events.COMPLETED, handler)

# ---- New API with Checkpoint.SAVED_CHECKPOINT event: -----

@evaluator.on(Checkpoint.SAVED_CHECKPOINT)
def notify_when_saved(eval_engine, chkpt_handler):   # we should pass to the attached handlers the engine and the checkpoint instance.
    assert eval_engine is engine
    assert chkpt_handler is handler
    print("Saved checkpoint:", chkpt_handler.last_checkpoint)
    
# ---- End of New API with Checkpoint.SAVED_CHECKPOINT event: -----

trainer.run(data_loader, max_epochs=10)
> ["best_model_9_accuracy=0.77.pt", "best_model_10_accuracy=0.78.pt", ]

Let me know what do you think?

@JeevanChevula
Copy link
Contributor Author

Thanks for the suggestion . I’ll try to work on updating the PR to follow the API approach you mentioned with

@JeevanChevula JeevanChevula marked this pull request as ready for review September 5, 2025 09:58
@JeevanChevula
Copy link
Contributor Author

Implementation Note:

Implemented EventEnum-based SAVED_CHECKPOINT event as requested. However, Ignite's event system only supports single-parameter handlers - the originally requested two-parameter signature (handler(engine, checkpoint_handler)) failed during event firing and registration.

Current implementation uses single parameter with checkpoint access via engine._current_checkpoint_handler. All 61 core tests pass, confirming functionality works without breaking existing features. The 3 distributed test errors are pre-existing infrastructure issues unrelated to this change.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this PR @JeevanChevula
I left few more comments to improve the PR

@@ -460,11 +470,15 @@ def __call__(self, engine: Engine) -> None:
if self.include_self:
# Now that we've updated _saved, we can add our own state_dict.
checkpoint["checkpointer"] = self.state_dict()

# Store reference to self in engine for event handlers to access
engine._current_checkpoint_handler = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this code? I do not understand why we need this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workaround for Ignite's event system limitation. You originally requested a two-parameter handler signature handler(engine, checkpoint_handler), but Ignite's fire_event() only supports single parameters and rejects handlers expecting additional arguments. This line stores the checkpoint reference in the engine so handlers can access it via engine._current_checkpoint_handler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to check details about this limitation. Unfortunately engine._current_checkpoint_handler is not good enough for a public API usage.

Alternatively, we can pass the instance of the checkpointer as an arg when attaching:

checkpoint = Checkpoint(...)

@trainer.on(Checkpoint. SAVED_CHECKPOINT, checkpoint)
def handler(engine, chkpt_handler):
    assert engine is trainer
    assert chkpt_handler is checkpoint

Maybe, we can skip automatic chkpt_handler arg injection:

checkpoint = Checkpoint(...)

@trainer.on(Checkpoint. SAVED_CHECKPOINT)
def handler(engine):
    assert engine is trainer
```

@JeevanChevula
Copy link
Contributor Author

Pushing current implementation with working SAVED_CHECKPOINT event functionality. Will add proper Google-style docstrings with version directives by Monday per contributing guidelines

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 12, 2025

@JeevanChevula please rebase your PR branch, you have now some extra commits

@JeevanChevula JeevanChevula force-pushed the add-saved-checkpoint-event branch from d500fc8 to d81faa9 Compare September 18, 2025 11:03
@JeevanChevula JeevanChevula marked this pull request as draft September 18, 2025 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CI docs module: engine Engine module module: handlers Core Handlers module module: metrics Metrics module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add new events
2 participants