-
Notifications
You must be signed in to change notification settings - Fork 15.6k
Fix scheduler heartbeat timeout failures with DetachedInstanceError
#53838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Resolves `DetachedInstanceError` when scheduler processes task instances that have timed out during heartbeat detection. The error occurred when Pydantic validation of `TIRunContext` attempted to access the consumed_asset_events relationship on `DagRun` objects that had been detached from the `SQLAlchemy` session. Root cause: The main scheduler loop calls `session.expunge_all()` which detaches all objects from the session. Later, when processing heartbeat timeouts, the scheduler creates `TIRunContext` objects that trigger Pydantic validation of `dag_run.consumed_asset_events`, causing `DetachedInstanceError` on the lazy-loaded relationship. Solution: Add `selectinload(DagRun.consumed_asset_events)` to the heartbeat timeout query to eagerly load the relationship before objects are detached. This minimal fix loads only the required relationship without over-eager loading of nested fields that aren't accessed during heartbeat processing. The fix affects all DAG types since consumed_asset_events is initialized as an empty list on all DagRun objects, not just asset-triggered DAGs. Longer term using `back_populates` (with `lazy="selectin"`) might be better so we don't need to remember this: https://docs.sqlalchemy.org/en/20/orm/queryguide/relationships.html https://docs.sqlalchemy.org/en/20/orm/relationship_api.html#sqlalchemy.orm.relationship.params.back_populates
uranusjr
approved these changes
Jul 28, 2025
ashb
approved these changes
Jul 29, 2025
github-actions bot
pushed a commit
that referenced
this pull request
Jul 29, 2025
…nstanceError`` (#53838) (cherry picked from commit 9458053) Co-authored-by: Kaxil Naik <[email protected]>
RoyLee1224
pushed a commit
to RoyLee1224/airflow
that referenced
this pull request
Jul 31, 2025
jedcunningham
pushed a commit
that referenced
this pull request
Aug 1, 2025
…nstanceError`` (#53838) (#53858) (cherry picked from commit 9458053) Co-authored-by: Kaxil Naik <[email protected]>
ferruzzi
pushed a commit
to aws-mwaa/upstream-to-airflow
that referenced
this pull request
Aug 7, 2025
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Aug 11, 2025
Similar to apache#53838 but prevents it for all queries needing `consumed_asset_events`. Instead of adding `.selectinload(DR.consumed_asset_events))` wherever needed, I am eagerly loading them now. Changes: - Add lazy='selectin' to `DagRun.consumed_asset_events` relationship for always-eager loading - Changed `backref` to `back_populates` in `AssetEvent.created_dagruns` to enable explicit control Why This Fix Works: - Eliminates lazy loading entirely by pre-loading the relationship at the model level - Prevents dependency on consistent session state in concurrent scheduler operations Closes apache#54306
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Aug 11, 2025
Similar to apache#53838 but prevents it for all queries needing `consumed_asset_events`. Instead of adding `.selectinload(DR.consumed_asset_events))` wherever needed, I am eagerly loading them now. Changes: - Add lazy='selectin' to `DagRun.consumed_asset_events` relationship for always-eager loading - Changed `backref` to `back_populates` in `AssetEvent.created_dagruns` to enable explicit control Why This Fix Works: - Eliminates lazy loading entirely by pre-loading the relationship at the model level - Prevents dependency on consistent session state in concurrent scheduler operations Closes apache#54306
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Aug 11, 2025
Similar to apache#53838 but prevents it for all queries needing `consumed_asset_events`. Instead of adding `.selectinload(DR.consumed_asset_events))` wherever needed, I am eagerly loading them now. Changes: - Add lazy='selectin' to `DagRun.consumed_asset_events` relationship for always-eager loading - Changed `backref` to `back_populates` in `AssetEvent.created_dagruns` to enable explicit control Why This Fix Works: - Eliminates lazy loading entirely by pre-loading the relationship at the model level - Prevents dependency on consistent session state in concurrent scheduler operations Closes apache#54306
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Aug 11, 2025
Similar to apache#53838 and alternative for apache#54331 This is a more localized change and only eagerly loads for this specific instance. Closes apache#54306
fweilun
pushed a commit
to fweilun/airflow
that referenced
this pull request
Aug 11, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves intermittent
DetachedInstanceError
when scheduler processes task instances that have timed out during heartbeat detection. The error occurred when Pydantic validation ofTIRunContext
attempted to access theconsumed_asset_events
relationship onDagRun
objects that had been detached from the SQLAlchemy session.The Problem:
selectinload(TI.dag_run)
(missingconsumed_asset_events
)session.expunge_all()
is called, detaching all objects from the sessionTIRunContext
creation triggers Pydantic validation that accessesdag_run.consumed_asset_events
DetachedInstanceError
Key Evidence:
consumed_asset_events
on detachedDagRun
objects reliably reproduces the errorWhy It's Intermittent:
session.expunge_all()
and subsequent object accessSolution
Add minimal eager loading with
selectinload(DagRun.consumed_asset_events)
to the heartbeat timeout query. This ensures the relationship is loaded before objects can be detached, eliminating the need for lazy loading.Why This Fix Works:
TIRunContext
Verification Steps for Reviewers
To verify the root cause and validate the fix, run these tests in an iPython shell:
Test 1: Verify DetachedInstanceError on expunged objects
Expected Result: Should show
DetachedInstanceError
when accessingconsumed_asset_events
on the detached object.Test 2: Verify the fix prevents the error
Expected Result: Should show
SUCCESS
because the relationship was eagerly loaded before detachment.Test 3: Verify scoped session reuse (explains contamination mechanism)
Expected Result: Should show
True
for session reuse, confirming thread-local scoping that enables object contamination.Testing Strategy
Why no new automated test added:
test_scheduler_passes_context_from_server_on_heartbeat_timeout
session.expunge_all()
and object access across concurrent scheduler operationsFuture Considerations
Long-term architectural improvement: Migrate to
back_populates
withlazy="selectin"
to eliminate this class of issues entirely:This would prevent similar
DetachedInstanceError
issues across the codebase by making the relationship always eagerly loaded.References:
Additional Context
This affects all DAG types (not just asset-triggered) since
consumed_asset_events
is initialized as empty list on all DagRun objects during creation in_create_orm_dagrun()
.The fix uses
selectinload
(vsjoinedload
) because the heartbeat query can return multiple TaskInstances, making selectinload more efficient for bulk operations.