Skip to content

Conversation

dstandish
Copy link
Contributor

Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1. To get "latest" DagRun I sort by the DagRun.id column. This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.

@dstandish dstandish self-assigned this May 22, 2025
@boring-cyborg boring-cyborg bot added the area:API Airflow's REST/HTTP API label May 22, 2025
dstandish added 5 commits May 22, 2025 19:27
Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1.  To get "latest" DagRun I sort by the DagRun.id column.  This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.
@dstandish dstandish force-pushed the fix-dags-get-endpoint branch from 9823d49 to 77b15d8 Compare May 23, 2025 02:27
@dstandish dstandish marked this pull request as ready for review May 23, 2025 02:34
@dstandish dstandish added the backport-to-v3-0-test Mark PR with this label to backport to v3-0-test branch label May 23, 2025
@dstandish dstandish added this to the Airflow 3.0.2 milestone May 23, 2025
Copy link
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks.

Just a few comments

Copy link
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't dive too deep, but generate_dag_with_latest_run_query is also using a latest_dag_run_per_dag_id_cte that seems worong, most likely also has the problem (joining over dag_run latest date, but actually multiple run with the same start date can happen, resulting in multiple rows per Dag which is not handled).

    latest_dag_run_per_dag_id_cte = (
        select(DagRun.dag_id, func.max(DagRun.start_date).label("start_date"))
        .where()
        .group_by(DagRun.dag_id)
        .cte()
    )

@dstandish
Copy link
Contributor Author

I didn't dive too deep, but generate_dag_with_latest_run_query is also using a latest_dag_run_per_dag_id_cte that seems worong, most likely also has the problem (joining over dag_run latest date, but actually multiple run with the same start date can happen, resulting in multiple rows per Dag which is not handled).

    latest_dag_run_per_dag_id_cte = (
        select(DagRun.dag_id, func.max(DagRun.start_date).label("start_date"))
        .where()
        .group_by(DagRun.dag_id)
        .cte()
    )

Right this function is wrong -- but I am not touching that one here -- one at a time.

@dstandish dstandish merged commit b994bb2 into apache:main May 28, 2025
109 of 176 checks passed
@dstandish dstandish deleted the fix-dags-get-endpoint branch May 28, 2025 17:54
github-actions bot pushed a commit that referenced this pull request May 28, 2025
Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1.  To get "latest" DagRun I sort by the DagRun.id column.  This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.
(cherry picked from commit b994bb2)

Co-authored-by: Daniel Standish <[email protected]>
Copy link

Backport successfully created: v3-0-test

Status Branch Result
v3-0-test PR Link

dstandish added a commit that referenced this pull request May 28, 2025
…51172)

Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1.  To get "latest" DagRun I sort by the DagRun.id column.  This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.
(cherry picked from commit b994bb2)

Co-authored-by: Daniel Standish <[email protected]>
kaxil pushed a commit that referenced this pull request Jun 3, 2025
…51172)

Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1.  To get "latest" DagRun I sort by the DagRun.id column.  This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.
(cherry picked from commit b994bb2)

Co-authored-by: Daniel Standish <[email protected]>
sanederchik pushed a commit to sanederchik/airflow that referenced this pull request Jun 7, 2025
Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1.  To get "latest" DagRun I sort by the DagRun.id column.  This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.
jose-lehmkuhl pushed a commit to jose-lehmkuhl/airflow that referenced this pull request Jul 11, 2025
Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1.  To get "latest" DagRun I sort by the DagRun.id column.  This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API backport-to-v3-0-test Mark PR with this label to backport to v3-0-test branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants