-
Notifications
You must be signed in to change notification settings - Fork 15.6k
Fix get dags query to not have join explosion #50984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1. To get "latest" DagRun I sort by the DagRun.id column. This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.
9823d49
to
77b15d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks.
Just a few comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't dive too deep, but generate_dag_with_latest_run_query
is also using a latest_dag_run_per_dag_id_cte
that seems worong, most likely also has the problem (joining over dag_run latest date, but actually multiple run with the same start date can happen, resulting in multiple rows per Dag which is not handled).
latest_dag_run_per_dag_id_cte = (
select(DagRun.dag_id, func.max(DagRun.start_date).label("start_date"))
.where()
.group_by(DagRun.dag_id)
.cte()
)
Right this function is wrong -- but I am not touching that one here -- one at a time. |
Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1. To get "latest" DagRun I sort by the DagRun.id column. This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date. (cherry picked from commit b994bb2) Co-authored-by: Daniel Standish <[email protected]>
…51172) Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1. To get "latest" DagRun I sort by the DagRun.id column. This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date. (cherry picked from commit b994bb2) Co-authored-by: Daniel Standish <[email protected]>
…51172) Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1. To get "latest" DagRun I sort by the DagRun.id column. This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date. (cherry picked from commit b994bb2) Co-authored-by: Daniel Standish <[email protected]>
Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1. To get "latest" DagRun I sort by the DagRun.id column. This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.
Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1. To get "latest" DagRun I sort by the DagRun.id column. This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.
Previously it was missing dag_id filter, but joining on start date would still be problematic. In this PR I refactor the query a bit so that all joins are guaranteed 1-1. To get "latest" DagRun I sort by the DagRun.id column. This is a simplifying assumption that would be more performant than sorting by start_date, since there could be more than 1 dag run with a given start date.