-
Notifications
You must be signed in to change notification settings - Fork 15.7k
Description
Apache Airflow version
3.0.6
If "Other Airflow 2 version" selected, which one?
No response
What happened?
After upgrade to airflow 3, system started experiencing random DAG disappearance.
Parsing intervals are setup to be pretty long, because we don't update DAGs between deploys.
The config for intervals has this setup:
dag_processor:
dag_file_processor_timeout: 300
min_file_process_interval: 7200
parsing_processes: 1
print_stats_interval: 300
refresh_interval: 1800
stale_dag_threshold: 1800
Log analysis showed that once we receive one callback on DAG processor for any DAG, it soon will be marked as stale and will disappear.
It may come back later, once process_interval kicks in. But its not always the case.
Full log:
dag_processor.log.zip
Points of interest in log:
Last time there is no error for particular DAG:
2025-09-04T20:02:57.426Z | {"log":"2025-09-04T20:02:57.426093587Z stdout F dags-folder process_etl_app_data.py 1 0 0.96s 2025-09-04T19:58:39"}
Then first callback for it comes in:
2025-09-04T20:05:08.722Z | {"log":"2025-09-04T20:05:08.722840445Z stdout F [2025-09-04T20:05:08.722+0000] {manager.py:464} DEBUG - Queuing TaskCallbackRequest CallbackRequest: filepath='process_etl_app_data.py' bundle_name='dags-folder' bundle_version=None msg=\"{'DAG Id': 'ds_etl', 'Task Id': 'etl_app_data', 'Run Id': 'manual__2025-09-04T20:00:00+00:00', 'Hostname': '10.4.142.168', 'External Executor Id': '5547a318-f6cc-4c02-92f5-90cbbb629e22'}\" ti=TaskInstance(id=UUID('01991650-8c36-70c5-a85b-44f6b572fe0f'), task_id='etl_app_data', dag_id='ds_etl', run_id='manual__2025-09-04T20:00:00+00:00', try_number=1, map_index=-1, hostname='10.4.142.168', context_carrier=None) task_callback_type=None context_from_server=TIRunContext(dag_run=DagRun(dag_id='ds_etl', run_id='manual__2025-09-04T20:00:00+00:00', logical_date=datetime.datetime(2025, 9, 4, 20, 0, tzinfo=Timezone('UTC')), data_interval_start=datetime.datetime(2025, 9, 4, 20, 0, 1, 133909, tzinfo=Timezone('UTC')), data_interval_end=datetime.datetime(2025, 9, 4, 20, 0, 1, 133909, tzinfo=Timezone('UTC')), run_after=datetime.datetime(2025, 9, 4, 20, 0, 1, 133909, tzinfo=Timezone('UTC')), start_date=datetime.datetime(2025, 9, 4, 20, 0, 1, 176556, tzinfo=Timezone('UTC')), end_date=None, clear_number=0, run_type=<DagRunType.MANUAL: 'manual'>, state=<DagRunState.RUNNING: 'running'>, conf={}, consumed_asset_events=[]), task_reschedule_count=0, max_tries=7, variables=[], connections=[], upstream_map_indexes=None, next_method=None, next_kwargs=None, xcom_keys_to_clear=[], should_retry=False) type='TaskCallbackRequest'"}
Then during next print of stats we have an error in this file (though it has not changed at all):
2025-09-04T20:12:58.040Z | {"log":"2025-09-04T20:12:58.040610948Z stdout F dags-folder process_etl_app_data.py 0 1 1.01s 2025-09-04T20:12:50"}
Eventually the DAG from that file disappears:
2025-09-04T20:57:53.765Z | {"log":"2025-09-04T20:57:53.765305682Z stdout F [2025-09-04T20:57:53.764+0000] {manager.py:310} INFO - DAG ds_etl is missing and will be deactivated."}
Further analysis showed that DAG processor seems to be reusing same parsing mechanism for callback execution and updates file parsing time, though does not update DAG parsing time. The DAG eventually becomes stale.
What you think should happen instead?
Processing callbacks should not affect DAG state.
And I think we should still be able to set reparsing timers for rare parsing.
How to reproduce
- Have DAG with callbacks
- Set
min_file_process_interval
higher thanstale_dag_threshold
and deploy airflow - Execute DAG, so callbacks are executed
Operating System
Debian Bookworm
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==9.12.0
apache-airflow-providers-celery==3.12.2
apache-airflow-providers-common-compat==1.7.3
apache-airflow-providers-common-io==1.6.2
apache-airflow-providers-common-messaging==1.0.5
apache-airflow-providers-common-sql==1.27.5
apache-airflow-providers-fab==2.4.1
apache-airflow-providers-http==5.3.3
apache-airflow-providers-postgres==6.2.3
apache-airflow-providers-redis==4.2.0
apache-airflow-providers-slack==9.1.4
apache-airflow-providers-smtp==2.2.0
apache-airflow-providers-standard==1.6.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
Helm chart deployed on AWS EKS cluster
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct