Skip to content

DAGs disappear sometime after callback execution #55315

@kirillsights

Description

@kirillsights

Apache Airflow version

3.0.6

If "Other Airflow 2 version" selected, which one?

No response

What happened?

After upgrade to airflow 3, system started experiencing random DAG disappearance.
Parsing intervals are setup to be pretty long, because we don't update DAGs between deploys.
The config for intervals has this setup:

  dag_processor:
    dag_file_processor_timeout: 300
    min_file_process_interval: 7200
    parsing_processes: 1
    print_stats_interval: 300
    refresh_interval: 1800
    stale_dag_threshold: 1800

Log analysis showed that once we receive one callback on DAG processor for any DAG, it soon will be marked as stale and will disappear.
It may come back later, once process_interval kicks in. But its not always the case.

Full log:
dag_processor.log.zip

Points of interest in log:

Last time there is no error for particular DAG:

2025-09-04T20:02:57.426Z | {"log":"2025-09-04T20:02:57.426093587Z stdout F dags-folder process_etl_app_data.py 1 0 0.96s 2025-09-04T19:58:39"}

Then first callback for it comes in:

2025-09-04T20:05:08.722Z | {"log":"2025-09-04T20:05:08.722840445Z stdout F [2025-09-04T20:05:08.722+0000] {manager.py:464} DEBUG - Queuing TaskCallbackRequest CallbackRequest: filepath='process_etl_app_data.py' bundle_name='dags-folder' bundle_version=None msg=\"{'DAG Id': 'ds_etl', 'Task Id': 'etl_app_data', 'Run Id': 'manual__2025-09-04T20:00:00+00:00', 'Hostname': '10.4.142.168', 'External Executor Id': '5547a318-f6cc-4c02-92f5-90cbbb629e22'}\" ti=TaskInstance(id=UUID('01991650-8c36-70c5-a85b-44f6b572fe0f'), task_id='etl_app_data', dag_id='ds_etl', run_id='manual__2025-09-04T20:00:00+00:00', try_number=1, map_index=-1, hostname='10.4.142.168', context_carrier=None) task_callback_type=None context_from_server=TIRunContext(dag_run=DagRun(dag_id='ds_etl', run_id='manual__2025-09-04T20:00:00+00:00', logical_date=datetime.datetime(2025, 9, 4, 20, 0, tzinfo=Timezone('UTC')), data_interval_start=datetime.datetime(2025, 9, 4, 20, 0, 1, 133909, tzinfo=Timezone('UTC')), data_interval_end=datetime.datetime(2025, 9, 4, 20, 0, 1, 133909, tzinfo=Timezone('UTC')), run_after=datetime.datetime(2025, 9, 4, 20, 0, 1, 133909, tzinfo=Timezone('UTC')), start_date=datetime.datetime(2025, 9, 4, 20, 0, 1, 176556, tzinfo=Timezone('UTC')), end_date=None, clear_number=0, run_type=<DagRunType.MANUAL: 'manual'>, state=<DagRunState.RUNNING: 'running'>, conf={}, consumed_asset_events=[]), task_reschedule_count=0, max_tries=7, variables=[], connections=[], upstream_map_indexes=None, next_method=None, next_kwargs=None, xcom_keys_to_clear=[], should_retry=False) type='TaskCallbackRequest'"}

Then during next print of stats we have an error in this file (though it has not changed at all):

2025-09-04T20:12:58.040Z | {"log":"2025-09-04T20:12:58.040610948Z stdout F dags-folder process_etl_app_data.py 0 1 1.01s 2025-09-04T20:12:50"}

Eventually the DAG from that file disappears:

2025-09-04T20:57:53.765Z | {"log":"2025-09-04T20:57:53.765305682Z stdout F [2025-09-04T20:57:53.764+0000] {manager.py:310} INFO - DAG ds_etl is missing and will be deactivated."}

Further analysis showed that DAG processor seems to be reusing same parsing mechanism for callback execution and updates file parsing time, though does not update DAG parsing time. The DAG eventually becomes stale.

What you think should happen instead?

Processing callbacks should not affect DAG state.
And I think we should still be able to set reparsing timers for rare parsing.

How to reproduce

  • Have DAG with callbacks
  • Set min_file_process_interval higher than stale_dag_threshold and deploy airflow
  • Execute DAG, so callbacks are executed

Operating System

Debian Bookworm

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==9.12.0
apache-airflow-providers-celery==3.12.2
apache-airflow-providers-common-compat==1.7.3
apache-airflow-providers-common-io==1.6.2
apache-airflow-providers-common-messaging==1.0.5
apache-airflow-providers-common-sql==1.27.5
apache-airflow-providers-fab==2.4.1
apache-airflow-providers-http==5.3.3
apache-airflow-providers-postgres==6.2.3
apache-airflow-providers-redis==4.2.0
apache-airflow-providers-slack==9.1.4
apache-airflow-providers-smtp==2.2.0
apache-airflow-providers-standard==1.6.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Helm chart deployed on AWS EKS cluster

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    affected_version:3.0Issues Reported for 3.0area:corekind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions