Tip: set depends_on_past=True in Airflow when creating a forecast model pipeline


💡 When creating forecasting model pipeline in Airflow set depends_on_past=True.


Why?

Forecasting models help us predict future data based on patterns and trends observed in historical data. These models output inherently depend on past observations/data, hence any deviation from the expected sequence can affect their accuracy.

One way to ensure the integrity of the sequence of data while creating forecasting models in Apache Airflow is to set the depends_on_past argument to 'True'. This way each subsequent task in the DAG will only run once its preceding task has completed successfully. This approach guarantees that the forecasting model is fed with the correct inputs in the correct order and that the historical data is correctly sequenced.

If you don't set depends_on_past=True and you have a problem with one of the partitions of the data used by your model, one of two things will happen in the following days:

  • Worst case scenario: your workflow will run without any issues being reported. This is a sign that you forgot to add sensors to test the availability of historical data and, chances are, your features are using incomplete data, to say the least.

  • Alternatively, subsequent executions of your workflow will also fail, bombarding your on call colleague's PagerDuty with messages. And when the problem is finally fixed, you still have to remember to clear and restart all the previous failed executions.

If you don’t set depends_on_past=True and you have a problem with one of the partitions of the data used by your model, one of two things will happen in the following days:

  • Worst case scenario: your workflow will run without any issues being reported. This is a sign that you forgot to add sensors to test the availability of historical data and, chances are, your features are using incomplete data, to say the least.

  • Alternatively, subsequent executions of your workflow will also fail, bombarding your oncall colleague’s pagerduty with messages. And when the problem is finally fixed, you still have to remember to clear and restart all the previous failed executions.

How?

For a specific task, most probably a sensor:

task = SomeSensor(
    task_id='task',
    depends_on_past=True,
    # ...
)

or in your dag default_args to enable depends_on_past for all your tasks

default_args = {
    'depends_on_past': True,
    # ...
}

dag = DAG(
    dag_id="time_series_pipeline",
    default_args=default_args,
    # ...
)
Â