KUNDUL Let's talk

Apache Airflow: ETL and ELT Best Practices

A practical guide to building reliable data pipelines with Airflow—from ETL vs ELT to atomic tasks and data quality.

Data engineering has evolved quickly: more tools, more use cases, and higher expectations. No matter the end goal—analytics, dashboards, or AI—data must first be extracted from sources, transformed into the right shape, and loaded into a warehouse or lake. ETL and ELT pipelines are the foundation of that flow. Apache Airflow, originally built for these pipelines, has become the open-source standard for orchestrating data workflows. In the 2024 State of Apache Airflow Report, 95% of Airflow users rely on it to run ETL and ELT. This post summarizes key concepts and best practices from Astronomer’s guide, Best Practices for ETL and ELT Pipelines with Apache Airflow.

ETL vs ELT (and ETLT)

ETL (Extract, Transform, Load) means you transform data before loading it into the target. Transformation happens “in-flight”—often in the orchestration layer (e.g. Airflow workers) or an external service. You might do this when you only need a subset of the data, want to combine multiple sources before loading, or need to change format (e.g. creating embeddings before loading into a vector store). The target receives only what you choose to persist.

ELT (Extract, Load, Transform) means you load the full raw data into the target first, then transform it inside the warehouse or lake. Modern cloud warehouses and lakes support this well: storage is cheap, and they offer strong “pushdown” compute (SQL, Spark, etc.). ELT is a natural fit when you want to keep all raw data and use the target’s engine for heavy transformations.

In practice, many teams use a mix—ETLT: light transformation in-flight, load, then further transformation in the target. Airflow can orchestrate ETL, ELT, and ETLT patterns together, so you can pick the right approach per source or use case.

Passing data between tasks: XCom vs external storage

In pipelines, one task often needs output from another. Airflow’s built-in mechanism for this is XCom (cross-communication): small pieces of data (e.g. file paths, config) are stored in the Airflow metadata database. Standard XCom is meant for small, serializable data (e.g. JSON). For larger payloads, use a custom XCom backend (e.g. Object Storage backend in Airflow 2.9+) so data is written to S3, GCS, or Azure Blob and only a reference is kept in the metadata DB. The alternative is to explicitly write and read from external storage inside your tasks—more control, but you handle paths and serialization yourself. For production pipelines that pass non-trivial data between tasks, a custom XCom backend (or explicit external storage) is recommended.

Where to run transformations

The “T” in ETL/ELT can run in several places. Airflow workers can run Python (or other languages via KubernetesPodOperator) for moderate-sized data. If data or compute needs exceed worker capacity, use external compute: Spark, Ray, Databricks, or the target system itself. In ELT, transformation usually runs in the target (e.g. Snowflake, BigQuery) via SQL or tools like dbt; Astronomer’s Cosmos lets you run dbt projects as Airflow DAGs with each model as a task. Choosing ETL vs ELT shapes where transformation runs—ETL often uses workers or external clusters; ELT uses the warehouse or lake.

Data quality checks

Relying on users to report bad data is a last resort. Build in data quality checks as tasks. Two useful types: stopping checks (fail the pipeline so bad data doesn’t propagate—e.g. duplicate primary keys) and warning checks (alert but allow the run to continue—e.g. high null rate in a column). Common placements: right after extract (validate source data) and after load (SQL-based checks in the target). You can implement checks with Airflow’s SQL check operators, custom Python, or tools like Great Expectations or Soda. Design for the worst case: what could go wrong, and what would be most costly downstream? Avoid alert fatigue by focusing on checks that truly require action.

Three core DAG best practices

  • Atomicity: Each task should do one clear thing (e.g. one extract, one transform, one load). That gives you visibility and lets you retry only the failed step.
  • Idempotency: Same inputs should produce the same outputs. Use run identifiers (e.g. logical_date) to partition data so re-runs don’t duplicate or overwrite incorrectly.
  • Modularity: Reuse logic across DAGs—shared functions, custom operators, and config (e.g. SQL in separate files). Treat DAGs like config; keep heavy logic out of the top level.

Avoid top-level code and set retries

Airflow parses DAG files frequently (e.g. every 30 seconds). Top-level code runs on every parse—so no database calls or heavy work at import time. Use dynamic task mapping when you need a variable number of tasks (e.g. one task per file or table) so task generation happens at runtime without top-level I/O. For resilience, set retries and retry_delay at the DAG or task level so transient failures (e.g. database concurrency) don’t fail the run immediately.

Treat DAGs like production code

Store DAGs in version control, use at least dev and production environments, and run tests. Unit tests for custom operators and Python logic; DAG validation tests (e.g. via DagBag) to enforce standards (allowed operators, required tags). Use CI/CD to deploy from a main branch after tests pass. Scaling also matters: tune DAG parse timeouts, max_active_runs, and pools so your Airflow instance keeps up with pipeline growth.

Features that help ETL and ELT

Dynamic task mapping (.expand() / .partial()) creates a variable number of task instances per run (e.g. one task per file or partition), improving observability and retries. Task groups modularize repeated patterns (e.g. extract → transform → load) into reusable blocks. Dataset-based scheduling (Airflow 2.4+) triggers DAGs when upstream datasets are updated, so downstream pipelines run when data is ready instead of on a fixed schedule. For large data between tasks, use a custom XCom backend (e.g. Object Storage) so the metadata database isn’t overloaded.

Pre-flight checklist

Before coding, clarify: who consumes the data and how fresh it must be; what quality and governance rules apply; where the source and target live and their formats; whether you need external storage between steps; and what tools are already in use. Answering these (even partially) helps you choose ETL vs ELT, where to run transforms, and how to integrate with existing systems like Snowflake, dbt, or Fivetran.

Need help orchestrating data pipelines?

Kundul helps teams build dependable orchestration, Snowflake data engineering workflows, dbt delivery pipelines, and production-ready data operations.

Book a call

Source and further reading

This post is based on Apache Airflow® Best Practices: ETL and ELT Pipelines (Astronomer). The guide covers definitions, pre-flight checklists, XCom vs external storage, operators and decorators, data quality, DAG design, testing, dynamic task mapping, programmatic DAGs, task groups, dataset scheduling, and key provider packages for clouds and warehouses.