dlt (data load tool, always lowercase) is a Python library you install with pip install dlt. It is not a platform, not a service, not a container you deploy. It is a library that turns Python generators into production-grade data loading pipelines, handling the parts you’d otherwise hand-roll: pagination state, schema creation and migration, nested data flattening, deduplication, and crash-safe resumption.
You write a function that yields Python dicts. dlt figures out the schema, creates the tables, tracks what it already loaded, and writes the data into a destination — a warehouse, a database, or a set of files on cloud storage. If it crashes mid-load, the next run picks up where it left off without duplicating data.
Don't confuse dlt (this library, by dlthub) with DLT (Delta Live Tables, the Databricks transformation framework). They solve different problems at different layers. This note is entirely about the Python library.
A Minimal Working Example
import dlt
@dlt.resource(write_disposition="replace")
def pokemon():
import requests
resp = requests.get("https://pokeapi.co/api/v2/pokemon?limit=50")
yield resp.json()["results"] # a list of dicts like {"name": "bulbasaur", "url": "..."}
pipeline = dlt.pipeline(
pipeline_name="pokemon_pipeline",
destination="duckdb",
dataset_name="raw",
)
load_info = pipeline.run(pokemon())
print(load_info)pip install "dlt[duckdb]" and run it. dlt will create a local DuckDB file, create a table called pokemon in a schema called raw, infer that the table has two text columns (name, url), and load 50 rows. Every subsequent section explains what happened between pipeline.run() and the data appearing in DuckDB.
What Happens When You Call pipeline.run()
The Working Directory
When you create a pipeline with pipeline_name="pokemon_pipeline", dlt creates a working directory at ~/.dlt/pipelines/pokemon_pipeline/. This directory holds everything dlt needs across runs:
~/.dlt/pipelines/pokemon_pipeline/
├── extract/ # raw data pulled from your generators
├── normalize/ # data rewritten in the destination's preferred format
├── load/ # data packaged and ready to send to the destination
├── completed_jobs/ # records of finished loads
└── state/ # pipeline state (incremental cursors, schema versions)
This is local, ephemeral storage — a scratch area. The authoritative state also gets written into the destination itself (more on that below), so you can blow away this directory and dlt will recover from the destination on the next run.
The Three Stages
pipeline.run() executes three stages in strict sequence. Each stage reads the output of the previous one from disk.
Extract runs your Python generator(s) and serialises their output into JSONL files in the extract/ folder. JSONL (JSON Lines) is a simple format: one JSON object per line, no enclosing array, no commas between records. dlt uses it as its internal interchange format because it’s streamable — you don’t need to hold the entire dataset in memory, and you can append to it without rewriting the file. Each extraction run creates a load package — a folder identified by a unique timestamp-based load_id that groups all the files from that run.
Normalize reads the JSONL files, inspects every record, and does three things. First, it infers (or updates) the schema — the set of table names, column names, and column types. Second, it flattens nested structures (explained in detail in the schema section below). Third, it rewrites the data into the destination’s preferred format. For DuckDB that’s still JSONL; for Snowflake with staging it’s Parquet . The output goes into normalize/.
Load reads the normalized files, applies any necessary DDL to the destination (CREATE TABLE, ALTER TABLE ADD COLUMN), and uploads the data. The load stage splits work into load jobs — one per file — that can execute in parallel. Once every job succeeds, the load package is marked complete.
What Gets Written to the Destination
After that minimal example, DuckDB contains not just your pokemon table but also three internal tables that dlt manages:
| Table | Purpose |
|---|---|
_dlt_version | One row per schema version. Stores the full schema as JSON so dlt can detect drift on the next run. |
_dlt_loads | One row per load package. Records the load_id, status, and timestamp. Useful for auditing and for downstream incremental reads. |
_dlt_pipeline_state | A JSON blob containing all incremental cursors and any user-defined state for this pipeline. This is how a fresh machine can resume — dlt reads this table on startup if local state is missing. |
Every data table also gets a _dlt_load_id column (links to _dlt_loads) and a _dlt_id column (a hash-based unique row identifier).
Resources
A resource is a decorated Python function that yields data. It is the atomic unit of a dlt pipeline — one resource produces one destination table. The @dlt.resource decorator attaches metadata that controls how dlt handles the data:
@dlt.resource(
name="pull_requests", # destination table name (defaults to function name)
primary_key="id", # used for merge deduplication
write_disposition="merge", # how to handle existing data (see next section)
)
def pull_requests():
for page in paginate("https://api.github.com/repos/org/repo/pulls"):
yield page # page is a list[dict]A resource can be a sync generator (yield), an async generator (async for ... yield), or a plain function returning a list. Yielding lists of dicts (pages) rather than individual dicts is significantly faster because it amortises Python overhead and dlt’s internal buffer management per yield.
Sources
A source groups related resources under a shared configuration namespace. It’s how you bundle “all the GitHub resources” or “all the Stripe resources” into one logical unit:
@dlt.source
def github(token: str = dlt.secrets.value):
return [pull_requests(), issues(), stargazers()]
pipeline.run(github())The dlt.secrets.value sentinel tells dlt to resolve token from its configuration system (explained later). The source name (github) becomes the config section name, so credentials are scoped under [sources.github].
Write Dispositions
Every resource has a write disposition that controls what happens when data is loaded into a table that already exists.
append inserts all new rows unconditionally. No deduplication. Good for immutable event streams where each run pulls a fresh, non-overlapping batch.
replace drops (or truncates) the target table and recreates it from scratch on every run. Simple and correct for small reference tables or snapshots where you always want the full current state.
merge deduplicates incoming rows against existing rows using the primary_key (or merge_key). This is the most complex disposition and supports three strategies:
delete+insert (default) — for each incoming row, DELETE any existing row with the same key, then INSERT the new row. Effectively an overwrite-by-key.
scd2 (Slowly Changing Dimension Type 2) — never deletes existing rows. Instead, dlt adds two columns (_dlt_valid_from, _dlt_valid_to) and “closes” the old row by setting its _dlt_valid_to to the current load timestamp, then inserts the new version with _dlt_valid_to = NULL. This preserves full history.
upsert — a true UPDATE-or-INSERT using the destination’s native MERGE statement (where supported).
@dlt.resource(
primary_key="id",
write_disposition={"disposition": "merge", "strategy": "scd2"},
)
def customers():
yield from fetch_all_customers()Incremental Loading
Without incremental loading, every run extracts the full dataset. For large or append-heavy sources, you need to track where the last run left off and only extract new or updated records. dlt handles this with dlt.sources.incremental, a cursor-tracking object you bind to a field in your data:
@dlt.resource(primary_key="id", write_disposition="merge")
def events(
updated_at=dlt.sources.incremental("updated_at", initial_value="2024-01-01T00:00:00Z")
):
for page in paginate("/events", params={"since": updated_at.last_value}):
yield pageHere’s what happens across runs:
First run — updated_at.last_value returns "2024-01-01T00:00:00Z" (the initial_value). Your code passes it to the API as a filter. After all pages are yielded, dlt scans the updated_at field across all records, applies a last_value_func (defaults to max), and stores the result — say "2024-06-15T12:34:00Z" — as the cursor.
Second run — updated_at.last_value returns "2024-06-15T12:34:00Z". Your code passes it to the API, which returns only records modified after that timestamp.
The cursor is persisted in two places: locally in ~/.dlt/pipelines/<name>/state/ and remotely in the destination’s _dlt_pipeline_state table. If you redeploy to a new machine or blow away local state, dlt reads the cursor from the destination on the next run. This is what makes dlt crash-safe and portable.
initial_valueis only used on the very first run (or if state is reset). For bounded backfills, you can also passend_valueto cap the window, then shift the window across parameterised runs (e.g. Airflow dynamic tasks per month).
For APIs that don’t support server-side filtering, dlt can filter client-side — it inspects each yielded record and drops records below the cursor — but you still pay the cost of fetching full pages. Check updated_at.start_out_of_range to detect when a page is entirely below the cursor and stop pagination early.
User-Defined State
Beyond automatic cursors, you can read and write arbitrary key-value state per resource:
@dlt.resource
def my_resource():
state = dlt.current.resource_state()
last_page = state.get("last_page", 0)
for page_num in range(last_page, 100):
yield fetch_page(page_num)
state["last_page"] = page_numThis state is serialised into the same _dlt_pipeline_state table alongside incremental cursors.
Schema Inference, Evolution, and Nested Data
Automatic Inference
On the first run, dlt examines every record yielded by your resources and infers a schema: table names, column names, column data types. It stores this schema as a versioned JSON document in the destination’s _dlt_version table. You never need to write CREATE TABLE statements.
Schema Evolution
On subsequent runs, if the data contains columns that don’t exist in the stored schema, dlt evolves the schema: it issues ALTER TABLE ADD COLUMN on the destination and bumps the schema version. This is the default behaviour (called evolve mode).
If a new record has a value for an existing column but with an incompatible type (e.g. a field previously seen as bigint now arrives as text), dlt does not coerce or fail. Instead it creates a variant column — e.g. amount__v_text — alongside the original amount column. The new row gets amount = NULL and amount__v_text = "not a number". This guarantees no data loss and no silent corruption.
Nested Data Normalisation
When dlt encounters nested structures in your data, it flattens them:
Nested dicts become columns on the parent table with double-underscore naming. {"user": {"address": {"city": "SF"}}} becomes a column user__address__city on the parent table.
Nested lists become child tables. Given:
{"id": 1, "tags": ["python", "data"]}dlt creates a parent table with id and a child table (e.g. my_resource__tags) with columns value, _dlt_parent_id (foreign key to the parent row’s _dlt_id), and _dlt_list_idx (0-based position in the original array). This preserves the relational structure of the nested data without requiring the destination to support array types.
You control how deep this flattening goes with max_table_nesting:
@dlt.resource(max_table_nesting=0) # no child tables at all; lists become JSON strings
@dlt.resource(max_table_nesting=1) # one level of child tables; deeper nesting becomes JSONSchema Contracts
In production, you often don’t want the schema to evolve freely — a malformed API response could silently add garbage columns to your warehouse. Schema contracts let you lock down evolution. You set a policy for four dimensions — tables (new tables appearing), columns (new columns in existing tables), data_type (type changes on existing columns), and extra values — each to one of four modes:
- evolve — accept the change (default)
- freeze — raise an exception and fail the pipeline
- discard_row — silently drop the offending row
- discard_value — null out the offending field, keep the rest of the row
@dlt.resource(schema_contract={
"tables": "evolve",
"columns": "freeze",
"data_type": "discard_value",
})
def transactions():
yield from fetch_transactions()Contracts can be set at the resource, source, or pipeline.run() level; the most specific scope wins.
freezeontablesmeans that if the API starts returning a new nested list (which would create a child table), the pipeline fails. This is usually what you want in production — it surfaces upstream schema changes as hard errors rather than silent table proliferation.
Destinations and Staging
Supported Destinations
dlt loads into:
- Warehouses: Snowflake, BigQuery, Redshift, Databricks, Azure Synapse, ClickHouse, MotherDuck
- Databases: PostgreSQL, MS SQL Server, DuckDB
- Filesystems: S3, GCS, Azure Blob Storage, local disk
The filesystem destination writes files (JSONL, Parquet, or CSV) with a configurable layout — a path template controlling partitioning:
# produces paths like: events/year=2024/month=06/{load_id}.{file_id}.parquetStaging
For warehouse destinations, dlt supports a two-hop loading strategy: first write Parquet files to a staging filesystem (S3, GCS, etc.), then tell the warehouse to COPY them in. This is dramatically faster than INSERT-based loading because warehouses are optimised for bulk file ingestion.
pipeline = dlt.pipeline(
pipeline_name="events",
destination="snowflake",
staging="filesystem", # stage via S3
dataset_name="raw",
)You configure the staging filesystem credentials in secrets.toml or environment variables (see next section). During the load stage, dlt uploads Parquet files to S3, then issues Snowflake’s COPY INTO command pointing at those files.
Filesystem as Final Destination
When the filesystem is the final destination (not staging), dlt stores pipeline state and schema metadata in special _dlt_pipeline_state and _dlt_version folders alongside your data. This makes the filesystem self-contained — another dlt pipeline or a query engine like Trino can read the state.
Configuration and Secrets
dlt resolves every configurable value through a layered provider chain, checked in order (first match wins):
- Explicit keyword arguments in code
- Environment variables (pattern:
SOURCES__GITHUB__TOKEN, double underscores for nesting) .dlt/secrets.toml(for credentials).dlt/config.toml(for non-secret settings)- Default values in function signatures
# .dlt/secrets.toml
[sources.github]
token = "ghp_abc123"
[destination.snowflake.credentials]
database = "analytics"
username = "loader"
password = "s3cret"
host = "acct.snowflakecomputing.com"
warehouse = "LOAD_WH"
role = "LOADER"In CI/production, environment variables replace TOML files entirely, keeping secrets out of repos:
export SOURCES__GITHUB__TOKEN=ghp_abc123
export DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD=s3cretTransformers and Resource Chaining
A transformer is a resource that consumes data from a parent resource instead of producing it independently. The typical use case is per-record enrichment: a parent yields a list of IDs, the transformer fetches details for each.
@dlt.resource
def repos():
yield from [{"name": "dlt", "full_name": "dlt-hub/dlt"}, ...]
@dlt.transformer(data_from=repos)
def repo_contributors(repo_item):
resp = requests.get(f"https://api.github.com/repos/{repo_item['full_name']}/contributors")
yield resp.json()The pipe operator (|) chains them when running:
pipeline.run(repos() | repo_contributors)The transformer runs once per record yielded by the parent. It inherits the parent’s parallelism settings — if the parent is parallelized=True or async, the transformer calls fan out concurrently.
The Declarative REST API Source
For the common case of “paginated JSON API with auth”, dlt ships a generic REST API source so you don’t have to write pagination logic at all:
from dlt.sources.rest_api import rest_api_source
source = rest_api_source({
"client": {
"base_url": "https://api.hubspot.com/crm/v3/",
"auth": {"type": "bearer", "token": dlt.secrets.value},
"paginator": "auto",
},
"resources": [
{
"name": "contacts",
"endpoint": {"path": "objects/contacts", "params": {"limit": 100}},
"primary_key": "id",
"write_disposition": "merge",
"incremental": {
"cursor_path": "properties.lastmodifieddate",
"initial_value": "2024-01-01T00:00:00Z",
},
},
{"name": "deals", "endpoint": {"path": "objects/deals"}},
],
})
pipeline = dlt.pipeline(destination="duckdb", dataset_name="hubspot")
pipeline.run(source)The paginator field supports "auto" (inspects the response to detect the style), "json_link" (follows a next URL in the JSON body), "header_link" (RFC 5988 Link header), "offset", and "cursor". For child endpoints (e.g. /contacts/{contact_id}/notes), you can configure resolved references that inject parent record fields into child endpoint URL templates.
Performance Tuning
Parallelism
By default, resources execute sequentially. You opt into parallelism per resource:
@dlt.resource(parallelized=True)
def my_resource():
...Async generators are parallel by default — dlt runs them on an event loop without requiring the flag.
The degree of parallelism is controlled by two settings in config.toml:
[extract]
workers = 5 # thread pool size for sync generators
max_parallel_items = 20 # max concurrent items across all resourcesBuffer and File Sizing
dlt buffers records in memory before flushing to disk:
[data_writer]
buffer_max_items = 5000 # flush threshold (rows in memory)
file_max_items = 100000 # max rows per output file
file_max_bytes = 134217728 # 128 MB per fileLarger files mean fewer load jobs but coarser parallelism at load time. For Snowflake COPY, 100–250 MB compressed Parquet files hit the sweet spot between per-file overhead and warehouse-side parallelism.
Parquet Output
For any warehouse destination with staging, switching the normalizer output to Parquet is a major win — columnar compression plus native COPY ingestion:
[normalize]
loader_file_format = "parquet"The Parquet writer uses pyarrow internally.
Orchestration
dlt is orchestrator-agnostic. It has no scheduler, no DAG concept, no UI. You bring your own.
Airflow
Call pipeline.run() inside a @task-decorated function. dlt’s crash-safe state means Airflow retries work correctly — a re-executed task resumes from the last completed load package, not from scratch.
from airflow.decorators import task
@task
def load_github():
import dlt
pipeline = dlt.pipeline(
pipeline_name="github",
destination="bigquery",
dataset_name="raw_github",
)
load_info = pipeline.run(github())
load_info.raise_on_failed_jobs()dlt also ships dlt deploy airflow-composer which auto-generates a DAG file from an existing pipeline script.
Prefect
Same pattern — wrap in a @flow or @task. No adapter library needed.
Lightweight Runtimes
Because dlt is just a pip dependency, it runs in AWS Lambda, Google Cloud Functions, or a cron job. When running in ephemeral environments, pipeline state is recovered from the destination’s _dlt_pipeline_state table on each cold start, so there is no dependency on persistent local storage.
Comparison with Alternatives
vs. Custom Python Scripts
This is the real comparison. You already know how to call requests.get(), handle pagination, manage a cursor in a database row, and issue INSERT statements. dlt’s value over that:
Schema inference and ALTER TABLE migrations per destination. Nested JSON flattening into child tables with foreign keys. Crash-safe incremental state with dual persistence (local + destination). Destination abstraction — switch from DuckDB (dev) to Snowflake (prod) by changing a string. SCD2 and merge strategies without writing MERGE SQL. For a one-off CSV-to-Postgres load, dlt is overkill. For anything you’ll run more than once against a changing API, it saves weeks.
vs. Airbyte
Airbyte is a platform: web UI, Temporal-based orchestrator, Docker/Kubernetes connector runtime, 600+ pre-built connectors, and its own state management. dlt is a library: zero infrastructure, state stored in your destination, you own orchestration. If you already run Airflow and want to own the full stack, dlt avoids the Airbyte operational tax (Postgres, Temporal, Redis, K8s). If you want a turnkey SaaS with a UI for non-engineers, Airbyte Cloud fits. Airbyte isolates connectors in containers, so a buggy connector can’t crash your scheduler; dlt runs in-process, so a segfault in a C extension takes down your worker.
vs. Singer / Meltano
Singer taps are standalone executables that emit JSON to stdout; targets consume it. Meltano wraps them with config management and a CLI. The key difference: Singer connectors are opaque binaries you compose but can’t modify without forking. dlt sources are Python functions you import and edit. Singer/Meltano have a far larger catalogue (~300+ connectors); dlt has ~40 verified sources but writing a new one is a single decorated function.
vs. Fivetran
Fully managed SaaS. Richest connector catalogue, least engineering effort, highest cost (per-MAR pricing). No customisation of extraction logic. dlt is the opposite end of the spectrum: free, fully customisable, you build and maintain everything.
vs. Databricks Auto Loader / Delta Live Tables
Auto Loader ingests files from cloud storage into Delta tables. It assumes data is already landed as files. dlt operates upstream — it pulls from APIs and databases and produces those files. They are complementary: dlt extracts from APIs → writes Parquet to S3 → Auto Loader ingests into Bronze Delta tables → Delta Live Tables refines into Silver/Gold.
Limitations
dlt is a single-machine, single-process library. There is no distributed execution — no Spark, no Ray. If a source produces billions of rows, you partition externally (e.g. Airflow dynamic task mapping by date range) and run separate pipeline instances. Each instance must use a distinct pipeline_name or dataset to avoid state collisions.
The normalize stage deserialises and re-serialises all data in Python. For very high throughput this becomes CPU-bound. The Parquet writer uses pyarrow which helps, but you won’t match a native Spark or Rust ingestion tool.
Schema inference defaults to evolve everywhere. In production, failing to set schema contracts means a single malformed API response can add garbage columns to your warehouse. In BigQuery, column deletion requires table recreation — so a rogue column is expensive to clean up.
Async support is solid, but sync resources use a thread pool and are subject to the GIL. Mixing CPU-heavy transformers with I/O-bound extractors in the same pipeline can bottleneck. Keep CPU-heavy work downstream (dbt, Spark) rather than inside dlt transformers.