Designing Reproducible Data Pipelines for Community Research

By Kailas Venkitasubramanian in Research Analytics Data Engineering Reproducible Research

March 9, 2024

In the first post of this series, I argued that reproducibility is not a technical luxury for community research institutions—it is an ethical and operational obligation. In this post, I want to move from philosophy to plumbing—because this is where reproducibility becomes real.

Specifically: what does it mean to design reproducible data pipelines in a community research environment?

At the UNC Charlotte Urban Institute, this question became concrete as we built the Quality of Life Explorer, developed deposit and extraction pipelines for the Charlotte Regional Data Trust, and began orchestrating workflows using Apache Airflow in an AWS environment.

This has been less a “one-time build” and more a cultural and technical transition—and we’re still on that journey.

Why pipelines matter in community research

Community research rarely involves a single static dataset. Instead, we work with:

Federal sources (ACS, BLS, CDC, HUD)
State administrative data
County-level programmatic data
Secure individual-level records under data use agreements

These datasets evolve independently. Releases change structures. Definitions shift. Vendors update systems. If each analyst downloads files manually and writes ad-hoc cleaning scripts, reproducibility erodes quickly.

A reproducible data pipeline ensures that:

Data ingestion is scripted and repeatable
Transformations are documented and versioned
Outputs are traceable to raw inputs
Updates can be rerun with minimal friction

In short: pipelines convert data chaos into structured infrastructure.

A simple mental model: the pipeline as a contract

Before we get into tooling, here’s the mental model that helped our team: a pipeline is a contract between raw inputs and trusted outputs.

It defines where data comes from
It defines how it changes
It defines what gets published
It makes those steps repeatable

Here’s a simplified view of the kind of pipeline we’re aiming for (public indicators + secure administrative data), drawn as a Mermaid diagram:

flowchart LR subgraph PublicData[Public data sources] ACS[ACS / Census] --> Ingest BLS[BLS] --> Ingest CDC[CDC] --> Ingest HUD[HUD] --> Ingest end subgraph TrustData[Administrative data via Data Trust] Deposit[Secure deposit] --> Validate[Schema & integrity checks] Validate --> SecureRaw[Encrypted raw storage] end Ingest[Ingestion jobs] --> Raw[Raw layer] Raw --> Stage[Staging / standardization] SecureRaw --> SecureStage[Secure staging] Stage --> Prod[Production indicators] SecureStage --> Extract[Reproducible analytic extracts] Prod --> Publish[Explorer / dashboards] Extract --> Analysis[Research + evaluation] Publish --> Community[Community / partners] Analysis --> Community

Lessons from building the Quality of Life Explorer pipeline

The Quality of Life Explorer required integrating multiple public datasets at the census tract level across 14 counties. Early versions relied heavily on analyst-driven updates. That worked—until it didn’t.

Here are some lessons we learned:

1. Separate raw, staging, and production layers

Borrowing from industry ETL best practices, we now explicitly separate:

Raw layer – untouched source files
Staging layer – cleaned and standardized
Production layer – analysis-ready indicators

This separation prevents subtle drift. If an indicator changes, we can trace whether it was a source change or a transformation decision.

2. Treat indicator definitions as code

In community research, indicators are often described in prose (“percentage of households cost-burdened”). But prose is ambiguous.

We began embedding indicator definitions directly in transformation scripts and storing metadata alongside code. Indicator logic now lives in version-controlled repositories—not just in PDF documentation.

3. Automate refresh cycles

Originally, updates were manual and calendar-driven. Now, pipelines can be triggered programmatically when new data releases are detected. Even partial automation dramatically reduces inconsistency.

The insight: reproducibility improves not when we automate everything, but when we eliminate undocumented manual steps.

Building deposit and extraction pipelines for the Data Trust

The Charlotte Regional Data Trust introduces additional complexity: secure administrative data under strict governance constraints.

Here, reproducibility must coexist with:

Encryption requirements
Role-based access controls
Audit logging
Disclosure risk mitigation

For deposit pipelines, we designed standardized intake processes:

Schema validation at upload
File integrity checks
Metadata capture at ingestion
Controlled movement into secure storage

For extraction pipelines, we structured workflows that:

Log queries and transformations
Maintain data lineage
Store derived datasets separately from raw records
Enable reproducible analytic extracts without exposing unnecessary fields

Industry practices in data engineering—particularly around data lineage and governance—have been instructive. Tools and ideas from modern data stack thinking (e.g., layered architectures, DAG-based orchestration) translate surprisingly well to community research.

But they must be adapted to privacy-sensitive contexts.

Orchestrating workflows with Airflow in AWS

As pipelines grew in complexity, ad-hoc scheduling became unsustainable. That is where orchestration became essential.

We began using Apache Airflow within an AWS environment to:

Schedule recurring ingestion tasks
Manage dependencies across datasets
Monitor pipeline health
Log execution histories

A tiny Airflow example (a pattern we actually use)

Below is a deliberately small DAG that captures a common pattern in our environment:

ingest a file (or pull from an API)
run a transformation step
publish an artifact (a table, extract, or indicator layer)

Even if your real DAG is more complex, this “three-step spine” tends to repeat across projects.

# airflow>=2
from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

def validate_and_log(**context):
    # Example placeholder:
    # - validate schema / row counts
    # - write a small metadata JSON (run_id, source version, checksum)
    # - push key stats to logs for observability
    run_id = context.get("run_id", "manual__unknown")
    print(f"Validation complete for run_id={run_id}")

with DAG(
    dag_id="ui_example_qol_refresh",
    start_date=datetime(2026, 1, 1),
    schedule="0 6 * * 1",  # Mondays at 6am (example)
    catchup=False,
    default_args={"retries": 1},
    tags=["urban-institute", "reproducible", "qol"],
) as dag:

    ingest = BashOperator(
        task_id="ingest_public_data",
        bash_command="python /opt/pipelines/ingest/acs_pull.py --output s3://ui-raw/acs/",
    )

    transform = BashOperator(
        task_id="transform_to_indicators",
        bash_command="python /opt/pipelines/transform/qol_indicators.py --input s3://ui-raw/ --output s3://ui-stage/",
    )

    validate = PythonOperator(
        task_id="validate_and_log_metadata",
        python_callable=validate_and_log,
    )

    publish = BashOperator(
        task_id="publish_production_layer",
        bash_command="python /opt/pipelines/publish/push_to_prod.py --input s3://ui-stage/ --output s3://ui-prod/",
    )

    ingest >> transform >> validate >> publish

A few notes on why this matters for reproducibility:

The schedule is explicit (no mystery cron jobs living on someone’s laptop).
Each step is named, logged, and rerunnable.
The DAG becomes living documentation: “this is how the number gets made.”

Airflow’s Directed Acyclic Graph (DAG) structure forces clarity. Each step must be explicitly defined. Dependencies must be declared. Failures are visible.

Instead of wondering whether “the script ran,” we have logs, timestamps, and traceable runs. When something breaks, we diagnose systematically—not by email archaeology.

But orchestration also introduced new learning curves:

Infrastructure configuration in AWS
IAM role management
Monitoring and cost awareness
Documentation discipline

This is not plug-and-play. It requires investment in staff skills and institutional commitment.

What industry practices taught us

Several principles from industry data engineering have shaped our approach. The most durable one is treating environments and workflows as reproducible artifacts rather than personal setups. Even when we’re not running fully containerized infrastructure, the goal is that another person — or a future version of us — could reconstruct the environment and get the same result.

Idempotency matters too. Pipelines should be safe to rerun. A failed job should not corrupt downstream layers, and a rerun should produce the same output as the original. This sounds obvious until you’re debugging a pipeline that partially succeeded and left the staging database in an uncertain state.

Observability is what makes the rest of it practical. Logging, alerts, and run histories are part of reproducibility, not extras. And keeping transformation modules reusable across workflows reduces duplication and makes errors easier to find.

These practices are common in technology companies. Bringing them into a university-based community research institute takes patience and iteration.

This is a journey

We are not “finished.”

Some pipelines remain semi-manual. Some documentation needs improvement. Some processes are evolving as staff capacity grows.

Reproducible data pipelines are not built in a single sprint. They emerge through cycles of:

Build
Break
Document
Refactor
Standardize

What matters most is institutional direction.

At the Urban Institute, we are shifting from project-specific scripts toward shared infrastructure. From analyst-specific workflows toward team-owned pipelines. From reactive updates toward orchestrated systems.

In community research—where policy decisions, funding allocations, and public trust are at stake—the shift is worth making carefully and deliberately.

References (with links)

Kimball, R., & Caserta, J. (2004). The Data Warehouse ETL Toolkit. Wiley.
Fowler, M. (2010). Patterns of Enterprise Application Architecture. Addison-Wesley.
Apache Software Foundation. Apache Airflow Documentation. https://airflow.apache.org/
Amazon Web Services. Architecting for the Cloud: AWS Best Practices. https://aws.amazon.com/architecture/
Wilson, G., et al. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. https://doi.org/10.1371/journal.pcbi.1005510