Designing Reproducible Data Pipelines for Community Research

By Kailas Venkitasubramanian in Research Analytics Data Engineering Reproducible Research

March 9, 2024

In the first post of this series, I argued that reproducibility is not a technical luxury for community research institutions—it is an ethical and operational obligation. In this post, I want to move from philosophy to plumbing—because this is where reproducibility becomes real.

Specifically: what does it mean to design reproducible data pipelines in a community research environment?

At the UNC Charlotte Urban Institute, this question became concrete as we built the Quality of Life Explorer, developed deposit and extraction pipelines for the Charlotte Regional Data Trust, and began orchestrating workflows using Apache Airflow in an AWS environment.

This has been less a “one-time build” and more a cultural and technical transition—and we’re still on that journey.

Why pipelines matter in community research

Community research rarely involves a single static dataset. Instead, we work with:

  • Federal sources (ACS, BLS, CDC, HUD)
  • State administrative data
  • County-level programmatic data
  • Secure individual-level records under data use agreements

These datasets evolve independently. Releases change structures. Definitions shift. Vendors update systems. If each analyst downloads files manually and writes ad-hoc cleaning scripts, reproducibility erodes quickly.

A reproducible data pipeline ensures that:

  1. Data ingestion is scripted and repeatable
  2. Transformations are documented and versioned
  3. Outputs are traceable to raw inputs
  4. Updates can be rerun with minimal friction

In short: pipelines convert data chaos into structured infrastructure.

A simple mental model: the pipeline as a contract

Before we get into tooling, here’s the mental model that helped our team: a pipeline is a contract between raw inputs and trusted outputs.

  • It defines where data comes from
  • It defines how it changes
  • It defines what gets published
  • It makes those steps repeatable

Here’s a simplified view of the kind of pipeline we’re aiming for (public indicators + secure administrative data), drawn as a Mermaid diagram:

flowchart LR subgraph PublicData[Public data sources] ACS[ACS / Census] --> Ingest BLS[BLS] --> Ingest CDC[CDC] --> Ingest HUD[HUD] --> Ingest end subgraph TrustData[Administrative data via Data Trust] Deposit[Secure deposit] --> Validate[Schema & integrity checks] Validate --> SecureRaw[Encrypted raw storage] end Ingest[Ingestion jobs] --> Raw[Raw layer] Raw --> Stage[Staging / standardization] SecureRaw --> SecureStage[Secure staging] Stage --> Prod[Production indicators] SecureStage --> Extract[Reproducible analytic extracts] Prod --> Publish[Explorer / dashboards] Extract --> Analysis[Research + evaluation] Publish --> Community[Community / partners] Analysis --> Community

Lessons from building the Quality of Life Explorer pipeline

The Quality of Life Explorer required integrating multiple public datasets at the census tract level across 14 counties. Early versions relied heavily on analyst-driven updates. That worked—until it didn’t.

Here are some lessons we learned:

1. Separate raw, staging, and production layers

Borrowing from industry ETL best practices, we now explicitly separate:

  • Raw layer – untouched source files
  • Staging layer – cleaned and standardized
  • Production layer – analysis-ready indicators

This separation prevents subtle drift. If an indicator changes, we can trace whether it was a source change or a transformation decision.

2. Treat indicator definitions as code

In community research, indicators are often described in prose (“percentage of households cost-burdened”). But prose is ambiguous.

We began embedding indicator definitions directly in transformation scripts and storing metadata alongside code. Indicator logic now lives in version-controlled repositories—not just in PDF documentation.

3. Automate refresh cycles

Originally, updates were manual and calendar-driven. Now, pipelines can be triggered programmatically when new data releases are detected. Even partial automation dramatically reduces inconsistency.

The insight: reproducibility improves not when we automate everything, but when we eliminate undocumented manual steps.

Building deposit and extraction pipelines for the Data Trust

The Charlotte Regional Data Trust introduces additional complexity: secure administrative data under strict governance constraints.

Here, reproducibility must coexist with:

  • Encryption requirements
  • Role-based access controls
  • Audit logging
  • Disclosure risk mitigation

For deposit pipelines, we designed standardized intake processes:

  • Schema validation at upload
  • File integrity checks
  • Metadata capture at ingestion
  • Controlled movement into secure storage

For extraction pipelines, we structured workflows that:

  • Log queries and transformations
  • Maintain data lineage
  • Store derived datasets separately from raw records
  • Enable reproducible analytic extracts without exposing unnecessary fields

Industry practices in data engineering—particularly around data lineage and governance—have been instructive. Tools and ideas from modern data stack thinking (e.g., layered architectures, DAG-based orchestration) translate surprisingly well to community research.

But they must be adapted to privacy-sensitive contexts.

Orchestrating workflows with Airflow in AWS

As pipelines grew in complexity, ad-hoc scheduling became unsustainable. That is where orchestration became essential.

We began using Apache Airflow within an AWS environment to:

  • Schedule recurring ingestion tasks
  • Manage dependencies across datasets
  • Monitor pipeline health
  • Log execution histories

A tiny Airflow example (a pattern we actually use)

Below is a deliberately small DAG that captures a common pattern in our environment:

  1. ingest a file (or pull from an API)
  2. run a transformation step
  3. publish an artifact (a table, extract, or indicator layer)

Even if your real DAG is more complex, this “three-step spine” tends to repeat across projects.

# airflow>=2
from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

def validate_and_log(**context):
    # Example placeholder:
    # - validate schema / row counts
    # - write a small metadata JSON (run_id, source version, checksum)
    # - push key stats to logs for observability
    run_id = context.get("run_id", "manual__unknown")
    print(f"Validation complete for run_id={run_id}")

with DAG(
    dag_id="ui_example_qol_refresh",
    start_date=datetime(2026, 1, 1),
    schedule="0 6 * * 1",  # Mondays at 6am (example)
    catchup=False,
    default_args={"retries": 1},
    tags=["urban-institute", "reproducible", "qol"],
) as dag:

    ingest = BashOperator(
        task_id="ingest_public_data",
        bash_command="python /opt/pipelines/ingest/acs_pull.py --output s3://ui-raw/acs/",
    )

    transform = BashOperator(
        task_id="transform_to_indicators",
        bash_command="python /opt/pipelines/transform/qol_indicators.py --input s3://ui-raw/ --output s3://ui-stage/",
    )

    validate = PythonOperator(
        task_id="validate_and_log_metadata",
        python_callable=validate_and_log,
    )

    publish = BashOperator(
        task_id="publish_production_layer",
        bash_command="python /opt/pipelines/publish/push_to_prod.py --input s3://ui-stage/ --output s3://ui-prod/",
    )

    ingest >> transform >> validate >> publish

A few notes on why this matters for reproducibility:

  • The schedule is explicit (no mystery cron jobs living on someone’s laptop).
  • Each step is named, logged, and rerunnable.
  • The DAG becomes living documentation: “this is how the number gets made.”

Airflow’s Directed Acyclic Graph (DAG) structure forces clarity. Each step must be explicitly defined. Dependencies must be declared. Failures are visible.

Instead of wondering whether “the script ran,” we have logs, timestamps, and traceable runs. When something breaks, we diagnose systematically—not by email archaeology.

But orchestration also introduced new learning curves:

  • Infrastructure configuration in AWS
  • IAM role management
  • Monitoring and cost awareness
  • Documentation discipline

This is not plug-and-play. It requires investment in staff skills and institutional commitment.

What industry practices taught us

Several principles from industry data engineering have shaped our approach. The most durable one is treating environments and workflows as reproducible artifacts rather than personal setups. Even when we’re not running fully containerized infrastructure, the goal is that another person — or a future version of us — could reconstruct the environment and get the same result.

Idempotency matters too. Pipelines should be safe to rerun. A failed job should not corrupt downstream layers, and a rerun should produce the same output as the original. This sounds obvious until you’re debugging a pipeline that partially succeeded and left the staging database in an uncertain state.

Observability is what makes the rest of it practical. Logging, alerts, and run histories are part of reproducibility, not extras. And keeping transformation modules reusable across workflows reduces duplication and makes errors easier to find.

These practices are common in technology companies. Bringing them into a university-based community research institute takes patience and iteration.

This is a journey

We are not “finished.”

Some pipelines remain semi-manual. Some documentation needs improvement. Some processes are evolving as staff capacity grows.

Reproducible data pipelines are not built in a single sprint. They emerge through cycles of:

  • Build
  • Break
  • Document
  • Refactor
  • Standardize

What matters most is institutional direction.

At the Urban Institute, we are shifting from project-specific scripts toward shared infrastructure. From analyst-specific workflows toward team-owned pipelines. From reactive updates toward orchestrated systems.

In community research—where policy decisions, funding allocations, and public trust are at stake—the shift is worth making carefully and deliberately.