Making Data Pipelines for the Quality of Life Explorer seamless

By Kailas Venkitasubramanian in community research

March 13, 2022

Introduction

The idea of building a full-fledged data and computational infrastructure to support the Charlotte-Mecklenburg Quality of Life (QoL) Explorer didn’t start with an explicit demand for it. Rather, we were grappling with a deliverables backlog that had accumulated during the Covid-19 pandemic and needed to make up ground to keep the explorer up-to-date and meet deadlines.

As I dived in to explore this problem, I discovered that at least part of the challenge were dysfunctional elements in how the institute was managing its data systems across projects and programs in general. A new project management strategy was certainly necessary, but not sufficient to transform this project in a sustainable manner. Thus, the start of this data pipeline project. In the past few months, we have been making several moves to restructure how we gather data and conduct core analyses but it’s now time to formalize the activities and plan a project to redevelop QoL’s data infrastructure.

Beginning in Fall ‘21

The project is a longstanding collaboration between the institute and the City of Charlotte and Mecklenburg County, so to our advantage a significant majority of the processes were already outlined and documented when we started work last Fall. But most of the 80-odd indicators in the QoL explorer are computed using manual processes that were both time consuming and error-prone.

Two things were instinctively clear. 1) We needed to script the computational processes 2) we needed to modernize the data warehouse that served the indicators to the application. But we were far from these goals. We gathered data from more than 50 sources including several public sources but no unified system of organizing the raw data existed. Identifying the process bottlenecks or mapping how the scripting could be planned was impossible without a system to organize raw data and reviewing source data and their intricacies.

Creating guidelines for a common vocabulary and structure for file system management, and organizing raw files to be prepared for computation was one of the first, rudimentary steps we took. While we could have initiated work on developing a data lake or structured data warehouse, we hardly had staff or resources to jumpstart formal data modeling and ETL. More critically, we had to make process enhancements while simultaneously working to clear backlog.

So, it was all about, let’s get this raw data organized, start scripting the ETL in Python and get through computing 40 indicators for two or three years in a few weeks. Ideal, No. But we mobilized what we had to tackle all parts of the project and deployed an action plan.

QoL Work Items in Fall ‘21

This brings us to the present.

Where are we now?

Much has happened since then. We not only relieved all the backlog but developed scripts for computing several indicators in batches, made scripts to review results and generate csvs for feeding the application. As we formally unroll the QoL automation project, a roadmap is emerging as shown below.

QoL Data Pipeline Framework

The figure above shows an outline of how we are planning to move data through the pipeline. Data sources will converge into a raw data database which then feeds into computational scripts dedicated to batches of indicators that share common data assets and/or workflows. They will produce sets of indicators that then move into a staging database. Review and QA scripts will run using the staging database. Once passed QA, indicators move to the production DB. Data can be queried or ported to the QoL explorer application.

What has been done? What is yet to be done?

At this time, we are planning to review the scripts developed so far, reorganize code into functions that enables batch processing of indicators that use similar workflow or data. Geoprocessing is an integral part of the process, so we are rebuilding the geocoding processes to map onto relevant batches of variables. We have also created a rudimentary SQL server database as the final destination. Census and ACS data is no longer downloaded as csv tables but sourced from their API.

Raw DB is yet to be developed. Post-processing scripts and several computational scripts are still in the works. Workflows need to be organized as envisioned above. Feels like we are halfway there but already made great strides in efficiency.

The road forward

This has been an exciting process for several reasons. For example, the challenge of being in a race to meet deadlines while transforming the process in parallel was built into the project due to circumstances. But more importantly, the path we have taken to enhance QoL pipeline paves new path for the UI reproducibility project and building a unified data ecosystem at the institute. Whether it is file system standardization, naming conventions, project workflows, version control, or developing reusable code, this automation project has touched several important dimensions of the reproducibility project both as a demonstration of its value but also develop capacity and structure to develop similar projects.