Releasing v1.0 of the Charlotte-Mecklenburg Quality of Life Explorer Data Pipeline
By Kailas Venkitasubramanian in Reproducible Research Data Science
March 16, 2023
A successful wrap
It’s a great day today.
We completed the version 1.0 of the Charlotte-Mecklenburg Quality of Life Explorer automation project. It’s hard to put in words how exciting and rewarding this feels but I’ll try anyways.
We managed to automate most of the data and computational processes needed to generate the 80-odd quality of life indicators featured in the explorer and create functional data pipelines to serve the application. Through this work, we’ve accomplished a significant reduction of our project workload and fundamentally transformed the nature of our engagement in this project. The completion of this work also revitalizes our team’s vision to building a reproducible data science framework at the Urban Institute and a unified data ecosystem. Let me tell you how all this worked out.
QoL Explorer (click on the ‘Quality of Life Explorer’ header in the figure to randomly cycle through the indicators)
Rebooting the QoL project in Fall ’21
The QoL explorer, as many know, is a longstanding collaboration between the City of Charlotte, Mecklenburg County and the Charlotte Urban Institute.
The automation project informally started in the Fall of 2021 with somewhat different goals. The task at hand was to recoup lost ground on the timeline and get deliverables back on track. Dozens of variables in multiple years required updating. Much had to be done, and quick. You can read more on the first part of this story here.
Fast forward a few months of frenzied work and development.
By early Spring 2022, we had brought the project on track to relieve it from the backlog. More importantly, the work during this time also shaped our understanding of how our processes could be more efficient and how reproducible analytical processes could transform the maintenance of QoL explorer.
Developing Scripts and Workflows
Much of the early work was on developing python scripts (thanks Providence) that could replace manual geoprocessing and aggregation operations to compute the indicators for each NPA. While they set the foundation for coding in the project, they were not ready to be aligned with a persistent data pipeline as several key processes remained manual.
Scripts to review the computed indicators and convert them to appropriate formats to serve the application also were worked out during this time.
We also designed and developed a SQL server database (thanks, Pratik that can ingest the computed QoL indicators to replace our dinosaur MS Access database. With this, we accomplished integrating QoL indicators data with the Charlotte Regional Data Trusts’s integrated data system by Spring 2022.
While the initial work was not ‘production’ ready, they made important building blocks to formalizing the automation project and how pipelines needed to be structured.
Refining the code and reshaping workflows
The next logical step was to reorganize and optimize the python scripts developed to create common categories of workflows based on shared data assets and/or computational processes.
For example, indicators developed using the census API had a set of common routines, and those ones in which proximity was being computed had similar processes for data preparation and calculation. Then there were lateral processes such as geocoding that affected more than one category of workflows. Development of compute functions and scripts continued and more indicators were added to the automation framework during this time.
Moving to the Cloud and developing final-mile processes
Until the start of Fall 2022, the automation project ran largely on exceptional work done by graduate student assistants. Bringing Nick on board was a shot in the arm as we now could start aligning the work on this project with the larger vision of a unified data analytic ecosystem and build critical components for the reproducibility framework at the Institute. To this end, we developed a test instance of AWS relational database service that could house the Urban Institute’s data assets including the ones consumed by QoL project.
As work progressed, we began drawing plans this Spring to tackle the residual manual work that still lurked in the post-processing parts of the project. We developed scripts (thanks Nick) to automate the ETL components after the indicators were computed. We also reconfigured the review process and enabled a streamlined ingestion into the AWS database that now houses the QoL data. This critical work appropriately brought the finale to this phase of the project.
Project Chronology
Figure below shows the general chronology of events in the project
How this work transforms the project
A lot of water has flown down the river since we started this work. I can safely say that all of the major workflows in this project are encoded in the pipeline now. This has resulted in around 60% reduction in the project workload and we have the unprecedented luxury to reimagine our engagement in the QoL project in future work cycles without straining the budget.
May be we can finally set data operations and maintenance aside and start developing research insights using the rich neighborhood-level dataset of QoL explorer? The possibilities are exciting. More to come on this..
We still have a few indicators to be absorbed into the pipeline but they are relatively trivial work. A key process to be completed is building the raw databases and connecting them to the compute code. We’ll be taking this up in the upcoming wave.
Final note
I see the success of this project speaking about the dedication and capacity of the UI data science team to transform how we support community research initiatives. More importantly, we are making all the right moves to elevate our data analytic work to greater rigor and transparency that augurs well to the ethos of our organization and in general, science.