Plunging into the Data Trust black box, and Deep Cleaning the System
By Kailas Venkitasubramanian in article community research Data Science Management
May 20, 2023
Diving into the world of administrative data and CRDT
Administrative data is messy is not much of an adage as much as it is a reality. When I took reins of managing the data infrastructure and analytical operations of Institute for Social Capital or ISC (now called the Charlotte Regional Data Trust) in the middle of 2021, messiness extended beyond data. The dysfunction was deep in how data was collected and organized, the way data operations and analyses were conducted, how information was collected from stakeholders, and how data was disseminated.
My job description at the institute extended beyond straightening things at CRDT but this presented itself as an elephant in the room. And in strange ways, it still continues to be. More on that later, but what is CRDT anyways?
Charlotte Regional Data Trust or CRDT is an integrated data system housed in the UNC Charlotte Urban Institute that collects, curates and integrates data from several administrative entities in the Charlotte Region and makes them accessible to research that benefits our community. For example, CRDT would facilitate integrating data of students in the Charlotte-Mecklenburg schools system and households that receive food and nutrition services in Mecklenburg county to better understand system gaps ensuring nutrition needs and evaluate effectiveness of the program. Or, data from Charlotte-Mecklenburg schools help assess a tutoring program’s effectivness in improving student success. A researcher attempting to make use of administrative data is typically met with brickwalls - undeveloped data governance and legal issues, arbitrary data release policies, etc. Let’s say they overcome them and are able to access data. Integrating them with other data sources will invariably be a dead-end due to security and privacy issues. CRDT, under a well-developed governance, enables this integrated social science research and enables researcher to understand our community with a unique data lens.
The promise was immense but we had a job to do to make this happen, consistently. And we were struggling.
Beginnings, and Non-Starters
As a first step I started assessing the datasets we had. Two things were apparent. We lacked a functioning data warehouse and operational databases. We had data deposits from dozens of partners worth more than a decade. Some of them were stored in a SQL server database as data dumps At this time, we were in talks with a vendor who had offered to build a cloud-based data warehouse. But we needed one now to conduct work properly.
We also lacked deeper information about our data holdings. Both contributed to almost zero curation and documentation of the data. This snowballed into a chaotic data request fulfillment process where both researchers and ISC staff had scant information to seek appropriate data and integrate them with each other. This manifested in delays to the researcher.
Part of the delay equation was the governance process where we hadn’t defined the information our staff needed to competently process requests. This had an upstream dependency wherein our services were not defined in an appropriate resolution to make the fulfillment process streamlined.
Another part was the fact that the data was not research ready. What do I mean by research ready? No consistent metadata or data documentation existed. So no meaningful data transformations, recoding and other curation could be done. Most discoveries about the nature of CRDT data were ad-hoc and when required by a certain project. No investigation on the quality happened as part of the ingestion. Data was deposited as a dump into the file system. Usually, quirks and issues in data quality were uncovered during a data fulfilment process that led to enormous delays and laborious coordination.
Another big part was inefficiencies in analytical processes for ingesting, storing, extracting and integrating data. That’s putting it mildly. Our code sucked and needed a major revamp. We used R and bits of Python to write code. In my initial days I tried to read a script written by our then data scientist that did a few data manipulations. I was expecting a few lines of code using dplyr functions. Instead, I was greeted by some vintage fortran-esque looking code using base R spanning hundreds of lines punctuated sometimes by mild cuss words as comments. It was this impregnable wall of verbose incoherence that should be declared illegal in the year 2021. How was this even allowed to happen? Well, that was the way then but that way had to end. Immediately.
It was time to clean up and get everything straight.
Creating functioning databases
There were databases in our secure server but ISC no longer used them. Rather there were copies of raw files and deposits from partners stored in a file system in their native formats. When a new request came in, new custom scripts had to be written to identify, query and process data that were lying disjointed in dozens of csv and excel files. This contributed to inordinate processing times.
Data was not modeled into structured relational DBs. We identified the most popular datasets in CRDT and started analyzing their structure and collecting additional information that’d help model the data into structured database. We created ER diagrams and developed scripts to load the data into the SQL Server. We developed basic routines to query and extract data from these new DBs to our data procesisng pipelines.
Clearing and archiving
Creating a system to organize projects and files is one thing. Following it is another. We did neither very well. Final_final_golden_1_1.txt was real. There were files strewn around in the server that were neither human readable or machine readable. A basic clean up was necessary anyways. Old files and folders had to be properly archived. While creating and loading databases, we committed to not use flat files for our core data operations and archived them. We made a system to manage working directories and software libraries. We knew all of this would change as other aspects of our institute’s operations improved but we made a system anyways to keep things organized.
Central warehouse
As we began rebuilding functional databases, we started drawing a plan for a central data warehouse that could house all CRDT data and any auxiliary data that were used for research and data analysis. The goal was to create a central warehouse that linked the databases together because, well, our bread and butter was linked data. The warehouse had to be nimble enough to accomodate changes to the data structures in the short and medium term as we still were discovering better information about our datasets and developing improved curation routines. But more importantly, we had to develop ways to link every database with every other database.
Record linkage - First Pass
We needed better record linkage methods. Our databases are from different organizations and they do not have standard keys to make linkages with each other and links between database were through people who featured in multiple databases. So their personally identifiable data such as names and date of births were used for linking. This was hardly ideal or accurate as data quality in administrative databases was all over the place. But thus was the nature of this beast and we had to do our best and build better methods as we understood data or made improvements to data quality.
We historically used a mix of exact and fuzzy algorithms to match data. We needed to formally assess the method and test alternatives based on best practice. We started testing levenshtein distance, cosine similarity and other algorithms that complimented the existing methodology and wrote scripts that were applied in a cascading manner so that the best possible linkage was accomplished. We saw that some methods were more accurate but inconsistent so any thresholds we make to automate the linkages were unreliable. But we had to establish benchmarks on how we automate the linkage process and more importantly, develop a way to easily store and query the linked data. Thus the idea of master link database
Master link database
The idea of a link database was simple. Based on the gold standard method of record linkage, we had to perform record linkage across all pairs of databases and identify a unique index of people represented in CRDT’s repository. The database thus become a single source from which all overlaps between the databases can be identified and queried. We thus created the master link database. As new data deposits came in, we needed to update the master link database. We also recognized that we had to do complete refresh of this database whenever our entity resolution methodology was updated. We then had to preserve the links made by the older methods for reproducibility. But we weren’t that far yet. We needed a place where all data were linked and could be used. This was a goal and need from the CRDT board and other stakeholders for a long time. We completed this by the end of 2021.
What was next?
A big contract to revamp the data infrastructure was brewing alongside while we made these moves. Such a large investment would be a welcome addition to our work but there were lingering questions and uncertain details surrounding it. We also had to start documenting our operations in order to establish continuity and reproducibility to our work and minimize black boxes. Did I tell you that all of this work was getting done with no full time staff? More about all this in the next post…