Using tidycensus to Analyze ACS PUMS Data
A short tutorial
By Kailas Venkitasubramanian in R Data Science Research Methods
May 12, 2024
If you’ve spent any time working with Census data, you know the drill: pull a pre-aggregated table, get median household income by county, move on. It works, and for a lot of questions, it’s exactly what you need. But sometimes the published tables just don’t cut it. What if you want to look at wage distributions for workers with specific educational credentials? Or model individual-level outcomes rather than tract-level averages? That’s where PUMS comes in — and once you start using it, it’s hard to go back.
What is ACS PUMS?
The American Community Survey Public Use Microdata Sample (PUMS) is the individual-level response data that underlies all those summary tables the Census Bureau publishes. Instead of one row per county showing you an aggregate statistic, you get one row per respondent. For the 1-year ACS, that’s roughly 1% of the U.S. population — about 3 million records — with hundreds of variables covering everything from employment and income to commute mode, housing costs, and English proficiency.
The tradeoff is geography. To protect respondent privacy, the Census doesn’t give you counties or tracts in the microdata. Instead, you get Public Use Microdata Areas (PUMAs) — geographic units that each contain at least 100,000 people. PUMAs nest within states and are built from census tracts. In dense urban areas like Mecklenburg County, a single county might span several PUMAs. In rural areas, a PUMA might cover multiple counties. This matters for how you define your analysis geography, and I’ll come back to it.
The power of PUMS is the flexibility. You can build custom cross-tabulations that Census doesn’t publish. You can apply your own groupings, filter to subpopulations, and fit statistical models on real individual-level data rather than ecological proxies. If your research questions involve understanding the lived experiences of communities — who is working in which industries, which households are cost-burdened, how educational attainment varies across demographic groups — PUMS gives you the raw material to answer those questions properly.
Why tidycensus Makes This Much Easier
Before the Census Bureau made PUMS available via API, you had to download large flat files from the FTP site, merge housing and person records manually, figure out the data dictionary, and pray your joins were correct. It was doable but tedious. Kyle Walker’s tidycensus package changed that.
With tidycensus, you specify the variables you want, the state, the survey year, and the function handles the API call, returns a clean tibble, and can even recode those cryptic numeric codes into readable labels — all in a few lines of R. The package also ships with a built-in data dictionary (pums_variables) so you can browse available variables without leaving your R session.
A Few Things to Keep in Mind Before You Start
Weights are not optional. PUMS is a sample. Each record comes with two weight variables: PWGTP (person weight) and WGTP (housing unit weight). The person weight tells you roughly how many people in the actual population that one respondent represents. If you ignore weights, your estimates will be wrong. For simple counts and proportions, you can apply weights directly using dplyr. For anything involving standard errors or confidence intervals, you’ll want the survey and srvyr packages, which handle the replicate weights the Census provides for variance estimation.
Know whether your variable is person-level or housing-level. Age, race, education, and income are person variables. Number of bedrooms, housing costs, and property value are housing unit variables. Mixing them up without understanding the record structure will give you nonsense results. In the tidycensus data dictionary, the level column tells you which is which.
PUMA geography isn’t the same as county geography. This is especially relevant for Mecklenburg County analysis. You’ll need to identify which PUMAs correspond to Mecklenburg, and in some cases a PUMA may overlap county boundaries. The Census publishes PUMA-to-county relationship files if you need to be precise about geographic coverage.
Sample size matters. For small subpopulations or small geographic areas, your estimates can be noisy. The 1-year ACS PUMS covers larger populations more reliably. For smaller subgroups, you may need the 5-year PUMS, which has larger samples but reflects a multi-year average rather than a single point in time.
Walking Through an Example: Mecklenburg County
Let’s get into the code. I’ll walk through pulling PUMS data for North Carolina and doing some basic analysis focused on Mecklenburg County. I’m using the 2022 1-year ACS here.
Step 1: Set Up and Browse Variables
library(tidyverse)
library(tidycensus)
# Browse the PUMS variable dictionary for 2022
pums_vars_2022 <- pums_variables %>%
filter(year == 2022, survey == "acs1")
# Look at person-level variables
pums_vars_2022 %>%
distinct(var_code, var_label, data_type, level) %>%
filter(level == "person") %>%
print(n = 20)
This gives you a searchable inventory of what’s available. Some of the variables I find most useful for community analysis: AGEP (age), SCHL (educational attainment), ESR (employment status), WAGP (wages), RAC1P (race), HISP (Hispanic origin), and PUMA (the geographic unit).
Step 2: Pull Data for North Carolina
# Download PUMS data for NC
nc_pums <- get_pums(
variables = c("PUMA", "AGEP", "SCHL", "ESR", "WAGP", "SEX", "RAC1P", "HISP"),
state = "NC",
survey = "acs1",
year = 2022,
recode = TRUE # returns human-readable labels alongside codes
)
The recode = TRUE argument is worth using — it adds _label columns with decoded values, so SEX becomes accompanied by SEX_label (Male/Female), SCHL gets a readable education category, and so on. The raw codes are still there if you need them for custom recoding.
By default, get_pums() always returns SERIALNO (housing unit ID), SPORDER (person order within unit), WGTP (housing weight), PWGTP (person weight), and ST (state code), in addition to the variables you requested.
Step 3: Filter to Mecklenburg County PUMAs
Mecklenburg County corresponds to several PUMAs in North Carolina. For 2022 ACS, the relevant PUMA codes are in the 01100–01108 range (you can verify this using the Census PUMA reference maps or the tigris package). Here I’ll filter to those PUMAs:
# Identify Mecklenburg PUMAs
meck_pumas <- c("01101", "01102", "01103", "01104", "01105", "01106", "01107", "01108")
meck_pums <- nc_pums %>%
filter(PUMA %in% meck_pumas)
Step 4: Apply Weights for Basic Estimates
A common mistake is treating each row as one person. It isn’t. Let’s use person weights to get an estimate of the working-age population (25–64) by educational attainment:
# Educational attainment for working-age adults in Mecklenburg
meck_pums %>%
filter(AGEP >= 25, AGEP <= 64) %>%
mutate(
ba_or_above = SCHL %in% c("21", "22", "23", "24") # BA, MA, professional, doctorate
) %>%
group_by(ba_or_above) %>%
summarize(
weighted_count = sum(PWGTP),
.groups = "drop"
) %>%
mutate(pct = weighted_count / sum(weighted_count))
This gives you a weighted estimate — not a sample count — of how many working-age Mecklenburg residents hold a bachelor’s degree or higher.
Step 5: Weighted Mean Wages by Education
Now let’s look at median wages among employed workers, broken down by educational attainment. I’ll use weighted.mean() here for simplicity, though for proper median estimation you’d want survey package methods:
meck_pums %>%
filter(
AGEP >= 25,
ESR %in% c("1", "2"), # civilian employed (at work or with job not at work)
WAGP > 0 # has wage income
) %>%
mutate(
edu_group = case_when(
SCHL %in% c("21", "22", "23", "24") ~ "Bachelor's or above",
SCHL %in% c("19", "20") ~ "Some college / Associate's",
SCHL == "16" ~ "High school diploma / GED",
TRUE ~ "Less than high school"
)
) %>%
group_by(edu_group) %>%
summarize(
mean_wage = weighted.mean(WAGP, PWGTP),
pop_est = sum(PWGTP),
.groups = "drop"
) %>%
arrange(desc(mean_wage))
You’ll see exactly what you’d expect — and also some patterns that are worth digging into. The gap between “Bachelor’s or above” and “some college” in a county like Mecklenburg, which has seen significant economic growth, is a real story about labor market structure that no pre-aggregated table will tell you as cleanly.
A Note on Replicate Weights
Everything above uses basic weighting for point estimates. If you’re publishing results or comparing estimates across groups, you should account for the sampling variance using replicate weights. The Census provides 80 replicate weights for each PUMS record (PWGTP1 through PWGTP80 for persons). The survey and srvyr packages handle this correctly:
library(srvyr)
meck_svy <- meck_pums %>%
filter(AGEP >= 25) %>%
as_survey_rep(
weights = PWGTP,
repweights = matches("PWGTP[0-9]+"),
type = "JK1",
scale = 4 / 80,
mse = TRUE
)
meck_svy %>%
group_by(SCHL_label) %>%
summarize(survey_mean(WAGP, na.rm = TRUE))
This gives you proper standard errors, not just point estimates — and that matters when you’re making claims about communities.
Give It a Try
PUMS data rewards curiosity. Once you get comfortable with the weighting and the variable structure, it opens up a different kind of analysis than what the summary tables allow — one that’s closer to the actual complexity of people’s lives. If you’re doing community research in Mecklenburg or anywhere else, it’s worth adding to your toolkit.
The tidycensus documentation at walker-data.com is the best place to start, and Kyle Walker’s book Analyzing US Census Data covers PUMS methods in much more depth. Try pulling a dataset today and see what questions it surfaces that you weren’t asking before.