Bayesian Improved Surname Geocoding: How It Works and Where We Use It

By Kailas Venkitasubramanian in R Research Methods Data Science

September 15, 2024

If you work with administrative data long enough, you run into the same wall eventually: the dataset has everything you need except race and ethnicity. Hospital discharge records, voter files, tax records, benefits enrollment data — these are often rich with information about where people live, what services they use, and what outcomes they experience. But ask whether they capture race or ethnicity, and the answer is usually no, or inconsistently, or only in ways that aren’t usable.

That gap matters. When we partner with health systems, county offices, or community organizations to understand disparities in program access or health outcomes, race and ethnicity are often central to the research question. Without them, we’re limited in what we can say about equity.

Bayesian Improved Surname Geocoding (BISG) doesn’t solve this problem perfectly, but it gets you surprisingly far. It’s a probabilistic method that estimates the likelihood someone belongs to a particular racial or ethnic group using two things almost every administrative record has: their last name and where they live. This post walks through how it works, where we use it in our work, and how you can get started with it in R.

Why this comes up in our work

Our team works with a lot of partner data where race and ethnicity are either missing entirely or collected in ways that don’t translate across datasets. Voter registration files are a good example. North Carolina collects race and ethnicity on voter registration forms, but people often decline to answer, and in other states the field doesn’t exist at all. Insurance claims data, program eligibility records, and health system administrative data frequently have similar gaps.

When a community partner asks us to examine whether a program is reaching residents equitably across racial groups, we need some way to estimate that distribution even when it’s not directly recorded. BISG gives us a defensible, well-documented method for doing that, with a real methodological literature behind it and a track record in academic research and voting rights litigation.

It’s not a substitute for collecting race and ethnicity data properly at the source. If you’re working with a partner on a new data system, advocate for collecting that information from participants directly. But for existing historical data? BISG is often the most practical path forward.

How BISG works (without the math)

The core idea is straightforward. Two pieces of information are useful for guessing someone’s race or ethnicity.

Last names carry signal. The Census Bureau publishes tables linking surnames to racial/ethnic distributions. A surname like “Kim” is associated with a high probability of being Asian. “Garcia” points strongly toward Hispanic. “Williams” is much more common among Black Americans than the general population. Last names are obviously imperfect proxies, but at the population level they carry real information.

So does location. Census data tells us the racial composition of every census block, tract, and county in the country. If someone lives in a census block that is 85% Black, that’s additional evidence beyond their surname.

BISG combines these two pieces of evidence using Bayes’ rule. Start with the surname-based probability as your initial estimate, then update it based on the racial composition of the neighborhood. If your surname suggests you’re 60% likely to be white and 20% likely to be Hispanic, and you live in a neighborhood that’s predominantly Hispanic, the neighborhood information pushes those probabilities in the Hispanic direction.

The output is not a single assigned race. It’s a vector of probabilities, one per racial/ethnic category, summing to 1. A row in your dataset might look like: pred.whi = 0.14, pred.bla = 0.02, pred.his = 0.78, pred.asi = 0.04, pred.oth = 0.02. You then use those probabilities in your analysis rather than a hard classification, which is methodologically more honest about the uncertainty involved.

Where you’d use this in practice

A few scenarios where BISG has come up or could come up in work like ours.

Voter file analysis is probably the most common application. Studying turnout patterns, polling place access, or the effects of registration policy changes often requires estimating how different communities are affected. The wru package in R was built by political scientists studying voting rights, so this use case is baked into the tooling.

Program reach assessments are another common one. A county health department wants to know whether their new outreach program is enrolling residents equitably. Their enrollment records have addresses and names but no race. BISG can produce an estimated racial distribution of enrollees to compare against the county population.

We also use it for examining disparities in administrative outcomes. Whether someone receives a benefit, gets flagged for audit, or is connected to a service, BISG lets you examine whether those processes differ by estimated race even when race isn’t recorded. And for historical datasets that predate any collection of race and ethnicity data, it’s often the only practical option.

The method has real limitations worth being upfront about. It performs better for some groups than others — surname matching is particularly strong for Hispanic and Asian last names, and less reliable for estimating Black or white identity in populations with high surname overlap. It’s an estimate, not a truth, and it should be treated as such in any analysis and any communication of findings. Document the method and acknowledge the uncertainty.

Getting started in R: annotated example with NC voter data

The wru package in R implements BISG. The eiCompare package wraps it in a few helper functions that make the workflow cleaner. Below is an annotated walkthrough adapted from the approach documented by the RPVote project at University of Washington, applied to North Carolina voter registration data for Cabarrus County.

Cabarrus County is a useful example: a mid-size county east of Charlotte with demographic diversity and a voter registration file that has a meaningful “not designated” rate on the race field, which is exactly the problem BISG helps address.

Step 0: Install and load packages

# Install if needed
install.packages(c("wru", "eiCompare", "tidyverse", "tigris", "sf"))

# Load libraries
library(wru)         # Core BISG implementation
library(eiCompare)   # Wrapper functions for cleaner BISG workflow
library(tidyverse)   # Data manipulation
library(tigris)      # Pull Census geographic data (blocks, tracts)
library(sf)          # Spatial data handling

Step 1: Load the voter registration data

North Carolina voter registration data is public and downloadable from the NC State Board of Elections. For this example we filter to Cabarrus County (county code 025).

# Load NC voter registration file
# Download from: https://www.ncsbe.gov/results-data/voter-registration-data
nc_voters <- read_delim("VR_Snapshot_20241101.txt", delim = "\t")

# Filter to Cabarrus County and keep relevant fields
cabarrus <- nc_voters %>%
  filter(county_desc == "CABARRUS") %>%
  select(
    voter_reg_num,      # Unique voter ID
    last_name = last_name,
    first_name = first_name,
    res_street_address, # Residential address
    res_city_desc,
    state_cd,
    zip_code,
    race_code           # We'll use this later to validate BISG estimates
  )

# Quick look
glimpse(cabarrus)
# Note: race_code will include "U" (undesignated) and "M" (multiracial)
# alongside the standard categories

Step 2: Geocode addresses to get Census block information

BISG needs Census geography — specifically tract and block codes — to pull the neighborhood racial composition. If your data doesn’t already have these, you’ll need to geocode the addresses first. The Census Geocoder API works well for this and doesn’t require an API key.

# The eiCompare package has a geocoding function that hits the Census API
# For large files, run in batches or use parallel processing

# For demonstration, assume you have already geocoded and have a file
# called cabarrus_geo.csv with columns: voter_reg_num, lat, lon,
# STATEFP, COUNTYFP, TRACTCE, BLOCKCE

cabarrus_geo <- read_csv("cabarrus_geo.csv")

# Join geocoded data back to voter file
cabarrus_full <- cabarrus %>%
  left_join(cabarrus_geo, by = "voter_reg_num")

Step 3: Pull Census block demographic data

The wru package needs Census data describing the racial composition of each block or tract. You can provide this yourself or let wru pull it directly.

# Pull Census data for Cabarrus County, NC
# State FIPS: 37, County FIPS: 025
nc_census <- get_census_data(
  key = Sys.getenv("CENSUS_API_KEY"),  # Set your Census API key in .Renviron
  states = "NC",
  county.list = list("NC" = c("025")),  # Cabarrus County
  census.geo = "block",
  census.year = 2020
)
# This pulls block-level P1 race/ethnicity counts from decennial Census

Note: You’ll need a free Census API key from api.census.gov. Store it in your .Renviron file as CENSUS_API_KEY rather than hardcoding it in scripts.

Step 4: De-duplicate the voter file

Voter files sometimes have duplicate entries. Remove them before running BISG.

# Remove duplicate voter IDs
# The dedupe_voter_file function from eiCompare handles this
cabarrus_dedupe <- dedupe_voter_file(
  voter_file = cabarrus_full,
  voter_id = "voter_reg_num"
)

cat("Records before deduplication:", nrow(cabarrus_full), "\n")
cat("Records after deduplication:", nrow(cabarrus_dedupe), "\n")

Step 5: Run BISG

This is the core step. The wru_predict_race_wrapper function from eiCompare calls wru under the hood and handles some edge cases (like voters that don’t match at the block level being automatically elevated to tract-level matching).

# Convert to data frame (required by wru)
cabarrus_df <- as.data.frame(cabarrus_dedupe)

# Run BISG
bisg_results <- eiCompare::wru_predict_race_wrapper(
  voter_file = cabarrus_df,
  census_data = nc_census,
  voter_id = "voter_reg_num",
  surname = "last_name",
  state = "NC",
  county = "COUNTYFP",
  tract = "TRACTCE",
  block = "BLOCKCE",
  census_geo = "block",     # Use block-level geography for best precision
  use_surname = TRUE,       # Use surname probabilities
  surname_only = FALSE,     # Combine surname + geography (recommended)
  surname_year = 2020,      # Match to 2020 Census surname tables
  use_age = FALSE,
  use_sex = FALSE,
  return_surname_flag = TRUE,   # Flag whether surname was matched
  return_geocode_flag = TRUE,   # Flag whether block was matched
  verbose = TRUE
)

# Inspect first few rows
head(bisg_results %>% select(voter_reg_num, last_name, starts_with("pred.")))
#   voter_reg_num last_name  pred.whi pred.bla pred.his pred.asi pred.oth
# 1  ...           JOHNSON    0.512    0.403    0.047    0.012    0.026
# 2  ...           NGUYEN     0.031    0.009    0.018    0.927    0.015
# 3  ...           GARCIA     0.089    0.014    0.851    0.029    0.017

Step 6: Summarize results

The probability columns (pred.whi, pred.bla, pred.his, pred.asi, pred.oth) can be summed across all voters to estimate the racial/ethnic composition of the file.

# Estimated racial composition of Cabarrus County registered voters
bisg_summary <- bisg_results %>%
  summarise(
    est_white     = sum(pred.whi, na.rm = TRUE),
    est_black     = sum(pred.bla, na.rm = TRUE),
    est_hispanic  = sum(pred.his, na.rm = TRUE),
    est_asian     = sum(pred.asi, na.rm = TRUE),
    est_other     = sum(pred.oth, na.rm = TRUE),
    total         = n()
  ) %>%
  mutate(across(starts_with("est_"), ~ . / total, .names = "pct_{.col}"))

print(bisg_summary)

Step 7: Visualize

# Reshape for plotting
bisg_long <- bisg_results %>%
  summarise(across(starts_with("pred."), sum)) %>%
  pivot_longer(everything(), names_to = "group", values_to = "count") %>%
  mutate(
    group = recode(group,
      "pred.whi" = "White",
      "pred.bla" = "Black",
      "pred.his" = "Hispanic",
      "pred.asi" = "Asian",
      "pred.oth" = "Other"
    ),
    proportion = count / sum(count)
  )

ggplot(bisg_long, aes(x = group, y = proportion, fill = group)) +
  geom_col(show.legend = FALSE) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "BISG-Estimated Racial/Ethnic Composition",
    subtitle = "Cabarrus County, NC Voter Registration File",
    x = NULL,
    y = "Estimated Proportion"
  ) +
  theme_minimal()

Step 8 (Optional): Validate against self-reported race

If your dataset has self-reported race for some subset of records, you can assess how well BISG performs in your specific population.

# Compare BISG predictions against self-reported race
# for voters who did provide a race code

validation <- bisg_results %>%
  filter(race_code %in% c("W", "B", "A", "I", "O", "M")) %>%
  mutate(
    # Assign predicted race as the category with highest probability
    bisg_predicted = case_when(
      pred.whi == pmax(pred.whi, pred.bla, pred.his, pred.asi, pred.oth) ~ "W",
      pred.bla == pmax(pred.whi, pred.bla, pred.his, pred.asi, pred.oth) ~ "B",
      pred.his == pmax(pred.whi, pred.bla, pred.his, pred.asi, pred.oth) ~ "H",
      pred.asi == pmax(pred.whi, pred.bla, pred.his, pred.asi, pred.oth) ~ "A",
      TRUE ~ "O"
    )
  )

# Accuracy for voters with reported race
table(validation$race_code, validation$bisg_predicted)

Note that hard-assigning the maximum probability category discards uncertainty — in any real analysis, use the probability columns directly rather than assigned categories.

A few things worth keeping in mind

Running BISG in R takes maybe an afternoon to set up properly. The harder parts are the geocoding step (which can be slow for large files and requires address standardization) and getting Census API access configured. Once those are in place, the actual BISG computation runs quickly even on large voter files.

When you report results from BISG analysis, be clear that racial/ethnic estimates are probabilistic, not measured directly from individuals. That’s honest methodology. Reviewers who know this literature will expect it. And if your partner organization is hoping to use these results for anything high-stakes, it’s worth discussing whether additional data collection might be feasible to supplement or validate the estimates.

The wru package has been updated to use 2020 Census surname and demographic data, which matters. Demographic compositions shift over time and the 2010 tables were noticeably dated by the mid-2010s. Make sure you’re using surname_year = 2020 to match the current data.

Give it a try. If you run into issues or find yourself adapting this for a different data context, I’d genuinely like to hear how it goes — drop me a note through the contact page.