Embracing multilingualism in data science
By Kailas Venkitasubramanian in Research Analytics Data Science Reproducible Research
April 10, 2025
Both of those efforts — reproducibility and pipelines — rest on a more basic question: which programming languages should a small research team actually use? In the previous posts of this series, I covered why reproducibility matters and how we are designing reproducible data pipelines at the UNC Charlotte Urban Institute. This post is about the layer underneath both.
Specifically, I want to argue that embracing multilingualism—fluency in both R and Python, rather than loyalty to one—has quietly done more for our team’s output than almost any other choice we’ve made.
The old debate, and why it is the wrong question
If you have spent any time in data science circles, you have encountered the “R versus Python” conversation. It is a decades-old debate with passionate advocates on both sides. R was built by statisticians for statisticians. Python was built as a general-purpose language that data science gradually adopted. Each community developed its own ecosystem, idioms, and culture.
But for applied community research—where we work with federal surveys, administrative records, geospatial layers, and policy-sensitive indicators—the question was never “which language is better?” It was: what combination of tools lets a small team do the widest range of work, reliably and reproducibly?
Once we stopped asking which language is better, the question became: what does each one do well?
What bilingual competence actually looks like
At the Urban Institute, our analytical work spans a wide spectrum: statistical modeling, geospatial analysis, data pipeline engineering, automated reporting, interactive dashboards, and secure data linkage. No single language covers all of that optimally.
Here is how the division has emerged in practice:
Where R excels for us:
- Exploratory data analysis and statistical modeling (
tidyverse,survey,lme4) - Publication-quality visualization (
ggplot2,tmap) - Literate programming and automated report generation (
R Markdown,Quarto) - Spatial statistics and census data access (
tidycensus,sf,spdep)
Where Python excels for us:
- Data pipeline orchestration and cloud infrastructure (
Airflow,boto3) - API development and web service integration (
FastAPI,requests) - Machine learning at scale (
scikit-learn,XGBoost) - Text processing and NLP tasks (
spaCy,transformers) - Database operations and ETL workflows (
SQLAlchemy,pandas)
Where they overlap and complement:
- Geospatial analysis: R’s
sfand Python’sgeopandaseach have strengths. We often do exploratory spatial work in R and production geocoding in Python. - Data wrangling:
dplyrandpandasare both excellent—but our team tends to reach for whichever they are already inside of. - Visualization:
ggplot2for static publication figures;plotly(available in both) for interactive dashboards.
The insight is not that one language is “better.” It is that research products improve when you use the right tool for each component (Boeing, 2020; Lovelace et al., 2019). A census tract analysis might start with tidycensus in R, get its indicators computed in a Python pipeline, and have its results rendered in a Quarto report that calls both.
The interoperability revolution
What makes multilingualism practical is the tooling that now exists to bridge the two ecosystems.
Reticulate allows R users to call Python functions, import Python modules, and pass data between R and Python objects within a single R session. The py and r objects let you cross the boundary without leaving your environment. Recent releases (reticulate 1.41+, 2025) have made this even smoother by integrating uv for Python environment management directly from R (RStudio, 2025).
Quarto, the publishing system from Posit, is designed from the ground up as a polyglot platform. A single Quarto document can contain R chunks, Python chunks, and Observable JS—all sharing data (Allaire, 2024). This is not a gimmick. For our team, it means a research report can pull census data in R, run a classification model in Python, and render interactive maps—all in one reproducible document.
Positron, Posit’s new IDE (2024), takes this further by providing first-class support for both R and Python in a single environment built on VS Code’s architecture. Unlike RStudio, which was R-first, Positron was designed as a polyglot IDE from the start.
Here is a minimal example of what cross-language work looks like in practice using reticulate:
library(reticulate)
library(tidycensus)
# Pull census data in R (where tidycensus shines)
# charlotte_income <- get_acs(
# geography = "tract",
# variables = "B19013_001",
# state = "NC", county = "Mecklenburg",
# year = 2022, geometry = TRUE
# )
# Pass to Python for ML preprocessing
# py$charlotte_df <- charlotte_income
# In Python: leverage scikit-learn for spatial clustering
# from sklearn.cluster import KMeans
# import pandas as pd
#
# df = r.charlotte_df
# kmeans = KMeans(n_clusters=5)
# df['cluster'] = kmeans.fit_predict(df[['estimate']])
The boundaries between languages become seams, not walls.
The learning curve is real—and manageable
I would be dishonest if I suggested that adopting a second language is painless. It is not. The learning curve is real, and it shows up differently depending on where a researcher starts.
Researchers from R backgrounds (common in our evaluation and survey research teams) tend to find Python verbose and unfamiliar. The object-oriented paradigm, zero-based indexing, and environment management (conda, venv, uv) all require adjustment. Researchers from Python backgrounds (more common among data engineers and computer science programs) find R’s formula syntax, tidyverse piping, and statistical modeling conventions idiosyncratic.
Our approach has been pragmatic rather than prescriptive:
-
Start with the second language in context. We do not seek generic “Intro to Python” workshops. Instead, we introduce Python within the workflows people already do — “here is how your ACS pipeline step would work in Python” or “here is how to call that geocoding API.”
-
Create shared templates. We maintain project templates with both R and Python scaffolding, so new staff and students see bilingual workflows as the norm rather than the exception.
-
Pair experienced users across languages. Our R experts may review Python code, and vice versa. This builds cross-fluency and catches idiom mistakes that linters miss.
-
Use Quarto as the bridge. Because Quarto supports both languages natively, it becomes a natural meeting ground. A researcher comfortable in R can contribute their section, while a data engineer writes theirs in Python—and the output is a single, unified document.
The JetBrains Developer Ecosystem Survey (2024) confirms what we have observed: researchers and data scientists increasingly work across languages, and the most productive teams treat multilingualism as a team capability rather than an individual requirement. The goal isn’t equal fluency in both — it’s being able to read, understand, and contribute to workflows in either.
What this means for community research institutions
The hiring implication is straightforward: prioritize analytical fluency over language loyalty. A researcher who deeply understands survey methodology or spatial econometrics is valuable regardless of whether they code in R or Python. The interoperability tools now exist to let teams compose work across languages without forcing everyone into the same stack.
Reproducibility also improves when you’re not artificially constrained. Some of our most robust pipelines exist precisely because we picked the best tool for each step. Capacity building becomes more inclusive as a side effect: graduate students arriving with Python training can contribute immediately to data engineering tasks, while staff with R backgrounds lead the statistical analysis. Instead of retraining everyone, we leverage the diversity of skills already present.
Wilson et al. (2017) advocated for “good enough practices” in scientific computing — pragmatic habits that any research team can adopt. Multilingualism is one of those practices. It is not about perfection. It is about removing unnecessary constraints on how your team does its best work.
References
- Allaire, J.J. (2024). Quarto: Open-source scientific and technical publishing system. quarto.org
- Boeing, G. (2020). The right tools for the job: The case for spatial science tool-building. Transactions in GIS, 24(5), 1299–1314.
- JetBrains. (2024). The State of Data Science 2024. blog.jetbrains.com
- Lovelace, R., Nowosad, J., & Muenchow, J. (2019). Geocomputation with R. CRC Press. r.geocompx.org
- Posit PBC. (2025). Reticulate: R interface to Python. rstudio.github.io/reticulate
- Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. doi.org/10.1371/journal.pcbi.1005510