Embracing multilingualism in data science
By Kailas Venkitasubramanian in Research Analytics Data Science Reproducible Research
April 10, 2025
In the previous posts of this series, I discussed why reproducibility matters for community research institutions and how we are designing reproducible data pipelines at the UNC Charlotte Urban Institute. In this post, I want to turn to something that sits underneath both of those efforts: the programming languages we use to do the work.
Specifically, I want to argue that embracing multilingualism—fluency in both R and Python, rather than loyalty to one—has been one of the most consequential decisions we have made as a research analytics team.
The old debate, and why it is the wrong question
If you have spent any time in data science circles, you have encountered the “R versus Python” conversation. It is a decades-old debate with passionate advocates on both sides. R was built by statisticians for statisticians. Python was built as a general-purpose language that data science gradually adopted. Each community developed its own ecosystem, idioms, and culture.
But for applied community research—where we work with federal surveys, administrative records, geospatial layers, and policy-sensitive indicators—the question was never “which language is better?” It was: what combination of tools lets a small team do the widest range of work, reliably and reproducibly?
That reframing changed everything for us.
What bilingual competence actually looks like
At the Urban Institute, our analytical work spans a wide spectrum: statistical modeling, geospatial analysis, data pipeline engineering, automated reporting, interactive dashboards, and secure data linkage. No single language covers all of that optimally.
Here is how the division has emerged in practice:
Where R excels for us:
- Exploratory data analysis and statistical modeling (
tidyverse,survey,lme4) - Publication-quality visualization (
ggplot2,tmap) - Literate programming and automated report generation (
R Markdown,Quarto) - Spatial statistics and census data access (
tidycensus,sf,spdep)
Where Python excels for us:
- Data pipeline orchestration and cloud infrastructure (
Airflow,boto3) - API development and web service integration (
FastAPI,requests) - Machine learning at scale (
scikit-learn,XGBoost) - Text processing and NLP tasks (
spaCy,transformers) - Database operations and ETL workflows (
SQLAlchemy,pandas)
Where they overlap and complement:
- Geospatial analysis: R’s
sfand Python’sgeopandaseach have strengths. We often do exploratory spatial work in R and production geocoding in Python. - Data wrangling:
dplyrandpandasare both excellent—but our team tends to reach for whichever they are already inside of. - Visualization:
ggplot2for static publication figures;plotly(available in both) for interactive dashboards.
The insight is not that one language is “better.” It is that research products improve when you use the right tool for each component (Boeing, 2020; Lovelace et al., 2019). A census tract analysis might start with tidycensus in R, get its indicators computed in a Python pipeline, and have its results rendered in a Quarto report that calls both.
The interoperability revolution
What makes multilingualism practical—not just aspirational—is the tooling that now exists to bridge the two ecosystems seamlessly.
Reticulate allows R users to call Python functions, import Python modules, and pass data between R and Python objects within a single R session. The py and r objects let you cross the boundary effortlessly. Recent releases (reticulate 1.41+, 2025) have made this even smoother by integrating uv for Python environment management directly from R (RStudio, 2025).
Quarto, the next-generation publishing system from Posit, is designed from the ground up as a polyglot platform. A single Quarto document can contain R chunks, Python chunks, and Observable JS—all sharing data (Allaire, 2024). This is not a gimmick. For our team, it means a research report can pull census data in R, run a classification model in Python, and render interactive maps—all in one reproducible document.
Positron, Posit’s new IDE (2024), takes this further by providing first-class support for both R and Python in a single environment built on VS Code’s architecture. Unlike RStudio, which was R-first, Positron was designed as a polyglot IDE from the start.
Here is a minimal example of what cross-language work looks like in practice using reticulate:
library(reticulate)
library(tidycensus)
# Pull census data in R (where tidycensus shines)
# charlotte_income <- get_acs(
# geography = "tract",
# variables = "B19013_001",
# state = "NC", county = "Mecklenburg",
# year = 2022, geometry = TRUE
# )
# Pass to Python for ML preprocessing
# py$charlotte_df <- charlotte_income
# In Python: leverage scikit-learn for spatial clustering
# from sklearn.cluster import KMeans
# import pandas as pd
#
# df = r.charlotte_df
# kmeans = KMeans(n_clusters=5)
# df['cluster'] = kmeans.fit_predict(df[['estimate']])
The boundaries between languages become seams, not walls.
The learning curve is real—and manageable
I would be dishonest if I suggested that adopting a second language is painless. It is not. The learning curve is real, and it shows up differently depending on where a researcher starts.
For R-trained researchers (common in our evaluation and survey research teams), Python can feel verbose and unfamiliar. The object-oriented paradigm, zero-based indexing, and environment management (conda, venv, uv) all require adjustment.
For Python-trained researchers (more common among data engineers and computer science backgrounds), R’s formula syntax, tidyverse piping, and statistical modeling conventions can feel idiosyncratic.
Our approach has been pragmatic rather than prescriptive:
-
Start with the second language in context. We do not seek generic “Intro to Python” workshops. Instead, we introduce Python within the workflows people already do—“here is how your ACS pipeline step would work in Python” or “here is how to call that geocoding API.”
-
Create shared templates. We maintain project templates with both R and Python scaffolding, so new staff and students see bilingual workflows as the norm rather than the exception.
-
Pair experienced users across languages. Our R experts may review Python code, and vice versa. This builds cross-fluency and catches idiom mistakes that linters miss.
-
Use Quarto as the bridge. Because Quarto supports both languages natively, it becomes a natural meeting ground. A researcher comfortable in R can contribute their section, while a data engineer writes theirs in Python—and the output is a single, unified document.
The JetBrains Developer Ecosystem Survey (2024) confirms what we have observed: researchers and data scientists increasingly work across languages, and the most productive teams are those that treat multilingualism as a team capability rather than an individual requirement. Not everyone needs to be equally fluent. But everyone should be able to read, understand, and contribute to workflows in both ecosystems.
What this means for the Community Research Institutions
For community research institutions working in the Charlotte region or beyond, this has practical implications.
First, hiring and training strategies should prioritize analytical fluency over language loyalty. A researcher who deeply understands survey methodology or spatial econometrics is valuable regardless of whether they code in R or Python. The interoperability tools now exist to let teams compose work across languages.
Second, reproducibility improves when you are not artificially constrained. Some of our most robust pipelines exist precisely because we picked the best tool for each step rather than forcing everything into one ecosystem.
Third, capacity building becomes more inclusive. Graduate students arriving with Python training from computer science programs can contribute immediately to data engineering tasks. Staff with R backgrounds from social science programs can lead the statistical analysis. Instead of retraining everyone into a single language, we leverage the diversity of skills already present.
Wilson et al. (2017) advocated for “good enough practices” in scientific computing—pragmatic habits that any research team can adopt. Multilingualism, supported by modern interoperability tools, is one of those practices. It is not about perfection. It is about removing unnecessary constraints on how your team does its best work.
References
- Allaire, J.J. (2024). Quarto: Open-source scientific and technical publishing system. quarto.org
- Boeing, G. (2020). The right tools for the job: The case for spatial science tool-building. Transactions in GIS, 24(5), 1299–1314.
- JetBrains. (2024). The State of Data Science 2024. blog.jetbrains.com
- Lovelace, R., Nowosad, J., & Muenchow, J. (2019). Geocomputation with R. CRC Press. r.geocompx.org
- Posit PBC. (2025). Reticulate: R interface to Python. rstudio.github.io/reticulate
- Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. doi.org/10.1371/journal.pcbi.1005510