Make your code purr - A short intro to iterating using PURRR in R
By Kailas Venkitasubramanian in R Data Science programming
May 10, 2022
Introducing ‘Purrr’
What’s programming without iterating? Purrr
is a package that provides a set of tools for working with functions and vectors in R.
Built on top of the map functions (e.g., lapply()
, sapply()
, etc.) in base R Purrr
provides a more consistent and modern syntax. Purrr
is particularly useful for performing repeated operations on lists and data frames. It provides a collection of functions that make it easier to apply functions to data, iterate over lists and vectors, and manipulate data in a flexible way.
Purrr’s key function is map(), which takes a list or vector and applies a function to each element, returning a new list or vector. This can be useful for tasks such as cleaning and transforming data, running statistical models on multiple subsets of data, and creating new variables. Purrr also includes many other functions for working with data, such as reduce(), pluck(), and nest(). With purrr, you can write more efficient and concise code that can be easier to read and maintain.
Let’s walk through a few examples of how you can use purrr
to replace the familiar for
loops.
A basic iteration - Looping over a vector (or variables)
Suppose you have a vector x and you want to square each element of the vector. You can do this with a for loop like this:
x <- 1:5
y <- numeric(length(x))
for (i in seq_along(x)) {
y[i] <- x[i]^2
}
This will give you a new vector y with each element squared.
Using purrr
, you can do the same thing with the map()
function:
library(purrr)
x <- 1:5
y <- map_dbl(x, ~ .x^2)
This will give you the same result as the for
loop.
The map()
function applies a function to each element of a vector and returns the results in a list. The map_dbl()
function is a variant of map()
that returns a numeric vector instead of a list.
Note the use of the formula shorthand (~ .x^2)
inside the map_dbl()
call. This is a convenient way to write simple functions in R.
Looping over a list
Let’s say we have a list of data frames, each containing information about a different group of people. We want to calculate the average age of each group, and store the results in a new data frame. Here’s how we would do it using a for
loop:
## Create a list of data frames
df_list <- list(
data.frame(name = "Group A", age = c(20, 25, 30)),
data.frame(name = "Group B", age = c(18, 22, 26)),
data.frame(name = "Group C", age = c(21, 27, 33))
)
## Using a for loop to loop over the list and calculate the mean age for each group
mean_age_list <- list()
for (i in seq_along(df_list)) {
mean_age_list[[i]] <- mean(df_list[[i]]$age)
}
And here’s how we could do it using a purrr
## Using map() to perform the same loop
mean_age_list_purrr <- df_list %>%
map(~mean(.$age))
## Print the result
mean_age_list_purrr
## [[1]]
## [1] 25
##
## [[2]]
## [1] 22
##
## [[3]]
## [1] 27
This will give you the same result as the for loop with a much more concise and less error-prone syntax.
The map()
function works similarly to lapply()
, applying a function to each element of a list and returning the results in a list. The map_dbl()
function is again a variant that returns a numeric vector instead of a list.
Note the use of the dot (.) to refer to the current element of the list, and the use of the $ operator to access a specific column of the data frame.
Looping over multiple vectors/lists
Suppose you have three vectors vec1
, vec2
and vec3
vec1 <- c(1, 2, 3)
vec2 <- c(4, 5, 6)
vec3 <- c(7, 8, 9)
We want to create a new vector that contains the maximum value for each corresponding element of the three input vectors. We can use pmap()
to loop over the vectors and apply a function that returns the maximum value:
max_values <- pmap(list(vec1, vec2, vec3), max)
max_values
## [[1]]
## [1] 7
##
## [[2]]
## [1] 8
##
## [[3]]
## [1] 9
In this case, pmap()
takes a list of vectors (list(vec1, vec2, vec3))
as its first argument and the max
function as its second argument. The max
function is applied to each corresponding element of the input vectors and returns the maximum value. The resulting vector contains the maximum value for each corresponding element of the input vectors.
Note that pmap()
is a shorthand for purrr::map2()
when used with two input vectors, and for purrr::pmap()
when used with more than two input vectors. In the case above, we could have used purrr::map3()
instead of pmap()
, but using pmap()
makes it more explicit that we are iterating over multiple vectors.
Using Purrr
to quickly generate a set of model estimates
Suppose we have the mtcars
dataset from the tidyverse package, and we want to fit a linear regression model for each combination of the variables mpg
, disp
, and hp
with wt
as the response variable. We can use purrr
to loop over the combinations and fit the models.
First, we’ll load the necessary packages and create a tibble with the combinations of variables we want to use:
library(tidyverse)
vars <- expand_grid(
x = c("mpg", "disp", "hp"),
y = c("mpg", "disp", "hp")
) %>%
filter(x != y) ## exclude same-variable pairs
vars
## # A tibble: 6 x 2
## x y
## <chr> <chr>
## 1 mpg disp
## 2 mpg hp
## 3 disp mpg
## 4 disp hp
## 5 hp mpg
## 6 hp disp
This creates a tibble with all possible combinations of the variables mpg
, disp
, and hp
, excluding pairs with the same variable.
Next, we’ll define a function that takes two variable names as input, subsets the mtcars dataset to those variables, and fits a linear regression model with wt
as the response variable:
fit_model <- function(x, y) {
mtcars %>%
select(x, y, wt) %>%
lm(wt ~ ., data = .)
}
Now, we can use the pmap()
function from purrr to apply this function to each row of the vars
tibble and return a list of fitted models:
models <- vars %>%
pmap(fit_model)
models
## [[1]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) mpg disp
## 3.562739 -0.066315 0.004277
##
##
## [[2]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) mpg hp
## 6.2182863 -0.1455218 -0.0005277
##
##
## [[3]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) disp mpg
## 3.562739 0.004277 -0.066315
##
##
## [[4]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) disp hp
## 1.675818 0.007737 -0.001662
##
##
## [[5]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) hp mpg
## 6.2182863 -0.0005277 -0.1455218
##
##
## [[6]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) hp disp
## 1.675818 -0.001662 0.007737
Use purrr
to clean datasets
Let’s first create an example dataset
## Generate fake data
name <- c("John", "Alice", "Bob")
age <- c(32, 28, 35)
city <- c("New York ", "Los Angeles", "San Francisco")
job <- c("Engineer", "Manager", "Scientist")
my_data <- data.frame(name, age, city, job)
Now let’s write a function to conduct a set of cleaning operations on this dataset.
## Load the required packages
library(dplyr)
library(purrr)
library(stringr)
## Define a list of cleaning functions
clean_functions <- list(
## Convert factor to character
function(x) ifelse(is.factor(x), as.character(x), x),
## Remove leading/trailing white space
str_trim
)
Now, using the map
function, let’s apply the list of cleaning functions to all the columns in this dataset.
## Apply the list of cleaning functions to all columns in the dataset
clean_data <- my_data %>%
mutate_all(map, clean_functions)
## Print the datasets
print(my_data)
## name age city job
## 1 John 32 New York Engineer
## 2 Alice 28 Los Angeles Manager
## 3 Bob 35 San Francisco Scientist
print(clean_data)
## name age city job
## 1 John 32 New York Engineer
## 2 Alice 28 Los Angeles Manager
## 3 Bob 35 San Francisco Scientist
Pretty cool, huh?
Other useful/powerful functions available in Purrr
purrr
provides a wide range of functions for working with iterative operations on lists, data frames, and other data structures. Here are some other useful functions available in the purrr package:
map2()
: applies a function to elements from two lists in parallel.
imap()
and imap_dfr()
: iterates over a list or vector, applies a function to each element along with its index, and returns the results as a list or data frame, respectively.
pmap()
and pmap_dfr()
: iterates over a list of lists, applies a function to each element in parallel, and returns the results as a list or data frame, respectively.
reduce()
: applies a function cumulatively to a list, returning a single output.
walk()
: applies a function to each element of a list, discarding the output and returning the original list.
transpose()
: transposes a list of lists or a data frame, interchanging rows and columns.
flatten()
: flattens a nested list into a one-dimensional list.
safely()
: creates a function that catches and returns errors as a list instead of throwing an error.
These functions, among others available in the purrr package, provide efficient and functional alternatives to traditional for loops, making it easier to work with iterative operations on lists and other data structures.
Why choose Purrr
over writing for
loops in R?
Cleaner, more readable code:
purrr
functions can make your code more concise and easier to read, compared to traditional for loops which can be verbose and hard to follow, especially for complex operations.
Functional programming paradigm:
purrr
is designed around the functional programming paradigm, which allows you to write more modular, reusable, and composable code. This can make it easier to write complex code that is less prone to errors.
Improved performance:
purrr functions are optimized for performance, and can be faster than traditional for loops, especially when working with large datasets. This is because purrr
functions often use vectorized operations internally, which can be more efficient than iterative operations.
Better integration with tidyverse:
purrr
is part of the tidyverse ecosystem of packages, which means it integrates well with other tidyverse packages such as dplyr
, tidyr
, and ggplot2.
This can make it easier to write code that is consistent, modular, and easy to maintain.
Overall, purrr
offers a more elegant and efficient way of working with lists, vectors, and other data structures in R, especially when you are working with large datasets, complex operations, or when you want to write code that is easy to read and maintain. While for
loops still have their place in R programming, purrr
offers a more powerful and flexible alternative that can help you write better code faster.