Make your code purr - A short intro to iterating using PURRR in R

By Kailas Venkitasubramanian in R Data Science programming

May 10, 2022

Introducing ‘Purrr’

What’s programming without iterating? Purrr is a package that provides a set of tools for working with functions and vectors in R.

Built on top of the map functions (e.g., lapply(), sapply(), etc.) in base R Purrr provides a more consistent and modern syntax. Purrr is particularly useful for performing repeated operations on lists and data frames. It provides a collection of functions that make it easier to apply functions to data, iterate over lists and vectors, and manipulate data in a flexible way.

Purrr’s key function is map(), which takes a list or vector and applies a function to each element, returning a new list or vector. This can be useful for tasks such as cleaning and transforming data, running statistical models on multiple subsets of data, and creating new variables. Purrr also includes many other functions for working with data, such as reduce(), pluck(), and nest(). With purrr, you can write more efficient and concise code that can be easier to read and maintain.

Let’s walk through a few examples of how you can use purrr to replace the familiar for loops.

A basic iteration - Looping over a vector (or variables)

Suppose you have a vector x and you want to square each element of the vector. You can do this with a for loop like this:

x <- 1:5
y <- numeric(length(x))

for (i in seq_along(x)) {
  y[i] <- x[i]^2
}

This will give you a new vector y with each element squared.

Using purrr, you can do the same thing with the map() function:

library(purrr)
x <- 1:5
y <- map_dbl(x, ~ .x^2)

This will give you the same result as the for loop.

The map() function applies a function to each element of a vector and returns the results in a list. The map_dbl() function is a variant of map() that returns a numeric vector instead of a list.

Note the use of the formula shorthand (~ .x^2) inside the map_dbl() call. This is a convenient way to write simple functions in R.

Looping over a list

Let’s say we have a list of data frames, each containing information about a different group of people. We want to calculate the average age of each group, and store the results in a new data frame. Here’s how we would do it using a for loop:

## Create a list of data frames
df_list <- list(
  data.frame(name = "Group A", age = c(20, 25, 30)),
  data.frame(name = "Group B", age = c(18, 22, 26)),
  data.frame(name = "Group C", age = c(21, 27, 33))
)

## Using a for loop to loop over the list and calculate the mean age for each group
mean_age_list <- list()
for (i in seq_along(df_list)) {
  mean_age_list[[i]] <- mean(df_list[[i]]$age)
}

And here’s how we could do it using a purrr

## Using map() to perform the same loop
mean_age_list_purrr <- df_list %>% 
  map(~mean(.$age))

## Print the result
mean_age_list_purrr
## [[1]]
## [1] 25
## 
## [[2]]
## [1] 22
## 
## [[3]]
## [1] 27

This will give you the same result as the for loop with a much more concise and less error-prone syntax.

The map() function works similarly to lapply(), applying a function to each element of a list and returning the results in a list. The map_dbl() function is again a variant that returns a numeric vector instead of a list.

Note the use of the dot (.) to refer to the current element of the list, and the use of the $ operator to access a specific column of the data frame.

Looping over multiple vectors/lists

Suppose you have three vectors vec1 , vec2 and vec3

vec1 <- c(1, 2, 3)
vec2 <- c(4, 5, 6)
vec3 <- c(7, 8, 9)

We want to create a new vector that contains the maximum value for each corresponding element of the three input vectors. We can use pmap() to loop over the vectors and apply a function that returns the maximum value:

max_values <- pmap(list(vec1, vec2, vec3), max)

max_values
## [[1]]
## [1] 7
## 
## [[2]]
## [1] 8
## 
## [[3]]
## [1] 9

In this case, pmap() takes a list of vectors (list(vec1, vec2, vec3)) as its first argument and the max function as its second argument. The max function is applied to each corresponding element of the input vectors and returns the maximum value. The resulting vector contains the maximum value for each corresponding element of the input vectors.

Note that pmap()is a shorthand for purrr::map2() when used with two input vectors, and for purrr::pmap() when used with more than two input vectors. In the case above, we could have used purrr::map3() instead of pmap(), but using pmap() makes it more explicit that we are iterating over multiple vectors.

Using Purrr to quickly generate a set of model estimates

Suppose we have the mtcars dataset from the tidyverse package, and we want to fit a linear regression model for each combination of the variables mpg, disp, and hp with wt as the response variable. We can use purrr to loop over the combinations and fit the models.

First, we’ll load the necessary packages and create a tibble with the combinations of variables we want to use:

library(tidyverse)

vars <- expand_grid(
  x = c("mpg", "disp", "hp"),
  y = c("mpg", "disp", "hp")
) %>%
  filter(x != y)  ## exclude same-variable pairs

vars
## # A tibble: 6 x 2
##   x     y    
##   <chr> <chr>
## 1 mpg   disp 
## 2 mpg   hp   
## 3 disp  mpg  
## 4 disp  hp   
## 5 hp    mpg  
## 6 hp    disp

This creates a tibble with all possible combinations of the variables mpg, disp, and hp, excluding pairs with the same variable.

Next, we’ll define a function that takes two variable names as input, subsets the mtcars dataset to those variables, and fits a linear regression model with wt as the response variable:

fit_model <- function(x, y) {
  mtcars %>%
    select(x, y, wt) %>%
    lm(wt ~ ., data = .)
}

Now, we can use the pmap() function from purrr to apply this function to each row of the vars tibble and return a list of fitted models:

models <- vars %>%
  pmap(fit_model)

models
## [[1]]
## 
## Call:
## lm(formula = wt ~ ., data = .)
## 
## Coefficients:
## (Intercept)          mpg         disp  
##    3.562739    -0.066315     0.004277  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = wt ~ ., data = .)
## 
## Coefficients:
## (Intercept)          mpg           hp  
##   6.2182863   -0.1455218   -0.0005277  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = wt ~ ., data = .)
## 
## Coefficients:
## (Intercept)         disp          mpg  
##    3.562739     0.004277    -0.066315  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = wt ~ ., data = .)
## 
## Coefficients:
## (Intercept)         disp           hp  
##    1.675818     0.007737    -0.001662  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = wt ~ ., data = .)
## 
## Coefficients:
## (Intercept)           hp          mpg  
##   6.2182863   -0.0005277   -0.1455218  
## 
## 
## [[6]]
## 
## Call:
## lm(formula = wt ~ ., data = .)
## 
## Coefficients:
## (Intercept)           hp         disp  
##    1.675818    -0.001662     0.007737

Use purrr to clean datasets

Let’s first create an example dataset

## Generate fake data
name <- c("John", "Alice", "Bob")
age <- c(32, 28, 35)
city <- c("New York   ", "Los Angeles", "San Francisco")
job <- c("Engineer", "Manager", "Scientist")
my_data <- data.frame(name, age, city, job)

Now let’s write a function to conduct a set of cleaning operations on this dataset.

## Load the required packages
library(dplyr)
library(purrr)
library(stringr)


## Define a list of cleaning functions
clean_functions <- list(
  ## Convert factor to character
  function(x) ifelse(is.factor(x), as.character(x), x),
  ## Remove leading/trailing white space
  str_trim
)

Now, using the map function, let’s apply the list of cleaning functions to all the columns in this dataset.

## Apply the list of cleaning functions to all columns in the dataset
clean_data <- my_data %>%
  mutate_all(map, clean_functions)

## Print the datasets
print(my_data)
##    name age          city       job
## 1  John  32   New York     Engineer
## 2 Alice  28   Los Angeles   Manager
## 3   Bob  35 San Francisco Scientist
print(clean_data)
##    name age          city       job
## 1  John  32      New York  Engineer
## 2 Alice  28   Los Angeles   Manager
## 3   Bob  35 San Francisco Scientist

Pretty cool, huh?

Other useful/powerful functions available in Purrr

purrr provides a wide range of functions for working with iterative operations on lists, data frames, and other data structures. Here are some other useful functions available in the purrr package:

map2(): applies a function to elements from two lists in parallel.

imap() and imap_dfr(): iterates over a list or vector, applies a function to each element along with its index, and returns the results as a list or data frame, respectively.

pmap() and pmap_dfr(): iterates over a list of lists, applies a function to each element in parallel, and returns the results as a list or data frame, respectively.

reduce(): applies a function cumulatively to a list, returning a single output.

walk(): applies a function to each element of a list, discarding the output and returning the original list.

transpose(): transposes a list of lists or a data frame, interchanging rows and columns.

flatten(): flattens a nested list into a one-dimensional list.

safely(): creates a function that catches and returns errors as a list instead of throwing an error.

These functions, among others available in the purrr package, provide efficient and functional alternatives to traditional for loops, making it easier to work with iterative operations on lists and other data structures.

Why choose Purrr over writing for loops in R?

Cleaner, more readable code:

purrr functions can make your code more concise and easier to read, compared to traditional for loops which can be verbose and hard to follow, especially for complex operations.

Functional programming paradigm:

purrr is designed around the functional programming paradigm, which allows you to write more modular, reusable, and composable code. This can make it easier to write complex code that is less prone to errors.

Improved performance:

purrr functions are optimized for performance, and can be faster than traditional for loops, especially when working with large datasets. This is because purrr functions often use vectorized operations internally, which can be more efficient than iterative operations.

Better integration with tidyverse:

purrr is part of the tidyverse ecosystem of packages, which means it integrates well with other tidyverse packages such as dplyr, tidyr, and ggplot2. This can make it easier to write code that is consistent, modular, and easy to maintain.

Overall, purrr offers a more elegant and efficient way of working with lists, vectors, and other data structures in R, especially when you are working with large datasets, complex operations, or when you want to write code that is easy to read and maintain. While for loops still have their place in R programming, purrr offers a more powerful and flexible alternative that can help you write better code faster.

Resources