# Make your code purr - A short intro to iterating using PURRR in R

By Kailas Venkitasubramanian in R Data Science programming

May 10, 2022

## Introducing ‘Purrr’

What’s programming without iterating? `Purrr`

is a package that provides a set of tools for working with functions and vectors in R.

Built on top of the map functions (e.g., `lapply()`

, `sapply()`

, etc.) in base R `Purrr`

provides a more consistent and modern syntax. `Purrr`

is particularly useful for performing repeated operations on lists and data frames. It provides a collection of functions that make it easier to apply functions to data, iterate over lists and vectors, and manipulate data in a flexible way.

Purrr’s key function is map(), which takes a list or vector and applies a function to each element, returning a new list or vector. This can be useful for tasks such as cleaning and transforming data, running statistical models on multiple subsets of data, and creating new variables. Purrr also includes many other functions for working with data, such as reduce(), pluck(), and nest(). With purrr, you can write more efficient and concise code that can be easier to read and maintain.

Let’s walk through a few examples of how you can use `purrr`

to replace the familiar `for`

loops.

## A basic iteration - Looping over a vector (or variables)

Suppose you have a vector x and you want to square each element of the vector. You can do this with a for loop like this:

```
x <- 1:5
y <- numeric(length(x))
for (i in seq_along(x)) {
y[i] <- x[i]^2
}
```

This will give you a new vector y with each element squared.

Using `purrr`

, you can do the same thing with the `map()`

function:

```
library(purrr)
x <- 1:5
y <- map_dbl(x, ~ .x^2)
```

This will give you the same result as the `for`

loop.

The `map()`

function applies a function to each element of a vector and returns the results in a list. The `map_dbl()`

function is a variant of `map()`

that returns a numeric vector instead of a list.

Note the use of the formula shorthand `(~ .x^2)`

inside the `map_dbl()`

call. This is a convenient way to write simple functions in R.

## Looping over a list

Let’s say we have a list of data frames, each containing information about a different group of people. We want to calculate the average age of each group, and store the results in a new data frame. Here’s how we would do it using a `for`

loop:

```
## Create a list of data frames
df_list <- list(
data.frame(name = "Group A", age = c(20, 25, 30)),
data.frame(name = "Group B", age = c(18, 22, 26)),
data.frame(name = "Group C", age = c(21, 27, 33))
)
## Using a for loop to loop over the list and calculate the mean age for each group
mean_age_list <- list()
for (i in seq_along(df_list)) {
mean_age_list[[i]] <- mean(df_list[[i]]$age)
}
```

And here’s how we could do it using a `purrr`

```
## Using map() to perform the same loop
mean_age_list_purrr <- df_list %>%
map(~mean(.$age))
## Print the result
mean_age_list_purrr
```

```
## [[1]]
## [1] 25
##
## [[2]]
## [1] 22
##
## [[3]]
## [1] 27
```

This will give you the same result as the for loop with a much more concise and less error-prone syntax.

The `map()`

function works similarly to `lapply()`

, applying a function to each element of a list and returning the results in a list. The `map_dbl()`

function is again a variant that returns a numeric vector instead of a list.

Note the use of the dot (.) to refer to the current element of the list, and the use of the $ operator to access a specific column of the data frame.

## Looping over multiple vectors/lists

Suppose you have three vectors `vec1`

, `vec2`

and `vec3`

```
vec1 <- c(1, 2, 3)
vec2 <- c(4, 5, 6)
vec3 <- c(7, 8, 9)
```

We want to create a new vector that contains the maximum value for each corresponding element of the three input vectors. We can use `pmap()`

to loop over the vectors and apply a function that returns the maximum value:

```
max_values <- pmap(list(vec1, vec2, vec3), max)
max_values
```

```
## [[1]]
## [1] 7
##
## [[2]]
## [1] 8
##
## [[3]]
## [1] 9
```

In this case, `pmap()`

takes a list of vectors `(list(vec1, vec2, vec3))`

as its first argument and the `max`

function as its second argument. The `max`

function is applied to each corresponding element of the input vectors and returns the maximum value. The resulting vector contains the maximum value for each corresponding element of the input vectors.

Note that `pmap()`

is a shorthand for `purrr::map2()`

when used with two input vectors, and for `purrr::pmap()`

when used with more than two input vectors. In the case above, we could have used `purrr::map3()`

instead of `pmap()`

, but using `pmap()`

makes it more explicit that we are iterating over multiple vectors.

## Using `Purrr`

to quickly generate a set of model estimates

Suppose we have the `mtcars`

dataset from the tidyverse package, and we want to fit a linear regression model for each combination of the variables `mpg`

, `disp`

, and `hp`

with `wt`

as the response variable. We can use `purrr`

to loop over the combinations and fit the models.

First, we’ll load the necessary packages and create a tibble with the combinations of variables we want to use:

```
library(tidyverse)
vars <- expand_grid(
x = c("mpg", "disp", "hp"),
y = c("mpg", "disp", "hp")
) %>%
filter(x != y) ## exclude same-variable pairs
vars
```

```
## # A tibble: 6 x 2
## x y
## <chr> <chr>
## 1 mpg disp
## 2 mpg hp
## 3 disp mpg
## 4 disp hp
## 5 hp mpg
## 6 hp disp
```

This creates a tibble with all possible combinations of the variables `mpg`

, `disp`

, and `hp`

, excluding pairs with the same variable.

Next, we’ll define a function that takes two variable names as input, subsets the mtcars dataset to those variables, and fits a linear regression model with `wt`

as the response variable:

```
fit_model <- function(x, y) {
mtcars %>%
select(x, y, wt) %>%
lm(wt ~ ., data = .)
}
```

Now, we can use the `pmap()`

function from purrr to apply this function to each row of the `vars`

tibble and return a list of fitted models:

```
models <- vars %>%
pmap(fit_model)
models
```

```
## [[1]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) mpg disp
## 3.562739 -0.066315 0.004277
##
##
## [[2]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) mpg hp
## 6.2182863 -0.1455218 -0.0005277
##
##
## [[3]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) disp mpg
## 3.562739 0.004277 -0.066315
##
##
## [[4]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) disp hp
## 1.675818 0.007737 -0.001662
##
##
## [[5]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) hp mpg
## 6.2182863 -0.0005277 -0.1455218
##
##
## [[6]]
##
## Call:
## lm(formula = wt ~ ., data = .)
##
## Coefficients:
## (Intercept) hp disp
## 1.675818 -0.001662 0.007737
```

## Use `purrr`

to clean datasets

Let’s first create an example dataset

```
## Generate fake data
name <- c("John", "Alice", "Bob")
age <- c(32, 28, 35)
city <- c("New York ", "Los Angeles", "San Francisco")
job <- c("Engineer", "Manager", "Scientist")
my_data <- data.frame(name, age, city, job)
```

Now let’s write a function to conduct a set of cleaning operations on this dataset.

```
## Load the required packages
library(dplyr)
library(purrr)
library(stringr)
## Define a list of cleaning functions
clean_functions <- list(
## Convert factor to character
function(x) ifelse(is.factor(x), as.character(x), x),
## Remove leading/trailing white space
str_trim
)
```

Now, using the `map`

function, let’s apply the list of cleaning functions to all the columns in this dataset.

```
## Apply the list of cleaning functions to all columns in the dataset
clean_data <- my_data %>%
mutate_all(map, clean_functions)
## Print the datasets
print(my_data)
```

```
## name age city job
## 1 John 32 New York Engineer
## 2 Alice 28 Los Angeles Manager
## 3 Bob 35 San Francisco Scientist
```

```
print(clean_data)
```

```
## name age city job
## 1 John 32 New York Engineer
## 2 Alice 28 Los Angeles Manager
## 3 Bob 35 San Francisco Scientist
```

Pretty cool, huh?

## Other useful/powerful functions available in `Purrr`

`purrr`

provides a wide range of functions for working with iterative operations on lists, data frames, and other data structures. Here are some other useful functions available in the purrr package:

`map2()`

: applies a function to elements from two lists in parallel.

`imap()`

and `imap_dfr()`

: iterates over a list or vector, applies a function to each element along with its index, and returns the results as a list or data frame, respectively.

`pmap()`

and `pmap_dfr()`

: iterates over a list of lists, applies a function to each element in parallel, and returns the results as a list or data frame, respectively.

`reduce()`

: applies a function cumulatively to a list, returning a single output.

`walk()`

: applies a function to each element of a list, discarding the output and returning the original list.

`transpose()`

: transposes a list of lists or a data frame, interchanging rows and columns.

`flatten()`

: flattens a nested list into a one-dimensional list.

`safely()`

: creates a function that catches and returns errors as a list instead of throwing an error.

These functions, among others available in the purrr package, provide efficient and functional alternatives to traditional for loops, making it easier to work with iterative operations on lists and other data structures.

## Why choose `Purrr`

over writing `for`

loops in R?

### Cleaner, more readable code:

`purrr`

functions can make your code more concise and easier to read, compared to traditional for loops which can be verbose and hard to follow, especially for complex operations.

### Functional programming paradigm:

`purrr`

is designed around the functional programming paradigm, which allows you to write more modular, reusable, and composable code. This can make it easier to write complex code that is less prone to errors.

### Improved performance:

purrr functions are optimized for performance, and can be faster than traditional for loops, especially when working with large datasets. This is because `purrr`

functions often use vectorized operations internally, which can be more efficient than iterative operations.

### Better integration with tidyverse:

`purrr`

is part of the tidyverse ecosystem of packages, which means it integrates well with other tidyverse packages such as `dplyr`

, `tidyr`

, and `ggplot2.`

This can make it easier to write code that is consistent, modular, and easy to maintain.

Overall, `purrr`

offers a more elegant and efficient way of working with lists, vectors, and other data structures in R, especially when you are working with large datasets, complex operations, or when you want to write code that is easy to read and maintain. While `for`

loops still have their place in R programming, `purrr`

offers a more powerful and flexible alternative that can help you write better code faster.