Looping Through Columns (Without Looping)

One of the most common wrangling problems I have to deal with is looping through columns of a dataframe/tibble. For instance, it is common that I want to create new columns that are a function of previous columns. To demonstrate this, I’ll use the Drought data set from the TidyTuesdayR project. A description of the data is not too important for this post, as I will be strictly covering a mechanical tool. Let’s first load in the data. Note that I will be taking a random 10 rows of the data since it is quite large. There is no need to have the same set of random rows as me- this is simply for presentation purposes.

## loading a couple packages I'll be using for cleaning
library(tidyverse)
library(lubridate)

## link
link <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-20/drought.csv"

## load in the data
## get a random sample of the data and get rid of unnnecessary columns
drought <- read_csv(link) %>% 
  slice_sample(n = 10) %>% 
  select(-map_date, -stat_fmt, -ends_with('total')) 

drought
## # A tibble: 10 × 6
##    state_abb valid_start valid_end  drought_lvl area_pct pop_pct
##    <chr>     <date>      <date>     <chr>          <dbl>   <dbl>
##  1 IN        2006-01-03  2006-01-09 None            80.5    74.9
##  2 KY        2018-05-29  2018-06-04 D1               0       0  
##  3 CA        2015-04-21  2015-04-27 D4              46.8    54.2
##  4 MN        2020-03-10  2020-03-16 D1               0       0  
##  5 IN        2005-08-30  2005-09-05 D4               0       0  
##  6 ND        2002-04-02  2002-04-08 None             0       0  
##  7 NC        2010-06-22  2010-06-28 D3               0       0  
##  8 NJ        2015-10-13  2015-10-19 None            63.7    37.2
##  9 PA        2017-02-21  2017-02-27 D0              14.3     6.6
## 10 ND        2008-05-27  2008-06-02 D4               0       0

Looping with dplyr::across

For the purpose of this post, let’s assume I want to do the following:

  1. Change the area_pct and pop_pct columns to fractions and give them new names area_frac and pop_frac.

  2. Make six new columns: columns named valid_start_year, valid_start_month, valid_start_day which is the year, month, and day of the column valid_start respectively, and columns named valid_end_year, valid_end_month, valid_end_day which is the year, month, and day of the column valid_end_year respectively.

Each of these tasks, while generally tedious and requiring a lot of typing, can be done very quickly with minimal typing using the dplyr::across function.

Part 1: Changing from percents to fractions.

Here’s how to complete task one. For demonstration purposes, I will display the code first, and explain afterwards.

drought %>% 
  mutate(across(.cols = c(area_pct, pop_pct), .fns = ~ ./100)) %>% 
  rename("area_frac" = "area_pct", "pop_frac" = "pop_pct") 
## # A tibble: 10 × 6
##    state_abb valid_start valid_end  drought_lvl area_frac pop_frac
##    <chr>     <date>      <date>     <chr>           <dbl>    <dbl>
##  1 IN        2006-01-03  2006-01-09 None            0.805    0.749
##  2 KY        2018-05-29  2018-06-04 D1              0        0    
##  3 CA        2015-04-21  2015-04-27 D4              0.468    0.542
##  4 MN        2020-03-10  2020-03-16 D1              0        0    
##  5 IN        2005-08-30  2005-09-05 D4              0        0    
##  6 ND        2002-04-02  2002-04-08 None            0        0    
##  7 NC        2010-06-22  2010-06-28 D3              0        0    
##  8 NJ        2015-10-13  2015-10-19 None            0.637    0.372
##  9 PA        2017-02-21  2017-02-27 D0              0.143    0.066
## 10 ND        2008-05-27  2008-06-02 D4              0        0

Let’s break down the dplyr::across function. The .cols argument takes the columns you want to perform operations on, and the .fns takes the function you want to perform on the columns. In this case, I created a lambda function using the ~ character, and passed in each column to my function using the . character. Hence, this small lambda function is telling R to take each of the columns, and divide by 100.

While there is little coding involved in this solution, this can be made even more compact since the .cols argument can take the tidy selecting functions starts_with and ends_with. Watch as I trim down this code a little:

drought %>% 
  mutate(across(ends_with("pct"), ~ ./100)) %>% 
  rename("area_frac" = "area_pct", "pop_frac" = "pop_pct") 
## # A tibble: 10 × 6
##    state_abb valid_start valid_end  drought_lvl area_frac pop_frac
##    <chr>     <date>      <date>     <chr>           <dbl>    <dbl>
##  1 IN        2006-01-03  2006-01-09 None            0.805    0.749
##  2 KY        2018-05-29  2018-06-04 D1              0        0    
##  3 CA        2015-04-21  2015-04-27 D4              0.468    0.542
##  4 MN        2020-03-10  2020-03-16 D1              0        0    
##  5 IN        2005-08-30  2005-09-05 D4              0        0    
##  6 ND        2002-04-02  2002-04-08 None            0        0    
##  7 NC        2010-06-22  2010-06-28 D3              0        0    
##  8 NJ        2015-10-13  2015-10-19 None            0.637    0.372
##  9 PA        2017-02-21  2017-02-27 D0              0.143    0.066
## 10 ND        2008-05-27  2008-06-02 D4              0        0

Here I operated my lambda function on all columns that ends with the phrase “pct”. You can imagine that this can become extremely powerful when scaled up.

Part 2: Getting month and year columns

Now I want to achieve my second task. Here, I will showcase one of the most powerful arguments of dplyr::across: the .names argument. As before, let me first code a solution for demonstration:

drought %>% 
  mutate(across(starts_with("valid"), ~ year(.), .names = "{.col}_year")) %>% 
  select(-starts_with("area")) ## only doing this so that we can see the output easier
## # A tibble: 10 × 7
##    state_abb valid_start valid_end  drought_lvl pop_pct valid_start_year valid…¹
##    <chr>     <date>      <date>     <chr>         <dbl>            <dbl>   <dbl>
##  1 IN        2006-01-03  2006-01-09 None           74.9             2006    2006
##  2 KY        2018-05-29  2018-06-04 D1              0               2018    2018
##  3 CA        2015-04-21  2015-04-27 D4             54.2             2015    2015
##  4 MN        2020-03-10  2020-03-16 D1              0               2020    2020
##  5 IN        2005-08-30  2005-09-05 D4              0               2005    2005
##  6 ND        2002-04-02  2002-04-08 None            0               2002    2002
##  7 NC        2010-06-22  2010-06-28 D3              0               2010    2010
##  8 NJ        2015-10-13  2015-10-19 None           37.2             2015    2015
##  9 PA        2017-02-21  2017-02-27 D0              6.6             2017    2017
## 10 ND        2008-05-27  2008-06-02 D4              0               2008    2008
## # … with abbreviated variable name ¹​valid_end_year

As before, I am using a tidy select function starts_with to target the columns I want to operate on and perform a lambda function on each. However, you can see that I skipped a renaming step by using the .names argument. The .names argument has the special glue syntax (read more here). In particular, this means we can use {.col} to input in the original column name.

However, recall that I wanted 6 new columns, which would require me writing three separate mutate functions (or 3 big arguments passed into mutate) right? Nope! The across function’s .fns argument (which we passed our lambda function to) can actually take a list of functions. Observe:

drought %>% 
  mutate(across(starts_with("valid"), list(year = ~year(.), month = ~month(.), day = ~day(.)), .names = "{.col}_{.fn}")) %>% 
  select(starts_with("valid"))
## # A tibble: 10 × 8
##    valid_start valid_end  valid_start_…¹ valid…² valid…³ valid…⁴ valid…⁵ valid…⁶
##    <date>      <date>              <dbl>   <dbl>   <int>   <dbl>   <dbl>   <int>
##  1 2006-01-03  2006-01-09           2006       1       3    2006       1       9
##  2 2018-05-29  2018-06-04           2018       5      29    2018       6       4
##  3 2015-04-21  2015-04-27           2015       4      21    2015       4      27
##  4 2020-03-10  2020-03-16           2020       3      10    2020       3      16
##  5 2005-08-30  2005-09-05           2005       8      30    2005       9       5
##  6 2002-04-02  2002-04-08           2002       4       2    2002       4       8
##  7 2010-06-22  2010-06-28           2010       6      22    2010       6      28
##  8 2015-10-13  2015-10-19           2015      10      13    2015      10      19
##  9 2017-02-21  2017-02-27           2017       2      21    2017       2      27
## 10 2008-05-27  2008-06-02           2008       5      27    2008       6       2
## # … with abbreviated variable names ¹​valid_start_year, ²​valid_start_month,
## #   ³​valid_start_day, ⁴​valid_end_year, ⁵​valid_end_month, ⁶​valid_end_day

Notice what happened here: I passed in a named list of lambda functions to be performed on each of the columns. In effect, I performed the lubridate::year, lubridate::month, and lubridate::day functions on the columns of interest and achieved the goal of creating 6 different columns all with 1 line of code. Additionally, I used the special{.fn} character in the .names argument to pass in the name of each of the functions to the new column names. Pretty incredible isn’t it?

Michael Topper
Michael Topper
PhD Candidate in Economics

Economics Student UCSB

Related