Looping Through Columns (Without Looping)
One of the most common wrangling problems I have to deal with is looping through columns of a dataframe/tibble. For instance, it is common that I want to create new columns that are a function of previous columns. To demonstrate this, I’ll use the Drought data set from the TidyTuesdayR project. A description of the data is not too important for this post, as I will be strictly covering a mechanical tool. Let’s first load in the data. Note that I will be taking a random 10 rows of the data since it is quite large. There is no need to have the same set of random rows as me- this is simply for presentation purposes.
## loading a couple packages I'll be using for cleaning
library(tidyverse)
library(lubridate)
## link
link <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-20/drought.csv"
## load in the data
## get a random sample of the data and get rid of unnnecessary columns
drought <- read_csv(link) %>%
slice_sample(n = 10) %>%
select(-map_date, -stat_fmt, -ends_with('total'))
drought
## # A tibble: 10 × 6
## state_abb valid_start valid_end drought_lvl area_pct pop_pct
## <chr> <date> <date> <chr> <dbl> <dbl>
## 1 IN 2006-01-03 2006-01-09 None 80.5 74.9
## 2 KY 2018-05-29 2018-06-04 D1 0 0
## 3 CA 2015-04-21 2015-04-27 D4 46.8 54.2
## 4 MN 2020-03-10 2020-03-16 D1 0 0
## 5 IN 2005-08-30 2005-09-05 D4 0 0
## 6 ND 2002-04-02 2002-04-08 None 0 0
## 7 NC 2010-06-22 2010-06-28 D3 0 0
## 8 NJ 2015-10-13 2015-10-19 None 63.7 37.2
## 9 PA 2017-02-21 2017-02-27 D0 14.3 6.6
## 10 ND 2008-05-27 2008-06-02 D4 0 0
Looping with dplyr::across
For the purpose of this post, let’s assume I want to do the following:
Change the
area_pct
andpop_pct
columns to fractions and give them new namesarea_frac
andpop_frac
.Make six new columns: columns named
valid_start_year
,valid_start_month
,valid_start_day
which is the year, month, and day of the columnvalid_start
respectively, and columns namedvalid_end_year
,valid_end_month
,valid_end_day
which is the year, month, and day of the columnvalid_end_year
respectively.
Each of these tasks, while generally tedious and requiring a lot of typing, can be done very quickly with minimal typing using the dplyr::across
function.
Part 1: Changing from percents to fractions.
Here’s how to complete task one. For demonstration purposes, I will display the code first, and explain afterwards.
drought %>%
mutate(across(.cols = c(area_pct, pop_pct), .fns = ~ ./100)) %>%
rename("area_frac" = "area_pct", "pop_frac" = "pop_pct")
## # A tibble: 10 × 6
## state_abb valid_start valid_end drought_lvl area_frac pop_frac
## <chr> <date> <date> <chr> <dbl> <dbl>
## 1 IN 2006-01-03 2006-01-09 None 0.805 0.749
## 2 KY 2018-05-29 2018-06-04 D1 0 0
## 3 CA 2015-04-21 2015-04-27 D4 0.468 0.542
## 4 MN 2020-03-10 2020-03-16 D1 0 0
## 5 IN 2005-08-30 2005-09-05 D4 0 0
## 6 ND 2002-04-02 2002-04-08 None 0 0
## 7 NC 2010-06-22 2010-06-28 D3 0 0
## 8 NJ 2015-10-13 2015-10-19 None 0.637 0.372
## 9 PA 2017-02-21 2017-02-27 D0 0.143 0.066
## 10 ND 2008-05-27 2008-06-02 D4 0 0
Let’s break down the dplyr::across
function. The .cols
argument takes the columns you want to perform operations on, and the .fns
takes the function you want to perform on the columns. In this case, I created a lambda function using the ~
character, and passed in each column to my function using the .
character. Hence, this small lambda function is telling R
to take each of the columns, and divide by 100.
While there is little coding involved in this solution, this can be made even more compact since the .cols
argument can take the tidy selecting functions starts_with
and ends_with
. Watch as I trim down this code a little:
drought %>%
mutate(across(ends_with("pct"), ~ ./100)) %>%
rename("area_frac" = "area_pct", "pop_frac" = "pop_pct")
## # A tibble: 10 × 6
## state_abb valid_start valid_end drought_lvl area_frac pop_frac
## <chr> <date> <date> <chr> <dbl> <dbl>
## 1 IN 2006-01-03 2006-01-09 None 0.805 0.749
## 2 KY 2018-05-29 2018-06-04 D1 0 0
## 3 CA 2015-04-21 2015-04-27 D4 0.468 0.542
## 4 MN 2020-03-10 2020-03-16 D1 0 0
## 5 IN 2005-08-30 2005-09-05 D4 0 0
## 6 ND 2002-04-02 2002-04-08 None 0 0
## 7 NC 2010-06-22 2010-06-28 D3 0 0
## 8 NJ 2015-10-13 2015-10-19 None 0.637 0.372
## 9 PA 2017-02-21 2017-02-27 D0 0.143 0.066
## 10 ND 2008-05-27 2008-06-02 D4 0 0
Here I operated my lambda function on all columns that ends with the phrase “pct”. You can imagine that this can become extremely powerful when scaled up.
Part 2: Getting month and year columns
Now I want to achieve my second task. Here, I will showcase one of the most powerful arguments of dplyr::across
: the .names
argument. As before, let me first code a solution for demonstration:
drought %>%
mutate(across(starts_with("valid"), ~ year(.), .names = "{.col}_year")) %>%
select(-starts_with("area")) ## only doing this so that we can see the output easier
## # A tibble: 10 × 7
## state_abb valid_start valid_end drought_lvl pop_pct valid_start_year valid…¹
## <chr> <date> <date> <chr> <dbl> <dbl> <dbl>
## 1 IN 2006-01-03 2006-01-09 None 74.9 2006 2006
## 2 KY 2018-05-29 2018-06-04 D1 0 2018 2018
## 3 CA 2015-04-21 2015-04-27 D4 54.2 2015 2015
## 4 MN 2020-03-10 2020-03-16 D1 0 2020 2020
## 5 IN 2005-08-30 2005-09-05 D4 0 2005 2005
## 6 ND 2002-04-02 2002-04-08 None 0 2002 2002
## 7 NC 2010-06-22 2010-06-28 D3 0 2010 2010
## 8 NJ 2015-10-13 2015-10-19 None 37.2 2015 2015
## 9 PA 2017-02-21 2017-02-27 D0 6.6 2017 2017
## 10 ND 2008-05-27 2008-06-02 D4 0 2008 2008
## # … with abbreviated variable name ¹valid_end_year
As before, I am using a tidy select function starts_with
to target the columns I want to operate on and perform a lambda function on each. However, you can see that I skipped a renaming step by using the .names
argument. The .names
argument has the special glue
syntax (read more here). In particular, this means we can use {.col}
to input in the original column name.
However, recall that I wanted 6 new columns, which would require me writing three separate mutate
functions (or 3 big arguments passed into mutate
) right? Nope! The across
function’s .fns
argument (which we passed our lambda function to) can actually take a list of functions. Observe:
drought %>%
mutate(across(starts_with("valid"), list(year = ~year(.), month = ~month(.), day = ~day(.)), .names = "{.col}_{.fn}")) %>%
select(starts_with("valid"))
## # A tibble: 10 × 8
## valid_start valid_end valid_start_…¹ valid…² valid…³ valid…⁴ valid…⁵ valid…⁶
## <date> <date> <dbl> <dbl> <int> <dbl> <dbl> <int>
## 1 2006-01-03 2006-01-09 2006 1 3 2006 1 9
## 2 2018-05-29 2018-06-04 2018 5 29 2018 6 4
## 3 2015-04-21 2015-04-27 2015 4 21 2015 4 27
## 4 2020-03-10 2020-03-16 2020 3 10 2020 3 16
## 5 2005-08-30 2005-09-05 2005 8 30 2005 9 5
## 6 2002-04-02 2002-04-08 2002 4 2 2002 4 8
## 7 2010-06-22 2010-06-28 2010 6 22 2010 6 28
## 8 2015-10-13 2015-10-19 2015 10 13 2015 10 19
## 9 2017-02-21 2017-02-27 2017 2 21 2017 2 27
## 10 2008-05-27 2008-06-02 2008 5 27 2008 6 2
## # … with abbreviated variable names ¹valid_start_year, ²valid_start_month,
## # ³valid_start_day, ⁴valid_end_year, ⁵valid_end_month, ⁶valid_end_day
Notice what happened here: I passed in a named list of lambda functions to be performed on each of the columns. In effect, I performed the lubridate::year
, lubridate::month
, and lubridate::day
functions on the columns of interest and achieved the goal of creating 6 different columns all with 1 line of code. Additionally, I used the special{.fn}
character in the .names
argument to pass in the name of each of the functions to the new column names. Pretty incredible isn’t it?