6  Data Handling

One of the fundamental building blocks of the tidyverse is the dplyr package. The dplyr package is now the primary package for data manipulation in R. Each of the functions that we look at in this chapter take a data frame as the first input argument, process it in a very simple way and return a data frame as the output. The dplyr package is loaded by default when we load the tidyverse package but we can also load it independently.

library(dplyr)

# Alternatively
# library(tidyverse)

In accordance with the tidyverse spirit of functions doing one thing and doing it well, there are separate functions for subsetting rows and for subsetting columns.

6.1 Subsetting Rows: filter

The filter function allows us to apply a logical subset to the rows of our data. As with all the dplyr functions that we will use in this course, the filter function takes a data frame as the first argument and a logical statement as the second argument. Typically, we base the logical statement on columns within our data frame.

library(ggplot2)
dm <- haven::read_sas("data/dm.sas7bdat")
act <- haven::read_sas("data/act.sas7bdat")
vs <- haven::read_sas("data/vs.sas7bdat")
# Find subjects taking GSK drug.  Note the double-equals
gsk_subj <- filter(dm, ARM == "GSK")
gsk_subj
# A tibble: 12 × 7
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <chr> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000008    53 M     GER     WHITE                     NOT HIS… GSK  
 2 STD123456:000009    60 F     USA     WHITE                     NOT HIS… GSK  
 3 STD123456:000011    66 F     USA     WHITE                     NOT HIS… GSK  
 4 STD123456:000012    26 M     FRA     BLACK OR AFRICAN AMERICAN NOT HIS… GSK  
 5 STD123456:000013    51 F     IRE     WHITE                     HISPANI… GSK  
 6 STD123456:000015    59 F     FRA     WHITE                     HISPANI… GSK  
 7 STD123456:000017    28 F     FRA     WHITE                     NOT HIS… GSK  
 8 STD123456:000019    48 F     USA     WHITE                     NOT HIS… GSK  
 9 STD123456:000022    46 M     GER     WHITE                     NOT HIS… GSK  
10 STD123456:000025    48 F     USA     WHITE                     NOT HIS… GSK  
11 STD123456:000026    71 F     FRA     WHITE                     NOT HIS… GSK  
12 STD123456:000028    67 M     IRE     WHITE                     HISPANI… GSK  

6.1.1 Logic

In the example above we used a double equals in order to ask the question, “is the ARM variable equal to ‘GSK’ (for each record)?” In R a single equals assigns the value on the right to the parameter on the left. We must therefore always use a double equals when asking the logical question, is x equal to y!!!

The table below lists some essential logical operations.

Code Meaning
== Equal to
!= Not equal to
> Greater than
>= Greater than or equal to
< Less than
<= Less than or equal to
& And - for joining logical statements
| Or - for joining locical statements
! Not - for switching TRUE and FALSE statements around
%in% One of - Can any of the values on the LHS of the %in% be found on the RHS?
any Summarises multiple logical statements into a single value - Are ANY of these values equal to TRUE?
all Summarises multiple logical statements into a single value - Are ALL of these values equal to TRUE?

6.1.2 Applying Multiple filter Operations

The filter function let’s us chain multiple logical questions together using commas. The commas are equivalent to using an “and” operation.

# Find subjects taking GSK drug in USA
gsk_subj_USA <- filter(dm, ARM == "GSK", COUNTRY == "USA")
gsk_subj_USA
# A tibble: 4 × 7
  USUBJID            AGE SEX   COUNTRY RACE  ETHNIC                 ARM  
  <chr>            <dbl> <chr> <chr>   <chr> <chr>                  <chr>
1 STD123456:000009    60 F     USA     WHITE NOT HISPANIC OR LATINO GSK  
2 STD123456:000011    66 F     USA     WHITE NOT HISPANIC OR LATINO GSK  
3 STD123456:000019    48 F     USA     WHITE NOT HISPANIC OR LATINO GSK  
4 STD123456:000025    48 F     USA     WHITE NOT HISPANIC OR LATINO GSK  
# Or equivalently
gsk_subj_USA <- filter(dm, ARM == "GSK" & COUNTRY == "USA")

The following example uses an “or” link.

# Find females over 50 in France or Germany
filter(dm, SEX=="F", AGE > 50, COUNTRY == "FRA" | COUNTRY == "GER")
# A tibble: 2 × 7
  USUBJID            AGE SEX   COUNTRY RACE  ETHNIC                 ARM  
  <chr>            <dbl> <chr> <chr>   <chr> <chr>                  <chr>
1 STD123456:000015    59 F     FRA     WHITE HISPANIC OR LATINO     GSK  
2 STD123456:000026    71 F     FRA     WHITE NOT HISPANIC OR LATINO GSK  

Finally, here is an example using the %in% operator. Although it looks a little ugly, this function is the same as an “in” in SQL or SAS. In other words “are any of the values on the left hand side contained within the values on the right hand side”?

# Find females over 50 in France or Germany
filter(dm, SEX=="F", AGE > 50, COUNTRY %in% c("FRA", "GER"))
# A tibble: 2 × 7
  USUBJID            AGE SEX   COUNTRY RACE  ETHNIC                 ARM  
  <chr>            <dbl> <chr> <chr>   <chr> <chr>                  <chr>
1 STD123456:000015    59 F     FRA     WHITE HISPANIC OR LATINO     GSK  
2 STD123456:000026    71 F     FRA     WHITE NOT HISPANIC OR LATINO GSK  

6.1.3 Any and All

The any and all functions condense any number of logical values into a single TRUE or FALSE value. At this stage of the course they are not much use but we will look at them a little closer later on.

# Some logical values
some_logic <- c(T, T, F)
some_logic
[1]  TRUE  TRUE FALSE
# Are any of these values TRUE?
any(some_logic)
[1] TRUE
# Excellent.  Are they all TRUE?
all(some_logic)
[1] FALSE

6.2 Extract Rows by Number: slice

The slice function lets us choose records by row number. It can be useful for looking at the first or last few records in a data frame. In the second example below the utility function, n, is used to pick out the last few rows. This utility is not a generic utility and can only be used within a handful of dplyr functions.

# First 3 rows of dm
slice(dm, 1:3)
# A tibble: 3 × 7
  USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC    ARM  
  <chr>            <dbl> <chr> <chr>   <chr>                     <chr>     <chr>
1 STD123456:000001    32 F     UK      BLACK OR AFRICAN AMERICAN NOT HISP… Comp…
2 STD123456:000002    28 M     FRA     WHITE                     NOT HISP… Comp…
3 STD123456:000003    55 M     USA     BLACK OR AFRICAN AMERICAN NOT HISP… Comp…
# Last 3 rows of dm (using the utility function, n())
slice(dm, (n() - 2):n())
# A tibble: 3 × 7
  USUBJID            AGE SEX   COUNTRY RACE  ETHNIC                 ARM       
  <chr>            <dbl> <chr> <chr>   <chr> <chr>                  <chr>     
1 STD123456:000028    67 M     IRE     WHITE HISPANIC OR LATINO     GSK       
2 STD123456:000029    68 F     UK      WHITE NOT HISPANIC OR LATINO Comparator
3 STD123456:000030    71 M     USA     WHITE NOT HISPANIC OR LATINO Comparator

6.3 Removing Duplicates: distinct

We can often find ourselves in a situation where we end up with duplicate or partial duplicates, eg following a merge. The distinct function is a useful utility that can help us to identify unique values and/or remove duplicates.

# Duplicate records
dup_data <- tibble(USUBJID = rep(c("STD123456:000001", "STD123456:000002"), c(3, 5)),
                   AGE = rep(c(32, 28), c(3, 5)))
dup_data
# A tibble: 8 × 2
  USUBJID            AGE
  <chr>            <dbl>
1 STD123456:000001    32
2 STD123456:000001    32
3 STD123456:000001    32
4 STD123456:000002    28
5 STD123456:000002    28
6 STD123456:000002    28
7 STD123456:000002    28
8 STD123456:000002    28
# Remove duplicate records
distinct(dup_data)
# A tibble: 2 × 2
  USUBJID            AGE
  <chr>            <dbl>
1 STD123456:000001    32
2 STD123456:000002    28

We can also use distinct to find all the unique combinations of specific variables. For example the unique combinations of ‘COUNTRY’ and ‘ARM’. By default, the function drops all of the other variables that we’re not interested. But we can also choose to keep them. In the example below, this retains the first record for each unique combination of ‘COUNTRY’ and ‘ARM’, which is not particularly useful. But it could be used to identify baseline records for each unique subject.

# Find unique combinations
distinct(dm, COUNTRY, ARM)
# A tibble: 9 × 2
  COUNTRY ARM       
  <chr>   <chr>     
1 UK      Comparator
2 FRA     Comparator
3 USA     Comparator
4 GER     Comparator
5 IRE     Comparator
6 GER     GSK       
7 USA     GSK       
8 FRA     GSK       
9 IRE     GSK       
# Find unique combinations and keep the first instance of each
distinct(dm, COUNTRY, ARM, .keep_all = TRUE)
# A tibble: 9 × 7
  USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC    ARM  
  <chr>            <dbl> <chr> <chr>   <chr>                     <chr>     <chr>
1 STD123456:000001    32 F     UK      BLACK OR AFRICAN AMERICAN NOT HISP… Comp…
2 STD123456:000002    28 M     FRA     WHITE                     NOT HISP… Comp…
3 STD123456:000003    55 M     USA     BLACK OR AFRICAN AMERICAN NOT HISP… Comp…
4 STD123456:000004    35 F     GER     WHITE                     HISPANIC… Comp…
5 STD123456:000005    30 F     IRE     WHITE                     NOT HISP… Comp…
6 STD123456:000008    53 M     GER     WHITE                     NOT HISP… GSK  
7 STD123456:000009    60 F     USA     WHITE                     NOT HISP… GSK  
8 STD123456:000012    26 M     FRA     BLACK OR AFRICAN AMERICAN NOT HISP… GSK  
9 STD123456:000013    51 F     IRE     WHITE                     HISPANIC… GSK  

6.4 Subsetting Columns: select

For column-wise operations we use the select function. Much like when writing SQL, we simply provide the function a data frame to select columns from and then a bunch of columns that we wish to select (or “keep”).

# Our GSK, USA Subjects from earlier
gsk_subj_USA
# A tibble: 4 × 7
  USUBJID            AGE SEX   COUNTRY RACE  ETHNIC                 ARM  
  <chr>            <dbl> <chr> <chr>   <chr> <chr>                  <chr>
1 STD123456:000009    60 F     USA     WHITE NOT HISPANIC OR LATINO GSK  
2 STD123456:000011    66 F     USA     WHITE NOT HISPANIC OR LATINO GSK  
3 STD123456:000019    48 F     USA     WHITE NOT HISPANIC OR LATINO GSK  
4 STD123456:000025    48 F     USA     WHITE NOT HISPANIC OR LATINO GSK  
# Now that we have GSK subjects in USA we don't need ARM or COUNTRY
select(gsk_subj_USA, USUBJID, AGE, SEX)
# A tibble: 4 × 3
  USUBJID            AGE SEX  
  <chr>            <dbl> <chr>
1 STD123456:000009    60 F    
2 STD123456:000011    66 F    
3 STD123456:000019    48 F    
4 STD123456:000025    48 F    

We can also use the select function to drop columns by putting a minus sign in front of any columns that we provide.

# Now that we have GSK subjects in USA we don't need ARM or COUNTRY
select(gsk_subj_USA, -ARM, -COUNTRY)
# A tibble: 4 × 5
  USUBJID            AGE SEX   RACE  ETHNIC                
  <chr>            <dbl> <chr> <chr> <chr>                 
1 STD123456:000009    60 F     WHITE NOT HISPANIC OR LATINO
2 STD123456:000011    66 F     WHITE NOT HISPANIC OR LATINO
3 STD123456:000019    48 F     WHITE NOT HISPANIC OR LATINO
4 STD123456:000025    48 F     WHITE NOT HISPANIC OR LATINO

Obviously we cannot mix and match dropping and keeping variables.

6.4.1 Utility Functions in Select

The select function comes with a number of built in utility functions (the tidy-select functions, ?dplyr_tidy_select). The functions are there to enable us to select columns more quickly, for example all of the columns that contain a particular word. These functions can only be called from within select and a handful of other tidyverse functions. They cannot be called directly.

  • starts_with
  • ends_with
  • contains
  • matches
  • one_of
  • everything

To use these utility functions we simply call them in place of a named column(s). For example, here we select all columns that begin with the letters “ACT”:

select(act, starts_with("ACT"))
# A tibble: 165 × 5
   ACT0101 ACT0102 ACT0103 ACT0104 ACT0105
     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1       5       2       3       3       3
 2       5       2       3       3       3
 3       3       4       3       2       1
 4       3       3       4       3       2
 5       3       5       1       2       3
 6       3       5       3       3       3
 7       3       3       3       3       3
 8       3       3       3       3       3
 9       3       3       3       4       4
10       4       4       4       4       4
# ℹ 155 more rows

We can also use a : to select all columns between the column on the left of the : and the column on the right of the :.

select(dm, AGE:COUNTRY)
# A tibble: 30 × 3
     AGE SEX   COUNTRY
   <dbl> <chr> <chr>  
 1    32 F     UK     
 2    28 M     FRA    
 3    55 M     USA    
 4    35 F     GER    
 5    30 F     IRE    
 6    22 F     GER    
 7    59 F     USA    
 8    53 M     GER    
 9    60 F     USA    
10    48 M     USA    
# ℹ 20 more rows

It is again important to note that although : can be used outside of the select function, the way we have used it to specify columns is unique to this use case. In other words we can’t write “A”:“Z” or similar.

6.5 Column Order

As well as subsetting columns, we can use select to move columns around (e.g. after a merge). Here we use the everything helper to move the ‘ARM’ column to the left and then keep ‘everything else’ as it was.

select(dm, ARM, everything())
# A tibble: 30 × 7
   ARM        USUBJID            AGE SEX   COUNTRY RACE                   ETHNIC
   <chr>      <chr>            <dbl> <chr> <chr>   <chr>                  <chr> 
 1 Comparator STD123456:000001    32 F     UK      BLACK OR AFRICAN AMER… NOT H…
 2 Comparator STD123456:000002    28 M     FRA     WHITE                  NOT H…
 3 Comparator STD123456:000003    55 M     USA     BLACK OR AFRICAN AMER… NOT H…
 4 Comparator STD123456:000004    35 F     GER     WHITE                  HISPA…
 5 Comparator STD123456:000005    30 F     IRE     WHITE                  NOT H…
 6 Comparator STD123456:000006    22 F     GER     WHITE                  NOT H…
 7 Comparator STD123456:000007    59 F     USA     WHITE                  NOT H…
 8 GSK        STD123456:000008    53 M     GER     WHITE                  NOT H…
 9 GSK        STD123456:000009    60 F     USA     WHITE                  NOT H…
10 Comparator STD123456:000010    48 M     USA     WHITE                  NOT H…
# ℹ 20 more rows

The relocate function is another relatively recent addition to dplyr that provides finer control when moving columns. Arguments .before and .after let us specify exactly where one or more columns are to be located.

relocate(dm, ARM, .after = USUBJID)
# A tibble: 30 × 7
   USUBJID          ARM          AGE SEX   COUNTRY RACE                   ETHNIC
   <chr>            <chr>      <dbl> <chr> <chr>   <chr>                  <chr> 
 1 STD123456:000001 Comparator    32 F     UK      BLACK OR AFRICAN AMER… NOT H…
 2 STD123456:000002 Comparator    28 M     FRA     WHITE                  NOT H…
 3 STD123456:000003 Comparator    55 M     USA     BLACK OR AFRICAN AMER… NOT H…
 4 STD123456:000004 Comparator    35 F     GER     WHITE                  HISPA…
 5 STD123456:000005 Comparator    30 F     IRE     WHITE                  NOT H…
 6 STD123456:000006 Comparator    22 F     GER     WHITE                  NOT H…
 7 STD123456:000007 Comparator    59 F     USA     WHITE                  NOT H…
 8 STD123456:000008 GSK           53 M     GER     WHITE                  NOT H…
 9 STD123456:000009 GSK           60 F     USA     WHITE                  NOT H…
10 STD123456:000010 Comparator    48 M     USA     WHITE                  NOT H…
# ℹ 20 more rows
# Alternatively
#relocate(dm, AGE:COUNTRY, .after = ARM)

6.6 Sorting

In general, tidyverse functions are named well and hence easy to remember. Unfortunately, sort and order already exist as R functions and so in the tidyverse we work with arrange! As with most of the tidyverse functions it is very easy to use. In this first example we use the arrange function to sort the demography data by age.

# Sort young to old
arrange(dm, AGE)
# A tibble: 30 × 7
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <chr> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000006    22 F     GER     WHITE                     NOT HIS… Comp…
 2 STD123456:000012    26 M     FRA     BLACK OR AFRICAN AMERICAN NOT HIS… GSK  
 3 STD123456:000002    28 M     FRA     WHITE                     NOT HIS… Comp…
 4 STD123456:000017    28 F     FRA     WHITE                     NOT HIS… GSK  
 5 STD123456:000023    28 F     IRE     BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 6 STD123456:000005    30 F     IRE     WHITE                     NOT HIS… Comp…
 7 STD123456:000001    32 F     UK      BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 8 STD123456:000004    35 F     GER     WHITE                     HISPANI… Comp…
 9 STD123456:000022    46 M     GER     WHITE                     NOT HIS… GSK  
10 STD123456:000010    48 M     USA     WHITE                     NOT HIS… Comp…
# ℹ 20 more rows

For a descending sort we call the utility function, desc, around the variable that we’re sorting by.

# Sort old to young
arrange(dm, desc(AGE))
# A tibble: 30 × 7
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <chr> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000026    71 F     FRA     WHITE                     NOT HIS… GSK  
 2 STD123456:000030    71 M     USA     WHITE                     NOT HIS… Comp…
 3 STD123456:000018    69 F     IRE     WHITE                     NOT HIS… Comp…
 4 STD123456:000029    68 F     UK      WHITE                     NOT HIS… Comp…
 5 STD123456:000027    67 F     IRE     ASIAN                     NOT HIS… Comp…
 6 STD123456:000028    67 M     IRE     WHITE                     HISPANI… GSK  
 7 STD123456:000011    66 F     USA     WHITE                     NOT HIS… GSK  
 8 STD123456:000020    64 M     FRA     BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 9 STD123456:000009    60 F     USA     WHITE                     NOT HIS… GSK  
10 STD123456:000007    59 F     USA     WHITE                     NOT HIS… Comp…
# ℹ 20 more rows

We can sort by as many variables as we have in our data. We can also sort by categorical variables, which are ordered alphabetically in the sort.

# Sort by multiple variables
arrange(dm, desc(ARM), COUNTRY, desc(AGE))
# A tibble: 30 × 7
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <chr> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000026    71 F     FRA     WHITE                     NOT HIS… GSK  
 2 STD123456:000015    59 F     FRA     WHITE                     HISPANI… GSK  
 3 STD123456:000017    28 F     FRA     WHITE                     NOT HIS… GSK  
 4 STD123456:000012    26 M     FRA     BLACK OR AFRICAN AMERICAN NOT HIS… GSK  
 5 STD123456:000008    53 M     GER     WHITE                     NOT HIS… GSK  
 6 STD123456:000022    46 M     GER     WHITE                     NOT HIS… GSK  
 7 STD123456:000028    67 M     IRE     WHITE                     HISPANI… GSK  
 8 STD123456:000013    51 F     IRE     WHITE                     HISPANI… GSK  
 9 STD123456:000011    66 F     USA     WHITE                     NOT HIS… GSK  
10 STD123456:000009    60 F     USA     WHITE                     NOT HIS… GSK  
# ℹ 20 more rows

Later we will see how we can manipulate the sort order for categorical variables using factors.

6.7 EXERCISE

  1. Using the act data, extract all data for subject 5
  2. Extract all screening and randomisation visits from the act data
  3. Select the USUBJID, VISITNUM and VISIT columns
  4. Remove the date (QSDTC) column
  5. Sort the data by USUBJID and descending VISITNUM

6.8 Creating New Columns: mutate

The mutate function allows us to create new columns within our data. We typically do so by manipulating other columns in the data but we can define our columns in any way we like.

# Create a new column that is the square of the height column
mutate(vs, Height_sq = HEIGHT ^2)
# A tibble: 30 × 4
   USUBJID          HEIGHT WEIGHT Height_sq
   <chr>             <dbl>  <dbl>     <dbl>
 1 STD123456:000001    165   88       27225
 2 STD123456:000002    179  100       32041
 3 STD123456:000003    182   81       33124
 4 STD123456:000004    166   91       27556
 5 STD123456:000005    169   73.6     28561
 6 STD123456:000006    164   66       26896
 7 STD123456:000007    157  107       24649
 8 STD123456:000008    189  101       35721
 9 STD123456:000009    155   54.1     24025
10 STD123456:000010    171   88       29241
# ℹ 20 more rows
# Create a new flag variable with the value 1
mutate(vs, flag = 1)
# A tibble: 30 × 4
   USUBJID          HEIGHT WEIGHT  flag
   <chr>             <dbl>  <dbl> <dbl>
 1 STD123456:000001    165   88       1
 2 STD123456:000002    179  100       1
 3 STD123456:000003    182   81       1
 4 STD123456:000004    166   91       1
 5 STD123456:000005    169   73.6     1
 6 STD123456:000006    164   66       1
 7 STD123456:000007    157  107       1
 8 STD123456:000008    189  101       1
 9 STD123456:000009    155   54.1     1
10 STD123456:000010    171   88       1
# ℹ 20 more rows

One nice property of the mutate function is that it allows us to immediately use columns that we create, much like SAS.

mutate(vs, 
       Height_m2 = (HEIGHT/100) ^2, 
       BMI = WEIGHT / Height_m2)
# A tibble: 30 × 5
   USUBJID          HEIGHT WEIGHT Height_m2   BMI
   <chr>             <dbl>  <dbl>     <dbl> <dbl>
 1 STD123456:000001    165   88        2.72  32.3
 2 STD123456:000002    179  100        3.20  31.2
 3 STD123456:000003    182   81        3.31  24.5
 4 STD123456:000004    166   91        2.76  33.0
 5 STD123456:000005    169   73.6      2.86  25.8
 6 STD123456:000006    164   66        2.69  24.5
 7 STD123456:000007    157  107        2.46  43.4
 8 STD123456:000008    189  101        3.57  28.3
 9 STD123456:000009    155   54.1      2.40  22.5
10 STD123456:000010    171   88        2.92  30.1
# ℹ 20 more rows

In the above example we have used some very simple numerical transformations. In the following sections we will look at some other useful functions for working with data. Each of the functions that we see are designed to work with vectors (columns of data). We can therefore call them all via the mutate function.

6.9 Renaming Columns

In the example below we rename the “COUNTRY” column, “LOCATION”. As we shall see later when creating new columns, the new name of the new column is placed to the left of the equals and the old name on the right.

# Rename the COUNTRY column to LOCATION - New name on the left!!!
rename(dm, LOCATION = COUNTRY)
# A tibble: 30 × 7
   USUBJID            AGE SEX   LOCATION RACE                      ETHNIC  ARM  
   <chr>            <dbl> <chr> <chr>    <chr>                     <chr>   <chr>
 1 STD123456:000001    32 F     UK       BLACK OR AFRICAN AMERICAN NOT HI… Comp…
 2 STD123456:000002    28 M     FRA      WHITE                     NOT HI… Comp…
 3 STD123456:000003    55 M     USA      BLACK OR AFRICAN AMERICAN NOT HI… Comp…
 4 STD123456:000004    35 F     GER      WHITE                     HISPAN… Comp…
 5 STD123456:000005    30 F     IRE      WHITE                     NOT HI… Comp…
 6 STD123456:000006    22 F     GER      WHITE                     NOT HI… Comp…
 7 STD123456:000007    59 F     USA      WHITE                     NOT HI… Comp…
 8 STD123456:000008    53 M     GER      WHITE                     NOT HI… GSK  
 9 STD123456:000009    60 F     USA      WHITE                     NOT HI… GSK  
10 STD123456:000010    48 M     USA      WHITE                     NOT HI… Comp…
# ℹ 20 more rows

Further columns can be renamed by using a comma to separate each pair of new and old names.

# Rename two columns - New name on the left!!!
rename(dm, LOCATION = COUNTRY, TREATMENT = ARM)
# A tibble: 30 × 7
   USUBJID            AGE SEX   LOCATION RACE                   ETHNIC TREATMENT
   <chr>            <dbl> <chr> <chr>    <chr>                  <chr>  <chr>    
 1 STD123456:000001    32 F     UK       BLACK OR AFRICAN AMER… NOT H… Comparat…
 2 STD123456:000002    28 M     FRA      WHITE                  NOT H… Comparat…
 3 STD123456:000003    55 M     USA      BLACK OR AFRICAN AMER… NOT H… Comparat…
 4 STD123456:000004    35 F     GER      WHITE                  HISPA… Comparat…
 5 STD123456:000005    30 F     IRE      WHITE                  NOT H… Comparat…
 6 STD123456:000006    22 F     GER      WHITE                  NOT H… Comparat…
 7 STD123456:000007    59 F     USA      WHITE                  NOT H… Comparat…
 8 STD123456:000008    53 M     GER      WHITE                  NOT H… GSK      
 9 STD123456:000009    60 F     USA      WHITE                  NOT H… GSK      
10 STD123456:000010    48 M     USA      WHITE                  NOT H… Comparat…
# ℹ 20 more rows

6.10 Creating Categories: cut

The cut function allows us to create a discrete variable by partitioning a continuous one. The first argument to the function is the continuous vector/variable that we wish to partition. We must then either specify a number of cut points (and let R choose the cuts for us) or provide a vector of specific cut points (including minimum and maximum values). For example, here we divide the AGE variable into two groups, 18-50 and 50+.

# Extract the AGE column as a vector (not required if using mutate!)
AGE <- pull(dm, AGE)
# Partition into two groups
cut(AGE, c(18, 50, Inf))
 [1] (18,50]  (18,50]  (50,Inf] (18,50]  (18,50]  (18,50]  (50,Inf] (50,Inf]
 [9] (50,Inf] (18,50]  (50,Inf] (18,50]  (50,Inf] (18,50]  (50,Inf] (50,Inf]
[17] (18,50]  (50,Inf] (18,50]  (50,Inf] (50,Inf] (18,50]  (18,50]  (18,50] 
[25] (18,50]  (50,Inf] (50,Inf] (50,Inf] (50,Inf] (50,Inf]
Levels: (18,50] (50,Inf]

By default, any data points lying on the boundary are included in the lower grouping. This is controlled by an argument to the cut function, right. If we want to include boundary values to be included in the higher grouping we set right = FALSE.

# Include 50 years olds in the older group
age_cat <- cut(AGE, c(18, 50, Inf), right = FALSE)
age_cat
 [1] [18,50)  [18,50)  [50,Inf) [18,50)  [18,50)  [18,50)  [50,Inf) [50,Inf)
 [9] [50,Inf) [18,50)  [50,Inf) [18,50)  [50,Inf) [18,50)  [50,Inf) [50,Inf)
[17] [18,50)  [50,Inf) [18,50)  [50,Inf) [50,Inf) [18,50)  [18,50)  [50,Inf)
[25] [18,50)  [50,Inf) [50,Inf) [50,Inf) [50,Inf) [50,Inf)
Levels: [18,50) [50,Inf)

We can add nicer labels to our cut by using the labels argument.

around50 <- filter(dm, 48 <= AGE, AGE <= 52)
mutate(around50, 
       AgeCat1 = cut(AGE, c(18, 50, Inf), labels = c("<=50", ">50")),
       AgeCat2 = cut(AGE, c(18, 50, Inf), labels = c("<50", ">=50"), 
                     right = FALSE)) 
# A tibble: 6 × 9
  USUBJID            AGE SEX   COUNTRY RACE  ETHNIC        ARM   AgeCat1 AgeCat2
  <chr>            <dbl> <chr> <chr>   <chr> <chr>         <chr> <fct>   <fct>  
1 STD123456:000010    48 M     USA     WHITE NOT HISPANIC… Comp… <=50    <50    
2 STD123456:000013    51 F     IRE     WHITE HISPANIC OR … GSK   >50     >=50   
3 STD123456:000014    48 F     USA     WHITE HISPANIC OR … Comp… <=50    <50    
4 STD123456:000019    48 F     USA     WHITE NOT HISPANIC… GSK   <=50    <50    
5 STD123456:000024    50 F     FRA     WHITE NOT HISPANIC… Comp… <=50    >=50   
6 STD123456:000025    48 F     USA     WHITE NOT HISPANIC… GSK   <=50    <50    

6.11 Factors and the forcats Package

The cut function turns a continuous variable into a categorical variable by creating a “factor”. Factors are essentially a code-decode paring where the code is an integer value from 1 to the number of levels, and the decode is a format that controls how these decodes are displayed. We can turn any variable into a factor by using the factor function. Factor variables are displayed as <fctr> when printing a data frame.

# Extract the SEX column and convert to a factor
SEX <- pull(dm, SEX)
SEX <- factor(SEX)
SEX
 [1] F M M F F F F M F M F M F F F F F F F M M M F F F F F M F M
Levels: F M

6.11.1 The forcats Package

Creating a factors is easy, but the real benefit comes from being able to manipulate the factor “levels” in order to either control the sort order or to combine levels. All of the required functionality is contained within the forcats package. We start by loading the package, which is installed with the tidyverse but which must always be loaded separately.

# Relabel F and M
library(forcats)
# What is in the package?
objects("package:forcats")
 [1] "%>%"                   "as_factor"             "fct"                  
 [4] "fct_anon"              "fct_c"                 "fct_collapse"         
 [7] "fct_count"             "fct_cross"             "fct_drop"             
[10] "fct_expand"            "fct_explicit_na"       "fct_infreq"           
[13] "fct_inorder"           "fct_inseq"             "fct_lump"             
[16] "fct_lump_lowfreq"      "fct_lump_min"          "fct_lump_n"           
[19] "fct_lump_prop"         "fct_match"             "fct_na_level_to_value"
[22] "fct_na_value_to_level" "fct_other"             "fct_recode"           
[25] "fct_relabel"           "fct_relevel"           "fct_reorder"          
[28] "fct_reorder2"          "fct_rev"               "fct_shift"            
[31] "fct_shuffle"           "fct_unify"             "fct_unique"           
[34] "first2"                "gss_cat"               "last2"                
[37] "lvls_expand"           "lvls_reorder"          "lvls_revalue"         
[40] "lvls_union"           

6.11.2 Reordering Levels

Having created a factor we can use the lvls_reorder function to change the order of the factor levels. In our earlier example we created a factor from the SEX column in dm. By default, factors are ordered alphabetically and so females have the value 1 underneath and males have the value 2. To use the lvls_reorder function we specify the name of the factor that we wish to re-order, then the order of the levels.

# What is the current order?
SEX
 [1] F M M F F F F M F M F M F F F F F F F M M M F F F F F M F M
Levels: F M
# Re-order females and males
SEX <- lvls_reorder(SEX, c(2, 1))
SEX
 [1] F M M F F F F M F M F M F F F F F F F M M M F F F F F M F M
Levels: M F
# Now in the data, (where we can't see the order unless we sort)
dm_new <- mutate(dm, SEX = lvls_reorder(SEX, c(2, 1)))
arrange(dm_new, SEX)
# A tibble: 30 × 7
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <fct> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000002    28 M     FRA     WHITE                     NOT HIS… Comp…
 2 STD123456:000003    55 M     USA     BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 3 STD123456:000008    53 M     GER     WHITE                     NOT HIS… GSK  
 4 STD123456:000010    48 M     USA     WHITE                     NOT HIS… Comp…
 5 STD123456:000012    26 M     FRA     BLACK OR AFRICAN AMERICAN NOT HIS… GSK  
 6 STD123456:000020    64 M     FRA     BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 7 STD123456:000021    55 M     GER     ASIAN                     NOT HIS… Comp…
 8 STD123456:000022    46 M     GER     WHITE                     NOT HIS… GSK  
 9 STD123456:000028    67 M     IRE     WHITE                     HISPANI… GSK  
10 STD123456:000030    71 M     USA     WHITE                     NOT HIS… Comp…
# ℹ 20 more rows

Note that in order to use the lvls_reorder successfully we need to be certain of the current order of the levels. As a more sophisticated alternative we can also reorder the levels of one variable based on the values of another using the fct_reorder function. This is particularly useful for plotting.

# Generate some fake count data of favourite colours
colour_data <- tibble(Colour = c("red", "green", "blue"),
                     Count = c(8, 28, 19))

# Create a factor of the Colour column and order by the (descending) Count
# I.e. highest count first
new_colour_data <- mutate(
  colour_data,
  Colour_order = fct_reorder(Colour, Count, .desc = TRUE))

# Plot the data using ggplot2
ggplot(data = new_colour_data) +
  geom_bar(aes(x = Colour_order, y = Count), stat = "identity")

6.11.3 Recoding Factors (Changing the Labels)

We can change the labels in a simple way using the lvls_revalue function. To use the lvls_revalue function we provide the factor that we wish to re-label and a vector of new labels. It is once again important that we know the current order of levels when using the lvls_revalue function. To be more specific about the recoding we can use the fct_recode function.

# Extract the data
SEX <- pull(dm, SEX)
SEX
 [1] "F" "M" "M" "F" "F" "F" "F" "M" "F" "M" "F" "M" "F" "F" "F" "F" "F" "F" "F"
[20] "M" "M" "M" "F" "F" "F" "F" "F" "M" "F" "M"
attr(,"label")
[1] "Sex"
# Relabel F and M to Female and Male
SEX_lab <- lvls_revalue(SEX, c("Female", "Male"))
SEX_lab
 [1] Female Male   Male   Female Female Female Female Male   Female Male  
[11] Female Male   Female Female Female Female Female Female Female Male  
[21] Male   Male   Female Female Female Female Female Male   Female Male  
Levels: Female Male
# Alternatively we may take this more robust approach
SEX_lab <- fct_recode(SEX, Female = "F", Male = "M")
SEX_lab
 [1] Female Male   Male   Female Female Female Female Male   Female Male  
[11] Female Male   Female Female Female Female Female Female Female Male  
[21] Male   Male   Female Female Female Female Female Male   Female Male  
Levels: Female Male

6.11.4 Collapsing/Combining Levels

The fct_recode can also be used to combine levels and re-classify a variable. In the example below we combine the European nations in the dm data by recoding each to "Europe".

# Extratc COUNTRY coloumn
COUNTRY <- pull(dm, COUNTRY)
COUNTRY
 [1] "UK"  "FRA" "USA" "GER" "IRE" "GER" "USA" "GER" "USA" "USA" "USA" "FRA"
[13] "IRE" "USA" "FRA" "USA" "FRA" "IRE" "USA" "FRA" "GER" "GER" "IRE" "FRA"
[25] "USA" "FRA" "IRE" "IRE" "UK"  "USA"
attr(,"label")
[1] "Country"
# Create Continent by combining countries
Continent <- fct_recode(COUNTRY, 
                        Europe = "UK", Europe = "FRA", 
                        Europe = "GER", Europe = "IRE")
Continent
 [1] Europe Europe USA    Europe Europe Europe USA    Europe USA    USA   
[11] USA    Europe Europe USA    Europe USA    Europe Europe USA    Europe
[21] Europe Europe Europe Europe USA    Europe Europe Europe Europe USA   
Levels: Europe USA

Once again we have other options available to us when re-classifying a variable. Here is the same example again but using the fct_collapse function. This function allows us to provide a vector of labels to collapse down onto a new label.

# Alternatively
Continent <- fct_collapse(COUNTRY, Europe = c("UK", "FRA", "GER", "IRE"))

6.12 String Manipulation with the stringr Package

The stringr package, which is part of the tidyverse, provides a whole heap of functions for working with text string. Almost all of the functions in the stringr package (and all of the functions that we look at in this section) begin with the prefix “str_”. Each function takes the text that we’re interested in doing something with as the first argument.

# Load the Package
library(stringr)
# What's inside?
objects("package:stringr")
 [1] "%>%"               "boundary"          "coll"             
 [4] "fixed"             "fruit"             "invert_match"     
 [7] "regex"             "sentences"         "str_c"            
[10] "str_conv"          "str_count"         "str_detect"       
[13] "str_dup"           "str_ends"          "str_equal"        
[16] "str_escape"        "str_extract"       "str_extract_all"  
[19] "str_flatten"       "str_flatten_comma" "str_glue"         
[22] "str_glue_data"     "str_ilike"         "str_interp"       
[25] "str_length"        "str_like"          "str_locate"       
[28] "str_locate_all"    "str_match"         "str_match_all"    
[31] "str_order"         "str_pad"           "str_rank"         
[34] "str_remove"        "str_remove_all"    "str_replace"      
[37] "str_replace_all"   "str_replace_na"    "str_sort"         
[40] "str_split"         "str_split_1"       "str_split_fixed"  
[43] "str_split_i"       "str_squish"        "str_starts"       
[46] "str_sub"           "str_sub_all"       "str_sub<-"        
[49] "str_subset"        "str_to_camel"      "str_to_kebab"     
[52] "str_to_lower"      "str_to_sentence"   "str_to_snake"     
[55] "str_to_title"      "str_to_upper"      "str_trim"         
[58] "str_trunc"         "str_unique"        "str_view"         
[61] "str_view_all"      "str_which"         "str_width"        
[64] "str_wrap"          "word"              "words"            

6.12.1 String Length

Let’s look at a few of the most useful functions, starting with str_length which simply tells us how long a string is. For the examples we work with vectors directly but in practice these functions will most likely be called via the mutate function.

str_length("GSK")  # equivalenetly `nchar("GSK")`
[1] 3

6.12.2 Adding White Space: Formatting

When producing production outputs we often have to work with white space. The str_pad function adds white space to the left, right or both sides of the given text string depending on what we pass to the side argument. By default the space is added to the left. We add space providing a number, which represents the desired length of each string in the vector.

# Some values that we wish to right align
aes <- c(15, 10, 7, 2, 1, 1)

# Add some padding to the left in order to right align the text
ae_pad <- str_pad(aes, 2)
ae_pad
[1] "15" "10" " 7" " 2" " 1" " 1"
# Or left align for some reason
str_pad(aes, 3, side="right")
[1] "15 " "10 " "7  " "2  " "1  " "1  "

6.12.3 Removal of White Space

White space can be useful for formatting but often it just gets in the way, for example of concatenation (see below). As easily as we can add it, we can take it away. We use str_trim (the opposite of str_pad) to remove unwanted space at either end of the string.

# Get rid of the white space that we just added
str_trim(ae_pad)
[1] "15" "10" "7"  "2"  "1"  "1" 

6.12.4 Concatenation

The str_c functions allows us to combine/concatenate string. We simply provide the function with vectors or single pieces of string that we wish to use for the concatenation.

# Calculate percentages and right-align
ae_perc <- round(100*aes / 30)
ae_perc_pad <- str_pad(ae_perc, 2)

# Joing the padded data together for nicely formatted results
str_c(ae_pad, " (", ae_perc_pad, "%)")
[1] "15 (50%)" "10 (33%)" " 7 (23%)" " 2 ( 7%)" " 1 ( 3%)" " 1 ( 3%)"

6.12.5 Paragraphs

When trying to fit lots of text into a confined space it can be useful to wrap when it exceeds some predefined length. The str_wrap function achieves this by strategically placing the return escape, "\n", within the given text.

# Create a paragraph of text
some_text <- "When working in a late phase pharmaceutical environment, it is common to have long titles.  But alas, we must fit these long titles within the limited space offered by the L10 format that we use.  Sad face."

# Insert returns in so that it fits within some pre-specified width.
str_wrap(some_text, 108) 
[1] "When working in a late phase pharmaceutical environment, it is common to have long titles. But alas, we must\nfit these long titles within the limited space offered by the L10 format that we use. Sad face."

6.12.6 Finding Patterns

The stringr package also contains a number of functions that build upon R’s regular expression capabilities. The string functions are similar to others that exist in R but are much more consistent in terms of the order of variables (the string that we’re querying is always the first argument).

In the example below we simulate some asthma treatments and find those containing a “LABA” component using the str_detect function.

# Create some asthma treatments
treats <- sample(c("ICS", "ICS/LABA"), 10, replace = TRUE)
treats
 [1] "ICS/LABA" "ICS"      "ICS"      "ICS"      "ICS"      "ICS/LABA"
 [7] "ICS"      "ICS"      "ICS/LABA" "ICS"     
# Find subjects with LABA component
str_detect(treats, "LABA")
 [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE

6.12.7 Replacing Values

Once we’ve found a string pattern we may wish to replace it with something else. The str_replace function allows us to replace one piece of string with another. In order to do so we provide three arguments: the string; the text that we want to find; the text that we want to replace it with when we find it.

study_conclusion <- "The p-value was less than 0.05"
str_replace(study_conclusion, "less than", "greater than")
[1] "The p-value was greater than 0.05"

Once we know what we’re doing we can use the str_replace function to do clever things like replace white space in the middle of a text string.

messy_text <- c("How did    I   end  up with  so   much white           space?")
# "\\s+" searches for cases with any length of white space
str_replace_all(messy_text, pattern="\\s+", " ")
[1] "How did I end up with so much white space?"

6.13 EXERCISE

  1. Create a new column, ACTTOT, which is the sum of the 5 ACT items in the act data

  2. Create a column, ACTRESP, which partitions ACTTOT into “controlled” if ACTTOT >= 20 or “uncontrolled” otherwise.

  3. Create a new variable, VISITNew, by formatting the VISITNUM column so that it displays “Screening”, “Randomisation”, “Early Withdrawal” or the appropriate week of the trial, e.g. “Week 6”. HINT: Use the fct_recode function from the forcats package.

  4. Create a flag that identifies whether or not it’s a phone visit. HINT: Use the str_detect or str_starts function from the stringr package within if_else.

  5. Filter the act data to find QSDT dates starting with “2012”. HINT: Use the str_detect, or str_starts function from the stringr package.

  6. Create a vector of subject IDs (SUBJID) from 1 to 20. Turn this into a suitable USUBJID format for study “GSK456789” (i.e. “GSK456789:000001” etc.) HINT: Use the appropriate function from the stringr package.

6.14 Working with Dates and the lubridate Package

The lubridate package is another of the tidyverse packages and provides a number of useful functions for working with dates and times, supplementing other functions that were already built into R before the tidyverse came along. As with forcats and stringr the lubridate package is installed as part of the tidyverse but we have to load it separately. Unlike forcats and stringr, however, functions in lubridate don’t have a prefix.

library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
objects("package:lubridate")
  [1] "%--%"              "%m-%"              "%m+%"             
  [4] "%within%"          "add_with_rollback" "am"               
  [7] "Arith"             "as_date"           "as_datetime"      
 [10] "as.difftime"       "as.duration"       "as.interval"      
 [13] "as.period"         "ceiling_date"      "Compare"          
 [16] "cyclic_encoding"   "date"              "Date"             
 [19] "date_decimal"      "date<-"            "day"              
 [22] "day<-"             "days"              "days_in_month"    
 [25] "ddays"             "decimal_date"      "dhours"           
 [28] "dmicroseconds"     "dmilliseconds"     "dminutes"         
 [31] "dmonths"           "dmy"               "dmy_h"            
 [34] "dmy_hm"            "dmy_hms"           "dnanoseconds"     
 [37] "dpicoseconds"      "dseconds"          "dst"              
 [40] "duration"          "dweeks"            "dyears"           
 [43] "dym"               "epiweek"           "epiyear"          
 [46] "fast_strptime"     "fit_to_timeline"   "floor_date"       
 [49] "force_tz"          "force_tzs"         "format_ISO8601"   
 [52] "guess_formats"     "hm"                "hms"              
 [55] "hour"              "hour<-"            "hours"            
 [58] "int_aligns"        "int_diff"          "int_end"          
 [61] "int_end<-"         "int_flip"          "int_length"       
 [64] "int_overlaps"      "int_shift"         "int_standardize"  
 [67] "int_start"         "int_start<-"       "intersect"        
 [70] "interval"          "is.Date"           "is.difftime"      
 [73] "is.duration"       "is.instant"        "is.interval"      
 [76] "is.period"         "is.POSIXct"        "is.POSIXlt"       
 [79] "is.POSIXt"         "is.timepoint"      "is.timespan"      
 [82] "isoweek"           "isoyear"           "lakers"           
 [85] "leap_year"         "local_time"        "make_date"        
 [88] "make_datetime"     "make_difftime"     "mday"             
 [91] "mday<-"            "mdy"               "mdy_h"            
 [94] "mdy_hm"            "mdy_hms"           "microseconds"     
 [97] "milliseconds"      "minute"            "minute<-"         
[100] "minutes"           "month"             "month<-"          
[103] "ms"                "my"                "myd"              
[106] "NA_Date_"          "NA_POSIXct_"       "nanoseconds"      
[109] "now"               "origin"            "parse_date_time"  
[112] "parse_date_time2"  "period"            "period_to_seconds"
[115] "picoseconds"       "pm"                "POSIXct"          
[118] "pretty_dates"      "qday"              "qday<-"           
[121] "quarter"           "reclass_date"      "reclass_timespan" 
[124] "rollback"          "rollbackward"      "rollforward"      
[127] "round_date"        "second"            "second<-"         
[130] "seconds"           "seconds_to_period" "semester"         
[133] "setdiff"           "show"              "stamp"            
[136] "stamp_date"        "stamp_time"        "time_length"      
[139] "today"             "tz"                "tz<-"             
[142] "union"             "wday"              "wday<-"           
[145] "week"              "week<-"            "weeks"            
[148] "with_tz"           "yday"              "yday<-"           
[151] "ydm"               "ydm_h"             "ydm_hm"           
[154] "ydm_hms"           "year"              "year<-"           
[157] "years"             "ym"                "ymd"              
[160] "ymd_h"             "ymd_hm"            "ymd_hms"          
[163] "yq"               

6.14.1 Date Conversion

As with other software, R stores date data as numeric data and applies a format over the top so that the dates are human-readable. For the underlying numbers R follows the unix convention whereby time starts on 1st January 1970 (day 0). The character format is always the international date standard format, “yyyy-mm-dd”. To make use of the date functionality we often need to convert our dates from character to date format. It is worth noting, however, that functions such as read_sas can recognise date types and convert automatically.

In the spirit of functions having a single, simple purpose a significant proportion of the functions in the lubridate package simply convert various character date and/or time formats into a date/time format. For dates we must choose from a number of functions of the form, dmy (day-month-year). The choice of which depends on what order our day, month and year are provided in. The functions are all reasonably intelligent and can handle a variety of separators (e.g. “/” or “-”); numeric and character specification of month; and both 2 and 4 digit years. Some examples are provided below.

# Some different ways of displaying the same date
a_date_europe <- "15-11-17"
a_date_US <- "11-15-17"
a_date_GSK <- "15-Nov-17"
a_date_programming_standard <- "2017-11-15"

# Some lubridate functions for converting each of thes dates
dmy(a_date_europe)
[1] "2017-11-15"
mdy(a_date_US)
[1] "2017-11-15"
dmy(a_date_GSK)
[1] "2017-11-15"
ymd(a_date_programming_standard)
[1] "2017-11-15"

The lubridate package has a similar set of functions for working with times and date-time. Some further examples are provided below.

# A time
lunch_time <- "12:00:00"
hms(lunch_time)
[1] "12H 0M 0S"
# A date-time
new_year_party_start <- "31-12-2017 21:30:00"
dmy_hms(new_year_party_start, tz = "UTC")
[1] "2017-12-31 21:30:00 UTC"

6.14.2 Date/time Utilities

Once we have converted our date/time into the appropriate format we can make use of a number of utility functions in order to process the dates. A few examples are provided below.

# Today is a
weekdays(today())
[1] "Friday"
# When can I leave?
today() + hours(3)
[1] "2026-01-09 03:00:00 UTC"
# DBF is today, how long until SAC?
DBF <- today()
SAC <- DBF + weeks(5)
SAC
[1] "2026-02-13"
# How many days until Christmas?
difftime(ymd("2017-12-25"), today()) 
Time difference of -2937 days

6.15 Transmute

All of the functions that we’ve seen in recent sections can be used with the mutate function. They can also be used with a counterpart of mutate, transmute. The difference between mutate and transmute is that when we use transmute, the newly created columns are kept but the old ones are dropped. Occasionally this can be useful if summarising a large number of variables into a total score or similar.

transmute(vs, 
          Height_m2 = (HEIGHT/100) ^2, 
          BMI = WEIGHT / Height_m2)
# A tibble: 30 × 2
   Height_m2   BMI
       <dbl> <dbl>
 1      2.72  32.3
 2      3.20  31.2
 3      3.31  24.5
 4      2.76  33.0
 5      2.86  25.8
 6      2.69  24.5
 7      2.46  43.4
 8      3.57  28.3
 9      2.40  22.5
10      2.92  30.1
# ℹ 20 more rows
If you notice an issue, have suggestions for improvements, or want to view the source code, you can find it on GitHub.