9 Setting and Merging – Introduction to R and the tidyverse

9.1 Setting: `bind_rows`

We can combine similarly structured data frames using the bind_rows function. This is equivalent to a “set” operation in SAS.

# Generate two data frames
interim <- crossing(SUBJID = 1:3, VISITNUM = 1:2)
post_interim <- crossing(SUBJID = 4:5, VISITNUM = 1:2)

# What do they look like?
interim

# A tibble: 6 × 2
  SUBJID VISITNUM
   <int>    <int>
1      1        1
2      1        2
3      2        1
4      2        2
5      3        1
6      3        2

post_interim

# A tibble: 4 × 2
  SUBJID VISITNUM
   <int>    <int>
1      4        1
2      4        2
3      5        1
4      5        2

# Combine (set) the data frames
final_data <- bind_rows(interim, post_interim)
final_data

# A tibble: 10 × 2
   SUBJID VISITNUM
    <int>    <int>
 1      1        1
 2      1        2
 3      2        1
 4      2        2
 5      3        1
 6      3        2
 7      4        1
 8      4        2
 9      5        1
10      5        2

The bind_rows is relatively intelligent in that the columns do not need to match between the two data frames that we wish to combine. If columns appear in a different order, the order of the first data frame is used for the final output.

In the following example a baseline record is combined with a post-baseline record. The post-baseline record does not contain a “BASE” column but this appears in the final data with a missing value.

# Generate two data frames
baseline <- tibble(SUBJID = 1, VISITNUM = 1, BASE= 10, AVAL = 10)
post_bl <- tibble(SUBJID = 1, VISITNUM = 2, AVAL= 11)

bind_rows(baseline, post_bl)

# A tibble: 2 × 4
  SUBJID VISITNUM  BASE  AVAL
   <dbl>    <dbl> <dbl> <dbl>
1      1        1    10    10
2      1        2    NA    11

9.2 Binding Columns bind_cols

For columns an equivalent function, bind_cols exists. So long as two data frames have the same number of rows we can use bind_cols to join them together. In the following example we should merge by USUBJID but let’s see what happens if we just stick the data together.

# Simply stick the demography and vitals data together
# Scary stuff! But it works since both have the same subjects in the same order
bind_cols(dm, vs)

New names:
• `USUBJID` -> `USUBJID...1`
• `USUBJID` -> `USUBJID...8`

# A tibble: 30 × 10
   USUBJID...1    AGE SEX   COUNTRY RACE  ETHNIC ARM   USUBJID...8 HEIGHT WEIGHT
   <chr>        <dbl> <chr> <chr>   <chr> <chr>  <chr> <chr>        <dbl>  <dbl>
 1 STD123456:0…    32 F     UK      BLAC… NOT H… Comp… STD123456:…    165   88  
 2 STD123456:0…    28 M     FRA     WHITE NOT H… Comp… STD123456:…    179  100  
 3 STD123456:0…    55 M     USA     BLAC… NOT H… Comp… STD123456:…    182   81  
 4 STD123456:0…    35 F     GER     WHITE HISPA… Comp… STD123456:…    166   91  
 5 STD123456:0…    30 F     IRE     WHITE NOT H… Comp… STD123456:…    169   73.6
 6 STD123456:0…    22 F     GER     WHITE NOT H… Comp… STD123456:…    164   66  
 7 STD123456:0…    59 F     USA     WHITE NOT H… Comp… STD123456:…    157  107  
 8 STD123456:0…    53 M     GER     WHITE NOT H… GSK   STD123456:…    189  101  
 9 STD123456:0…    60 F     USA     WHITE NOT H… GSK   STD123456:…    155   54.1
10 STD123456:0…    48 M     USA     WHITE NOT H… Comp… STD123456:…    171   88  
# ℹ 20 more rows

In this case it worked, although we might wish to remove the USUBJID1 column after checking that all values are equal to those in the USUBJID column. The bind_cols function is useful for simulation but with real data a merge is usually more appropriate.

9.3 Merging (Joining)

The tidyverse terminology generally follows SQL where possible. We therefore talk about “joining” as opposed to “merging”. As is often the case in the tidyverse, different types of join are achieved by using different join functions. Here is a list of the current join functions:

anti_join
cross_join
full_join
inner_join
left_join
nest_join
right_join
semi_join
sql_join
sql_semi_join

The *_join functions optimise the join for us so as long as we’re not performing a many-to-many merge no prior sorting is required. We need only specify which data frames we wish to join and which variables we wish to join by. In the following example we join together the demographics and vital signs data. Each of the join functions follows the same format of arguments: the first two arguments are the two data frames to join, the third argument is the by argument. This third argument requires a character vector of columns that we wish to join by.

full_join(dm, vs, by = "USUBJID")

# A tibble: 30 × 9
   USUBJID            AGE SEX   COUNTRY RACE          ETHNIC ARM   HEIGHT WEIGHT
   <chr>            <dbl> <chr> <chr>   <chr>         <chr>  <chr>  <dbl>  <dbl>
 1 STD123456:000001    32 F     UK      BLACK OR AFR… NOT H… Comp…    165   88  
 2 STD123456:000002    28 M     FRA     WHITE         NOT H… Comp…    179  100  
 3 STD123456:000003    55 M     USA     BLACK OR AFR… NOT H… Comp…    182   81  
 4 STD123456:000004    35 F     GER     WHITE         HISPA… Comp…    166   91  
 5 STD123456:000005    30 F     IRE     WHITE         NOT H… Comp…    169   73.6
 6 STD123456:000006    22 F     GER     WHITE         NOT H… Comp…    164   66  
 7 STD123456:000007    59 F     USA     WHITE         NOT H… Comp…    157  107  
 8 STD123456:000008    53 M     GER     WHITE         NOT H… GSK      189  101  
 9 STD123456:000009    60 F     USA     WHITE         NOT H… GSK      155   54.1
10 STD123456:000010    48 M     USA     WHITE         NOT H… Comp…    171   88  
# ℹ 20 more rows

In the previous example we performed a full join. Observe what happens if we call this function on datasets with mismatching subject IDs.

dm_head <- dm %>% head(10) %>% select(-RACE, -ETHNIC) %>% slice(-1)
vs_missing <- vs %>% slice(1, 2, 4, 6, 9, 10)
dm_head

# A tibble: 9 × 5
  USUBJID            AGE SEX   COUNTRY ARM       
  <chr>            <dbl> <chr> <chr>   <chr>     
1 STD123456:000002    28 M     FRA     Comparator
2 STD123456:000003    55 M     USA     Comparator
3 STD123456:000004    35 F     GER     Comparator
4 STD123456:000005    30 F     IRE     Comparator
5 STD123456:000006    22 F     GER     Comparator
6 STD123456:000007    59 F     USA     Comparator
7 STD123456:000008    53 M     GER     GSK       
8 STD123456:000009    60 F     USA     GSK       
9 STD123456:000010    48 M     USA     Comparator

vs_missing

# A tibble: 6 × 3
  USUBJID          HEIGHT WEIGHT
  <chr>             <dbl>  <dbl>
1 STD123456:000001    165   88  
2 STD123456:000002    179  100  
3 STD123456:000004    166   91  
4 STD123456:000006    164   66  
5 STD123456:000009    155   54.1
6 STD123456:000010    171   88

# Join together
full_join(dm_head, vs_missing, by = "USUBJID")

# A tibble: 10 × 7
   USUBJID            AGE SEX   COUNTRY ARM        HEIGHT WEIGHT
   <chr>            <dbl> <chr> <chr>   <chr>       <dbl>  <dbl>
 1 STD123456:000002    28 M     FRA     Comparator    179  100  
 2 STD123456:000003    55 M     USA     Comparator     NA   NA  
 3 STD123456:000004    35 F     GER     Comparator    166   91  
 4 STD123456:000005    30 F     IRE     Comparator     NA   NA  
 5 STD123456:000006    22 F     GER     Comparator    164   66  
 6 STD123456:000007    59 F     USA     Comparator     NA   NA  
 7 STD123456:000008    53 M     GER     GSK            NA   NA  
 8 STD123456:000009    60 F     USA     GSK           155   54.1
 9 STD123456:000010    48 M     USA     Comparator    171   88  
10 STD123456:000001    NA <NA>  <NA>    <NA>          165   88

All subjects are included but for the subjects with missing vitals the HEIGHT and WEIGHT values are shown as missing, as well as the AGE, SEX, COUNTRY and ARM values for the first subject in dm data.

9.4 Left, Right and Inner Joins

If we perform a left join, only subjects present in dm_head are displayed with accompanying data joined from vs_missing.

left_join(dm_head, vs_missing, by = "USUBJID")

# A tibble: 9 × 7
  USUBJID            AGE SEX   COUNTRY ARM        HEIGHT WEIGHT
  <chr>            <dbl> <chr> <chr>   <chr>       <dbl>  <dbl>
1 STD123456:000002    28 M     FRA     Comparator    179  100  
2 STD123456:000003    55 M     USA     Comparator     NA   NA  
3 STD123456:000004    35 F     GER     Comparator    166   91  
4 STD123456:000005    30 F     IRE     Comparator     NA   NA  
5 STD123456:000006    22 F     GER     Comparator    164   66  
6 STD123456:000007    59 F     USA     Comparator     NA   NA  
7 STD123456:000008    53 M     GER     GSK            NA   NA  
8 STD123456:000009    60 F     USA     GSK           155   54.1
9 STD123456:000010    48 M     USA     Comparator    171   88

If we specify a right join, then only subjects that appear in vs_missing are joined on from dm_head, as seen below.

right_join(dm_head, vs_missing, by = "USUBJID")

# A tibble: 6 × 7
  USUBJID            AGE SEX   COUNTRY ARM        HEIGHT WEIGHT
  <chr>            <dbl> <chr> <chr>   <chr>       <dbl>  <dbl>
1 STD123456:000002    28 M     FRA     Comparator    179  100  
2 STD123456:000004    35 F     GER     Comparator    166   91  
3 STD123456:000006    22 F     GER     Comparator    164   66  
4 STD123456:000009    60 F     USA     GSK           155   54.1
5 STD123456:000010    48 M     USA     Comparator    171   88  
6 STD123456:000001    NA <NA>  <NA>    <NA>          165   88

An inner join only returns complete records where data are present in both data frames for a particular subject.

inner_join(dm_head, vs_missing, by = "USUBJID")

# A tibble: 5 × 7
  USUBJID            AGE SEX   COUNTRY ARM        HEIGHT WEIGHT
  <chr>            <dbl> <chr> <chr>   <chr>       <dbl>  <dbl>
1 STD123456:000002    28 M     FRA     Comparator    179  100  
2 STD123456:000004    35 F     GER     Comparator    166   91  
3 STD123456:000006    22 F     GER     Comparator    164   66  
4 STD123456:000009    60 F     USA     GSK           155   54.1
5 STD123456:000010    48 M     USA     Comparator    171   88

9.5 Semi Joins and Anti Joins

In cases such as the one above, where subjects appear in one dataset but not the other the semi_join and anti_join functions can be useful. The semi_join function returns all records in the first data frame that would be merged if a left/inner join were performed.

The anti_join function is effectively the converse to the semi_join and returns all the records that would not be merged.

# Show the records in dm that would be merged if performing an inner/left join
semi_join(dm_head, vs_missing, by = "USUBJID")

# A tibble: 5 × 5
  USUBJID            AGE SEX   COUNTRY ARM       
  <chr>            <dbl> <chr> <chr>   <chr>     
1 STD123456:000002    28 M     FRA     Comparator
2 STD123456:000004    35 F     GER     Comparator
3 STD123456:000006    22 F     GER     Comparator
4 STD123456:000009    60 F     USA     GSK       
5 STD123456:000010    48 M     USA     Comparator

# Show the records in dm that would NOT be merged if performing an inner/left join
anti_join(dm_head, vs_missing, by = "USUBJID")

# A tibble: 4 × 5
  USUBJID            AGE SEX   COUNTRY ARM       
  <chr>            <dbl> <chr> <chr>   <chr>     
1 STD123456:000003    55 M     USA     Comparator
2 STD123456:000005    30 F     IRE     Comparator
3 STD123456:000007    59 F     USA     Comparator
4 STD123456:000008    53 M     GER     GSK

9.6 EXERCISE

Merge the randomisation dates (contained within the sl data) onto the act data
Now merge on the demography and vitals information and save this as act_main

Extra

As in an earlier exercise, find subjects that have a missing Week 24 record and calculate the Worst Observation Carried Forward for these subjects. The resulting data should have a single row for each subject where VISITNUM and VISIT have been set to 60 and "VISIT 6 (WEEK 24)" respectively. An additional column should be labelled WOCF and be set to TRUE for these records. Bind these records back on to the main data.

9.1 Setting: bind_rows