We can combine similarly structured data frames using the bind_rows function. This is equivalent to a “set” operation in SAS.
# Generate two data framesinterim <-crossing(SUBJID =1:3, VISITNUM =1:2)post_interim <-crossing(SUBJID =4:5, VISITNUM =1:2)# What do they look like?interim
The bind_rows is relatively intelligent in that the columns do not need to match between the two data frames that we wish to combine. If columns appear in a different order, the order of the first data frame is used for the final output.
In the following example a baseline record is combined with a post-baseline record. The post-baseline record does not contain a “BASE” column but this appears in the final data with a missing value.
# Generate two data framesbaseline <-tibble(SUBJID =1, VISITNUM =1, BASE=10, AVAL =10)post_bl <-tibble(SUBJID =1, VISITNUM =2, AVAL=11)bind_rows(baseline, post_bl)
# A tibble: 2 × 4
SUBJID VISITNUM BASE AVAL
<dbl> <dbl> <dbl> <dbl>
1 1 1 10 10
2 1 2 NA 11
9.2 Binding Columns bind_cols
For columns an equivalent function, bind_cols exists. So long as two data frames have the same number of rows we can use bind_cols to join them together. In the following example we should merge by USUBJID but let’s see what happens if we just stick the data together.
# Simply stick the demography and vitals data together# Scary stuff! But it works since both have the same subjects in the same orderbind_cols(dm, vs)
New names:
• `USUBJID` -> `USUBJID...1`
• `USUBJID` -> `USUBJID...8`
# A tibble: 30 × 10
USUBJID...1 AGE SEX COUNTRY RACE ETHNIC ARM USUBJID...8 HEIGHT WEIGHT
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 STD123456:0… 32 F UK BLAC… NOT H… Comp… STD123456:… 165 88
2 STD123456:0… 28 M FRA WHITE NOT H… Comp… STD123456:… 179 100
3 STD123456:0… 55 M USA BLAC… NOT H… Comp… STD123456:… 182 81
4 STD123456:0… 35 F GER WHITE HISPA… Comp… STD123456:… 166 91
5 STD123456:0… 30 F IRE WHITE NOT H… Comp… STD123456:… 169 73.6
6 STD123456:0… 22 F GER WHITE NOT H… Comp… STD123456:… 164 66
7 STD123456:0… 59 F USA WHITE NOT H… Comp… STD123456:… 157 107
8 STD123456:0… 53 M GER WHITE NOT H… GSK STD123456:… 189 101
9 STD123456:0… 60 F USA WHITE NOT H… GSK STD123456:… 155 54.1
10 STD123456:0… 48 M USA WHITE NOT H… Comp… STD123456:… 171 88
# ℹ 20 more rows
In this case it worked, although we might wish to remove the USUBJID1 column after checking that all values are equal to those in the USUBJID column. The bind_cols function is useful for simulation but with real data a merge is usually more appropriate.
9.3 Merging (Joining)
The tidyverse terminology generally follows SQL where possible. We therefore talk about “joining” as opposed to “merging”. As is often the case in the tidyverse, different types of join are achieved by using different join functions. Here is a list of the current join functions:
The *_join functions optimise the join for us so as long as we’re not performing a many-to-many merge no prior sorting is required. We need only specify which data frames we wish to join and which variables we wish to join by. In the following example we join together the demographics and vital signs data. Each of the join functions follows the same format of arguments: the first two arguments are the two data frames to join, the third argument is the by argument. This third argument requires a character vector of columns that we wish to join by.
full_join(dm, vs, by ="USUBJID")
# A tibble: 30 × 9
USUBJID AGE SEX COUNTRY RACE ETHNIC ARM HEIGHT WEIGHT
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 STD123456:000001 32 F UK BLACK OR AFR… NOT H… Comp… 165 88
2 STD123456:000002 28 M FRA WHITE NOT H… Comp… 179 100
3 STD123456:000003 55 M USA BLACK OR AFR… NOT H… Comp… 182 81
4 STD123456:000004 35 F GER WHITE HISPA… Comp… 166 91
5 STD123456:000005 30 F IRE WHITE NOT H… Comp… 169 73.6
6 STD123456:000006 22 F GER WHITE NOT H… Comp… 164 66
7 STD123456:000007 59 F USA WHITE NOT H… Comp… 157 107
8 STD123456:000008 53 M GER WHITE NOT H… GSK 189 101
9 STD123456:000009 60 F USA WHITE NOT H… GSK 155 54.1
10 STD123456:000010 48 M USA WHITE NOT H… Comp… 171 88
# ℹ 20 more rows
In the previous example we performed a full join. Observe what happens if we call this function on datasets with mismatching subject IDs.
# A tibble: 9 × 5
USUBJID AGE SEX COUNTRY ARM
<chr> <dbl> <chr> <chr> <chr>
1 STD123456:000002 28 M FRA Comparator
2 STD123456:000003 55 M USA Comparator
3 STD123456:000004 35 F GER Comparator
4 STD123456:000005 30 F IRE Comparator
5 STD123456:000006 22 F GER Comparator
6 STD123456:000007 59 F USA Comparator
7 STD123456:000008 53 M GER GSK
8 STD123456:000009 60 F USA GSK
9 STD123456:000010 48 M USA Comparator
# Join togetherfull_join(dm_head, vs_missing, by ="USUBJID")
# A tibble: 10 × 7
USUBJID AGE SEX COUNTRY ARM HEIGHT WEIGHT
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 STD123456:000002 28 M FRA Comparator 179 100
2 STD123456:000003 55 M USA Comparator NA NA
3 STD123456:000004 35 F GER Comparator 166 91
4 STD123456:000005 30 F IRE Comparator NA NA
5 STD123456:000006 22 F GER Comparator 164 66
6 STD123456:000007 59 F USA Comparator NA NA
7 STD123456:000008 53 M GER GSK NA NA
8 STD123456:000009 60 F USA GSK 155 54.1
9 STD123456:000010 48 M USA Comparator 171 88
10 STD123456:000001 NA <NA> <NA> <NA> 165 88
All subjects are included but for the subjects with missing vitals the HEIGHT and WEIGHT values are shown as missing, as well as the AGE, SEX, COUNTRY and ARM values for the first subject in dm data.
9.4 Left, Right and Inner Joins
If we perform a left join, only subjects present in dm_head are displayed with accompanying data joined from vs_missing.
left_join(dm_head, vs_missing, by ="USUBJID")
# A tibble: 9 × 7
USUBJID AGE SEX COUNTRY ARM HEIGHT WEIGHT
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 STD123456:000002 28 M FRA Comparator 179 100
2 STD123456:000003 55 M USA Comparator NA NA
3 STD123456:000004 35 F GER Comparator 166 91
4 STD123456:000005 30 F IRE Comparator NA NA
5 STD123456:000006 22 F GER Comparator 164 66
6 STD123456:000007 59 F USA Comparator NA NA
7 STD123456:000008 53 M GER GSK NA NA
8 STD123456:000009 60 F USA GSK 155 54.1
9 STD123456:000010 48 M USA Comparator 171 88
If we specify a right join, then only subjects that appear in vs_missing are joined on from dm_head, as seen below.
right_join(dm_head, vs_missing, by ="USUBJID")
# A tibble: 6 × 7
USUBJID AGE SEX COUNTRY ARM HEIGHT WEIGHT
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 STD123456:000002 28 M FRA Comparator 179 100
2 STD123456:000004 35 F GER Comparator 166 91
3 STD123456:000006 22 F GER Comparator 164 66
4 STD123456:000009 60 F USA GSK 155 54.1
5 STD123456:000010 48 M USA Comparator 171 88
6 STD123456:000001 NA <NA> <NA> <NA> 165 88
An inner join only returns complete records where data are present in both data frames for a particular subject.
inner_join(dm_head, vs_missing, by ="USUBJID")
# A tibble: 5 × 7
USUBJID AGE SEX COUNTRY ARM HEIGHT WEIGHT
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 STD123456:000002 28 M FRA Comparator 179 100
2 STD123456:000004 35 F GER Comparator 166 91
3 STD123456:000006 22 F GER Comparator 164 66
4 STD123456:000009 60 F USA GSK 155 54.1
5 STD123456:000010 48 M USA Comparator 171 88
9.5 Semi Joins and Anti Joins
In cases such as the one above, where subjects appear in one dataset but not the other the semi_join and anti_join functions can be useful. The semi_join function returns all records in the first data frame that would be merged if a left/inner join were performed.
The anti_join function is effectively the converse to the semi_join and returns all the records that would not be merged.
# Show the records in dm that would be merged if performing an inner/left joinsemi_join(dm_head, vs_missing, by ="USUBJID")
# A tibble: 5 × 5
USUBJID AGE SEX COUNTRY ARM
<chr> <dbl> <chr> <chr> <chr>
1 STD123456:000002 28 M FRA Comparator
2 STD123456:000004 35 F GER Comparator
3 STD123456:000006 22 F GER Comparator
4 STD123456:000009 60 F USA GSK
5 STD123456:000010 48 M USA Comparator
# Show the records in dm that would NOT be merged if performing an inner/left joinanti_join(dm_head, vs_missing, by ="USUBJID")
# A tibble: 4 × 5
USUBJID AGE SEX COUNTRY ARM
<chr> <dbl> <chr> <chr> <chr>
1 STD123456:000003 55 M USA Comparator
2 STD123456:000005 30 F IRE Comparator
3 STD123456:000007 59 F USA Comparator
4 STD123456:000008 53 M GER GSK
9.6 EXERCISE
Merge the randomisation dates (contained within the sl data) onto the act data
Now merge on the demography and vitals information and save this as act_main
Extra
As in an earlier exercise, find subjects that have a missing Week 24 record and calculate the Worst Observation Carried Forward for these subjects. The resulting data should have a single row for each subject where VISITNUM and VISIT have been set to 60 and "VISIT 6 (WEEK 24)" respectively. An additional column should be labelled WOCF and be set to TRUE for these records. Bind these records back on to the main data.
If you notice an issue, have suggestions for improvements, or want to view the source code, you can find it on GitHub.