dm %>%
summarise(Mean_Age = mean(AGE), SD_Age = sd(AGE))# A tibble: 1 × 2
Mean_Age SD_Age
<dbl> <dbl>
1 50.4 15.3
summarisegroup_bygroup_bysummariseAs we might expect, R has a number of in-built functions that can be used to summarise data. These summary functions tend to be very granular and work directly with vectors (columns) of data. Typically we call these functions via the summarise function. The summarise function works just like mutate except that rather than create a new column, it summarises over all values in the column. We must therefore always provide summarise with functions that return a single value. Like transmute, the summarise function only returns the summarised columns, all other columns are ignored. As with mutate and transmute it helps to name the summarised data.
dm %>%
summarise(Mean_Age = mean(AGE), SD_Age = sd(AGE))# A tibble: 1 × 2
Mean_Age SD_Age
<dbl> <dbl>
1 50.4 15.3
There are many statistical functions in R, mostly contained within the “base” and “stats” packages. Generally the function purposes are self-evident. A common set of statistical summary functions is listed below:
R is written in a way that forces us to think about missing values, which is a nice way of saying that they can be a pain to deal with:
some_vals <- c(1, NA, 3)
mean(some_vals)[1] NA
median(some_vals)[1] NA
min(some_vals)[1] NA
Thankfully, each of the statistical summary functions has an argument, na.rm, which when set to TRUE allows us to remove missing values.
some_vals <- c(1, NA, 3)
mean(some_vals, na.rm = TRUE)[1] 2
median(some_vals, na.rm = TRUE)[1] 2
min(some_vals, na.rm = TRUE)[1] 1
mutateNote that we can use functions such as mean and sd with mutate The effect is to summarise and then merge the result back on to the original data.
dm %>%
mutate(Mean_Age = mean(AGE), SD_Age = sd(AGE))# A tibble: 30 × 9
USUBJID AGE SEX COUNTRY RACE ETHNIC ARM Mean_Age SD_Age
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 STD123456:000001 32 F UK BLACK OR A… NOT H… Comp… 50.4 15.3
2 STD123456:000002 28 M FRA WHITE NOT H… Comp… 50.4 15.3
3 STD123456:000003 55 M USA BLACK OR A… NOT H… Comp… 50.4 15.3
4 STD123456:000004 35 F GER WHITE HISPA… Comp… 50.4 15.3
5 STD123456:000005 30 F IRE WHITE NOT H… Comp… 50.4 15.3
6 STD123456:000006 22 F GER WHITE NOT H… Comp… 50.4 15.3
7 STD123456:000007 59 F USA WHITE NOT H… Comp… 50.4 15.3
8 STD123456:000008 53 M GER WHITE NOT H… GSK 50.4 15.3
9 STD123456:000009 60 F USA WHITE NOT H… GSK 50.4 15.3
10 STD123456:000010 48 M USA WHITE NOT H… Comp… 50.4 15.3
# ℹ 20 more rows
group_byR has always been able to calculate summary statistics for one variable by another but the creation of the dplyr package has enabled these “by” operations to take place within a consistent framework. We achieve the desired effect via the group_by function. But before we do any calculations, let’s look at what this function does.
dm %>%
group_by(ARM)# A tibble: 30 × 7
# Groups: ARM [2]
USUBJID AGE SEX COUNTRY RACE ETHNIC ARM
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 STD123456:000001 32 F UK BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
2 STD123456:000002 28 M FRA WHITE NOT HIS… Comp…
3 STD123456:000003 55 M USA BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
4 STD123456:000004 35 F GER WHITE HISPANI… Comp…
5 STD123456:000005 30 F IRE WHITE NOT HIS… Comp…
6 STD123456:000006 22 F GER WHITE NOT HIS… Comp…
7 STD123456:000007 59 F USA WHITE NOT HIS… Comp…
8 STD123456:000008 53 M GER WHITE NOT HIS… GSK
9 STD123456:000009 60 F USA WHITE NOT HIS… GSK
10 STD123456:000010 48 M USA WHITE NOT HIS… Comp…
# ℹ 20 more rows
A quick glance at the result above might suggest that nothing has happened. The data are not even sorted by the ARM variable. What has happened is that an attribute has been added to the data. Notice that in the printed result we see “Groups: ARM [2]”. This tells us that the data are grouped by ARM and that the arm variable has 2 distinct (unique) values.
We can group by as many variables as we like. It doesn’t matter if they are continuous or discrete although clearly the latter makes more sense.
dm %>%
group_by(COUNTRY, SEX)# A tibble: 30 × 7
# Groups: COUNTRY, SEX [9]
USUBJID AGE SEX COUNTRY RACE ETHNIC ARM
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 STD123456:000001 32 F UK BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
2 STD123456:000002 28 M FRA WHITE NOT HIS… Comp…
3 STD123456:000003 55 M USA BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
4 STD123456:000004 35 F GER WHITE HISPANI… Comp…
5 STD123456:000005 30 F IRE WHITE NOT HIS… Comp…
6 STD123456:000006 22 F GER WHITE NOT HIS… Comp…
7 STD123456:000007 59 F USA WHITE NOT HIS… Comp…
8 STD123456:000008 53 M GER WHITE NOT HIS… GSK
9 STD123456:000009 60 F USA WHITE NOT HIS… GSK
10 STD123456:000010 48 M USA WHITE NOT HIS… Comp…
# ℹ 20 more rows
The main benefit of grouping by a variable(s) is so that we can summarise other variables by our grouping variable(s). Once our data are grouped we simply call the summarise function and the behaviour adapts accordingly.
dm %>%
group_by(ARM) %>%
summarise(N = n())# A tibble: 2 × 2
ARM N
<chr> <int>
1 Comparator 18
2 GSK 12
The result is always a data frame. Adding more “by” variables simply results in more columns.
dm %>%
group_by(ARM, COUNTRY) %>%
summarise(N = n())`summarise()` has grouped output by 'ARM'. You can override using the `.groups`
argument.
# A tibble: 9 × 3
# Groups: ARM [2]
ARM COUNTRY N
<chr> <chr> <int>
1 Comparator FRA 3
2 Comparator GER 3
3 Comparator IRE 4
4 Comparator UK 2
5 Comparator USA 6
6 GSK FRA 4
7 GSK GER 2
8 GSK IRE 2
9 GSK USA 4
Grouping our data has no effect on column operations such as select and rename. It can, however, change the behaviour of functions such as filter and slice, which operate on rows. For example, when we call slice on grouped data the count resets for each “by” variable.
# Find the first PK measurement for each subject
theoph %>%
group_by(SUBJID) %>%
slice(1)# A tibble: 12 × 5
# Groups: SUBJID [12]
SUBJID WT DOSE TIME CONC
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 79.6 4.02 0 0.74
2 2 72.4 4.4 0 0
3 3 70.5 4.53 0 0
4 4 72.7 4.4 0 0
5 5 54.6 5.86 0 0
6 6 80 4 0 0
7 7 64.6 4.95 0 0.15
8 8 70.5 4.53 0 0
9 9 86.4 3.1 0 0
10 10 58.2 5.5 0 0.24
11 11 65 4.92 0 0
12 12 60.5 5.3 0 0
# And the last
theoph %>%
group_by(SUBJID) %>%
slice(n())# A tibble: 12 × 5
# Groups: SUBJID [12]
SUBJID WT DOSE TIME CONC
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 79.6 4.02 24.4 3.28
2 2 72.4 4.4 24.3 0.9
3 3 70.5 4.53 24.2 1.05
4 4 72.7 4.4 24.6 1.15
5 5 54.6 5.86 24.4 1.57
6 6 80 4 23.8 0.92
7 7 64.6 4.95 24.2 1.15
8 8 70.5 4.53 24.1 1.25
9 9 86.4 3.1 24.4 1.12
10 10 58.2 5.5 23.7 2.42
11 11 65 4.92 24.1 0.86
12 12 60.5 5.3 24.2 1.17
In the exercises we will practice using a variety of functions with grouped data.
theoph datamutate and the act_long data to calculate the ACT Total score (ACTTOT) for each subject-visit combination by summing over QSSTRESN and save the output as act_basic.
QSTEST, QSORRES and QSSTRESN variablessliceact_basic data from the previous exercise
ACTTOT to ACTBLUSUBJID and ACTBL columnsact_full data, for each treatment-visit combination calculate
Extra
act_basic data such that it contains only subjects with a missing Week 24 Total Score and impute week 24 scores using the WOCF (Worst Observation Carried Forward) method by carrying forward their worst (minimum) Total Score from all visits up to that point. HINT: Use the any function to see if a subject has a Week 24 record, and then find the minimum Total Score to impute missing values.group_byIt is important to note that by adding a grouping to our data we are making a permanent change to the attributes of our data. This could cause unexpected behaviour further down the line if we “forgot” that our data were grouped. It is easy to undo a grouping using the ungroup function.
# Overwrite dm with a grouped version of dm
dm <- dm %>%
group_by(ARM, COUNTRY)
dm# A tibble: 30 × 7
# Groups: ARM, COUNTRY [9]
USUBJID AGE SEX COUNTRY RACE ETHNIC ARM
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 STD123456:000001 32 F UK BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
2 STD123456:000002 28 M FRA WHITE NOT HIS… Comp…
3 STD123456:000003 55 M USA BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
4 STD123456:000004 35 F GER WHITE HISPANI… Comp…
5 STD123456:000005 30 F IRE WHITE NOT HIS… Comp…
6 STD123456:000006 22 F GER WHITE NOT HIS… Comp…
7 STD123456:000007 59 F USA WHITE NOT HIS… Comp…
8 STD123456:000008 53 M GER WHITE NOT HIS… GSK
9 STD123456:000009 60 F USA WHITE NOT HIS… GSK
10 STD123456:000010 48 M USA WHITE NOT HIS… Comp…
# ℹ 20 more rows
# Remove the grouping
dm <- dm %>%
ungroup()
dm# A tibble: 30 × 7
USUBJID AGE SEX COUNTRY RACE ETHNIC ARM
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 STD123456:000001 32 F UK BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
2 STD123456:000002 28 M FRA WHITE NOT HIS… Comp…
3 STD123456:000003 55 M USA BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
4 STD123456:000004 35 F GER WHITE HISPANI… Comp…
5 STD123456:000005 30 F IRE WHITE NOT HIS… Comp…
6 STD123456:000006 22 F GER WHITE NOT HIS… Comp…
7 STD123456:000007 59 F USA WHITE NOT HIS… Comp…
8 STD123456:000008 53 M GER WHITE NOT HIS… GSK
9 STD123456:000009 60 F USA WHITE NOT HIS… GSK
10 STD123456:000010 48 M USA WHITE NOT HIS… Comp…
# ℹ 20 more rows
In addition, each time we call the group_by function it overwrites the previous grouping. So rather than always ungrouping our data we can just re-group in order to slice and dice our data in different ways.