8  Summaries and Aggregation

8.1 Summary Statistics: summarise

As we might expect, R has a number of in-built functions that can be used to summarise data. These summary functions tend to be very granular and work directly with vectors (columns) of data. Typically we call these functions via the summarise function. The summarise function works just like mutate except that rather than create a new column, it summarises over all values in the column. We must therefore always provide summarise with functions that return a single value. Like transmute, the summarise function only returns the summarised columns, all other columns are ignored. As with mutate and transmute it helps to name the summarised data.

dm %>% 
  summarise(Mean_Age = mean(AGE), SD_Age = sd(AGE))
# A tibble: 1 × 2
  Mean_Age SD_Age
     <dbl>  <dbl>
1     50.4   15.3

There are many statistical functions in R, mostly contained within the “base” and “stats” packages. Generally the function purposes are self-evident. A common set of statistical summary functions is listed below:

  • mean
  • median
  • sd
  • var
  • min
  • max
  • sum

8.2 Missing Values

R is written in a way that forces us to think about missing values, which is a nice way of saying that they can be a pain to deal with:

some_vals <- c(1, NA, 3)
mean(some_vals)
[1] NA
median(some_vals)
[1] NA
min(some_vals)
[1] NA

Thankfully, each of the statistical summary functions has an argument, na.rm, which when set to TRUE allows us to remove missing values.

some_vals <- c(1, NA, 3)
mean(some_vals, na.rm = TRUE)
[1] 2
median(some_vals, na.rm = TRUE)
[1] 2
min(some_vals, na.rm = TRUE)
[1] 1

8.2.1 Summarising with mutate

Note that we can use functions such as mean and sd with mutate The effect is to summarise and then merge the result back on to the original data.

dm %>% 
  mutate(Mean_Age = mean(AGE), SD_Age = sd(AGE))
# A tibble: 30 × 9
   USUBJID            AGE SEX   COUNTRY RACE        ETHNIC ARM   Mean_Age SD_Age
   <chr>            <dbl> <chr> <chr>   <chr>       <chr>  <chr>    <dbl>  <dbl>
 1 STD123456:000001    32 F     UK      BLACK OR A… NOT H… Comp…     50.4   15.3
 2 STD123456:000002    28 M     FRA     WHITE       NOT H… Comp…     50.4   15.3
 3 STD123456:000003    55 M     USA     BLACK OR A… NOT H… Comp…     50.4   15.3
 4 STD123456:000004    35 F     GER     WHITE       HISPA… Comp…     50.4   15.3
 5 STD123456:000005    30 F     IRE     WHITE       NOT H… Comp…     50.4   15.3
 6 STD123456:000006    22 F     GER     WHITE       NOT H… Comp…     50.4   15.3
 7 STD123456:000007    59 F     USA     WHITE       NOT H… Comp…     50.4   15.3
 8 STD123456:000008    53 M     GER     WHITE       NOT H… GSK       50.4   15.3
 9 STD123456:000009    60 F     USA     WHITE       NOT H… GSK       50.4   15.3
10 STD123456:000010    48 M     USA     WHITE       NOT H… Comp…     50.4   15.3
# ℹ 20 more rows

8.3 By Operations: group_by

R has always been able to calculate summary statistics for one variable by another but the creation of the dplyr package has enabled these “by” operations to take place within a consistent framework. We achieve the desired effect via the group_by function. But before we do any calculations, let’s look at what this function does.

dm %>% 
  group_by(ARM)
# A tibble: 30 × 7
# Groups:   ARM [2]
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <chr> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000001    32 F     UK      BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 2 STD123456:000002    28 M     FRA     WHITE                     NOT HIS… Comp…
 3 STD123456:000003    55 M     USA     BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 4 STD123456:000004    35 F     GER     WHITE                     HISPANI… Comp…
 5 STD123456:000005    30 F     IRE     WHITE                     NOT HIS… Comp…
 6 STD123456:000006    22 F     GER     WHITE                     NOT HIS… Comp…
 7 STD123456:000007    59 F     USA     WHITE                     NOT HIS… Comp…
 8 STD123456:000008    53 M     GER     WHITE                     NOT HIS… GSK  
 9 STD123456:000009    60 F     USA     WHITE                     NOT HIS… GSK  
10 STD123456:000010    48 M     USA     WHITE                     NOT HIS… Comp…
# ℹ 20 more rows

A quick glance at the result above might suggest that nothing has happened. The data are not even sorted by the ARM variable. What has happened is that an attribute has been added to the data. Notice that in the printed result we see “Groups: ARM [2]”. This tells us that the data are grouped by ARM and that the arm variable has 2 distinct (unique) values.

We can group by as many variables as we like. It doesn’t matter if they are continuous or discrete although clearly the latter makes more sense.

dm %>% 
  group_by(COUNTRY, SEX)
# A tibble: 30 × 7
# Groups:   COUNTRY, SEX [9]
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <chr> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000001    32 F     UK      BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 2 STD123456:000002    28 M     FRA     WHITE                     NOT HIS… Comp…
 3 STD123456:000003    55 M     USA     BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 4 STD123456:000004    35 F     GER     WHITE                     HISPANI… Comp…
 5 STD123456:000005    30 F     IRE     WHITE                     NOT HIS… Comp…
 6 STD123456:000006    22 F     GER     WHITE                     NOT HIS… Comp…
 7 STD123456:000007    59 F     USA     WHITE                     NOT HIS… Comp…
 8 STD123456:000008    53 M     GER     WHITE                     NOT HIS… GSK  
 9 STD123456:000009    60 F     USA     WHITE                     NOT HIS… GSK  
10 STD123456:000010    48 M     USA     WHITE                     NOT HIS… Comp…
# ℹ 20 more rows

8.4 Summary Statistics Continued

The main benefit of grouping by a variable(s) is so that we can summarise other variables by our grouping variable(s). Once our data are grouped we simply call the summarise function and the behaviour adapts accordingly.

dm %>% 
  group_by(ARM) %>%
  summarise(N = n())
# A tibble: 2 × 2
  ARM            N
  <chr>      <int>
1 Comparator    18
2 GSK           12

The result is always a data frame. Adding more “by” variables simply results in more columns.

dm %>% 
  group_by(ARM, COUNTRY) %>%
  summarise(N = n())
`summarise()` has grouped output by 'ARM'. You can override using the `.groups`
argument.
# A tibble: 9 × 3
# Groups:   ARM [2]
  ARM        COUNTRY     N
  <chr>      <chr>   <int>
1 Comparator FRA         3
2 Comparator GER         3
3 Comparator IRE         4
4 Comparator UK          2
5 Comparator USA         6
6 GSK        FRA         4
7 GSK        GER         2
8 GSK        IRE         2
9 GSK        USA         4

8.5 Other Uses of Grouping

Grouping our data has no effect on column operations such as select and rename. It can, however, change the behaviour of functions such as filter and slice, which operate on rows. For example, when we call slice on grouped data the count resets for each “by” variable.

# Find the first PK measurement for each subject
theoph %>% 
  group_by(SUBJID) %>%
  slice(1)
# A tibble: 12 × 5
# Groups:   SUBJID [12]
   SUBJID    WT  DOSE  TIME  CONC
    <dbl> <dbl> <dbl> <dbl> <dbl>
 1      1  79.6  4.02     0  0.74
 2      2  72.4  4.4      0  0   
 3      3  70.5  4.53     0  0   
 4      4  72.7  4.4      0  0   
 5      5  54.6  5.86     0  0   
 6      6  80    4        0  0   
 7      7  64.6  4.95     0  0.15
 8      8  70.5  4.53     0  0   
 9      9  86.4  3.1      0  0   
10     10  58.2  5.5      0  0.24
11     11  65    4.92     0  0   
12     12  60.5  5.3      0  0   
# And the last
theoph %>% 
  group_by(SUBJID) %>%
  slice(n())
# A tibble: 12 × 5
# Groups:   SUBJID [12]
   SUBJID    WT  DOSE  TIME  CONC
    <dbl> <dbl> <dbl> <dbl> <dbl>
 1      1  79.6  4.02  24.4  3.28
 2      2  72.4  4.4   24.3  0.9 
 3      3  70.5  4.53  24.2  1.05
 4      4  72.7  4.4   24.6  1.15
 5      5  54.6  5.86  24.4  1.57
 6      6  80    4     23.8  0.92
 7      7  64.6  4.95  24.2  1.15
 8      8  70.5  4.53  24.1  1.25
 9      9  86.4  3.1   24.4  1.12
10     10  58.2  5.5   23.7  2.42
11     11  65    4.92  24.1  0.86
12     12  60.5  5.3   24.2  1.17

In the exercises we will practice using a variety of functions with grouped data.

8.6 EXERCISE

  1. Calculate the maximum concentration for each subject in the theoph data
  2. Use mutate and the act_long data to calculate the ACT Total score (ACTTOT) for each subject-visit combination by summing over QSSTRESN and save the output as act_basic.
    1. Remove the QSTEST, QSORRES and QSSTRESN variables
    2. Ensure that each record contains a unique ACT Total score for the subject-visit combination. HINT: Use slice
  3. Derive baseline (Visit 2) ACT Total scores for each subject using the act_basic data from the previous exercise
    1. Rename ACTTOT to ACTBL
    2. Keep only the USUBJID and ACTBL columns
  4. Using the act_full data, for each treatment-visit combination calculate
    1. The number of subjects
    2. The mean and standard deviation of the ACT Total Score

Extra

  1. Filter the act_basic data such that it contains only subjects with a missing Week 24 Total Score and impute week 24 scores using the WOCF (Worst Observation Carried Forward) method by carrying forward their worst (minimum) Total Score from all visits up to that point. HINT: Use the any function to see if a subject has a Week 24 record, and then find the minimum Total Score to impute missing values.

8.7 Updating / Undoing a group_by

It is important to note that by adding a grouping to our data we are making a permanent change to the attributes of our data. This could cause unexpected behaviour further down the line if we “forgot” that our data were grouped. It is easy to undo a grouping using the ungroup function.

# Overwrite dm with a grouped version of dm
dm <- dm %>% 
  group_by(ARM, COUNTRY) 
dm
# A tibble: 30 × 7
# Groups:   ARM, COUNTRY [9]
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <chr> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000001    32 F     UK      BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 2 STD123456:000002    28 M     FRA     WHITE                     NOT HIS… Comp…
 3 STD123456:000003    55 M     USA     BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 4 STD123456:000004    35 F     GER     WHITE                     HISPANI… Comp…
 5 STD123456:000005    30 F     IRE     WHITE                     NOT HIS… Comp…
 6 STD123456:000006    22 F     GER     WHITE                     NOT HIS… Comp…
 7 STD123456:000007    59 F     USA     WHITE                     NOT HIS… Comp…
 8 STD123456:000008    53 M     GER     WHITE                     NOT HIS… GSK  
 9 STD123456:000009    60 F     USA     WHITE                     NOT HIS… GSK  
10 STD123456:000010    48 M     USA     WHITE                     NOT HIS… Comp…
# ℹ 20 more rows
# Remove the grouping
dm <- dm %>%
  ungroup()
dm
# A tibble: 30 × 7
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <chr> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000001    32 F     UK      BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 2 STD123456:000002    28 M     FRA     WHITE                     NOT HIS… Comp…
 3 STD123456:000003    55 M     USA     BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 4 STD123456:000004    35 F     GER     WHITE                     HISPANI… Comp…
 5 STD123456:000005    30 F     IRE     WHITE                     NOT HIS… Comp…
 6 STD123456:000006    22 F     GER     WHITE                     NOT HIS… Comp…
 7 STD123456:000007    59 F     USA     WHITE                     NOT HIS… Comp…
 8 STD123456:000008    53 M     GER     WHITE                     NOT HIS… GSK  
 9 STD123456:000009    60 F     USA     WHITE                     NOT HIS… GSK  
10 STD123456:000010    48 M     USA     WHITE                     NOT HIS… Comp…
# ℹ 20 more rows

In addition, each time we call the group_by function it overwrites the previous grouping. So rather than always ungrouping our data we can just re-group in order to slice and dice our data in different ways.

If you notice an issue, have suggestions for improvements, or want to view the source code, you can find it on GitHub.