10  Transposing data

Another common task when working with data is the transposition from/to long/wide formats. In the world of CDISC everything is in a long format. However we often need our data in a wide format in order to model or plot it. In the tidyverse, the tidyr package is required for restructuring our data. The tidyr package has many functions but we will focus on just two, pivot_longer and pivot_wider.

10.1 The pivot_longer Function

The pivot_longer function is use when we wish to go from a wide format to a long format. Once again the first argument is the data, which we will pipe in for the examples. We usually need to think long and hard about which variables we want to collapse, and which need to stay as they are…

[THINKING]

Once we’re done thinking we must specify which variables we wish to stack together. Selecting variables can be done using the c function to name specif columns, or by using the same tidy_select group of functions (help(select_helpers)) that are available to us when using the select function (e.g. starts_with, contains, etc.). We can also specify the variables that we don’t want to stack together using a minus sign.

vs %>% 
  pivot_longer(cols = c(HEIGHT,WEIGHT))
# A tibble: 60 × 3
   USUBJID          name   value
   <chr>            <chr>  <dbl>
 1 STD123456:000001 HEIGHT 165  
 2 STD123456:000001 WEIGHT  88  
 3 STD123456:000002 HEIGHT 179  
 4 STD123456:000002 WEIGHT 100  
 5 STD123456:000003 HEIGHT 182  
 6 STD123456:000003 WEIGHT  81  
 7 STD123456:000004 HEIGHT 166  
 8 STD123456:000004 WEIGHT  91  
 9 STD123456:000005 HEIGHT 169  
10 STD123456:000005 WEIGHT  73.6
# ℹ 50 more rows
# Alternatively
# vs %>% 
#   pivot_longer(cols = HEIGHT:WEIGHT)
# Or else specify columns not to collapse using -
# vs %>% 
#   pivot_longer(cols = -USUBJID)

As we can see, the stacking operation has created two new columns to replace the ones that we stacked:

  • The names column contains the original column names
  • The values column stores the values

We can control the naming of these new columns using names_to and values_to. The argument names pretty much speak for themselves. But here is an example for which we label using CDISC terminology.

# Make my vitals data CDISCy
vs %>% 
  pivot_longer(-USUBJID, names_to = "VSTESTCD", values_to = "VSORRES")
# A tibble: 60 × 3
   USUBJID          VSTESTCD VSORRES
   <chr>            <chr>      <dbl>
 1 STD123456:000001 HEIGHT     165  
 2 STD123456:000001 WEIGHT      88  
 3 STD123456:000002 HEIGHT     179  
 4 STD123456:000002 WEIGHT     100  
 5 STD123456:000003 HEIGHT     182  
 6 STD123456:000003 WEIGHT      81  
 7 STD123456:000004 HEIGHT     166  
 8 STD123456:000004 WEIGHT      91  
 9 STD123456:000005 HEIGHT     169  
10 STD123456:000005 WEIGHT      73.6
# ℹ 50 more rows

10.2 The pivot_wider Function

The pivot_wider function is used when we wish to go from a long format to a wide format. Once again we pass in the data as the first argument. As the converse of the pivot_longer function we specify a names_*from* and a values_*from* argument to identify the column that currently stores the names for our new columns (names_from), and the one that currently stores the values (values_from). The remaining columns are unaffected.

# Count the number of subjects in each treatment arm
big_n <- dm %>%
  group_by(ARM) %>%
  summarise(N = n())
big_n
# A tibble: 2 × 2
  ARM            N
  <chr>      <int>
1 Comparator    18
2 GSK           12
# Now transpose such that each treatment is in a separate column
big_n %>%
  pivot_wider(names_from = ARM, values_from = N)
# A tibble: 1 × 2
  Comparator   GSK
       <int> <int>
1         18    12

10.3 EXERCISE

  1. Create a data frame, act_sum, by transposing the individual ACT question scores in the act data frame in order to calculate the ACT Total Score (sum of the questions).
    1. First stack the scores together using pivot_longer().
    2. Then calculate total ACTTOT for each subject and visit combination. The resulting act_sum data frame should look like this:
# A tibble: 165 × 4
# Groups:   USUBJID, VISITNUM [165]
  USUBJID          VISITNUM VISIT                          ACTTOT
  <chr>               <dbl> <chr>                           <dbl>
1 STD123456:000001       10 VISIT 1 (SCREENING)                16
2 STD123456:000001       20 VISIT 2 (RANDOMISATION)            16
3 STD123456:000001       30 VISIT 3 (PHONE CALL 1 WEEK 6)      13
4 STD123456:000001       40 VISIT 4 (WEEK 12)                  15
5 STD123456:000001       50 VISIT 5 (PHONE CALL 2 WEEK 18)     14
6 STD123456:000001       60 VISIT 6 (WEEK 24)                  17
7 STD123456:000002       10 VISIT 1 (SCREENING)                15
# ℹ 158 more rows
  1. As in an earlier exercise, calculate summary statistics (n, Mean, Median, SD, Min and Max) by treatment and visit using the act_full data frame.
    1. Create a data frame, act_stats, by transposing the data such that the rows contain the summary statistics and the values for each treatment appear in a separate column. Hint: first use pivot_longer() to create a stats and values column and then pivot_wider() to create a column for each treatment. The resulting act_stats data frame should look like this:
# A tibble: 42 × 5
# Groups:   VISITNUM [7]
  VISITNUM VISIT                   Stat   Comparator   GSK
     <dbl> <chr>                   <chr>       <dbl> <dbl>
1       10 VISIT 1 (SCREENING)     n           14    10   
2       10 VISIT 1 (SCREENING)     Mean        17.1  16.6 
3       10 VISIT 1 (SCREENING)     Median      17    15.5 
4       10 VISIT 1 (SCREENING)     SD           1.41  2.01
5       10 VISIT 1 (SCREENING)     Min         15    15   
6       10 VISIT 1 (SCREENING)     Max         19    20   
7       20 VISIT 2 (RANDOMISATION) n           18    12   
# ℹ 35 more rows

Extra

  1. Re-do question 2 but format the summary statistics such that the mean and median are given to 1 decimal place, the standard deviation to 2 decimal places and n, min and max are given as integers. HINT: Use the format function
If you notice an issue, have suggestions for improvements, or want to view the source code, you can find it on GitHub.