Another common task when working with data is the transposition from/to long/wide formats. In the world of CDISC everything is in a long format. However we often need our data in a wide format in order to model or plot it. In the tidyverse, the tidyr package is required for restructuring our data. The tidyr package has many functions but we will focus on just two, pivot_longer and pivot_wider.
10.1 The pivot_longer Function
The pivot_longer function is use when we wish to go from a wide format to a long format. Once again the first argument is the data, which we will pipe in for the examples. We usually need to think long and hard about which variables we want to collapse, and which need to stay as they are…
[THINKING]
Once we’re done thinking we must specify which variables we wish to stack together. Selecting variables can be done using the c function to name specif columns, or by using the same tidy_select group of functions (help(select_helpers)) that are available to us when using the select function (e.g. starts_with, contains, etc.). We can also specify the variables that we don’t want to stack together using a minus sign.
# Alternatively# vs %>% # pivot_longer(cols = HEIGHT:WEIGHT)# Or else specify columns not to collapse using -# vs %>% # pivot_longer(cols = -USUBJID)
As we can see, the stacking operation has created two new columns to replace the ones that we stacked:
The names column contains the original column names
The values column stores the values
We can control the naming of these new columns using names_to and values_to. The argument names pretty much speak for themselves. But here is an example for which we label using CDISC terminology.
# Make my vitals data CDISCyvs %>%pivot_longer(-USUBJID, names_to ="VSTESTCD", values_to ="VSORRES")
The pivot_wider function is used when we wish to go from a long format to a wide format. Once again we pass in the data as the first argument. As the converse of the pivot_longer function we specify a names_*from* and a values_*from* argument to identify the column that currently stores the names for our new columns (names_from), and the one that currently stores the values (values_from). The remaining columns are unaffected.
# Count the number of subjects in each treatment armbig_n <- dm %>%group_by(ARM) %>%summarise(N =n())big_n
# A tibble: 2 × 2
ARM N
<chr> <int>
1 Comparator 18
2 GSK 12
# Now transpose such that each treatment is in a separate columnbig_n %>%pivot_wider(names_from = ARM, values_from = N)
Create a data frame, act_sum, by transposing the individual ACT question scores in the act data frame in order to calculate the ACT Total Score (sum of the questions).
First stack the scores together using pivot_longer().
Then calculate total ACTTOT for each subject and visit combination. The resulting act_sum data frame should look like this:
As in an earlier exercise, calculate summary statistics (n, Mean, Median, SD, Min and Max) by treatment and visit using the act_full data frame.
Create a data frame, act_stats, by transposing the data such that the rows contain the summary statistics and the values for each treatment appear in a separate column. Hint: first use pivot_longer() to create a stats and values column and then pivot_wider() to create a column for each treatment. The resulting act_stats data frame should look like this:
# A tibble: 42 × 5
# Groups: VISITNUM [7]
VISITNUM VISIT Stat Comparator GSK
<dbl> <chr> <chr> <dbl> <dbl>
1 10 VISIT 1 (SCREENING) n 14 10
2 10 VISIT 1 (SCREENING) Mean 17.1 16.6
3 10 VISIT 1 (SCREENING) Median 17 15.5
4 10 VISIT 1 (SCREENING) SD 1.41 2.01
5 10 VISIT 1 (SCREENING) Min 15 15
6 10 VISIT 1 (SCREENING) Max 19 20
7 20 VISIT 2 (RANDOMISATION) n 18 12
# ℹ 35 more rows
Extra
Re-do question 2 but format the summary statistics such that the mean and median are given to 1 decimal place, the standard deviation to 2 decimal places and n, min and max are given as integers. HINT: Use the format function
If you notice an issue, have suggestions for improvements, or want to view the source code, you can find it on GitHub.