4  Welcome to the tidyverse!

If you have used R previously, most of what you have seen so far should be relatively familiar. Much of the rest of this course focuses on relatively new structures and concepts which have been predominantly developed by Hadley Wickham, an R Foundation member and RStudio employee, and his team.

One of the key themes that you will notice is that functions in the tidyverse have a single purpose! Functions do just one thing and do it well. Further, they process the inputs in a consistent way so that you always know what to expect on the other side.

The first tidyverse concept that we shall look at is a tibble.

4.1 Tibbles (Special Data Frames)

Tibbles are not part of base R and so to see what they look like we need to load the tibble package.

library(tibble)  # NB: loading the dplyr/tidyverse package also loads tibble

To the user, there is very little difference between a tibble and a data frame. The main difference is the way they print. In particular, a tibble limits the number of rows and columns that actually print. Instead, the dimensions are printed, and if there are too many columns to fit on the screen these are simply listed out. Finally the underlying data type for each column is also displayed. For numeric data, the underlying type is displayed. These are dbl (double) or int (integer) although we rarely have to worry about this distinction in practice.

air_tib <- tibble(airquality)
air_tib
# A tibble: 153 × 6
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    41     190   7.4    67     5     1
 2    36     118   8      72     5     2
 3    12     149  12.6    74     5     3
 4    18     313  11.5    62     5     4
 5    NA      NA  14.3    56     5     5
 6    28      NA  14.9    66     5     6
 7    23     299   8.6    65     5     7
 8    19      99  13.8    59     5     8
 9     8      19  20.1    61     5     9
10    NA     194   8.6    69     5    10
# ℹ 143 more rows

A tibble is actually just an extension to a data frame. As such we will generally use the term “data frame” throughout this course.

4.2 Creating Data Frames

Typically, we import data from other sources (see next chapter). However, the flexibility of R and the tidyverse allows us to easily generate our own datasets from scratch. This can be particularly useful for simulation.

A data frame is simply a structured collection of vectors where each vector is stored as a ‘column’ (technically, a data frame is a ‘list’ of vectors). To create a data frame we use the tidyverse function tibble. We separate columns by a comma. Each new column is given a name to the left of an equals sign, with the values that it will hold entered to the right of the equals.

my_df <- tibble(SUBJID = rep(1:3, each = 2),
               VISITNUM = rep(1:2, 3))
my_df
# A tibble: 6 × 2
  SUBJID VISITNUM
   <int>    <int>
1      1        1
2      1        2
3      2        1
4      2        2
5      3        1
6      3        2

The tibble function adds columns sequentially, meaning that we can add a column and then use it to generate another column, all in one function call. For example,

tibble(HEIGHT = c(182, 164),
       WEIGHT = c(74, 67),
       BMI = WEIGHT^2/HEIGHT)
# A tibble: 2 × 3
  HEIGHT WEIGHT   BMI
   <dbl>  <dbl> <dbl>
1    182     74  30.1
2    164     67  27.4

4.3 The crossing Function

The crossing Function lives in the tidyr package (which is automatically loaded using library(tidyverse)). The function generates a data frame containing all combinations of the values that we provide it with. This is particularly useful for simulation. For example, we could use the function to create visits for each subject in a study:

library(tidyr) 
my_df <- crossing(SUBJID = 1:3, VISITNUM = 1:2)
my_df
# A tibble: 6 × 2
  SUBJID VISITNUM
   <int>    <int>
1      1        1
2      1        2
3      2        1
4      2        2
5      3        1
6      3        2

4.4 Extracting Columns from Data Frames

The ‘modern’ way to extract a column from a data frame is to use the pull function (from the dplyr package). The pull function takes two arguments: the name of the data frame, and the name of the column that we wish to extract. The extracted column is returned as a vector.

library(dplyr)
my_df
# A tibble: 6 × 2
  SUBJID VISITNUM
   <int>    <int>
1      1        1
2      1        2
3      2        1
4      2        2
5      3        1
6      3        2
# Extract the SUBJID column
pull(my_df, SUBJID)
[1] 1 1 2 2 3 3

We will look at subsetting and other operations later in the course.

4.5 EXERCISE

  1. Create a data frame, my_df, containing subject numbers 1 to 20, some random ages between 18 and 65 and trial status “Ongoing” or “Completed”
  2. Create a data frame containing all possible combinations of COUNTRY (“UK”, “USA”, “FRA”), STATUS (“Ongoing”,“Completed”) and TRT (“GSK”, “OTHER”)
If you notice an issue, have suggestions for improvements, or want to view the source code, you can find it on GitHub.