library(tibble) # NB: loading the dplyr/tidyverse package also loads tibbleIf you have used R previously, most of what you have seen so far should be relatively familiar. Much of the rest of this course focuses on relatively new structures and concepts which have been predominantly developed by Hadley Wickham, an R Foundation member and RStudio employee, and his team.
One of the key themes that you will notice is that functions in the tidyverse have a single purpose! Functions do just one thing and do it well. Further, they process the inputs in a consistent way so that you always know what to expect on the other side.
The first tidyverse concept that we shall look at is a tibble.
4.1 Tibbles (Special Data Frames)
Tibbles are not part of base R and so to see what they look like we need to load the tibble package.
To the user, there is very little difference between a tibble and a data frame. The main difference is the way they print. In particular, a tibble limits the number of rows and columns that actually print. Instead, the dimensions are printed, and if there are too many columns to fit on the screen these are simply listed out. Finally the underlying data type for each column is also displayed. For numeric data, the underlying type is displayed. These are dbl (double) or int (integer) although we rarely have to worry about this distinction in practice.
air_tib <- tibble(airquality)
air_tib# A tibble: 153 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
# ℹ 143 more rows
A tibble is actually just an extension to a data frame. As such we will generally use the term “data frame” throughout this course.
4.2 Creating Data Frames
Typically, we import data from other sources (see next chapter). However, the flexibility of R and the tidyverse allows us to easily generate our own datasets from scratch. This can be particularly useful for simulation.
A data frame is simply a structured collection of vectors where each vector is stored as a ‘column’ (technically, a data frame is a ‘list’ of vectors). To create a data frame we use the tidyverse function tibble. We separate columns by a comma. Each new column is given a name to the left of an equals sign, with the values that it will hold entered to the right of the equals.
my_df <- tibble(SUBJID = rep(1:3, each = 2),
VISITNUM = rep(1:2, 3))
my_df# A tibble: 6 × 2
SUBJID VISITNUM
<int> <int>
1 1 1
2 1 2
3 2 1
4 2 2
5 3 1
6 3 2
The tibble function adds columns sequentially, meaning that we can add a column and then use it to generate another column, all in one function call. For example,
tibble(HEIGHT = c(182, 164),
WEIGHT = c(74, 67),
BMI = WEIGHT^2/HEIGHT)# A tibble: 2 × 3
HEIGHT WEIGHT BMI
<dbl> <dbl> <dbl>
1 182 74 30.1
2 164 67 27.4
4.3 The crossing Function
The crossing Function lives in the tidyr package (which is automatically loaded using library(tidyverse)). The function generates a data frame containing all combinations of the values that we provide it with. This is particularly useful for simulation. For example, we could use the function to create visits for each subject in a study:
library(tidyr)
my_df <- crossing(SUBJID = 1:3, VISITNUM = 1:2)
my_df# A tibble: 6 × 2
SUBJID VISITNUM
<int> <int>
1 1 1
2 1 2
3 2 1
4 2 2
5 3 1
6 3 2
4.4 Extracting Columns from Data Frames
The ‘modern’ way to extract a column from a data frame is to use the pull function (from the dplyr package). The pull function takes two arguments: the name of the data frame, and the name of the column that we wish to extract. The extracted column is returned as a vector.
library(dplyr)
my_df# A tibble: 6 × 2
SUBJID VISITNUM
<int> <int>
1 1 1
2 1 2
3 2 1
4 2 2
5 3 1
6 3 2
# Extract the SUBJID column
pull(my_df, SUBJID)[1] 1 1 2 2 3 3
We will look at subsetting and other operations later in the course.
4.5 EXERCISE
- Create a data frame,
my_df, containing subject numbers 1 to 20, some random ages between 18 and 65 and trial status “Ongoing” or “Completed” - Create a data frame containing all possible combinations of COUNTRY (“UK”, “USA”, “FRA”), STATUS (“Ongoing”,“Completed”) and TRT (“GSK”, “OTHER”)