3 Data Types and Vectors – Introduction to R and the tidyverse

3.1 Data Types

R has 4 basic data types (modes) from which almost everything else is built upon. These are:

numeric
character
logical
complex (which we will ignore in this course)

# Numeric
3 + 7

[1] 10

6 - 14

[1] -8

5*9

[1] 45

100/3

[1] 33.33333

# Character
"Hello! I work at GSK"

[1] "Hello! I work at GSK"

# Logical
6 > 7

[1] FALSE

is.numeric(12)

[1] TRUE

Note that the logical values TRUE and FALSE are reserved words, meaning that we cannot overwrite (mask) them. In the examples above the logicals were produced by asking a question. It is also possible to create them directly (x <- TRUE). When doing so, they must be written exactly using capital letters and without quotation marks. We will look at the usage of logicals more closely later on in the course.

The terms used above are formal definitions in R. The mode function lets us query the type of any element.

mode(5)                 # numeric

[1] "numeric"

mode("have a guess")    # character

[1] "character"

3.2 Combining Elements to Make a Vector

A vector is a collection of elements/values that have the same type or “mode”. In R, vectors technically have no dimension but we will usually treat them as columns of data in a dataset (data frame).

We can easily create vectors using the c function to “combine” one or more elements together.

# Numeric vectors
c(5, 42, 99)

[1]  5 42 99

c(6.423, 7.324, 9.225)

[1] 6.423 7.324 9.225

# Character vectors
c("I", "have", "four", "elements")

[1] "I"        "have"     "four"     "elements"

We can also combine vectors using c.

# Create a couple of vectors
vec1 <- c(5,9,2)
vec1

[1] 5 9 2

vec2 <- c(7,8,1)
vec2

[1] 7 8 1

# Combine the vectors
vec3 <- c(vec1, vec2)
vec3

[1] 5 9 2 7 8 1

NB: A scalar (single value) is technically just a vector with 1 element.

3.3 Sequences

In addition to the c function, R contains a number of useful operators and utility functions for creating vectors. First we look at the seq function for generating sequences.

The seq function takes self-explanatory arguments “from” and “to”. By default, the incremental value is 1. We can optionally provide either “by” or “length” to change this.

# Simple sequence
seq(from = 1, to = 10)

 [1]  1  2  3  4  5  6  7  8  9 10

# The same, but without specifying the arguments:
seq(1, 10)

 [1]  1  2  3  4  5  6  7  8  9 10

# Counting down
seq(80, 60, by = -5)

[1] 80 75 70 65 60

# Using 'by' and 'length'
seq(from = 1, to = 10, by = 0.5)

 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
[16]  8.5  9.0  9.5 10.0

seq(from = 10, to = 50, length = 5)

[1] 10 20 30 40 50

If we simply need a sequence of integers then the : operator provides a simple shortcut.

1:50

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

10:1

 [1] 10  9  8  7  6  5  4  3  2  1

3.3.1 Repeating Values

The rep function allows us to repeat a value(s). The arguments times and each provide us with some control of how the repetition is performed.

# Single values
rep(1, times = 6)

[1] 1 1 1 1 1 1

rep("z", times = 3)

[1] "z" "z" "z"

# Vectors
vec1  # Reminder of what this is

[1] 5 9 2

rep(vec1, times = 4)

 [1] 5 9 2 5 9 2 5 9 2 5 9 2

rep(vec1, each = 4)

 [1] 5 5 5 5 9 9 9 9 2 2 2 2

rep(vec1, times = c(5, 4, 6))

 [1] 5 5 5 5 5 9 9 9 9 2 2 2 2 2 2

# More advanced example
n_subj <- 5
n_visits <- 4
SUBJID <- rep(1:n_subj, each = n_visits)
SUBJID

 [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5

VISITNUM <- rep(1:n_visits, n_subj)
VISITNUM

 [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

3.3.2 Missing data

In R, missing data is represented by NA. The term NA is a reserved word in R. Like the reserved logical terms TRUE and FALSE, NA must be written exactly as it is written here, i.e. in capitals and without quotation marks, regardless of whether the data in question is numeric, character, or logical.

# Missing numeric
mis_num <- c(1,2,NA,4)
mis_num

[1]  1  2 NA  4

# Missing character
mis_let <- c("G", NA, "K")
mis_let

[1] "G" NA  "K"

# Missing logical
mis_log <- c(mis_num > 5)
mis_log

[1] FALSE FALSE    NA FALSE

3.4 EXERCISE

Load the tidyverse package
Create a vector that stores the values 4, 9, 2, 1, 6, and 3
1. Add 1 to every element in your vector (easier than you may think)
2. Add a missing value to the end (i.e. the 7th position)
3. Add 1 to every element again and observe what happens with the missing value
Create a vector containing the numbers 1 to 20, incremented by 1
Create a sequence of 3 numbers between 18 and 65
Repeat the word “GSK” 50 times

Extra

Create a vector with subject numbers 1, 2 and 3 where 1 is repeated 6 times, 2 is repeated 4 times, and 3 is repeated 5 times.

3.5 Simulating Data

In order to populate vectors and data frames (see next chapter) we can also simulate data. This is not a simulation course but we will briefly look at some of the easy ways in which we can simulate data, either via known distributions or by sampling from existing data.

3.5.1 Distributions

R has a number of statistical distributions “built in”, a huge number more available through add-on packages. The built-in distributions all follow a consistent naming convention and structure. First of all each distribution has been given a short-form. For example, the normal distribution has the short form, “norm”. A handful of common short forms are listed below:

Short Form	Distribution
norm	Normal
pois	Poisson
binom	Binomial
f	F
t	Student’s t
unif	Uniform
weibull	Weibull
gamma	Gamma
chisq	Chi-squared

See help("Distributions") for more information on available distributions.

To each of the short forms we may apply one of four pre-fixes, d, p, q, r. When appended to a short-form distribution, the prefixes respectively provide:

d - the probability density function, PDF, for given values
p - the cumulative distribution function, CDF, for given quantiles
q - the inverse distribution function for given probabilities
r - randomly generated numbers from the distribution for a given n

So for example if we wanted to know the PDF of a normal distribution between the values of -3 and 3 we would generate a sequence and then use a function, dnorm to find the densities. For example,

# Values of interest 
x <- seq(-3, 3, by = .01)
PDF <- dnorm(x)

# Now let's plot using ggplot2
ggplot() + 
  geom_line(aes(x = x, y = PDF))

3.5.2 Sampling From a Distribution

From a simulation perspective we are normally interested in sampling values from a known distribution. This is very straightforward. For example, if we want 200 numbers randomly sampled from a normal distribution we would write, rnorm(200). Which we might then plot as a histogram.

# Values of interest 
x <- rnorm(200)

# Now let's plot using ggplot2
ggplot() + 
  geom_histogram(aes(x), bins = 20,
                 colour = I("grey"), fill = I("lightblue"))

3.5.3 Sampling from a Set of Values

As well as sampling from distributions we may also sample from a discrete set of values using the sample function. The sample function requires us to provide a vector to sample from, a number of samples to take. The function contains additional options for whether we wish to sample with replacement (replace) and/or whether we wish to give each value an equal chance of being sampled (prob).

Here are a few examples,

# Generate some subjects
treatment <- c("GSK", "Other")

# Sample values from this set, with replacement
sample(treatment, 50, replace = TRUE)

 [1] "GSK"   "GSK"   "Other" "Other" "GSK"   "Other" "Other" "Other" "Other"
[10] "GSK"   "Other" "GSK"   "GSK"   "GSK"   "GSK"   "Other" "GSK"   "GSK"  
[19] "Other" "Other" "Other" "Other" "GSK"   "Other" "Other" "Other" "GSK"  
[28] "GSK"   "GSK"   "GSK"   "GSK"   "GSK"   "GSK"   "Other" "GSK"   "Other"
[37] "GSK"   "GSK"   "Other" "Other" "GSK"   "Other" "GSK"   "GSK"   "GSK"  
[46] "GSK"   "Other" "GSK"   "GSK"   "Other"

# Sample again but this time we want more 'Other' in our population,
# rougly 4 'Other' for every 'GSK' entry
sample(treatment, 50, replace = TRUE, prob=c(1,4))

 [1] "Other" "Other" "Other" "Other" "Other" "Other" "Other" "Other" "Other"
[10] "Other" "Other" "GSK"   "Other" "Other" "Other" "Other" "GSK"   "GSK"  
[19] "Other" "GSK"   "Other" "Other" "Other" "Other" "Other" "Other" "Other"
[28] "GSK"   "Other" "Other" "Other" "Other" "Other" "GSK"   "Other" "Other"
[37] "Other" "GSK"   "Other" "Other" "GSK"   "Other" "Other" "Other" "GSK"  
[46] "GSK"   "Other" "Other" "GSK"   "Other"

3.6 An Introduction to Data Frames

There are actually several structures that can be used to store data in R. The most common structure, and the one we will work with on this course, is called a data.frame. Like a SAS dataset, a data frame is a collection of columns (vectors), each of which may have a different type (mode).

There are several built-in data frames, contained within the “datasets” package

objects("package:datasets")

We can use the head and tail functions to view the first or last few rows of a data frame

# In-built airquality data
head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

tail(airquality)

    Ozone Solar.R Wind Temp Month Day
148    14      20 16.6   63     9  25
149    30     193  6.9   70     9  26
150    NA     145 13.2   77     9  27
151    14     191 14.3   75     9  28
152    18     131  8.0   76     9  29
153    20     223 11.5   68     9  30

# Specifying a custom number of rows to display
head(airquality, 1)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1