5 Importing/Exporting Data – Introduction to R and the tidyverse

Natively, R has very good support for many file formats. For example, functions such as read.csv or the more generic read.table can be used to read in CSV files and other delimited text files. The foreign package can also be used to read in SAS transport files.

In this section we look at tidyverse functions for importing/exporting data. Although the tidyverse functions don’t always offer a great deal more in terms of functionality they are generally faster and more consistent than their base R counterparts.

5.1 Importing Text Files

The tidyverse functions that we will use for importing and exporting text files are contained in the readr package. Note that readr is loaded by default when loading tidyverse.

library(readr)

# Alternatively
library(tidyverse)

The readr package has many functions but the import ones all begin read_* and the export ones all begin write_*.

5.2 Reading in a CSV

The read_csv function is a special case of the read_delim function, a function that allows us to read in text files with a variety of different delimiters and structures. The read_csv function enables us to read in a CSV file. As always, we must give the imported data a name, else it is simply printed to screen. The named data is imported as a tbl_df data frame.

# Read in and save as `theoph`
# The "." represents the current working directory so this is a relative path
theoph <- read_csv("./data/theoph.csv")

# Now print
theoph

# A tibble: 132 × 5
   SUBJID    WT  DOSE  TIME  CONC
    <dbl> <dbl> <dbl> <dbl> <dbl>
 1      1  79.6  4.02  0     0.74
 2      1  79.6  4.02  0.25  2.84
 3      1  79.6  4.02  0.57  6.57
 4      1  79.6  4.02  1.12 10.5 
 5      1  79.6  4.02  2.02  9.66
 6      1  79.6  4.02  3.82  8.58
 7      1  79.6  4.02  5.1   8.36
 8      1  79.6  4.02  7.03  7.47
 9      1  79.6  4.02  9.05  6.89
10      1  79.6  4.02 12.1   5.94
# ℹ 122 more rows

5.2.1 File Paths

Note that we specified the file path using forward slashes, i.e. "/". The “backslash, "\", is an escape key and has special meaning (or at least what comes after it has a meaning). For example "\n" means ‘return/enter’, "\t" means ‘tab’ and confusingly "\\" means ‘backslash’! Basically, you must either replace all of the backslashes with forward slashes or add a second backslash at each location.

5.3 Reading in Data from SAS

We can read both the “.XPT” (a.k.a. “SAS transport file”) and “.SAS7BDAT” formats into R by making use of the haven package. The SAS transport file (version 5) is an open format and can be read in by making use of the haven read_xpt function. The “.SAS7BDAT” files can be read in by making use of the haven read_sas function.

RStudio understands the concept of labels and the RStudio data viewer is arguably better than the one in PC SAS!!!

There are other packages, e.g. foreign and SASXport, which have similar functionality when it comes to importing SAS transport files. However, we use haven for reasons of consistency and efficiency.

# Load the package
library(haven)

# Read in the data (remembering file extension)
dm <- read_sas("./data/dm.sas7bdat")

# View the data
dm    # or try View(dm)

# A tibble: 30 × 7
   USUBJID            AGE SEX   COUNTRY RACE                      ETHNIC   ARM  
   <chr>            <dbl> <chr> <chr>   <chr>                     <chr>    <chr>
 1 STD123456:000001    32 F     UK      BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 2 STD123456:000002    28 M     FRA     WHITE                     NOT HIS… Comp…
 3 STD123456:000003    55 M     USA     BLACK OR AFRICAN AMERICAN NOT HIS… Comp…
 4 STD123456:000004    35 F     GER     WHITE                     HISPANI… Comp…
 5 STD123456:000005    30 F     IRE     WHITE                     NOT HIS… Comp…
 6 STD123456:000006    22 F     GER     WHITE                     NOT HIS… Comp…
 7 STD123456:000007    59 F     USA     WHITE                     NOT HIS… Comp…
 8 STD123456:000008    53 M     GER     WHITE                     NOT HIS… GSK  
 9 STD123456:000009    60 F     USA     WHITE                     NOT HIS… GSK  
10 STD123456:000010    48 M     USA     WHITE                     NOT HIS… Comp…
# ℹ 20 more rows

5.4 The Working Directory

All R sessions have a working directory. On Windows this is usually something like "C:/Users/[mudid]/Documents". This is the directory that R looks in by default when we attempt to import data. Similarly, it’s where R writes to by default. It’s also the default if sourcing other R scripts. We can find out what our working directory is via the getwd function and list the files within it using the list.files function.

# What is my current working directory?
getwd()

[1] "/home/runner/work/intro_to_r_and_the_tidyverse_training/intro_to_r_and_the_tidyverse_training/data"

# What files are in the working directory?
list.files()

 [1] "act.sas7bdat"     "act.xpt"          "actFull.sas7bdat" "actFull.xpt"     
 [5] "actLong.sas7bdat" "actLong.xpt"      "dataset.sas7bdat" "dm.sas7bdat"     
 [9] "dm.xpt"           "pft.sas7bdat"     "pft.xpt"          "sl.sas7bdat"     
[13] "sl.xpt"           "theoph.csv"       "vs.sas7bdat"      "vs.xpt"

We can change/set the working directory using the setwd function. The advantage of setting up a working directory is that we needn’t specify full file paths every time we import/export data. This also makes our code more transferable as our username isn’t hard-coded into our scripts!

# Set my working directory to where some data are stored
setwd("/mnt/code/gsk_R_training/data")

In the example below we import data that is located in our current working directory. We can therefore simply specify the name of the file (including the extension) and ignore the path.

dm <- read_sas("dm.sas7bdat")

5.5 Projects

We started this course by creating an RStudio project. One of the benefits of creating a project within a directory is that it sets the workspace to that directory. This means that we can immediately use relative file paths for any local data.

For more information on RStudio projects see [RStudio Projects]

5.6 Data on Shared Drives

It is generally considered good practice to maintain a single source of truth for our data. Where possible we should avoid making local copies. Instead, we can create an alias for a remote location, similar to the SAS ‘libname’ approach. The simplest way to do this is to save the path to the data directory as an object and use the file.path function to specify the specific datasets we want to read in. The file.path function concatenates text using "/".

# I have SAS files here:
sdtm <- "/mnt/code/gsk_R_training/data"

# Now I want to read in data
dm <- read_sas( file.path(sdtm, "dm.sas7bdat") )

Note: For details on connecting to common data repositories such as the HARP file share, LSAF, RDIP, GDrive and Denodo, see the WARP Data Backends user guide

5.7 EXERCISE

Import the theoph data into R using a relative file path (i.e. one that starts “data/”)
1. Check that it has imported correctly - how many rows and columns does it have?
Import the act data into R
1. Check that it has imported correctly - how many rows and columns does it have?

5.8 Exporting Data

There is an experimental, write_sas function within haven. However it is not currently possible to export data to the “.SAS7BDAT” format with any consistency. However we may export data by using the haven function write_xpt to the “.XPT” (a.k.a. “SAS transport file”) format following the SAS V5 standard (acceptable for submission to regulatory agencies).

In addition, we may export data to various delimited file formats … using readr functions such as write_delim or write_csv. The format of such functions is extremely consistent - the first argument is the name of the data (i.e. the R object name) and the second argument is the name of the file that we wish to write to. Here is an example using write_csv.

write_csv(dm, "dm.csv")

Other useful arguments to write_csv include na, which controls the way missing values are written to the output file (defaults to "NA"), and append which, when set to TRUE allows us to append to existing files rather than create new ones or overwrite existing files.

write_csv(dm, "dm_saslike.csv", na = ".")