Reading in data

Last updated on 2023-01-24 | Edit this page

Overview

Questions

How can we get data into R?

Objectives

Understand the advantages of tidy data
Use readxl to import data from Microsoft Excel
Use readr to import data from text files

Do I need to do this lesson?

If you’ve already used readr and readxl and here, you’ll probably already know the material covered in this section. If you can complete the following excercises, you can skip this lesson.

Codon table

Load in the data in human_codon_table.xlsx (the column names are codon, frequency and count).

Make sure to specify the column types and names. Make use of here::here() to generate the file path relative to the project root.

Weather data

Load the data in the two files weather_brisbane.csv and weather_sydney.csv into one data frame, creating a column called ‘file’ that contains the file from which each row originated.

Make sure to specify the column types and names. Make use of here::here() to generate the file path relative to the project root.

Show me the solution

Loading the file human_codon_table.xlsx:

R

# this might be different depending on where you saved the data
cdn_path <- here::here("episodes", "data", "human_codon_table.xlsx")

df <- readxl::read_xlsx(cdn_path, 
                        col_names=c("codon", "frequency", "count"),
                        col_types = c('text', "numeric", "numeric"))

glimpse(df)

ERROR

Error in glimpse(df): could not find function "glimpse"

And the weather data:

R

library(readr)

wthr_path <- here::here("episodes", "data", c("weather_brisbane.csv", "weather_sydney.csv"))

# column types
col_types <- list(
  date = col_date(format="%Y-%m-%d"),
  min_temp_c = col_double(),
  max_temp_c = col_double(),
  rainfall_mm = col_double(),
  evaporation_mm = col_double(),
  sunshine_hours = col_double(),
  dir_max_wind_gust = col_character(),
  speed_max_wind_gust_kph = col_double(),
  time_max_wind_gust = col_time(),
  temp_9am_c = col_double(),
  rel_humid_9am_pc = col_integer(),
  cloud_amount_9am_oktas = col_double(),
  wind_direction_9am = col_character(),
  wind_speed_9am_kph = col_double(),
  MSL_pressure_9am_hPa = col_double(),
  temp_3pm_c = col_double(),
  rel_humid_3pm_pc = col_double(),
  cloud_amount_3pm_oktas = col_double(),
  wind_direction_3pm = col_character(),
  wind_speed_3pm_kph = col_double(),
  MSL_pressure_3pm_hPa = col_double()
)

# read in data
weather <- read_csv(wthr_path, skip=10, 
                           col_types=col_types, col_names = names(col_types))

glimpse(weather)

ERROR

Error in glimpse(weather): could not find function "glimpse"

Reading in data

So you want to make some awesome plots using R. The first step you’ll need to take is to import your data.

When I first learnt R in a first-year statistics course, the way we did this was:

Open an Excel spreadsheet with the data
Select the data to be imported
Copy to the clipboard (Cmd + c)
Open an R terminal
Enter the command read.table(pipe("pbpaste"), header = TRUE)

This is a pretty convoluted way to do things! This weird process was part of the reason that, after I finished the course, I didn’t touch R again for almost ten years.

Luckily, in the meantime people worked out better ways of doing things. This lesson will cover two awesome packages which can help you import your data: readxl for Excel files, and readr for text files.

Reading in data from Excel: readxl

If you’ve entered data into a digital format, chances are that you’ve used spreadsheet software like Microsoft Excel. Excel makes data entry easy because it takes care of all of the formatting for you. But at some point, you’ll probably want to get your data out of Excel and into R. That’s where the readxl package comes in!

The workhorse of this package is the read_xlsx() function (or read_xls() for older excel files). It takes as input a path to the file you want to read.

R

# define path to excel file to read - this will probably be different for you
my_excel_sheet <- here::here("episodes", "data", "readxl_example_1.xlsx")

# read in data 
my_excel_data <- readxl::read_xlsx(my_excel_sheet)
my_excel_data

OUTPUT

# A tibble: 3 × 3
  word           count number
  <chr>          <chr>  <dbl>
1 prematurely    one        4
2 airconditioned two        5
3 supermarket    three      6

File paths with `here::here()`

Note that I use a function called here to create the file path to the excel file. You can read more about this function here.

Using this function helps with cross-platform compatibility (so it works on Windows as well as MacOSX and Linux). This is because directories on Windows are usually separated with a backslash \, whereas on Unix-based distributions (like MacOS and Linux) directories are separated by a slash /.

You can check what the file path looks like on your platform by printing out the variable we defined with the file path:

R

my_excel_sheet

OUTPUT

[1] "/home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/data/readxl_example_1.xlsx"

Using here::here() always generates paths relative to the project root (the file created by RStudio when you make a new project, which has the file extension .Rproj). If you move the project around on your filesystem (or send it to someone else), the import will still work. This wouldn’t be the case if you manually specified the absolute path to the file.

Note that this example file path is a little convoluted, because of the way the lessons are created. Yours should be more straightforward and look more like this:

R

my_excel_sheet <- here::here("data", "readxl_example_1.xlsx")

Tibbles

Coming back to the output of the call to readxl::read_xlsx() - calls to this function return a tibble object with the data contained in the excel spreadsheet.

The tibble object is a convenient way of storing rectangular data, where column has a type, (which is shown when you print the table). Ideally, in a tibble each column should contain data from one variable, and each row consists of observations of each variables - this is known as ‘tidy data’ (more on this in the next lesson).

If you don’t specify the types (like character, integer, double) of each column, readxl will try to guess them based on the types in the excel spreadsheet. This can lead to unexpected behaviour, especially if you mix numbers and strings in the same columns.

Check the documentation to read more about how to specify column types in readxl.

Challenge 1: Column types

Open the file readxl_example_2.xlsx in Excel. What do you think the types will be when you import this file?

Show me the solution

R

# read in data 
data <- readxl::read_xlsx(here::here("episodes", "data", "readxl_example_2.xlsx"))
head(data)

OUTPUT

# A tibble: 3 × 5
  replicate `drug concentration` `assay 1` `assay 2` `assay 3`
      <dbl> <chr>                <chr>     <chr>     <lgl>    
1         1 10 uM                ++        1         TRUE     
2         2 20 uM                +++       medium    FALSE    
3         2 20 uM                +         high      FALSE

Note that:

The ‘replicate’ column has a type of dbl even though it contains only integers.
The ‘drug concentration’ column is of type chr because R doesn’t know that ‘uM’ is a unit.
Mixing numbers and characters in the ‘assay 2’ column results in a column of characters (why do you think this is?)

The type of the columns is important to be aware of because some functions in R expect that you feed them columns of a particular type. For example, we can extract the ‘drug concentration’ column from the second file we imported using the $ operator.

R

conc <- data$`drug concentration`
conc

OUTPUT

[1] "10 uM" "20 uM" "20 uM"

We can then extract the first two characters of each string in this column using stringr::str_sub():

R

nums <- stringr::str_sub(conc, 1, 2)

But we can’t add one to this column:

R

conc + 1

ERROR

Error in conc + 1: non-numeric argument to binary operator

Specifying column types

Since the guessing of types can result in unexpected behaviour, it’s best to always specify the types of data when you import it. You can do this in a verbose way:

R

data <- readxl::read_xlsx(here::here("episodes", "data", "readxl_example_2.xlsx"),
                           col_types = c("numeric", "text", "text", "text", "numeric"))

WARNING

Warning: Coercing boolean to numeric in E2 / R2C5

WARNING

Warning: Coercing boolean to numeric in E3 / R3C5

WARNING

Warning: Coercing boolean to numeric in E4 / R4C5

R

data

OUTPUT

# A tibble: 3 × 5
  replicate `drug concentration` `assay 1` `assay 2` `assay 3`
      <dbl> <chr>                <chr>     <chr>         <dbl>
1         1 10 uM                ++        1                 1
2         2 20 uM                +++       medium            0
3         2 20 uM                +         high              0

Note that the last column (‘assay 3’) contains Boolean (TRUE/FALSE) values, but we coerced it to a numeric type when we imported it. R prints a warning to let us know.

Other options

readxl::read_xlsx() also has lots of other options to deal with other aspects of data imports - for example, you can:

Use the sheet and range named arguments to specify which parts of the sheet to import
Skip a number of header rows using the skip option,
Provide a vector of your own column names using the col_names argument
Pass col_names=FALSE to avoid using the first line of the file as your column names if your data doesn’t have column names

Challenge 2: readxl

Open the file human_codon_table.xlsx in Excel. The columns of this table contain codons, frequencies of observations of each codon, and counts of each codon.

How do you think it should be imported?

Import this data, including the col_names, col_types, sheet and range named arguments.

Show me the solution

R

readxl::read_xlsx(here::here("episodes", "data", "human_codon_table.xlsx"),
                  col_names = c("codon", "frequency", "count"),
                  col_types = c('text', 'numeric', 'numeric'),
                  sheet = 'human_codon_table',
                  range = 'A2:C65' )

OUTPUT

# A tibble: 64 × 3
   codon frequency  count
   <chr>     <dbl>  <dbl>
 1 UCU        15.2 618711
 2 UAU        12.2 495699
 3 UGU        10.6 430311
 4 UUC        20.3 824692
 5 UCC        17.7 718892
 6 UAC        15.3 622407
 7 UGC        12.6 513028
 8 UUA         7.7 311881
 9 UCA        12.2 496448
10 UAA         1    40285
# … with 54 more rows

Reading in data from text files: readr

Alternatively, you might have tabular data in a text file. In these kinds of files, each row of the file is a row of the table, and usually either tabs or commas separate the values in each column.

If you have comma-separated data (like a .csv file) the readr::read_csv() function will be most convenient, and if your data is tab-separated (like a .tsv file), you’ll want readr::read_tsv(). If you have a different delimiter, you can use readr::read_delim(file, delim=delimiter), where delimiter is a string containing your delimiter.

The syntax is very similar to read_xlsx(), and you should specify the column types here as well. Here I use the short form of the column specification - ‘c’ for character and ‘i’ for integer.

R

data <- readr::read_csv(here::here("episodes", "data", "readr_example_1.csv"),
                           col_types = c("cci"))
data

OUTPUT

# A tibble: 3 × 3
  word           count number
  <chr>          <chr>  <int>
1 prematurely    one        4
2 airconditioned two        5
3 supermarket    three      6

You can read more about the column specification for readr functions here.

The other options for read_csv() are also similar to those for read_xslx, including col_names and skip.

Reading multiple files

If you have multiple files with the same columns and column types, you can pass a vector of file paths rather than an atomic vector. To add an extra column with the file name (so you know which rows came from which files), use the id parameter.

R

files <- c(here::here("episodes", "data", "sim_0_counts.txt"),
           here::here("episodes", "data", "sim_1_counts.txt"))

data <- readr::read_tsv(files, col_types = 'cci', id="file")

head(data)

OUTPUT

# A tibble: 6 × 4
  file                                                         test0 test1 count
  <chr>                                                        <chr> <chr> <int>
1 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… PRR       2
2 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… AVSM      1
3 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… IGLF      1
4 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… no_i…  1039
5 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… R        82
6 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… DI        3

Resources

Keypoints

Use readxl::read_xlsx() to import data from excel
Use readr::read_tsv(), readr::read_csv(), readr::read_delim() to import data from delimited text files

Reading in data

Overview

Questions

Objectives

Do I need to do this lesson?

Codon table

Weather data

Show me the solution

R

ERROR

R

ERROR

Reading in data

Reading in data from Excel: readxl

R

OUTPUT

File paths with here::here()

R

OUTPUT

R

Tibbles

Challenge 1: Column types

Show me the solution

R

OUTPUT

R

OUTPUT

R

R

ERROR

Specifying column types

R

WARNING

WARNING

WARNING

R

OUTPUT

Other options

Challenge 2: readxl

Show me the solution

R

OUTPUT

Reading in data from text files: readr

R

OUTPUT

Reading multiple files

R

OUTPUT

Resources

Keypoints

File paths with `here::here()`