Reading in data
Last updated on 2023-01-24 | Edit this page
Overview
Questions
- How can we get data into
R?
Objectives
- Understand the advantages of tidy data
- Use
readxlto import data from Microsoft Excel - Use
readrto import data from text files
Do I need to do this lesson?
If you’ve already used readr and readxl and here, you’ll probably already know the material covered in this section. If you can complete the following excercises, you can skip this lesson.
Codon table
Load in the data in human_codon_table.xlsx (the column names are codon, frequency and count).
Make sure to specify the column types and names. Make use of here::here() to generate the file path relative to the project root.
Weather data
Load the data in the two files weather_brisbane.csv and weather_sydney.csv into one data frame, creating a column called ‘file’ that contains the file from which each row originated.
Make sure to specify the column types and names. Make use of here::here() to generate the file path relative to the project root.
Loading the file human_codon_table.xlsx:
R
# this might be different depending on where you saved the data
cdn_path <- here::here("episodes", "data", "human_codon_table.xlsx")
df <- readxl::read_xlsx(cdn_path,
col_names=c("codon", "frequency", "count"),
col_types = c('text', "numeric", "numeric"))
glimpse(df)
ERROR
Error in glimpse(df): could not find function "glimpse"
And the weather data:
R
library(readr)
wthr_path <- here::here("episodes", "data", c("weather_brisbane.csv", "weather_sydney.csv"))
# column types
col_types <- list(
date = col_date(format="%Y-%m-%d"),
min_temp_c = col_double(),
max_temp_c = col_double(),
rainfall_mm = col_double(),
evaporation_mm = col_double(),
sunshine_hours = col_double(),
dir_max_wind_gust = col_character(),
speed_max_wind_gust_kph = col_double(),
time_max_wind_gust = col_time(),
temp_9am_c = col_double(),
rel_humid_9am_pc = col_integer(),
cloud_amount_9am_oktas = col_double(),
wind_direction_9am = col_character(),
wind_speed_9am_kph = col_double(),
MSL_pressure_9am_hPa = col_double(),
temp_3pm_c = col_double(),
rel_humid_3pm_pc = col_double(),
cloud_amount_3pm_oktas = col_double(),
wind_direction_3pm = col_character(),
wind_speed_3pm_kph = col_double(),
MSL_pressure_3pm_hPa = col_double()
)
# read in data
weather <- read_csv(wthr_path, skip=10,
col_types=col_types, col_names = names(col_types))
glimpse(weather)
ERROR
Error in glimpse(weather): could not find function "glimpse"
Reading in data
So you want to make some awesome plots using R. The first step you’ll need to take is to import your data.
When I first learnt R in a first-year statistics course, the way we did this was:
- Open an Excel spreadsheet with the data
- Select the data to be imported
- Copy to the clipboard (Cmd + c)
- Open an R terminal
- Enter the command
read.table(pipe("pbpaste"), header = TRUE)
This is a pretty convoluted way to do things! This weird process was part of the reason that, after I finished the course, I didn’t touch R again for almost ten years.
Luckily, in the meantime people worked out better ways of doing things. This lesson will cover two awesome packages which can help you import your data: readxl for Excel files, and readr for text files.
Reading in data from Excel: readxl
If you’ve entered data into a digital format, chances are that you’ve used spreadsheet software like Microsoft Excel. Excel makes data entry easy because it takes care of all of the formatting for you. But at some point, you’ll probably want to get your data out of Excel and into R. That’s where the readxl package comes in!
The workhorse of this package is the read_xlsx() function (or read_xls() for older excel files). It takes as input a path to the file you want to read.
R
# define path to excel file to read - this will probably be different for you
my_excel_sheet <- here::here("episodes", "data", "readxl_example_1.xlsx")
# read in data
my_excel_data <- readxl::read_xlsx(my_excel_sheet)
my_excel_data
OUTPUT
# A tibble: 3 × 3
word count number
<chr> <chr> <dbl>
1 prematurely one 4
2 airconditioned two 5
3 supermarket three 6
File paths with here::here()
Note that I use a function called here to create the file path to the excel file. You can read more about this function here.
Using this function helps with cross-platform compatibility (so it works on Windows as well as MacOSX and Linux). This is because directories on Windows are usually separated with a backslash \, whereas on Unix-based distributions (like MacOS and Linux) directories are separated by a slash /.
You can check what the file path looks like on your platform by printing out the variable we defined with the file path:
R
my_excel_sheet
OUTPUT
[1] "/home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/data/readxl_example_1.xlsx"
Using here::here() always generates paths relative to the project root (the file created by RStudio when you make a new project, which has the file extension .Rproj). If you move the project around on your filesystem (or send it to someone else), the import will still work. This wouldn’t be the case if you manually specified the absolute path to the file.
Note that this example file path is a little convoluted, because of the way the lessons are created. Yours should be more straightforward and look more like this:
R
my_excel_sheet <- here::here("data", "readxl_example_1.xlsx")
Tibbles
Coming back to the output of the call to readxl::read_xlsx() - calls to this function return a tibble object with the data contained in the excel spreadsheet.
The tibble object is a convenient way of storing rectangular data, where column has a type, (which is shown when you print the table). Ideally, in a tibble each column should contain data from one variable, and each row consists of observations of each variables - this is known as ‘tidy data’ (more on this in the next lesson).
If you don’t specify the types (like character, integer, double) of each column, readxl will try to guess them based on the types in the excel spreadsheet. This can lead to unexpected behaviour, especially if you mix numbers and strings in the same columns.
Check the documentation to read more about how to specify column types in readxl.
R
# read in data
data <- readxl::read_xlsx(here::here("episodes", "data", "readxl_example_2.xlsx"))
head(data)
OUTPUT
# A tibble: 3 × 5
replicate `drug concentration` `assay 1` `assay 2` `assay 3`
<dbl> <chr> <chr> <chr> <lgl>
1 1 10 uM ++ 1 TRUE
2 2 20 uM +++ medium FALSE
3 2 20 uM + high FALSE
Note that:
- The ‘replicate’ column has a type of
dbleven though it contains only integers. - The ‘drug concentration’ column is of type
chrbecause R doesn’t know that ‘uM’ is a unit. - Mixing numbers and characters in the ‘assay 2’ column results in a column of characters (why do you think this is?)
The type of the columns is important to be aware of because some functions in R expect that you feed them columns of a particular type. For example, we can extract the ‘drug concentration’ column from the second file we imported using the $ operator.
R
conc <- data$`drug concentration`
conc
OUTPUT
[1] "10 uM" "20 uM" "20 uM"
We can then extract the first two characters of each string in this column using stringr::str_sub():
R
nums <- stringr::str_sub(conc, 1, 2)
But we can’t add one to this column:
R
conc + 1
ERROR
Error in conc + 1: non-numeric argument to binary operator
Specifying column types
Since the guessing of types can result in unexpected behaviour, it’s best to always specify the types of data when you import it. You can do this in a verbose way:
R
data <- readxl::read_xlsx(here::here("episodes", "data", "readxl_example_2.xlsx"),
col_types = c("numeric", "text", "text", "text", "numeric"))
WARNING
Warning: Coercing boolean to numeric in E2 / R2C5
WARNING
Warning: Coercing boolean to numeric in E3 / R3C5
WARNING
Warning: Coercing boolean to numeric in E4 / R4C5
R
data
OUTPUT
# A tibble: 3 × 5
replicate `drug concentration` `assay 1` `assay 2` `assay 3`
<dbl> <chr> <chr> <chr> <dbl>
1 1 10 uM ++ 1 1
2 2 20 uM +++ medium 0
3 2 20 uM + high 0
Note that the last column (‘assay 3’) contains Boolean (TRUE/FALSE) values, but we coerced it to a numeric type when we imported it. R prints a warning to let us know.
Other options
readxl::read_xlsx() also has lots of other options to deal with other aspects of data imports - for example, you can:
- Use the
sheetandrangenamed arguments to specify which parts of the sheet to import - Skip a number of header rows using the
skipoption, - Provide a vector of your own column names using the
col_namesargument - Pass
col_names=FALSEto avoid using the first line of the file as your column names if your data doesn’t have column names
Challenge 2: readxl
Open the file human_codon_table.xlsx in Excel. The columns of this table contain codons, frequencies of observations of each codon, and counts of each codon.
How do you think it should be imported?
Import this data, including the col_names, col_types, sheet and range named arguments.
R
readxl::read_xlsx(here::here("episodes", "data", "human_codon_table.xlsx"),
col_names = c("codon", "frequency", "count"),
col_types = c('text', 'numeric', 'numeric'),
sheet = 'human_codon_table',
range = 'A2:C65' )
OUTPUT
# A tibble: 64 × 3
codon frequency count
<chr> <dbl> <dbl>
1 UCU 15.2 618711
2 UAU 12.2 495699
3 UGU 10.6 430311
4 UUC 20.3 824692
5 UCC 17.7 718892
6 UAC 15.3 622407
7 UGC 12.6 513028
8 UUA 7.7 311881
9 UCA 12.2 496448
10 UAA 1 40285
# … with 54 more rows
Reading in data from text files: readr
Alternatively, you might have tabular data in a text file. In these kinds of files, each row of the file is a row of the table, and usually either tabs or commas separate the values in each column.
If you have comma-separated data (like a .csv file) the readr::read_csv() function will be most convenient, and if your data is tab-separated (like a .tsv file), you’ll want readr::read_tsv(). If you have a different delimiter, you can use readr::read_delim(file, delim=delimiter), where delimiter is a string containing your delimiter.
The syntax is very similar to read_xlsx(), and you should specify the column types here as well. Here I use the short form of the column specification - ‘c’ for character and ‘i’ for integer.
R
data <- readr::read_csv(here::here("episodes", "data", "readr_example_1.csv"),
col_types = c("cci"))
data
OUTPUT
# A tibble: 3 × 3
word count number
<chr> <chr> <int>
1 prematurely one 4
2 airconditioned two 5
3 supermarket three 6
You can read more about the column specification for readr functions here.
The other options for read_csv() are also similar to those for read_xslx, including col_names and skip.
Reading multiple files
If you have multiple files with the same columns and column types, you can pass a vector of file paths rather than an atomic vector. To add an extra column with the file name (so you know which rows came from which files), use the id parameter.
R
files <- c(here::here("episodes", "data", "sim_0_counts.txt"),
here::here("episodes", "data", "sim_1_counts.txt"))
data <- readr::read_tsv(files, col_types = 'cci', id="file")
head(data)
OUTPUT
# A tibble: 6 × 4
file test0 test1 count
<chr> <chr> <chr> <int>
1 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… PRR 2
2 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… AVSM 1
3 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… IGLF 1
4 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… no_i… 1039
5 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… R 82
6 /home/runner/work/cmri_R_workshop/cmri_R_workshop/episodes/… test… DI 3