The most time-consuming step in many analyses is the cleaning and preparation of the data. This course will show you how you can use R to speed up that process and make it more transparent and repeatable. This means it will also be easier to check for and correct errors in your data processing.

This course is part of a new initiative called Conservation Hackers. Conservation hackers aims to connect conservation practitioners who need help with programming, data and data analysis with data scientists. Conservation hackers also aims to educate conservation scientists in programming skills.

We’re pleased to be partnering with Reef Life Survey to offer this course. Reef Life Survey is a not-for-profit organisation that trains volunteer citizen scientist divers in monitoring reef life. The Reef Life Survey data is freely available from their webpage and we’ll be using it as an example of conservation monitoring data in this course.

The course should take about 3 hours to complete. We’ll focus on a particular set of packages for data analysis from the ‘tidyverse’. We’ll cover filtering, error checking, joining data frames and some simple plots. The course aims to to explain some key concepts in detail so that students have the confidence to learn more.

Setup for this course

So that the course runs efficiently, and to save plenty of time for trying fun things in R, we’d ask that you come to the course prepared.

Please download these data files, unzip them and put them in a folder where you will keep your R course notes.

This is an intermediate level course, so we’ll assume you know how to install R and R packages. As a general guide to what we expect in terms of prior knowledge, we’ll assume you can run R, load data and write basic calculations that you save in a script. We’ll assume you know nothing about how to structure data, create plots or make data summaries. Even if you know about these topics, you may still find the course helpful if you are self-taught, because we’ll cover the conceptual foundation of these topics.

Please have R and Rstudio installed on your computer before starting. You’ll want to save plenty of time for doing this, it can be tricky on some computers, especially if you do not have ‘admin’ permission on a work computer. You may need to call IT to get help. We obviously only offer limited help with such installation issues. The latest version of R is version 4. You are welcome to try this version. We are using 3.6.2 currently for writing this course, so there may be some minor differences. The reason we haven’t updated yet is that a lot of the packages we use haven’t been updated yet.

You’ll also need to install a few R packages. We’re using dplyr, readr, lubridate, ggplot2, tidyr, and patchwork in this course. You can install them with this R code: install.packages(c("dplyr", "readr", "lubridate", "ggplot2", "tidyr", "patchwork")). To install the first five packages in one go (plus more handy data wrangling packages we won’t cover here) use install.packages(c("tidyverse")).

If that doesn’t work email us with the error and we’ll try to help. Otherwise, see your IT department for help. We’re using dplyr version 1.0, which was released earlier this year. If you have an older version of dplyr the course should still work fine, but there may be some minor differences in the code.

Data and the case-study

We will use data from Reef Life Survey’s Rottnest Island surveys (accessible here). Rottnest Island is in Western Australia and its reefs host a diverse fish assemblage. This includes both temperate and tropical species. In 2011 a heatwave caused the fish community to undergo a massive change. We’ll base our analysis today on a study by Day and colleagues, which showed an increase in tropical fish during and after the heatwave.

We’ll aim to organize this data so we can study the effects of the heatwave on temperate and tropical species composition.

We’ve modified the data slightly so it contains some errors to practice with, so if you want to use this data for real science, please contact Reef Life Survey directly.

Overview of R in RStudio

We’ll assume you are using R in Rstudio. Rstudio is an integrated developmement environment or IDE in hacker speak. The Rstudio IDE is an interface to R that adds many human friendly features. There are a few terms we’ll assume you are familiar with in this course. First Rstudio’s window panes:

Some other important terms you should know:

To find out what a function does and the arguments it uses, use the help() function, or ?.

help(library)
#?library

If you’re having trouble using functions with your own data, the ‘Examples’ section in the help documentation can be a good place to start.

The importance of being organized

So now you know Rstudio and have downloaded the data, let’s talk about being organized. Organization is key, because even simple statistics can produce a lot of complex code and many intermediate data outputs.

We recommend you create a folder on your computer for each new project, like today’s course. You can do this as an Rstudio project if you like (see File > New project) or just as a folder. Within that create a data-raw folder where you will keep your raw data. If datafiles are really big, then you might also have a folder somewhere else (e.g. on the drive) that you link to later. You can store intermediate data-steps in data-raw or another folder, it’s up to you. Also make a folder for your R code and a folder to save your figures in.

We also recommend giving your files human and machine readable names. Jenny Brian’s advice is a good to look at for ways to name your files.

New script

Now create a new script (file > New File > R Script) and save it with a sensible name in your project folder. Leave yourself some ‘breadcrumbs’ at the top of the script. It’s common that you’ll look at some R code after a long break and forget what you were doing. So start our scripts like this, with a title, name of creator, date and perhaps description:

# RLS data wrangling course
# CJ Brown 
# 2020-07-23

The # symbol is a ‘comment’ that tells R to ignore this text.

Now if your analysis gets complex you should split it among different scripts. We’ll just use the one today, but if you have a more complex project it’s a good idea to chunk it. Then you might like to name you scripts ‘1_data-error-checking.R`, ’2_exploratory-plots.R’ etc…, so you know what order to run them in.

Packages

Now we are organized, let’s load a package into this session:

library(readr)

Much of R’s power comes from all the user-contributed packages, which do a huge variety of different tasks. Many of these packages live in CRAN where they undergo some verification. There are many more on github and other repositories, these don’t undergo the same level of checking. R’s license requires that the user knows what they are doing (there is no warranty!), so it’s on you to read and understand what a package does.

Today we’ll use readr to read in data and then see some other packages for data wrangling and plotting.

Read in data

dat <- read_csv("data-raw/Rottnest-Island-UVC.csv")

There is also the similar read.csv() in base R. We like to use readr’s read_csv because it does some extra checks on the data, converts data to sensible classes where it can (e.g. dates), and allows for more flexibility in naming variables (columns).

We’ve read in our data as a data frame; check your global environment and you should see it is listed as an object in memory called ‘dat’ with the number of observations and variables listed.

Check the data

Note that, if you haven’t collected the data yourself, it is a good idea to contact the data provider to ensure that you understand how the data was collected and that you make appropriate acknowledgments. An in-depth understanding of how the data was collected will help you to make good decisions when analysing the data.

Once you’re ready to start working with the data, you’ll want to do some initial checks. There are a few key functions in R that let us look at the data in different ways. These functions are likely to become standard routine in your scripts. They will help you determine if your data is formatted correctly for further wrangling and analysis.

First we’ll check the data type of each of the variables in our data frame.

The data types we’ll discuss in this course are:

head(dat)
tail(dat)
#View(dat) #also try hover over the name and press F2
names(dat)
nrow(dat)
ncol(dat)
length(unique(dat$SurveyID))
length(unique(dat$SpeciesID))
unique(dat$Sizeclass)
table(dat$Sizeclass)
summary(dat)

We can change data types easily using a group of functions that start with ‘as.’. For example, let’s say we wanted to turn the ‘Sizeclass’ variable into ordered categories that we could group the rest of the data by. This might be useful if we want to plot abundance in each size class from smallest to largest.

dat$Sizeclass <- as.factor(dat$Sizeclass)
head(dat)
## # A tibble: 6 x 9
##   SurveyID SpeciesID Sizeclass Method Abundance Family Class
##      <dbl>     <dbl> <fct>      <dbl>     <dbl> <chr>  <chr>
## 1  6001969        35 15             1         1 Pemph… Acti…
## 2  6001969        38 40             1         2 Kypho… Acti…
## 3  6001969        43 20             1         2 Enopl… Acti…
## 4  6001969        57 20             1         1 Labri… Acti…
## 5  6001969        64 7.5            1        20 Labri… Acti…
## 6  6001969        64 10             1        20 Labri… Acti…
## # … with 2 more variables: CURRENT_TAXONOMIC_NAME <chr>,
## #   TempTrop_23cutoff <chr>
levels(dat$Sizeclass)
##  [1] "2.5"   "5"     "7.5"   "10"    "12.5"  "15"    "20"    "25"   
##  [9] "30"    "35"    "40"    "50"    "62.5"  "75"    "87.5"  "100"  
## [17] "112.5" "125"   "137.5" "150"   "162.5" "175"   "187.5" "200"  
## [25] "250"   "300"

Watch out though! If you change a factor to a numeric you’ll get numbers 1:number of factor levels, even if the factor names are different numbers.

dat$Sizeclass <- as.numeric(dat$Sizeclass)
head(dat)
## # A tibble: 6 x 9
##   SurveyID SpeciesID Sizeclass Method Abundance Family Class
##      <dbl>     <dbl>     <dbl>  <dbl>     <dbl> <chr>  <chr>
## 1  6001969        35         6      1         1 Pemph… Acti…
## 2  6001969        38        11      1         2 Kypho… Acti…
## 3  6001969        43         7      1         2 Enopl… Acti…
## 4  6001969        57         7      1         1 Labri… Acti…
## 5  6001969        64         3      1        20 Labri… Acti…
## 6  6001969        64         4      1        20 Labri… Acti…
## # … with 2 more variables: CURRENT_TAXONOMIC_NAME <chr>,
## #   TempTrop_23cutoff <chr>
unique(dat$Sizeclass)
##  [1]  6 11  7  3  4  5  2  8  9 10 12 13 16  1 21 15 14 24 18 20 19 26 22
## [24] 25 NA 17 23

Data wrangling

Wrangling your data means getting into shape for exploration and analysis. The ‘dplyr’ package provides an array of useful functions for wrangling.

library(dplyr)

The conceptual framework under-pinning dplyr is called the "Grammar of data manipulation’. We’ll start with two key functions here, and we’ll continue to reveal more as the course progresses. As bonus material at the end of the course we’ll show a special tool that allows us to string dplyr functions together for efficient and elegant data wrangling.

Filtering and selecting

Let’s start by exploring data for one species, Scarus ghobban.

datscarus <- filter(dat, CURRENT_TAXONOMIC_NAME == "Scarus ghobban")
head(datscarus)
## # A tibble: 6 x 9
##   SurveyID SpeciesID Sizeclass Method Abundance Family Class
##      <dbl>     <dbl>     <dbl>  <dbl>     <dbl> <chr>  <chr>
## 1  6001983       145         5      1         1 Labri… Acti…
## 2  6001987       145         4      1         1 Labri… Acti…
## 3  6001987       145         5      1         1 Labri… Acti…
## 4  6002000       145         7      1         1 Labri… Acti…
## 5  6002011       145         7      1         1 Labri… Acti…
## 6  6002025       145         6      1         1 Labri… Acti…
## # … with 2 more variables: CURRENT_TAXONOMIC_NAME <chr>,
## #   TempTrop_23cutoff <chr>
length(unique(datscarus$SpeciesID))
## [1] 1

We might also want to remove some columns in our data set. We can do this easily with dplyr’s select function. For example, we know that all the data was collected with the same method, so we’ll remove this column.

unique(dat$Method)
## [1] 1
datsub <- select(dat, -Method)
head(datsub)
## # A tibble: 6 x 8
##   SurveyID SpeciesID Sizeclass Abundance Family Class CURRENT_TAXONOM…
##      <dbl>     <dbl>     <dbl>     <dbl> <chr>  <chr> <chr>           
## 1  6001969        35         6         1 Pemph… Acti… Pempheris multi…
## 2  6001969        38        11         2 Kypho… Acti… Kyphosus sydney…
## 3  6001969        43         7         2 Enopl… Acti… Enoplosus armat…
## 4  6001969        57         7         1 Labri… Acti… Pictilabrus lat…
## 5  6001969        64         3        20 Labri… Acti… Siphonognathus …
## 6  6001969        64         4        20 Labri… Acti… Siphonognathus …
## # … with 1 more variable: TempTrop_23cutoff <chr>
datsub2 <- select(dat, SurveyID:Sizeclass, Abundance:TempTrop_23cutoff)
head(datsub2)
## # A tibble: 6 x 8
##   SurveyID SpeciesID Sizeclass Abundance Family Class CURRENT_TAXONOM…
##      <dbl>     <dbl>     <dbl>     <dbl> <chr>  <chr> <chr>           
## 1  6001969        35         6         1 Pemph… Acti… Pempheris multi…
## 2  6001969        38        11         2 Kypho… Acti… Kyphosus sydney…
## 3  6001969        43         7         2 Enopl… Acti… Enoplosus armat…
## 4  6001969        57         7         1 Labri… Acti… Pictilabrus lat…
## 5  6001969        64         3        20 Labri… Acti… Siphonognathus …
## 6  6001969        64         4        20 Labri… Acti… Siphonognathus …
## # … with 1 more variable: TempTrop_23cutoff <chr>

First plots

It’s often easier to explore data with graphics. R is really great for graphics. R has good base packages for graphics, but we’ll use Grammar of Graphics 2 today (ggplot2).

library(ggplot2)

The theory of ggplot is that you build you plot up in a series of layers. Each layer is created by a specific function and we string them together with a +. Here’s an example:

ggplot(datscarus) + 
  aes(x = Abundance) + 
  geom_histogram()

This makes a histogram. The ggplot(datscarus) declares the dataframe from which we’ll draw variables and also creates the page for the plot. The aes stands for aesthetics for the plot, here we use it to declare an x axis which is the variable Abundance. Then geom_histogram declares a geometry object which decides how the aesthetics are plotted to the page.

Try it again, but just run the first 1 or 2 lines, you’ll see the first line makes an empty plot, the second line adds an x-axis, the third line adds the data.

The different elements are layered, so whatever comes first goes underneath.

ggplot also likes data as dataframes, so as long as your data is a dataframe and all the variables/classes/groups are in the dataframe you are good to go. This works well most of the time, but can be clunky sometimes, as we’ll explain below.

Here’s another example:

ggplot(datscarus) + 
  aes(x = Sizeclass, y = Abundance) +
  geom_point()