seascape models

Data Wrangling Using R

Data Wrangling Using R

A short course


What is data wrangling?
Wikipedia defines it as “Data wrangling or data wrangling is loosely the process of manually converting or mapping data from one ‘raw’ form into another format that allows for more convenient consumption of the data with the help of semi-automated tools”. Have you ever been faced with a dataset that was in the wrong format for the statistical package you wanted to use? Well data-wrangling is about solving that problem.
If you have ever had to deal with large data-sets you may realise that data wrangling can take a considerable amount of time and skill with spreasheets programs like excel. Data wrangling is also dangerous for your analysis- if you stuff something up, like accidentally deleting some rows of data, it can affect all your results and the problem can be hard to detect.

There are better ways of wrangling data than with spreadsheets, and one way is with R.

R is fast becoming the most powerful and flexible programming package for environmental data analysis. It is also totally free. R is a powerful language for data wrangling because
1. It is relatively fast to process commands
2. You can create repeatable scripts
3. You can trace errors back to their source
3. You can share your wrangling scripts with other people
4. You can conveniently search large databases for errors
5. Having your data in R opens up a huge array of cutting edge analysis tools.

A core principle of science is repeatability. Creating and sharing your wrangling scripts helps you to fulfil the important principle of repeatability. It makes you a better scientist and can also make you a more popular scientist: a well thought out script for a common wrangling problem may be widely used by others. In fact, these days it is best practice to include your scripts with publications.
Most statistical methods in R require your data is input in a certain ‘tidy’ format. This course will cover how to use R to easily convert your data into a ‘tidy’ format. Steps include restructuring existing datasets and combining different data-sets. We will also create some data summaries and plots, which are easy once you have ‘tidy’ data. The course will be useful for people who want to explore, analyse and visualise their own field and experimental data in R. The skills covered are also a useful precusor for management of very large datasets in R.

The main principles I hope you will learn are

  • Data wrangling in R is safe, fast, reliable and repeatable
  • Coding aesthetics for readable and repeatable code

To download the data you need click here

The download also includes a portable html of this course.

Let’s begin!

Basic calculations

I am going to assume you are using RStudio. However, you can also use the R app and a text editor if you prefer.
Open up RStudio. R is a programming language, which means there are very few commands available in the menus, rather you have to type in commands into the ‘Console’ (notice there is a window called ‘Console’). The hard part about learning to use R is knowning what the commands are that you want and how to structure the commands in a way R understands. RStudio does however, ease the transition to programming by providing you with some drop down menus.
Let’s get started with R, doing some basic calculations in the console. In the console type

10 - 1

And the console returns an answer (9). You have just asked the R program to compute this (easy) sum. Look carefully at the console and note that each line starts with a > this means the console is ready to take input. If you just type 10 - and hit enter, you will see the next line begins with a +, meaning the console is waiting for you to finish this equation. Type any number and hit return and you will get the answer followed by a >.
You can also ask the console for help, for instance try:


Notice a new ‘Help’ window opens up, with details of how to use arithmetic operators in R. These help files are very useful, but also tend to contain a lot of technical jargon. In learning to use R, you will begin to understand this jargon, making it much easier for you to learn more in the future. But to start off, if things don’t make sense at first reading, try web searches, referring to a beginners book or just experiment with code in different ways to see what happens.
R contains most functions and constants for maths you can think of, for instance try:


Storing numbers as variables

It is well and good to be able to do calculations in R, but what if we want to save our number and reuse it later?
That is what variables are for. Try typing this into your console:

x <- 3^2

Note that the console returns absolutely nothing, but it is ready for more input (>). Try typing x and the console reminds us that three squared in fact equals 9. To run through it point by point, the command we are giving says: compute 3^2 and assign (the assign command is this <-) the answer to x. The spaces around the assign don’t matter, you can leave them out. However, I prefer to have the spaces because it makes the code more readable. You may also see some people use = instead of <-. Both are ok, but for technical reasons that will be explained later, it is better to use <- for assigning values to variables.
If we want R to return the answer when the calculation is computed then enclose the sum in brackets like this (x <- 3^2).
Now, what if we try to recall a variable that does not exist? For instance type y. R reminds us, in red font, that we never created a variable called y.
Now we have created x we can use it in algebraic equations. For instance try z <- x*3 and you will see we get variable z that equals 27.
We can also assign words to variables:

v <- 'today'

Note that we enclosed our word in inverted commas. If we didn’t do this, R would try to assign the value of the variable today to v.
A few tips on naming variables. You can name variables whatever you want (e.g. you could use nine instead of x), but there are a few rules: (1) you can’t use special characters in variable names (other than . and _) and (2) you must start a variable name with a letter. It is also a good idea to avoid naming variables as constants or functions that already exist in R, like pi, e, help.

Vectors: variables with more than one number

So far our variables only have one number. Note that when we recall a value, e.g. x the console prints [1] 9. The [1] means that the first number stored in x is 9. We can store more than one number or word in a variable using the c() function:

 x <- c(22, 23, 24)
v <- c('yesterday', 'today', 'tomorrow')

Which creates variables each with three values. Note again I use spaces purely for clarity, they are not essential.
We can index variables to access certain numbers using []


returns the first value of x and the second value of v. We can also use indices to recall multiple values at once. For instance try


Does the output make sense?

Logical indexing

We can also use logicals (TRUE/FALSE) for indexing. To see how this works, let’s pose R some logical questions

42 == 54
42 == 7*6
x[1] > 10
x[1] != 22

The relational operators used here ask questions such as 42 == 52 means ‘does 42 equal 54?’. To see more of them type help('=='). You will see they return TRUE or FALSE depending on whether the relation is true or false.
Logicals are very useful because we can use them for indexing. For instance


returns a vector of T/F, telling us that the first two values of x are not greater than 23 but the last value is. We can store this logical vector and use it to return values of x that are greater than 23:

ix <- x>23

We can also apply this idea to words

iv <- v == 'today'

There are other specialised functions if you want to do partial searches for words (like grep()), but we won’t cover those in this basic course.
One final thing, if you don’t like the logicals, you can also use the which() function, which turns logicals into indices

ix <- which(x>23)
ix # the third value of x is greater than 23
x[ix] #the third value of x is 24. 


So far we have just been typing everything into the console. However, we might want to save our commands for later. To do this go to the menu and select file -> New file -> R Script. It will open a text editor. From now on, type all your code into this editor. To send the code to the console, click on a line, or select a section and hit ctrl-R on windows or cmnd-enter on mac.
We can put comments in the script, to remind us what it does. R will not evaluate comments, which are preceeded by an #. For instance, start your script by typing:

# R data wrangling course
# Your name
# 23 Jun 2015

I recommend using comments frequently throughout your code. It will make life a lot easier when you come back to a piece of code after a few days, because the comments will remind you what the code does. It will also make it easier to share your code with collaborators and make your code more repeatable, which as I said, is a core principle of science. Another way to make your code more readable is to break up sections of code within your script, for instance here is how I often delineate sections in my code


# Minor heading

Now we have a script with something worth saving, you should go to the menu and select ‘file’ -> ‘save as’ to save the script for later. Next time you come back to your analysis, you can just open this file, select everything and hit run. Then you will be back to where you left off last time. Further, you can share your script with other people or publish it with your paper.
Don’t forget what I said about repeatability and readability! It can take time to organise your script carefully, but it is worth it. I encourage you to include ample comments and section headers as we progress with the rest of the course.

Reading in data

R really becomes useful when we can read in our own datasets. To start, we are going to read in a dataset of fish surveys, dat.csv. This is a practice data set and is very small (only 21 rows). You might think wrangling this data would be easier in a spreadsheet, like excel. We will develop our wrangling skills on this small dataset then test them on a much larger dataset, which would be tricky to handle in a spreadsheet.
The first step is to tell R where on your computer the dataset is stored. In your script type this (but replace the path with that for your own computer)

setwd('/Users/Documents/R data wrangling')

setwd() stands for ‘set working directory’. You will need to replace the folder path in the '' with the path on your computer. On windows you can right click on the file and select ‘properties’, on mac use ‘get info’, then cut and paste the path. Note that if you cut and paste on a windows computer you will have to replace the \ with /. I also have put a code header in here as an example. I won’t do this below, but I suggest you put in code headers for each new section we cover, it will make it easier for you to come back to this script later.
We can see the files in this folder by typing list.files(). The first data we will use is called dat.csv and can be read in as a dataframe like this

fishdat <- read.csv('dat.csv', header = TRUE, stringsAsFactors = FALSE)

Notice we used the assign operator again, but this time assigned the input data to the name fishdat.
read.csv is function that reads a csv file and stores it in R’s local memory. Functions are simply algorithms that R has stored, which perform a task for you, in this case reading a csv file. Functions always begin with a name followed by (), which contain some arguments. In the case of the read.csv() function, the arguments are the name of the data file (‘dat.csv’), header = TRUE and stringsAsFactors = FALSE. The header argument simply says the first row is a row of variable names (so don’t treat them as measurements). The strings as factors arguments tells R to treat strings (ie words) as strings and not to group them by names. You will see why we do this later. Also notice we use = to specify the function’s arguments. As I said above, it is best to use <- for assigning variable names and save = for function arguments.
If you need more help on any function in R type its name prefixed with a question mark, for instance ?read.csv.
Now lets look at fishdat. Either type fishdat into the console, or on RStudio, go to the Environment window and double click on ‘fishdat’ under ‘Data’. You will see a table of data like this (I have appended a truncated version here)

fishdat is what we call ‘tidy data’. Each row is an observation. Here we have multiple sites, each with five transects. We have five variables site, the names of sites, transect the transect number, fish_abund the abundance of fish at that transect and dN15_SG the ratio of nitrogen stable isotopes of a seagrass sample from that transect (high nitrogen isotope values indicate the influence of pollution).

A note about Excel files

In short, don’t use .xlsx or .xls files for saving data. The problem with .xls and .xlsx files are that they store extra info with the data that makes files larger than neccessary and Excel formats can also unwittingly alter your data!
This kind of thing can really mess you up. A stable way to save your data is as a .csv which stands for comma seperated values. These are simply values seperated by commas and rows defined by returns. If you select ‘save as’ in excel, you can choose .csv as one of the options. Try opening the .csv file I have given you using a text editor and you will see it is just words, numbers and commas. Obviously using the csv format means you need to avoid commas in your variable names and values, because all commas will be interpreted as spaces between cells in the table.
As an example of the problems with excel files, type some numbers into an excel spreadsheet then click ‘format’ -> ‘cells’ and change the category to ‘time’. Now save it as a .csv and open it with a text editor. Your numbers have changed from what you typed in. Because excel automatically detects number types, if you enter something that looks like dates it might decide to change them all to dates and in doing so alter your data.

Accessing variables in the dataframe

The rules we learned above for accessing values of a variable also apply to dataframes. Except now we have both rows and columns. Rows and columns are indexed by using a comma as a separator. Try

fishdat[ ,c(1,3)]

We can also access variables by their names:


Basic data summaries

We can get a summary of the data frame by typing:

head(fishdat) #first 6 rows by default
head(fishdat, 10) #first ten rows

Useful if the dataframe is very large!
We can also do some statistical summaries. This is handy to check the data is ready in correctly and even to start some analysis. For instance:

sd(fishdat$fish_abund) #standard deviation of abundances
table(fishdat$site) #number of transects per site
table(fishdat$transect) #number of sites per transect

Did you notice a mistake in fishdat?
Yes, R distinguishes between upper and low case, so it is treating sites b and B as different sites. They are meant to be the same. We will find out how to correct this typo in a moment. But first, we need to locate its position.
We can locate the indices for certain variables using logical relations, for instance

fishdat$site == 'A'

Returns a vector of TRUE and FALSE, with the same number of rows as the dataframe. TRUE occures where the site is A and FALSE where the site isn’t A. The double equals above is used to ask a question (ie is site A?). If you prefer row indices to T/F, you can enclose the logicals in the which() function:

which(fishdat$site == 'A')

So the first five rows are site A. There are other sorts of logical quesions for instance:

which(fishdat$site !='A')
which(fishdat$dN15_SG > 11.99)
which(fishdat$dN15_SG >= 11.99)

Correcting mistakes in data frames

Earlier we noticed a typo where one transect at site ‘B’ was mistakenly typed as site ‘b’. We should identify which ‘b’ is wrong and correct it.

ierror <- which(fishdat$site == 'b') #Find out which row number is wrong
fishdat$site[ierror] <- 'B' #Change it to the correct upper case. 

We could also have used the toupper() function to change the case of ‘b’ (or infact the case of every value in site. You might be thinking, ‘I could have easily changed that error in a spreasheet’. Well you could have for a dataframe with 20 observations, but what if the data had 1 million observations?
## Saving our corrected data frame Now we have corrected our dataframe, we may want to save it for later. We can do this with write.csv() which is basically the opposite of `read.csv()

write.csv(fishdat, 'dat_corrected.csv', row.names = FALSE)

Which saves a new csv file to our current working directory (have a look in your folder and you should see it there). If you want to know what your current working directory is, type getwd().

Basic plots

Lets do a few graphical summaries. Try these

plot(fishdat$fish_abund, fishdat$dN15_SG)

Which plots a histogram and an xy plot of fish abundances against the nitrogen isotope ratios for seagrass at each transect. Looks like there is a relationship. We can improve our plot slightly by using some extra arguments to plot()

plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15)

The pch command changes the symbol type. In fact, a whole host of graphical parameters are available with the par() function (try ?par to see them all). For instance

par(las=1) #plot text the right way up
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15)

par(mfrow = c(1,2)) #plot window with two columns 
hist(fishdat$fish_abund, xlab ='Fish abundance', main ='Histogram')
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15, col ='steelblue')

Loading packages

Many users of R also develop add on packages, which they then share with the world. Today we are going to use some packages that make data wrangling easy. First you need to install them to your computer (you will need a web connection to do this). I will run through package installs on the projector. Today we need the tidyr (for tidying dataframes) and dplyr (‘data pliers’) packages. Once you have installed these, you need to load them for each new R session like this:


The red warning just means some functions in R’s base packages have been replaced with those in dplyr, nothing for us to worry about here.
All the operations we will perform in tidyr and dplyr can be done using R’s base packages. However, we are using these special packages because they make data wrangling easy, intuitive and aesthetic. Remember I said that a good script can make you a popular scientist? Well dplyr and tidyr are good examples of that. They do some tasks we could do already in R but makes them easier and faster.

Filtering a dataframe using dplyr

We can filter a dataframe with dplyr to say, just get the rows for one site

fish_A <- filter(fishdat, site == 'A')

Which says, take the fishdat dataframe and filter it for rows that have site=='A'. One way this could be useful would be to help us make a plot where the sites are different colours, for instance

plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abunance', pch = 15, col ='grey')
points(fish_A$dN15_SG, fish_A$fish_abund, pch = 15, col = 'red')

Here we have plotted all the points (an easy way to get the axes to scale correctly) and then used points() to add points for site A, but changed their colours. Can you work out how to add the other three sites as different colours?

Summaries using dplyr

Summarising data is straightforward with dplyr. What follows is somewhat similar to making a pivot table in excel, but without all the clicking. First we need to specify what we are summarising over, using the group_by() function

grps <- group_by(fishdat, site)

grps is similar to the orginal data frame, but an extra property has been added ‘Groups:site’. Now we have specified a grouping variable, we can summarise() by sites, for instance the mean values of the nitrogen isotopes and fish abundances

summarise(grps, mean_dN = mean(dN15_SG)) #just nitrogen
summarise(grps, mean_dN = mean(dN15_SG), mean_abund = mean(fish_abund)) #means for nitrogen and fish

Here is what our data summary looks like

## Source: local data frame [4 x 3]
##   site mean_dN mean_abund
## 1    A   9.666        6.4
## 2    B  15.260       18.6
## 3    C  10.878        4.2
## 4    D  14.752       19.6

You could summarise using all sorts of other functions, like sd() or min(). Try it for yourself.
Let’s store our new data summary then, use its variables in some plots

datsum <- summarise(grps, mean_dN = mean(dN15_SG), mean_abund = mean(fish_abund))
#Make a plot with all variables
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15)
#Add the means summaries as red points
points(datsum$mean_dN, datsum$mean_abund, pch = 16, col ='red')

Aesthetics in multi-line coding

Notice that we just did a series of data-wrangling steps. First we identified groups then we created some summaries. We could write this code all on one line like this

summarise(group_by(fishdat, site), mean_dN = mean(dN15_SG))

Which will return the same summary as above. But, notice we have nested a function inside a function! Not very aesthetic and difficult for others to follow. One solution would be to do the code over multiple lines, like we did above

grps <- group_by(fishdat, site)
summarise(grps, mean_dN = mean(dN15_SG))

One clever thing that the creaters of dplyr did was to add data ‘pipes’ to improve the aesthetics of multi-function commands. Pipes look like this %>% and work like this

fishdat %>% group_by(site) %>% summarise(mean_dN = mean(dN15_SG))

Pipes take the output of the first function and apply it as the first argument of the second function and so on. Notice that we skip specifying the dataframe in group_by() and summarise() because the data was specified at the start of the pipe. The pipe is like saying, take fishdat pipe it into group_by(), then pipe the result into summarise(). Now our code reads like a sentence ‘take fishdat, do this, then this, then this’. Rather than before which was more like ‘apply this function to the output of another function that was applied to fishdat’.

Joining dataframes

Our fish abundances and seagrass isotopes aren’t the end point of our data analysis. We also want to know how they relate to some variables we measured at each site. To see these load

sitedat <- read.csv('Sitesdat.csv', header = TRUE, stringsAsFactors = FALSE)

dplyr has a range of functions for joining dataframes. All are prefixed with join. Here we use the inner_join() to find out about the others look at the help file (?inner_join)

datnew <- inner_join(fishdat, sitedat, by ='site')

We have joined fishdat and sitedat by their common variable site. Notice the join automatically replicates distance and depth across all the transects, keeping out dataframe ‘tidy’. One analysis we could do is plot the seagrass isotope ratios by distance (which represents distance from a sewage plant).

plot(datnew$distance, datnew$dN15_SG, xlab ='Distance from sewage', ylab ='Nitrogen isotope ratio of seagrass')

Looks like the pollution source is enriching the nitrogen in the seagrass.
As a challenge, can you think of a way of plotting the mean values of dN15_SG at each distance?

Tidying messy data

We have yet more data. Our collaborator has helped us out and taken three fish of different sizes from each site (small, medium and large) and run isotope analyses on those. This is great, because it means we can see if fish isotopes follow the same pattern as the seagrass.
Let’s look at the data

fishiso <- read.csv('FishIsodat.csv', header = TRUE, stringsAsFactors = FALSE)

It seems our collaborator was not familiar with the ‘tidy data’ format. Notice that values for the isotope ratios at each site are stored as separate columns. We will need to tidy this data if we are to proceed. Luckily we have the tidyr() package handy and one function gather() will be particularly handy here. Check out the vignettes other useful functions to tidy data)

(isotidy <- gather(fishiso, site, dN_fish, A:D))
##      size site dN_fish
## 1   small    A   10.04
## 2  medium    A   11.21
## 3   large    A   10.77
## 4   small    B   16.10
## 5  medium    B      NA
## 6   large    B   16.50
## 7   small    C   12.23
## 8  medium    C   13.18
## 9   large    C   13.20
## 10  small    D   14.93
## 11 medium    D   14.44
## 12  large    D   13.98

The tidy data looks much better. What gather() has done is take the fishiso dataframe, creating new variables site and dN_fish and made them from the values in columns A to D.
Notice the NA for the medium fish at site B. Unfortunately, we couldn’t catch a fish of that size at site B. We have encoded the missing value as NA when we entered the data. NA is the correct way to indicate a missing value to R.

Summaries with missing values

Let’s create a summary of our new fish isotope ratio data, using our new piping skills.

(fishiso_mn <- isotidy %>% group_by(size) %>% summarise(mean_dN = mean(dN_fish)))
## Source: local data frame [3 x 2]
##     size mean_dN
## 1  large 13.6125
## 2 medium      NA
## 3  small 13.3250

Which says, take isotidy, group it by size then make a summary of mean values for dN_fish. Well we get the means, but not for the size category with the missing value. Why? R’s default setting for taking a mean across a vector with a missing value is the return NA. R does this to make sure we are aware there are missing samples (and therefore, our design is unbalanced). We can change this default behaviour however, and ask R to ignore the NA by adding the argument na.rm = TRUE

(fishiso_mn <- isotidy %>% group_by(size) %>% summarise(mean_dN = mean(dN_fish, na.rm = TRUE)))
## Source: local data frame [3 x 2]
##     size  mean_dN
## 1  large 13.61250
## 2 medium 12.94333
## 3  small 13.32500

We have our mean for the medium fish, but be aware, it was taken from only three sites. Let’s also do some means by sites (across sizes)

sites_iso_mn <- isotidy %>% group_by(site) %>% summarise(mean_dNfish = mean(dN_fish, na.rm = TRUE))

Bringing it all together

Now we have combined all our data frames, let’s see if we can plot nitrogen isotope ratios for both fish and seagrass on the same plot, as a function of distance from the sewage plant. First, join the mean values for fish isotope ratios to the site data

sites_means <- inner_join(sitedat, sites_iso_mn, by = 'site')

Then join our new sites dataframe with the summary of fishdat.

(sites_means2 <- inner_join(sites_means, datsum, by = 'site'))
##   site distance depth mean_dNfish mean_dN mean_abund
## 1    A        5    10    10.67333   9.666        6.4
## 2    B       12    11    16.30000  15.260       18.6
## 3    C        6     4    12.87000  10.878        4.2
## 4    D       11     3    14.45000  14.752       19.6

Note, you could try doing this all in one line, using pipes.
Now lets put it all together in a plot

par(las = 1)
plot(sites_means2$distance, sites_means2$mean_dN, xlab = 'Distance from sewage plant', ylab = 'N isotope ratio', pch = 16, ylim = c(5, 20), col = 'springgreen3', cex = 1.5, cex.lab = 1.3)
points(sites_means2$distance, sites_means2$mean_dNfish, xlab = 'Distance from sewage plant',  ylab = 'N isotope ratio', pch = 16, col ='steelblue1', cex = 1.5)
legend('topleft', legend = c('Seagrass', 'Fish'), pch = 16, col =c('springgreen3','steelblue1'), cex = 1.5)

To explain some new arguments above, ylim sets the y-limits, cex sets the scaling of points and cex.lab resizes the axes labels. The legend() function lets us add a legend to the plot. Note, if you want to try different colours, type colors() to see a full list. The RColorBrewer package is also extremely handy for picking colours. Finally, you should also plot the correlation of mean fish versus mean seagrass isotope ratios. I will leave that to you to work out.

Advanced tasks

If you have cruised through the first part of the course, have a go with this advanced challenge question. I have given you some extra dataframes to practice your data wrangling skills with. These are much larger, so you will need to use R to find errors in them. The dataframes are: Fish_wghts.csv which records individual weights and sexes for fish caught in 100 different trawls; and Trawl_sites.csv which records for each trawl: its distance from shore and its depth. We want to know how the mean weight of individual fish varies with distance from shore, how the total population biomass varies with distance from shore and if the the different sexes have different weights. Beware: there are two errors in the fish_wghts file that you will need to correct!


I hope I have convinced you of how useful R can be for data wrangling. R might be hard at first, but the more you use R the easier it will become. If you can use R you will find data management much more straightforward and repeatable. Aim to write aesthetic scripts and you can avoid many head-aches when you realise there are errors in your database- with a neat script you can trace back to the problem and correct it.
Data wrangling is a skill that take times to learn. But I believe it is worthwhile. Being good at data wrangling will make you a sought after collaborator and a better scientist.

Contact: Chris Brown

Email Tweets YouTube Code on Github


Designed by Chris Brown. Source on Github