What is data wrangling?
Wikipedia defines it as “Data wrangling or data wrangling is loosely the process of manually converting or mapping data from one ‘raw’ form into another format that allows for more convenient consumption of the data with the help of semi-automated tools”. Have you ever been faced with a dataset that was in the wrong format for the statistical package you wanted to use? Well data-wrangling is about solving that problem.
If you have ever had to deal with large data-sets you may realise that data wrangling can take a considerable amount of time and skill with spreasheets programs like excel. Data wrangling is also dangerous for your analysis- if you stuff something up, like accidentally deleting some rows of data, it can affect all your results and the problem can be hard to detect.
R is fast becoming the most powerful and flexible programming package for environmental data analysis. It is also totally free. R is a powerful language for data wrangling because
1. It is relatively fast to process commands
2. You can create repeatable scripts
3. You can trace errors back to their source
3. You can share your wrangling scripts with other people
4. You can conveniently search large databases for errors
5. Having your data in R opens up a huge array of cutting edge analysis tools.
A core principle of science is repeatability. Creating and sharing
your wrangling scripts helps you to fulfil the important principle of
repeatability. It makes you a better scientist and can also make you a
more popular scientist: a well thought out script for a common wrangling
problem may be widely used by others. In fact, these days it is best
practice to include your scripts with publications.
Most statistical methods in R require your data is input in a certain ‘tidy’ format. This course will cover how to use R to easily convert your data into a ‘tidy’ format. Steps include restructuring existing datasets and combining different data-sets. We will also create some data summaries and plots, which are easy once you have ‘tidy’ data. The course will be useful for people who want to explore, analyse and visualise their own field and experimental data in R. The skills covered are also a useful precusor for management of very large datasets in R.
To download the data you need click here
The download also includes a portable html of this course.
I am going to assume you are using RStudio. However, you can also use the R app and a text editor if you prefer.
Open up RStudio. R is a programming language, which means there are very few commands available in the menus, rather you have to type in commands into the ‘Console’ (notice there is a window called ‘Console’). The hard part about learning to use R is knowning what the commands are that you want and how to structure the commands in a way R understands. RStudio does however, ease the transition to programming by providing you with some drop down menus.
Let’s get started with R, doing some basic calculations in the console. In the console type
10 - 1
And the console returns an answer (
9). You have just asked the R program to compute this (easy) sum. Look carefully at the console and note that each line starts with a
> this means the console is ready to take input. If you just type
10 - and hit enter, you will see the next line begins with a
meaning the console is waiting for you to finish this equation. Type
any number and hit return and you will get the answer followed by a
You can also ask the console for help, for instance try:
Notice a new ‘Help’ window opens up, with details of how to use arithmetic operators in R. These help files are very useful, but also tend to contain a lot of technical jargon. In learning to use R,
you will begin to understand this jargon, making it much easier for you
to learn more in the future. But to start off, if things don’t make
sense at first reading, try web searches, referring to a beginners book
or just experiment with code in different ways to see what happens.
R contains most functions and constants for maths you can think of, for instance try:
sqrt(9) 3^2 exp(1) pi
It is well and good to be able to do calculations in R, but what if we want to save our number and reuse it later?
That is what variables are for. Try typing this into your console:
x <- 3^2
Note that the console returns absolutely nothing, but it is ready for more input (
>). Try typing
x and the console reminds us that three squared in fact equals
9. To run through it point by point, the command we are giving says: compute
3^2 and assign (the assign command is this
<-) the answer to
The spaces around the assign don’t matter, you can leave them out.
However, I prefer to have the spaces because it makes the code more
readable. You may also see some people use
= instead of
<-. Both are ok, but for technical reasons that will be explained later, it is better to use
<- for assigning values to variables.
If we want R to return the answer when the calculation is computed then enclose the sum in brackets like this
(x <- 3^2).
Now, what if we try to recall a variable that does not exist? For instance type
y. R reminds us, in red font, that we never created a variable called
Now we have created
x we can use it in algebraic equations. For instance try
z <- x*3 and you will see we get variable
z that equals
We can also assign words to variables:
v <- 'today'
Note that we enclosed our word in inverted commas. If we didn’t do this, R would try to assign the value of the variable
A few tips on naming variables. You can name variables whatever you want (e.g. you could use
nine instead of
x), but there are a few rules: (1) you can’t use special characters in variable names (other than
and (2) you must start a variable name with a letter. It is also a good
idea to avoid naming variables as constants or functions that already
exist in R, like
So far our variables only have one number. Note that when we recall a value, e.g.
x the console prints
 9. The
 means that the first number stored in
x is 9. We can store more than one number or word in a variable using the
x <- c(22, 23, 24) x v <- c('yesterday', 'today', 'tomorrow')
Which creates variables each with three values. Note again I use spaces purely for clarity, they are not essential.
We can index variables to access certain numbers using
returns the first value of
x and the second value of
v. We can also use indices to recall multiple values at once. For instance try
x[1:3] x x[c(1,3)]
Does the output make sense?
We can also use logicals (
FALSE) for indexing. To see how this works, let’s pose R some logical questions
42 == 54 42 == 7*6 x > 10 x != 22
The relational operators used here ask questions such as
42 == 52 means ‘does 42 equal 54?’. To see more of them type
help('=='). You will see they return
FALSE depending on whether the relation is true or false.
Logicals are very useful because we can use them for indexing. For instance
returns a vector of T/F, telling us that the first two values of
are not greater than 23 but the last value is. We can store this
logical vector and use it to return values of x that are greater than
ix <- x>23 x[ix]
We can also apply this idea to words
iv <- v == 'today' v[iv]
There are other specialised functions if you want to do partial searches for words (like
grep()), but we won’t cover those in this basic course.
One final thing, if you don’t like the logicals, you can also use the
which() function, which turns logicals into indices
ix <- which(x>23) ix # the third value of x is greater than 23 x[ix] #the third value of x is 24.
So far we have just been typing everything into the console. However,
we might want to save our commands for later. To do this go to the menu
and select file -> New file -> R Script. It will
open a text editor. From now on, type all your code into this editor.
To send the code to the console, click on a line, or select a section
and hit ctrl-R on windows or cmnd-enter on mac.
We can put comments in the script, to remind us what it does. R will not evaluate comments, which are preceeded by an
#. For instance, start your script by typing:
# R data wrangling course # Your name # 23 Jun 2015
I recommend using comments frequently throughout your code. It will make life a lot easier when you come back to a piece of code after a few days, because the comments will remind you what the code does. It will also make it easier to share your code with collaborators and make your code more repeatable, which as I said, is a core principle of science. Another way to make your code more readable is to break up sections of code within your script, for instance here is how I often delineate sections in my code
#-----------------------# # MAJOR HEADING # #-----------------------# # # Minor heading #
Now we have a script with something worth saving, you should go to
the menu and select ‘file’ -> ‘save as’ to save the script for later.
Next time you come back to your analysis, you can just open this file,
select everything and hit run. Then you will be back to where you left
off last time. Further, you can share your script with other people or
publish it with your paper.
Don’t forget what I said about repeatability and readability! It can take time to organise your script carefully, but it is worth it. I encourage you to include ample comments and section headers as we progress with the rest of the course.
R really becomes useful when we can read in our own datasets. To start, we are going to read in a dataset of fish surveys,
This is a practice data set and is very small (only 21 rows). You might
think wrangling this data would be easier in a spreadsheet, like excel.
We will develop our wrangling skills on this small dataset then test them
on a much larger dataset, which would be tricky to handle in a
The first step is to tell R where on your computer the dataset is stored. In your script type this (but replace the path with that for your own computer)
# #DATA IMPORT # setwd('/Users/Documents/R data wrangling')
setwd() stands for ‘set working directory’. You will need to replace the folder path in the
with the path on your computer. On windows you can right click on the
file and select ‘properties’, on mac use ‘get info’, then cut and paste
the path. Note that if you cut and paste on a windows computer you will
have to replace the
/. I also have put a
code header in here as an example. I won’t do this below, but I suggest
you put in code headers for each new section we cover, it will make it
easier for you to come back to this script later.
We can see the files in this folder by typing
list.files(). The first data we will use is called
dat.csv and can be read in as a
dataframe like this
fishdat <- read.csv('dat.csv', header = TRUE, stringsAsFactors = FALSE)
Notice we used the assign operator again, but this time assigned the input data to the name
read.csv is function that reads a csv file and stores it in R’s local memory. Functions are simply algorithms that R has stored, which perform a task for you, in this case reading a csv file. Functions always begin with a name followed by
(), which contain some arguments. In the case of the
read.csv() function, the arguments are the name of the data file (‘dat.csv’),
header = TRUE and
stringsAsFactors = FALSE. The
argument simply says the first row is a row of variable names (so don’t
treat them as measurements). The strings as factors arguments tells R to treat strings (ie words) as strings and not to group them by names. You will see why we do this later. Also notice we use
= to specify the function’s arguments. As I said above, it is best to use
<- for assigning variable names and save
= for function arguments.
If you need more help on any function in R type its name prefixed with a question mark, for instance
Now lets look at
fishdat. Either type
into the console, or on RStudio, go to the Environment window and
double click on ‘fishdat’ under ‘Data’. You will see a table of data
like this (I have appended a truncated version here)
fishdat is what we call ‘tidy data’. Each row is an
observation. Here we have multiple sites, each with five transects. We
have five variables
site, the names of sites,
transect the transect number,
fish_abund the abundance of fish at that transect and
the ratio of nitrogen stable isotopes of a seagrass sample from that
transect (high nitrogen isotope values indicate the influence of
In short, don’t use .xlsx or .xls files for saving data. The
problem with .xls and .xlsx files are that they store extra info with
the data that makes files larger than neccessary and Excel formats can also unwittingly alter your data!
This kind of thing can really mess you up. A stable way to save your data is as a
which stands for comma seperated values. These are simply values
seperated by commas and rows defined by returns. If you select ‘save as’
in excel, you can choose .csv as one of the options. Try opening the
.csv file I have given you using a text editor and you will see it is
just words, numbers and commas. Obviously using the csv format means you
need to avoid commas in your variable names and values, because all
commas will be interpreted as spaces between cells in the table.
As an example of the problems with excel files, type some numbers into an excel spreadsheet then click ‘format’ -> ‘cells’ and change the category to ‘time’. Now save it as a .csv and open it with a text editor. Your numbers have changed from what you typed in. Because excel automatically detects number types, if you enter something that looks like dates it might decide to change them all to dates and in doing so alter your data.
The rules we learned above for accessing values of a variable also apply to dataframes. Except now we have both rows and columns. Rows and columns are indexed by using a comma as a separator. Try
fishdat[1,3] fishdat[,2] fishdat[1,] fishdat[ ,c(1,3)]
We can also access variables by their names:
fishdat$site fishdat$fish_abund[1:5] fishdat[,'transect']
We can get a summary of the data frame by typing:
head(fishdat) #first 6 rows by default head(fishdat, 10) #first ten rows
Useful if the dataframe is very large!
We can also do some statistical summaries. This is handy to check the data is ready in correctly and even to start some analysis. For instance:
nrow(fishdat) mean(fishdat$fish_abund) sd(fishdat$fish_abund) #standard deviation of abundances table(fishdat$site) #number of transects per site table(fishdat$transect) #number of sites per transect
Did you notice a mistake in fishdat?
Yes, R distinguishes between upper and low case, so it is treating sites
as different sites. They are meant to be the same. We will find out how
to correct this typo in a moment. But first, we need to locate its
We can locate the indices for certain variables using logical relations, for instance
fishdat$site == 'A'
Returns a vector of
FALSE, with the same number of rows as the dataframe.
TRUE occures where the site is
FALSE where the site isn’t
The double equals above is used to ask a question (ie is site A?). If
you prefer row indices to T/F, you can enclose the logicals in the
which(fishdat$site == 'A')
So the first five rows are site
A. There are other sorts of logical quesions for instance:
which(fishdat$site !='A') which(fishdat$dN15_SG > 11.99) which(fishdat$dN15_SG >= 11.99)
Earlier we noticed a typo where one transect at site ‘B’ was mistakenly typed as site ‘b’. We should identify which ‘b’ is wrong and correct it.
ierror <- which(fishdat$site == 'b') #Find out which row number is wrong fishdat$site[ierror] fishdat$site[ierror] <- 'B' #Change it to the correct upper case.
We could also have used the
toupper() function to change the case of ‘b’ (or infact the case of every value in
You might be thinking, ‘I could have easily changed that error in a
spreasheet’. Well you could have for a dataframe with 20 observations,
but what if the data had 1 million observations?
## Saving our corrected data frame Now we have corrected our dataframe, we may want to save it for later. We can do this with
write.csv() which is basically the opposite of `read.csv()
write.csv(fishdat, 'dat_corrected.csv', row.names = FALSE)
Which saves a new csv file to our current working directory (have a
look in your folder and you should see it there). If you want to know
what your current working directory is, type
Lets do a few graphical summaries. Try these
hist(fishdat$fish_abund) plot(fishdat$fish_abund, fishdat$dN15_SG)
Which plots a histogram and an xy plot of fish abundances against the
nitrogen isotope ratios for seagrass at each transect. Looks like there
is a relationship. We can improve our plot slightly by using some extra
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15)
pch command changes the symbol type. In fact, a whole host of graphical parameters are available with the
par() function (try
?par to see them all). For instance
par(las=1) #plot text the right way up plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15) par(mfrow = c(1,2)) #plot window with two columns hist(fishdat$fish_abund, xlab ='Fish abundance', main ='Histogram') plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15, col ='steelblue')
Many users of R also develop add on packages, which
they then share with the world. Today we are going to use some packages
that make data wrangling easy. First you need to install them to your
computer (you will need a web connection to do this). I will run through
package installs on the projector. Today we need the
tidyr (for tidying dataframes) and
dplyr (‘data pliers’) packages. Once you have installed these, you need to load them for each new R session like this:
The red warning just means some functions in R’s base packages have been replaced with those in
dplyr, nothing for us to worry about here.
All the operations we will perform in tidyr and dplyr can be done using R’s base packages. However, we are using these special packages because they make data wrangling easy, intuitive and aesthetic. Remember I said that a good script can make you a popular scientist? Well dplyr and tidyr are good examples of that. They do some tasks we could do already in R but makes them easier and faster.
We can filter a dataframe with dplyr to say, just get the rows for one site
fish_A <- filter(fishdat, site == 'A') fish_A
Which says, take the
fishdat dataframe and filter it for rows that have
site=='A'. One way this could be useful would be to help us make a plot where the sites are different colours, for instance
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abunance', pch = 15, col ='grey') points(fish_A$dN15_SG, fish_A$fish_abund, pch = 15, col = 'red')
Here we have plotted all the points (an easy way to get the axes to scale correctly) and then used
points() to add points for site A, but changed their colours. Can you work out how to add the other three sites as different colours?
Summarising data is straightforward with
follows is somewhat similar to making a pivot table in excel, but
without all the clicking. First we need to specify what we are
summarising over, using the
grps <- group_by(fishdat, site) grps
grps is similar to the orginal data frame, but an extra
property has been added ‘Groups:site’. Now we have specified a grouping
variable, we can
summarise() by sites, for instance the mean values of the nitrogen isotopes and fish abundances
summarise(grps, mean_dN = mean(dN15_SG)) #just nitrogen summarise(grps, mean_dN = mean(dN15_SG), mean_abund = mean(fish_abund)) #means for nitrogen and fish
Here is what our data summary looks like
## Source: local data frame [4 x 3] ## ## site mean_dN mean_abund ## 1 A 9.666 6.4 ## 2 B 15.260 18.6 ## 3 C 10.878 4.2 ## 4 D 14.752 19.6
You could summarise using all sorts of other functions, like
min(). Try it for yourself.
Let’s store our new data summary then, use its variables in some plots
datsum <- summarise(grps, mean_dN = mean(dN15_SG), mean_abund = mean(fish_abund)) #Make a plot with all variables plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15) #Add the means summaries as red points points(datsum$mean_dN, datsum$mean_abund, pch = 16, col ='red')
Notice that we just did a series of data-wrangling steps. First we identified groups then we created some summaries. We could write this code all on one line like this
summarise(group_by(fishdat, site), mean_dN = mean(dN15_SG))
Which will return the same summary as above. But, notice we have nested a function inside a function! Not very aesthetic and difficult for others to follow. One solution would be to do the code over multiple lines, like we did above
grps <- group_by(fishdat, site) summarise(grps, mean_dN = mean(dN15_SG))
One clever thing that the creaters of
dplyr did was to add data ‘pipes’ to improve the aesthetics of multi-function commands. Pipes look like this
%>% and work like this
fishdat %>% group_by(site) %>% summarise(mean_dN = mean(dN15_SG))
Pipes take the output of the first function and apply it as the first
argument of the second function and so on. Notice that we skip
specifying the dataframe in
summarise() because the data was specified at the start of the pipe. The pipe is like saying, take
fishdat pipe it into
group_by(), then pipe the result into
Now our code reads like a sentence ‘take fishdat, do this, then this,
then this’. Rather than before which was more like ‘apply this function
to the output of another function that was applied to fishdat’.
Our fish abundances and seagrass isotopes aren’t the end point of our data analysis. We also want to know how they relate to some variables we measured at each site. To see these load
sitedat <- read.csv('Sitesdat.csv', header = TRUE, stringsAsFactors = FALSE) sitedat
dplyr has a range of functions for joining dataframes. All are prefixed with join. Here we use the
inner_join() to find out about the others look at the help file (
datnew <- inner_join(fishdat, sitedat, by ='site') datnew
We have joined fishdat and sitedat by their common variable site. Notice the join automatically replicates
depth across all the transects, keeping out dataframe ‘tidy’. One analysis we could do is plot the seagrass isotope ratios by
distance (which represents distance from a sewage plant).
plot(datnew$distance, datnew$dN15_SG, xlab ='Distance from sewage', ylab ='Nitrogen isotope ratio of seagrass')
Looks like the pollution source is enriching the nitrogen in the seagrass.
As a challenge, can you think of a way of plotting the mean values of
dN15_SG at each distance?
We have yet more data. Our collaborator has helped us out and taken
three fish of different sizes from each site (small, medium and large)
and run isotope analyses on those. This is great, because it means we
can see if fish isotopes follow the same pattern as the seagrass.
Let’s look at the data
fishiso <- read.csv('FishIsodat.csv', header = TRUE, stringsAsFactors = FALSE) fishiso
It seems our collaborator was not familiar with the ‘tidy data’
format. Notice that values for the isotope ratios at each site are
stored as separate columns. We will need to tidy this data if we are to
proceed. Luckily we have the
tidyr() package handy and one function
gather() will be particularly handy here. Check out the vignettes other useful functions to tidy data)
(isotidy <- gather(fishiso, site, dN_fish, A:D))
## size site dN_fish ## 1 small A 10.04 ## 2 medium A 11.21 ## 3 large A 10.77 ## 4 small B 16.10 ## 5 medium B NA ## 6 large B 16.50 ## 7 small C 12.23 ## 8 medium C 13.18 ## 9 large C 13.20 ## 10 small D 14.93 ## 11 medium D 14.44 ## 12 large D 13.98
The tidy data looks much better. What
gather() has done is take the
fishiso dataframe, creating new variables
dN_fish and made them from the values in columns A to D.
NA for the medium fish at site
B. Unfortunately, we couldn’t catch a fish of that size at site B. We have encoded the missing value as
NA when we entered the data.
NA is the correct way to indicate a missing value to R.
Let’s create a summary of our new fish isotope ratio data, using our new piping skills.
(fishiso_mn <- isotidy %>% group_by(size) %>% summarise(mean_dN = mean(dN_fish)))
## Source: local data frame [3 x 2] ## ## size mean_dN ## 1 large 13.6125 ## 2 medium NA ## 3 small 13.3250
Which says, take isotidy, group it by size then make a summary of
mean values for dN_fish. Well we get the means, but not for the size
category with the missing value. Why? R’s default setting for taking a mean across a vector with a missing value is the return
does this to make sure we are aware there are missing samples (and
therefore, our design is unbalanced). We can change this default
behaviour however, and ask R to ignore the NA by adding the argument
na.rm = TRUE
(fishiso_mn <- isotidy %>% group_by(size) %>% summarise(mean_dN = mean(dN_fish, na.rm = TRUE)))
## Source: local data frame [3 x 2] ## ## size mean_dN ## 1 large 13.61250 ## 2 medium 12.94333 ## 3 small 13.32500
We have our mean for the medium fish, but be aware, it was taken from only three sites. Let’s also do some means by sites (across sizes)
sites_iso_mn <- isotidy %>% group_by(site) %>% summarise(mean_dNfish = mean(dN_fish, na.rm = TRUE))
Now we have combined all our data frames, let’s see if we can plot nitrogen isotope ratios for both fish and seagrass on the same plot, as a function of distance from the sewage plant. First, join the mean values for fish isotope ratios to the site data
sites_means <- inner_join(sitedat, sites_iso_mn, by = 'site')
Then join our new sites dataframe with the summary of fishdat.
(sites_means2 <- inner_join(sites_means, datsum, by = 'site'))
## site distance depth mean_dNfish mean_dN mean_abund ## 1 A 5 10 10.67333 9.666 6.4 ## 2 B 12 11 16.30000 15.260 18.6 ## 3 C 6 4 12.87000 10.878 4.2 ## 4 D 11 3 14.45000 14.752 19.6
Note, you could try doing this all in one line, using pipes.
Now lets put it all together in a plot
par(las = 1) plot(sites_means2$distance, sites_means2$mean_dN, xlab = 'Distance from sewage plant', ylab = 'N isotope ratio', pch = 16, ylim = c(5, 20), col = 'springgreen3', cex = 1.5, cex.lab = 1.3) points(sites_means2$distance, sites_means2$mean_dNfish, xlab = 'Distance from sewage plant', ylab = 'N isotope ratio', pch = 16, col ='steelblue1', cex = 1.5) legend('topleft', legend = c('Seagrass', 'Fish'), pch = 16, col =c('springgreen3','steelblue1'), cex = 1.5)
To explain some new arguments above,
ylim sets the y-limits,
cex sets the scaling of points and
cex.lab resizes the axes labels. The
legend() function lets us add a legend to the plot. Note, if you want to try different colours, type
colors() to see a full list. The
package is also extremely handy for picking colours. Finally, you
should also plot the correlation of mean fish versus mean seagrass
isotope ratios. I will leave that to you to work out.
If you have cruised through the first part of the course, have a go
with this advanced challenge question. I have given you some extra
dataframes to practice your data wrangling skills with. These are much
larger, so you will need to use R to find errors in them. The dataframes are:
Fish_wghts.csv which records individual weights and sexes for fish caught in 100 different trawls; and
which records for each trawl: its distance from shore and its depth. We
want to know how the mean weight of individual fish varies with
distance from shore, how the total population biomass varies with
distance from shore and if the the different sexes have different
weights. Beware: there are two errors in the fish_wghts file that you will need to correct!
I hope I have convinced you of how useful R can be for data wrangling. R might be hard at first, but the more you use R the easier it will become. If you can use R
you will find data management much more straightforward and repeatable.
Aim to write aesthetic scripts and you can avoid many head-aches when
you realise there are errors in your database- with a neat script you
can trace back to the problem and correct it.
Data wrangling is a skill that take times to learn. But I believe it is worthwhile. Being good at data wrangling will make you a sought after collaborator and a better scientist.