Notice we used the assign operator again, but this time assigned the input data to the name fishdat.
read.csv is function that reads a csv file and stores it in R’s local memory. Functions are simply algorithms that R has stored, which perform a task for you, in this case reading a csv file. Functions always begin with a name followed by (), which contain some arguments. In the case of the read.csv() function, the arguments are the name of the data file (‘dat.csv’), header = TRUE and stringsAsFactors = FALSE. The header argument simply says the first row is a row of variable names (so don’t treat them as measurements). The strings as factors arguments tells R to treat strings (ie words) as strings and not to group them by names. You will see why we do this later. Also notice we use = to specify the function’s arguments. As I said above, it is best to use <- for assigning variable names and save = for function arguments.
If you need more help on any function in R type its name prefixed with a question mark, for instance ?read.csv.
Now lets look at fishdat. Either type fishdat into the console, or on RStudio, go to the Environment window and double click on ‘fishdat’ under ‘Data’. You will see a table of data like this (I have appended a truncated version here)
A note about Excel files
In short, don’t use .xlsx or .xls files for saving data. The problem with .xls and .xlsx files are that they store extra info with the data that makes files larger than neccessary and Excel formats can also unwittingly alter your data!
This kind of thing can really mess you up. A stable way to save your data is as a .csv which stands for comma seperated values. These are simply values seperated by commas and rows defined by returns. If you select ‘save as’ in excel, you can choose .csv as one of the options. Try opening the .csv file I have given you using a text editor and you will see it is just words, numbers and commas. Obviously using the csv format means you need to avoid commas in your variable names and values, because all commas will be interpreted as spaces between cells in the table.
As an example of the problems with excel files, type some numbers into an excel spreadsheet then click ‘format’ -> ‘cells’ and change the category to ‘time’. Now save it as a .csv and open it with a text editor. Your numbers have changed from what you typed in. Because excel automatically detects number types, if you enter something that looks like dates it might decide to change them all to dates and in doing so alter your data.
Accessing variables in the dataframe
The rules we learned above for accessing values of a variable also apply to dataframes. Except now we have both rows and columns. Rows and columns are indexed by using a comma as a separator. Try
fishdat[1,3]
fishdat[,2]
fishdat[1,]
fishdat[ ,c(1,3)]
We can also access variables by their names:
fishdat$site
fishdat$fish_abund[1:5]
fishdat[,'transect']
Basic data summaries
We can get a summary of the data frame by typing:
head(fishdat)
head(fishdat, 10)
Useful if the dataframe is very large!
We can also do some statistical summaries. This is handy to check the data is ready in correctly and even to start some analysis. For instance:
nrow(fishdat)
mean(fishdat$fish_abund)
sd(fishdat$fish_abund)
table(fishdat$site)
table(fishdat$transect)
Did you notice a mistake in fishdat?
Yes, R distinguishes between upper and low case, so it is treating sites b and B as different sites. They are meant to be the same. We will find out how to correct this typo in a moment. But first, we need to locate its position.
We can locate the indices for certain variables using logical relations, for instance
fishdat$site == 'A'
Returns a vector of TRUE and FALSE, with the same number of rows as the dataframe. TRUE occures where the site is A and FALSE where the site isn’t A. The double equals above is used to ask a question (ie is site A?). If you prefer row indices to T/F, you can enclose the logicals in the which() function:
which(fishdat$site == 'A')
So the first five rows are site A. There are other sorts of logical quesions for instance:
which(fishdat$site !='A')
which(fishdat$dN15_SG > 11.99)
which(fishdat$dN15_SG >= 11.99)
Correcting mistakes in data frames
Earlier we noticed a typo where one transect at site ‘B’ was mistakenly typed as site ‘b’. We should identify which ‘b’ is wrong and correct it.
ierror <- which(fishdat$site == 'b')
fishdat$site[ierror]
fishdat$site[ierror] <- 'B'
We could also have used the toupper() function to change the case of ‘b’ (or infact the case of every value in site. You might be thinking, ‘I could have easily changed that error in a spreasheet’. Well you could have for a dataframe with 20 observations, but what if the data had 1 million observations?
## Saving our corrected data frame Now we have corrected our dataframe, we may want to save it for later. We can do this with write.csv() which is basically the opposite of `read.csv()
write.csv(fishdat, 'dat_corrected.csv', row.names = FALSE)
Which saves a new csv file to our current working directory (have a look in your folder and you should see it there). If you want to know what your current working directory is, type getwd().
Basic plots
Lets do a few graphical summaries. Try these
hist(fishdat$fish_abund)
plot(fishdat$fish_abund, fishdat$dN15_SG)
Which plots a histogram and an xy plot of fish abundances against the nitrogen isotope ratios for seagrass at each transect. Looks like there is a relationship. We can improve our plot slightly by using some extra arguments to plot()
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15)
The pch command changes the symbol type. In fact, a whole host of graphical parameters are available with the par() function (try ?par to see them all). For instance
par(las=1)
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15)
par(mfrow = c(1,2))
hist(fishdat$fish_abund, xlab ='Fish abundance', main ='Histogram')
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15, col ='steelblue')
Loading packages
Many users of R also develop add on packages, which they then share with the world. Today we are going to use some packages that make data wrangling easy. First you need to install them to your computer (you will need a web connection to do this). I will run through package installs on the projector. Today we need the tidyr (for tidying dataframes) and dplyr (‘data pliers’) packages. Once you have installed these, you need to load them for each new R session like this:
library(tidyr)
library(dplyr)
The red warning just means some functions in R’s base packages have been replaced with those in dplyr, nothing for us to worry about here.
All the operations we will perform in tidyr and dplyr can be done using R’s base packages. However, we are using these special packages because they make data wrangling easy, intuitive and aesthetic. Remember I said that a good script can make you a popular scientist? Well dplyr and tidyr are good examples of that. They do some tasks we could do already in R but makes them easier and faster.
Filtering a dataframe using dplyr
We can filter a dataframe with dplyr to say, just get the rows for one site
fish_A <- filter(fishdat, site == 'A')
fish_A
Which says, take the fishdat dataframe and filter it for rows that have site==‘A’. One way this could be useful would be to help us make a plot where the sites are different colours, for instance
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abunance', pch = 15, col ='grey')
points(fish_A$dN15_SG, fish_A$fish_abund, pch = 15, col = 'red')
Here we have plotted all the points (an easy way to get the axes to scale correctly) and then used points() to add points for site A, but changed their colours. Can you work out how to add the other three sites as different colours?
Summaries using dplyr
Summarising data is straightforward with dplyr. What follows is somewhat similar to making a pivot table in excel, but without all the clicking. First we need to specify what we are summarising over, using the group_by() function
grps <- group_by(fishdat, site)
grps
grps is similar to the orginal data frame, but an extra property has been added ‘Groups:site’. Now we have specified a grouping variable, we can summarise() by sites, for instance the mean values of the nitrogen isotopes and fish abundances
summarise(grps, mean_dN = mean(dN15_SG))
summarise(grps, mean_dN = mean(dN15_SG), mean_abund = mean(fish_abund))
Here is what our data summary looks like
## Source: local data frame [4 x 3]
##
## site mean_dN mean_abund
## 1 A 9.666 6.4
## 2 B 15.260 18.6
## 3 C 10.878 4.2
## 4 D 14.752 19.6
You could summarise using all sorts of other functions, like sd() or min(). Try it for yourself.
Let’s store our new data summary then, use its variables in some plots
datsum <- summarise(grps, mean_dN = mean(dN15_SG), mean_abund = mean(fish_abund))
plot(fishdat$dN15_SG, fishdat$fish_abund, xlab = 'N Isotope ratio', ylab ='Fish abundance', pch = 15)
points(datsum$mean_dN, datsum$mean_abund, pch = 16, col ='red')
Aesthetics in multi-line coding
Notice that we just did a series of data-wrangling steps. First we identified groups then we created some summaries. We could write this code all on one line like this
summarise(group_by(fishdat, site), mean_dN = mean(dN15_SG))
Which will return the same summary as above. But, notice we have nested a function inside a function! Not very aesthetic and difficult for others to follow. One solution would be to do the code over multiple lines, like we did above
grps <- group_by(fishdat, site)
summarise(grps, mean_dN = mean(dN15_SG))
One clever thing that the creaters of dplyr did was to add data ‘pipes’ to improve the aesthetics of multi-function commands. Pipes look like this %>% and work like this
fishdat %>% group_by(site) %>% summarise(mean_dN = mean(dN15_SG))
Pipes take the output of the first function and apply it as the first argument of the second function and so on. Notice that we skip specifying the dataframe in group_by() and summarise() because the data was specified at the start of the pipe. The pipe is like saying, take fishdat pipe it into group_by(), then pipe the result into summarise(). Now our code reads like a sentence ‘take fishdat, do this, then this, then this’. Rather than before which was more like ‘apply this function to the output of another function that was applied to fishdat’.
Joining dataframes
Our fish abundances and seagrass isotopes aren’t the end point of our data analysis. We also want to know how they relate to some variables we measured at each site. To see these load
sitedat <- read.csv('Sitesdat.csv', header = TRUE, stringsAsFactors = FALSE)
sitedat
dplyr has a range of functions for joining dataframes. All are prefixed with join. Here we use the inner_join() to find out about the others look at the help file (?inner_join)
datnew <- inner_join(fishdat, sitedat, by ='site')
datnew
We have joined fishdat and sitedat by their common variable site. Notice the join automatically replicates distance and depth across all the transects, keeping out dataframe ‘tidy’. One analysis we could do is plot the seagrass isotope ratios by distance (which represents distance from a sewage plant).
plot(datnew$distance, datnew$dN15_SG, xlab ='Distance from sewage', ylab ='Nitrogen isotope ratio of seagrass')
Looks like the pollution source is enriching the nitrogen in the seagrass.
As a challenge, can you think of a way of plotting the mean values of dN15_SG at each distance?
Tidying messy data
We have yet more data. Our collaborator has helped us out and taken three fish of different sizes from each site (small, medium and large) and run isotope analyses on those. This is great, because it means we can see if fish isotopes follow the same pattern as the seagrass.
Let’s look at the data
fishiso <- read.csv('FishIsodat.csv', header = TRUE, stringsAsFactors = FALSE)
fishiso
It seems our collaborator was not familiar with the ‘tidy data’ format. Notice that values for the isotope ratios at each site are stored as separate columns. We will need to tidy this data if we are to proceed. Luckily we have the tidyr() package handy and one function gather() will be particularly handy here. Check out the vignettes other useful functions to tidy data)
(isotidy <- gather(fishiso, site, dN_fish, A:D))
## size site dN_fish
## 1 small A 10.04
## 2 medium A 11.21
## 3 large A 10.77
## 4 small B 16.10
## 5 medium B NA
## 6 large B 16.50
## 7 small C 12.23
## 8 medium C 13.18
## 9 large C 13.20
## 10 small D 14.93
## 11 medium D 14.44
## 12 large D 13.98
The tidy data looks much better. What gather() has done is take the fishiso dataframe, creating new variables site and dN_fish and made them from the values in columns A to D.
Notice the NA for the medium fish at site B. Unfortunately, we couldn’t catch a fish of that size at site B. We have encoded the missing value as NA when we entered the data. NA is the correct way to indicate a missing value to R.
Summaries with missing values
Let’s create a summary of our new fish isotope ratio data, using our new piping skills.
(fishiso_mn <- isotidy %>% group_by(size) %>% summarise(mean_dN = mean(dN_fish)))
## Source: local data frame [3 x 2]
##
## size mean_dN
## 1 large 13.6125
## 2 medium NA
## 3 small 13.3250
Which says, take isotidy, group it by size then make a summary of mean values for dN_fish. Well we get the means, but not for the size category with the missing value. Why? R’s default setting for taking a mean across a vector with a missing value is the return NA. R does this to make sure we are aware there are missing samples (and therefore, our design is unbalanced). We can change this default behaviour however, and ask R to ignore the NA by adding the argument na.rm = TRUE
(fishiso_mn <- isotidy %>% group_by(size) %>% summarise(mean_dN = mean(dN_fish, na.rm = TRUE)))
## Source: local data frame [3 x 2]
##
## size mean_dN
## 1 large 13.61250
## 2 medium 12.94333
## 3 small 13.32500
We have our mean for the medium fish, but be aware, it was taken from only three sites. Let’s also do some means by sites (across sizes)
sites_iso_mn <- isotidy %>% group_by(site) %>% summarise(mean_dNfish = mean(dN_fish, na.rm = TRUE))
Bringing it all together
Now we have combined all our data frames, let’s see if we can plot nitrogen isotope ratios for both fish and seagrass on the same plot, as a function of distance from the sewage plant. First, join the mean values for fish isotope ratios to the site data
sites_means <- inner_join(sitedat, sites_iso_mn, by = 'site')
Then join our new sites dataframe with the summary of fishdat.
(sites_means2 <- inner_join(sites_means, datsum, by = 'site'))
## site distance depth mean_dNfish mean_dN mean_abund
## 1 A 5 10 10.67333 9.666 6.4
## 2 B 12 11 16.30000 15.260 18.6
## 3 C 6 4 12.87000 10.878 4.2
## 4 D 11 3 14.45000 14.752 19.6
Note, you could try doing this all in one line, using pipes.
Now lets put it all together in a plot
par(las = 1)
plot(sites_means2$distance, sites_means2$mean_dN, xlab = 'Distance from sewage plant', ylab = 'N isotope ratio', pch = 16, ylim = c(5, 20), col = 'springgreen3', cex = 1.5, cex.lab = 1.3)
points(sites_means2$distance, sites_means2$mean_dNfish, xlab = 'Distance from sewage plant', ylab = 'N isotope ratio', pch = 16, col ='steelblue1', cex = 1.5)
legend('topleft', legend = c('Seagrass', 'Fish'), pch = 16, col =c('springgreen3','steelblue1'), cex = 1.5)
To explain some new arguments above, ylim sets the y-limits, cex sets the scaling of points and cex.lab resizes the axes labels. The legend() function lets us add a legend to the plot. Note, if you want to try different colours, type colors() to see a full list. The RColorBrewer package is also extremely handy for picking colours. Finally, you should also plot the correlation of mean fish versus mean seagrass isotope ratios. I will leave that to you to work out.
Advanced tasks
If you have cruised through the first part of the course, have a go with this advanced challenge question. I have given you some extra dataframes to practice your data wrangling skills with. These are much larger, so you will need to use R to find errors in them. The dataframes are: Fish_wghts.csv which records individual weights and sexes for fish caught in 100 different trawls; and Trawl_sites.csv which records for each trawl: its distance from shore and its depth. We want to know how the mean weight of individual fish varies with distance from shore, how the total population biomass varies with distance from shore and if the the different sexes have different weights. Beware: there are two errors in the fish_wghts file that you will need to correct!
Conclusion
I hope I have convinced you of how useful R can be for data wrangling. R might be hard at first, but the more you use R the easier it will become. If you can use R you will find data management much more straightforward and repeatable. Aim to write aesthetic scripts and you can avoid many head-aches when you realise there are errors in your database- with a neat script you can trace back to the problem and correct it.
Data wrangling is a skill that take times to learn. But I believe it is worthwhile. Being good at data wrangling will make you a sought after collaborator and a better scientist.