22 Jun 2016
I’m sitting here at the International Coral Reef Symposium in Hawaii. Looking at all the exciting graphs, data and analysis going up got me thinking: what are the key Rstats tools coral reef scientists should learn?
Starting out with Rstats can be overwhelming because there are so many tools and packages available (but check out some of the free courses on my page). So here is a quick list of the key tools (packages and R functions) you should learn.
A data visualisation tool tops the list. Data visualisation is the first step of analysis - you can quickly look for interesting patterns and check for errors. The
ggplot2 (‘grammar of graphics 2’) package makes dataviz quick and easy. You can rapidly slice and dice your data in different ways for viz and also overplot simple analyses, like regressions rapidly. Check out the guide here.
Linear models are becoming the dominant statistical tool for univariate data analysis in ecology and have largely replaced ANOVA methods. For instance, if you want to know how coral growth relates to temperature, how fish biomass relates to fishing pressure and so on you would probably use a linear model. Linear models are implemented in R with the function
A few quick examples on why
lm() is so great.
Just using a single continuous predictor - that is linear regression. Multiple continous predictor variables: multiple regression. If you combine categorical and continuous predictor variables you have your classic ANCOVA analysis.
predict() function to see what your trend lines look like.
You can get confidence interval using the
confint() functions. Do away with p-values they are dated 20th century statistics. Well ok p-values have their uses, but confidence intervals let you check how strong your effect is, as well as it’s significance.
Once you mastered
glm() (generalized linear model) is a must for ecologists. We often deal with non-normal data (ie doesn’t fit a normal distribution), like presence/absence data and counts.
lm() is inappropriate for this type of data. Way back in the 70s some clever people invented GLMs as a generalization of logistic regression. GLMs can handle many types of data.
Check out Zuur et al.s book for a handy guide.
Ecological data is never simple and our data are often grouped. For instance, we often have blocks in experiments or transects within sites from field surveys. The package
lme4 lets you implement random effects in linear models. Random effects are a flexible way to account for nuisance sources of variation, which may be important but we don’t have a specific hypothesis for which way the effect will go.
Two key functions in
lmer() which is the random effects version of
glmer(), the random effects version of
glm(). Again, check out Zuur et al.s book for a handy guide.
Got a large data-set? Need to merge data-sets by a common ID (e.g. site names)? Need to generate new variables from exisiting variables? Need to create data summaries? The package
dplyr (‘data pliers’) is a must for all these tasks. Check out the vignette.
Ok, so you have entered your data (or got it from someone else), but wait a minute, the way you entered it doesn’t let you use
lm().. (if you followed my guide you shouldn’t have this problem). Do you need to reenter it?
tidyr (for tidy data) to the rescue. The package
tidyr lets you reformat data in ways (e.g. from wide format to long format) in the way that most stats packages prefer it. Check out this blog.
vegan provides good old multi-variate analysis, without the cost of Primer. Admittedly this package is slightly harder to use than Primer, but hey if you are already using R, the jump to primer is easy. Check out the vignettes to get started.
That’s my top 7. Good luck getting started.