Associate Professor Chris J. Brown (Griffith University, chris.brown@griffith.edu.au, Twitter: @bluecology)
Professor David Schoeman (University of the Sunshine Coast, dschoema@usc.edu.au)
Professor Anthony J. Richardson (The University of Queensland and CSIRO, a.richardson@maths.uq.edu.au)
Dr Bill Venables (The University of Queensland and CSIRO, Honorary Research Fellow, Bill.Venables@gmail.com)
Dr Christina Buelow (Griffith University, c.buelow@griffith.edu.au)
These notes are provided to students at the R Workshop, held at The University of Queensland February 2023.
2023 Chris J. Brown, David Schoeman, Anthony J. Richardson, Bill Venables, Christina A. Buelow
The aim of today’s course is to train you in the basic skills you need to be proficient at spatial analysis and mapping. We’re going to look at data wrangling, with a focus on its application to spatial data. Then we will do some spatial statistics and mapping.
We’re going to focus on some popular packages for the data wrangling tasks, many of which are drawn from a group of packages known as the ‘tidyverse’.
For the spatial analysis we will use tools based around the ‘sf’ (simple features) and ‘terra’ packages.
The main principles I hope you will learn are:
By the end of this course you should know:
We’re aiming to give you a realistic experience, so today’s course will be based around a particular project that requires wrangling data, building up analysis and creating maps.
It makes sense to do your spatial analysis in R, because today R is the leading platform for environmental data analysis. R also has add-on packages that can make it a fully fledged Geographic Information System, from reprojecting spatial data to creating maps.
A real advantage of R over other GIS programs is that it covers everything from data wrangling, stats to spatial analysis. So having your data in R opens up a huge array of cutting edge analysis tools. This means, for instance, that you can apply cutting edge statistical models straight to your spatial data sets.
And R is totally free.
A core principle of science is repeatability. Creating and sharing your data scripts helps you to fulfill the important principle of repeatability. It makes you a better scientist and can also make you a more popular scientist: a well thought out script for a common wrangling or analysis problem may be widely used by others. In fact, these days it is best practice to include your scripts with publications.
Most statistical methods in R require your data is input in a certain ‘tidy’ format. This course will cover how to use R to easily convert your spatial data into a ‘tidy’ format, which makes error checking and analysis easy.
Steps include restructuring existing spatial datasets and combining different spatial and non-spatial data-sets. We will also create some data summaries, plots and maps. Then we will do some spatial statistical analysis. Finally, just for fun, we will finish by creating a web based map of our data.
The course will be useful for people who want to explore, analyse and visualise their own spatial data in R.
We’re going to assume you’re using a recent version of R (>3.6.0) and in the course we will use the RStudio editor.
To work through these notes you will need to install the add-on
packages readr
, tidyr
, ggplot2
and dplyr
. Or you can just get the package
tidyverse
which has these and more. You will also need
sf
, terra
, leaflet
,
tmap
. We will also use mgcv
, which comes with
R, so you won’t need to install that one.
This isn’t an absolute beginner course, so we going to assume you have some knowledge of R already. If you are an absolute beginner, then you should take our other course.
We will assume you have at least basic understanding of how R works (e.g. scripts, console, how to access data and what a ‘package’ is). As a guide you should already be able to read data into R using R code (ie not using the menus in Rstudio) and create some basic plots.
The code in these notes is however complete, so you can run this entire course successfully without having to ‘know’ anything. Though it will be better for your own learning if you type the code out yourself.
If you get an error, well done! That is a chance to learn for the real world. So ask one of us for help. And if you don’t understand something, also don’t be afraid to just ask one of us for help.
You’re close to finishing your PhD on plankton ecology and just need to do one final chapter. You’re supervisor isn’t being much help (he’s off on a global trek promoting his new book).
You’re at the International Plankton Symposium (IPS2020) and you gather the courage to talk to Professor Calanoid, your academic hero. After enduring a long rant about Professor Salp’s plenary talk (“she’s just a backboneless filter feeder who doesn’t do any real research herself”), Prof Calanoid mentions that she’s read your first PhD paper on zooplankton biogeography.
Prof Calanoid was impressed with the extent of R analysis in your biogeography paper and goes onto suggest you collaborate on a new database she is ‘working with’.
The database has extensive samples of copepod richness throughout Australia’s oceans and the Southern Ocean too. Prof Calanoid has a hypothesis - that like many organisms, copepod species richness (just the number of unique species) will be higher in warmer waters than cooler waters. But she needs help sorting out the data and running some stats.
It will be a super easy paper for you, just do some of your R stuff and you will be a coauthor. It will likely be published in the top journal, The Nature of Plankton.
Of course, if you do this job she will support your fellowship application to the International Plankton Research Institute.
All you have to do is sort out the copepod data, match it to ocean temperature data and run some stats to test this hypothesis.
Prof Calanoid also wants you to make some flashy graphs to put in the paper, and make an interactive map of the data that they can share with their funders.
Oh and time is short, this is an open-access database, so Prof Calanoid needs this done in the next 3 weeks so that she can submit a paper on it before Professor Salp does. So better drop all your other commitments.
Professor Calanoid sends you the data files. The spreadsheet
copepods-raw.csv
has measurements of copepod species
richness from around Australia. Copepods are a type of zooplankton,
perhaps the most abundant complex animals on the planet and an important
part of ocean food-webs. Prof Calanoid has also sent you some other
data, but has not explained what that is for yet. You’ll have to figure
that out.
Copepod species were counted using samples taken from a Continuous Plankton Recorder. The CPR was towed behind ‘ships of opportunity’ (including commercial and research vessels). ‘Silks’ run continuously through the CPR and the plankton are trapped onto the silks, kind of like a printer that runs all day and night to record plankton in the ocean.
(The data we’ve given you are in fact modified from real data, provided by Professor Ant Richardson. Ant runs a plankton lab that is collecting and processing this data from a program called AusCPR, find out more here.)
So Prof Calanoid’s data is what we’ll work with today. We’ve tried to make this as realistic a learning experience as possible. So be ready to face some errors in the data from Prof Calanoid!
So now we’re almost ready to start the course. But before we get started, there are a few technical things you need to know about how we will use R today.
We’ve provided all the data from Professor Calanoid in a sub-folder
data-for-course
.
If Rstudio is already open when you open a script to get started,
then don’t forget to set the working directory with setwd()
or under the ‘Session’ menu.
If you make your own scripts, you should save them with the data folder as a subfolder and everything should work fine.
The modern quantitative scientist has to know a lot more about working with databases and data analysis than in the past. Scientists are increasingly integrating a large variety of spatial data-sets into their work. These analyses require matching data-sets that may have been entered in different ways, or cover different temporal and spatial scales.
All of these procedures can be termed data wrangling. In this course we are going to learn how R can be used for wrangling of spatial data. We’re going to work through a ‘realistic’ case-study.
As expert R users we have often been faced with situations where a collaborator has asked us to ‘just run some numbers’ on a dataset, and be rewarded with an ‘easy’ paper.
‘Easy’ is often far from the truth. And the time-consuming part isn’t the stats. It’s the cleaning of the data that takes a lot of time. And then often we need to match up the new data to existing data-sets, such as when we want to know whether the spatial distribution plankton correlates with ocean temperature.
If you have to deal with large data-sets you may realise that data wrangling can take a considerable amount of time and skill with spreasheets programs like excel. Data wrangling is also dangerous for your analysis- if you stuff something up, like accidentally deleting some rows of data, it can affect all your results and the problem can be hard to detect.
Remember Prof Calanoid? Well let’s get started with that copepod richness data. In this part of the course we are going to clean it up and run some basic analyses.
Fire up RStudio, start a new R script (just click
the symbol with the little green plus) and save the script in the same
folder as where you have ‘data-for-course/’ as a sub-folder. You might
like to call your script copepod-wrangling.R
.
We always start our scripts with some comments that include a description of goals, our name and date. So do that too.
Now let’s look at that first spreadsheet Prof Calanoid sent us. We don’t know the data well, and Prof Calanoid hasn’t told us much about it (or sent us any meta-data on what it all means), so we will want to do some thorough checks and visuals before we run any analyses.
This mirrors situations that all of us (Dave, Ant, Bill and Chris) of us have often come across. We are given data by collaborators, so we need to check and do some visuals on it before we do the analysis, to make sure we understand it well and avoid errors.
It is common to see people hired to do an analysis of a ‘complete’ data-set, but it ends up taking them the entire contract just to sort out the data, which weren’t really complete after all. R can help speed up this process, so the analysis (what you’re ultimately paid to do) gets done.
We will load in the data using a package from the tidyverse called
readr
. readr
is handy because it does extra
checks on data consistency over and above what the base
R functions do. Data frames imported by
readr
also print in summary form by default. Let’s see
how:
library(readr)
dat <- read_csv("data-for-course/copepods_raw.csv")
dat
# A tibble: 5,313 × 11
silk_id segment_no latitude longitude sample_time_utc project route vessel
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 1 -28.3 154. 26/06/2009 22:08 AusCPR BRSY ANL Win…
2 1 5 -28.7 154. 26/06/2009 23:12 AusCPR BRSY ANL Win…
3 1 9 -29.0 154. 27/06/2009 0:17 AusCPR BRSY ANL Win…
4 1 13 -29.3 154. 27/06/2009 1:22 AusCPR BRSY ANL Win…
5 1 17 -29.7 154. 27/06/2009 2:26 AusCPR BRSY ANL Win…
6 1 18 -29.8 154. 27/06/2009 2:43 AusCPR BRSY ANL Win…
7 1 26 -30.4 153. 27/06/2009 4:52 AusCPR BRSY ANL Win…
8 1 30 -30.7 153. 27/06/2009 5:57 AusCPR BRSY ANL Win…
9 1 33 -31.0 153. 27/06/2009 6:45 AusCPR BRSY ANL Win…
10 1 37 -31.3 153. 27/06/2009 7:50 AusCPR BRSY ANL Win…
# … with 5,303 more rows, and 3 more variables: meanlong <dbl>, region <chr>,
# richness_raw <dbl>
(if you want to print the entire data frame, then use this
code: data.frame(dat)
to turn it back into a base
R data frame).
In this data you will see a silk_id
column, which is
just the ID for each of the silks, onto which plankton are recorded. For
processing, silks are divided into segments, so you will also see a
segment_no
column. The other columns are pretty self
explanatory.
It’s a good idea to do some visuals with new data to check they are in shape.
We will be using ggplot2
for graphs and
tmap
for maps in this course. Both work will within the
tidyverse paradigm, and both have some pretty powerful tools for quickly
creating graphics
You might like to download RStudio’s ggplot cheatsheet for reference.
At it’s heart a map is just a plot of spatial coordinates. So let’s
make our first map with ggplot
, plotting the coordinates
for the samples (segments of the CPR silks):
library(ggplot2)
ggplot(dat) +
aes(x = longitude, y = latitude, color = richness_raw) +
geom_point()
Which just shows the location of every segment. We also coloured points by the species richness. You can kind of see the CPR surveys wrapping around the coast of Australia.
So far so good, now let’s look at the richness data, our main variable for analysis
In this course we’ll be blending a lot of spatial and non-spatial analysis and visuals. This is common in a spatial analysis workflow. Sometimes the idea of learning spatial analysis sounds daunting, but much of it is just a continuation of non-spatial data analysis techniques. So if you can already do those in R you can transfer those skills to spatial analysis.
Let’s try another plot of latitude versus richness:
ggplot(dat, aes(x = latitude, y = richness_raw)) +
stat_smooth() +
geom_point()
It’s handy to do lots of plots when getting to know your data, maps or not. Something clearly looks odd with this graph, like there is an unnatural change in the data pattern at about -40. Well, at least we have some results.
We’ll send this graph to Prof Calanoid and see what she thinks the issue is. Hopefully we’ll hear back soon.
(hint, if you actually want to save the above graph use
ggsave
)
Let’s repeat the above map of richness, but this time using some of R’s specialist packages for GIS and mapping.
The first step is to turn our point data into a spatially referenced
data frame. We will use the sf
package to do this.
sf
stands for ‘simple features’ and is an open standard
for geospatial databases. Interfaces using the simple features standard
are available in many GIS programs. For R we need the
package sf
.
A good introduction to sf
can be found in Geocomputation in R,
which is free online.
In the spatial analysis in R was usually done with
the sp
package, which uses a different (and less
convenient) paradigm for databases. You may still have to use that
sometimes when sf
doesn’t play nicely with other packages.
Eventually sf
will replace sp
, so we will be
teaching sf
today.
So first install sf
if you don’t have it. This may prove
to be a bit tricky depending on your operating system. sf
depends on lots of other packages including rgdal
and
rgdal
is famous for issues with installation. So see how
you go. If sf
won’t install, then just follow along on our
screen for now and figure that out later (with lots of googling).
Now, let’s turn our data into a ‘simple features collection’.
library(sf)
sdat <- st_as_sf(dat, coords = c("longitude", "latitude"),
crs = 4326)
There’s a bit to explain here in this simple step.
st_as_sf
is the function that converts different data
types to simple features. Try ?st_as_sf
to see what else it
can convert. dat
is our original data. The
coords
bit just gives the names of the columns that relate
to the spatial coordinates (in order of X coordinate followed by Y
coordinate).
What does the part starting crs
do?
crs
stands for Coordinate Reference
System.
What is a coordinate reference system? Well the earth is spherical. So at the very least, we’d need a 3 dimensional coordinate system and a reference point to accurately map locations on the Earth’s surface.
Also, the earth is not a perfect sphere. It is lumpy and fatter at the equator because of its spin. So we also need a model that describes the spherical surface of the earth.
In mapping, we refer to the reference point as datum
and
the earth model as an ellipsoid
. Together, these make a
geographic coordinate reference system
(GCS), which tells
us where the coordinates of our copepod data are located on the
earth.
GCS’s are represented by angular units (i.e. longitude and latitude),
usually in decimal degrees. Our copepod coordinates are long-lat, so we
chose a common ‘one-size-fits-all’ GCS called WGS84 to define the crs
using the EPSG
code 4326.
What is an EPSG code? It’s a unique, short-hand code for a specific crs.
Common practice for defining crs’s in R has traditionally been with
either an EPSG
code or a proj4string
. You can
still use either of these in R, but note that proj4strings
are being phased out, and it’s best practice is to either use an
EPSG
code or Well-known text
(WKT) to define a
crs. A WKT string contains all of the detailed information we need to
define a crs, but is cumbersome if you don’t need all of the detail.
Read this
for a more complete overview.
It’s easy to find out all of the above for a chosen crs in R. For
example, for the EPSG code 4326 we can find out: 1) what the name of
this crs is, 2) the corresponding proj4string
, and 3) the
WKT
crs4326 <- st_crs(4326)
crs4326$Name # name of the crs
[1] "WGS 84"
crs4326$wkt # crs in well-known text format
[1] "GEOGCRS[\"WGS 84\",\n ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n MEMBER[\"World Geodetic System 1984 (Transit)\"],\n MEMBER[\"World Geodetic System 1984 (G730)\"],\n MEMBER[\"World Geodetic System 1984 (G873)\"],\n MEMBER[\"World Geodetic System 1984 (G1150)\"],\n MEMBER[\"World Geodetic System 1984 (G1674)\"],\n MEMBER[\"World Geodetic System 1984 (G1762)\"],\n MEMBER[\"World Geodetic System 1984 (G2139)\"],\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]],\n ENSEMBLEACCURACY[2.0]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n CS[ellipsoidal,2],\n AXIS[\"geodetic latitude (Lat)\",north,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n AXIS[\"geodetic longitude (Lon)\",east,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n USAGE[\n SCOPE[\"Horizontal component of 3D system.\"],\n AREA[\"World.\"],\n BBOX[-90,-180,90,180]],\n ID[\"EPSG\",4326]]"
crs4326$proj4string # crs as a proj4string
[1] "+proj=longlat +datum=WGS84 +no_defs"
(Note: It’s good that we can still also extract the
proj4string
, because although it’s being deprecated, it’s
still used in other spatial data packages like raster
and
terra
.)
As mentioned above, EPSG:4326 is short-hand code for WGS84, a common long-lat geographic coordinate reference system (GCS).
When we make a 2-dimensional map in WGS84 GCS, we assume that a degree is a linear unit of measure (when in reality it’s angular).
To more accurately map our data in 2 dimensions, we need to decide
how to ‘project’ 3 dimensions into 2. There are many ways to do the
projection depending on where we are in the world and what we’re most
interested in preserving (e.g., angles vs. distances vs. area).
Projections are defined by a
projected coordinate reference system
(PCS), and spatial
packages in R use the software PROJ to this.
If you want to learn more about projections, try this blog.
To find the most appropriate projected crs for your data, try the R package crs suggest.
Let’s have a look at what we’ve created in sdat
:
sdat
Simple feature collection with 5313 features and 9 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 89.6107 ymin: -65.2428 xmax: 174.335 ymax: -16.80253
Geodetic CRS: WGS 84
# A tibble: 5,313 × 10
silk_id segment_no sample_time_utc project route vessel meanlong region
* <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
1 1 1 26/06/2009 22:08 AusCPR BRSY ANL Windar… 153. East
2 1 5 26/06/2009 23:12 AusCPR BRSY ANL Windar… 153. East
3 1 9 27/06/2009 0:17 AusCPR BRSY ANL Windar… 153. East
4 1 13 27/06/2009 1:22 AusCPR BRSY ANL Windar… 153. East
5 1 17 27/06/2009 2:26 AusCPR BRSY ANL Windar… 153. East
6 1 18 27/06/2009 2:43 AusCPR BRSY ANL Windar… 153. East
7 1 26 27/06/2009 4:52 AusCPR BRSY ANL Windar… 153. East
8 1 30 27/06/2009 5:57 AusCPR BRSY ANL Windar… 153. East
9 1 33 27/06/2009 6:45 AusCPR BRSY ANL Windar… 153. East
10 1 37 27/06/2009 7:50 AusCPR BRSY ANL Windar… 153. East
# … with 5,303 more rows, and 2 more variables: richness_raw <dbl>,
# geometry <POINT [°]>
Note too that when you print sdat
to screen you get some
meta-data, including the geometry type, its dimensions (just XY here,
but we could also have XYZ), its ‘bbox’ (bounding box), and the CRS
(coordinate reference system).
The data table in sdat
looks much like dat
did, but note it now has a geometry
column. This is where
the coordinates (just one point for each data row) are stored. More
complex simple features could have a series of points, lines, polygons
or other types of shapes nested in each row of the geometry column.
The nice thing about sf
(that wasn’t true of
sp
) is that because the data is basically a dataframe with
a geometry, we can use all the operations that work on dataframes on
sf
simple features collections.
These include data wrangling operations like inner_join
,
plotting operations from ggplot2
and model fitting tools
too (like glm
).
sf
also adds geometric operations, like
st_join
which do joins based on the coordinates. More on
this later.
Now let’s get into the mapping so we can make something that looks good to send to Prof Calanoid.
sf
has simple plotting features, like this:
plot(sdat["richness_raw"])
We’ve chosen just to plot the richness column. If we just had
plot(sdat)
it’d do one panel for each variable in the
dataframe. The ["richness_raw"]
just selects that
variable.
The package tmap
is one of many packages for making more
sophisticated maps. Let’s try it:
library(tmap)
tm_shape(sdat) +
tm_dots(col = "richness_raw")
tmap
works much like ggplot2
in that we
build and add on layers. In this case we have just one layer, from
sdat
. We declare the layer with tm_shape()
(in
this case sdat
), then the plot type with the following
command.
Here we are using tm_dots
to plot dots of the
coordinates. Other options are tm_polygons
,
tm_symbols
and many others we’ll see later.
We’ve picked the column "richness_raw"
for the colour
scale.
We can customize the plot, for instance:
tm1 <- tm_shape(sdat) +
tm_dots(col = "richness_raw",
palette = "Blues",
title = "Species #")
tm1
Now we’ve stored the plot object as tm1
. If we want to
see it we just type tm1
. This means we can do more to it
later on, like add in more layers.
We’ve set a colour palette 'Blues'
. Then, we’ve just
told it to title the colour scale (if we didn’t give a title, it would
title the colour scale ‘richness_raw’).
The name ‘Blues’ comes from the RColorBrewer
package,
which provides a great catalogue of colour palettes. Check out
interactive web app for other palettes: colorbrewer.org. You can use these
names directly in the tmap palette argument.
Let’s save this figure and email it to Prof Calanoid to get her opinion. At least she will be happy to hear we are making progress:
tmap_save(tm1, filename = "Richness-map.png",
width = 600, height = 600)
tmap_save
is just a function for saving tmaps. Try
?tmap_save
for other options, like changing the figure
size, resolution, or file type. The only other trick we used above was
to save the tmap to a variable name, tm1
. This meant we
could drop that variable name into tmap_save
, which
identifies which map we want to save.
We use tmap because it is more intuitive and less ‘buggy’ than other
map plotting packages. It has the advantages of allowing you: layer
different maps, to plot both shape files and rasters, to easily change
colour scales, to easily add compasses and scale bars. It is also pretty
intuitive (the old plot
method in comparison would offset
multiple layers by a small amount and you had to adjust the coordinates
to account for it!).
If you get competent at the basics then getting decent maps quickly will become easy for you. Your mapping life will seem good until one day, you decide you want to change the colours, or the legend, or the order of panels or some other detail. Then very quickly you will find yourself bashing your head against your monitor in frustration (at least I do).
That’s the thing about tmap, its really convenient, until suddenly it isn’t. It’s also pretty new so there isn’t a lot of help out there, though help online is expanding every time I look.
So press on, use web searches if you get stuck and don’t be afraid to ask a question on stack overflow if you can’t find your problem answered online already.
As we said earlier sf
can handle lots of different types
of spatial data, including shapes like polygons. To practice with
polygons let’s load in a map of Australia and a map of Australia’s
continental shelf.
We’ll add these as layers using tmap
.
The polygons are stored as shapefiles, in the course data. We can
read them in with the st_read
command (which is like
read_csv, but for spatial files!):
aus <- st_read("data-for-course/spatial-data/Aussie/Aussie.shp")
Reading layer `Aussie' from data source
`/Users/s2973410/Documents/Code/teaching/spatial-analysis-in-r/data-for-course/spatial-data/Aussie/Aussie.shp'
using driver `ESRI Shapefile'
Simple feature collection with 8 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 112.9211 ymin: -43.63192 xmax: 153.6389 ymax: -9.229614
Geodetic CRS: WGS 84
shelf <- st_read("data-for-course/spatial-data/aus_shelf/aus_shelf.shp")
Reading layer `aus_shelf' from data source
`/Users/s2973410/Documents/Code/teaching/spatial-analysis-in-r/data-for-course/spatial-data/aus_shelf/aus_shelf.shp'
using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 112.2242 ymin: -44.1284 xmax: 153.8942 ymax: -8.8798
Geodetic CRS: GRS 1980(IUGG, 1980)
And they are loaded. You can check out the data quite easily by typing the object names:
aus
Simple feature collection with 8 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 112.9211 ymin: -43.63192 xmax: 153.6389 ymax: -9.229614
Geodetic CRS: WGS 84
adm geometry
1 NSW MULTIPOLYGON (((141.0029 -3...
2 Vic MULTIPOLYGON (((140.9659 -3...
3 QLD MULTIPOLYGON (((153.5522 -2...
4 SA MULTIPOLYGON (((129.0013 -2...
5 WA MULTIPOLYGON (((129.0012 -1...
6 Tas MULTIPOLYGON (((148.3096 -4...
7 NT MULTIPOLYGON (((137.9962 -1...
8 ACT MULTIPOLYGON (((149.1253 -3...
It is widely known that .shp files are a terrible format. They are inefficient at storing data and to save one shapefile you actually create multiple files. This means bits of the file can get lost if you transfer the data somewhere.
A better format is the Geopackage. Geopackages can save multiple different data types all in a single file. And they compress the data more. Read more about different file formats here.
We’ve used .shp files in this course because we expect you are most likely to encounter that format. But we encourage you to save your own data in the .gpkg format.
People often ask us where we get our spatial data from. The answer is all sorts of places, sometimes from collaborators, sometimes we make it from scratch and other times we download it from online data providers (try the AODN for Australian ocean data).
There is an instructive bonus section on how we created the map of Australia’s shelf, if you’d like to know more about sourcing spatial data for maps.
It is easy to make a map of a polygon with tmap
:
tm_shape(shelf) +
tm_polygons()
A thematic map can be built up as a series of layers, so we can just keep adding to our map if we want:
Note how each new layer starts with the tm_shape
line,
then we say what type of shape we want. We could keep going and add the
land and the copepod points:
tm_shape(shelf) +
tm_polygons(col = 'lightblue') +
tm_shape(aus) +
tm_polygons() +
tm_shape(sdat) +
tm_dots()
We’ve made the shelf ‘lightblue’ to differentiate it from the land.
But we are missing the samples in the southern ocean. This is because
the extent for a tmap is set by the first tm_shape
. We can
fix this by setting the bbox
(bounding box):
tm_shape(shelf, bbox = sdat) +
tm_polygons()+#col = 'lightblue') +
tm_shape(aus) +
tm_polygons() +
tm_shape(sdat) +
tm_dots()
Now our map extent matches the samples.
Let’s stop here and give you some time to play around with
tmap
and its different features. I really want you to
explore the package and try different things. If you get errors, good,
they are a learning opportunity.
Now to get started you might want to type ?tmap
to
peruse tmap
s different features.
To learn about a quick way to change the style, type
tmap_style("beaver")
then run your map code again.
Or try the tmap
vignette. It can be accessed with
vignette('tmap-getstarted')
or web search ‘r tmap’.
Here’s a map we made, to help give you some ideas:
Prof Calanoid gets back to us about the figure we sent of richness and latitude. As we had suspected, she says the results are junk. Prof Calanoid has now explained that we need to standardize richness estimates, because silks from different routes have different sizes.
Prof Calanoid had already provided the silk sizes in a file
Route-data.csv
, but had neglected to tell us we needed to
use this for a standarisation (typical!). No worries though, we just
need to learn how to join the routes data to our richness data and then
the standarization will be easy.
dplyr
stands for ‘data pliers’. Along with
tidyr
it is one of the most useful packages for getting
your data into shape for analysis. It works very well with spatial data
stored as sf
objects. dplyr
is part of the
tidyverse so it plays very nicely with ggplot2
and
readr
.
One thing dplyr
is good at is joining data frames by
matching columns. Try type ?inner_join
in your console and
you will get a list of all the join types it supports. Remember
dplyr
can only do joins on the data table, not the
geometry. We will look at spatial joins later on.
Today we will use inner_join
. Below, the code will join
dat
to the routes
data using columns with the
same names to match by. It will keep all rows from dat
where there are matching rows in routes
, so if rows don’t
match, they will be chucked out (use left_join
if you want
to keep rows in dat
that don’t match too).
inner_join
will also duplicate rows if there are
multiple matches.
routes <- read_csv("data-for-course/Route-data.csv")
Have a quick look at routes
now to make sure you are
happy with it. Then we will just use inner_join
to join it
with our spatial data (making sure we check the number or rows stays the
same):
sdat_std <- inner_join(sdat, routes, by = "route")
nrow(sdat)
[1] 5313
nrow(sdat_std)
[1] 5313
nrow(routes)
[1] 25
length(unique(routes$route))
[1] 25
We checked the number of rows and the number of unique route names to make sure we didn’t inadvertently remove or duplicate any data.
It is worth taking a look through the resulting data (e.g. hover your mouse over the data frame and press F2) to make sure the join worked as intended.
Joins are a very important but very dangerous data wrangling
operation! You must always choose your join type carefully. For
instance, inner_join
vs left_join
vs
full_join
will all give the same result for some datasets,
but not others.
Even after you think you know what you are doing, you still need to check the outcome. As we explained above, you can lose samples that don’t match, or duplicate samples that match multiple times. I (CB) have made (and thankfully corrected) this mistake many times, often because of small inconsistencies in the data I was provided, like when some site names have all lower case, and a handful have title case.
We don’t say this to put you off joins, they are one of the most useful data wrangling tasks you can do in R, but just be careful.
Once we have a matching silk_area
value for each sample,
it is easy to add a new variable that is standardised richness. To do
this we use mutate
which just takes exisiting variables and
calculates a new variable (or overwrites an existing one if we give it
the same name). In addition to the standardised variables, we will also
calculate the number of species per individual observed.
sdat_std <- mutate(sdat_std,
richness = richness_raw/silk_area)
Ok, let’s plot standardized richness against latitude so we can send a new graph to Prof Calanoid. To do that we first need to extract the latitude, since it is now stored in the geometry:
sdat_std$Lat <- st_coordinates(sdat_std)[,2]
Now some straightforward ggplot:
ggplot(sdat_std) +
aes(x = Lat, y = richness, color = richness) +
geom_point() +
stat_smooth() +
theme_bw()
Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Do you see a different pattern now?
We should also save the standardised data for use later:
save(sdat_std,file = "data-for-course/spatial-data/copepods_standardised.rda")
Prof Calanoid didn’t give us the spatial layers for Australia or the continental shelf. We had to derive them ourselves.
We often get asked where we get our spatial data from. The answer is, lots of sources! But often from web repositories, like the excellent Australian Ocean Data Network.
It will be instructive to go through how the shelf data was made. It’s derivation is a bit convulted, but that is pretty common when you are making customized maps.
First, we downloaded a shapefile of bathymetry from the AODN
It is lines data, with lines for contours at 5m resolution.
Then we read it in with sf
and filter for the 200m
contour as our definition of the continential shelf.
library(sf)
shelfin <- st_read("data-raw/contour_5m/contour_5m.shp")
shelfin
range(shelfin$CONTOUR)
shelf200 <- dplyr::filter(shelfin, CONTOUR == 199.9)
plot(shelf200["LENGTH"])
Now we have lots of little line segments (the data is not one continuous line around Australia unfortunately). We want to make a polygon from these bits of lines.
So we did some web searching and on stackoverflow we read about the
package concaveman
which fills in the gaps to make a
polygon. So we install that package and use it to make a continuous
polygon:
library(concaveman)
ushelf200 <- concaveman(shelf200)
plot(ushelf200)
Then we just check it has made a valid polygon (see
?st_is_valid
for more explanation)
st_is_valid(ushelf200)
Turns out its not valid, which may cause us issues with spatial joins later on.
We make it valid with st_make_valid function:
ushelf200 <- st_make_valid(ushelf200)
st_is_valid(ushelf200)
That worked.
The data also has a Z dimension, that we don’t need. So let’s drop that:
ushelf200 <- st_zm(ushelf200)
Now save it, so its all ready for the students (you!):
st_write(ushelf200, "data-for-course/spatial-data/aus_shelf.shp")
If you find yourself in this kind of convoluted workflow trying to get spatial data to work, don’t despair. Its common, keep web searching and you will find answers to get you to something that works. Just don’t under-estimate how long this kind of thing can take, next time someone asks you to help with an ‘easy’ project.
In the first session we tidied and analysed Prof Calanoid’s data so that we could run some analysis of the relationship between latitude and species richness. But the job isn’t done yet.
Prof Calanoid’s actual hypothesis was about sea surface temperature. She also wanted to see some maps and create an interactive map for her funders.
So in this session we are going to look at how we combine our copepod data with spatial layers for temperature so we can do some spatial analysis on the relationship between temperature and richness.
Then we will generate some predictions for species richness, and map them, like a species distribution model.
Finally we will look at creating interactive maps.
In part 2 of this course we will look at some more advanced spatial analysis that R can do. We are going to look at spatial joins, where we join data based on its geometries. Then we will look at a new type of data ‘raster’ (pixel) data and see how we can extract raster data for a series of points.
We will use this extracted data to build a spatial statistical model to test for a relationship between SST and richness. Then, just for fun, we will build an interactive web map.
Let’s start with a clean slate for session 2. So I encourage you to start a new script and a new R session. Then we’ll just load in key packages and data from this morning:
library(tidyverse)
library(sf)
library(tmap)
load("data-for-course/spatial-data/copepods_standardised.rda")
aus <- st_read("data-for-course/spatial-data/Aussie/Aussie.shp")
Reading layer `Aussie' from data source
`/Users/s2973410/Documents/Code/teaching/spatial-analysis-in-r/data-for-course/spatial-data/Aussie/Aussie.shp'
using driver `ESRI Shapefile'
Simple feature collection with 8 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 112.9211 ymin: -43.63192 xmax: 153.6389 ymax: -9.229614
Geodetic CRS: WGS 84
Prof Calanoid is ever curious. On seeing the map we sent her with the points crossing the continental shelf, she asks if shelf/off-shelf has an effect on richness. So we’ll need to do some more GIS to find out.
To look at the position of the samples with respect to the
continental shelf, we’ll have to intersect the shelf and points data. To
do that we need to do a spatial join. Unlike dplyr
joins,
which work on the data table, the spatial join works on the geometries
(ie coordinates). We’ll use ‘st_intersects’ which is going to tell us
where points overlap the shelf polgyon. Let’s have a go.
First, we will add a new variable shelf
, which is just
the same everywhere (because everywhere inside the polygon is on the
shelf!). This will be handy later. Then we will use st_join
to try the join, telling it to do an intersection (there are other types
of join, see ?st_join
for details):
shelf <- st_read("data-for-course/spatial-data/aus_shelf/aus_shelf.shp")
Reading layer `aus_shelf' from data source
`/Users/s2973410/Documents/Code/teaching/spatial-analysis-in-r/data-for-course/spatial-data/aus_shelf/aus_shelf.shp'
using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 112.2242 ymin: -44.1284 xmax: 153.8942 ymax: -8.8798
Geodetic CRS: GRS 1980(IUGG, 1980)
shelf$shelf <- "Shelf"
sdat_shelf <- st_join(sdat_std, shelf, join = st_intersects)
Ah an error! Do we throw up our hands in despair and tell Prof Calanoid we can’t do it? No. Let’s look at this obscure message a bit more closely and try and figure out what it means.
hmmm, so st_crs
looks like a function, let’s find out
what it does: ?st_crs
.
Ah so it tells us what the CRS is. So
st_crs(x) == st_crs(y) is not TRUE
is saying our two
datasets have different coordinate reference systems (note the
==
which is a question: does st_crs(x) = st_crs(y)?)
Let’s look at that:
st_crs(shelf)
Coordinate Reference System:
User input: GRS 1980(IUGG, 1980)
wkt:
GEOGCRS["GRS 1980(IUGG, 1980)",
DATUM["D_unknown",
ELLIPSOID["GRS80",6378137,298.257222101,
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["Degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["longitude",east,
ORDER[1],
ANGLEUNIT["Degree",0.0174532925199433]],
AXIS["latitude",north,
ORDER[2],
ANGLEUNIT["Degree",0.0174532925199433]]]
st_crs(sdat_std)
Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
One is using GRS80 as a coordinate reference system, and the other is using WGS84. (Note: in reality these crs’s are nearly identical, but we need to make them the same if we want to intersect them.)
We can fix this easily with a call to st_transform
.
Let’s put shelf on the same datum as sdat_std:
shelf <- st_transform(shelf, crs = st_crs(sdat_std))
Now try the join again:
sdat_shelf <- st_join(sdat_std, shelf, join = st_intersects)
names(sdat_shelf)
[1] "silk_id" "segment_no" "sample_time_utc" "project.x"
[5] "route" "vessel" "meanlong.x" "region"
[9] "richness_raw" "geometry" "project.y" "number_segments"
[13] "meanlat" "meanlong.y" "silk_area" "richness"
[17] "Lat" "FID" "shelf"
unique(sdat_shelf$shelf)
[1] "Shelf" NA
Note it has combined the data fields from sdat_std and shelf when it
did the join. This includes the variable we created ‘shelf’ which has
values shelf
where the point is inside the polygon (ie on
the shelf) and NA
where it the point isn’t in the
polygon.
Let’s rename those NA
values to ‘Offshore’ so that is
clear:
library(tidyr)
sdat_shelf <- mutate(sdat_shelf,
shelf = replace_na(shelf, "Offshore"))
table(sdat_shelf$shelf)
Offshore Shelf
4110 1203
So we have many more samples offshore than on the shelf.
Now to check the intersection worked, we can map it out, colouring points by whether they were on or off the shelf:
tm_shape(shelf, bbox = sdat_shelf) +
tm_polygons(col = "grey10") +
tm_shape(sdat_shelf) +
tm_dots(col = "shelf", palette = "RdBu") +
tm_graticules()
Intersections and related operations can be slow for complex
polygons, or when intersecting lots of polygons. sf
speeds
up these computations by implementing spatial
indexing, which means it creates bounding boxes around the polygons
first so it can hone in the search for intersections to those places
(thus saving computation time).
sf
functions like st_intersection(x, y)
(and st_join
) compute the spatial index on the first shape
given to the function (i.e. x
). So swapping them around
could speed up computations in some instances. Give it a try if speed is
an issue for you.
This feature means that sf
operations can be
significantly faster than simpler operations from other packages (like
sp
).
Now we have all the data together, we can do another ggplot, with smooth’s to visually check if there is an effect of the continental shelf on the richness gradient.
ggplot(sdat_shelf) +
aes(x = Lat, y = richness, color = shelf) +
geom_point(alpha = 0.5, size = 0.2) +
stat_smooth() +
theme_bw()
It looks pretty consistent, though there is an interesting deviation of higher richness on the shelf at around -32. Looking at the map above, this may reflect the high number of on-shelf observations in West Australia at -32 and total absence of offshelf obs there.
We could figure this out with some more explanation (for instance dividing by east and west coasts), but we’ll leave that for you to do in your own time.
Our aim was to uncover the relationship between temperature and copepod richness. To do that we need some spatial data on temperature, so we can extract temperature at the sampling sites.
We have provided you with two files MeanAVHRRSST.gri
and
MeanAVHRRSST.grd
which contain gridded maps of annual mean
sea surface temperature from the Hadley dataset. Gridded data, also
known as raster data, can be read and manipulated with the
terra
package. Once you have installed this package, load
it in:
library(terra)
A note about Raster data, which package should I use?
There used to be only one package for raster data,
raster
. Now there are three main ones (raster
,
terra
and stars
). In this course we have
decided to use terra
. terra
is similar but
more adaptable and faster than raster
, and is made by the
same people. So it supersedes raster. terra
is easy to
learn if you already used raster
.
There are several things to consider when deciding if you shift to using a newer package. Chris discusses them in his blog comparing raster and terra.
There still may be cases where you need to use raster
such as to make sure data are compatible with the format required by
other packages. For instance terra
is sometimes, but not
always compatible with sf
objects. So it is common when
packages change or update that we have to change object types a lot.
This is something we’ll see below.
In such cases, you can easily convert a terra
SpatRaster
object to a raster
raster object with the command
raster::raster()
. If you need an sf
vector
object (e.g. point data) to be compatible with terra
’s
formats convert it with terra::vect()
.
We can then load and view the SST raster like this:
rsst <- rast('data-for-course/spatial-data/MeanAVHRRSST/MeanAVHRRSST.grd')
plot(rsst)
rast
will read the raster data in as a
SpatRaster
object.
This creates a pretty decent first plot of the raster. However, note the colour scale isn’t that appropriate for temperatures - green where temperatures are hot and red where they are cold Further, these default colours wouldn’t be that great if our audience was red-green colour blind (and we suspect that Prof Calanoid is colour blind)
A colour scale with multiple hues is also inappropriate for continuous data where there are no natural breaks.
Notice how the change from 12-14 degrees looks much bigger than the change from say 18-20 degrees (the yellow band south of Tasmania and across New Zealand), but both are 2 degrees of change. This effect is created by the shift across 3 hues (red-yellow-green) at around 13 degrees, but only a change in the intensity of green at 19 degrees.
We can use tmap
for SpatRasters too.
tm_shape(rsst) +
tm_raster(palette = "-RdBu", title = "SST")
Note the “-” in front of the palette name, which makes red warmer temperatures by reversing the colour palette.
I think the sequential palettes are a generally good choice for
temperatures. Sequential palettes like Reds
are most
appropriate when our data has a linear scale. You may also see some
people use palettes like RdBu
(red-blue). We’ve used it
here, because we have temperatures <0.
Unlike the terra::plot()
function tmap puts the hue
change at a sensible place (zero). When there is a natural break-point
in the data, like zero degrees, diverging palettes make sense.
It is straight forward to add on our samples as points over the raster with a tmap:
tm_shape(rsst) +
tm_raster(palette = "-RdBu", title = "SST") +
tm_shape(sdat_std) +
tm_dots(col = "richness_raw",
palette = "Greys",
title = "Species #") +
tm_compass()
We have overlaid our copepod sampling points on the map of temperature, now let’s extract the temperature values at those sampling sites, so we can test Prof Calanoid’s hypothesis about SST.
This is easy to do with terra’s extract
function.
sdat_std$sst <- terra::extract(rsst, vect(sdat_std))[,2]
Note thatterra::extract() isn’t compatible with sf objects (yet)
(unlike raster::extract()). So we need to wrap our points data in
vect()
to convert it to terra
’s SpatVector
format for storing point data.
This is one of the object conversions I mentioned that we just have to deal with when using the latest packages. They usually don’t cost much time if you know what you are doing.
Now we can plot the correlation between richness and SST. We can also run a test to calculate the Pearson correlation coefficient:
ggplot(sdat_std, aes(sst, richness)) +
geom_point() +
theme_minimal()
Warning: Removed 3 rows containing missing values (`geom_point()`).
with(sdat_std, cor.test(sst, richness))
Pearson's product-moment correlation
data: sst and richness
t = 46.698, df = 5308, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5202827 0.5584213
sample estimates:
cor
0.5396288
The test indicates a significant positive correlation with SST. However, if you examine the plot you may notice that variance in richness tends to increase with SST.
Can you think of a more appropriate way to model this data that will be more satisfactory to Prof Calanoid?
Note that ggplot warned us that it ‘removed 3 rows because of missing data’. That is because some of the sst values are missing:
filter(sdat_std, is.na(sst))
Simple feature collection with 3 features and 17 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 152.0625 ymin: -32.6829 xmax: 152.4542 ymax: -32.10338
Geodetic CRS: WGS 84
# A tibble: 3 × 18
silk_id segment_no sample_time_utc project.x route vessel meanlong.x region
* <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
1 362 93 7/12/2015 16:09 AusCPR BRNC Island C… 153. East
2 362 97 7/12/2015 17:51 AusCPR BRNC Island C… 153. East
3 362 101 7/12/2015 19:33 AusCPR BRNC Island C… 153. East
# … with 10 more variables: richness_raw <dbl>, geometry <POINT [°]>,
# project.y <chr>, number_segments <dbl>, meanlat <dbl>, meanlong.y <dbl>,
# silk_area <dbl>, richness <dbl>, Lat <dbl>, sst <dbl>
Let’s get rid of those rows, because the missing data will cause us issues later:
sdat_sst <- filter(sdat_std, !is.na(sst))
The !
just means ‘NOT’, so we are asking for the rows
that are not NA (missing).
We could use a GAM to model richness on SST with a Poisson distribution. The Poisson is appropriate for count data.
library(mgcv)
m1 <- gam(richness ~ s(sst, k=5), data = sdat_sst, family = 'poisson')
plot(m1)
So we’ve limited the degress of freedom of the spline, so it can’t be
too ‘bendy’. We’ve also asked gam
to use a Poisson
distribution (default is Gaussian), because our data are discrete counts
of the number of species.
Our data are spatial, and we’d not expect the sst-richness relationship to be the same in all places. For example, the types of species that make up our data are different in the Indian and Pacific oceans.
We can check for different patterns across the different oceans the data covers Australia. So let’s include a fixed effect for region also:
sdat_sst$Region <- factor(sdat_sst$region)
m1 <- gam(richness ~ s(sst, k=5, by = Region) + Region, data = sdat_sst, family = 'poisson')
Let’s look at the result (this is easiest with ggplot2). We’ll add the GAM predictions onto the end of our dataframe so we can plot those.
sdat_sst$pred_m1 <- predict(m1, type = "response")
ggplot(sdat_sst) +
aes(x = sst, y = richness, color = Region)+
geom_point(size = 0.2, alpha = 0.3) +
geom_line(aes(y = pred_m1), size = 1) +
facet_grid(.~Region) +
theme_bw()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
The type = "response"
command ensures we get predictions
of species, not log(species). Note was used aes
again
inside geom_line()
to ensure we got the right y values for
the lines.
So we send the results of our GAM to Prof Calanoid and we get the response:
“The poisson model you use for your GAMs is clearly pretty bad and potentially misleading. It is very overdispersed, as such models often are. I would suggest changing to a negative binomial. The catch is you have to supply theta (the dispersion parameter). A good way to pick one is just to start with a trial value and adjust it until your deviance matches your df.residual fairly closely.”
To see these numbers type:
deviance(m1)
[1] 14746.02
m1$df.residual
[1] 5296.811
The deviance is the sampling variation we have, the df residuals is the sampling variation our model expects. So clearly our model is under-estimating the sampling variation (the data are ‘overdispersed’).
It does make a difference. Let’s have a look:
m2 <- gam(richness ~ s(sst, by = Region) + Region, data = sdat_sst, family = mgcv::negbin(theta = 1.99))
deviance(m2)
[1] 5294.37
m2$df.residual
[1] 5291.904
So we tried the Negative Binomial and in this case negbin(theta = 1.99) does a pretty good job, and much more restrained than the poisson.
Here I just tried different values of theta (the dispersion parameter) until I got the deviance and resid DF reasonably close. You can also see the script ‘find-theta.R’ in your data folder for a way to automate this.
A really neat way to check the ‘fit’ of count models is with rootograms. Here’s one for our model:
(the DataGLMRepeat package is my own pkg of handy functions on github, if you want it, get it here: https://github.com/cbrown5/DataGLMRepeat)
A perfect fit would have all the grey rectangles sitting exactly on the horizontal line (at zero). The red line shows expected (modelled) counts, the bars show observed counts. So where the bar is floating over the line, the model is over-predicting observed values (ie around richnesses of 10). Where the bar hangs below the black line, the model is underpredicting observations (ie at counts>15).
So this model could still use a bit of work, but we’ll leave that for another day.
Ok let’s check out the results
sdat_sst$pred_m2 <- predict(m2, type = "response")
ggplot(sdat_sst) +
aes(x = sst, y = richness, color = Region)+
geom_point(size = 0.2, alpha = 0.3) +
geom_line(aes(y = pred_m2), size = 1) +
facet_grid(.~Region) +
theme_bw()
Looks much the same as before. So our result was robust.
Note we could add standard errors to these plots, by asking
se = TRUE
in our predict
function.
Now we might like to get a bit more sophisticated with our spatial model. We haven’t accounted for dependencies between the locations. Sampling sites that are close together are often more similar to each other than those that are further apart, so each sample is not a true ‘independent replicate’.
This phenomenon is often called spatial auto-correlation. It is a concern because if we over-estimate the number of truly independent replicates, then we will also tend to overestimate p-values and effect sizes (though sometimes we can underestimate effect sizes too). So a ‘significant’ finding is more likely to be spurious.
We won’t have time in the workshop to cover all the intricacies of spatial models. However, we’ll try one type and there is some more code below for you to try in your own time.
But first, some some words of advice from Bill Venables.
In this context we really need to regard GAMs (and GLMs) as exploratory tools. They are powerful, but they rely on the (usually naive, but often harmless) assumption that, given the predictors, the observations can be regarded as independent.
Even if this is plainly not so, it doesn’t rule out the approach entirely as an exploratory tool. GAMMs and GLMMs have the capacity to allow for non-independence of various forms, but taking this to a realistic level would also be extremely intricate and usually dependent on the specifics of any example.
One approach to account for spatial dependencies is to use lat and longitude as predictors. However, they usually confound with other predictors of a more general kind, such as SST, and foregoing the chance to build a more generally applicable, and interpretable model.
Another approach is to choose a spatial scale larger than the distance between samples, within which we will ‘clump’ samples. Then we can apply a random effect to those clumps. Of course, this assumes the ‘clumps’ are independent of each other.
While I can agree with these approaches, I favour a two-pronged attack, namely try with and without lat and lon as predictors. The purpose of the model that has lat/lon is merely to assess just how much you might be losing out on by omitting location. If this is “not much” then you can feel comfortable with your primary model, (i.e. omitting Lat & Long from the picture). If this is “wow, that’s a lot!” then firstly, you need to be aware of it and secondly, you might want to look around for other suitable predictors, of a non-location kind, to fill in the void.
Best practice would be to model the spatial dependencies directly, but we won’t cover those more complicated models here.
Say we wanted to allow the effect of SST to vary continuously with longitude. First extract longitude coordinates from the geometry:
sdat_sst$x <- st_coordinates(sdat_sst)[,1]
Now we just include the x
variable in our call to the
smoother and do the usual checks. I’ve included three different ways to
fit this model below, for your to explore.
# m_int <- gam(richness ~s(sst) + s(x, y, bs = "gp", m = c(1,1)), data = sdat_sst, family = mgcv::negbin(theta = 2.03))
# m_int <- gam(richness ~s(sst, by = x), data = sdat_sst, family = mgcv::negbin(theta = 2.03))
m_int <- gam(richness ~s(sst, x), data = sdat_sst, family = mgcv::negbin(theta = 2.03))
m_int$df.residual
[1] 5280.651
deviance(m_int)
[1] 5277.21
Note that when we plot this model we now get a contour plot (or surface):
plot(m_int)
But when we send that to Prof Calanoid she says it looks like the mess of pasta she had for dinner last night.
Below we’ll learn how to plot the model predictions so they look less like pasta and more like a map.
One approach Bill mentioned was to clump the data into relatively ‘independent’ groups and them model those clumps as random effects. We won’t do this code in the workshop, but it’s here in case you want to try it.
For simplicity we will try that with just the west coast data:
sdat_sst$x <- st_coordinates(sdat_sst)[,1]
sdat_sst$y <- st_coordinates(sdat_sst)[,2]
sdat_west <- filter(sdat_sst, (x < 120) & (y > -40))
Now create the clumps
sdat_sst$group <- cut(sdat_sst$y, breaks = seq(-37, -16, by = 1.5))
Now we can use group
as a random effect:
m4 <- gamm(richness ~ s(sst), data = sdat_sst,
random = list(group=~1),
family = mgcv::negbin(theta = 3.8))
plot(m4$gam)
summary(m4$lme)
Another model Bill mentioned. Here’s the code if you want to try it in your own time.
We should first transform the data so the distances are in metres:
sdat_west2 <- st_transform(sdat_west, crs = 3577)
sdat_west2$x <- st_coordinates(sdat_west2)[,1]
sdat_west2$y <- st_coordinates(sdat_west2)[,2]
Then we can fit the model. I’m using a Gaussian Process smooth on the x,y coordinates, because the GP is a model of correlations:
m5 <- gam(richness ~ s(sst) + s(x, y, bs = "gp"),
data = sdat_west2,
family = mgcv::negbin(theta = 3.8))
plot(m5)
This could be considered best practice. Here’s the code if you want to try it in your own time.
This is possible with the mgcv
package, though I would
prefer to use a Bayesian method (like INLA) for this more complicated
type of modelling. Anyway, if you were motivated to try and fit a model
with spatial AC, you might like to do it like this:
m6 <- gamm(richness ~ s(sst),
#This next step includes the spatial AC
# with an exponential correlation structure
correlation = corExp(form = ~x + y),
data = sdat_west2, family =
mgcv::negbin(theta = 3.8))
plot(m6$gam)
summary(m6$lme)
Which models correlations between samples using an exponential
function on distance. Interestingly the range parameter is about 5km,
meaning the correlation between points that are 5km apart is about 0.36
and by 30km the correlation is 0.05. See ?corExp
for more
details about why that is.
To make a more interpretable map of the SST x longitude model, we will use rasters.
So we want to map our predictions for richness back onto an SST raster.
If we wanted to generate predictions at the original sample sites, we could just say:
sdat_sst$richness_pred <- predict(m_int, type = "response")
And then plot the predictions with tmap:
tm_shape(sdat_sst) +
tm_dots(col = "richness_pred")
This looks ok, but we might like to extent the edges of the points a bit to fill out the map a bit more. There are a few ways to do this, we will use rasters.
First let’s aggregate our SST raster so the cell size is bigger (resolution is grainier):
rsst2 <- aggregate(rsst, 2)
par(mfrow = c(1,2))
plot(rsst)
plot(rsst2)
Now we do some tricker to get predictions for the raster grid.
We need to set up a dataframe that has the SST values and x values (longitudes) from the raster and their cell numbers. Cells are numbered 1 to the total number of cells, starting at the top left cell.
I’ve given two options below for choosing cells. The first is more conservative and just chooses cells that have samples (we commented it out). The second uses all ocean cells (we’ll do this one now):
# icell <- cellFromXY(rsst2, st_coordinates(sdat_sst)) #alt option
icell <- 1:ncell(rsst)
pred <- data.frame(sst = rsst2[icell][,1],
cells = icell,
x = xFromCell(rsst2, icell),
y = yFromCell(rsst2, icell))
pred <- na.omit(pred)
head(pred)
sst cells x y
1 27.88307 1 82.75 -10
2 27.87724 2 83.25 -10
3 27.87886 3 83.75 -10
4 27.88036 4 84.25 -10
5 27.87260 5 84.75 -10
6 27.87042 6 85.25 -10
We used na.omit
to get rid of NA
SST values
(land basically).
Now we can use predict
to predict the richness values
for m_int
, but with our new SST and x values, using the
newdata
argument.
pred$richness_pred <- predict(m_int, newdata = pred, type = "response")
head(pred)
sst cells x y richness_pred
1 27.88307 1 82.75 -10 25.40934
2 27.87724 2 83.25 -10 25.67132
3 27.87886 3 83.75 -10 25.95014
4 27.88036 4 84.25 -10 26.22997
5 27.87260 5 84.75 -10 26.49140
6 27.87042 6 85.25 -10 26.76509
We chose the response
type, so that predictions units of
species richness, not log species richness (because of the log link in
the negative binomial).
Now we just assign the predictions back to an empty raster. The empty
raster is just a copy of rsst
with no data.
rpred <- rast(rsst2)
rpred[pred$cells] <- matrix(pred$richness_pred, ncol = 1)
We specify rpred[pred$cells]
so it only adds in values
in the cells we predited too.
Finally use your tmap skills to make a map Prof Calanoid will love:
tm_shape(aus, bbox = st_bbox(rpred)) +
tm_polygons(col = "white") +
tm_shape(rpred) +
tm_raster(palette = "RdPu",
title= "Richness", alpha = 0.8, n=10) +
tm_layout(bg.color = "grey20",
legend.position = c("left", "top"),
legend.text.color = "white",
legend.title.color = "white")
It looks blocky, that’s because we predicted richness to the underlying raster. Arguably the ‘blockiness’ is actually good in this case - these are model predictions, not real data, the blockiness serves to emphasise that.
Prof Calanoid would probably also like to see a map of SST, so have a go at that yourself.
Finally, Prof Calanoid would probably also like to see the model fit. One way to do this is to plot the predictions we made in the original data frame as points, coloured by longitude:
ggplot(sdat_sst) +
aes(x = sst, y = richness_pred, color = x, alpha = 0.5) +
geom_point() +
theme_bw() +
ylab("Richness (predicted)") +
xlab(expression('SST ('*~degree*C*')'))
There are some interesting deviations from the trend around 20-21 degrees that depend on the longitude. What is going on there? Maybe one of our local plankton experts can answer that question?
You are almost there with meeting all of Prof Calanoid’s requests. The analysis hasn’t been quite as quick, or gone quite as smoothly (pardon the pun) as she led us to believe, but there’s only one step to go - the interactive map to impress her funders (though we secretly suspect she just wants to post it on the web to taunt Prof Salp).
You want to deliver 100% of what she asked, so you can get that fellowship.
We will create the interactive map using the leaflet
package. You will need to be connected to the internet for this to work
properly. You can access leaflet directly via tmap. Just change the tmap
mode to “view”:
tmap_mode("view")
Leaflet makes use of a Javascript package for mapping (this is the language that dynamic web pages tend to use). It builds maps of your data ontop a range of freely available map layers. Check out this guide to leaflet in R for more advanced examples. But you can do the basics just with the same tmap code we used before.
Leaflet uses Javascript, so it is code that runs in a user’s browser. This means anyone looking at the map on the web has to download all the data before they can render the map. So you should keep your spatial datasets small if you want to use leaflet - imagine your collegues trying to download your 100mb spatial data layer on their mobile data plan.
Many sophisticated web mapping applications, like Google Maps, use server-side code. These can render much larger data-sets because they are only transferring the data that is needed for a particular view. Creating these kinds of applications requires specialised expertise that we won’t cover in this course.
So what we need to do now check the size of our dataset:
print(object.size(sdat_sst), units = "Kb")
5177.4 Kb
Not too bad. If it was much bigger though you might want to simplify the data, for instance, by aggregating points.
Here’s a first map:
tm_shape(sdat_sst) +
tm_dots()