Associate Professor Chris J. Brown (Griffith University, chris.brown@griffith.edu.au, Twitter: @bluecology)

Professor David Schoeman (University of the Sunshine Coast, dschoema@usc.edu.au)

Professor Anthony J. Richardson (The University of Queensland and CSIRO, a.richardson@maths.uq.edu.au)

Dr Bill Venables (The University of Queensland and CSIRO, Honorary Research Fellow, Bill.Venables@gmail.com)

Dr Christina Buelow (Griffith University, c.buelow@griffith.edu.au)

These notes are provided to students at the R Workshop, held at The University of Queensland February 2023.

2023 Chris J. Brown, David Schoeman, Anthony J. Richardson, Bill Venables, Christina A. Buelow

Introduction

The aim of today’s course is to train you in the basic skills you need to be proficient at spatial analysis and mapping. We’re going to look at data wrangling, with a focus on its application to spatial data. Then we will do some spatial statistics and mapping.

We’re going to focus on some popular packages for the data wrangling tasks, many of which are drawn from a group of packages known as the ‘tidyverse’.

For the spatial analysis we will use tools based around the ‘sf’ (simple features) and ‘terra’ packages.

The main principles I hope you will learn are:

Spatial data wrangling in R is safe, fast, reliable and repeatable
R can be efficiently used to do GIS operations
R can be efficiently used to make maps
How to perform spatial analyses that integrate across different data-sets

By the end of this course you should know:

How spatial data are stored in R as simple features and rasters
How to join spatial and non-spatial data types
How to make publication quality maps
How to use some common map projections
How to use spatial data in simple statistical analyses
How to make an interactive web map
How to get help with R

We’re aiming to give you a realistic experience, so today’s course will be based around a particular project that requires wrangling data, building up analysis and creating maps.

Why use R for spatial analysis?

It makes sense to do your spatial analysis in R, because today R is the leading platform for environmental data analysis. R also has add-on packages that can make it a fully fledged Geographic Information System, from reprojecting spatial data to creating maps.

A real advantage of R over other GIS programs is that it covers everything from data wrangling, stats to spatial analysis. So having your data in R opens up a huge array of cutting edge analysis tools. This means, for instance, that you can apply cutting edge statistical models straight to your spatial data sets.

And R is totally free.

A core principle of science is repeatability. Creating and sharing your data scripts helps you to fulfill the important principle of repeatability. It makes you a better scientist and can also make you a more popular scientist: a well thought out script for a common wrangling or analysis problem may be widely used by others. In fact, these days it is best practice to include your scripts with publications.

Most statistical methods in R require your data is input in a certain ‘tidy’ format. This course will cover how to use R to easily convert your spatial data into a ‘tidy’ format, which makes error checking and analysis easy.

Steps include restructuring existing spatial datasets and combining different spatial and non-spatial data-sets. We will also create some data summaries, plots and maps. Then we will do some spatial statistical analysis. Finally, just for fun, we will finish by creating a web based map of our data.

The course will be useful for people who want to explore, analyse and visualise their own spatial data in R.

R versions and Packages required

We’re going to assume you’re using a recent version of R (>3.6.0) and in the course we will use the RStudio editor.

To work through these notes you will need to install the add-on packages readr, tidyr, ggplot2 and dplyr. Or you can just get the package tidyverse which has these and more. You will also need sf, terra, leaflet, tmap. We will also use mgcv, which comes with R, so you won’t need to install that one.

Your knowledge of R

This isn’t an absolute beginner course, so we going to assume you have some knowledge of R already. If you are an absolute beginner, then you should take our other course.

We will assume you have at least basic understanding of how R works (e.g. scripts, console, how to access data and what a ‘package’ is). As a guide you should already be able to read data into R using R code (ie not using the menus in Rstudio) and create some basic plots.

The code in these notes is however complete, so you can run this entire course successfully without having to ‘know’ anything. Though it will be better for your own learning if you type the code out yourself.

If you get an error, well done! That is a chance to learn for the real world. So ask one of us for help. And if you don’t understand something, also don’t be afraid to just ask one of us for help.

Now just imagine…

You’re close to finishing your PhD on plankton ecology and just need to do one final chapter. You’re supervisor isn’t being much help (he’s off on a global trek promoting his new book).

You’re at the International Plankton Symposium (IPS2020) and you gather the courage to talk to Professor Calanoid, your academic hero. After enduring a long rant about Professor Salp’s plenary talk (“she’s just a backboneless filter feeder who doesn’t do any real research herself”), Prof Calanoid mentions that she’s read your first PhD paper on zooplankton biogeography.

Prof Calanoid was impressed with the extent of R analysis in your biogeography paper and goes onto suggest you collaborate on a new database she is ‘working with’.

The database has extensive samples of copepod richness throughout Australia’s oceans and the Southern Ocean too. Prof Calanoid has a hypothesis - that like many organisms, copepod species richness (just the number of unique species) will be higher in warmer waters than cooler waters. But she needs help sorting out the data and running some stats.

It will be a super easy paper for you, just do some of your R stuff and you will be a coauthor. It will likely be published in the top journal, The Nature of Plankton.

Of course, if you do this job she will support your fellowship application to the International Plankton Research Institute.

All you have to do is sort out the copepod data, match it to ocean temperature data and run some stats to test this hypothesis.

Prof Calanoid also wants you to make some flashy graphs to put in the paper, and make an interactive map of the data that they can share with their funders.

Oh and time is short, this is an open-access database, so Prof Calanoid needs this done in the next 3 weeks so that she can submit a paper on it before Professor Salp does. So better drop all your other commitments.

The copepod example data

Professor Calanoid sends you the data files. The spreadsheet copepods-raw.csv has measurements of copepod species richness from around Australia. Copepods are a type of zooplankton, perhaps the most abundant complex animals on the planet and an important part of ocean food-webs. Prof Calanoid has also sent you some other data, but has not explained what that is for yet. You’ll have to figure that out.

Copepod species were counted using samples taken from a Continuous Plankton Recorder. The CPR was towed behind ‘ships of opportunity’ (including commercial and research vessels). ‘Silks’ run continuously through the CPR and the plankton are trapped onto the silks, kind of like a printer that runs all day and night to record plankton in the ocean.

(The data we’ve given you are in fact modified from real data, provided by Professor Ant Richardson. Ant runs a plankton lab that is collecting and processing this data from a program called AusCPR, find out more here.)

So Prof Calanoid’s data is what we’ll work with today. We’ve tried to make this as realistic a learning experience as possible. So be ready to face some errors in the data from Prof Calanoid!

So now we’re almost ready to start the course. But before we get started, there are a few technical things you need to know about how we will use R today.

Data

We’ve provided all the data from Professor Calanoid in a sub-folder data-for-course.

If Rstudio is already open when you open a script to get started, then don’t forget to set the working directory with setwd() or under the ‘Session’ menu.

If you make your own scripts, you should save them with the data folder as a subfolder and everything should work fine.

1.1 Shape data and maps

What’s the deal with spatial ‘data wrangling’

The modern quantitative scientist has to know a lot more about working with databases and data analysis than in the past. Scientists are increasingly integrating a large variety of spatial data-sets into their work. These analyses require matching data-sets that may have been entered in different ways, or cover different temporal and spatial scales.

All of these procedures can be termed data wrangling. In this course we are going to learn how R can be used for wrangling of spatial data. We’re going to work through a ‘realistic’ case-study.

As expert R users we have often been faced with situations where a collaborator has asked us to ‘just run some numbers’ on a dataset, and be rewarded with an ‘easy’ paper.

‘Easy’ is often far from the truth. And the time-consuming part isn’t the stats. It’s the cleaning of the data that takes a lot of time. And then often we need to match up the new data to existing data-sets, such as when we want to know whether the spatial distribution plankton correlates with ocean temperature.

If you have to deal with large data-sets you may realise that data wrangling can take a considerable amount of time and skill with spreasheets programs like excel. Data wrangling is also dangerous for your analysis- if you stuff something up, like accidentally deleting some rows of data, it can affect all your results and the problem can be hard to detect.

Copepod data

Remember Prof Calanoid? Well let’s get started with that copepod richness data. In this part of the course we are going to clean it up and run some basic analyses.

Fire up RStudio, start a new R script (just click the symbol with the little green plus) and save the script in the same folder as where you have ‘data-for-course/’ as a sub-folder. You might like to call your script copepod-wrangling.R.

We always start our scripts with some comments that include a description of goals, our name and date. So do that too.

Loading data

Now let’s look at that first spreadsheet Prof Calanoid sent us. We don’t know the data well, and Prof Calanoid hasn’t told us much about it (or sent us any meta-data on what it all means), so we will want to do some thorough checks and visuals before we run any analyses.

This mirrors situations that all of us (Dave, Ant, Bill and Chris) of us have often come across. We are given data by collaborators, so we need to check and do some visuals on it before we do the analysis, to make sure we understand it well and avoid errors.

It is common to see people hired to do an analysis of a ‘complete’ data-set, but it ends up taking them the entire contract just to sort out the data, which weren’t really complete after all. R can help speed up this process, so the analysis (what you’re ultimately paid to do) gets done.

We will load in the data using a package from the tidyverse called readr. readr is handy because it does extra checks on data consistency over and above what the base R functions do. Data frames imported by readr also print in summary form by default. Let’s see how:

library(readr)
dat <- read_csv("data-for-course/copepods_raw.csv")
dat

# A tibble: 5,313 × 11
   silk_id segment_no latitude longitude sample_time_utc  project route vessel  
     <dbl>      <dbl>    <dbl>     <dbl> <chr>            <chr>   <chr> <chr>   
 1       1          1    -28.3      154. 26/06/2009 22:08 AusCPR  BRSY  ANL Win…
 2       1          5    -28.7      154. 26/06/2009 23:12 AusCPR  BRSY  ANL Win…
 3       1          9    -29.0      154. 27/06/2009 0:17  AusCPR  BRSY  ANL Win…
 4       1         13    -29.3      154. 27/06/2009 1:22  AusCPR  BRSY  ANL Win…
 5       1         17    -29.7      154. 27/06/2009 2:26  AusCPR  BRSY  ANL Win…
 6       1         18    -29.8      154. 27/06/2009 2:43  AusCPR  BRSY  ANL Win…
 7       1         26    -30.4      153. 27/06/2009 4:52  AusCPR  BRSY  ANL Win…
 8       1         30    -30.7      153. 27/06/2009 5:57  AusCPR  BRSY  ANL Win…
 9       1         33    -31.0      153. 27/06/2009 6:45  AusCPR  BRSY  ANL Win…
10       1         37    -31.3      153. 27/06/2009 7:50  AusCPR  BRSY  ANL Win…
# … with 5,303 more rows, and 3 more variables: meanlong <dbl>, region <chr>,
#   richness_raw <dbl>

(if you want to print the entire data frame, then use this code: data.frame(dat) to turn it back into a base R data frame).

In this data you will see a silk_id column, which is just the ID for each of the silks, onto which plankton are recorded. For processing, silks are divided into segments, so you will also see a segment_no column. The other columns are pretty self explanatory.

Initial visuals

It’s a good idea to do some visuals with new data to check they are in shape.

We will be using ggplot2 for graphs and tmap for maps in this course. Both work will within the tidyverse paradigm, and both have some pretty powerful tools for quickly creating graphics

You might like to download RStudio’s ggplot cheatsheet for reference.

Check the coordinates

At it’s heart a map is just a plot of spatial coordinates. So let’s make our first map with ggplot, plotting the coordinates for the samples (segments of the CPR silks):

library(ggplot2)
ggplot(dat) + 
  aes(x = longitude, y = latitude, color = richness_raw) +
  geom_point()

Which just shows the location of every segment. We also coloured points by the species richness. You can kind of see the CPR surveys wrapping around the coast of Australia.

So far so good, now let’s look at the richness data, our main variable for analysis

Plotting richness

In this course we’ll be blending a lot of spatial and non-spatial analysis and visuals. This is common in a spatial analysis workflow. Sometimes the idea of learning spatial analysis sounds daunting, but much of it is just a continuation of non-spatial data analysis techniques. So if you can already do those in R you can transfer those skills to spatial analysis.

Let’s try another plot of latitude versus richness:

ggplot(dat, aes(x = latitude, y = richness_raw)) + 
  stat_smooth() + 
  geom_point()

It’s handy to do lots of plots when getting to know your data, maps or not. Something clearly looks odd with this graph, like there is an unnatural change in the data pattern at about -40. Well, at least we have some results.

We’ll send this graph to Prof Calanoid and see what she thinks the issue is. Hopefully we’ll hear back soon.

(hint, if you actually want to save the above graph use ggsave)

Introduction to maps

Let’s repeat the above map of richness, but this time using some of R’s specialist packages for GIS and mapping.

The first step is to turn our point data into a spatially referenced data frame. We will use the sf package to do this.

sf stands for ‘simple features’ and is an open standard for geospatial databases. Interfaces using the simple features standard are available in many GIS programs. For R we need the package sf.

A good introduction to sf can be found in Geocomputation in R, which is free online.

In the spatial analysis in R was usually done with the sp package, which uses a different (and less convenient) paradigm for databases. You may still have to use that sometimes when sf doesn’t play nicely with other packages. Eventually sf will replace sp, so we will be teaching sf today.

So first install sf if you don’t have it. This may prove to be a bit tricky depending on your operating system. sf depends on lots of other packages including rgdal and rgdal is famous for issues with installation. So see how you go. If sf won’t install, then just follow along on our screen for now and figure that out later (with lots of googling).

Now, let’s turn our data into a ‘simple features collection’.

library(sf)
sdat <- st_as_sf(dat, coords = c("longitude", "latitude"), 
                 crs = 4326)

There’s a bit to explain here in this simple step.

st_as_sf is the function that converts different data types to simple features. Try ?st_as_sf to see what else it can convert. dat is our original data. The coords bit just gives the names of the columns that relate to the spatial coordinates (in order of X coordinate followed by Y coordinate).

What does the part starting crs do?

Coordinate reference systems

crs stands for Coordinate Reference System.

What is a coordinate reference system? Well the earth is spherical. So at the very least, we’d need a 3 dimensional coordinate system and a reference point to accurately map locations on the Earth’s surface.

Also, the earth is not a perfect sphere. It is lumpy and fatter at the equator because of its spin. So we also need a model that describes the spherical surface of the earth.

In mapping, we refer to the reference point as datum and the earth model as an ellipsoid. Together, these make a geographic coordinate reference system (GCS), which tells us where the coordinates of our copepod data are located on the earth.

GCS’s are represented by angular units (i.e. longitude and latitude), usually in decimal degrees. Our copepod coordinates are long-lat, so we chose a common ‘one-size-fits-all’ GCS called WGS84 to define the crs using the EPSG code 4326.

What is an EPSG code? It’s a unique, short-hand code for a specific crs.

Common practice for defining crs’s in R has traditionally been with either an EPSG code or a proj4string. You can still use either of these in R, but note that proj4strings are being phased out, and it’s best practice is to either use an EPSG code or Well-known text (WKT) to define a crs. A WKT string contains all of the detailed information we need to define a crs, but is cumbersome if you don’t need all of the detail. Read this for a more complete overview.

It’s easy to find out all of the above for a chosen crs in R. For example, for the EPSG code 4326 we can find out: 1) what the name of this crs is, 2) the corresponding proj4string, and 3) the WKT

crs4326 <- st_crs(4326)
crs4326$Name # name of the crs

[1] "WGS 84"

crs4326$wkt # crs in well-known text format

[1] "GEOGCRS[\"WGS 84\",\n    ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n        MEMBER[\"World Geodetic System 1984 (Transit)\"],\n        MEMBER[\"World Geodetic System 1984 (G730)\"],\n        MEMBER[\"World Geodetic System 1984 (G873)\"],\n        MEMBER[\"World Geodetic System 1984 (G1150)\"],\n        MEMBER[\"World Geodetic System 1984 (G1674)\"],\n        MEMBER[\"World Geodetic System 1984 (G1762)\"],\n        MEMBER[\"World Geodetic System 1984 (G2139)\"],\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]],\n        ENSEMBLEACCURACY[2.0]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    USAGE[\n        SCOPE[\"Horizontal component of 3D system.\"],\n        AREA[\"World.\"],\n        BBOX[-90,-180,90,180]],\n    ID[\"EPSG\",4326]]"

crs4326$proj4string # crs as a proj4string

[1] "+proj=longlat +datum=WGS84 +no_defs"

(Note: It’s good that we can still also extract the proj4string, because although it’s being deprecated, it’s still used in other spatial data packages like raster and terra.)

As mentioned above, EPSG:4326 is short-hand code for WGS84, a common long-lat geographic coordinate reference system (GCS).

When we make a 2-dimensional map in WGS84 GCS, we assume that a degree is a linear unit of measure (when in reality it’s angular).

To more accurately map our data in 2 dimensions, we need to decide how to ‘project’ 3 dimensions into 2. There are many ways to do the projection depending on where we are in the world and what we’re most interested in preserving (e.g., angles vs. distances vs. area). Projections are defined by a projected coordinate reference system (PCS), and spatial packages in R use the software PROJ to this.

If you want to learn more about projections, try this blog.

To find the most appropriate projected crs for your data, try the R package crs suggest.

Simple feature points collection

Let’s have a look at what we’ve created in sdat:

sdat

Simple feature collection with 5313 features and 9 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 89.6107 ymin: -65.2428 xmax: 174.335 ymax: -16.80253
Geodetic CRS:  WGS 84
# A tibble: 5,313 × 10
   silk_id segment_no sample_time_utc  project route vessel      meanlong region
 *   <dbl>      <dbl> <chr>            <chr>   <chr> <chr>          <dbl> <chr> 
 1       1          1 26/06/2009 22:08 AusCPR  BRSY  ANL Windar…     153. East  
 2       1          5 26/06/2009 23:12 AusCPR  BRSY  ANL Windar…     153. East  
 3       1          9 27/06/2009 0:17  AusCPR  BRSY  ANL Windar…     153. East  
 4       1         13 27/06/2009 1:22  AusCPR  BRSY  ANL Windar…     153. East  
 5       1         17 27/06/2009 2:26  AusCPR  BRSY  ANL Windar…     153. East  
 6       1         18 27/06/2009 2:43  AusCPR  BRSY  ANL Windar…     153. East  
 7       1         26 27/06/2009 4:52  AusCPR  BRSY  ANL Windar…     153. East  
 8       1         30 27/06/2009 5:57  AusCPR  BRSY  ANL Windar…     153. East  
 9       1         33 27/06/2009 6:45  AusCPR  BRSY  ANL Windar…     153. East  
10       1         37 27/06/2009 7:50  AusCPR  BRSY  ANL Windar…     153. East  
# … with 5,303 more rows, and 2 more variables: richness_raw <dbl>,
#   geometry <POINT [°]>

Note too that when you print sdat to screen you get some meta-data, including the geometry type, its dimensions (just XY here, but we could also have XYZ), its ‘bbox’ (bounding box), and the CRS (coordinate reference system).

The data table in sdat looks much like dat did, but note it now has a geometry column. This is where the coordinates (just one point for each data row) are stored. More complex simple features could have a series of points, lines, polygons or other types of shapes nested in each row of the geometry column.

The nice thing about sf (that wasn’t true of sp) is that because the data is basically a dataframe with a geometry, we can use all the operations that work on dataframes on sf simple features collections.

These include data wrangling operations like inner_join, plotting operations from ggplot2 and model fitting tools too (like glm).

sf also adds geometric operations, like st_join which do joins based on the coordinates. More on this later.

Basic cartography

Now let’s get into the mapping so we can make something that looks good to send to Prof Calanoid.

sf has simple plotting features, like this:

plot(sdat["richness_raw"])

We’ve chosen just to plot the richness column. If we just had plot(sdat) it’d do one panel for each variable in the dataframe. The ["richness_raw"] just selects that variable.

Thematic maps

The package tmap is one of many packages for making more sophisticated maps. Let’s try it:

library(tmap)

tm_shape(sdat) + 
  tm_dots(col = "richness_raw")

tmap works much like ggplot2 in that we build and add on layers. In this case we have just one layer, from sdat. We declare the layer with tm_shape() (in this case sdat), then the plot type with the following command.

Here we are using tm_dots to plot dots of the coordinates. Other options are tm_polygons, tm_symbols and many others we’ll see later.

We’ve picked the column "richness_raw" for the colour scale.

We can customize the plot, for instance:

tm1 <- tm_shape(sdat) + 
  tm_dots(col = "richness_raw", 
          palette = "Blues", 
          title = "Species #")
tm1

Now we’ve stored the plot object as tm1. If we want to see it we just type tm1. This means we can do more to it later on, like add in more layers.

We’ve set a colour palette 'Blues'. Then, we’ve just told it to title the colour scale (if we didn’t give a title, it would title the colour scale ‘richness_raw’).

The name ‘Blues’ comes from the RColorBrewer package, which provides a great catalogue of colour palettes. Check out interactive web app for other palettes: colorbrewer.org. You can use these names directly in the tmap palette argument.

Saving tmap

Let’s save this figure and email it to Prof Calanoid to get her opinion. At least she will be happy to hear we are making progress:


tmap_save(tm1, filename = "Richness-map.png", 
          width = 600, height = 600)

tmap_save is just a function for saving tmaps. Try ?tmap_save for other options, like changing the figure size, resolution, or file type. The only other trick we used above was to save the tmap to a variable name, tm1. This meant we could drop that variable name into tmap_save, which identifies which map we want to save.

Convenience and frustration with tmap

We use tmap because it is more intuitive and less ‘buggy’ than other map plotting packages. It has the advantages of allowing you: layer different maps, to plot both shape files and rasters, to easily change colour scales, to easily add compasses and scale bars. It is also pretty intuitive (the old plot method in comparison would offset multiple layers by a small amount and you had to adjust the coordinates to account for it!).

If you get competent at the basics then getting decent maps quickly will become easy for you. Your mapping life will seem good until one day, you decide you want to change the colours, or the legend, or the order of panels or some other detail. Then very quickly you will find yourself bashing your head against your monitor in frustration (at least I do).

That’s the thing about tmap, its really convenient, until suddenly it isn’t. It’s also pretty new so there isn’t a lot of help out there, though help online is expanding every time I look.

So press on, use web searches if you get stuck and don’t be afraid to ask a question on stack overflow if you can’t find your problem answered online already.

Mapping spatial polygons as layers

As we said earlier sf can handle lots of different types of spatial data, including shapes like polygons. To practice with polygons let’s load in a map of Australia and a map of Australia’s continental shelf.

We’ll add these as layers using tmap.

Loading shapefiles

The polygons are stored as shapefiles, in the course data. We can read them in with the st_read command (which is like read_csv, but for spatial files!):

aus <- st_read("data-for-course/spatial-data/Aussie/Aussie.shp")

Reading layer `Aussie' from data source 
  `/Users/s2973410/Documents/Code/teaching/spatial-analysis-in-r/data-for-course/spatial-data/Aussie/Aussie.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 8 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 112.9211 ymin: -43.63192 xmax: 153.6389 ymax: -9.229614
Geodetic CRS:  WGS 84

shelf <- st_read("data-for-course/spatial-data/aus_shelf/aus_shelf.shp")

Reading layer `aus_shelf' from data source 
  `/Users/s2973410/Documents/Code/teaching/spatial-analysis-in-r/data-for-course/spatial-data/aus_shelf/aus_shelf.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 112.2242 ymin: -44.1284 xmax: 153.8942 ymax: -8.8798
Geodetic CRS:  GRS 1980(IUGG, 1980)

And they are loaded. You can check out the data quite easily by typing the object names:

aus

Simple feature collection with 8 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 112.9211 ymin: -43.63192 xmax: 153.6389 ymax: -9.229614
Geodetic CRS:  WGS 84
  adm                       geometry
1 NSW MULTIPOLYGON (((141.0029 -3...
2 Vic MULTIPOLYGON (((140.9659 -3...
3 QLD MULTIPOLYGON (((153.5522 -2...
4  SA MULTIPOLYGON (((129.0013 -2...
5  WA MULTIPOLYGON (((129.0012 -1...
6 Tas MULTIPOLYGON (((148.3096 -4...
7  NT MULTIPOLYGON (((137.9962 -1...
8 ACT MULTIPOLYGON (((149.1253 -3...

A note about .shp files

It is widely known that .shp files are a terrible format. They are inefficient at storing data and to save one shapefile you actually create multiple files. This means bits of the file can get lost if you transfer the data somewhere.

A better format is the Geopackage. Geopackages can save multiple different data types all in a single file. And they compress the data more. Read more about different file formats here.

We’ve used .shp files in this course because we expect you are most likely to encounter that format. But we encourage you to save your own data in the .gpkg format.

People often ask us where we get our spatial data from. The answer is all sorts of places, sometimes from collaborators, sometimes we make it from scratch and other times we download it from online data providers (try the AODN for Australian ocean data).

There is an instructive bonus section on how we created the map of Australia’s shelf, if you’d like to know more about sourcing spatial data for maps.

Mapping polygons

It is easy to make a map of a polygon with tmap:

tm_shape(shelf) + 
  tm_polygons()

A thematic map can be built up as a series of layers, so we can just keep adding to our map if we want:

Note how each new layer starts with the tm_shape line, then we say what type of shape we want. We could keep going and add the land and the copepod points:

tm_shape(shelf) + 
  tm_polygons(col = 'lightblue') +
  tm_shape(aus) + 
  tm_polygons() + 
  tm_shape(sdat) + 
  tm_dots()

We’ve made the shelf ‘lightblue’ to differentiate it from the land.

But we are missing the samples in the southern ocean. This is because the extent for a tmap is set by the first tm_shape. We can fix this by setting the bbox (bounding box):

tm_shape(shelf, bbox = sdat) + 
  tm_polygons()+#col = 'lightblue') +
  tm_shape(aus) + 
  tm_polygons() + 
  tm_shape(sdat) + 
  tm_dots()

Now our map extent matches the samples.

Let’s stop here and give you some time to play around with tmap and its different features. I really want you to explore the package and try different things. If you get errors, good, they are a learning opportunity.

Now to get started you might want to type ?tmap to peruse tmaps different features.

To learn about a quick way to change the style, type tmap_style("beaver") then run your map code again.

Or try the tmap vignette. It can be accessed with vignette('tmap-getstarted') or web search ‘r tmap’.

Here’s a map we made, to help give you some ideas:

Prof Calanoid gets back to us about the figure we sent of richness and latitude. As we had suspected, she says the results are junk. Prof Calanoid has now explained that we need to standardize richness estimates, because silks from different routes have different sizes.

Prof Calanoid had already provided the silk sizes in a file Route-data.csv, but had neglected to tell us we needed to use this for a standarisation (typical!). No worries though, we just need to learn how to join the routes data to our richness data and then the standarization will be easy.

Introduction to dplyr package for spatial data wrangling

dplyr stands for ‘data pliers’. Along with tidyr it is one of the most useful packages for getting your data into shape for analysis. It works very well with spatial data stored as sf objects. dplyr is part of the tidyverse so it plays very nicely with ggplot2 and readr.

Table joins with spatial data

One thing dplyr is good at is joining data frames by matching columns. Try type ?inner_join in your console and you will get a list of all the join types it supports. Remember dplyr can only do joins on the data table, not the geometry. We will look at spatial joins later on.

Today we will use inner_join. Below, the code will join dat to the routes data using columns with the same names to match by. It will keep all rows from dat where there are matching rows in routes, so if rows don’t match, they will be chucked out (use left_join if you want to keep rows in dat that don’t match too). inner_join will also duplicate rows if there are multiple matches.

routes <- read_csv("data-for-course/Route-data.csv")

Have a quick look at routes now to make sure you are happy with it. Then we will just use inner_join to join it with our spatial data (making sure we check the number or rows stays the same):

sdat_std <- inner_join(sdat, routes, by = "route")
nrow(sdat)

[1] 5313

nrow(sdat_std)

[1] 5313

nrow(routes)

[1] 25

length(unique(routes$route))

[1] 25

We checked the number of rows and the number of unique route names to make sure we didn’t inadvertently remove or duplicate any data.

It is worth taking a look through the resulting data (e.g. hover your mouse over the data frame and press F2) to make sure the join worked as intended.

Dangerous joins

Joins are a very important but very dangerous data wrangling operation! You must always choose your join type carefully. For instance, inner_join vs left_join vs full_join will all give the same result for some datasets, but not others.

Even after you think you know what you are doing, you still need to check the outcome. As we explained above, you can lose samples that don’t match, or duplicate samples that match multiple times. I (CB) have made (and thankfully corrected) this mistake many times, often because of small inconsistencies in the data I was provided, like when some site names have all lower case, and a handful have title case.

We don’t say this to put you off joins, they are one of the most useful data wrangling tasks you can do in R, but just be careful.

Adding new variables

Once we have a matching silk_area value for each sample, it is easy to add a new variable that is standardised richness. To do this we use mutate which just takes exisiting variables and calculates a new variable (or overwrites an existing one if we give it the same name). In addition to the standardised variables, we will also calculate the number of species per individual observed.

sdat_std <-  mutate(sdat_std,
              richness = richness_raw/silk_area)

Ok, let’s plot standardized richness against latitude so we can send a new graph to Prof Calanoid. To do that we first need to extract the latitude, since it is now stored in the geometry:

sdat_std$Lat <- st_coordinates(sdat_std)[,2]

Now some straightforward ggplot:

ggplot(sdat_std) +
  aes(x = Lat, y = richness, color = richness) + 
  geom_point() +
  stat_smooth() + 
  theme_bw()

Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Do you see a different pattern now?

We should also save the standardised data for use later:

save(sdat_std,file =  "data-for-course/spatial-data/copepods_standardised.rda")

Bonus material: Where does the shelf data come from?

Prof Calanoid didn’t give us the spatial layers for Australia or the continental shelf. We had to derive them ourselves.

We often get asked where we get our spatial data from. The answer is, lots of sources! But often from web repositories, like the excellent Australian Ocean Data Network.

It will be instructive to go through how the shelf data was made. It’s derivation is a bit convulted, but that is pretty common when you are making customized maps.

First, we downloaded a shapefile of bathymetry from the AODN

It is lines data, with lines for contours at 5m resolution.

Then we read it in with sf and filter for the 200m contour as our definition of the continential shelf.

library(sf)
shelfin <- st_read("data-raw/contour_5m/contour_5m.shp")
shelfin
range(shelfin$CONTOUR)
shelf200 <- dplyr::filter(shelfin, CONTOUR == 199.9)
plot(shelf200["LENGTH"])

Now we have lots of little line segments (the data is not one continuous line around Australia unfortunately). We want to make a polygon from these bits of lines.

So we did some web searching and on stackoverflow we read about the package concaveman which fills in the gaps to make a polygon. So we install that package and use it to make a continuous polygon:

library(concaveman)
ushelf200 <- concaveman(shelf200)
plot(ushelf200)

Then we just check it has made a valid polygon (see ?st_is_valid for more explanation)

st_is_valid(ushelf200)

Turns out its not valid, which may cause us issues with spatial joins later on.

We make it valid with st_make_valid function:

ushelf200 <- st_make_valid(ushelf200)
st_is_valid(ushelf200)

That worked.

The data also has a Z dimension, that we don’t need. So let’s drop that:

ushelf200 <- st_zm(ushelf200)

Now save it, so its all ready for the students (you!):

st_write(ushelf200, "data-for-course/spatial-data/aus_shelf.shp")

If you find yourself in this kind of convoluted workflow trying to get spatial data to work, don’t despair. Its common, keep web searching and you will find answers to get you to something that works. Just don’t under-estimate how long this kind of thing can take, next time someone asks you to help with an ‘easy’ project.

1.3 GIS and spatial analysis

In the first session we tidied and analysed Prof Calanoid’s data so that we could run some analysis of the relationship between latitude and species richness. But the job isn’t done yet.

Prof Calanoid’s actual hypothesis was about sea surface temperature. She also wanted to see some maps and create an interactive map for her funders.

So in this session we are going to look at how we combine our copepod data with spatial layers for temperature so we can do some spatial analysis on the relationship between temperature and richness.

Then we will generate some predictions for species richness, and map them, like a species distribution model.

Finally we will look at creating interactive maps.

Introduction to using R for GIS and spatial analysis

In part 2 of this course we will look at some more advanced spatial analysis that R can do. We are going to look at spatial joins, where we join data based on its geometries. Then we will look at a new type of data ‘raster’ (pixel) data and see how we can extract raster data for a series of points.

We will use this extracted data to build a spatial statistical model to test for a relationship between SST and richness. Then, just for fun, we will build an interactive web map.

Getting started for Session 2

Let’s start with a clean slate for session 2. So I encourage you to start a new script and a new R session. Then we’ll just load in key packages and data from this morning:

library(tidyverse)
library(sf)
library(tmap)

load("data-for-course/spatial-data/copepods_standardised.rda")
aus <- st_read("data-for-course/spatial-data/Aussie/Aussie.shp")

Reading layer `Aussie' from data source 
  `/Users/s2973410/Documents/Code/teaching/spatial-analysis-in-r/data-for-course/spatial-data/Aussie/Aussie.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 8 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 112.9211 ymin: -43.63192 xmax: 153.6389 ymax: -9.229614
Geodetic CRS:  WGS 84

Prof Calanoid is ever curious. On seeing the map we sent her with the points crossing the continental shelf, she asks if shelf/off-shelf has an effect on richness. So we’ll need to do some more GIS to find out.

Intersecting points and polygons

To look at the position of the samples with respect to the continental shelf, we’ll have to intersect the shelf and points data. To do that we need to do a spatial join. Unlike dplyr joins, which work on the data table, the spatial join works on the geometries (ie coordinates). We’ll use ‘st_intersects’ which is going to tell us where points overlap the shelf polgyon. Let’s have a go.

First, we will add a new variable shelf, which is just the same everywhere (because everywhere inside the polygon is on the shelf!). This will be handy later. Then we will use st_join to try the join, telling it to do an intersection (there are other types of join, see ?st_join for details):

shelf <- st_read("data-for-course/spatial-data/aus_shelf/aus_shelf.shp")

Reading layer `aus_shelf' from data source 
  `/Users/s2973410/Documents/Code/teaching/spatial-analysis-in-r/data-for-course/spatial-data/aus_shelf/aus_shelf.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 112.2242 ymin: -44.1284 xmax: 153.8942 ymax: -8.8798
Geodetic CRS:  GRS 1980(IUGG, 1980)

shelf$shelf <- "Shelf"

sdat_shelf <- st_join(sdat_std, shelf, join = st_intersects)

Ah an error! Do we throw up our hands in despair and tell Prof Calanoid we can’t do it? No. Let’s look at this obscure message a bit more closely and try and figure out what it means.

hmmm, so st_crs looks like a function, let’s find out what it does: ?st_crs.

Ah so it tells us what the CRS is. So st_crs(x) == st_crs(y) is not TRUE is saying our two datasets have different coordinate reference systems (note the == which is a question: does st_crs(x) = st_crs(y)?)

Let’s look at that:

st_crs(shelf)

Coordinate Reference System:
  User input: GRS 1980(IUGG, 1980) 
  wkt:
GEOGCRS["GRS 1980(IUGG, 1980)",
    DATUM["D_unknown",
        ELLIPSOID["GRS80",6378137,298.257222101,
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["Degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["longitude",east,
            ORDER[1],
            ANGLEUNIT["Degree",0.0174532925199433]],
        AXIS["latitude",north,
            ORDER[2],
            ANGLEUNIT["Degree",0.0174532925199433]]]

st_crs(sdat_std)

Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

One is using GRS80 as a coordinate reference system, and the other is using WGS84. (Note: in reality these crs’s are nearly identical, but we need to make them the same if we want to intersect them.)

We can fix this easily with a call to st_transform. Let’s put shelf on the same datum as sdat_std:

shelf <- st_transform(shelf, crs = st_crs(sdat_std))

Now try the join again:

sdat_shelf <- st_join(sdat_std, shelf, join = st_intersects)
names(sdat_shelf)

 [1] "silk_id"         "segment_no"      "sample_time_utc" "project.x"      
 [5] "route"           "vessel"          "meanlong.x"      "region"         
 [9] "richness_raw"    "geometry"        "project.y"       "number_segments"
[13] "meanlat"         "meanlong.y"      "silk_area"       "richness"       
[17] "Lat"             "FID"             "shelf"

unique(sdat_shelf$shelf)

[1] "Shelf" NA

Note it has combined the data fields from sdat_std and shelf when it did the join. This includes the variable we created ‘shelf’ which has values shelf where the point is inside the polygon (ie on the shelf) and NA where it the point isn’t in the polygon.

Let’s rename those NA values to ‘Offshore’ so that is clear:

library(tidyr)
sdat_shelf <- mutate(sdat_shelf, 
                     shelf = replace_na(shelf, "Offshore"))
table(sdat_shelf$shelf)


Offshore    Shelf 
    4110     1203

So we have many more samples offshore than on the shelf.

Now to check the intersection worked, we can map it out, colouring points by whether they were on or off the shelf:


tm_shape(shelf, bbox = sdat_shelf) + 
  tm_polygons(col = "grey10") + 
  tm_shape(sdat_shelf) + 
  tm_dots(col = "shelf", palette = "RdBu") +
  tm_graticules()

Tips for those who like it fast

Intersections and related operations can be slow for complex polygons, or when intersecting lots of polygons. sf speeds up these computations by implementing spatial indexing, which means it creates bounding boxes around the polygons first so it can hone in the search for intersections to those places (thus saving computation time).

sf functions like st_intersection(x, y) (and st_join) compute the spatial index on the first shape given to the function (i.e. x). So swapping them around could speed up computations in some instances. Give it a try if speed is an issue for you.

This feature means that sf operations can be significantly faster than simpler operations from other packages (like sp).

Analysis of richness by continential shelf/offshore

Now we have all the data together, we can do another ggplot, with smooth’s to visually check if there is an effect of the continental shelf on the richness gradient.

ggplot(sdat_shelf) + 
aes(x = Lat, y = richness, color = shelf) +
geom_point(alpha = 0.5, size = 0.2) + 
  stat_smooth() + 
  theme_bw()

It looks pretty consistent, though there is an interesting deviation of higher richness on the shelf at around -32. Looking at the map above, this may reflect the high number of on-shelf observations in West Australia at -32 and total absence of offshelf obs there.

We could figure this out with some more explanation (for instance dividing by east and west coasts), but we’ll leave that for you to do in your own time.

Introducting raster data

Our aim was to uncover the relationship between temperature and copepod richness. To do that we need some spatial data on temperature, so we can extract temperature at the sampling sites.

We have provided you with two files MeanAVHRRSST.gri and MeanAVHRRSST.grd which contain gridded maps of annual mean sea surface temperature from the Hadley dataset. Gridded data, also known as raster data, can be read and manipulated with the terra package. Once you have installed this package, load it in:

library(terra)

A note about Raster data, which package should I use?

There used to be only one package for raster data, raster. Now there are three main ones (raster, terra and stars). In this course we have decided to use terra. terra is similar but more adaptable and faster than raster, and is made by the same people. So it supersedes raster. terra is easy to learn if you already used raster.

There are several things to consider when deciding if you shift to using a newer package. Chris discusses them in his blog comparing raster and terra.

There still may be cases where you need to use raster such as to make sure data are compatible with the format required by other packages. For instance terra is sometimes, but not always compatible with sf objects. So it is common when packages change or update that we have to change object types a lot. This is something we’ll see below.

In such cases, you can easily convert a terra SpatRaster object to a raster raster object with the command raster::raster(). If you need an sf vector object (e.g. point data) to be compatible with terra’s formats convert it with terra::vect().

We can then load and view the SST raster like this:

rsst <- rast('data-for-course/spatial-data/MeanAVHRRSST/MeanAVHRRSST.grd')
plot(rsst)

rast will read the raster data in as a SpatRaster object.

This creates a pretty decent first plot of the raster. However, note the colour scale isn’t that appropriate for temperatures - green where temperatures are hot and red where they are cold Further, these default colours wouldn’t be that great if our audience was red-green colour blind (and we suspect that Prof Calanoid is colour blind)

A colour scale with multiple hues is also inappropriate for continuous data where there are no natural breaks.

Notice how the change from 12-14 degrees looks much bigger than the change from say 18-20 degrees (the yellow band south of Tasmania and across New Zealand), but both are 2 degrees of change. This effect is created by the shift across 3 hues (red-yellow-green) at around 13 degrees, but only a change in the intensity of green at 19 degrees.

We can use tmap for SpatRasters too.


tm_shape(rsst) + 
  tm_raster(palette = "-RdBu", title = "SST")

Note the “-” in front of the palette name, which makes red warmer temperatures by reversing the colour palette.

I think the sequential palettes are a generally good choice for temperatures. Sequential palettes like Reds are most appropriate when our data has a linear scale. You may also see some people use palettes like RdBu (red-blue). We’ve used it here, because we have temperatures <0.

Unlike the terra::plot() function tmap puts the hue change at a sensible place (zero). When there is a natural break-point in the data, like zero degrees, diverging palettes make sense.

Layering rasters and points in tmap

It is straight forward to add on our samples as points over the raster with a tmap:

tm_shape(rsst) + 
  tm_raster(palette = "-RdBu", title = "SST") + 
  tm_shape(sdat_std) + 
  tm_dots(col = "richness_raw", 
          palette = "Greys", 
          title = "Species #") + 
  tm_compass()

Extracting temperatures at the sampling sites

We have overlaid our copepod sampling points on the map of temperature, now let’s extract the temperature values at those sampling sites, so we can test Prof Calanoid’s hypothesis about SST.

This is easy to do with terra’s extract function.

sdat_std$sst <- terra::extract(rsst, vect(sdat_std))[,2]

Note thatterra::extract() isn’t compatible with sf objects (yet) (unlike raster::extract()). So we need to wrap our points data in vect() to convert it to terra’s SpatVector format for storing point data.

This is one of the object conversions I mentioned that we just have to deal with when using the latest packages. They usually don’t cost much time if you know what you are doing.

Now we can plot the correlation between richness and SST. We can also run a test to calculate the Pearson correlation coefficient:


ggplot(sdat_std, aes(sst, richness)) + 
  geom_point() + 
  theme_minimal()

Warning: Removed 3 rows containing missing values (`geom_point()`).

with(sdat_std, cor.test(sst, richness))


    Pearson's product-moment correlation

data:  sst and richness
t = 46.698, df = 5308, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5202827 0.5584213
sample estimates:
      cor 
0.5396288

The test indicates a significant positive correlation with SST. However, if you examine the plot you may notice that variance in richness tends to increase with SST.

Can you think of a more appropriate way to model this data that will be more satisfactory to Prof Calanoid?

Note that ggplot warned us that it ‘removed 3 rows because of missing data’. That is because some of the sst values are missing:

filter(sdat_std, is.na(sst))

Simple feature collection with 3 features and 17 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 152.0625 ymin: -32.6829 xmax: 152.4542 ymax: -32.10338
Geodetic CRS:  WGS 84
# A tibble: 3 × 18
  silk_id segment_no sample_time_utc project.x route vessel    meanlong.x region
*   <dbl>      <dbl> <chr>           <chr>     <chr> <chr>          <dbl> <chr> 
1     362         93 7/12/2015 16:09 AusCPR    BRNC  Island C…       153. East  
2     362         97 7/12/2015 17:51 AusCPR    BRNC  Island C…       153. East  
3     362        101 7/12/2015 19:33 AusCPR    BRNC  Island C…       153. East  
# … with 10 more variables: richness_raw <dbl>, geometry <POINT [°]>,
#   project.y <chr>, number_segments <dbl>, meanlat <dbl>, meanlong.y <dbl>,
#   silk_area <dbl>, richness <dbl>, Lat <dbl>, sst <dbl>

Let’s get rid of those rows, because the missing data will cause us issues later:

sdat_sst <- filter(sdat_std, !is.na(sst))

The ! just means ‘NOT’, so we are asking for the rows that are not NA (missing).

Simple model of SST

We could use a GAM to model richness on SST with a Poisson distribution. The Poisson is appropriate for count data.

library(mgcv)
m1 <- gam(richness ~ s(sst, k=5), data = sdat_sst, family = 'poisson')
plot(m1)

So we’ve limited the degress of freedom of the spline, so it can’t be too ‘bendy’. We’ve also asked gam to use a Poisson distribution (default is Gaussian), because our data are discrete counts of the number of species.

Accounting for regions

Our data are spatial, and we’d not expect the sst-richness relationship to be the same in all places. For example, the types of species that make up our data are different in the Indian and Pacific oceans.

We can check for different patterns across the different oceans the data covers Australia. So let’s include a fixed effect for region also:

sdat_sst$Region <- factor(sdat_sst$region)
m1 <- gam(richness ~ s(sst, k=5, by = Region) + Region, data = sdat_sst, family = 'poisson')

Let’s look at the result (this is easiest with ggplot2). We’ll add the GAM predictions onto the end of our dataframe so we can plot those.

sdat_sst$pred_m1 <- predict(m1, type = "response")

ggplot(sdat_sst) +
  aes(x = sst, y = richness, color = Region)+
  geom_point(size = 0.2, alpha = 0.3) +
  geom_line(aes(y = pred_m1), size = 1) + 
  facet_grid(.~Region) + 
  theme_bw()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The type = "response" command ensures we get predictions of species, not log(species). Note was used aes again inside geom_line() to ensure we got the right y values for the lines.

Accounting for overdispersion

So we send the results of our GAM to Prof Calanoid and we get the response:

“The poisson model you use for your GAMs is clearly pretty bad and potentially misleading. It is very overdispersed, as such models often are. I would suggest changing to a negative binomial. The catch is you have to supply theta (the dispersion parameter). A good way to pick one is just to start with a trial value and adjust it until your deviance matches your df.residual fairly closely.”

To see these numbers type:

deviance(m1)

[1] 14746.02

m1$df.residual

[1] 5296.811

The deviance is the sampling variation we have, the df residuals is the sampling variation our model expects. So clearly our model is under-estimating the sampling variation (the data are ‘overdispersed’).

It does make a difference. Let’s have a look:

m2 <- gam(richness ~ s(sst, by = Region) + Region, data = sdat_sst, family = mgcv::negbin(theta = 1.99))
deviance(m2)

[1] 5294.37

m2$df.residual

[1] 5291.904

So we tried the Negative Binomial and in this case negbin(theta = 1.99) does a pretty good job, and much more restrained than the poisson.

Here I just tried different values of theta (the dispersion parameter) until I got the deviance and resid DF reasonably close. You can also see the script ‘find-theta.R’ in your data folder for a way to automate this.

Rootograms for checking model fit

A really neat way to check the ‘fit’ of count models is with rootograms. Here’s one for our model:

(the DataGLMRepeat package is my own pkg of handy functions on github, if you want it, get it here: https://github.com/cbrown5/DataGLMRepeat)

A perfect fit would have all the grey rectangles sitting exactly on the horizontal line (at zero). The red line shows expected (modelled) counts, the bars show observed counts. So where the bar is floating over the line, the model is over-predicting observed values (ie around richnesses of 10). Where the bar hangs below the black line, the model is underpredicting observations (ie at counts>15).

So this model could still use a bit of work, but we’ll leave that for another day.

Ok let’s check out the results

sdat_sst$pred_m2 <- predict(m2, type = "response")

ggplot(sdat_sst) +
  aes(x = sst, y = richness, color = Region)+
  geom_point(size = 0.2, alpha = 0.3) +
  geom_line(aes(y = pred_m2), size = 1) + 
  facet_grid(.~Region) + 
  theme_bw()

Looks much the same as before. So our result was robust.

Note we could add standard errors to these plots, by asking se = TRUE in our predict function.

Spatial models of SST and richness

Now we might like to get a bit more sophisticated with our spatial model. We haven’t accounted for dependencies between the locations. Sampling sites that are close together are often more similar to each other than those that are further apart, so each sample is not a true ‘independent replicate’.

This phenomenon is often called spatial auto-correlation. It is a concern because if we over-estimate the number of truly independent replicates, then we will also tend to overestimate p-values and effect sizes (though sometimes we can underestimate effect sizes too). So a ‘significant’ finding is more likely to be spurious.

We won’t have time in the workshop to cover all the intricacies of spatial models. However, we’ll try one type and there is some more code below for you to try in your own time.

But first, some some words of advice from Bill Venables.

Words of caution on spatial modelling from Bill Venables

In this context we really need to regard GAMs (and GLMs) as exploratory tools. They are powerful, but they rely on the (usually naive, but often harmless) assumption that, given the predictors, the observations can be regarded as independent.

Even if this is plainly not so, it doesn’t rule out the approach entirely as an exploratory tool. GAMMs and GLMMs have the capacity to allow for non-independence of various forms, but taking this to a realistic level would also be extremely intricate and usually dependent on the specifics of any example.

One approach to account for spatial dependencies is to use lat and longitude as predictors. However, they usually confound with other predictors of a more general kind, such as SST, and foregoing the chance to build a more generally applicable, and interpretable model.

Another approach is to choose a spatial scale larger than the distance between samples, within which we will ‘clump’ samples. Then we can apply a random effect to those clumps. Of course, this assumes the ‘clumps’ are independent of each other.

While I can agree with these approaches, I favour a two-pronged attack, namely try with and without lat and lon as predictors. The purpose of the model that has lat/lon is merely to assess just how much you might be losing out on by omitting location. If this is “not much” then you can feel comfortable with your primary model, (i.e. omitting Lat & Long from the picture). If this is “wow, that’s a lot!” then firstly, you need to be aware of it and secondly, you might want to look around for other suitable predictors, of a non-location kind, to fill in the void.

Best practice would be to model the spatial dependencies directly, but we won’t cover those more complicated models here.

Using longitude as a covariate

Say we wanted to allow the effect of SST to vary continuously with longitude. First extract longitude coordinates from the geometry:

sdat_sst$x <- st_coordinates(sdat_sst)[,1]

Now we just include the x variable in our call to the smoother and do the usual checks. I’ve included three different ways to fit this model below, for your to explore.

# m_int <- gam(richness ~s(sst) + s(x, y, bs = "gp", m = c(1,1)), data = sdat_sst, family = mgcv::negbin(theta = 2.03))

# m_int <- gam(richness ~s(sst, by = x), data = sdat_sst, family = mgcv::negbin(theta = 2.03))

m_int <- gam(richness ~s(sst, x), data = sdat_sst, family = mgcv::negbin(theta = 2.03))

m_int$df.residual

[1] 5280.651

deviance(m_int)

[1] 5277.21

Note that when we plot this model we now get a contour plot (or surface):

plot(m_int)

But when we send that to Prof Calanoid she says it looks like the mess of pasta she had for dinner last night.

Below we’ll learn how to plot the model predictions so they look less like pasta and more like a map.

Challenge topic: Spatial ‘clumping’ model for the West Coast

One approach Bill mentioned was to clump the data into relatively ‘independent’ groups and them model those clumps as random effects. We won’t do this code in the workshop, but it’s here in case you want to try it.

For simplicity we will try that with just the west coast data:

sdat_sst$x <- st_coordinates(sdat_sst)[,1]
sdat_sst$y <- st_coordinates(sdat_sst)[,2]

sdat_west <- filter(sdat_sst, (x < 120) & (y > -40))

Now create the clumps

sdat_sst$group <- cut(sdat_sst$y, breaks = seq(-37, -16, by = 1.5))

Now we can use group as a random effect:

m4 <- gamm(richness ~ s(sst), data = sdat_sst, 
           random = list(group=~1),
          family = mgcv::negbin(theta = 3.8))
plot(m4$gam)
summary(m4$lme)

Challenge topic: Exploratory model with x and y as smoothers

Another model Bill mentioned. Here’s the code if you want to try it in your own time.

We should first transform the data so the distances are in metres:

sdat_west2 <- st_transform(sdat_west, crs = 3577)
sdat_west2$x <- st_coordinates(sdat_west2)[,1]
sdat_west2$y <- st_coordinates(sdat_west2)[,2]

Then we can fit the model. I’m using a Gaussian Process smooth on the x,y coordinates, because the GP is a model of correlations:

m5 <- gam(richness ~ s(sst) + s(x, y, bs = "gp"), 
          data = sdat_west2, 
          family = mgcv::negbin(theta = 3.8))
plot(m5)

Challenge topic: Modelling spatial autocorrelation

This could be considered best practice. Here’s the code if you want to try it in your own time.

This is possible with the mgcv package, though I would prefer to use a Bayesian method (like INLA) for this more complicated type of modelling. Anyway, if you were motivated to try and fit a model with spatial AC, you might like to do it like this:

m6 <- gamm(richness ~ s(sst), 
                #This next step includes the spatial AC
                # with an exponential correlation structure
                correlation = corExp(form = ~x + y),
                data = sdat_west2, family =
                  mgcv::negbin(theta = 3.8))

plot(m6$gam)

summary(m6$lme)

Which models correlations between samples using an exponential function on distance. Interestingly the range parameter is about 5km, meaning the correlation between points that are 5km apart is about 0.36 and by 30km the correlation is 0.05. See ?corExp for more details about why that is.

Generating and mapping model predictions

To make a more interpretable map of the SST x longitude model, we will use rasters.

So we want to map our predictions for richness back onto an SST raster.

Generating predictions at the sample locations

If we wanted to generate predictions at the original sample sites, we could just say:

sdat_sst$richness_pred <- predict(m_int, type = "response")

And then plot the predictions with tmap:

tm_shape(sdat_sst) +
  tm_dots(col = "richness_pred")

This looks ok, but we might like to extent the edges of the points a bit to fill out the map a bit more. There are a few ways to do this, we will use rasters.

Generating predictions anywhere

First let’s aggregate our SST raster so the cell size is bigger (resolution is grainier):

rsst2 <- aggregate(rsst, 2)
par(mfrow = c(1,2))
plot(rsst)
plot(rsst2)

Now we do some tricker to get predictions for the raster grid.

We need to set up a dataframe that has the SST values and x values (longitudes) from the raster and their cell numbers. Cells are numbered 1 to the total number of cells, starting at the top left cell.

I’ve given two options below for choosing cells. The first is more conservative and just chooses cells that have samples (we commented it out). The second uses all ocean cells (we’ll do this one now):

# icell <- cellFromXY(rsst2, st_coordinates(sdat_sst)) #alt option
icell <- 1:ncell(rsst)
pred <- data.frame(sst = rsst2[icell][,1],
                       cells = icell, 
                       x = xFromCell(rsst2, icell),
                      y = yFromCell(rsst2, icell)) 

pred <- na.omit(pred)
head(pred)

       sst cells     x   y
1 27.88307     1 82.75 -10
2 27.87724     2 83.25 -10
3 27.87886     3 83.75 -10
4 27.88036     4 84.25 -10
5 27.87260     5 84.75 -10
6 27.87042     6 85.25 -10

We used na.omit to get rid of NA SST values (land basically).

Now we can use predict to predict the richness values for m_int, but with our new SST and x values, using the newdata argument.

pred$richness_pred <- predict(m_int, newdata = pred, type = "response")
head(pred)

       sst cells     x   y richness_pred
1 27.88307     1 82.75 -10      25.40934
2 27.87724     2 83.25 -10      25.67132
3 27.87886     3 83.75 -10      25.95014
4 27.88036     4 84.25 -10      26.22997
5 27.87260     5 84.75 -10      26.49140
6 27.87042     6 85.25 -10      26.76509

We chose the response type, so that predictions units of species richness, not log species richness (because of the log link in the negative binomial).

Now we just assign the predictions back to an empty raster. The empty raster is just a copy of rsst with no data.

rpred <- rast(rsst2)
rpred[pred$cells] <- matrix(pred$richness_pred, ncol = 1)

We specify rpred[pred$cells] so it only adds in values in the cells we predited too.

Finally use your tmap skills to make a map Prof Calanoid will love:

tm_shape(aus, bbox = st_bbox(rpred)) +
  tm_polygons(col = "white") + 
  tm_shape(rpred) + 
  tm_raster(palette = "RdPu",
            title= "Richness", alpha = 0.8, n=10) + 
  tm_layout(bg.color = "grey20", 
            legend.position = c("left", "top"),
            legend.text.color = "white", 
            legend.title.color = "white")

It looks blocky, that’s because we predicted richness to the underlying raster. Arguably the ‘blockiness’ is actually good in this case - these are model predictions, not real data, the blockiness serves to emphasise that.

Prof Calanoid would probably also like to see a map of SST, so have a go at that yourself.

Finally, Prof Calanoid would probably also like to see the model fit. One way to do this is to plot the predictions we made in the original data frame as points, coloured by longitude:

ggplot(sdat_sst) + 
  aes(x = sst, y = richness_pred, color = x, alpha = 0.5) +
  geom_point() + 
    theme_bw() +
  ylab("Richness (predicted)") + 
  xlab(expression('SST ('*~degree*C*')'))

There are some interesting deviations from the trend around 20-21 degrees that depend on the longitude. What is going on there? Maybe one of our local plankton experts can answer that question?

Create an interactive map

You are almost there with meeting all of Prof Calanoid’s requests. The analysis hasn’t been quite as quick, or gone quite as smoothly (pardon the pun) as she led us to believe, but there’s only one step to go - the interactive map to impress her funders (though we secretly suspect she just wants to post it on the web to taunt Prof Salp).

You want to deliver 100% of what she asked, so you can get that fellowship.

We will create the interactive map using the leaflet package. You will need to be connected to the internet for this to work properly. You can access leaflet directly via tmap. Just change the tmap mode to “view”:

tmap_mode("view")

Leaflet makes use of a Javascript package for mapping (this is the language that dynamic web pages tend to use). It builds maps of your data ontop a range of freely available map layers. Check out this guide to leaflet in R for more advanced examples. But you can do the basics just with the same tmap code we used before.

Data size and leaflet

Leaflet uses Javascript, so it is code that runs in a user’s browser. This means anyone looking at the map on the web has to download all the data before they can render the map. So you should keep your spatial datasets small if you want to use leaflet - imagine your collegues trying to download your 100mb spatial data layer on their mobile data plan.

Many sophisticated web mapping applications, like Google Maps, use server-side code. These can render much larger data-sets because they are only transferring the data that is needed for a particular view. Creating these kinds of applications requires specialised expertise that we won’t cover in this course.

So what we need to do now check the size of our dataset:

print(object.size(sdat_sst), units = "Kb")

5177.4 Kb

Not too bad. If it was much bigger though you might want to simplify the data, for instance, by aggregating points.

Get started with a map

Here’s a first map:

tm_shape(sdat_sst) +
  tm_dots()

Now let’s build up our leaflet map, but this time we will specify the fill colour of our circle markers to be set using oranges.

We will also add a legend to tell us what shade of purple corresponds to which copepod richness.

tm_shape(sdat_sst) +
  tm_dots(col = "richness_pred")

I encourage you to play around the options for the leaflet maps, look at the help files and provider tiles.

Maps done. We can save this as a webpage and email it to Prof Calanoid: click the ‘Export’ button above the figure window in RStudio (Better yet, the data are open access, so you just post the html to our own webpage and share the link on Twitter with #beatyoutoit. That way Prof Calanoid can’t usurp all the credit for this. Prof Salp and Prof Calanoid are constantly glued to their phones, promoting themselves on Twitter, so they are bound to see it. ).

Job done. Now we await this esteemed paper Prof Calanoid promised to publish in “The Nature of Plankton”, and her support of our fellowship application.

Bonus material: Changing the map projection

We might want to change the projection of our map, because this affects how people interpret distances and areas.

tmap can change map projections for us on the fly. Let’s try that, with slightly silly settings, so you can see the effect.

First, tmap has a list of proj4strings we can can conveniently access. For example:

tmaptools::get_proj4("robin")

Coordinate Reference System:

Warning in `$.crs`(x, "input"): CRS uses proj4string, which is deprecated.

Warning in `$.crs`(x, "input"): CRS uses proj4string, which is deprecated.

  User input: +proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs +towgs84=0,0,0

Warning in `$.crs`(x, "wkt"): CRS uses proj4string, which is deprecated.

Warning in `$.crs`(x, "wkt"): CRS uses proj4string, which is deprecated.

  wkt:
BOUNDCRS[
    SOURCECRS[
        PROJCRS["unknown",
            BASEGEOGCRS["unknown",
                DATUM["World Geodetic System 1984",
                    ELLIPSOID["WGS 84",6378137,298.257223563,
                        LENGTHUNIT["metre",1]],
                    ID["EPSG",6326]],
                PRIMEM["Greenwich",0,
                    ANGLEUNIT["degree",0.0174532925199433],
                    ID["EPSG",8901]]],
            CONVERSION["unknown",
                METHOD["Robinson"],
                PARAMETER["Longitude of natural origin",0,
                    ANGLEUNIT["degree",0.0174532925199433],
                    ID["EPSG",8802]],
                PARAMETER["False easting",0,
                    LENGTHUNIT["metre",1],
                    ID["EPSG",8806]],
                PARAMETER["False northing",0,
                    LENGTHUNIT["metre",1],
                    ID["EPSG",8807]]],
            CS[Cartesian,2],
                AXIS["(E)",east,
                    ORDER[1],
                    LENGTHUNIT["metre",1,
                        ID["EPSG",9001]]],
                AXIS["(N)",north,
                    ORDER[2],
                    LENGTHUNIT["metre",1,
                        ID["EPSG",9001]]]]],
    TARGETCRS[
        GEOGCRS["WGS 84",
            DATUM["World Geodetic System 1984",
                ELLIPSOID["WGS 84",6378137,298.257223563,
                    LENGTHUNIT["metre",1]]],
            PRIMEM["Greenwich",0,
                ANGLEUNIT["degree",0.0174532925199433]],
            CS[ellipsoidal,2],
                AXIS["latitude",north,
                    ORDER[1],
                    ANGLEUNIT["degree",0.0174532925199433]],
                AXIS["longitude",east,
                    ORDER[2],
                    ANGLEUNIT["degree",0.0174532925199433]],
            ID["EPSG",4326]]],
    ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
        METHOD["Geocentric translations (geog2D domain)",
            ID["EPSG",9603]],
        PARAMETER["X-axis translation",0,
            ID["EPSG",8605]],
        PARAMETER["Y-axis translation",0,
            ID["EPSG",8606]],
        PARAMETER["Z-axis translation",0,
            ID["EPSG",8607]]]]

Show’s us the proj4string for the Robinson projection, which is the standard for IPCC assessment report maps.

(Note also the warning message saying that proj4strings are being deprecated. While it’s current best practice to use EPSG codes or WKT stings to define projections, R packages are still catching up to this change. Below we need to use a proj4string to do the projection on the fly with tmap, so we’ll use it this time.)

Cut and paste that into a new projection definition. I’m going to make one change, though, I’ll change +lon_0=0 to +lon_0=100, this centres the projection at 100E.

robin <-  "+proj=robin +lon_0=100 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs +towgs84=0,0,0"

Now use it in your tmap like this:

tmap_mode("plot")
tm_shape(rsst, projection = robin) + 
  tm_raster(palette = "-RdBu", title = "SST") + 
  tm_shape(aus, projection = robin) + 
  tm_polygons(col = "wheat") + 
  tm_compass()

Notice how it bends the raster. Try different degrees in +lon_0=... and see how that affects the plot.

Bonus material: calculating distances with sf

Distance to shelf

For speed, let’s just do this for Tasmania:

tas_ext <- raster::extent(140, 155, -45, -39)
stas <- st_crop(sdat_shelf, st_bbox(tas_ext))

Warning: attribute variables are assumed to be spatially constant throughout all
geometries

Slow but precise way

This method will calculate great circle distances, which is a more accurate way to calculate distances on the globe. But it is also slower (because calculations are done in 3 dimensions!):

dist <- st_distance(stas, shelf)

Fast but less precise way

This method first projects lon-lat coordinates to UTM, which will result in some distortion. However, then we can calculate Euclidian distances, which is fast (because it is done on only 2 dimensions).

We’ll use a local projected coordinate reference system for Tasmania with the short-hand EPSG code: 32755.

stas2 <- st_transform(stas, crs = 32755)
shelf2 <- st_crop(shelf, tas_ext)

Warning: attribute variables are assumed to be spatially constant throughout all
geometries

shelf2 <- st_transform(shelf2, crs = 32755)

dist2 <- st_distance(stas2, shelf2)
stas2$dist <- as.numeric(dist2)/1000

For most applications the fast but less precise way will probably suffice.

Plot samples by their distance to the shelf

tm_shape(stas2) + 
  tm_dots() + 
  tm_shape(shelf2) + 
  tm_polygons(col = "lightblue") + 
  tm_shape(stas2) + 
  tm_symbols(col = "dist", alpha = 0.5, 
             title.col = "Distance from \n shelf (km)")

You could add in the land, reprojecting on the fly with:

tm_shape(shelf2) + 
  tm_polygons(col = "lightblue") + 
tm_shape(aus, projection = 32755) + 
  tm_polygons() + 
  tm_shape(stas2) + 
  tm_symbols(col = "dist", alpha = 0.5, 
             title.col = "Distance from \n shelf (km)")

Finally, let’s look at how distance to shelf relates to richness:

ggplot(stas2) + 
  aes(x = dist, y = richness) + 
  geom_point() + 
  stat_smooth()

Not much going on there.

1.3 Conclusion

We hope you enjoyed this course. We went all the way from data-wrangling, to spatial analysis to mapping and back again, all in 1 day (and Prof Calanoid thought we would need 3 weeks!).

You need to practice to build your R skills, so we encourage you to try and make R a part of your normal analysis and graphing workflows, even if it seems harder at first.

Getting help

Writing code is 80% googling the answer (unattributed)

If you are going to be a succesful R user, then you need to get good at finding help to deal with bugs. The above aphorism is widely subscribed to by professional programmers. R is no exception. If you web search an issue, like ‘ggplot remove legend’ you will commonly get a pretty decent answer on Stack Overflow or a similar site. I probably used google tens of times to write these course notes (I can never remember how to put the degrees symbols on plots for instance).

If the answer doesn’t already exist there then sign up to Stack Overflow and ask it yourself (but spend a bit of time looking, no one wants to get tagged for duplicating an existing question!).

Another good idea is to find a local support group. R coding is an emotional experience, frustration is a common one, but the elation of finding a solution can help us persist. Having other people to help, or even just listen to your frustrations is a huge help for your motivation to keep learning R.

R books and web material

There are plenty of good books out there (too many to choose from in fact). For the content we covered today, some good resources are:

R for Data Science, for data wrangling mainly
R Graphics Cookbook, for ggplot and free on the web
Chris often refers to Mixed Effects Models and Extensions in Ecology with R
The classic text for GAMs is Generalized Additive Models: An Introduction with R. This has a very technical treatment of GAMs, but also an extensive section with practical examples, including spatial ones.
Geocomputation in R, free on the web
Leaflet for R, free on the web
If you want to learn new tricks, or stay up-to-date with the latest packages, the blog aggregator R-Bloggers has a non-step feed of R blogs from all over the world and all disciplines, including Chris’ blog

If you prefer to have a printed guide, another tactic is to web search your favourite package and ‘cheatsheet’.

One more thing…

So you are probably wondering what happened after you delivered the results to Prof Calanoid.

Well you heard nothing for days, then weeks, then months. You emailed her several times, but no drafts of the paper were forthcoming. By this time your fellowship application was due, to her credit Prof Calanoid did support your application.

After seeing the impressive maps on your webpage, Prof Salp gave you a surprise call. You sheepishly explained you were already collaborating with Prof Calanoid on this particular analysis. Prof Salp didn’t seem perturbed and suggested you work with her on a different dataset. But she doesn’t believe in unpaid labour, so she asked you to apply for a fellowship at her institute, The Global Plankton Research Institute.

The next week you check your email to find you’ve been offered both fellowships. So which would you choose?

Spatial analysis in R Workshop