Chapter 1 Quality R analysis with large language models

1.1 Summary

If you are using R you are probably using language model assistants (e.g. ChatGPT) to help you write code, but are you using them in the most effective way? Language models have different biases to humans and so make different types of errors. This 1-day workshop will cover how to use language models to learn R and conduct reliable environmental analyses. We will cover:

Pros and cons of different assistants from the interfaces like ChatGPT to advanced coding assistants that can run and test code by themselves and keep going until the analysis is complete (and even written up).
Best practice prompting techniques that can dramatically improve LLM performance for complex statistical applications
Applying language models to common environmental applications such as GLMs, multivariate statistics and Bayesian statistics
Copyright and ethical issues

We’ll finish up with a discussion of what large language models mean for analysis and the scientific process.

Requirements for interactive workshop: Laptop with R, Rstudio and VScode installed. Please see software instructions below.

1.1.0.1 Who should take this workshop?

The workshop is for: anyone who currently uses R, from intermittent users to experienced professionals. The workshop is not suitable for those that need an introduction to R and I’ll assume students know at least what R does and are able to do tasks like read in data and create plots.

1.2 About Chris

I’m an Associate Professor of Fisheries Science at University of Tasmania and an Australian Research Council Future Fellow. I specialise in data analysis and modelling, skills I use to better inform environmental decision makers. R takes me many places and I’ve worked with marine ecosystems from tuna fisheries to mangrove forests. I’m an experienced teacher of R. I have taught R to 100s people over the years, from the basics to sophisticated modelling and for everyone from undergraduates to my own supervisors.

1.3 Citation for course

If following the prompting advice please consider citing my accompanying article, currently in pre-print form:

Citation: Brown & Spillias (2025). Prompting large language models for quality ecological statistics. Pre-print. https://doi.org/10.32942/X2CS80

1.4 Course outline

Introduction to LLMs for R

9-10am In this presentation I’ll cover how LLMs work, best practices prompt engineering, software, applications for R users and ethics.

Part 1 LLM fundamentals

10-10:30am

Tea break: 10:30-10:45

Continue Part 1 LLM fundamentals

10:45-11:30am

Part 2 Github copilot for R

11:30-12:00pm

Lunch break: 12:00-1:00pm

Continue part 2 Github copilot for R

1:00-2:15pm

Ethics and copyright

2:15-2:45pm

Tea break 2:45-3:00pm

Part 3 Advanced coding assistants

3:00-3:30pm

Conclusion and discussion of what it means for science

3:30-4:00pm

1.5 Software you’ll need for this workshop

Save some time for set-up, there is a bit to it. You may also need ITs help if your computer is locked down.

R of course, (4.2.2 or later)

VScode It takes a bit of time and IT knowledge and setting up to connect VScode to R. So don’t leave that to the last minute See my installation instructions

Once you have VScode, make sure you get these extensions (which can be found by clicking the four squares in the left hand menu bar):

GitHub Copilot, GitHub Copilot Chat

If you can’t get VScode to work with R you are still welcome to join. Most of the morning session can be done in Rstudio. In the afternoon you’ll need VScode if you want to try what I am teaching.

1.5.1 Optional software

RStudio. You can do some of this workshop in Rstudio. But you’ll get more out of it if you use VScode. If you are using Rstudio, make sure you get a Github Copilot account and connect it to Rstudio.

1.5.1.1 Optional VScode extensions

Web Search for Copilot (once installed, follow the instructions to set up your API key for Tavily, I use Tavily because it has a free tier).

Roo code extension for VScode.

Markdown preview enhanced Let’s you view markdown files in a pane with cmd(cntrl)-shift-V

Radian terminal I also recommend installing radian terminal. This makes your terminal for R much cleaner, has autocomplete and seems to help with some common issues.

1.6 Software licenses

Github copilot Go to their page to sign-up. The free tier is fine. You can also get free Pro access as a student or professor (requires application).

API key and account with LLM provider

You will need API (application programming interface) access to an LLM to do all the examples in this workshop. This will allow us to interact with LLMs directly via R code. API access is on a pay per token basis. You will need to create an account with one of the below providers and then buy some credits (USD10 should be sufficient).

Make sure you save it somewhere safe! You can usually only view them once on creation. You’ll need it for the workshop.

Here are some popular choices:

OpenRouter (recommended as gives you flexible access to lots of models, but some capabilites won’t work, e.g. Claude models are on OpenRouter, but their vision capabilities are only available directly from Anthropic)
Anthropic use this instead of OpenRouter if you want vision, web browser and computer use abilities.
OpenAI

Once you have your API key, keep it secret. It’s a password. Be careful not to push it to aa github repo accidently.

1.7 R packages you’ll need for this workshop

install.packages(c("vegan", "ellmer","tidyverse")

INLA for Bayesian computation. Use the link, its not on cran.

1.8 Data

We’ll load all data files directly via URL in the workshop notes. So no need to download any data now. Details on data attribution are below.

1.8.1 Benthic cover surveys and fish habitat

In this course we’ll be analyzing benthic cover and fish survey data. These data were collected by divers doing standardized surveys on the reefs of Kia, Solomon Islands. These data were first publshed by Hamilton et al. 2017 who showed that logging of forests is causing sedimentation and impact habitats of an important fishery species.

In a follow-up study Brown and Hamilton 2018 developed a Bayesian model that estimates the size of the footprint of pollution from logging on reefs.

1.8.2 Rock lobster time-series

We’ll analyse time-series data from the Australian Temperate Reef Collaboration and the Reef Life Survey. Original data are available here, and you should access the originals for any research. This data is collected by divers every year who swim standardized transects and count reef fauna.

We’ll use a simplified version I created to go with this course. The link is in the notes when we need it.

Citation to the dataset.

The analysis in the notes is inspired by this study looking at how we forecast reef species change.