AI assistants for Scientific Coding
0.1 Summary
If you are doing data analysis you are probably using language models (e.g. ChatGPT) to help you write code, but are you using them in the most effective way? Language models have different biases to humans and so make different types of errors. This book will cover how to use language models to learn scientific computing and conduct reliable environmental analyses. The book is reference material for a 1-day workshop I teach.
I will cover:
How to use different software tools from the simple interfaces like ChatGPT to advanced tools that can run and test code by themselves and keep going until the analysis is complete (and even written up).
Vibe coding and how future analysis workflows will change dramatically from today
Best practice prompting techniques that can dramatically improve model performance for complex data analysis
Applying language models to common environmental applications such as GLMs and multivariate statistics
Issues including environmental impacts, copyright and ethics
I’ll also make space for an interactive discussion of people’s concerns about AI, but also the opportunities.
The content I’ll teach is suitable for anyone using computing coding (e.g. R, Python) to do data analysis.
Examples will be in marine conservation science using the R language, but the methods are general to any field. The AI software is also general to any programming language and we won’t be doing much actual coding (the AI does that!) so participants can follow along in other languages if they prefer. To follow the practical applications you will need to have some experience in scientific computing (e.g. R or Python).
0.2 Citation for book
Please consider citing my accompanying article if you are using the advice in this book, currently in pre-print form:
Citation: Brown & Spillias (2025). Prompting large language models for quality ecological statistics. Pre-print. https://doi.org/10.32942/X2CS80
Generative AI software for coding is changing fast. So this book will keep changing as I update it to reflect new developments. It does not make sense to publish it as a fixed book. The general guidelines for prompting, which have a longer shelf-life are captured in the article. So please cite that to support my work.
0.2.0.1 Who should read this book?
The book is for: anyone who currently uses R, from intermittent users to experienced professionals. The workshop is not suitable for those that need an introduction to R and I’ll assume students know at least what R does and are able to do tasks like read in data and create plots.
Important This book isn’t for people who need an introduction to R or Python. I’ll assume students know at least how to do tasks like read in data and create plots in Python or R. To use these AI tools effectively you absolutely have to understand how scientific computing works first. If you need an introduction to R then I recommend you learn it without AI first.
0.3 About Chris
I’m an Associate Professor of Fisheries Science at University of Tasmania and an Australian Research Council Future Fellow. I specialise in data analysis and modelling, skills I use to better inform environmental decision makers. R takes me many places and I’ve worked with marine ecosystems from tuna fisheries to mangrove forests. I’m an experienced teacher of R. I have taught R to 100s people over the years, from the basics to sophisticated modelling and for everyone from undergraduates to my own supervisors.
0.4 Software you’ll need for this workshop
Save some time for setting up the software, there is a bit to it. You may also need IT help if your computer is locked down. See the Chapter 3 on for more detailed instructions.
0.5 Book overview
Note the book is still under construction. Most sections are complete, but a few sections just have bullet points that I refer to when teaching the workshop. It’s also a fast moving space, so the content needs constant updating.
- Introduction — Introduction to LLMs for coding: Strengths, weaknesses and common failure modes of language models when used for programming and data analysis.
- Set-up and tools: Step‑by‑step software installation, recommended environments (R, Python, editors), and brief troubleshooting tips.
- Calling LLMs via an API: Accessing LLMs programmatically from R or Python, prompt structure and examples.
- GitHub Copilot for R and coding assistants: How to use Copilot and other assistant features for planning, coding and editing.
- General prompting advice: Best practices for effective prompts, problem decomposition and chain-of-thought techniques.
- AI-powered analysis workflows: Organising workflows, stages of analysis and worked examples using an example dataset.
- Advanced LLM agents: Agents, automation, safety considerations and example agent workflows.
- Project set-up for AI agents (Specification sheets): Templates and guidance for organising projects to work well with agents and assistants.
- Research applications of LLMs: Programmatic literature review, pdf processing, web-searching and other research-focused examples.
- Writing documents with Quarto and AI assistants: Workflows for writing papers, reproducible figures and integrating AI into Quarto.
- Cost and security: Practical considerations on cost, API security and agent safety.
- Ethics and copyright: Environmental impacts, IP concerns and ethical responsibilities when using LLMs.
- Appendix — Code and data: Code snippets, data access links and supporting scripts used in the book.
0.6 Data
We’ll load all data files directly via URL in the workshop notes. So no need to download any data now. Details on data attribution are below.
0.6.1 Benthic cover surveys and fish habitat
In this course we’ll be analyzing benthic cover and fish survey data. These data were collected by divers doing standardized surveys on the reefs of Kia, Solomon Islands. These data were first publshed by Hamilton et al. 2017 who showed that logging of forests is causing sedimentation and impact habitats of an important fishery species.
In a follow-up study Brown and Hamilton 2018 developed a Bayesian model that estimates the size of the footprint of pollution from logging on reefs.