9 Project set-up for AI agents
This chapter focuses on establishing a project structure and workflow that can get the most out of using AI agents.
9.1 Project organization
Its helpful to set-up your projects in an organized and modularised way. In my experience most R users write most of their analysis in one long script. Don’t do this. It will be hard for ‘future you’ to navigate. If its hard for a human to navigate, it will also be hard for the assistant. Here’s how I set-up my projects.
9.1.1 General guidance
- Create a new folder for each new project.
- Optional but recommended: Initiliaze a git repo in that folder (I use the ‘Source Control’ extension to VS Code).
- Set-up folders and files in an organized way
- Ideally put the data in this folder also. However, large datasets or sensitive data can be kept in other folders.
- Keep scripts short and modularized (e.g one for data analysis, one for modelling).
Once you have your folder you can make it an Rstudio project (if using Rstudio) or just use ‘open folder’ in vscode. If want to link multiple folders in then use VScode workspaces.
If you are not using git (version control), then I recommend you learn. LLM code editing tools can cause you to lose older versions. So best to back them up with proper use of git.
9.1.2 Project directory structure example
Here’s an example of a project directory structure. You don’t have to use this structure, there are no hard rules about prompting LLMs and ultimately this will be a very long prompt. the important thing is to be organized.
my-project/
├── README.md
├── .gitignore
├── Scripts/ # R code
│ ├── 01_data-prep.R
│ ├── 02_data-analysis.R
│ └── 03_plots.R
├── Shared/
│ ├── Outputs/
│ │ ├── Figures/
│ │ ├── data-prep/
│ │ └── model-objects/
│ ├── Data/
│ └── Manuscripts/
└── Private/
9.2 The README.md file
The README.md is the memory for the project. If you use github it will also be the landing page for your repo, which is handy.
Remember you are writing this for you and the LLMs. So think of it like a prompt.
Here’s an example of some of the information you might want to include in your readme.
# PROJECT TITLE
## Summary
## Aims
## Data methodology
## Analysis methodology
## Tech context
- We will use the R program
- tidyverse packages for data manipulation
- ggplot2 for data visualization
Keep your scripts short and modular to facilitate debugging. Don't complete all of the steps below in one script. Finish scripts where it makes sense and save intermediate datasets.
## Steps
As you go tick of the steps below.
[ ] Wrangle data
[ ] Fit regression
[ ] Plot verification
[ ] ...
## Data
Include meta-data here and file paths.
## Directory structure
my-project/
├── README.md
├── .gitignore
├── Scripts/ # R code
│ ├── 01_data-prep.R
│ ├── 02_data-analysis.R
│ └── 03_plots.R
├── Shared/
│ ├── Outputs/
│ │ ├── Figures/
│ │ ├── data-prep/
│ │ └── model-objects/
│ ├── Data/
│ └── Manuscripts/
└── Private/
9.3 Example data
For this chapter we’ll work with some ecological data on benthic marine habitats and fish.
9.3.1 Case-study: Bumphead parrotfish, ‘Topa’ in Solomon Islands
Bumphead parrotfish (Bolbometopon muricatum) are an enignmatic tropical fish species. Adults of these species are characterized by a large bump on their forehead that males use to display and fight during breeding. Sex determination for this species is unknown, but it is likely that an individual has the potential to develop into either a male or female at maturity.
Adults travel in schools and consume algae by biting off chunks of coral and in the process they literally poo out clean sand. Because of their large size, schooling habit and late age at maturity they are susceptible to overfishing, and many populations are in decline.
Their lifecycle is characterized by migration from lagoonal reef as juveniles (see image below) to reef flat and exposed reef habitats as adults. Early stage juveniles are carnivorous and feed on zooplankton, and then transform into herbivores at a young age.
Image: Lifecycle of bumphead parrotfish. Image by E. Stump and sourced from Hamilton et al. 2017.
Until the mid 2010s the habitat for settling postlarvae and juveniles was a mystery. However, the pattern of migrating from inshore to offshore over their known lifecycle suggests that the earliest benthic lifestages (‘recruits’) stages may occur on nearshore reef habitats.
Nearshore reef habitats are susceptible to degradation from poor water quality, raising concerns that this species may also be in decline because of pollution. But the gap in data from the earliest lifestages hinders further exploration of this issue.
In this course we’ll be analyzing the first survey that revealed the habitat preferences of early juveniles stages of bumphead parrotfish. These data were analyzed by Hamilton et al. 2017 and Brown and Hamilton 2018.
In the 2010s Rick Hamilton (The Nature Conservancy) lead a series of surveys in the nearshore reef habitats of Kia province, Solomon Islands. The aim was to look for the recruitment habitat for juvenile bumphead parrotfish. These surveys were motivated by concern from local communities in Kia that topa (the local name for bumpheads) are in decline.
In the surveys, divers swam standardized transects and searched for juvenile bumphead in nearshore habitats, often along the edge of mangroves. All together they surveyed 49 sites across Kia.
These surveys were made all the more challenging by the occurrence of crocodiles in mangrove habitat in the region. So these data are incredibly valuable.
Logging in the Kia region has caused water quality issues that may impact nearshore coral habitats. During logging, logs are transported from the land onto barges at ‘log ponds’. A log pond is an area of mangroves that is bulldozed to enable transfer of logs to barges. As you can imagine, logponds are very muddy. This damage creates significant sediment runoff which can smother and kill coral habitats.
Rick and the team surveyed reefs near logponds and in areas that had no logging. They only ever found bumphead recruits hiding in branching coral species.
In this course we will first ask if the occurrence of bumphead recruits is related to the cover of branching coral species. We will then develop a statistical model to analyse the relationship between pollution from logponds and bumphead recruits, and use this model to predict pollution impacts to bumpheads across the Kia region.
The data and code for the original analyses are available at my github site. In this course we will use simplified versions of the original data. We’re grateful to Rick Hamilton for providing the data for this course.
9.4 Example spec sheet
# Analysis of fish dependence on coral habitat
## Introduction
This project will ask how abundance of fish juveniles depends on coral cover. The fish we are interested in is *Bolbometopon muricatum*, the bumphead parrotfish. Also known as 'topa' in the local language of our study region. We will analyze survey data from 49 sites, that includes benthic cover surveys and surveys fish abundance at the same locations. We are studying its juvenile habitat.
## Aims of the analysis
1. Does fish abundance depend on branching coral cover?
2. What is the direction and strength of the relationship between fish abundance and branching coral cover?
3. Does fish abundance depend on soft coral cover?
4. What is the direction and strength of the relationship between fish abundance and branching coral cover?
## Data methodology
The data was collected with the point intersect transect method. Divers swam along transects. There were several transects per site. Along each transect they dropped points and recorded the type of benthic organism (in categories) on that point. Percentage cover for one organism type can then be calculated as the number of points with that organism divided by the total number of points on that transect. In our data we have percent cover of branching corals and percent cover of soft corals.
Transects were averaged to give a single value for each site.
At each site divers also counted the number of juvenile 'topa' along dive transects of the same length.
## Analysis methodology
We will use generalized linear models to analyze the relationship between topa and the two coral cover types.
Topa abundance is probably over-dispersed, so we we will need to use a negative binomial family. We will use R and the MASS package:
MASS::glm.nb(pres.topa ~ CB_cover*soft_cover, data = fish_coral_cover_sites)
To obtain a final model we should model selection, starting with a full model then working towards simpler models. We will use likelihood ratio tests to compare models. For example the first test would be:
m1 <- MASS::glm.nb(pres.topa ~ CB_cover*soft_cover, data = fish_coral_cover_sites) m2 <- MASS::glm.nb(pres.topa ~ CB_cover + soft_cover, data = fish_coral_cover_sites) anova(m1, m2, test = “Chisq”)
Then proceed with m2 if the interaction term is not significant. Elsewise to the next proceed with m1.
On completion of the model selection, we will do model diagnostics. This will include checking residuals and the dispersion parameter. Save these as png files.
Write a diagnosticis report in an Rmarkdown file.
## Instructions for the agent
The agent will produce a report that answers the above questions. The report will include a description of the data, the methods used for analysis, and the results of the analysis. The code will be written as R scripts.
Each script should be modular and save intermediate results as datafiles and figures. The final report must be written in Rmarkdown format. The figures will be imported using markdown syntax, e.g. ``. Don't use R code for figures in the markdown report.
Summary tables should be imported from .csv files and created using the `knitr::kable()` function in Rmarkdown.
The report must include the following sections:
- Study aims
- Data methodology
- Analysis methodology
- Results
- Model selection and verification
- Model fit statistics
- Plots of predicted fish abundance (log-link scale) based on the final model, with confidence intervals
- Relevant statistics (r2, p-values, etc.)
The agent is must produce diagnostic plots and a separate report on the model diagnostics.
### Tech context
- We will use the R program
- tidyverse packages for data manipulation
- ggplot2 for data visualization
- use `theme_set(theme_classic())` for plots
- Use the `MASS` package for the negative binomial model, however don't load it globally with `library(MASS)`, instead use `MASS::glm.nb()` to avoid namespace conflicts.
- Use `visreg` package for plotting model effects and confidence intervals, e.g. `visreg::visreg(m2, "CB_cover", "soft_cover", gg=TRUE, scale = 'linear')`
Keep your scripts short and modular to facilitate debugging. Don't complete all of the steps below in one script. Finish scripts where it makes sense and save intermediate datasets.
When using Rscript to run R scripts in terminal put quotes around the file, e.g. `Rscript "1_model.R"`
### Workflow
1. Create a todo list and keep track of progress
2. Data processing including standardizing coral variables by number of points
3. Model selection and verification, produce diagnostic plots
4. Model diagnostic plots markdown report
5. Create plots of predictions from the final model
6. Write report in markdown format
### Directory structure
/```
glm-test-case/
├── data/ # Processed data files and intermediate datasets
├── fish-coral.csv # Raw data file with fish and coral cover measurements
├── glm-readme.md # This readme file with project documentation
├── initial-prompt.md # Initial project prompt and requirements
├── outputs/ # Generated output files
│ └── plots/ # Diagnostic plots and visualization outputs (PNG files)
└── scripts/ # R scripts for data analysis and modeling
/```
Put the .rmd reports in the top-level directory.
## Meta data
### fish-coral.csv
Location: `data/fish-coral.csv`
Variables
- site: Unique site IDs, use to join to benthic_cover.csv
- reef.ID: Unique reef ID
- pres.topa: number of Topa counted (local name for Bolbometopon)
- pres.habili: number of Habili counted (local name for Cheilinus)
- secchi: Horizontal secchi depth (m), higher values mean the water is less turbid
- flow: Factor indicating if tidal flow was "Strong" or "Mild" at the site
- logged: Factor indicating if the site was in a region with logging "Logged" or without logging "Not logged"
- coordx: X coordinate in UTM zone 57S
- coordy: Y coordinate in UTM zone 57S
- CB_cover: Number of PIT points for branching coral cover
- soft_cover: Number of PIT points for soft coral cover
- n_pts: Number of PIT points at this site (for normalizing cover to get per cent cover)
- dist_to_logging_km: Linear distance to nearest log pond (site where logging occurs) in kilometres.