library(tidyverse)
library(readr)
<- read_csv(url("https://raw.githubusercontent.com/cbrown5/example-ecological-data/refs/heads/main/data/benthic-reefs-and-fish/fish-coral-cover-sites.csv"))
dat
head(dat)
summary(dat)
5 Github copilot for R
I’ll show you how you can most effectively use github copilot to plan, code and write up your data analysis and modelling.
Software requirements: VScode with R or Python and github copilot license + extension for VScode (comes built in to VScode).
Many of the tools described below are also implemented a similar way in the Positron Assistant. So you can follow along with that software if you prefer.
Github Copilot calls itself an ‘AI programming assistant’ or an ‘AI pair programmer’. I’ll refer to it as an ‘LLM coding assistant’ or just ‘Assistant’.
Assistants add a layer of software between you and the LLM. The software is doing some hidden interpretation of what you want to do, as well as trying to save costs. For instance, for most assistants we often don’t get to control (or even see) the system message, the temperature or the number of output tokens. The assistant is also guessing context to include in the prompt, so it can automatically give the LLM more context. At the same time it is managing the LLM’s context window and trying to save on costs.
There is no generic name for this type of software (the field is moving to fast to have standardized names). So I’ll refer to them Assistants. In this bucket I’ll also put chatGPT, Claude, Roo Code, Cline and others. Note that Github Copilot (which I’ll call copilot for short) is different to the ‘Copilot’ assistant that is on the web and in the Teams app.
This software is also called ‘chatbots’, however, I prefer assistants as the tasks they can do are much broader than just chatting.
Tip: You’ll get the most of out Github Copilot if you use Visual Studio Code as your development environment (rather than RStudio). Setting up VScode with R can be a bit fiddly, check out my my installation instructions (Chapter 3) if you have trouble. Web searching advice is also a good idea if you are stuck. Its worth the effort.
Copilot It is developing rapidly, so it is quite likely that when you read this there will be changes and new features.
In this section I’ll focus on showing the main ways you can use copilot. Just be aware the implementation may change in future.
We’ll look at:
- Overview VScode for those that are new to this software
- Best practices for setting up your project directory
- Inline code editing
- Ask mode
- Edit mode
- Agent mode
5.1 What to do if you are not using R with VScode
If you are using Python in VScode, then just use the same dataset I have below, but write in a .py script. Copilot can help you if you get stuck with anything.
If you are using R in Rstudio. You can enable copilot with inline completions from Rstudio (if you have github copilot account). See Chapter 3, Option 2 for how to do that.
Once we get to the Inline Code Generation section you can use gander as described in Chapter 3.
5.2 Inline code editing
This chapter explores techniques for using GitHub Copilot’s inline code editing capabilities to enhance your R programming workflow.
5.2.1 1. Code completion
This is the only option supported in Rstudio (last time I checked).
Assuming you have github copilot set-up you just need to start a new R script (remember to keep it organized and give it a useful name) and start typing. You’ll see suggested code completions appear in grey. Hit tab
to complete them.
Let’s read in the benthic site data and fish counts:
More details on this data are available here.
Now try create a ggplot of secchi
(a measure of water clarity, higher values mean clearer water) and pres.topa
(count of topa, the bumphead parrotfish). Start typing gg
and see what happens.
You should get a recommendation for a ggplot. But it won’t know the variable names.
Tip: Sometimes GC gets stuck in a loop and keeps recommending the same line. To break it out of the loop try typing something new.
5.2.2 2. Using comments
The code completion is using your script and all open scripts in VScode to predict your next line of code. It won’t know the variable names unless you’ve provided that. One way is to include them in the readme.md file and have that open, another is to use comments in the active script (which tends to work more reliably), e.g.
# Make a point plot of secchi against pres.topa
gg...
Should get you the write ggplot. Using variable names in your prompts is more precise and will help the LLM guess the right names.
You could also try putting key variable names in comments at the top of your script.
Another way to use autocomplete is not to write R at all, just to write comments and fix the R code. Try templating a series of plots like:
# Make a point plot of secchi against pres.topa with a stat_smooth
# Plot logged (two categories) and pres.topa as a boxplot
# Plot CB_cover (branching coral cover) against secchi
Tip: Remember one way to improve prompts is to be more specific. So using the correct variable names helps.
Now go back through and click under each line to get the suggestions.
This strategy is great in data wrangling workflows. As a simple example try make this grouped summary using comments only:
%>%
dat group_by(logged) %>%
summarize(mean_topa = mean(pres.topa),
mean_CB = mean(CB_cover))
To make this I might write this series of comments:
#group dat by logged
#summarize pres.topa and CB_cover
If the variable names are documented above you can ofter be lazier and less precise with variable names here.
5.2.3 3. Code completion settings
Click the octocat in the bottom right corner of VScode to fine-tune the settings. You can enable/disable code completions (sometimes they are annoying e.g. when writing a workshop!).
You can also enable ‘next edit suggestions’. These are useful if editing an exisiting file. e.g. if you misspelt ‘sechi’ then updated it in one place, it will suggest updates through the script. Hit tab to move through these.
The box will also tell you if indexing is available. Indexing allows AI to search your code faster.
5.2.4 4. Inline code generation
In VScode you can also access an inline chat box with cmd/cntrl-i. This chat can chat as well as edit code.
You can click anywhere and active this. I find it most useful though to select a section of code and then hit cmd/cntrl-i.
This is most useful to - Add new code - Explain code - Fix bugs - Add tests
Try select some of your code (e.g. a ggplot) and ask it to explain what the code does.
Now try select one of your plots and ask for some style changes (e.g. theme, colours, axes label sizes etc…).
Now add a bug into one of your plots. See if the inline chatbox can fix the bug.
5.2.4.1 Prompt shortcuts
Use the /
to bring up a list of prompt shortcuts. The most useful in R are /explain
, /fix
, /tests
. Try select some code then use these to see what happens.
5.3 Ask mode
Ask mode is a chatbot that has access to context from your active project. It is great to help you plan analysis and implementation, using context from your project. It is also handy to help explain code.
In VScode click the ‘octocat’ symbol that should be at the top towards the right. This will open the chat window.
The chat panel will appear down the bottom of this new sidebar. Confirm that the chatbot is currently set to ‘Ask’ mode.
Your current file will automatically be included as context for the prompt. You can drag and drop any other files here as well.
Start by asking the chatbot for guidance on a statistical analysis. We are interested in how the abundance of Topa relates to coral cover. For instance you could ask:
How can I test the relationship between pres.topa and CB_cover?
Evaluate the quality of its response and we will discuss.
5.3.1 Improving our initial prompt by attaching data
Recall our initial prompt was:
How can I statistically test the relationship between pres.topa and CB_cover?
Try some of the strategies above (make a new prompt by clicking the + button) and compare the quality of advice.
For instance, you can save a data summary like this:
write_csv(head(dat), "resources/head-site-level-data.csv")
Then drag and drop it into the ask window and add something like:
How can I statistically test the relationship between pres.topa and CB_cover? Here are the first 6 rows of data
5.3.2 Improving our initial prompt by attaching domain knowledge
You can further improve the response by attaching a trusted resource. e.g. save this webpage on count models for ecology to your computer. Then you can attach the html file.
However, html files are full of code as well as the text we want. It can be more efficient for the LLM if you convert the html to markdown. Here’s a handy browser operated tool that will convert an arbitrary webpage to markdown:
https://tools.simonwillison.net/jina-reader
Just paste in the url above (https://environmentalcomputing.net/statistics/glms/glm-2/) to get a markdown version. Copy that to a local file in your project and attach it to ask mode.
Be sure to check the markdown to make sure its safe and doesn’t contain malicious instructions.
Another option is to use a websearch tool. You can install one with github copilot. However, I’m not recommending this at this point, due to cyber-security concerns, we’ll disucss this further later on.
That worked well for me. I then followed up with:
Great. Evaluate the robustness of each suggestoin on a 1-10 scale
And it gave me a nice summary suggesting to try overdispersion models first (which is a good suggestion).
The absolute best practice would be to give the assistant all the context for your study and observational design. Let’s see how doing that can work in our favour when planning implementation.
5.3.3 Planning implementation
Another way to use Ask mode is for help in implementing an analysis. Many of our workflows are complex and involve multiple data wrangling steps.
To get the best out of GC I recommend creating a detailed README.md file with project context. Let’s try that and use it to plan our project.
Save the README.md that his here to a local file. (Remember that we are going to be using this as a prompt, so read it first).
Now you can attach it (or open it then click new chat). Given all the context you’ve provided you can just write something simple like:
Help me plan R code to implement this analysis.
Or
Help me plan the workflow and scripts to implement this analysis
I did this. It suggested both code (that looked approximatley correct) and the directory structure, sticking to my guideline in the readme about being modular.
You should iterative with Ask mode to if there are any refinements you want.
Let’s move onto edit mode to see how to put this plan to action.
5.4 Edit mode: Creating code from text
Edit mode will edit files for you. The best way to learn how is to just see it in action.
Open the Chat panel and click the ‘Ask’ button, then select ‘Edit’.
Now go back to your R file with the plots in it. In edit mode suggest a new plot such as:
Create a pairs plot for all continuous variables
Tip: Sometimes you can’t go back once copilot has made edits to a file. So its good practice to use git and commmit changes before and after editing.
Try creating some other plots or analyses using edit mode.
5.4.1 Adding documentation with edit mode
Edit mode can write text as well as code. It can be handy to use for documentation for instance.
Open the README.md. Then type this prompt:
Document what this script does in the readme.
Drag and drop your script into the chat window as well.
Click ‘Keep’ if you like what it did. Or you can suggest improvements. Alternatively, accept it for now and then edit it afterwards.
Another way to use edit mode is for translation. Try getting it to translate part of your code into another programming language, or part of your text into another language (Spanish and French are the best options at the moment).
5.4.1.1 Why so much code?
Copilot is designed as a programming assistant. We don’t know its system message, but given the main market for this software is professional programmers, we can guess it has a strong emphasis on programming robust code.
You might notice that copilot tend to ‘over-engineer’ your R scripts. For instance, it has a tendancy to make an if
statement to check if each new package needs installing, before loading it.
If you don’t like this style you can add a statement to the readme asking it to keep implementation simple.
5.4.2 Workflows and tips for edit mode
Remember its an assistant, its not doing the project for you. So you need to make sure it stays on track. Left unattended (if you just accept, accept, accept without reading) it can go down rabbit holes. Sometimes it creates superfluous analyses or even incorret statistics.
So here’s how I recommend you use it:
- Use git for version control so you can go back in to older versions.
- Read the suggested edits before accepting
- NEVER edit the file while copilot is working! To edit files it uses string matching to locate the position to insert the edits. If you change the file it may not find the correct place to insert the new code.
5.5 Agent mode: Automated workflows
Agents are LLMs that have tools that allow them to work autonomously. In effect they review the results of tool use (such as writing code and running code), then respond to those results.
In Copilot’s chat window you can set it to ‘Agent’ mode to enable these features.
After each tool use copilot will ask you to confirm the changes and the next action. At that point you can review its changes, make edits, or continue chatting to suggest refinements.
Image: Agent mode from https://code.visualstudio.com
Agent mode has access to the terminal, so it will be using the terminal application to run scripts it creates. We’ll demonstrate in class so you can understand what its doing.
5.5.1 Exploring agent mode
Let’s explore Agent mode’s features through some analysis.
Switch the Modes to agent mode. Make sure you have the readme file in your project directory. This has all the context our agent will need.
It will help if you have the readme file open when you run the agent, so open that file now (then it will refer to that file first).
Now prompt agent mode with something like:
I want to do a regression of pres.topa on cb_cover. Create all the scripts I need to do this, as well as run the scripts to make plots of the predicted relationship between the variables
Then follow along with the conversation, adjusting or correcting it as need be.
See how it goes and we can discuss in class. It will be interesting to compare results as they will all be different.
Tip: There’s no ‘optimal’ prompt, only better prompts. Sometimes the best way to write is the way you are most comfortable writing. You’ll get more out of your brain that way and copilot will end up performing the same.
Here’s an alternative way to phrase that prompt:
Arrr matey, I be wantin' to run a regression of pres.topa on cb_cover. Hoist all the scripts I need for this voyage, and chart the plots of the predicted relationship between these variables!
(Ok so that last prompt definitely doesn’t follow the guidelines of being super clear, but I was bored and it seemed to work ok)
5.5.2 Writing up the project?
You can keep going from here if you like and get agent mode to write up the results it found as an Rmd file. It will use the tables it generates to (hopefully) make accurate interpretations. Copilot Pro also has vision capabilities. This means it will be able to interpret the figures it creates as well. We’ll see that in action when we look at Roo Code in a bit.
If you do that, as always, don’t take anything for granted. Make sure you check everything and understand the results yourself.
5.5.3 Agent mode errors
Agent mode will try to debug its errors, but can get stuck in a loop sometimes.
One common error is that it will try to run Rscript
from the terminal in order to run full R scripts. But not everyone’s computer is set-up with an Rscript
command. To do this you will need to web search how to put Rscript
on your PATH (which means it can be found from the terminal). How you do this depends on your computer.
This is an advanced IT set-up feature that I won’t cover here, but for one tip see the custom instructions section below.
5.5.4 Summary
Agent mode can really accelerate your workflow development. But there are some risks. It can also go off track or write excessive amounts of code (over-engineering). Best practices for using Agent mode include:
- Separate science questions (what stats) from implementation stats (what code)
- Understand the stats you want to do, don’t just rely on copilot to get it right
- Checking what it does at is does it, so you can keep it on track
- Giving strong guidelines e.g. through a project readme file.
- Keeping the readme updated to guide copilot
- Report AI use and how it was used in your publications
5.6 LLM choice
Github copilot let’s you choose different LLMs. What you can see will depend on what type of plan you have. Some LLMs have a limited number of ‘Premium Requests’ per month (again depending on your account tier) (e.g. at time of writing Pro accounts got 300 premium requests per month.)
Click the Model choice menu to see what’s available (bottom of chat window, probably says ‘GPT-4.1’ right now).
The 0x
means you get unlimited requests, whereas a 1x
means it will use one full premium request.
Different models have different performances for different tasks.
In general there isn’t a lot of formal evaluations of R coding or statistics yet. Whereas, there are many available for Python.
The two most commonly used LLMs for data science are Claude Sonnet series (currently 4.0 available) and GPT 4.X series (currently 4.1 available).
In general I’ve found the Claude Sonnet models are the best for R code and most reliable for tool use (e.g. Edit or Agent mode).
GPT 4.1 is also pretty good, but in my experience there is a higher rate of errors. For instance, in Edit and Agent mode reasonably often it will delete text or code you wanted to keep (so keep an eye on what its doing). GPT 4.1 also tends to be a bit ‘lazier’ than Sonnet 4.0 in that it will tend to explain what to do, rather than explaning and implementing the suggestion.
Play around with the same prompt, but different models, so you can compare performance.
5.6.1 Bring Your Own API Key (BYOK)
You can add your own API key to github copilot. This lets you access other models on a pay per use basis. To add an API key click the model menu then select ‘Manage models’ and follow instructions.
You will also need an OpenAI API key if you want to use vision, that is, attach images as context.
5.7 Tools and MCP
Advanced users may want to install Model Context Protocol Servers or tools. These are accessible via the cog in the chat window.
I won’t cover tools and MCP in detail in this course. But these give Github Copilot additional abilites, such as web searching, or access to specific online databases or reference manuals (e.g. for code styles).
There are some security concerns to consider if using tools, see later chapters on this topic.
5.8 Other options
Github Copilot is a complex software and I recommend browsing the docs for a comprehensive guide. But there are a few other key features.
All of these are found by clicking the octocat on the bottom right of your VScode window.
5.8.1 Indexing
First, it will automatically index the scripts and documents in your active directory. This means a version is created that the AI agent can quickly search. This speeds up recommendations and makes them more accurate.
However, indexing may also be a data privacy concern. While your index shouldn’t be shared with anyone else, it does mean that as soon as you open VSCode github copilot is reading some of the files (hard to tell exactly what) to create its index in the cloud.
To turn this off click the octocat and find the option for indexing to disable it.
5.8.2 Configure code completions
You can enable/disable code completions for a certain file type or all file types. Sometimes they get in the way or are distracting, like when you are writing qmd documents.
5.8.3 Next edit suggestions
This option is turned off by default. If turned on it will suggest changes anywhere in your document, rather than just at the cursor. For instance, if you update a variable name at the top, it will find and suggest updating all instances of that variable name. You use tab to scroll through the different ‘next edit’ suggestions when it makes those.
5.9 Customizing Github Copilot mode and instructions
There’s a lot of customization options in copilot now. To see some of these, open the chat window and click the cog symbol. Then pick one of the options below
5.9.1 Custom modes
After the cog click ‘Modes’, then a pop-up will appear at the top of your VScode window. Click the plus to create a new mode. Follow the prompts. Try paste in our text from the stats bot from before.
Once that is saved in the .github directory (so only local to this project) you can click the Modes selector (bottom left of chat window, might say ‘Ask’, ‘Edit’ or ‘Agent’) and choose your new mode. Try having a conversation with it.
You can also give custom modes access to tools, so they can edit or act like agents. If you’re an advanced user you might want to play around with these options to make modes for different tasks (like data exploration).
5.9.2 Custom intstructions
For heavy Agent users you may want to set-up custom instructions. These are just instructions attached to every prompt you run in this project. To create these press the cog button again, click ‘Instructions’ and follow the steps to create a new instruction as we did above for the modes.
Some ideas: you could set preference for ggplot2 or a specific coding style.
You can also add instructions to tell the agent how to use Rscript to avoid terminal errors See here for instructions. One way around Rscript related errors is to find the path to the Rscript executable on your computer (e.g. Rscript.exe on windows) and then create a custom instruction that says something like:
To use Rscript to run R scripts give the full path like this:
"C:/Program Files/R/bin/Rscript.exe" "my_script.R"