• Quality R analysis with large language models
  • 1 Quality R analysis with large language models
    • 1.1 Summary
    • 1.2 About Chris
    • 1.3 Citation for course
    • 1.4 Course outline
    • 1.5 Software you’ll need for this workshop
      • 1.5.1 Optional software
    • 1.6 Software licenses
    • 1.7 R packages you’ll need for this workshop
    • 1.8 Data
      • 1.8.1 Benthic cover surveys and fish habitat
      • 1.8.2 Rock lobster time-series
  • 2 Introduction to LLMs for R
  • 3 LLM prompting fundamentals
    • 3.1 Setup authorisation
    • 3.2 Understanding how LLMs work
      • 3.2.1 Token-by-token prediction
      • 3.2.2 Temperature effects
      • 3.2.3 Comparing model complexity
      • 3.2.4 Understanding context windows
    • 3.3 DIY stats bot
      • 3.3.1 Improving the stats bot
    • 3.4 Advanced prompting
      • 3.4.1 Zero-shot
      • 3.4.2 One-shot and multi-shot prompts
      • 3.4.3 Making a simple tool
      • 3.4.4 Summary of agents and tools
    • 3.5 Reflection on prompting fundamentals
  • 4 Github copilot for R
  • 5 Best practices project setup
    • 5.1 Project organization
      • 5.1.1 General guidance
      • 5.1.2 Project directory structure example
    • 5.2 The README.md file
    • 5.3 Example data
      • 5.3.1 Case-study: Bumphead parrotfish, ‘Topa’ in Solomon Islands
  • 6 Inline code editing
    • 6.1 1. Code completion
    • 6.2 2. Using comments
    • 6.3 3. Code completion settings
    • 6.4 4. Inline code generation
      • 6.4.1 Prompt shortcuts
  • 7 Planning your analysis with Ask mode
    • 7.1 Stages of analysis
    • 7.2 Ask mode
    • 7.3 The jagged frontier of LLM progress
    • 7.4 How to prompt for better statistical advice
      • 7.4.1 What LLMs don’t do that real statisticians do…
      • 7.4.2 Guidelines for prompting for statistical advice
      • 7.4.3 Improving our initial prompt by attaching data
      • 7.4.4 Improving our initial prompt by attaching domain knowledge
    • 7.5 Planning implementation
  • 8 Creating your code with Edit mode
    • 8.1 Adding a plan to the readme
      • 8.1.1 Working through your plan
      • 8.1.2 Why so much code?
    • 8.2 Workflows and tips for edit mode
      • 8.2.1 Suggested workflow for new analyses
      • 8.2.2 Suggested workflow for analyses I know well
  • 9 Automated workflows with Agent mode
    • 9.1 Exploring agent mode
      • 9.1.1 Motivating example Bayesian time-series analysis
      • 9.1.2 Set-up your project
      • 9.1.3 Prompts I used
      • 9.1.4 Writing up the project?
      • 9.1.5 Custom intstructions
    • 9.2 Summary
  • 10 Ethics and copyright
    • 10.1 Sustainability
    • 10.2 Model biases
    • 10.3 Rising inequality
    • 10.4 Copyright
    • 10.5 Managing data privacy
    • 10.6 Supplement: Calculations of personal environmental impact from using LLMs
  • 11 Advanced LLM agents
  • 12 Cost and security
    • 12.1 Cost considerations
    • 12.2 API security
    • 12.3 Agent security
  • 13 Conclusion
    • 13.1 The changing landscape of scientific computing
    • 13.2 Developing community standards
    • 13.3 Future directions
    • 13.4 Recommendations for students and ECRs
    • 13.5 What should supervisors (PIs) do?
    • 13.6 Everyone
  • Published with bookdown

Quality R analysis with large language models

Chapter 10 Ethics and copyright

There’s several fundamental ethical issues we should discuss related to the development of LLMs and AI in general.

  • That LLMs use considerable energy and water resources
  • That many LLMs have probably been trained on copyright data, so there is IP theft
  • That concentration of LLM capabilities in a few big companies may contribute to rising inequality
  • That LLMs have inherent biases and will change the way science is done, possibly for the worse.

I’ve developed a little quiz to help you think about your personal ethics.

10.1 Sustainability

Training LLMs costs millions of dollars, much of this cost is energy use. Further, the data centres for training and running LLMs need water for cooling. Asking a finished LLM questions uses much less energy, but cumulatively across the globe it adds up to a lot. Here are a few informative statistics I found online:

From Forbes:

  • “ChatGPT’s daily power usage is nearly equal to 180,000 U.S. households, each using about twenty-nine kilowatts.”
  • Microsoft emissions have risen 30% since 2020 due to data centers
  • AI prompts use 10x more energy than a traditional google search

To put it in context I did some calculations on my personal usage. I estimate the prompting I do through copilot each year will cost about 2.32 kg of C02 and about 1000 litres of water. (this is lower bound, as I also using LLMs for other tasks).

To put that in context, flying the 1.5 hours from Halifax to Montreal is about 172kg of emissions, driving 15 minutes is about 3 kg. So I’m using approximately 10 less than a short flight, or the same as driving to work once. 1000L is equivalent to taking about 22 5-minute showers.

Of course, the carbon cost is global, whereas the water cost is localised (Probably to US data centres, so by using this resource I’m really just making the water problem worse for Americans. )

So its not a huge increase in my personal energy use. But cumulatively across the globe it is a lot.

More generally, humanities energy use is growing exponential. Despite renewables and so on, ultimately our planet won’t be able to sustain this energy drawdown. LLMs are part of that trend of growing energy use. At some point we need to start using less energy, or the biosphere will become depleted and return to a ‘moon like rock’ in one study’s words.

Here’s my personal belief.

If we’re smart humanity will use this technology to find ways to make our use of the planet more sustainable and ultimately save water and energy. Just like we should have been using fossil fuels to develop a transition to lasting sustainanle energy use. So you can guess how likely that is to happen…

Its the reason I’m teaching this course. I don’t personally think that LLMs make our lives better, or humanity more sustainable. They just raise the bar on the rate of progress.

You can bet industries are using this technology to improve their productivity (= greater environmental impacts). I believe as environmental scientists we need to try to keep up. Ultimately we need progress on local to planetary sustainability (environmental scientists) to outpace the development of the industries that are environmentally unsustainable.

10.2 Model biases

This is a big one. I recommend everyone read this perspective on the ‘Illusion of Understanding’

Its important that we don’t become too reliant on AI for our work. That’s why I’m teaching and promoting thoughtful use.

Some key points:

  • We need to maintain and grow research fields that aren’t convenient to do with AI, not just grow the stuff that’s easy with AI
  • We need to push ourselves as individuals to not ‘be lazy’ and rely on AI too much. There is still great value in human learning. This requires mental energy, for instance, you will know something better if you write it yourself rather than write it with AI.
  • We need to be aware of biases in the content AI generates

For statistics these biases are likely to be a preference for well-known methods developed by Western science. So you should still read the literature broadly and avoid using AI, or prompt it in different ways, if you truly want to create novel statitistics (as opposed to using it to do statistics on a study that is otherwise novel data etc…)

10.3 Rising inequality

AI development is currently concentrated in the USA and profits for LLM use go to American companies. (USA is itself a country with massive inequality issues!). So the extent to LLMs replaces labour will redirect income and taxes from jobs in countries to American companies.

It is likely that the current low cost of LLM use will not continue. Companies are running at a loss in order to gain market share. So be careful how dependent you become on the LLMs and what that budget is replacing in your research budgets.

I personally beleive that our own countries should be developing our own LLM products and resources. Even if they are not ‘industry leading’ they can still be highly effective for specific tasks. There are open-source models available that can fill this role.

10.4 Copyright

Many LLMs have been trained on pirated books. The extent to which this is recognized by law is still in court.

For me personally its frustrating that I spent years developing a statistics blog (which was open-access, but I appreciated attribution), but now that information has been mined by LLMs. Thus AI companies are profiting from our collective knowledge.

It is an even worse situation for authors who’s livelihoods and careers depend on their copyrighted works.

Copilot does in theory block itself from writing code that might be copyrighted. However, the efficacy of this system is unclear (it seems to just be a command in the system prompt). So be careful. Here are some recommendations for individuals

  • In general you own works you create with an LLM.
  • This also means you have the liability for any works you create (not normally an issue in environmental sciences).
  • e.g. you couldn’t blame the LLM if you had to retract a paper due to incorrect statistics.
  • You should acknolwedge LLM use in academic publications, and what you used it for.
  • Always look for original sources references, e.g. don’t ‘cite’ the LLM for use of a GLM, use a textbook or reputable source (Zuur’s books are good for this!)

10.5 Managing data privacy

Any prompt you send to an LLM provider is going to the server of an AI company (e.g. Google). So its important to be mindful of what information you are including in your prompts.

The data you send (including text data) will be covered by the privacy policy of the LLM provider. Some services claim to keep your data private (e.g. the Copilot subscription my University has). Public services will tend to retain the right to use any data you enter as prompts.

This means if you put your best research ideas into chatGPT, its possible that it will repeat them later to another user who asks similar questions. So be mindful of what you are writing.

Before using an LLM to help with data analysis, be sure you understand the IP and ethical considerations involved with that data. For instance, if you have human survey data you may not be allowed to send that to a foreign server, or reveal any information to an LLM.

In that case you have three options.

10.5.0.1 Option 1: Locally hosted LLM

Use a locally hosted LLM. We won’t cover setting these up in this workshop. Locally hosted LLMs run on your computer. They can be suitable for simpler tasks and if you have a reasonably powerful GPU. Downsides are they do not have the performance of the industry leading LLMs and response times can be slower.

10.5.0.2 Option 2: Keep data seperate from code development.

Use the LLM to help generate code to analyse the data, but do not give the LLM the data or the results. I would recommend keeping the data in a different directory altogether (ie not your project directory), so that LLM agents don’t inadvertently access the raw data. You also want to be sure that the LLM isn’t returning results of data analysis to itself (and therefore you reveal private information to the LLM).

It can be helpful to generate some simulated data to use for code development, so there is no risk of violating privacy.

10.5.0.3 Option 3: Ignore sensitive folders

Some LLM agents can be directed to ignore specific folders. e.g. You could add a command to ignore a folder to copilot custom instructions, Roo Code has a .rooignore file for this.

However, remember prompts are not 100% precise (unlike real code), so there’s still the chance the LLM will go in those folders. So be careful, if its really sensitive keep it elsewhere on your computer, and always check its actions before you approve them.

10.6 Supplement: Calculations of personal environmental impact from using LLMs

A ChatGPT request uses 2.9 watt-hour. So say that’s similar cost for coding applicatoins (probably more due to the additional context we are loading with every prompt). Then looking at my chat history I had 14 conversations in the last week (not counting in-line editing). Average was 3x requests per conversation, so in a year that equals: 2.9 * 14 * 3 * 52 = 6.33 kW-hours In USA energy cost on Average is 367 grams C02 per kW-hour. (https://www.eia.gov/tools/faqs/faq.php?id=74&t=11) So my conservative estimated yearly usage for coding: 6.33 x 367 = 2.32 kg C02 For comparison flying the 1.5 hours from Halifax to Montreal is about 172kg of emissions. So my personal annual emissions for coding are perhaps about 10x than a short plane flight. Water is used for cooling in data centres: “A single ChatGPT conversation uses about fifty centilitres of water, equivalent to one plastic bottle.” Based on calculations above, this equates to about 1000L per year. That’s equivalent to about 22 x 5-minute showers.