Chapter 10 Ethics and copyright

There’s several fundamental ethical issues we should discuss related to the development of LLMs and AI in general.

That LLMs use considerable energy and water resources
That many LLMs have probably been trained on copyright data, so there is IP theft
That concentration of LLM capabilities in a few big companies may contribute to rising inequality
That LLMs have inherent biases and will change the way science is done, possibly for the worse.

I’ve developed a little quiz to help you think about your personal ethics.

10.1 Sustainability

Training LLMs costs millions of dollars, much of this cost is energy use. Further, the data centres for training and running LLMs need water for cooling. Asking a finished LLM questions uses much less energy, but cumulatively across the globe it adds up to a lot. Here are a few informative statistics I found online:

From Forbes:

“ChatGPT’s daily power usage is nearly equal to 180,000 U.S. households, each using about twenty-nine kilowatts.”
Microsoft emissions have risen 30% since 2020 due to data centers
AI prompts use 10x more energy than a traditional google search

To put it in context I did some calculations on my personal usage. I estimate the prompting I do through copilot each year will cost about 2.32 kg of C02 and about 1000 litres of water. (this is lower bound, as I also using LLMs for other tasks).

To put that in context, flying the 1.5 hours from Halifax to Montreal is about 172kg of emissions, driving 15 minutes is about 3 kg. So I’m using approximately 10 less than a short flight, or the same as driving to work once. 1000L is equivalent to taking about 22 5-minute showers.

Of course, the carbon cost is global, whereas the water cost is localised (Probably to US data centres, so by using this resource I’m really just making the water problem worse for Americans. )

So its not a huge increase in my personal energy use. But cumulatively across the globe it is a lot.

More generally, humanities energy use is growing exponential. Despite renewables and so on, ultimately our planet won’t be able to sustain this energy drawdown. LLMs are part of that trend of growing energy use. At some point we need to start using less energy, or the biosphere will become depleted and return to a ‘moon like rock’ in one study’s words.

Here’s my personal belief.

If we’re smart humanity will use this technology to find ways to make our use of the planet more sustainable and ultimately save water and energy. Just like we should have been using fossil fuels to develop a transition to lasting sustainanle energy use. So you can guess how likely that is to happen…

Its the reason I’m teaching this course. I don’t personally think that LLMs make our lives better, or humanity more sustainable. They just raise the bar on the rate of progress.

You can bet industries are using this technology to improve their productivity (= greater environmental impacts). I believe as environmental scientists we need to try to keep up. Ultimately we need progress on local to planetary sustainability (environmental scientists) to outpace the development of the industries that are environmentally unsustainable.

10.2 Model biases

This is a big one. I recommend everyone read this perspective on the ‘Illusion of Understanding’

Its important that we don’t become too reliant on AI for our work. That’s why I’m teaching and promoting thoughtful use.

Some key points:

We need to maintain and grow research fields that aren’t convenient to do with AI, not just grow the stuff that’s easy with AI
We need to push ourselves as individuals to not ‘be lazy’ and rely on AI too much. There is still great value in human learning. This requires mental energy, for instance, you will know something better if you write it yourself rather than write it with AI.
We need to be aware of biases in the content AI generates

For statistics these biases are likely to be a preference for well-known methods developed by Western science. So you should still read the literature broadly and avoid using AI, or prompt it in different ways, if you truly want to create novel statitistics (as opposed to using it to do statistics on a study that is otherwise novel data etc…)

10.3 Rising inequality

AI development is currently concentrated in the USA and profits for LLM use go to American companies. (USA is itself a country with massive inequality issues!). So the extent to LLMs replaces labour will redirect income and taxes from jobs in countries to American companies.

It is likely that the current low cost of LLM use will not continue. Companies are running at a loss in order to gain market share. So be careful how dependent you become on the LLMs and what that budget is replacing in your research budgets.

I personally beleive that our own countries should be developing our own LLM products and resources. Even if they are not ‘industry leading’ they can still be highly effective for specific tasks. There are open-source models available that can fill this role.

10.4 Copyright

Many LLMs have been trained on pirated books. The extent to which this is recognized by law is still in court.

For me personally its frustrating that I spent years developing a statistics blog (which was open-access, but I appreciated attribution), but now that information has been mined by LLMs. Thus AI companies are profiting from our collective knowledge.

It is an even worse situation for authors who’s livelihoods and careers depend on their copyrighted works.

Copilot does in theory block itself from writing code that might be copyrighted. However, the efficacy of this system is unclear (it seems to just be a command in the system prompt). So be careful. Here are some recommendations for individuals

In general you own works you create with an LLM.
This also means you have the liability for any works you create (not normally an issue in environmental sciences).
e.g. you couldn’t blame the LLM if you had to retract a paper due to incorrect statistics.
You should acknolwedge LLM use in academic publications, and what you used it for.
Always look for original sources references, e.g. don’t ‘cite’ the LLM for use of a GLM, use a textbook or reputable source (Zuur’s books are good for this!)

10.5 Managing data privacy

Any prompt you send to an LLM provider is going to the server of an AI company (e.g. Google). So its important to be mindful of what information you are including in your prompts.

The data you send (including text data) will be covered by the privacy policy of the LLM provider. Some services claim to keep your data private (e.g. the Copilot subscription my University has). Public services will tend to retain the right to use any data you enter as prompts.

This means if you put your best research ideas into chatGPT, its possible that it will repeat them later to another user who asks similar questions. So be mindful of what you are writing.

Before using an LLM to help with data analysis, be sure you understand the IP and ethical considerations involved with that data. For instance, if you have human survey data you may not be allowed to send that to a foreign server, or reveal any information to an LLM.

In that case you have three options.

10.5.0.1 Option 1: Locally hosted LLM

Use a locally hosted LLM. We won’t cover setting these up in this workshop. Locally hosted LLMs run on your computer. They can be suitable for simpler tasks and if you have a reasonably powerful GPU. Downsides are they do not have the performance of the industry leading LLMs and response times can be slower.

10.5.0.2 Option 2: Keep data seperate from code development.

Use the LLM to help generate code to analyse the data, but do not give the LLM the data or the results. I would recommend keeping the data in a different directory altogether (ie not your project directory), so that LLM agents don’t inadvertently access the raw data. You also want to be sure that the LLM isn’t returning results of data analysis to itself (and therefore you reveal private information to the LLM).

It can be helpful to generate some simulated data to use for code development, so there is no risk of violating privacy.

10.5.0.3 Option 3: Ignore sensitive folders

Some LLM agents can be directed to ignore specific folders. e.g. You could add a command to ignore a folder to copilot custom instructions, Roo Code has a .rooignore file for this.

However, remember prompts are not 100% precise (unlike real code), so there’s still the chance the LLM will go in those folders. So be careful, if its really sensitive keep it elsewhere on your computer, and always check its actions before you approve them.

10.6 Supplement: Calculations of personal environmental impact from using LLMs

A ChatGPT request uses 2.9 watt-hour. So say that’s similar cost for coding applicatoins (probably more due to the additional context we are loading with every prompt). Then looking at my chat history I had 14 conversations in the last week (not counting in-line editing). Average was 3x requests per conversation, so in a year that equals: 2.9 * 14 * 3 * 52 = 6.33 kW-hours In USA energy cost on Average is 367 grams C02 per kW-hour. (https://www.eia.gov/tools/faqs/faq.php?id=74&t=11) So my conservative estimated yearly usage for coding: 6.33 x 367 = 2.32 kg C02 For comparison flying the 1.5 hours from Halifax to Montreal is about 172kg of emissions. So my personal annual emissions for coding are perhaps about 10x than a short plane flight. Water is used for cooling in data centres: “A single ChatGPT conversation uses about fifty centilitres of water, equivalent to one plastic bottle.” Based on calculations above, this equates to about 1000L per year. That’s equivalent to about 22 x 5-minute showers.