2 Introduction to LLMs for coding

Slides from the workshop presentation (on a google drive)

2.1 The jagged frontier of LLM progress

LLMs were created to write text. But it soon became apparent that they excel at writing programming code in many different languages.

Since then AI companies have been optimising their training and development for coding and logic.

There are a series of standardized tests that are used to compare quality of LLMs. Common evaluation tests are the SWE benchmark which looks at the ability of LLMs to autonomously create bug fixes. Current models get about 50% resolution on this benchmark.

Their progress on math and logic is a bit more controversial. It seems like some of the math benchmarks (like AIME annual tests for top 5% highschool students) are saturated as LLMs are scoring close to 100% on these tests.. So newer tests of unsolved maths problems are being developed.

However, others are finding that the ability of LLMs on math and logic are overstated, perhaps because the LLMs have been trained on the questions and the answers. Its also clear that AI companies have a strong financial incentive to find ways (real and otherwise) of improving on the benchmarks. Are the moment there is tough competition to be ‘industry leaders’ and grab market share with impressive results on benchmarks.

Either way, it does seem that the current areas of progress are programming, math and logic.

Evaluations on statistics and the R software are less common.

The limited evaluations of LLMs on their ability to identify the correct statistical procedure are less impressive than other benchmarks. An evaluation (published 2025) of several models, including GPT-4 as the most up-to-date model, found accuracy at suggesting the correct statistical test of between 8% and 90%.

In general LLMs were good at choosing descriptive statistics (accuracy of up to 90% for GPT-4). Whereas when choosing inferential tests accuracy was much less impressive - GPT-4 scored between 20% and 43% accuracy on questions for which a contingency table was the correct answer.

The results also indicate the improvements that can be gained through better prompts (i.e. doubling in accuracy for GPT 4).

The lesson is two-fold. Just because LLMs excel at some tasks doesn’t mean they will excel at others. Second, good prompting strategies pay off.

For us in the niche R world there is also another lesson. The LLMs should be good at helping us implement analyses (ie write the R code). However, they are less reliable as statisticians who can guide us on the scientific question of what type of analysis to do.

In this book we’ll walk through some of the ways you can use LLMs to help you do statistics, data analysis and R coding (and a bit of python too). There are a multitude of software and R package options. So many its bewildering. In the next chapter I’ll give you a few of the easiest set-up optoins available.