A collaborator once asked me: “can you do that thing where you take a little bit of data and model it up to make lots more data?”

Does it sound suspicious to fake a lot of data from a little bit of data? It depends on the context. Making up data is totally appropriate if you want to do a power analysis.

And ‘making-up’ data (let’s call it simulation from now on) for power analysis is a handy tactic for designing effective surveys and experiments. This post will look at why and how.

In the narrow-sense, power analysis it about type II error, or the chance your data and analysis won’t be able to detect an effect that is really there.

Type II errors are the quiet sibling to type I errors, which tend to get more focus. The whole “<0.05 is significant”” mentality is about type I errors.

Having a low type I is important if you want to say a new drug, which may have side-effects, works.

But in many fields, type II errors are more consequential. In environmental science in particular, we don’t want significant (in the non-statistical sense of the word) environmental change to go unnoticed. Poor power would mean we are unlikely to detect environmental change in our acidification experiments, or from say an oil spill decimating a seabird colony.

Pragmatically, if you are doing a PhD you have a limited amount of time to get publishable results. So you’d want enough power to detect an effect, should it be there.

Power analysis is also much more useful type II errors, when used in the broader sense of the term.

Broad-sense power analysis is
about how well data and a statistical model work together to measure an
effect. So it could be about whether we measure the *right* effect (not
just whether we measure it at all).

In the broad and narrow sense, power analysis is a really helpful tool when you are designing experiments, or a field survey.

For more on traditional power analysis in environmental stats I recommend the textbook Quinn and Keough (or google it to find a pdf) or in the broad-sense Bolker’s excellent book

Let’s say you want to know whether there are more fish inside a marine reserve (that have no fishing) than outside the reserve. You are going to do a number of standardized transects inside and outside the reserve and count the numbers of fish.

Your fish species of interest are a very abundant sweetlips and a rather rare humphead wrasse. What’s the chance that you would be able to detect a two times difference in abundance for each fish species between reserves and fished areas?

We can address this question with power analysis by simulating ‘fake’ data for the surveys where there is a doubling of abundance, then fitting a statistical model to the fake data, then deciding whether or not the difference is ‘significant’ (e.g. p<0.05). Then we repeat that a 1000 times and count up the % of times we said there was a difference. That % is the power.

So we need to decide on a few things ahead of time, the sample size of surveys, the expected (mean) abundance values and the variance in abundance. This is where you could draw on earlier literature to make estimated guesses. The sample size is up for grabs and trying different sample sizes could be part of your power analysis.

Let’s assume there are normally 10 sweetlips per transect and 1 humphead wrasse per transect.

As the data are counts we’ll assume they are Poisson distributed. This amounts to assuming mean = variance, so the variance of sweetlips across transects is 10 and wrasse is 1.

To answer this question with R we are going to use quite a few handy packages:

```
library(purrr)
library(ggplot2)
library(broom)
library(dplyr)
library(tidyr)
```

`purrr`

is handing for creating 1000s of randomised datasets, `ggplot2`

is for plots, `broom`

is for cleaning the 1000s of models we’ll fit,
`dplyr`

and `tidyr`

are for data wrangling.

Now let’s create a function that simulates data and fits a model. This
may look overwhelming, but don’t worry about the R details if you’re not
that into R. All we are doing is creating a function that simultions
some data from two groups (reserve or not) for `n`

transects, and then
fits a
GLM
and finally it spits out a p-value for whether there was a significant
difference in the simulated data.

```
sim <- function(n, x1, x2){
x <- rep(c(x1, x2), each = n/2)
y <- rpois(n, lambda = x)
m1 <- glm(y ~ x, family = "poisson") %>% tidy()
m1
}
```

Now we can use our simulation function to simulate counting wrasse on 20 transects (10 inside and 10 outside the reserve), and then fitting the GLM to that data:

```
set.seed(2001) #just do this to get the same result as me
sim(100, 1, 2)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.522 0.287 -1.82 0.0690
## 2 x 0.618 0.167 3.69 0.000222
```

So we get a table with mean estimated difference (on log scale), standard errors and p-values.

Now we use `purrr`

to do this 1000 times:

```
mout <- map(1:1000, ~sim(20, 1, 2))
```

Which results in 1000 lists, yuck. Let’s do some data wrangling on the output:

```
mout2 <- mout %>%
bind_rows(.id = "rep") %>%
filter(term != "(Intercept)") %>%
mutate(Signif = p.value < 0.05,
rep = as.numeric(rep))
head(data.frame(mout2))
## rep term estimate std.error statistic p.value Signif
## 1 1 x 6.931472e-01 0.4082471 1.697862e+00 0.089533876 FALSE
## 2 2 x 8.266786e-01 0.4531632 1.824240e+00 0.068115744 FALSE
## 3 3 x 5.877867e-01 0.3220304 1.825252e+00 0.067963001 FALSE
## 4 4 x 1.145132e+00 0.4339488 2.638865e+00 0.008318414 TRUE
## 5 5 x 1.823216e-01 0.3496022 5.215114e-01 0.602010565 FALSE
## 6 6 x -2.823275e-13 0.3429971 -8.231194e-13 1.000000000 FALSE
```

Now we get a dataframe of the 1000 simulations, indicating whether p for the difference between reserve vs unreserved was <0.05 (column ‘Signif’).

To get the power, we just sum `Signif`

and divide by the 1000 trials:

```
sum(mout2$Signif)/1000
## [1] 0.408
```

So an ~40% chance we’d detect a 2x difference in wrasse abundance with 20 transects. This is the 2-sided probability, arguably for this question we could also use a one-sided test.

Try it again for the sweetlips (expected abundance doubling from 10 to 20). You’ll see you get much more power with this more abundance species (almost 100%).

You could try this with different sample sizes to get an idea of how much effort you need to invest in doing transects in order to see a difference (if the difference is really there of course).

How close does our approach get us to the 2x difference? We can also answer that by looking at the estimates from the GLM:

```
ggplot(mout2, aes(x = exp(estimate))) +
geom_density(fill = "tomato") +
theme_bw() +
geom_vline(xintercept = 2) +
xlab("Estimated times difference")
```

This distribution shows the expected outcomes we’d estimate over 1000 repeats of the surveys. So the solid vertical line is the ‘real’ difference. Note the long tail to the left of drastic overestimates. It is common with small sample sizes that we might overestimate the true effect size. More on this later.

I took the exponent of the `estimate`

(estimated mean difference),
because the Poisson GLM has a log link, so the estimate is on the log
scale. Taking its exponent means it is now interpreted as a times
difference (as per the x-axis label).

It is reasonably well known that over-use of p-values can contribute to publication bias, where scientists tend to publish papers about significant and possibly overestimated effect sizes, but never publish the non-significant results. This bias can be particularly bad with small sample sizes, because there’s a reasonable chance we’ll see a big difference and therefore, make a big deal about it.

We can look at this phenomena in our simulations. First, let’s take the mean of our estimated effect sizes for those trials that were significant and those that were not:

```
signif_mean <- mean(exp(filter(mout2, Signif)$estimate))
nonsignif_mean <- mean(exp(filter(mout2, !Signif)$estimate))
all_mean <- mean(exp(mout2$estimate))
c(all_mean, signif_mean, nonsignif_mean)
## [1] 2.210280 3.062810 1.622725
```

So average effect size for the significant trials is >3x (remember the real difference is 2x). If we take the average across all trials it is closer to the truth (2.3x).

Clearly if we only publish the significant results, over many studies this will add up to a much bigger difference than is really there. This can be a problem in some fields. I don’t think publication bias particularly affects studies of marine reserves, because typically there are multiple research questions, so the researchers will publish anyway.

Let’s look at this as a plot. We’ll do the same distribution as above, but with different colours for significant versus non-significant.

```
ggplot(mout2, aes(x = exp(estimate), fill = Signif)) +
geom_density(alpha = 0.5) +
theme_bw() +
geom_vline(xintercept = 2) +
xlab("Estimated times difference") +
xlim(0,5)
```

You can clearly see the significant trials almost always overestimate the true difference (vertical line).

What’s the solution? Make sure you report on non-significant results. And try to aim for larger sample sizes.

Designed by Chris Brown. Source on Github