P-hacking

2024

## Warning: package 'data.table' was built under R version 4.3.3

Large n makes any difference significant

Stopping early can be deceptive

Choosing a reasonable sample size

Even the most teensy effect can be significant with a large enough sample size.
This is why it is important to decide on the smallest effect you care about before you start collecting data, and use this choice to determine your sample size.
This is called a power analysis (see review of power slide deck).

Multiple Comparisons

Each NHST comes with a \(\alpha\) percent chance of a false positive.
If you perform many of these tests, the chance of a false positive increases.
After just 3 tests, the probability of a false positive is 14.3%.

Familywise error rate (FWER)

Single Test Type I Error Probability: \(\alpha\)
Probability of No Type I Error in a Single Test: \(1 - \alpha\)
Probability of No Type I Errors in \(n\) Independent Tests: \((1 - \alpha)^n\)
Probability of At Least One Type I Error in \(n\) Independent Tests: \(1 - (1 - \alpha)^n\)

FWER example

Famous Multiple Comparisons Example

https://www.wired.com/2009/09/fmrisalmon/

Correcting for multiple comparisons

There are many methods that correct for inflated FWER but all are out of scope for this unit.
The main objective of including FWER here is simply to build a conceptual understanding of the problem.

P-hacking flavours

Multiple Comparisons / Testing: Conducting many statistical tests on the same data set and only reporting those results that are statistically significant. The more tests performed, the higher the chance of finding at least one statistically significant result purely by chance.

Selective Reporting: Only reporting outcomes or analyses that show significant results while ignoring others that don’t. This can also involve selectively reporting subgroups or conditions that yield favorable outcomes.

Data Dredging: Searching through large amounts of data for patterns without a prior hypothesis. This often involves testing numerous variables against each other until something statistically significant is found.

Fishing for P-Values: Continuously adding or removing data or trying different statistical methods until a significant p-value (usually less than 0.05) is obtained.

Stopping Data Collection Early: Halting the data collection process as soon as significant results are observed, instead of collecting the amount of data originally planned. This can inflate the apparent significance of the results.

Outcome Switching: Changing the primary outcome of a study after looking at the data. For example, if the original outcome did not yield significant results, a different outcome that does is reported as the main result.

Cherry-Picking Data: Selectively choosing specific subsets of data that support a particular conclusion, while ignoring other data that might contradict it.

Manipulating Analysis Methods: Altering the analytical methods, such as changing the inclusion criteria for a study, switching between parametric and non-parametric tests, or transforming data until significant results are produced.

Overfitting Models: Creating complex models that fit the sample data very closely but do not generalize well to other data sets. This can create seemingly significant results that actually have little to no predictive power.

What do we do with all of this?

We don’t have time to cover each of these in detail and frankly not much – but still some – of your performance in this unit hangs on your ability to understand these concepts.
The key is that in order to understand these concepts and intelligently cope with them, you need to understand the fundamentals of statistical inference and coding.
That is what this unit is really about.

Calling bullshit

After you complete this unit, please watch the following series of videos by Carl Bergstrom and Jevin West.
https://www.youtube.com/playlist?list=PLPnZfvKID1Sje5jWxt-4CSZD7bUI4gSPS