IA data processing and p-hacking
Saturday 1 May 2021
I have recently learned about the term p-hacking, the iIncorrect use of significance testing, and it leads to studies making striking claims which go against generally accepted ideas. We have all read headlines about how if we just eat a special type of food regularly it will have a miraculous effect on our health. Are these claims the results of p-hacking?
Scientists are accused of lowering scientific standards by p-hacking. Cherry picking, excluding sets of inconclusive data and only reporting those data sets showing significance, can make a partly inconclusive study look innovative and groundbreaking. Searching for correlations in the data can be another form of p-hacking. If the hypothesis is written first and the data tested afterwards the statistics for testing significance and rejecting the null hypothesis are reliable. Reversing this method by looking for a correlation before writing the hypothesis can lead to incorrect conclusions.
There is positive feedback too. The more surprising the findings of a study, the more attention the research gets in the media. News headlines attract clicks and views which is good for publishers. Researchers desire conclusive results after years of arduous data collection, it helps secure the next round of funding and the stakes are high.
DP Biology students sometimes inadvertently indulge in p-hacking. Desperate to find something significant in an IA study they might delete anomalous data, collect extra data to lower their p-value, or even test a completely new hypothesis if it looks to be more significant. We should be able to teach DP Biology students how to recognise p-hacking. It is an important skill in critical thinking. More importantly there are no marks for statistical significance in the IA, only for correct data processing and methodology. A conclusion correctly stating that the data are inconclusive scores more highly than one which has errors in the statistics.
In the last five minutes of this short video, Zedstatistics explains why p-hacking is a problem in scientific research. It's problematic in IA investigations too, especially those which take data from a database.
Imagine an IB Diploma student who chooses to study the effect of light intensity on the morphology of oak leaves in their IA. Do oak leaves have a different shape in brighter light?
The student is hard working and measures many different leaf characteristics in the hope of finding the one which is most affected by the light. These dependent variables include; mass, length, width, surface area, leaf thickness, colour of leaves, etc. In the data processing a statistical test is carried out separately on each variable to see if the null hypothesis should be rejected, (H0: light intensity has no effect on this feature). The student finds that leaf thickness correlates with light intensity, then writes the research question, "What is the effect of light intensity of the thickness of Q.robur leaves?" and only includes the data for leaf thickness in the report. This is a classic case of p-hacking. Why is it wrong?
Using a standard test of significance, if p<0.05 we reject the null hypothesis. It is reasonable to say that it is unlikely that the null hypothesis is true if we collect results which are this different from those we would expect if the null hypothesis is valid. At the same time we should understand that in one out of twenty tests there could be a value of p<0.05 even though the light intensity has no effect on the feature. This is by definition the probability level we are using as a test. When p=0.05 the probability of getting these results when there is no significant difference is five percent. So if you test twenty different features of leaves it is more than likely that one will show a significant difference even when none of them are actually affected by light intensity. This is p-hacking.
To avoid p-hacking
- Write the hypothesis before analysing the data.
- Avoid doing multiple tests of significance in a study. e.g. repeated t-tests.
- Include all the results in the study, don't be tempted to exclude results which look wrong.
There is a simple correction, the Bonferroni correction, which avoids p-hacking if there is no alternative to multiple t-tests. Count how many t-tests are done in the analysis and divide the 'alpha', the p<0.05 significance level, by the number of t-tests. If there are ten tests, then the p-value to show a significant difference between the observed data and the null hypothesis would be p < (0.05/10), i.e. P<0.005.
If you want to know more I recommend this article by Geoff Cumming in The Conversation, "Why so many science studies are wrong." If you prefer something more mathematical read this clear explanation of significance testing and how to avoid The Familywise Error, from Charles Zaiontz. It includes the Bonferroni correction to a p-value that can avoid p-hacking (the familywise error). Zaiontz also mentions the specialised statistics to use in place of doing multiple t-tests. ANOVA or Mann-Whitney and Wilcoxon tests are better suited to this type of data.