Souto-Maior, Caetano (2022) Extraordinarily corrupt or statistically commonplace? Reproducibility crises may stem from a lack of understanding of outcome probabilities. In: UNSPECIFIED.
|
Text
Extraordinarily corrupt or statistically commonplace.pdf Download (817kB) | Preview |
Abstract
Failure to consistently reproduce experimental results, i.e. failure to reliably identify or quantify an effect — often dubbed a ‘reproducibility crisis’ when referring to a large number of studies in a given field — has become a serious concern in many communities and is widely believed to be caused by (i) lack of systematic methodological description, poor experimental practice, or outright fraud. On the other hand, it is common knowledge of the scientific practice that (ii) replicate experiments — even when performed in the same lab, by the same experimenter — will rarely show complete quantitative agreement between them. The presence of the widely believed (i) and commonplace (ii) explanations are not mutually exclusive, but they are incompatible as justifications for irreproducibility. Invoking the former implies an anomaly, a crisis, while the latter is statistically expected and therefore amenable to quantification.
Interpreting two or more studies as conflicting is often a reduction to a mechanicist view where a ground truth exists that must be observed with every properly performed experiment; a slightly less naive view (at best) is a frequentist view where statistical tests must confidently identify a true effect (i.e. a single parameter value) as significant almost always (i.e. an arbitrary proportion of 95\% of times). A broader view, however, may consider that the effect can only be observed as a probability distribution; individual experiments are, therefore, not expected to differ only by sampling and power to identify a significant effect, but by variation at the level of the parameter value itself — i.e. it is accepted that there are sources of variation that cannot be controlled with infinite precision, for instance in the environment and from the experimenter, or it is acknowledged that there may be unknown, uncontrolled factors that will introduce biases. Quantitatively, that perspective is consistent with a Bayesian hierarchical formulation, where the effect (commonly called the group-level) parameters are under a hyperprior and above individual experiment parameters.
Put another way, the Bayesian hierarchical view allows reconciliation between seemingly discordant results by interpreting each experiment as a sample itself of a (group- or system-level) distribution, which in turn sets the range and probability of expected outcomes for new individual experiments. As a corollary, a large number of replicates will increase the confidence not only in the expected value but also in the deviation for it. Thus, “validating” an experiment does not mean getting the same number every time, but establishing the range and likelihood of well-performed experiments. Conversely, once an experiment has been extensively replicated, the effect distribution is informative of how much each repetition deviates from expectation, whether they are actually extreme — and potentially contain anomalies or misconduct — or if they are probabilistically not surprising. This formulation has profound consequences for assessments and claims on reproducibility.
Export/Citation: | EndNote | BibTeX | Dublin Core | ASCII/Text Citation (Chicago) | HTML Citation | OpenURL |
Social Networking: |
Monthly Views for the past 3 years
Monthly Downloads for the past 3 years
Plum Analytics
Actions (login required)
View Item |