Notes on Mayo's Notion of Severity

Soshichi Uchii, Kyoto University


Deborah Mayo propounded the epistemology of experiment in her Error and the Growth of Experimental Knowledge (1996), and the notion of severity plays an essential role in her epistemolgy. In the following two notes, I wish to point out a defect of her definition of severity, and to argue that she must revise this definition in conformity with what she actually does in her book (Note 1). The revision has some important consequence: in order to apply Mayo's severity consideration to experimental tests, we have to know all alternative hypotheses, in a given experimental situation, in advance. Mayo does not seem to recognize this, and her analysis of Perrin's experiment seems to be affected by this defect. I will present what I regard as the correct way to reconstruct Perrin's argument (Note 2).

1. Mayo's definition of severity

Mayo's error statistical approach to the epistemology of experiment, in her Error and the Growth of Experimental Knowledge (1996), crucially depends on her definition of severity, but the reader may be rather puzzled by the manner she introduces this definition. Her definition of severity appears rather abruptly (in chapter 6), and with no explicit reference to her previous arguments or preparations in the previous five chapters. What she says, immediately before she introduces the definition of severity, is essentially this much:

The cornerstone of an experiment is to do something to make the data say something beyond what they would say if one passively came across them. The goal of this active intervention is to ensure that, with high probability, erroneous attributions of experimental results are avoided. The error of concern in passing H is that one will do so while H is not true. Passing a severe test, in the sense I have been advocating, counts for hypothesis h because it corresponds to having good reasons for ruling out specific versions and degrees of this mistake. (Mayo 1996,178)

Then comes her first statement of a severe test; that is,

(S): a passing result is a severe test of hypothesis H just to the extent that it is very improbable for such a passing result to occur, were H false. (Mayo 1996, 178)

But our immediate response is: how is this notion of severity related to statistical tests explained so far? We know what "significance level" is, or what "experimental distribution" is (see pp. 158-9), but they are essentially defined in terms of the probability of an outcome, given hypothesis H is true, not given H is false! And so far, Mayo has not given any hint as regards how we should compute probabilitites, given hypothesis H is false (her discussin of type II error--accepting the null hypothesis when it is in fact false--comes long after, on p. 367) . Thus a strong uneasiness is produced in the reader's mind; Mayo does not begin to discuss this until page 195, and the discussion there is frustrating, as we will see shortly.

In order to make her arguments more intelligible (and, consequently, more susceptible to criticisms), Mayo should have made use of examples in her repertoire immediately. Later (192 ff.), she indeed uses the Binomial Experiment (a lady judging whether the tea or the milk was first added to the cup, and she claims she can do better than chance) for illustration, but the reader is told a somewhat different story from what he or she has expected. In this example we have two hypotheses, H0 and H' (p is the probability of success):

H0: the lady is guessing, p = 0.5

H': the lady does better than guessing, p > 0.5

And given the null hypothesis H0, we could obtain experimental distribution for 100 trials. All right, this time, let us take H' as our test (null) hypothesis; then Mayo's definition of severity can be easily illustrated. Let f signify the observed relative frequency of "success" (i.e. the lady's judgment is correct) in 100 trials; then we already know that

P(f0.6 | H0) = 0.03.

Since in this context H' = H0, we have an indication of the severity of test for our hypothesis H', according to Mayo's definition. Suppose we obtained the result that f0.6, which may be abbreviated as e. Then, the test of H' by means of this result e is as severe as 97 percent (in terms of percentage).

However, it must be noticed that if we take H0 as our test hypothesis, the situation is not as easy as this, since we have to be able to calculate the probability of some result conditional on the falsity of H0. For instance,

P(f0.6 | H0) = P(f0.6 | H') .

Since H' does not specify the value of p, how can we obtain this probability? Notice that H' in this case becomes quite similar to the "Bayesian catchall" as Mayo puts it (a disjunction of an indefinite number of hypotheses). Literally, the negation of H0 is nothing but a disjunction of all hypotheses assigning any value (between 0 and 1; which amounts to saying, informally, that the lady does better or worse than mere guessing) except for 0.5; so how should Mayo obtain the probability of e on such a disjunction? I don't see any difference between the Baysian difficulty and Mayo's difficulty in this regard. Suppose (although I am sure Mayo does not like this supposition), imitating Laplace (the principle of indifference), we assume that any value for p is equally possible; then it should be the case that

P(f0.6 | H0) = P(f0.6 | H') = P(f0.6 | H0) = 0.03.

This means that the test by e is as severe for H0 as for H0. This seems to be quite disastrous for Mayo's definition; for the same result e counts as a good evidence for both H0 and H0! Notice that this suggests a stronger worry than Earman's worry, treated and answered by Mayo in 6.3. Earman suspected whether we can obtain a low probability in case test hypothesis is false; and he presented a case in terms of higher-level alternatives. But my example suggests a far stronger worry that both the null hypothesis and its rival hypothesis may give the same probability to the same evidence, the two hypotheses being low-level alternatives to each other! If she wants to stick to her definition (S), she has to show that, on her account of error statistics, this sort of counterexample never appears.

Beginners may have some difficulty for understanding the preceding result; so let me give you a simpler version (a classical example, from the history of probability theory), in terms of finite alternative hypotheses. Let H0 be the same as before, and suppose there are 4 other alternatives (given the background information, say):

On our assumption, H0 is equivalent to the disjunction of these four. Then, given that each alternative hypothesis is equally probable (according to the Laplacean principle of indifference), the probability of any result e on H0 is the same as the probability of e on H0 (i.e., on the disjunction of the four). It suffices to show this for

e = the Lady's judgment is correct.

Then, clearly,

Since each hypothesis is equiprobable (the experimental situation where this holds, even for the frequentist, will be shown soon), the probability of e on the disjunction of the four is clearly the mean of these predictions, i.e., P(e | H0) = 0.50, which is exactly the same as P(e | H0) . Given this, it is easy to see that the same result holds for any evidence statement e. And, the case where there are infinitely many hypotheses (as regards the value of p), is not any different, in principle, from this simpler version.

Moreover, this example is not as unrealistic as it may seem at first sight. For, what is crucial is that the alternative hypotheses are distributed symmetrically around the null hypothesis; we can find many such cases, when we wish to ascertain the correct value of a parameter.

Finally, it may be pointed out that Mayo's defence of the error statistical approach agianst this sort of counterexample, in terms of piecemeal character of experimental learning, does not help much. She says,

Within an experimental testing model, the falsity of a primary hypothesis H takes on a specific meaning. If H states that a parameter is greater than some value c, not-H states that it is less than c; if H states that factor x is responsible for at least p percent of an effect, not-H states that it is responsible for less than p percent; if H states that an effect is caused by factor f, for example, neutral currents, not-H may say that it is caused by some other factor possibly operative in the experimental context ...; if H states that the effect is systematic--of the sort brought about more often than by chance--then not-H states that it is due to chance. How specific the question is depends upon what is required to ensure a good chance of leaning something of interest ... (190-1)

I agree with this; and most other (empiricist) Bayesians will join me in this. But this does not help in the least for solving the difficulty posed by my counterexample. Thus, as it turns out, all Mayo suggests later (on p. 195) is that:

(1) the probability of an outcome conditional on a disjunction of alternatives is not generally a legitimate quantity for a frequentist; and

(2) the severity criterion (SC) requires that the severity be high against each of single alternatives (not a disjunction of such).

(1) amounts to saying that her definition of severity is not appropriate for many of her canonical models of experimental inquiry, and (2) is nothing but a substantial revision of her definition. If (2) is what she really wants (and indeed this seems to be the case, judging from her subsequent discussions, long after, in chapter 11, p. 397), she should have changed the definition in the first place. It seems to me that, in short, she has chosen a wrong way to state her crucial definition, and this may easily give the impression that she simply wishes to evade the whole question by this maneuvre (1) and (2). It looks quite strange to demand that you should not ask the severity of the test for H0 (but OK for H'), when you are trying to test H0 against H' (H0), in one of her canonical models.

Moreover, we can easily construct a test situation in which it is legitimate for the frequentist to obtain P(e | H0). Suppose there are a small number of (say, 5) coins, biased or unbiased (as our five hypotheses tell, respectively), and one is chosen at random; and you are asked to determine by experiment which coin is in fact chosen. In this case, it is legitimate, even for the frequentist, to ask the probability P(e | H0), and our counterexample is fully alive. And our intuition is, if the observed frequency of head is close to 0.5 and the number of trials is large enough, this is a good evidence for H0 but not for H0. But, unfortunately, Mayo's definition of severity does not work for this case.

Thus she has to abandon her official definition of severity, (S) above, and reformulate it along the line of (2); but, notice that in this case all alternative hypotheses must be known in advance. This point becomes relevant to my next point.

2. How should we reconstruct Perrin's argument?

Since it is a great virtue of Mayo's book that she applied her epistemology to a number of specific examples from the history of science, it may be a bit unfair if I ignore such concrete applications of severity considerations. So let us examine her analysis of Perrin's experiment (pp. 217-242), one of the highlights in Mayo's book. Perrin conducted a series of careful experiments on the Brownian motion, and he greatly contributed to establishing the kinetic theory (Einstein-Smoluchowski version) on the experimental basis. Mayo divides Perrin's experimental inquiry into two steps: Step 1 consists of checking, for each experiment E, whether the results of the experiment actually performed follow the given statistical distribution; Step 2 involves using estimates of the coefficient of diffusion which is crucial for obtaining Avogadro number.

Having examined Perrin's original exposition in his Atoms (Perrin 1990), I came to the conclusion that Mayo's reconstruction of Step 1 (establishing the complete irregularity of the Brownian motion) does not follow the order of the original, and hence misleading in some crucial respect. So let me present my own rendering, in the following.

Mayo rightly points out this:

Only by keeping in mind that a great many causal factors were ruled out experimentally before Perrin's tests (around 1910) can his experiments be properly understood. (Mayo 1996, 218)

And as I understand Perrin's argument, Perrin was almost convinced, before his experiments, that the Brownian motion is not caused by any external influence, such as air currents, because there were already an abundance of evidence for this conviction. Thus Perrin writes:

But in this case neighbouring particles move in approximately the same direction as the air currents and roughly indicate the conformation of the latter. The Brownian movement, on the other hand, cannot be watched for any length of time without it becoming apparent that the movements of any two particles are completely independent, even when they approach one another to within a distance less than their diameter (Brown, Wiener, Gouy). (Perrin 1990, 84)

The agitation cannot, moreover, be due to vibration of the object glass carrying the drop under observation, for such vibration, when produced expressly, produces general currents which can be recognized without hesitation and which can be seen superimposed upon the irregular agitation of the grains. (Perrin, 84)

In fact--and this is perhaps its strangest and most truly novel feature--the Brownian movement never ceases. Inside a small closed cell (so that evaporation may be avoided) it may be observed over periods of days, months, and years. It is seen in the liquid inclusions that have remained shut up in quarts for thousands of years. It is eternal and spontaneous. (Perrin, 85)

All these characteristics force us to conclude, with Wiener (1863), that "the agitation does not originate either in the particles themselves or in any cause external to the liguid, but must be attributed to internal movements, characteristic of the fluid state", movements which the grains follow more faithfully the smaller they are. We are thus brought face to face with an essential property of what is called a fluid in equilibrium; its apparent repose is merely an illusion due to the imperfection of our senses and corresponds in reality to a permanent condition of unco-ordinated agitation. (Perrin, 85-86)

However, despite this conviction and its general conformity to what the kinetic theory says, the kinetic theory "is nevertheless a hypothesis only" (88). That's why Perrin attempted to subject the question to a definite experimental test. But for this purpose, he had to extend the gas laws to dilute emulsions, and to particles larger than molecules in such emulsions (89-94). By preparing suitable materials (such as gamboge) for this inquiry, he obtained the affirmative answer, so that he now can go on to a quantitative inquiry as regards the Brownian motion. Then comes a series of experiments in which Perrin subjected the Einstein-Smoluchowski theory to test.

I have no objection to Mayo's reconstruction in terms of Step 1 amd Step 2; but I have an objection against Mayo's rendering of the alternative hypotheses at Step 1. According to Mayo, Perrin was to decide between the following two alternative hypotheses (Mayo 1996, 223; j is the null hypothesis)

j: The data from E approximates a random sample from the hypothesized Normal process M.

j': The sample displacements of data from E are characteristic of systematic (nonchance) effects.

And, for the sake of fairness, let us see what Mayo says on these hypotheses:

So ruling out hypothesis j' was the centerpiece of Perrin's work. Asking about j' came down to asking whether factors outside the liguid medium might be responsible for the observed motion of Brownian particles. The general argument in ruling out possible external factors--even without being able to list them all--was this: if Brownian motion were the effect of such a factor, then neighboring particles would be expected to move in approximately the same direction. In fact, however, a particle's movement was found to be independent of that of its neighbors. To sustain this argument, Perrin called up experimental knowledge gleaned from several canonical cases of ("real") chance phenomena. (Mayo 1996, 224; bold letters mine.)

Aside from other complaints, the contrast of key words in j and j' seems quite misleading, in view of the context of Perrin's experiments: random sample from the Normal process, and systematic effects. If Mayo puts two key words (random and Normal) in the null hypothesis, we've got to consider four combinations:

  1. ramdom sample from the Normal process,
  2. random sample from a non-Normal process,
  3. non-random sample from the Normal process, and
  4. non-random sample from a non-Normal process.

But notice that Mayo interprets that Perrin was going to reject j' for sustaining his conviction (bold letters in the previous quotation) that there are no coordinated movements (due to external factors) among neighboring particles. This interpretation underlies the formulation of j'; and this causes a lot of trouble. For there may be many candidates for the cause of systematic effects, and we cannot know the range of such candidates in advance. As we have concluded in the previous note 1, it is essential for Mayo's requirement of severity that we know all alternative hypotheses in advance; otherwise she (as a frequentist) cannot talk about the severity of a test. Thus the problem with j' is not merely verbal but crucial for Mayo's severity condition.

My interpretation is different: given Perrin's conviction (supported by the predecessors' results) that no external influence is conceivable, he does not have to consider any systematic effect due to external influence; or even if he takes this possibility still into cnsideration, he can exclude this by separate experiments (indeed, some collaborator was checking the influence of temperature, for instance; see Perrin, 104, 122). So the remaining possibilities are deviations from irregularity due either to (a) bias of sampling or to (b) non-Normal process (distribution in the whole population). But (a) can be easily avoided by the manner of experiment, as Perrin's remark in other context shows (Mayo also quotes this):

In order not to be tempted to choose grains which happened to be slightly more visible than the rest (those, that is to say, which were slightly above the average size), which would raise the value of N a little, I followed the first grain that showed itself in the centre of the field of vision. (Perrin, 124)

That is, by following the rule of ramdom sampling, you can exclude the possibility of non-ramdom sample, i.e. (a). The manner Perrin paid enough attention to such considerations is well illustrated also by the following incident (also quoted by Mayo) while he was trying to determine the Avogadro number:

In this way I obtained the value 69. A source of error has, however, been pointed out to me by M. Constantin. This young physicist noticed, during the course of some measurements on some preparations only a few microns thick, that the proximity of a boudary checked the Brownian movement. ... Working at a sufficient distance from the walls with the grains that I had used, he obtained the value [N = 64]; unfortunately the number of observations (about 100) was too small. These measurements will be repeated. (Perrin 1990, 124)

Thus, because of such practice of random sampling, and avoidance of systematic deviations, only the possibility of (b) remains (that is, 3 and 4 disappear from the preceding list). Then, it is clear that Mayo's formulation of j' is misleading (notice it contains the possibilities of non-statistical hypotheses, including those unknown to Perrin!), to say the least. It should be replaced by:

j: The data from E approximates a random sample from a non-Normal process.

Here, "non-Normal" means a substantive deviation from a Normal distribution. Thus, only in this form (I would venture to say this), Mayo can claim (on the empirical basis) that "It is very probable that a few [experiments] would have shown differences statistically significant from what is expected under j''" (Mayo 1996, 231), because the falsity of j now corresponds to a non-Normal distribution; we may assume that Perrin was considering only statistical hypotheses, comparable to j, in his experiments. This is the crux Mayo should have stated clearly, but she has failed to do so. Finally, although this looks to be the most promising way for Mayo to adopt, the defect of her official definition of severity (S) is not alleviated in the least; she has to use the revised (relativized to a single alternative) version.


Mayo, Deborah (1996) Error and the Growth of Experimental Knowledge, The University of Chicago Press.

Perrin, J. (1990) Atoms, Ox Bow Press (reprint).

The preceding Notes come from my seminar on Error Statistics at Kyoto University; for more of my comments on Mayo, see Error Statistics.)

Last modified July 3, 2001. (c) Soshichi Uchii