Fusion Confusion? Comments on Nancy Reid: BFF Four Are we Converging?

Fusion Confusion? Comments on Nancy Reid: BFF Four Are we Converging? Deborah G. Mayo The Fourth Bayesian, Fiducial and Frequentist Workshop (BFF4): Harvard University May 2, 2017 <1>

I m delighted to be part of a workshop linking statistics and philosophy of statistics! I thank the organizers for inviting me. Nancy Reid s BFF Four Are we Converging? gives numerous avenues for discussion She zeroes in on obstacles to fusion: Confusion or disagreement on the nature of probability and its use in statistical inference <2>

From Nancy Reid: Nature of probability probability to describe physical haphazard variability probabilities represent features of the real world in idealized form subject to empirical test and improvement conclusions of statistical analysis expressed in terms of interpretable parameters enhanced understanding of the data generating process probability to describe the uncertainty of knowledge measures rational, supposedly impersonal, degree of belief given relevant information (Jeffreys) measures a particular person s degree of belief, subject typically to some constraints of self-consistency often linked with personal decision-making <3>

As is common, she labels the second epistemological But a key question for me is: what s relevant for a normative epistemology, for an account of what s warranted/unwarranted to infer <4>

Reid quite rightly asks: in what sense are confidence distribution functions, significance functions, structural or fiducial probabilities to be interpreted? empirically? degree of belief? literature is not very clear <5>

Reid: We may avoid the need for a different version of probability by appeal to a notion of calibration (Cox 2006, Reid & Cox 2015) This is my central focus I approach this indirectly, with analogy between philosophy of statistics and statistics <6>

Carnap: Bayesians as Popper: Frequentists (N-P/Fisher) Can t solve induction but can build logics of induction or confirmation theories (e.g., Carnap 1962). Define a confirmation relation: C(H, e) (, rather than ) logical probabilities deduced from first order languages to measure the degree of implication or confirmation that e affords H (syntactical) <7>

Problems Languages too restricted There was a continuum of inductive logics (tried to restrict via inductive intuition ) How can a priori assignments of probability be relevant to reliability? ( guide to life ) Few philosophers of science are logical positivists, but the hankering for a logic of induction remains in some quarters <8>

Popper: In opposition to [the] inductivist attitude, I assert that C(H,e) must not be interpreted as the degree of corroboration of H by e, unless e reports the results of our sincere efforts to overthrow H. (Popper 1959, 418) The requirement of sincerity cannot be formalized-- (ibid.) Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory or in other words, only if they result from serious attempts to refute the theory. (Popper 1994, 89) -never successfully formulated the notion <9>

Ian Hacking (1965) gives a logic of induction that does not require priors, based on (Barnard, Royall, Edwards) Law of Likelihood : x support hypothesis H1 more than H0 if, Pr(x;H1) > Pr(x;H0) (i.e., if the likelihood ratio LR > 1). George Barnard, there always is such a rival hypothesis viz., that things just had to turn out the way they actually did (1972, 129). Pr(LR in favor of H1 over H0 ; H0) = high. <10>

Neyman-Pearson: In order to fix a limit between small and large values of [the likelihood ratio] we must know how often such values appear when we deal with a true hypothesis. (Pearson and Neyman 1967, 106) Sampling distribution of LR A crucial criticism in statistical foundations <11>

In statistics: Sampling distributions, significance levels, power, all depend on something more [than the likelihood function] something that is irrelevant in Bayesian inference namely the sample space. (Lindley 1971, 436) Once the data are in hand: Inference should follow the Likelihood Principle (LP): In philosophy (R. Rosenkrantz defending the LP): The LP implies the irrelevance of predesignation, of whether a hypothesis was thought of beforehand or was introduced to explain known effects. (Rosenkrantz 1977, 122) (don t mix discovery with justification) <12>

Probabilism vs Performance Are you looking for a way to assign degree of belief, confirmation, support in a hypothesis considered epistemological Or to ensure long-run reliability of methods, coverage probabilities (via the sampling distribution) considered only for long-run behavior, acceptance sampling <13>

We require a third role: Probativism (severe-testing). To assess and control erroneous interpretations of data, post-data The problems with selective reporting (Fisher) non-noveldata (Popper), are not problems about long-runs It s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpretation. <14>

Ian Hacking: there is no such thing as a logic of statistical inference (1980, 145) Though I m responsible for much of the criticism. I now believe that Neyman, Peirce, and Braithwaite were on the right lines to follow in the analysis of inductive arguments Probability enters to qualify a claim inferred, it reports the method s capabilities to control and alert us to erroneous interpretations (error probabilities) Assigning probability to the conclusion rather than the method is founded on a false analogy with deductive logic (Hacking, 141). he s convinced by Peirce <15>

The only two who are clear on the false analogy: Fisher (1935, 54): In deductive reasoning all knowledge obtainable is already latent in the postulates...the conclusions are never more accurate than the data. In inductive reasoning.. [t]he conclusions normally grow more and more accurate as more data are included. It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based. Peirce ( The probability of Induction 1878): In the case of analytic [deductive] inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic [inductive] inferences we only know the degree of trustworthiness of our proceeding. <16>

Neyman and His Performance You could say Neyman gets his performance idea trying to clarify Fisher s fiducial intervals Neyman thought his confidence intervals were the same as Fisher s fiducial intervals. In a (1934) paper (to generalize fiducial limits), Neyman said a confidence coefficient refers to the probability of our being right when applying a certain rule for making statements set out in advance. (623) Fisher was highly complimentary: Neyman had every reason to be proud of the line of argument he had developed for its perfect clarity. (Fisher comment in Neyman 1934, 618) <17>

Neyman thinks he s clarifying Fisher s (1936, 253) equivocal reference to the aggregate of all such statements. [1] This then is a definite probability statement about the unknown parameter (Fisher 1930, 533) <18>

It s interesting too, to hear Neyman s response to Carnap s criticism of Neyman s frequentism Neyman: I am concerned with the term degree of confirmation introduced by Carnap. [if] the application of the locally best one-sided test failed to reject the [test] hypothesis (Neyman 1955, 40) The question is: does a failure to reject the hypothesis confirm it? A sample X = (X1,,Xn) each Xi is Normal, N(μ,σ 2 ), (NIID), σ assumed known; H0: μ μ0 against H1: μ > μ0. Test fails to reject H 0, d(x 0 ) c α. <19>

Carnap says yes Neyman:.the attitude described is dangerous. the chance of detecting the presence [of discrepancyδ from H0], when only [this number of] observations are available, is extremely slim, even if [δ is present]. (Neyman 1955, 41) The situation would have been radically different if the power function were greater than 0.95. (ibid.) Merely surviving the statistical test is too easy, occurs too frequently, even when H 0 is false. <20>

A post-data analysis is even better*: Mayo and Cox 2006 ( Frequentist principle of evidence ): FEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy δ from H 0, only if there is a high probability (1 c) the test would have given a worse fit with H 0 (i.e., d(x) > d(x 0 )) were a discrepancy δ to exist. (83-4) If Pr(d(X) > d(x 0 ); μ = μ 0 + δ) is high d(x) d(x 0 ); infer: any discrepancy from μ 0 < δ [Infer: µ < CI u ) (* severity for acceptance : Mayo & Spanos 2006/2011) <21>

How to justify detaching the inference? Rubbing off: The procedure is rarely wrong, therefore, the probability it is wrong in this case is low. What s rubbed off? (could be a probabilism or a performance) Bayesian epistemologists: (Having no other relevant information): A rational degree of belief or epistemic probability rubs off Attaching the probability to the claim differs from a report of well-testedness of the claim <22>

Severe Probing Reasoning The reasoning of the severe testing theorist is counterfactual: H: μ x 0 + 1.96σ x (i.e., μ CIu ) H passes severely because were this inference false, and the true mean μ > CI u then, very probably, we would have observed a larger sample mean. (I don t saddle Cox with my take, nor Popper) <23>

How Well Tested (Corroborated, Probed) How Probable We can build a logic for severity (it won t be probability) both C and ~C can be poorly tested low severity is not just a little bit of evidence, but bad or no evidence Formal error probabilities may serve to quantify probativeness or severity of tests (for a given inference), they do not automatically give this-must be relevant <24>

What Nancy Reid s paper got me thinking about is the calibration point: Here s the longer quote: We may avoid the need for a different version of probability by appeal to a notion of calibration, as measured by the behaviour of a procedure under hypothetical repetition. That is, we study assessing uncertainty, as with other measuring devices, by assessing the performance of proposed methods under hypothetical repetition. Within this scheme of repetition, probability is defined as a hypothetical frequency. (Reid and Cox 2015, 295) <25>

Notions of calibration also vary! (1) If we calibrate p-values by a Bayes factor or other probabilism, p-values exaggerate evidence (2) If we calibrate Bayes factors by performance or severity, they exaggerate what s warranted to infer depends on one s philosophy of statistics, Greenland, Senn, Rothman, Carlin, Poole, Goodman, Altman (2016, 342). Reid: it is unacceptable if a procedure yielding high-probability regions in some non-frequency sense are poorly calibrated I agree. I take this as calling for the second (2), frequentist, calibration <27>

This takes me to my last point: an irony about today s replication crisis In some cases it s thought Big Data foisted statistics on fields unfamiliar with its dangers, and Reid discusses some foibles A lot of consciousness-raising is going on More hand-wringing than ever regarding cherry-picking, selection effects (p-hacking, significance seeking) R.A. Fisher: it s easy to lie with statistics by selective reporting (1955, p. 75) new names, same problem <28>

Returns to a question from back when the possibility of a logic of induction was still viable: can t data speak for themselves? Preregistration calls are everywhere: Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. (Simmons, Nelson, and Simonsohn 2011, 1362) At the same time Use of the Bayes factor gives experimenters the freedom to employ optional stopping without penalty. (In fact, Bayes factors can be used in the complete absence of a sampling plan ) (Bayarri, Benjamin, Berger, Sellke 2016, 100) <29>

What I take away from Nancy Reid s talk is: if we don t know what we mean by an account works we can t tell how to calibrate <30>

In the severe testing view: In order for a calibration to be relevant to normative epistemology, that is to what is warranted to infer, (what s well and poorly tested) 1. It must be directly affected by selection effects (cherry picking, multiple testing, stopping rules) 2. enable testing assumptions 3. enable statistical falsification. Points to the need for further philosophical-statistical interaction <31>

Philosophy of Inductive/Statistical Inference Inductive Logics Carnap C(H,e), Hacking Falsification, testing accounts Popper Parallels in Formal Statistics (goes much further) Bayesian and Likelihoodist accounts Probability: to assign degree of confirmation, support, belief (posterior or comparative) Probabilisms Fiducial? Fisherian, Neyman-Pearson frequentist methods: Probability: (a) to ensure reliable performance (b) severity of tests probativeness Fiducial? <32>

[1] (endnote) <33>

REFERENCES Barnard, G. (1972). The Logic of Statistical Inference (review of The Logic of Statistical Inference by Ian Hacking). British Journal for the Philosophy of Science 23(2): 123-132. Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology 72: 90-103. Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle. 2 nd ed. Vol. 6. Lecture Notes- Monograph Series. Hayward, California: Institute of Mathematical Statistics. Carnap, R. (1962). Logical Foundations of Probability. 2nd ed. Chicago: University of Chicago Press. Cox, D. R. (2006). Principles of Statistical Inference. Cambridge: Cambridge University Press. Fisher, R. A. (1930). Inverse Probability. Mathematical Proceedings of the Cambridge Philosophical Society 26(4): 528-535. Fisher, R. A. (1935). The Logic of Inductive Inference. Journal of the Royal Statistical Society 98(1): 39 82. Fisher, R.A. (1936). Uncertain Inference. Proceedings of the American Academy of Arts and Sciences 71: 248-258. Fisher, R. A. (1955). Statistical Methods and Scientific Induction. Journal of the Royal Statistical Society, Series B (Methodological) 17(1): 69 78. Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University Press. Hacking, I. (1972). Review: Likelihood. British Journal for the Philosophy of Science 23(2): 132-7. Hacking, I. (1980). The Theory of Probable Inference: Neyman, Peirce and Braithwaite, in Mellor, D. (ed,), pp. 141 60. Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite. Cambridge: CUP. <34>

Jeffreys, H. (1939). Theory of Probability. Oxford: Oxford University Press. Lindley, D. (1971). The Estimation of Many Parameters. In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435 455. Toronto: Holt, Rinehart and Winston. Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. Mayo, D. G. (2014). On the Birnbaum Argument for the Strong Likelihood Principle (with discussion). Statistical Science 29(2): 227-39, 261-6. Mayo, D. G. (2016). Don't Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary on Wasserstein, R.L. & Lazar, N.A. 2016, The ASA's Statement on p-values: Context, Process, and Purpose. The American Statistician, vol. 70, no. 2, supplemental materials. Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference," in Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes- Monograph series, Institute of Mathematical Statistics (IMS) 49: 77-97. Mayo, D. G. and Spanos, A. (2006). "Severe Testing as a Basic Concept in a Neyman- Pearson Philosophy of Induction," British Journal of Philosophy of Science, 57: 323-357. Mayo, D. G. and Spanos, A. (2011). Error Statistics, in Bandyopadhyay, P. and Forster, M. (eds.) pp. 152 198. Philosophy of Statistics, Vol. 7, Handbook of the Philosophy of Science. The Netherlands: Elsevier. Neyman, J. (1930). Methodes nouvelles de verification des hypotheses. Compt Rend Premier Congr Math Pays Slaves: 355-366. Neyman, J. (1934). On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. Early Statistical Papers of J. <35>

Neyman: 98-141. [Originally published (1934) in The Journal of the Royal Statistical Society 97(4): 558-625.] Neyman, J. (1955). The Problem of Inductive Inference. Communications on Pure and Applied Mathematics 8(1): 13 46. Pearson, E. and Neyman, J. (1967). On the Problem of Two Samples. In Joint Statistical Papers, by J. Neyman and E.S. Pearson, 99-115 (Berkeley: University of California Press). First published in Bull. Acad. Pol. Sci (1930): 73-96. Peirce, C. S. (1931). Collected Papers of Charles Sanders Peirce, Hartsthorne, C and Weiss, P. (eds.), 6 vols. Cambridge: Harvard University Press. Popper, K. (1959). The Logic of Scientific Discovery. New York: Basic Books. Popper, K. (1994). The Myth of the Framework: In Defense of Science and Rationality. (ed. M. A. Notturno). London & New York: Routledge. Reid, C. (1997). Neyman. New York: Springer Science & Business Media. Reid, N. & Cox, D.R. (2015). "On Some Principles of Statistical Inference." International Statistical Review 83(2): 293-308. Rosenkrantz, R. (1977). Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel. Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman and Hall, CRC Press. Sellke, T., Bayarri, M. & Berger, J. O. (2001). Calibration of ρ Values for Testing Precise Hypotheses. The American Statistician 55(1): 62-71. Simmons, J. Nelson, L. and Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as Significant. Psych. Sci. 22(11): 1359-1366. <36>