6
Richard Carrier: Proving history or idiocy? In preparation for the release of the sequel to Dr. Richard Carrier’s Proving History, I thought it would be good to address the foundations of the sequel as presented in Proving History. Namely, Dr. Carrier’s arguments about “Bayes’ Theorem” or BT (as presented in his book, anyway). There are several reasons for this. First, Dr. Carrier argues that BT should be THE method for all historical research. Second, the entire point of Proving History was to argue that the approach we will find in the sequel is the best. The third and most important reason, however, is that Dr. Carrier doesn’t understand BT, how it differs from the method he actually advocates (Bayesian inference/statistics), or how fundamentally inaccurate basically every claim he makes about his methodology is. We can begin our journey through Dr. Carrier’s failure to understand his own method by looking at his flawed “proof”. At one point, Dr. Carrier states we can “conclude here and now that Bayes’ Theorem models and describes all valid historical methods. No other method is needed…” (p. 106). His “proof”, though, stands or falls with the first proposition in it: “BT is a logically proven theorem” (ibid.). This is true. However, Dr. Carrier doesn’t seem to have read the sources he cites. For example, on p. 50 Dr. Carrier refers the reader via an endnote (no. 9) to “several highly commendable texts” on BT. The one he states gives “a complete proof of formal validity of BT” is Papoulis, A. (1986). Probability, Random Variables, and Stochastic Processes. (2nd Ed.). I don’t have the 2nd edition, but I do have the 3rd and as this proof is trivial I really could use any intro probability textbook. Papoulis begins his “complete proof of formal validity” (as opposed to proof of informal validity? or incomplete proof?) by defining a set and probability function for which the axioms of probability hold. A key axiom is that any set of possible outcomes must sum or integrate to 1 (simplistically, for those who haven’t taken any calculus, integration is a kind of summation). For example, imagine an individual names “Anna” is drawing cards from the pack. Lets imagine that “It wasn’t the Jack of Diamonds Nor the Joker she drew at first It wasn’t the King or the Queen of Hearts But the Ace of Spades reversed” The probability of drawing the card she did is 1/52. This is true for the other 51 cards as

Richard Carrier- Proving History or Idiocy

  • Upload
    50bmg

  • View
    33

  • Download
    3

Embed Size (px)

Citation preview

  • Richard Carrier: Proving history or idiocy?

    In preparation for the release of the sequel to Dr. Richard Carriers Proving History, I thought it would be good to address the foundations of the sequel as presented in Proving History. Namely, Dr. Carriers arguments about Bayes Theorem or BT (as presented in his book, anyway). There are several reasons for this. First, Dr. Carrier argues that BT should be THE method for all historical research. Second, the entire point of Proving History was to argue that the approach we will find in the sequel is the best. The third and most important reason, however, is that Dr. Carrier doesnt understand BT, how it differs from the method he actually advocates (Bayesian inference/statistics), or how fundamentally inaccurate basically every claim he makes about his methodology is.

    We can begin our journey through Dr. Carriers failure to understand his own method by looking at his flawed proof. At one point, Dr. Carrier states we can conclude here and now that Bayes Theorem models and describes all valid historical methods. No other method is needed (p. 106). His proof, though, stands or falls with the first proposition in it: BT is a logically proven theorem (ibid.). This is true. However, Dr. Carrier doesnt seem to have read the sources he cites. For example, on p. 50 Dr. Carrier refers the reader via an endnote (no. 9) to several highly commendable texts on BT. The one he states gives a complete proof of formal validity of BT is Papoulis, A. (1986). Probability, Random Variables, and Stochastic Processes. (2nd Ed.). I dont have the 2nd edition, but I do have the 3rd and as this proof is trivial I really could use any intro probability textbook. Papoulis begins his complete proof of formal validity (as opposed to proof of informal validity? or incomplete proof?) by defining a set and probability function for which the axioms of probability hold. A key axiom is that any set of possible outcomes must sum or integrate to 1 (simplistically, for those who havent taken any calculus, integration is a kind of summation). For example, imagine an individual names Anna is drawing cards from the pack. Lets imagine that It wasnt the Jack of Diamonds Nor the Joker she drew at first It wasnt the King or the Queen of Hearts But the Ace of Spades reversed The probability of drawing the card she did is 1/52. This is true for the other 51 cards as

  • well. The probability that she would draw a card from the deck that was in the deck is 52/52 or 1. This is intuitive and obvious, but the important point is that it also follows from the fact that there are 52 cards and the probability for drawing any one of them is 1/52, hence the probability of drawing a card is given by the sum of the probabilities of drawing each individual card, or 1/52 summed 52 times. In Dr. Carriers appendix (p. 284) he notes that probability functions must sum to 1. What he apparently doesnt understand is what this entails. It means that in order to use BT to evaluate how probable some outcome, result, historical event, etc., is, one must consider every single one.

    Let me make this simpler with a simple example. I can calculate the probability that, given a full deck of cards, a random draw will yield an ace because I know in advance every possible outcome. If, however, someone mixed together 10 cards drawn at random from 300 different decks and asked me to pick a card, I cant calculate the probability anymore. Even if I were told that the new deck contained 3,000 cards, I have no idea how many are aces. Incidentally, this is the perfect situation for Bayesian statistical inference, which works (simplistically) by assuming e.g., a certain distribution of aces and then changing my model of how likely it is that the next draw will yield an ace as I learn more about the distribution of cards in the entire 3,000 card deck.

    Dr. Carrier wishes to use what he thinks BT is to evaluate the probability that particular events occurred ~2,000 years ago. For example, on pp. 40-42 he considers the possibility that Jesus was a legendary rabbi in terms of the class of legendary rabbis and information we have on such a class. We are in a far worse position than in the mixed card deck example above, because we dont even know the number of legendary rabbis still less who they might be (if we did wed have the answer: Jesus either would be one or wouldnt).

    There is another basic property of BT Dr. Carrier seems to have missed. As Papoulis clearly states, BT is only valid for events/outcomes that are mutually exclusive. Often, both of these requirements (the sum to 1 and mutually exclusivity) are given together: the set of outcomes must be collectively exhaustive and mutually exclusive, or BT is only valid if 1) all possible outcomes are known 2) one and only one outcome can occur.

  • This makes BT useless for most purposes including historiography. However, Dr. Carrier isnt really using BT. As his references show (as well as his formulation of the theorem shows), he is actually using something called Bayesian inference/Bayesian analysis. However, this negates his entire proof because it doesnt matter if BT is a logically proven theorem and there is no complete proof of formal validity for some Bayesian inference/analysis theorem Dr. Carrier could use in place of his first proposition.

    Ok, so we cant use BT, but that doesnt mean we cant use Bayesian methods. However, in order to use Bayesian methods Dr. Carrier would have to understand Bayesian statistics (and statistics in general). He doesnt. We can see this clearly when Dr. Carrier, whose expertise is ancient history, addresses the frequentist vs. Bayesian debate. To keep things simple, lets just say that this is an ongoing debate arguably going back to Bayes but definitely is over a century old. Dr. Carrier is apparently so confident in his mathematical acuity he resolves the dispute with almost no reference to math or the literature in a few pages: The whole debate between frequentists and Bayesians, therefore, has merely been about what a probability is a frequency of, the rules are the same for either (p. 266). Hm. Amazing that generations of the best statistical minds missed this. Oh wait. They didnt.

    Lets look at how Carrier describes the dispute: The debate between the so-called frequentists and Bayesians can be summarized thus: frequentists describe probabilities as a measure of the frequency of occurrence of particular kinds of event within a given set of events, while Bayesians often describe probabilities as measuring degrees of belief or uncertainthy. (p. 265). This is laughably wrong:

    Frequentist statistical procedures are mainly distinguished by two related features; (i) they regard the information provided by the data x as the sole quantifiable form of relevant probabilistic information and (ii) they use, as a basis for both the construction and the assessment of statistical procedures, long-run frequency behaviour under hypothetical repetition of similar circumstances. Bernardo, J. M. & Smith, A. F. (1994). Bayesian Theory. Wiley.

    Undoubtedly, the most critical and most criticized point of Bayesian analysis deals with the choice of the prior distribution, since, once this prior distribution is known, inference can be led in an almost mechanic way by minimizing posterior losses,

  • computing higher posterior density regions, or integrating out parameters to find the predictive distribution. The prior distribution is the key to Bayesian inference and its determination is therefore the most important step in drawing this inference. To some extent, it is also the most difficult. Indeed, in practice, it seldom occurs that the available prior information is precise enough to lead to an exact determination of the prior distribution, in the sense that many probability distributions are compatible with this informationMost often, it is then necessary to make a (partly) arbitrary choice of the prior distribution, which can drastically alter the subsequent inference. Robert, C. P. (2001). The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer Texts in Statistics). (2nd Ed.). Springer.

    The frequency part of frequentist does have to do with kinds of events, but frequencies are the measure of probability, not the reverse. To illustrate, consider the bell curve (the graph of the normal distribution). Its a probability distribution. Now imagine a standardized test like the SATs which is designed such that scores will be normally distributed and have this bell curve graph. What does the graph tell us? It tells us that the most people who take the test get very close scores, but very infrequently some test-takers will get high scores and other will get low. In other words, the bell curve is the graph of a probability function (technically, a probability density function or pdf), and it is formed by the frequency of particular scores. We know that it is very improbable for a persons score to fall in either of the ends/tails of the bell curve because these are very infrequent outcomes.

    What does this mean for frequentist methods? Well, Kaplan, The Princeton Review, and other test prep companies try to show their methods work by using this normal distribution. They claim that people who take their classes arent distributed the way the population is, because too frequently students taking their class obtain scores above average (i.e., those who take the classes have test scores that arent distributed the way the population is). They use the frequency of higher-than-average scores to argue that their class must improve scores.

    Whats key is that the data are obtained and analyzed but the distribution is only used to determine whether the values the analysis yielded are statistically significant. Bayesian inference reverses this, creating fundamental differences. The process starts with a probability distribution. The prior distributions obtained represent uncertainty and make predictions about the data that will be obtained. Once the new data is obtained,

  • the model is adjusted to better fit it. This is usually done many, many times as more and more information is tested against an increasingly more accurate model. The key differences are 1) the iterative process 2) the use of models which make predictions 3) the use of distributions to represent unknowns and (in part) the way the model will learn or adapt given new input.

    So why dont we find any of this in Dr. Carriers description of Bayesian methods? Why do we always find ad hoc descriptions of priors? Because Dr. Carrier wants to use Bayesian analysis but apparently doesnt understand what priors actually are or how complicated they can be in even simple models: In many situations, however, the selection of the prior distribution is quite delicate in the absence of reliable prior information, and generic solutions must be chosen instead. Since the choice of the prior distribution has a considerable influence on the resulting inference, this choice must be conducted with the utmost care. Marin, J. M., & Robert, C. (2007). Bayesian Core: A Practical Approach to Computational Bayesian Statistics. (Springer Texts in Statistics). Springer.

    While the axiomatic development of Bayesian inference may appear to provide a solid foundation on which to build a theory of inference, it is not without its problems. Suppose, for example, a stubborn and ill-informed Bayesian puts a prior on a population proportion p that is clearly terrible (to all but the Bayesian himself). The Bayesian will be acting perfectly logically (under squared error loss) by proposing his posterior mean, based on a modest size sample, as the appropriate estimate of p. This is no doubt the greatest worry that the frequentist (as well as the world at large) would have about Bayesian inference that the use of a bad prior will lead to poor posterior inference. This concern is perfectly justifiable and is a fact of life with which Bayesians must contendWe have discussed other issues, such as the occasional inadmissibility of the traditional or favored frequentist method and the fact that frequentist methods dont have any real, compelling logical foundation. We have noted that the specification of a prior distribution, be it through introspection or elicitation, is a difficult and imprecise process, especially in multiparameter problems, and in any statistical problem, suffers from the potential of yielding poor inferences as a result of poor prior modeling. Samaniego, F. J. (2010). A Comparison of the Bayesian and Frequentist Approaches to Estimation. (Springer Texts in Statistics). Springer.

  • The stubborn and ill-informed Bayesian is in a much better position than Dr. Carrier. Dr. Carrier has confused BT with Bayesian analysis and the Bayesian approach with the frequentist all because he apparently hasnt understood any of these. Instead of prior distributions his priors are best guesses. Instead of real belief functions we find heres what I believe. No considerations are given to the nature of the data (categorical, nominal, and in general none numerical data require specific models and tests, Bayesian or not).

    So instead of the universally valid historical method Dr. Carrier argues BT provides, all that hes actually done is butcher mathematics in order to plug values in to a formula that is as mathematical as numerology but apparently seems impressive if you have no clue what you are talking about. Perhaps thats why Dr. Carriers CV indicates hes been lecturing on Bayes Theorem since 2003, but his 2008 dissertation contains no reference to Bayes Theorem.