Statistical Inference and Data Mining

COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 35

TE

RR

Y W

IDE

NE

R

DATA MINING AIMS TO DISCOVER SOMETHING NEW FROM THE FACTS RECORDED

in a database. For many reasons—encoding errors, measurement

errors, unrecorded causes of recorded features—the information

in a database is almost always noisy; therefore, inference from data-

bases invites applications of the theory of probability. From a sta-

tistical point of view, databases are usually uncontrolled

convenience samples; therefore data mining poses a collection of

interesting, difficult—sometimes impossible—inference problems,

raising many issues, some well studied and others unexplored or at

least unsettled.

Data mining almost always involves a search architecture requir-

ing evaluation of hypotheses at the stages of the search, evaluation

of the search output, and appropriate use of the results. Statistics

has little to offer in understanding search architectures but a great

deal to offer in evaluation of hypotheses in the course of a search,

in evaluating the results of a search, and in understanding the

appropriate uses of the results.

C l a r k G l y m o u r ,

D a v i d M a d i g a n , D a r y l P r e g i b o n ,

a n d P a d h r a i c S m y t h

Statistical Inferenceand Data Mining

Statistics may have little to offer the search architectures in a data mining search,

but a great deal to offer in evaluating hypotheses in the search, in evaluating the results of the search, and in applying the results.

36 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM

Here we describe some of thecentral statistical ideas relevantto data mining, along with anumber of recent techniquesthat may sometimes be applied.Our topics include features ofprobability distributions, estima-tion, hypothesis testing, modelscoring, Gibb’s sampling, ratio-nal decision making, causalinference, prediction, andmodel averaging. For a rigoroussurvey of statistics, the mathe-matically inclined reader shouldsee [7]. Due to space limita-tions, we must also ignore anumber of interesting topics,including time series analysisand meta-analysis.

Probability DistributionsThe statistical literature con-tains mathematical characteriza-tions of a wealth of probabilitydistributions, as well as proper-ties of random variables—func-tions defined on the “events” towhich a probability measureassigns values. Important rela-tions among probability distrib-utions include marginalization(summing over a subset of val-ues) and conditionalization(forming a conditional proba-bility measure from a probabili-ty measure on a sample spaceand some event of positive prob-ability. Essential relationsamong random variablesinclude independence, condi-tional independence, and vari-ous measures ofdependence—of which themost famous is the correlationcoefficient. The statistical litera-ture also characterizes familiesof distributions by propertiesuseful in identifying any particu-lar member of the family fromdata, or by closure propertiesuseful in model construction orinference (e.g., conjugate fami-lies closed under conditionaliza-tion and the multinormal familyclosed under linear combina-

tion). Knowledge of the properties ofdistribution families can be invalu-able in analyzing data and makingappropriate inferences.

Inference involves the followingfeatures:

• Estimation• Consistency• Uncertainty• Assumptions• Robustness• Model averaging

Many procedures of inference canbe thought of as estimators, or func-tions from data to some object to beestimated, whether the object is thevalues of a parameter, intervals ofvalues, structures, decision trees, orsomething else. Where the data are asample from a larger (actual orpotential) collection described by aprobability distribution for any givensample size, the array of values of anestimator over samples of that sizehas a probability distribution. Statis-tics investigates such distributions ofestimates to identify features of anestimator related to the information,reliability, and uncertainty it pro-vides.

AN important feature ofan estimator is consisten-cy; in the limit, as thesample size increaseswithout bound, esti-

mates should almost certainly con-verge to the correct value of whateveris being estimated. Heuristic proce-dures, which abound in machinelearning (and in statistics), have noguarantee of ever converging on theright answer. An equally importantfeature is the uncertainty of an esti-mate made from a finite sample.That uncertainty can be thought ofas the probability distribution of esti-mates made from hypothetical sam-ples of the same size obtained in thesame way. Statistical theory providesmeasures of uncertainty (e.g., stan-dard errors) and methods of calcu-lating them for various families of

Heuristic

procedures,

which abound in

machine learning

(and in statistics),

have no

guarantee of

ever converging

on the right

answer.


estimators. A variety of resampling and simulationtechniques have also been developed for assessinguncertainties of estimates [1]. Other things (e.g.,consistency) being equal, estimators that minimizeuncertainty are preferred.

The importance of uncertainty assessments can beillustrated in many ways. For example, in recentresearch aimed at predicting the mortality of hospi-talized pneumonia patients, a large medical databasewas divided into a training set and a test set. (Searchprocedures used the training set to form a model,and the test set helped assess the predictions of themodel.) A neural net using a large number of vari-ables outperformed several other methods. However,the neural net’s performance turned out to be anaccident of the particular train/test division. When arandom selection of other train/test divisions (withthe same proportions) were made and the neural netand competing methods trained and tested accord-ing to each, the average neural net performance wascomparable to that of logistic regression.

Estimation is almost always made on the basis of aset of assumptions, or model, but for a variety of rea-sons the assumptions may not be strictly met. If themodel is incorrect, estimates based on it are alsoexpected to be incorrect, although that is not alwaysthe case. One aim of statistical research is to find waysto weaken the assumptions necessary for good esti-mation. Robust statistics looks for estimators thatwork satisfactorily for larger families of distributions;resilient statistics [3] concern estimators—oftenorder statistics—that typically have small errors whenassumptions are violated.

A more Bayesian approach to the problem of esti-mation under assumptions emphasizes that alterna-tive models and their competing assumptions areoften plausible. Rather than making an estimatebased on a single model, several models can be con-sidered, each with an appropriate probability, andwhen each of the competing models yields an esti-mate of the quantity of interest, an estimate can beobtained as the weighted average of the estimatesgiven by the individual models [5]. When the proba-bility weights are well calibrated to the frequencieswith which the various models obtain, model averag-ing is bound to improve estimation on average. Sincethe models obtained in data mining are usually theresult of some automated search procedure, theadvantages of model averaging are best obtained ifthe error frequencies of the search procedure areknown—something usually obtainable only throughextensive Monte Carlo exploration. Our impressionis that the error rates of search procedures proposedand used in the data mining and statistical literatures

are rarely estimated in this way. (See [10] and [11]for Monte Carlo test design-for-search procedures.)When the probabilities of various models are entirelysubjective, model averaging gives at least coherentestimates.

Hypothesis TestingHypothesis testing can be viewed as one-sided estima-tion in which, for a specific hypothesis and any sam-ple of an appropriate kind, a testing rule eitherconjectures that the hypothesis is false or makes noconjecture. The testing rule is based on the condi-tional sampling distribution (conditional on thetruth of the hypothesis to be tested) of some statisticor other. The significance level a of a statistical testspecifies the probability of erroneously conjecturingthat the hypothesis is false (often called rejecting thehypothesis) when the hypothesis is in fact true. Givenan appropriate alternative hypothesis, the probabilityof failing to reject the hypothesis under test can becalculated; that probability is called the power of thetest against the alternative. The power of a test is obvi-ously a function of the alternative hypothesis beingconsidered.

SINCE statistical tests are widely used, some oftheir important limitations should be noted.Viewed as a one-sided estimation method,hypothesis testing is inconsistent unless thealpha level of the testing rule is decreased

appropriately as the sample size increases. Generally,a level a test of one hypothesis and a level a test ofanother hypothesis do not jointly provide a level a testof the conjunction of the two hypotheses. In specialcases, rules (sometimes called contrasts) exist forsimultaneously testing several hypotheses [4]. Animportant corollary for data mining is that the alphalevel of a test has nothing directly to do with the prob-ability of error in a search procedure that involves test-ing a series of hypotheses. If, for example, for eachpair of a set of variables, hypotheses of independenceare tested at a = 0.5, then 0.5 is not the probability oferroneously finding some dependent set of variableswhen in fact all pairs are independent. That relationwould hold (approximately) only when the samplesize is much larger than the number of variables con-sidered. Thus, in data mining procedures that use asequence of hypothesis tests, the alpha level of thetests cannot generally be taken as an estimate of anyerror probability related to the outcome of the search.

In many, perhaps most, realistic hypothesis spaces,hypothesis testing is comparatively uninformative. If ahypothesis is not rejected by a test rule and a sample,the same test rule and the same sample may very well

also not reject many other hypothe-ses. And in the absence of knowledgeof the entire power function of thetest, the testing procedure providesno information about such alterna-tives. Further, the error probabilitiesof tests have to do with the truth ofhypotheses, not with approximatetruth; hypotheses that are excellentapproximations may be rejected inlarge samples. Tests of linear models,for example, typically reject them invery large samples no matter howclosely they seem to fit the data.

Model Scoring

THE evidence provided bydata should lead us toprefer some models orhypotheses to others andto be indifferent about

still other models. A score is any rulethat maps models and data to num-bers whose numerical ordering cor-responds to a preference orderingover the space of models, given thedata. For such reasons, scoring rulesare often an attractive alternative totests. Indeed, the values of test statis-tics are sometimes themselves usedas scores, especially in the structural-equation literature. Typical rulesassign to a model a value deter-mined by the likelihood functionassociated with the model, the num-ber of parameters, or dimension, ofthe model, and the data. Popularrules include the Akaike Informa-tion Criterion (AIC), Bayes Informa-tion Criterion (BIC), and MinimumDescription Length. Given a priorprobability distribution over models,the posterior probability on the datais itself a scoring function, arguably aprivileged one. The BIC approxi-mates posterior probabilities inlarge samples.

There is a notion of consistencyappropriate to scoring rules; in thelarge sample limit, the true modelshould almost surely be amongthose receiving maximal scores. AICscores are generally not consistent[8]. The probability (p) valuesassigned to statistics in hypothesistests of models are scores, but it does

not seem to be known whetherand under what conditions theyform a consistent set of scores.There are also uncertainties asso-ciated with scores, since two dif-ferent samples of the same sizefrom the same distribution canyield not only different numericalvalues for the same model buteven different orderings of models.

For obvious combinatorial rea-sons, it is often impossible whensearching a large model space tocalculate scores for all models;however, it is often feasible todescribe and calculate scores fora few equivalence classes of mod-els receiving the highest scores.

In some contexts, inferencesmade using Bayes scores and pos-teriors can differ a great dealfrom inferences made withhypothesis tests. (See [5] forexamples of models that accountfor almost all of the variance ofan outcome of interest and thathave very high posterior or Bayesscores but are overwhelminglyrejected by statistical tests.)

Of the various scoring rules,perhaps the most interesting isthe posterior probability, because,unlike many other consistentscores, posterior probability has acentral role in the theory of ratio-nal choice. Unfortunately, poste-riors can be difficult to compute.

Gibbs SamplingStatistical theory typically givesasymptotic results that can beused to describe posteriors orlikelihoods in large samples.Unfortunately, even in very largedatabases, the number of casesrelevant to a particular questioncan be quite small. For example,in studying the effects of hospital-ization on survival of pneumoniapatients, mortality comparisonsbetween those treated at homeand those treated in a hospitalmight be wanted. But even in avery large database, the numberof pneumonia patients treated at


When the

probabilities of

various models

are entirely

subjective, model

averaging gives

at least

coherent

estimates.


home and who die of pneumonia complications isvery small. And statistical theory typically provides fewor no ways to calculate distributions in small samplesin which the application of asymptotic formulas canbe wildly misleading. Recently, a family of simulationmethods—often described as Gibbs sampling after thegreat American physicist Josiah Willard Gibbs1839–1903, have been adapted from statisticalmechanics, permitting the approximate calculation ofmany distributions. A review of these procedures is in[9].

Rational Decision Making and PlanningThe theory of rational choice assumes the decisionmaker has a definite set of alternative actions, knowl-edge of a definite set of possible alternative states ofthe world, and knowledge of the payoffs or utilities ofthe outcomes of each possible action in each possiblestate of the world, as well as knowledge of the proba-bilities of various possible states of the world. Givenall this information, a decision rule specifies which ofthe alternative actions ought to be taken. A large lit-erature in statistics and economics addresses alterna-tive decision rules—maximizing expected utility,minimizing maximum possible loss, and more. Ratio-nal decision making and planning are typically thegoals of data mining, but rather than providing tech-niques or methods for data mining, the theory ofrational choice poses norms for the use of informa-tion obtained from a database.

The very framework of rational decision makingrequires probabilities for alternative states of affairsand knowledge of the effects alternative actions willhave. To know the outcomes of actions is to knowsomething of cause-and-effect relations. Extractingsuch causal information is often one of the principalgoals of data mining and more generally of statisticalinference.

Inference to CausesUnderstanding causation is the hidden motivationbehind the historical development of statistics. Fromthe beginning of the field, in the work of Bernoulliand Laplace, the absence of causal connectionbetween two variables has been taken to imply theirprobabilistic independence [12]; the same idea isfundamental in the theory of experimental design. In1934, Sewell Wright, a biologist, introduced directedgraphs to represent causal hypotheses (with verticesas random variables and edges representing directinfluences); these graphs have become common rep-resentations of causal hypotheses in the social sci-ences, biology, computer science, and engineering.

In 1982, statisticians Harry Kiiveri and T. P. Speedcombined directed graphs with a generalized con-nection between independence and absence of

causal connection in what they called the Markovcondition—if Y is not an effect of X, then X and Y areconditionally independent, given the direct causes ofX. Kiiveri and Speed showed that much of the linearmodeling literature tacitly assumed the Markov con-dition; the Markov condition is also satisfied by mostcausal models of categorical data and of virtually allcausal models of systems without feedback. Underadditional assumptions, conditional independenceprovides information about causal dependence. Themost common—and most thoroughly investigated—additional assumption is that all conditional inde-pendencies are due to the Markov condition's beingapplied to the directed graph describing the actualcausal processes generating the data, a requirementwith many names (e.g., faithfulness). Directed graphswith associated probability distributions satisfying theMarkov condition are called by different names indifferent literatures (e.g., Bayes nets, belief nets,structural equation models, and path models).

CAUSAL inference from uncontrolled con-venience samples is liable to manysources of error. Three of the mostimportant are latent variables (or con-founders), sample selection bias, and

model equivalence. A latent variable is any unrecord-ed feature that varies among recorded units andwhose variation influences recorded features. Theresult is an association among recorded features notin fact due to any causal influence of the recordedfeatures themselves. The possibility of latent variablescan seldom, if ever, be ignored in data mining. Sam-ple selection bias occurs when the values of any twoof the variables under study, say X and Y, themselvesinfluence whether a feature is recorded in a database.That influence produces a statistical associationbetween X and Y (and other variables) that has nocausal significance. Datasets with missing values posesample selection bias problems. Models with quitedifferent graphs may generate the same constraintson probability distributions through the Markov con-dition and may therefore be indistinguishable with-out experimental intervention. Any procedure thatarbitrarily selects one or a few of the equivalents maybadly mislead users when the models are given acausal significance. If model search is viewed as aform of estimation, all of these difficulties are sourcesof inconsistency.

Standard data mining methods run afoul of thesedifficulties. The search algorithms in such commer-cial linear model analysis programs as LISREL selectone from an unknown number of statistically indis-tinguishable models. Regression methods are incon-sistent for all of the reasons listed earlier. Forexample, consider the structure: Y = aT + ey; X1 = bT

+ cQ + e1; X2 = dQ + e2, where Tand Q are unrecorded. NeitherX1 nor X2 has any influence onY. For all nonzero values of a, b,c,, d, however, in sufficientlylarge samples, regression of Y onX1, X2 yields significant regres-sion coefficients for X1 and X2.With the causal interpretationoften given it, regression saysthat X1 and X2 cause of Y.Assuming the Markov and faith-fulness conditions, all that canbe inferred correctly (in largesamples) from data on X1, X2,and Y is that X1 is not a cause ofX2 or of Y; X2 is not a cause of Y;Y is not a cause of X2; and thereis no common cause of Y and X2.Nonregression algorithms imple-mented in the TETRAD II pro-gram [6, 10] give the correctresult asymptotically in this caseand in all cases in which theMarkov and faithfulness condi-tions hold. The results are alsorobust against the three prob-lems with causal inference notedin the previous paragraph [11].However, the statistical decisionsmade by the algorithms are notreally optimal, and the imple-mentations are limited to themultinomial and multinormalfamilies of probability distribu-tions. A review of Bayesiansearch procedures for causalmodels is given in [2].

PredictionSometimes one is interested in usinga sample, or a database, to predictproperties of a new sample, where itis assumed the two samples areobtained from the same probabilitydistribution. As with estimation, pre-diction is interested in accuracy anduncertainty, and is often measuredby the variance of the predictor.

Prediction methods for thissort of prediction problem alwaysassume some regularities—con-straints—in the probability distri-bution. In data mining contexts,the constraints are typically eithersupplied by human experts or

automatically inferred from the data-base. For example, regressionassumes a particular functional formfor relating variables or, in the case oflogistic regression, relating the valuesof some variables to the probabilitiesof other variables; but constraints areimplicit in any prediction methodthat uses a database to adjust or esti-mate the parameters used in predic-tion. Other forms of constraint mayinclude independence, conditionalindependence, and higher-orderconditions on correlations (e.g.,tetrad constraints). On average, aprediction method guaranteeing sat-isfaction of the constraints realized inthe probability distribution is moreaccurate and has a smaller variancethan a prediction method that doesnot. Finding the appropriate con-straints to be satisfied is the most dif-ficult issue in this sort of prediction.As with estimation, prediction can beimproved by model averaging, pro-vided the probabilities of the alterna-tive assumptions imposed by themodel are available.

Another sort of predictioninvolves interventions that alter theprobability distribution—as in pre-dicting the values (or probabilities)of variables under a change in man-ufacturing procedures or changes ineconomic or medical treatmentpolicies. Making accurate predic-tions of this kind requires someknowledge of the relevant causalstructure and is generally quite dif-ferent from prediction withoutintervention, although the samecaveats about uncertainty andmodel averaging apply. For graphi-cal representations of causalhypotheses according to the Markovcondition, general algorithms forpredicting the outcomes of inter-ventions from complete or incom-plete causal models were developedin [10]. In 1995, some of these pro-cedures were extended and madeinto a more convenient calculus byJudea Pearl, a computer scientist. Arelated theory without graphicalmodels was developed in 1974 byDonald Rubin, a statistician, and


Understanding

causation is the

hidden motivation

behind the

historical

development of

statistics.


others, and in 1986 by James Robins.Well-known studies by Herbert Needleman, a

physician and statistician, of the correlation of leaddeposits in children’s teeth and the children’s IQsresulted, eventually, in removal of tetraethyl leadfrom gasoline in the U.S. One dataset Needlemanexamined included more than 200 subjects and mea-sured a large number of covariates. In 1985, Needle-man and his colleagues reanalyzed the data usingbackward stepwise regression of verbal IQ on thesevariables and obtained six significant regressors,including lead. In 1988, Steven Klepper, an econo-mist, and his collaborators reanalyzed the data assum-ing that all the variables were measured with error.Klepper’s model assumes that each measured num-ber is a linear combination of the true value and anerror and that the parameters of interest are not theregression coefficients but the coefficients relatingthe unmeasured true-value variables to the unmea-sured true value of verbal IQ.

These coefficients are in fact indeterminate—or, ineconometric terminology, unidentifiable. However,an interval estimate of the coefficients that is strictlypositive or negative for each coefficient can be madeif the amount of measurement error can be boundedwith prior knowledge by an amount that varies fromcase to case. For example, Klepper found that thebound required to ensure the existence of a strictlynegative interval estimate for the lead-to-IQ coeffi-cient was much too strict to be credible; thus he con-cluded that the case against lead was not nearly asstrong as Needleman’s analysis suggested.

Allowing the possibility of latent variables, RichardScheines in 1996 reanalyzed the correlations with theTETRAD II program and concluded that three of thesix regressors could have no influence on IQ. Theregression included the three extra variables onlybecause the partial regression coefficient is estimatedby conditioning on all other regressors—just the rightthing to do for linear prediction, but the wrong thingto do for causal inference using the Markov condition(see the example at the end of the earlier section Infer-ence to Causes). Using the Klepper model—but with-out the three irrelevant variables—and assigning to allof the parameters a normal prior probability withmean zero and a substantial variance, Scheines usedGibbs sampling to compute a posterior probability dis-tribution for the lead-to-IQ parameter. The probabili-ty is very high that lead exposure reduces verbal IQ.

ConclusionThe statistical literature has a wealth of technical pro-cedures and results to offer data mining, but it alsooffers several methodological morals:

• Prove that estimation and search procedures used

in data mining are consistent under conditionsreasonably thought to apply in applications;

• Use and reveal uncertainty—don’t hide it; • Calibrate the errors of search—for honesty and to

take advantages of model averaging; • Don’t confuse conditioning with intervening, that is,

don’t take the error probabilities of hypothesis teststo be the error probabilities of search procedures.

Otherwise, good luck. You’ll need it.

References1. Efron, B. The Jackknife, the Bootstrap, and Other Resampling Plans.

Society for Industrial and Applied Mathematics (SIAM), Num-ber 38, Philadelphia, 1982.

2. Heckerman, D. Bayesian networks for data mining. Data Min-ing and Knowledge Discovery, submitted.

3. Hoaglin, D., Mosteller, F., and Tukey, J. Understanding Robustand Exploratory Data Analysis. Wiley, New York, 1983.

4. Miller, R. Simultaneous Statistical Inference. Springer-Verlag, NewYork, 1981.

5. Raftery, A.E. Bayesian model selection in social research. Work-ing Paper 94-12, Center for Studies in Demography and Ecolo-gy, Univ. of Washington, Seattle, 1994.

6. Scheines, R., Spirtes, P., Glymour, C., and Meek, C. TETRAD II:Tools for Causal Modeling. Users Manual. Erlbaum, Hillsdale, N.J.,1994.

7. Schervish, M. Theory of Statistics. Springer-Verlag, New York, 1995. 8. Schwartz, G. Estimating the dimension of a model. Ann. Stat. 6

(1978), 461–464.9. Smith, A.F.M., and Roberts, G.O. Bayesian computation via the

Gibb's sampler and related Markov chain Monte Carlo meth-ods. J. R. Stat. Soc., Series B, 55 (1993), 3–23.

10. Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction,and Search. Springer-Verlag, New York, 1993.

11. Spirtes, P., Meek, C., and Richardson, T. Causal inference inthe presence of latent variables and selection bias. In Proceed-ings of the 11th Conference on Uncertainty in Artificial Intelligence. P.Besnard and S. Hanks, Eds. Morgan Kaufmann Publishers, SanMateo, Calif., 1995, pp. 499–506.

12. Stigler, S. The History of Statistics. Harvard University Press, Cam-bridge, Mass., 1986.

Additional references for this article can be found athttp://www.research.microsoft.com/research/datamine/CACM-DM-refs/.

CLARK GLYMOUR is Alumni University Professor at CarnegieMellon University and Valtz Family Professor of Philosophy at theUniversity of California, San Diego. He can be [email protected].

DAVID MADIGAN is an associate professor of statistics at the Uni-versity of Washington in Seattle. He can be reached at [email protected].

DARYL PREGIBON is the head of statistics research in AT&T Lab-oratories. He can be reached at [email protected]

PADHRAIC SMYTH is an assistant professor of information andcomputer science at the University of California, Irvine. He can bereached at [email protected].

Permission to make digital/hard copy of part or all of this work for personalor classroom use is granted without fee provided that copies are not made ordistributed for profit or commercial advantage, the copyright notice, the titleof the publication and its date appear, and notice is given that copying is bypermission of ACM, Inc. To copy otherwise, to republish, to post on servers,or to redistribute to lists requires prior specific permission and/or a fee.

© ACM 0002-0782/96/1100 $3.50

C

Documents

Statistical Inference and Data Mining