42
2 Intelligent Statistical Data Mining with Information Complexity and Genetic Algorithms Hamparsum Bozdogan University of Tennessee, Knoxville, USA CONTENTS 2.1 Introduction ............................................................. 15 2.2 What is Information Complexity:ICOMP? ................................ 17 2.3 Information Criteria for Multiple Regression Models ..................... 31 2.4 A GA for the Regression Modeling ...................................... 36 2.5 Numerical Examples .................................................... 41 2.6 Conclusion and Discussion .............................................. 49 Acknowledgments ....................................................... 50 References .............................................................. 51 This paper develops a computationally feasible intelligent data mining and knowl- edge discovery technique that addresses the potentially daunting statistical and com- binatorial problems presented by subset regression models. Our approach integrates novel statistical modelling procedures based on an information-theoretic measure of complexity. We form a three-way hybrid between: information measures of com- plexity, multiple regression models, and genetic algorithms (GAs). We demonstrate our new approach using a simulated example and on a real data set to illustrate the versatility and the utility of the new approach. 2.1 Introduction In regression type problems whether it is in multiple regression analysis, in logistic, or in ordinal logistic regression, model building and evaluation and selection of rel- evant subset of predictor variables on which to base inferences is a central problem in data mining to reduce the “curse of dimensionality,” a term coined by Richard Bellman (see, Bellman, 1961) almost 42 years ago. Also, see, e.g., Sakamoto et al., 1986, Miller 1990, Boyce et al., 1974. Often a quantitative, binary, or ordinal level 1-58488-344-8/04/$0.00+$1.50 c 2004 by CRC Press LLC 15

Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

2

Intelligent Statistical Data Mining withInformation Complexity and GeneticAlgorithms

Hamparsum BozdoganUniversity of Tennessee, Knoxville, USA

CONTENTS2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 What is Information Complexity:ICOMP?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Information Criteria for Multiple Regression Models. . . . . . . . . . . . . . . . . . . . . 312.4 A GA for the Regression Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.5 Numerical Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.6 Conclusion and Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

This paper develops a computationally feasible intelligent data mining and knowl-edge discovery technique that addresses the potentially daunting statistical and com-binatorial problems presented by subset regression models. Our approach integratesnovel statistical modelling procedures based on an information-theoretic measure ofcomplexity. We form a three-way hybrid between: information measures of com-plexity, multiple regression models, and genetic algorithms(GAs). We demonstrateour new approach using a simulated example and on a real data set to illustrate theversatility and the utility of the new approach.

2.1 Introduction

In regression type problems whether it is in multiple regression analysis, in logistic,or in ordinal logistic regression, model building and evaluation and selection of rel-evant subset of predictor variables on which to base inferences is a central problemin data mining to reduce the “curse of dimensionality,” a term coined by RichardBellman (see, Bellman, 1961) almost 42 years ago. Also, see,e.g., Sakamoto et al.,1986, Miller 1990, Boyce et al., 1974. Often a quantitative,binary, or ordinal level

1-58488-344-8/04/$0.00+$1.50c© 2004 by CRC Press LLC 15

Page 2: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

16 Statistical Data Mining and Knowledge Discovery

response variable is studied given a set of predictor variables. In such cases it is oftendesirable to determine which subsets of the predictors are most useful for forecastingthe response variable, and to interpret a large number of regression coefficients, sincethis can become unwieldy even for moderately sized data, andto achieve parsimonyof unknown parameters, allowing both better estimation andclearer interpretation ofthe parameters included in these models.

The problem of selecting the best regression models is a non-trivial exercise, par-ticularly when a large number of predictor variables exist and the researcher lacksprecise information about the exact relationships among the variables.

In many cases the total possible number of models reaches over millions (e.g.,more than 20 predictor variables) or perhaps into the billions (e.g., 30 predictor vari-ables) and evaluation of all possible combinations of subsets is unrealistic in termsof time and cost.

Therefore, numerical optimization techniques and strategies for model selectionare needed to explore the vast solution space. In general theproblem of subset selec-tion using numerical techniques requires two components:

(1) an algorithm for the efficient searching of the solution space, and(2) a criterion or measure for the comparison of competing models to help

guide the search.

Most statistical packages for statistical analysis provide aBackwardandForwardstepwise selectionstrategy for choosing the best subset model. However, it is wellknown that bothBackwardand Forward stepwise selectionin regression analysisdo not always find the best subset of predictor variables fromthe set ofk variables.Major criticisms levelled onBackwardandForward stepwise selectionare that, littleor no theoretical justification exists for the order in whichvariables enter or exit thealgorithm (Boyce et al., 1974, p. 19, Wilkinson, 1989, p. 177-178), and the arbitrarychoices of the probabilities specified a priori to enter and remove the variables in theanalysis. Another criticism is that stepwise searching rarely finds the overall bestmodel or even the best subset of a particular size (Mantel, 1970, Hocking, 1976,1983, Moses, 1986).

Lastly, and most importantly, because only local searchingis employed, stepwiseselection provides extremely limited sampling from a smallarea of the vast solutionspace. Stepwise selection, at the very best, can only produce an “adequate” model(Sokal and Rohlf, 1981, p. 668).

Based on the above shortcomings of existing problems in regression analysis, thepurpose of this paper is to introduce and develop a computationally feasible intelli-gent data mining and knowledge discovery technique based onthe genetic algorithm(GA) and information-based model selection criteria for subset selection inmultipleregressionmodels. Our approach has been also extended tologistic regressionandordinal logistic regressionmodels as a three-way hybrid. For space considerations,we will report and publish the results of these elsewhere. However, for more onsubset selection of best predictors inordinal logistic regressionmodels, we refer thereader to Lanning and Bozdogan (2003) in this volume.

Page 3: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 17

A GA is a stochastic search algorithm which is based on concepts of biologicalevolution and natural selection that can be applied to solving problems where vastnumbers of possible solutions exist.GAshave been used in a wide variety of fieldssuch as engineering, economics, game theory (Holland, 1992), computational sci-ences (Forrest, 1993), marketing (Bauer, 1994) and biology(Sumida et al., 1990).Unlike conventional optimization approaches, theGA requires no calculation of thegradient of the objective function and is not likely to be restricted to a local optima(Goldberg, 1989). AGA treats information as a series of codes on a binary string,where each string represents a different solution to a givenproblem. These stringsare analogous models to the genetic information coded by genes on a chromosome.A string can be evaluated, according to some“fitness” value, for its particular abilityto solve the problem. On the basis of the fitness values strings are either retainedor removed from the analysis after each run so that, after many runs, the best solu-tions have been identified. One important difficulty with anyGA is in choosing anappropriate fitness function as the basis for evaluating each solution.

With respect tomultiple regressionanalysis, the fitness value is a subset selectioncriterion for comparing subset models in a search of the bestsubset. This can beeasily determined by using informational model selection criteria.

The format of this paper is as follows. In Section??, we discuss what informa-tion complexity is, and present its general form in model selection. In Section??,we give the derived closed form analytical expressions of information complexityICOMP of Bozdogan (1988, 1990, 1994, 2000), and Akaike’s (1973, 1974) infor-mation criterionAIC, and Rissanen’s (1978, 1986)MDL, Schwarz’s (1978)SBCor BIC, and Bozdogan’s (1987)Consistent AIC with Fisher information CAICFasdecision rules for model selection and evaluation in multiple regression models.In Section??, we develop theGA for the general regression modelling and dis-cuss the new statistical software we developed with graphical user interface (GUI)in a flexible Matlab computational environment. Section?? is devoted to sim-ulated and real data examples in multiple regression modelsto demonstrate theversatility and utility of our new approach.In Section??, we draw ourconclusions.

2.2 What is Information Complexity:ICOMP?

In general statistical modeling and model evaluation problems, the concept of modelcomplexity plays an important role. At the philosophical level, complexity involvesnotions such as connectivity patterns and the interactionsof model components.Without a measure of“overall” model complexity, prediction of model behaviorand assessing model quality is difficult. This requires detailed statistical analysisand computation to choose the best fitting model among a portfolio of competingmodels for a given finite sample. In this section, we develop and present information-

Page 4: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

18 Statistical Data Mining and Knowledge Discovery

theoretic ideas of a measure of“overall” model complexity in statistical modellingto help provide new approaches relevant to statistical inference.

Recently, based on Akaike’s (1973) originalAIC, many model-selection proce-dures that take the form of apenalized likelihood(a negative log likelihood plus apenalty term) have been proposed (Sclove, 1987). For example, forAIC this form isgiven by

AIC(k) = −2logL(θk)+2m(k) (2.1)

whereL(θk) is the maximized likelihood function,θk is the maximum likelihoodestimate of the parameter vectorθk under the modelMk, andm(k) is the number ofindependent parameters whenMk is the model.

In AIC, the compromise takes place between the maximized log likelihood, i.e.,−2logL(θk) (the lack of fit component) andm(k), the number of free parameters es-timated within the model (the penalty component) which is a measure of complexitythat compensates for the bias in the lack of fit when the maximum likelihood esti-mators are used. In usingAIC, according to Akaike (1987, p. 319), the accuracy ofparameter estimates is measured by a universal criterion, namely

Accuracy Measure= E [loglikelihood o f the f itted model] (2.2)

whereE denotes the expectation, sinceAIC is an unbiased estimator of minus twicethe expected log likelihood.

We are motivated from considerations similar to those inAIC. However, we basethe new procedureICOMP on thestructural complexityof an element or set of ran-dom vectors via a generalization of theinformation-based covariance complexityindexof van Emden (1971).

For a general multivariate linear or nonlinear model definedby

Statistical model= Signal+Noise (2.3)

ICOMP is designed to estimate a loss function:

Loss= Lack o f Fit+Lack o f Parsimony+Pro f usion o f Complexity (2.4)

in several ways using the additivity properties of information theory. We further baseour developments on similar considerations to Rissanen (1976) in hisfinal estimationcriterion (FEC)in estimation and model identification problems, as well asAkaike’s(1973) AIC, and its analytical extensions in Bozdogan (1987).

The development and construction ofICOMP is based on a generalization of thecovariance complexity indexoriginally introduced by van Emden (1971). Instead ofpenalizing the number of free parameters directly,ICOMP penalizes the covariancecomplexity of the model. It is defined by

ICOMP= −2logL(θ )+2C(ΣModel), (2.5)

Page 5: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 19

whereL(θk) is the maximized likelihood function,θk is the maximum likelihood es-timate of the parameter vectorθk under the modelMk, andC represents a real-valuedcomplexity measure andCov( θ ) = ΣModel represents the estimated covariance ma-trix of the parameter vector of the model.

Since there are several forms and justifications ofICOMP, based on (??), in thispaper, for brevity, we will present the most general form ofICOMP referred to asICOMP(IFIM). ICOMP(IFIM) exploits the well-known asymptotic optimality prop-erties of theMLE’s, and uses the information-based complexity of theinverse-Fisherinformation matrix (IFIM)of a model. This is known as the celebratedCramer-Raolower bound (CRLB)matrix. See, e.g., Cramer (1946) and Rao (1945, 1947, 1948).

Before we deriveICOMP(IFIM), we first introduce some background material tounderstand the concept of complexity and give the definitionof the complexity of asystem next.

2.2.1 The Concept of Complexity and Complexity of a System

Complexity is a general property of statistical models thatis largely independent ofthe specific content, structure, or probabilistic specification of the models. In theliterature, the concept of complexity has been used in many different contexts. Ingeneral, there is not a unique definition of complexity in statistics, since the notionis “elusive” according to van Emden (1971, p. 8). Complexity has many faces,and it is defined under many different names such as those of“Kolmogorov Com-plexity” (Cover, Gacs, and Gray, 1989),“Shannon Complexity”(Rissanen, 1989),and“Stochastic Complexity”(Rissanen, 1987, 1989) in information theoretic cod-ing theory, to mention a few. For example, Rissanen (1986, 1987, 1989) similar toKolmogorov (1983) defines complexity in terms of the shortest code length for thedata that can be achieved by the class of models, and calls itStochastic Complex-ity (SC). The Monash School (e.g., Wallace and Freeman, 1987, Wallace and Dowe,1993, Baxter, 1996) define complexity in terms ofMinimum Message Length (MML)which is based on evaluating models according to their ability to compress a messagecontaining the data.

An understanding of complexity is necessary in general model building theoryand inductive inference to study uncertainty in light of thedata. Statistical modelsand methods are not exactly deductive since human beings often reason on the basisof uncertainties. Instead, they generally fall under the category of inductive infer-ence. Inductive inference is the problem of choosing a parameter, or model, froma hypothesis, or model space, which best‘explains’ the data under study (Baxter,1996, p. 1). As discussed in Akaike (1994, p. 27), reasoning under uncertaintywas studied by the philosopher C. S. Pierce (see, e.g., Pierce, 1955), who called itthe logic of abduction, or in short,abduction. Abduction is a way of reasoning thatuses general principles and observed facts to obtain new facts, but all with a degreeof uncertainty. Abduction takes place using numerical functions and measures suchas the information theoretic model selection criteria. Pierce insisted that the mostoriginal part of scientific work was related to the abductivephase, or the phase ofselection of proper hypotheses. Therefore, developing a systematic procedure for

Page 6: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

20 Statistical Data Mining and Knowledge Discovery

abductive inference with the aid of the notion of complexityis “a prerequisite to theunderstanding of learning and evolutionary processes ”(von Neumann, 1966). Inthis context, statistical modelling and model building is ascience of abduction whichforms the philosophical foundation ofdata miningandknowledge discovery. Hence,the study of complexity is of considerable practical importance for model selectionof proper hypotheses or models within the data mining enterprise.

We give the following simple system theoretic definition of complexity to motivatea statistically defined measure.

Definition 2.1. Complexity of a system (of any type) is a measure of the degree ofinterdependency between the whole system and a simple enumerative compo-sition of its subsystems or parts.

We note that this definition of complexity is different from the way it is frequentlynow used in the literature to mean the number of estimated parameters in a model.For our purposes, the complexity of a model is most naturallydescribed in terms ofinteractions of the components of the model, and the information required to con-struct the model in a way it is actually defined. Therefore, the notion of complexitycan be best explained if we consider the statistical model arising within the contextof a real world system. For example, the system can be physical, biological, social,behavioral, economic, etc., to the extent that the system responses are considered tobe random.

As complexity is defined inDefinition 2.1, we are interested in the amount bywhich the whole system, say,S, is different from the composition of its components.If we let C denote any real-valued measure of complexity of a systemS, thenC(S)will measure the amount of the difference between the whole system and its decom-posed components. Using the information theoretic interpretation, we define thisamount to be the discrimination information of the joint distribution of the probabil-ity model at hand against the product of its marginal distributions. Discriminationinformation is equal to zero if the distributions are identical and is positive otherwise(van Emden, 1971, p. 25).

Thus, to quantify the concept of complexity in terms of ascalar index, we onlyhave to express the interactions in a mathematical definition. We shall accomplishthis by appealing to information theory since it possesses several important ana-lytical advantages over the conventional procedures such as those ofadditivity andconstraining properties, andallowance to measure dependencies.

For more details on the system theoretic definition of complexity as backgroundmaterial, we refer the reader to van Emden (1971, p. 7 and 8), and Bozdogan (1990).

2.2.2 Information Theoretic Measure of Complexity of a MultivariateDistribution

For a random vector, we define the complexity as follows.

Definition 2.2. The complexity of a random vector is a measureof the interactionor the dependency between its components.

Page 7: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 21

We consider a continuous p-variate distribution with jointdensity functionf (x) =f (x1,x2, . . . ,xp) and marginal density functionsf j (x j), j = 1,2, . . . , p. FollowingKullback (1968), Harris (1978), Theil and Fiebig (1984), and others, we define theinformational measure of dependencebetween random variablesx1,x2, . . . ,xp by

I(x) = I(x1,x2, . . . ,xp) = Ef

[log

f (x1,x2, . . . ,xp)

f1(x1) f2(x2) · · · fp(xp)

].

Or, it is equivalently defined by

I(x) =

∞∫

−∞

· · ·∞∫

−∞

f (x1,x2, . . . ,xp) logf (x1,x2, . . . ,xp)

f1(x1) f2(x2) · · · fp(xp)dx1 · · ·dxp, (2.6)

whereI is theKullback-Leibler (KL) (1951) information divergence against inde-pendence. I(x) in (??) is ameasure of expected dependencyamong the componentvariables, which is also known as theexpected mutual informationor theinformationproper.

• Property 1. I(x) ≡ I(x1,x2, . . . ,xp) ≥ 0, i.e., the expected mutual informationis nonnegative.

• Property 2. f(x1,x2, . . . ,xp) = f1(x1) f2(x2) · · · fp(xp) for every p-tuple

(x1,x2, . . . ,xp) if and only if the random variablesx1,x2, . . . ,xp are mutuallystatistically independent. In this case the quotient in (2.6) is equal to unity,and its logarithm is then zero. Hence,I(x) ≡ I(x1,x2, . . . ,xp) = 0. If it is notzero, this implies a dependency.

We relate theKL divergencein (??) to Shannon’s (1948) entropyby the importantidentity

I(x) ≡ I(x1,x2, . . . ,xp) =p

∑j=1

H(x j)−H(x1,x2, . . . ,xp), (2.7)

whereH(x j) is the marginal entropy, andH(x1,x2, . . . ,xp) is the global or joint en-tropy. Watanabe (1985) calls (??) the strength of structureand ameasure of in-terdependence. We note that (??) is the sum of the interactions in a system withx1,x2, . . . ,xp as components, which we define to be the entropy complexity ofthatsystem. This is also called theShannon Complexity(see, Rissanen, 1989). The moreinterdependency in the structure, the larger will be the sumof the marginal entropiesto the joint entropy. If we wish to extract fewer and more important variables, itwill be desirable that they be statistically independent, because the presence of inter-dependence means redundancy and mutual duplication of information contained inthese variables (Watanabe, 1985).

To define the information-theoretic measure of complexity of a multivariate distri-bution, we letf (x) be a multivariate normal density function given by

Page 8: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

22 Statistical Data Mining and Knowledge Discovery

f (x) = f (x1,x2, ...,xp) = (2π)−p2 | Σ |−

12 exp

{−1

2(x − µ)′ Σ−1(x − µ)

}(2.8)

whereµ = (µ1,µ2, ...,µp)′, −∞ < µ j < ∞, j = 1,2, ..., p andΣ > 0 (p.d.). We write

x∼ Np(µ , Σ). Then the joint entropyH(x) = H(x1,x2, ...,xp) from (??) for the casein which µ = 0 is given by

H(x) = H(x1,x2, ...,xp) = −∫

f (x) log f (x) dx

=

∫f (x)

[p2

log(2π) | Σ |+ 12(x− µ)′Σ−1(x− µ)

]dx

=p2

log(2π) | Σ |+ 12

tr

[∫f (x)Σ−1(x− µ)(x− µ)′dx

]. (2.9)

Then , sinceE[(x− µ)(x− µ)′] = Σ, we have

H(x) = H(x1,x2, ...,xp) =p2

log(2π)+p2

+12

log| Σ |

=p2

[log(2π) +1]+12

log| Σ | . (2.10)

See, e.g., Rao (1965, p. 450), and Blahut (1987, p. 250).From (??), the marginal entropyH(x j) is

H(x j) = −∫ ∞

−∞f (x j ) f (x j ) dxj

=12

log(2π)+12

+12

log(σ2j ), j = 1,2, . . . , p. (2.11)

2.2.3 Initial Definition of Covariance Complexity

Van Emden (1971, p.61) provides a reasonable initial definition of informationalcomplexity of a covariance matrixΣ for the multivariate normal distribution. Thismeasure is given by:

I(x1,x2, ...,xp) =

p

∑j=1

H(x j)−H(x1,x2, ...,xp)

=

p

∑j=1

[12

log(2π)+12

log(σ j j )+12

]− p

2log(2π)− 1

2log| Σ |− p

2. (2.12)

Page 9: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 23

This reduces to

C0(Σ) =12

p

∑j=1

log(σ j j )−12

log| Σ | , (2.13)

whereσ j j ≡ σ2j is the j-th diagonal element ofΣ andp is the dimension ofΣ. Note

thatC0(Σ) = 0 whenΣ is a diagonal matrix (i.e., if the variates are linearly inde-pendent).C0(Σ) is infinite if any one of the variables may be expressed as a linearfunction of the others (| Σ | = 0). If θ=(θ1,θ2, . . . ,θk) is a normal random vectorwith covariance matrix equal toΣ(θ ), thenC0(Σ( θ )) is simply theKL distance be-tween the multivariate normal density ofθ and the product of the marginal densitiesof the components ofθ . As pointed out by van Emden (1971), the result in (??) isnot an effective measure of the amount of complexity in the covariance matrixΣ,since:

• C0(Σ) depends on the marginal and common distributions of the random vari-ablesx1, ...,xp, and

• The first term ofC0(Σ) in (??) would change under orthonormal transforma-tions.

2.2.4 Definition of Maximal Covariance Complexity

Since we defined the complexity as a general property of statistical models, we con-sider that the general definition of complexity of a covariance matrixΣ should beindependent of the coordinates of the original random variables (x1,x2, ...,xp) as-sociated with the variancesσ2

j , j = 1,2, ..., p. As it is C0(Σ) in (??) is coordinatedependent. However, to characterize the maximal amount of complexity of Σ, wecan relate the general definition of complexity ofΣ to the total amount of interactionorC0( Σ) in (??). We do this by recognizing the fact that the maximum of (??) underorthonormal transformations of the coordinate system may reasonably serve as themeasure of complexity ofΣ. This corresponds to observing the interaction betweenthe variables under the coordinate system that most clearlyrepresents it in terms ofthe measureI(x1,x2, ...,xp) ≡C0(Σ). So, to improve on (??), we have the followingproposition.

Proposition 2.1. A maximal information theoretic measure of complexity of a co-variance matrixΣ of a multivariate normal distribution is

C1(Σ) = maxTC0(Σ) = maxT{

H(x1)+ . . .+H(xp)−H(x1,x2, ...,xp)}

=p2

log

[tr(Σ)

p

]− 1

2log| Σ | , (2.14)

Page 10: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

24 Statistical Data Mining and Knowledge Discovery

where the maximum is taken over the orthonormal transformation T of the overallcoordinate systems x1,x2, ...,xp.

Proof: Following van Emden (1971, p. 61), Ljung and Rissanen (1978,p. 1421),and filling the gap in Maklad and Nichols (1980, p. 82) to find

C1(Σ) = maxT{H(x1)+ ...+H(xp)−H(x1,x2, ...,xp)} (2.15)

we must find the orthonormal transformation, sayT, of Σ that maximizes

p

∑j=1

log(σ∗j j )+ ...+ log(σ∗

pp), (2.16)

whereσ∗j j ≡σ∗2

j ’s are the diagonal elements of the covariance ofTx= T(x1,x2, ...,xp),i.e.,Cov(Tx) = Σ∗. Since orthonormal transformations leavetr(Σ) = σ11+ ...+σpp

invariant, we

maximizep

∑j=1

log(σ j j ) (2.17)

sub ject to tr(Σ) = c,c = constant.

To carry out this maximization, we use the geometric and arithmetic mean ofσ11,...,σpp given by

(p

∏j=1

σ j j

) 1p

≤ 1p

p

∑j=1

σ j j (2.18)

with equality if and only ifσ11 = σ22 = ... = σpp.The equality condition in (??) is always achieved by orthonormal transformations

T to equalize all variances to within certain error. This is shown by van Emden(1971, p.66). Hence, from van Emden (1971, p.61), and Makladand Nichols (1980,p.82), (??) implies that

maxp

∑j=1

log(σ j j ) = max logp

∏j=1

σ j j

= plog tr(Σ)− plog p = plog

[tr(Σ)

p

]. (2.19)

Now replacing the first component ofC0(Σ) in (?? ), we find

Page 11: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 25

C1(Σ) = maxT{H(x1)+ ...+H(xp)−H(x1,x2, ...,xp)}

=p2

log

[tr(Σ)

p

]− 1

2log | Σ | (2.20)

as a maximal information theoretic measure of the complexity of a covariance matrixΣ of a multivariate normal distribution.

C1(Σ) in (??) is an upper bound toC0(Σ) in (??), and it measures both inequal-ity among the variances and the contribution of the covariances inΣ (van Emden,1971, p.63). Such a measure is very important in model selection and evaluationproblems to determine thestrength of model structures, similarity, dissimilarity,andhigh-order correlationswithin the model.C1(Σ) is independent of the coordinatesystem associated with the variancesσ2

j ≡ σ2j j , j = 1,2, ..., p. Furthermore, if, for

example, one of theσ2j ’s is equal to zero, thenC0(Σ) in (??) takes the value “∞−∞”

which is “indeterminate,” whereasC1(Σ) in (??) has the value “∞” (infinity) whichhas a mathematical meaning. Also,C1(Σ) in (??) has rather attractive properties.Namely,C1( Σ) is invariant with respect to scalar multiplication and orthonormaltransformation. Further,C1(Σ) is a monotonically increasing function of the dimen-sion p of Σ; see Magnus and Neudecker (1999, p. 26). These properties are givenand established in Bozdogan (1990).

The contribution of the complexity of the model covariance structure is that itprovides a numerical measure to assessparameter redundancyandstabilityuniquelyall in one measure. When the parameters are stable, this implies that the covariancematrix should be approximately a diagonal matrix. This concept of stable parameteris equivalent to the simplicity of model covariance structure defined in Bozdogan(1990). Indeed,C1(Σ) penalizes the scaling of the ellipsoidal dispersion, and theimportance of circular distribution has been taken into account. It is because of thesereasons that we useC1(Σ) without using any transformations ofΣ, and that we donot discard the use ofC0(Σ). If we write (??) as

C1(Σ) =12

log( tr(Σ)

p )p

| Σ | (2.21)

we interpret the complexity as thelog ratio between thegeometric mean of the aver-age total variationand the generalized variance, sincetr(Σ)/p is equal toaveragetotal variation, and| Σ | is thegeneralized variance.

Further, if we letλ1,λ2, . . . ,λp be the eigenvalues ofΣ, then tr(Σ)/p = λ a =

1/p∑pj=1 λ j is the arithmetic mean of the eigenvalues ofΣ, and | Σ |1/p = λ g =

(∏p

j=1 λ j

)1/pis the geometric mean of the eigenvalues ofΣ. Then the complexity

of Σ can be written as

C1(Σ) =p2

log(

λ a/λg

). (2.22)

Page 12: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

26 Statistical Data Mining and Knowledge Discovery

Hence, we interpret the complexity as thelog ratio between thearithmetic meanandthegeometric mean of the eigenvaluesof Σ. It measures how unequal the eigenvaluesof Σ are, and it incorporates the two simplest scalar measures ofmultivariate scatter,namely thetrace and thedeterminantinto one single function. Indeed, Mustonen(1997) in a recent paper studies the fact that thetrace (sum of variances)and thedeterminantof the covariance matrixΣ (generalized variance)alone do not meetcertain essential requirements of variability in the multivariate normal distribution.

In general, large values of complexity indicate a high interaction between the vari-ables, and a low degree of complexity represents less interaction between the vari-ables. The minimum ofC1(Σ) corresponds to the“least complex”structure. In otherwords,C1(Σ) → 0 asΣ → I , the identity matrix. This establishes a plausible relationbetween information-theoretic complexity and computational effort. Further, whatthis means is that the identity matrix is the least complex matrix. To put it in sta-tistical terms, orthogonal designs or linear models with nocollinearity are the leastcomplex, or most informative, and that the identity matrix is the only matrix forwhich the complexity vanishes. Otherwise,C1(Σ) > 0, necessarily.

Geometrically,C1(Σ) preserves all inner products, angles, and lengths under or-thogonal transformations ofΣ. An orthogonal transformationT indeed exists whichcorresponds to a sequence of plane rotation of the coordinate axes to equalize thevariances. This can be achieved usingJacobi’s iterative methodor Gauss-Seidelmethod(see, Graham, 1987).

We note that the system correlation matrix can also be used todescribe complexity.If we wish to show theinterdependencies(i.e., correlations) among the parameterestimates, then we can transform the covariances to correlation matrices and describeyet another useful measure of complexity. LetR be the correlation matrix obtainedfrom Σ by the relationship

R= Λσ ΣΛσ , (2.23)

whereΛσ = diag(1/σ1, . . . ,1/σp) is a diagonal matrix whose diagonal elementsequal to 1/σ j , j = 1,2, . . . , p. From (??), we have

C1(R) = −1/2log | R | ≡C0(R). (2.24)

Diagonal operation of a covariance matrixΣ always reduces the complexity ofΣ, andthatC1(R) ≡ C0(R) takes into account theinterdependencies (correlations)amongthe variables. For simplicity, theC0 measure based on the correlation matrixR willbe denoted byCR, andC0(R) is written asCR(Σ) for notational convenience, sinceR is obtained fromΣ. Obviously,CR is invariant with respect to scaling and or-thonormal transformations and subsequently can be used as acomplexity measureto evaluate the interdependencies among parameter estimates. Note that if|R| = 1,then I(x1,x2, . . . ,xp) = 0 which implies the mutual independence of the variablesx1,x2, . . . ,xp . If the variables are not mutually independent, then 0< |R| < 1 andthat I(x1,x2, . . . ,xp) > 0. In this senseI(x) in (??) or ( ??) can also be viewed as ameasure of dimensionality of model manifolds.

Page 13: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 27

Next, we develop the informational complexityICOMP(IFIM) approach to modelevaluation based on the maximal covariance complexityC1(•), andCR(•).

2.2.5 ICOMP as an Approximation to the Sum of Two Kullback-LeiblerDistances

In this section, we introduce a new model-selection criterion calledICOMP(IFIM) tomeasure the fit between multivariate normal linearand/ornonlinear structural mod-els and observed data as an example of the application of the covariance complexitymeasure defined in the previous section.ICOMP(IFIM) resembles a penalized like-lihood method similar toAIC andAIC-type criteria, except that the penalty dependson the curvature of the log likelihood function via the scalar C1( •) complexity valueof theestimated IFIM.

Proposition 2.2. For a multivariate normal linear or nonlinear structural model wedefine the general form of ICOMP(IFIM) as

ICOMP(IFIM ) = −2logL(θ )+2C1(F−1(θ )), (2.25)

where C1 denotes the maximal informational complexity ofF−1, the estimated IFIM.To show this, suppose we consider a general statistical model of the form given by

y = m(θ )+ε, (2.26)

where:

• y = (y1,y2, . . . ,yn) is an(n×1) random vector of response values inℜn;

• θ is a parameter vector inℜk;

• m(θ ) is a systematic component of the model inℜn, which depends on the pa-rameter vectorθ , and its deterministic structure depends on the specific modelconsidered, e.g., in the usual linear multiple regression model m(θ ) = Xθ ,whereX is an(n×(k+1)) matrix of nonstochastic or constant design or modelmatrix with k explanatory variables so thatrank(X) =k+ 1 = q; andε is an(n×1) random error vector with

E(ε) = 0,E(εε ′) = Σε . (2.27)

Following Bozdogan and Haughton (1998), we denoteθ ∗ to be a vector of param-eters of the operating true model, andθ to be any other value of the vector of param-eters. Letf (y;θ ) denote the joint density function ofy givenθ . Let I(θ ∗;θ )denotetheKL distancebetween the densitiesf (y;θ ∗) and f (y;θ ). Then, sinceyi are inde-pendent,i = 1,2, . . . ,n, we have

Page 14: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

28 Statistical Data Mining and Knowledge Discovery

I(θ ∗;θ ) =∫

ℜnf (y;θ ∗) log

[f (y;θ ∗)f (y;θ )

]dy

=n

∑i=1

∫fi(yi ;θ ∗) log[ fi(yi ;θ ∗)]dyi −

n

∑i=1

∫fi(yi ;θ ∗) log[ fi(yi ;θ )]dyi , (2.28)

where fi , i = 1,2, . . . ,n are the marginal densities of theyi .Note that the first term in (??) is the usual negative entropyH( θ ∗;θ ∗) ≡ H(θ ∗)

which is constant for a givenfi(yi ;θ ∗). The second term is equal to:

−n

∑i=1

E [log fi(yi ;θ )] , (2.29)

which can be unbiasedly estimated by

−n

∑i=1

log fi(yi ;θ ) = − logL(θ |yi), (2.30)

where logL(θ |yi) is the log likelihood function of the observations evaluated atθ .Given a modelM where the parameter vector is restricted, a maximum likelihoodestimatorθM can be obtained forθ , and the quantity

−2n

∑i=1

log fi(yi ; θ M) = −2logL( θ M) (2.31)

evaluated. This will give us the estimation of the firstKL distance, which is rem-iniscent to the derivation ofAIC. On the other hand, a modelM gives rise to anasymptotic covariance matrixCov( θ M) = Σ( θ M) for theMLE θM. That is,

θ M ∼ N(

θ ∗,Σ( θ M) ≡ F−1( θ M))

. (2.32)

Now invoking theC1(•) complexity onΣ( θM) from the previous section can be seenas theKL distancebetween the joint density and the product of marginal densities fora normal random vector with covariance matrixΣ( θ M) via (?? ), maximized overall orthonormal transformations of that normal random vector (see Bozdogan, 1990).Hence, using the estimated covariance matrix, we defineICOMP as the sum of twoKL distancesgiven by:

ICOMP(IFIM ) = −2n

∑i=1

log fi(yi ; θM)+2C1

(Σ(θM)

)

= −2logL(θM)+2C1

(F−1(θM)

). (2.33)

The first component ofICOMP(IFIM) in (??) measures the lack of fit of the model,and the second component measures the complexity of the estimatedinverse-Fisher

Page 15: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 29

information matrix (IFIM), which gives a scalar measure of the celebratedCramer-Rao lower bound matrixwhich takes into account the accuracy of the estimatedparameters and implicitly adjusts for the number of free parameters included in themodel. See, e.g., Cramer (1946) and Rao (1945, 1947, 1948).

This approach has several rather attractive features. IfF−1j j (θK) is the j-th diagonal

element of theinverse-Fisher information matrix (IFIM), from Chernoff (1956), weknow thatF−1

j j (•) represents the variance of the asymptotic distribution of√

n( θ j −θ j), for j = 1, ...,K. Considering a subset of theK parameters of sizek, we have that

F−1j j (θK) ≥ F−1

j j (θk). (2.34)

Behboodian (1964) explains that the inequality (??) means that the variance of theasymptotic distribution of

√n( θ j −θ j) can only increase as the number of unknown

parameters is increased. This is an important result that impacts the parameter redun-dancy. The proof of (??) is shown in Chen (1996, p. 6) in his doctoral dissertation:“Model Selection in Nonlinear Regression Analysis”under my supervision.

The use of theC1(F−1( θ M)) in the information-theoretic model evaluation cri-teria takes into account the fact that as we increase the number of free parameters ina model, the accuracy of the parameter estimates decreases.As preferred accordingto the principle of parsimony,ICOMP(IFIM) chooses simpler models that providemore accurate and efficient parameter estimates over more complex, overspecifiedmodels.

We note that thetrace of IFIM in the complexity measure involves only the di-agonal elements analogous tovarianceswhile thedeterminantinvolves also the off-diagonal elements analogous tocovariances. Therefore,ICOMP(IFIM) contraststhe trace and thedeterminantof IFIM , and this amounts to a comparison of thegeometricandarithmetic meansof the eigenvaluesof IFIM given by

ICOMP(IFIM ) = −2logL( θ M)+slog

(λ a

λ g

), (2.35)

wheres= dim(F−1( θ M)) = rank(F−1( θ M)).We note thatICOMP(IFIM) now looks in appearance like theCAIC of Bozdogan

(1987), Rissanen’s (1978)MDL, and Schwarz’s (1978) Bayesian criterionSBC, ex-

cept for usinglog(

λ a/λ g

)instead of usinglog(n), wherelog(n) denotes the natural

logarithm of the sample sizen. A model with minimumICOMP is chosen to be thebest among all possible competing alternative models.

Thegreatest simplicity, that iszero complexity, is achieved whenIFIM is propor-tional to the identity matrix, implying that theparameters are orthogonaland can beestimated with equal precision. In this sense,parameter orthogonality, several formsof parameter redundancy, andparameter stabilityare all taken into account.

We note thatICOMP(IFIM) in (??) penalizes the“bad scaling” of the parameters.It is important to note that well conditioning of the information matrix needs a simplestructure, but the latter does not necessarily imply the former. For example, consideran information matrix that is diagonal with some diagonal elements close to zero.

Page 16: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

30 Statistical Data Mining and Knowledge Discovery

In this case, the corresponding correlation matrix is an identity matrix, which is thesimplest. But, the information matrix is poorly conditioned. Therefore, the analysisbased on the correlation matrix often ignores an important characteristic, namely,the ratios of the diagonal elements in the information matrix, or the“scale” of thesecomponents.

Similar toAIC, to makeICOMP(IFIM) to be scale invariant with respect to scalingand orthonormal transformations in model selection enterprise, we suggest the useof thecorrelational form of IFIM given by

F−1R ( θ ) = D−1/2

F−1 F−1D−1/2F−1 . (2.36)

Then,ICOMP(IFIM )R is defined by

ICOMP(IFIM )R = −2logL( θ )+2C1

(F−1

R ( θM))

. (2.37)

In this wayICOMP becomes invariant to one-to-one transformations of the param-eter estimates. In the literature, several authors such as McQuarie and Tsai (1998,p. 367), and Burnham and Anderson (1998, p.69), without reviewing the impact andthe applications ofICOMP to many complex modelling problems, have erroneouslyinterpreted the contribution of this novel approach overAIC, andAIC-type criteria.

With ICOMP(IFIM), complexity is viewed not as the number of parameters inthe model, but as thedegree of interdependence(i.e., thecorrelational structureamong the parameter estimates). By defining complexity in this way,ICOMP(IFIM)provides a more judicious penalty term thanAIC, Rissanen’s (1978, 1986)MDL,Schwarz’s (1978)SBC (or BIC), and Bozdogan’s (1987)Consistent AIC (CAIC).The lack of parsimony is automatically adjusted byC1(F−1( θ M)) orC1(F

−1R ( θ M))

across the competing alternative portfolio of models as theparameter spaces of thesemodels are constrained in the model selection process.

Following Morgera (1985, p. 612), we define therelative reduction of complexity(RRC)in terms of theestimated IFIMas

RRC =C1(F−1(θM))−C1(F

−1R (θM))

C1(F−1(θM)), (2.38)

and thepercent relative reduction of complexityby

PRRC=C1(F−1(θM))−C1(F

−1R (θM))

C1(F−1(θM))×100%. (2.39)

The interpretation ofRRCor PRRCis that they both measure heteroscedastic com-plexity plus a correlational complexity of the model. In general statistical modellingframework, what this means is that, when the parameter estimates are highly cor-related, in nonlinear, and in many other statistical modelling, one can remove thecorrelation by considering parameter transformations of the model. The differencebetween the complexitiesC1(F−1( θM)) andC1(F

−1R ( θ M)) can be used to show

how well the parameters are scaled. Parameter transformation can reduce the com-plexity measure based on the correlation structure, but it can increase the complexity

Page 17: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 31

measure based on the maximal complexity. This occurs because the reduction in thecorrelation does not imply the reduction of scaling effect.Indeed, the reduction inthe correlation may even make scaling worse. In this sense,ICOMP(IFIM) may bebetter than ICOMP(IFIM)R especially in nonlinear models, since it considers bothof these effects in one criterion. For more on these, see, e.g., Chen (1996), Bozdogan(2000), and Chen and Bozdogan (2004).

There are other formulations ofICOMP, which are based on the covariance matrixproperties of the parameter estimates of a model starting from their finite samplingdistributions and Bayesian justification ofICOMP. These versions ofICOMP areuseful in linear and nonlinear models. For more details on this and other approaches,we refer the readers to Bozdogan (2000), and Bozdogan and Haughton (1998) whereconsistency properties ofICOMPhave been studied in the case of the usual multipleregression models. The probabilities of underfitting and overfitting forICOMPas thesample sizen tends to infinity have been established. Through a large scale MonteCarlo “misspecification environment,”when the true model is not in the model set,the performance ofICOMP has been studied under different configurations of theexperiment with varying sample sizes and the error variances. The results obtainedshow thatICOMP class criteriaoverwhelmingly agree the most often with theKLdecision, which goes to the heart of the consistency arguments about informationcriteria not studied before, since most of the studies are based on the fact that thetrue model considered is in the model set.

In concluding this section, we note that the difference betweenICOMP class cri-teria andAIC, SBC/MDL, andCAIC is that withICOMP we have the advantage ofworking with both biased as well as the unbiased estimates ofthe parameters. Fur-ther, we have the advantage of using smoothed (or improved) covariance estimatorsof the models and measure the complexities to study therobustness propertiesof dif-ferent methods of parameter estimates and improved covariance estimators.AIC andAIC-type criteriaare based onMLE’s, which often are biased and they do not fullytake into account the concept ofparameter redundancy, accuracy, and theparam-eter interdependenciesin model fitting and selection process. Also,ICOMP classcriteria legitimize the role of theFisher information matrix (FIM)(a tribute to Rao,1945, 1947, 1948) as the natural metric on the parameter manifold of the model,which remained academic in the statistical literature.

2.3 Information Criteria for Multiple Regression Models

We consider the multiple linear regression model in matrix form given by

y = Xβ + ε (2.40)

wherey is a vector of(n×1) observations on a dependent variable,X is a full rank(n×q) matrix of nonstochastic predetermined variables in standardized form, and

Page 18: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

32 Statistical Data Mining and Knowledge Discovery

β is a(q×1) coefficient vector, andε is an(n×1) vector of unknown disturbanceterm, such that

ε ∼ N(0,σ2I) or equivalentlyεi ∼ N(0,σ2) f or i = 1,2, ...,n. (2.41)

Given the model in (??) under the assumption of normality, we can analyticallyexpress the density function of regression model for a particular sample observationas

f(yi |xi ,β ,σ2)=

(2πσ2)− 1

2 exp

[− (yi −x′iβ )2

2σ2

]. (2.42)

That is, the random observation vectory is distributed as a multivariate normalwith mean vectorXβ and covariance matrixσ2In. The likelihood function of thesample is:

L(β ,σ2|y,X

)=(2πσ2)− n

2 exp

[−(y−Xβ )′ (y−Xβ )

2σ2

], (2.43)

and the log likelihood function is:

l(β ,σ2)= −n

2log(2π)− n

2logσ2− (y−Xβ )′ (y−Xβ )

2σ2 . (2.44)

Using matrix differential calculus of Magnus and Neudecker(1999), the maxi-mum likelihood estimates (MLE’s) (β , σ2) of

(β ,σ2

)are given by:

β = (X′X)−1X′y, and (2.45)

σ2 =(y−Xβ)′(y−Xβ)

n=

RSSn

. (2.46)

Themaximum likelihood (ML) covariance matrix of the estimatedregression co-efficientsis given by

Cov(β )MLE = σ2(X′X)−1 (2.47)

without centering and scaling the model matrixX. Also, theinverse Fisher informa-tion matrix (IFIM) is given by

Cov(β , σ2 ) = F−1 =

[σ2(X′X)−1 0

0 2σ4

n

]. (2.48)

Now, we can define derived forms ofICOMP in multiple regression as follows.This can be defined for bothCov(β ), F−1, and thecorrelational form of IFIM, F−1

R .These are as follows.

Page 19: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 33

2.3.1 ICOMP Based on Complexity Measures

ICOMP(Reg)C0 = −2logL(θ )+2C0(Cov(β ))

= nlog(2π)+nlog(σ2)+n+2[12

q

∑j=1

log(σ j j (β ))− 12

log∣∣∣ Cov(β )

∣∣∣]

= nlog(2π)+nlog(σ2)+n+2[12

q

∑j=1

log(σ j j (β ))− 12

q

∑j=1

log(λ j)]. (2.49)

based on the original definition of the complexityC0(·) in (??).

ICOMP(Reg)C1 = −2logL(θ )+2C1(Cov(β ))

= nlog(2π)+nlog(σ2)+n

+2[q2

log(tr( Cov(β ))

q)− 1

2log∣∣∣ Cov(β )

∣∣∣]

= nlog(2π)+nlog(σ2)+n+2[q2

log(λ a

λ g)]. (2.50)

based onC1(·) in (??).If we use the estimatedinverse Fisher information matrix(IFIM ) in (??), then we

defineICOMP(IFIM) as

ICOMP(IFIM )Regression= −2logL(θM)+2C1

(F−1(θM)

)

= nlog(2π)+nlog(σ2)+n+C1

(F−1(θM)

),(2.51)

where

C1

(F−1(θM)

)= (q+1) log

[trσ2(X′X)−1+ 2σ4

n

q+1

]

− log∣∣σ2(X′X)−1

∣∣− log

(2σ4

n

). (2.52)

In (??), as the number of parameters increases (i.e., as the size ofX increases),the error varianceσ2 gets smaller even though the complexity gets larger. Also, as

Page 20: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

34 Statistical Data Mining and Knowledge Discovery

σ2 increases,(X′X)−1 decreases. Therefore,C1(F−1) achieves a trade-off betweenthese two extremes and guards against multicollinearity.

To preserve scale invariance, we use thecorrelational form of IFIM, that is, weuseF−1

R and define the correlational form ofICOMP(IFIM)Regressiongiven by

ICOMP(IFIM )Regression= nlog(2π)+nlog(σ2)+n+C1

(F−1

R ( θM))

, (2.53)

whereF−1

R ( θ M) = D−1/2F−1 F−1D−1/2

F−1 = USV′ (2.54)

using thesingular value decomposition (svd)on F−1R . Svd(·) produces a diagonal

matrix S, of the same dimension asF−1R and with nonnegative diagonal elements in

decreasing order, which are the singular values ofF−1R , and unitary matricesU and

V, which satisfyUU ′ = VV′ = I , so thatF−1R = U ·S·V′.

2.3.2 ICOMP Under Misspecification

Although we will not use this form ofICOMP in this paper, to be complete and toinform the readers, when the model is misspecified, we defineICOMP under mis-specification as

ICOMP(IFIM )Misspec= nlog(2π)+nlog(σ2)+n+2C1(Cov(θ )Misspec), (2.55)

whereCov(θ )Misspec= F−1RF−1 (2.56)

is a consistent estimator of the covariance matrixCov(θ ∗k ). This is often called the

“sandwiched covariance”or “robust covariance”estimator, since it is a correct co-variance regardless of whether the assumed model is corrector not. It is calledsandwiched covariance, becauseR is the meat and the twoF−1s are slices of thebread. When the model is correct we getF = R, and the formula reduces to the usualinverse Fisher information matrixF−1(White, 1982).

In the regression case, the Fisher information in inner-product form is given as in(??) by

F−1 =

[σ2(X′X)−1 0

0 2σ4

n

](2.57)

and the estimatedouter-product form of the Fisher information matrixis given by

R=

[1

σ4 X′D2X X′1 Sk2σ3

(X′1 Sk2σ3 )

′ (n−q)(Kt−1)4σ4

], (2.58)

whereD2 = diag(ε21 , . . . , ε2

n) andX is (n×q) matrix of regressors or model matrix,Sk is the estimated residual skewness,Kt the kurtosis, and 1 is a(n× 1) vector ofones. That is,

Page 21: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 35

Sk= Coefficient of skewness=(1

n

n∑

i=1ε3

i )

σ3 (2.59)

and

Kt = Coefficient of kurtosis=(1

n

n∑

i=1ε4

i )

σ4 . (2.60)

Hence, the“sandwiched covariance”or “robust covariance”estimator is givenby

Cov(θ )Misspec=

[σ2(X′X)−1 0

0 2σ4

n

][ 1σ4 X′D2X X′1 Sk

2σ3

(X′1 Sk2σ3 )

′ (n−q)(Kt−1)4σ4

][σ2(X′X)−1 0

0 2σ4

n

]. (2.61)

Note that this covariance matrix in (??) should impose greater complexity thanthe inverse Fisher information matrix (IFIM). It also takes into account presence ofskewness and kurtosis which is not possible withAIC, andMDL/SBC. For more onmodel selection under misspecification, see Bozdogan and Magnus (2003).

Another form ofICOMP is defined by

ICOMP(IFIM )Misspec= nlog(2π)+nlog(σ2)+n+2[tr(F−1R)+C1(F−1)].

(2.62)Similarly,Generalized Akaike’s (1973) information criterion (GAIC)is defined by

GAIC= nlog(2π)+nlog(σ2)+n+2tr(F−1R). (2.63)

For this, see, Bozdogan (2000).Bozdogan and Ueno (2000) modified Bozdogan’s (1987)CAICF given by

CAICFE = nlog(2π)+nlog(σ2)+n+k[log(n)+2]+ log∣∣∣F∣∣∣+ tr(F−1R), (2.64)

which includesAkaike’sapproach andCAICF as special cases. For the applicationand performance ofCAICFE, we refer the reader to Irizarry (2001) for model selec-tion in local likelihood estimation.

We note that the termtr(F−1R) in (?? )-( ??) is important because it providesinformation on the correctness of the assumed class of potential models as discussedin White (1982). A fundamental assumption underlying classical model selectioncriteria is that often the true model is considered to lie within a specified class ofpotential models. In general, this is not always the case, and often the true modelmay not be within the model set considered. Therefore, our approach guards usagainst themisspecification of the probability modelas we actually fit and evaluatethe models. In this sense, this result is very important in practice, which is oftenignored.

Page 22: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

36 Statistical Data Mining and Knowledge Discovery

2.3.3 AIC and AIC-Type Criteria

AIC for the regression model to be used as fitness values in theGA is given by

AIC(Regression) = nlog(2π)+nlog(σ2)+n+2(k+1). (2.65)

Similarly, Rissanen (1978) and Schwarz (1978) (MDL/SBC) criterion is definedby

MDL/SBC(Regression) = nlog(2π)+nlog(σ2)+n+k log(n). (2.66)

We note thatICOMP andICOMP(IFIM) are much more general thanAIC. Theyincorporate the assumption of dependence and independenceof the residuals andhelp the analyst to consider risks of both under- and overparameterized models.ICOMP andICOMP(IFIM) relieves the researcher of any need to consider the pa-rameter dimension of a model explicitly (see Bozdogan and Haughton, 1998 for moredetailed comparison.)

2.4 A GA for the Regression Modeling

Genetic algorithms(GAs)are a part of evolutionary computing. This is a very fastgrowing area of artificial intelligence (AI). As it is well known,GAsare inspired byDarwin’s theory about evolution. Simply said, solution to aproblem solved byGAsis evolved. Genetic Algorithms(GAs)were invented by John Holland and developedby him and his students and colleagues. This led to Holland’sbook Adaption inNatural and Artificial Systemspublished in 1975. In 1992 John Koza used geneticalgorithm to evolve programs to perform certain tasks. He called his method“geneticprogramming”(GP). LISPprograms were used, because programs in this languagecan be expressed in the form of a“parse tree,” which is the object theGAworks on.

GA is started with a set of solutions (represented by chromosomes) called popu-lation. Solutions from one population are taken and used to form a new population.This is motivated by a hope that the new population will be better than the old one.Solutions that are selected to form new solutions (offsprings) are selected accord-ing to their fitness value. The more suitable they are the morechances they have toreproduce.

Our implementation of theGA for the problem of model selection in multiple lin-ear regression basically follows Goldberg (1989). Recall that the general regressionmodel can be represented as:

y = Xβ + ε. (2.67)

A GA for the problem of model selection in subset regression models can be im-plemented using the following steps. For a comprehensive background ofGAs and

Page 23: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 37

related topics, we refer readers to Goldberg (1989), Michalewicz (1992), and others.Goldberg’sGA (or called simple genetic algorithm, SGA) contains the followingcomponents.

•A genetic coding scheme for the possible regression models

Each regression model is encoded as a string, where each locus in the string isa binary code indicating the presence (1) or absence (0) of a given predictor vari-able. Every string has the same length, but each contains different binary codingrepresenting different combinations of predictor variables. For example, in ak=5variable regression with a constrant term, the string1 0 1 0 1 1 represents a model,whereconstant termis included in the model,variable 1is excluded from the model,variable 2is included in the model, and so on.

•Generating an initial population of models

Population size (i.e., number of models fitted)N is an important parameter ofGA.Population size says how many chromosomes are in a population (in one generation).If there are too few chromosomes,GA has a few possibilities to perform crossoverand only a small part of search space is explored. On the otherhand, if there are toomany chromosomes,GA slows down. Research shows that after some limit (whichdepends mainly on encoding and the problem) it is not useful to increase populationsize, because it does not make solving the problem faster. Wefirst initialize thefirst population ofN random breeding models. Note that the population sizeN,representing the number of models to begin the first run, is chosen by the investigatorand not by random. Our algorithm is flexible in that it allows one to choose anypopulation size.

•A fitness function to evaluate the performance of any model

In general we can use any one of the model selection criteria described in Section?? as the fitness function used in ourGA for the regression analysis. However, forthe purpose of illustration of theGA and for the brevity in this paper, we restrictour attention to theICOMP criterion. Analysts can choose any appropriate modelselection criterion based on their needs and preferences.

•A mechanism to select the fitter models

This step involves selecting models based on theirICOMP(IFIM) values for com-position of the mating pool. After calculating theICOMP(IFIM) for each of the pos-sible subset models in the population, we subtract the criterion value for each modelfrom the highest criterion value in the population. In otherwords, we calculate

∆ICOMP(i)(IFIM ) = ICOMP(IFIM )Max− ICOMP(IFIM )(i) (2.68)

for i = 1, ...,N, whereN is the population size.Next, we average these differences; that is, we compute

∆ICOMP(IFIM ) =1N

N

∑i=1

∆ICOMP(i)(IFIM ). (2.69)

Page 24: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

38 Statistical Data Mining and Knowledge Discovery

Then the ratio of each model’s difference value to the mean difference value is cal-culated; that is, we compute

∆ICOMP(i)(IFIM )/∆ICOMP(IFIM ). (2.70)

This ratio is used to determine which models will be includedin the mating pool.The chance of a model being mated is proportional to this ratio. In other words, amodel with a ratio of two is twice as likely to mate as a model with a ratio of one. Theprocess of selecting mates to produce offspring models continues until the number ofoffsprings equals the initial population size. This is called the proportional selectionor fitting. There is also rank selection or fitting withICOMP. For this see Bearse andBozdogan (2002).

•A reproductive operation to perform mating of parent models to produceoffspring models

Mating is performed as a crossover process. A model chosen for crossover iscontrolled by the crossover probability(pc) or the crossover rate. The crossoverprobability(pc) is often determined by the investigator. A crossover probability ofzero simply means that the members of the mating pool are carried over into thenext generation and no offsprings are produced. A crossoverprobability (pc) =1 indicates that mating (crossover) always occurs between any two parent modelschosen from the mating pool; thus the next generation will consist only of offspringmodels (not of any model from the previous generation).

During the crossover process, we randomly pick a position along each pair ofparent models (strings) as the crossover point. For any pairof parents, the stringsare broken into two pieces at the crossover point and the portions of the two stringsto the right of this point are interchanged between the parents to form two offspringstrings as shown in Figure 2.1.

In this case each parent has ten loci. A randomly chosen pointalong the lengthof each parent model becomes the crossover point where the models are broken andthen reattached to another parent to form two new models. We have a choice ofseveral types of crossover operations. Here, we give just three choices which willsuffice for all practical purposes.

Single point crossover- one crossover point is selected, binary string from be-ginning of chromosome to the crossover point is copied from one parent, the rest iscopied from the second parent:

Parent A Parent B Offspring11001011 + 11011111= 11001111

Two point crossover- two crossover point are selected, binary string from begin-ning of chromosome to the first crossover point is copied fromone parent, the partfrom the first to the second crossover point is copied from thesecond parent and therest is copied from the first parent:

Parent A Parent B Offspring11001011 + 11011111= 11011111

Page 25: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 39

FIGURE 2.1An example of the mating process, by means of crossing-over,for a given pairof models (strings).

Uniform crossover - bits are randomly copied from the first or from the secondparent:

Parent A Parent B Offspring11001011 + 11011101= 11011111

In our algorithm, the user has the option of choosing any one of the above crossoveroptions. Also, the user has the option of choosing what is called elitism rule. Thismeans that at least one best solution is copied without changes to a new population,so the best solution found can survive to the end of the run.

•A random effect of mutation to change the composition of new offspringmodels

Mutation of models is used in aGAas another means of creating new combinationsof variables so that the searching process can jump to another area of the fitnesslandscape instead of searching in a limited area. We permit mutation by specifyinga mutation rate or probability at which a randomly selected locus can change from0 to 1 or 1 to 0. Thus, a randomly selected predictor variable is either added to orremoved from the model.

Page 26: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

40 Statistical Data Mining and Knowledge Discovery

Depending on the particular crossover and mutation rates, the second generationwill be composed entirely of offspring models or of a mixtureof offspring and parentmodels. Models in the second generation then go to produce the third generation; theprocess continues one generation after another for a specified number of generationscontrolled by the analyst.

In summary, the outline of the steps ofGA is as follows:

1. [Start] Generate random population ofN chromosomes (suitable solutions forthe problem).

2. [Fitness] Evaluate the fitness of each chromosome in the population using oneof the model selection criteria.

3. [New population] Create a new population by repeating following steps untilthe new population is complete.

(a) [Selection] Select two parent chromosomes from a population accordingto their fitness (e.g., ICOMP value)(the better fitness, the bigger chanceto be selected).

(b) [Crossover] With a crossover probability cross over the parents to forma new offspring (children). If no crossover was performed, offspring isan exact copy of parents. There are three choices of crossover.

(c) [Mutation] With a mutation probability mutate new offspring at each lo-cus (position in chromosome).

(d) [Accepting] Place new offspring in a new population.

4. [Replace] Use new generated population for a further run of algorithmandlook for the minimum of the model selection criteria used.

5. [Test] If the final condition is satisfied based on the model selection criteria,stop, and return the best solution in current population.

6. [Loop] Go to step 2.

In the next section, all our computations are carried out using the newly developedgraphical user interface (GUI) software in Matlab for theGAsubset selection of bestpredictors.GUI for solving the subset selection model problem is easy to useandvery user friendly. The following is the summary of inputs and outputs and usage ofGUI.

Page 27: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 41

List of inputs and usage:No. of generations: Key in an integer valuePopulation size: Key in an integer valueFitness Values: Check an option in the block:

AIC ICOMP ICOMP(IFIM)Probability of Crossover: Key in a real number value from 0 to 1.

Single Point Two-point UniformProbability of Mutation: Key in a real number value from 0 to 1.Elitism Rule Check/uncheck the option in the checkboxInput Data files: Button Y: for the response variable Y

Button X: for the set of predictor variables X.Go: To solve the problemReset: Reset all keyed-in inputs and outputsExit: Exit the program.

List of outputs and usage:

1. View 2-D/View 3-D buttons: Shows the 2D/3D plotsof criterion valuesversus generations.

2. Save Figure: Pop up the currentfigure showing in theGUI plot window, thenthe user can save the pop up figure.

3. Output in Matlab Shows the tableof generations ofGA,Command Window: Shows thefitted chromosomes(Variables) and Binary Strings,and Criterion Score Values.

4. Output File: Same as the output inMatlab command window.

5. Save Figures: Simply click on view2-D/View 3-D after the run.

2.5 Numerical Examples

We consider two examples to illustrate the implementation of the GA in subset se-lection of best predictors.

2.5.1 Subset Selection of Best Predictors in Multiple Regression: ASimulation Example

In this example, we generated the values fory andx1,x2, ...,x10 , using the followingsimulation protocol.

Page 28: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

42 Statistical Data Mining and Knowledge Discovery

We simulate the first five predictors using the following:

x1 = 10+ ε1,

x2 = 10+0.3ε1+ αε2, whereα =√

1−0.32 =√

0.91= 0.9539

x3 = 10+0.3ε1+0.5604αε2+0.8282αε3,

x4 = −8+x1+0.5x2+0.3x3+0.5ε4,

x5 = −5+0.5x1+x2+0.5ε5,

whereεi is independent and identically distributed (i.i.d.)according toN(0,σ2 = 1),for i = 1, . . . ,n observations, and alsoε1,ε2,ε3,ε4,ε5 ∼ N(0,σ2 = 1). The param-eterα controls the degree of collinearity in the predictors. Then, we generate theresponse variabley from:

yi = −8+x1+0.5x2+0.3x3+0.5εi, f or i = 1, . . . ,n = 100observations.

Further, we generate five redundant variables:x6, ...,x10 using the uniform randomnumbers given by

x6 = 6× rand(0,1), . . .,x10 = 10× rand(0,1)

and fit a multiple regression model ofy on X = [x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10]for n = 100 observations, wherex0 = 1 constant column of(n×1) vector of ones.

We expect that theGA run should pick the subset{x0,x1,x2,x3} to be the bestsubset selected using the mininumICOMP value.

The following output from Matlab shows that theGA with ICOMP(IFIM) as thefitness function can detect the relationship and pick the predictors{x0,x1,x2,x3} tobe the best subset chosen as in most of the generations of theGA.

Simulation of Collinear Datan = 100Number of generations = 15Population Size = 30Fitness Value =ICOMP(IFIM)Probability of crossover = 0.7

(Two point cross over is used.)Elitism = YesProbability of mutation = 0.01

TABLE 2.1Parameters of theGA for the simulated model.

In Table 2.1, we show the results of just one run of theGA with the parametersgiven above.

Page 29: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 43

Generation Chromosome (Variables) Binary String ICOMP(IFIM)1 0 1 2 3 9 10 1 1 1 1 0 0 0 0 0 1 1 1602 0 1 2 3 1 1 1 1 0 0 0 0 0 0 0 1553 0 1 2 3 8 10 1 1 1 1 0 0 0 0 1 0 1 160.114 0 1 2 3 1 1 1 1 0 0 0 0 0 0 0 1555 0 1 2 3 1 1 1 1 0 0 0 0 0 0 0 1556 0 1 2 3 10 1 1 1 1 0 0 0 0 0 0 1 156.337 0 1 2 3 10 1 1 1 1 0 0 0 0 0 0 1 156.338 0 1 2 3 7 10 1 1 1 1 0 0 0 1 0 0 1 157.89 0 1 2 3 6 7 1 1 1 1 0 0 1 1 0 0 0 157.7710 0 1 2 3 7 10 1 1 1 1 0 0 0 1 0 0 1 157.811 0 1 2 3 7 10 1 1 1 1 0 0 0 1 0 0 1 157.812 0 1 2 3 7 10 1 1 1 1 0 0 0 1 0 0 1 157.813 0 1 2 3 7 10 1 1 1 1 0 0 0 1 0 0 1 157.814 0 1 2 3 7 10 1 1 1 1 0 0 0 1 0 0 1 157.815 0 1 2 3 7 10 1 1 1 1 0 0 0 1 0 0 1 157.8

TABLE 2.2The results from one run of theGA for the simulated model.

Looking at Table 2.2, we note thatGA picks the best subset{x0,x1,x2,x3} veryquickly in the second generation, again in the fourth and in fifth generations. Indeed,we also note that in each of the generations the subset{x0,x1,x2,x3} shows up alongwith one or two redundant variables.

We can repeat this experiment by simulating newX−y data sets and run theGAmany times in order to further illustrate the accuracy and efficiency of theGA.

2.5.2 Subset Selection of Best Predictors in Multiple Regression: AReal Example

In this example we determine the best subset of predictors ofy =Percent body fatfrom Siri’s (1956) equation, usingk = 13 predictors,x1 =Age (years),x2 =Weight(lbs),x3 =Height (inches),x4 = Neck circumference (cm),x5 =Chest circumference(cm),x6 =Abdomen 2 circumference (cm),x7 =Hip circumference (cm),x8 =Thighcircumference (cm),x9 =Knee circumference (cm),x10 =Ankle circumference (cm),x11 =Biceps (extended) circumference (cm),x12 = Forearm circumference (cm),x13 =Wrist circumference (cm) using theGAwith ICOMPas the fitness function.

The data contains the estimates of the percentage of body fatdetermined by un-derwater weighing and various body circumference measurements forn = 252 men.This is a good example to illustrate the versatility and utility of our approach usingmultiple regression analysis withGA. This data set is maintained by Dr. Roger W.Johnson of the Department of Mathematics & Computer Scienceat South DakotaSchool of Mines & Technology (email address: [email protected], andweb address: http://silver.sdsmt.edu/∼rwjohnso).

Accurate measurement of body fat is inconvenient/costly and it is desirable tohave easy methods of estimating body fat that are not inconvenient/costly. A variety

Page 30: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

44 Statistical Data Mining and Knowledge Discovery

0 5 10 15150

160

170

180

190

200

210

220

230

Generations

Crite

rion V

alu

e

Mean Criterion Value="o" and Min Criterion Value="x"

FIGURE 2.22D Plot: Summary of a GA Run for the Simulated Data.

of popular health books suggest that the readers assess their health, at least in part,by estimating their percentage of body fat. In Bailey (1994), for instance, the readercan estimate body fat from tables using their age and variousskin-fold measurementsobtained by using a caliper. Other texts give predictive equations for body fat usingbody circumference measurements (e.g., abdominal circumference) and/or skin-foldmeasurements. See, for instance, Behnke and Wilmore (1974,pp. 66-67); Wilmore(1976, p. 247); or Katch and McArdle (1977, pp. 120-132). Percentage of body fatfor an individual can be estimated once body density has beendetermined. One (e.g.Siri (1956))assumes that the body consists of two components: lean body tissue andfat tissue. Letting

D = Body Density (gm/cm3)A = proportion of lean body tissueB = proportion of fat tissue (A+B= 1)a = density of lean body tissue (gm/cm3)b = density of fat tissue (gm/cm3)

we have

D = 1/[(A/a)+ (B/b)].

Solving forB we find

Page 31: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 45

0 5

10 15

20 25

30

0

5

10

15

150

200

250

300

350

400

450

500

550

Populatio n

Score surface of the Model s

Generatio n

FIGURE 2.33D Plot: Model Landscape of all Models Evaluated by ICOMP.

B = (1/D)∗ [ab/(a−b)]− [b/(a−b)].

Using the estimatesa= 1.10gm/cm3 andb= 0.90gm/cm3 (see Katch and McAr-dle (1977), p. 111 or Wilmore (1976), p. 123) we come up with “Siri’s equation”:

Percentage o f Body Fat(i.e.,100∗B)= 495/D−450.

Volume, and hence body density, can be accurately measured avariety of ways.The technique of underwater weighing “computes body volumeas the difference be-tween body weight measured in air and weight measured duringwater submersion.In other words, body volume is equal to the loss of weight in water with the appro-priate temperature correction for the water’s density” (Katch and McArdle (1977), p.113). Using this technique,

Body Density= WA/[(WA−WW)/c. f .−LV]

where WA = Weight in air (kg), WW = Weight in water (kg), c.f. = Water correctionfactor (= 1 at 39.2 deg F as one-gram of water occupies exactly onecm3 at thistemperature,= .997 at 76−78 deg F), LV = Residual Lung Volume (liters) (Katchand McArdle (1977, p. 115). Other methods of determining body volume are givenin Behnke and Wilmore (1974, p. 22).

For this example, we first evaluatedall possible subsetregression models. Thenwe picked the top 15 best subset models according to the rankings of the mini-mum ICOMP(IFIM) values. Then, we ran theGA for 100 runs with the parame-ters given in Table 2.4 and picked the top 10 ranking best subset models accord-ing to the minimum value of ICOMP(IFIM) to determine if theGA did indeedfind the lowest ICOMP(IFIM) model in comparison to the all possible subsetselection.

The best top fifteen regression models found by the all possible subset selectionprocedure are given in Table 2.3.

Page 32: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

46 Statistical Data Mining and Knowledge Discovery

Rank Number Variables ICOMP(IFIM)1 1 - - 4 - 6 7 8 - - -12 13 1473.90652 1 - - 4 - 6 7 8 9 - -12 13 1474.55253 1 - 3 4 - 6 7 8 - - -12 13 1474.67514 1- - 4 - 6 7 8 - -10 - 12 13 1475.17215 1 - - 4 - 6 7 8 - - 11 12 13 1475.20896 1 - - 4 - 6 7 8 9 10 - 12 13 1475.54067 1 - 3 4 - 6 7 8 9 - - 12 13 1475.60248 1 - 3 4 - 6 7 8 - 10 - 12 13 1475.70679 1 - - 4 - 6 7 8 9 - 11 12 13 1475.820810 - - 3 4 - 6 7 - - - - 12 13 1475.953911 1 - 3 4 - 6 7 8 - - 11 12 13 1476.013812 1 - - 4 5 6 7 8 - - - 12 13 1476.036213 - - - 4 - 6 7 - - - - 12 13 1476.11614 1 - - 4 - 6 7 8 - 10 11 12 13 1476.391315 1 - 3 4 - 6 7 8 9 10 - 12 13 1476.443

TABLE 2.3The best models chosen by the lowest fifteenICOMP(IFIM) values among all possi-ble models for the body fat data.

Number of runs= 100Number of generations = 30Population Size = 20Fitness Value =ICOMP(IFIM)Probability of crossover = 0.5

(Uniform cross over is used.)Elitism = YesProbability of mutation = 0.01

TABLE 2.4Parameters of theGA run for the body fat data.

Page 33: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 47

GA Ranking Chromosome (Variables) Binary String ICOMP(IFIM)1 (1) 1 - - 4 - 6 7 8 - - -12 13 0 1 0 0 1 0 1 1 1 0 0 0 1 1 1473.92 (2) 1 - - 4 - 6 7 8 9 - -12 13 0 1 0 0 1 0 1 1 1 1 0 0 1 1 1474.63 (3) 1 - 3 4 - 6 7 8 - - -12 13 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1474.74 (4) 1 - - 4 - 6 7 8 - 10 - 12 13 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1475.25 (7) 1 - 3 4 - 6 7 8 9 - -12 13 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1475.66 (8) 1 - 3 4 - 6 7 8 - 10 -12 13 0 1 0 1 1 0 1 1 1 0 1 0 1 1 1475.77 (9) 1 - - 4 - 6 7 8 9 - 11 12 13 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1475.88 (11)1 - 3 4 - 6 7 8 - - 11 12 13 0 1 0 1 1 0 1 1 1 0 0 1 1 1 14769 (13) - - - 4 - 6 7 - - - - 12 13 0 0 0 0 1 0 1 1 0 0 0 0 1 1 1476. 110 (15) 1 - 3 4 - 6 7 8 9 10 - 12 13 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1476. 4

TABLE 2.5Top 10 ranking subsets of the best predictors for the body fatdata set from 100 runsof theGA.

RSquare =0.741266RSquare Adj =0.733844Root Mean Square Error =4.317462Mean of Response =19.15079Observations (or Sum Wgts)=252

TABLE 2.6Summary of Fit of best subset model.

Term Estimated Coeff. Std Error t-Ratio Prob>Intercept -0.63164 6.498054 -0.10 0.9226Age 0.0838616 0.029956 2.80 0.0055NeckCirc -0.634546 0.213624 -2.97 0.0033Abdo2Circ 0.8808665 0.066639 13.22 <.0001HipCirc -0.359215 0.118802 -3.02 0.0028TighCirc 0.2826235 0.129812 2.18 0.0304ForearmCirc 0.4529919 0.185745 2.44 0.0155WristCirc -1.935856 0.481505 -4.02 <.0001

TABLE 2.7Parameter estimates of the best subsetGAmodel.

Page 34: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

48 Statistical Data Mining and Knowledge Discovery

0 10 20 30 40 50 60 70 80 90 1001472

1474

1476

1478

1480

1482

1484

Run Number

Criterio

n V

alu

e

Min Criterion Value

FIGURE 2.42D Plot: Summary of 100 Runs of the GA for the Body Fat Data.

If we had to choose one model from this set, the best subset is the first rankingmodel according toICOMP(IFIM ) = 1473.9 with the subset{x1 = Age (years),x4 = Neck circumference (cm),x6 =Abdomen 2 circumference (cm),x7 =Hip cir-cumference (cm),x8 =Thigh circumference (cm),x12 = Forearm circumference(cm), x13 =Wrist circumference (cm)}. Indeed this corresponds to the best subsetchosen from all possible subset selection. We further note thatGA’s selection corre-sponds to the top seven best subsets from the results of the all possible subsets. Thisis quite interesting and shows the fact thatGA is a highly intelligent statistical modelselection device capable of pruning combinatorially largenumbers of submodels toobtain optimal or near optimal subset regression models.

Based on our results, the summary of fit and the parameter estimates of the bestpredictive model are given in Tables 2.6 and 2.7. Figure 2.4 shows the summary of100 runs of the GA for the body fat data, and Figure 2.5 shows the 3D plot of themodel landscape of all models evaluated by the information complexity criterion in100 runs of the GA for the body fat data.

When we carry out forward stepwise regression analysis on the body fat data set,this approach gives us the full saturated model as the best fitting model, which is notsurprising. In other words, stepwise procedure is not able to distinguish the impor-tance of the predictors in the model, since the P-values usedin stepwise selection arearbitrary and the F-ratio stopping rule does not have the provision of compensatingbetween the lack of fit and increased model complexity. It does not attempt to findthe best model in the model search space.

Page 35: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 49

05

1015

2025

30

0

20

40

60

80

100

1470

1475

1480

1485

1490

1495

1500

1505

1510

Population

Score surface of the Models

Generation

FIGURE 2.53D Plot: Model landscape of all models evaluated by ICOMP(IFIM) in 100 runsof the GA for the body fat data.

Therefore, it is time that researchers start critically thinking of abandoning the useof such procedures, which are less than optimal.

2.6 Conclusion and Discussion

In this paper we have demonstrated that theGA, an algorithm which mimics Dar-winian evolution, is a useful optimization tool for statistical modelling. We vieweach regression model as a chromosome or a genotype within a population of suchentities. The value of an information-based model selection criterion is calculated foreach model as a measure of the model’s fitness. The collectionof all possible fitnessvalues forms a fitness landscape of this “artificial life.” Using the biological conceptsof natural selection, crossover, and mutation, theGA searches over this landscape tofind the “best” model.

Our GA application to the problem of optimal statistical model selection on thesimulated and the body fat indicates that theGA can indeed find the best modelwithout having to evaluate all possible regression models.Compare to all possiblesubset selection, we evaluated only a small proportion of the total model space ineach run.

Page 36: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

50 Statistical Data Mining and Knowledge Discovery

Question can be asked as to: “Will theGA always be guaranteed to find the bestmodel for all data sets?” In some situations theGAwill find “good” models, but maymiss finding the overall best one. This is not a failure specific to theGA, but rathera limitation faced by any type of optimization algorithm notusing an exhaustivesearch. Researchers often analyze a large number of variables wherein a very largenumber of possible models exist. In many data sets, we shouldbear in mind thata single best model may not exist, but rather a number of equally optimal modelsare present. AGA is capable, in such a case, of finding at least some of these bestmodels. For example, in the body fat data example, theGA found the best top sevenmodels in the all possible subsets. This is a remarkable achievement.

Another question is: “What are the optimal numbers of both population size andnumber of generations for use with the GA?” In the literatureof GA, there are noclear cut answers to this question. The answer will probablydepend on the numberof predictor variables and the structure of the data set at hand. Understanding therelationships among these factors requires further investigation (Mahfoud 1994). Weare currently investigating these problems by trying different combinations of popu-lation sizes and generations, and examining their results carefully to see if they haveconsistent patterns and results.

The GA approach in combination with information-based criteria is much morelikely to uncover the “best” model as well as “better” modelsthan stepwise selectionfor several reasons. First, ourGAapproach utilizes an information-based criterion asfitness values to evaluate models instead of treating regression model selection as aproblem involving statistical hypothesis-testing, such as P-values used for stepwiseselection. Second, theGA approach is not limited to simply adding or removing asingle variable at a time. Rather, theGAevaluates models with new combinations ofentire sets of variables obtained by evolutionary mechanisms (such as natural selec-tion and crossover) at each generation. Third, theGA is a very flexible optimizationalgorithm. Several evolutionary mechanisms can be modifiedby the investigatorsat different stages ofGA. For example, different proportional selection schemes canalso be used to determine the combination of genotypes in thepopulation in thenext generation (Srinivas and Patnaik, 1994). Finally, andmost importantly, aGAis not a “hill-climbing” searching algorithm like forward,backward, and stepwiseprocedures. A givenGAcan simultaneously search over multiple areas in the fitnesslandscape of the solution space.

It is obvious that more research is needed on the applicationof GAs to statis-tical model selection, in general, not just for linear regression modelling. We areworking on other applications of theGA in other statistical modelling problemssuch as in logistic and ordinal logistic regression, in vector autoregressive models(Bearse and Bozdogan, 1998), and in multivariate regression (Bearse and Bozdo-gan, 2002). We encourage researchers to developGAsfor other statistical modellingproblems.

Page 37: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 51

Acknowledgments

I was introduced to the genetic algorithms(GAs)by Dr. Hang-Kwang (Hans) Luh in1996 when he was a post doc in Mathematical Ecology here at theUniversity of Ten-nessee and worked with me for two years taking my advanced statistical modellingand multivariate courses. I am grateful to Hans for poundinginto me the importanceof this area in which I was doing statistical modelling with information complexityICOMP, and marrying it with theGA.

I acknowledge two of my graduate students and thank Vuttichai Chatpattanananand Xinli Bao for creating graphical user interface (GUI) based on my existing Mat-lab programs of theGA and modifying them. This made the computations mucheasier.

Finally, I dedicate this paper to Dean Warren Neel for his unending support of mywork and area of research since I came to the University of Tennessee in 1990, andfor creating a conducive atmosphere to carry out high level research.

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihoodprinciple. In B.N. Petrov and F. Csaki (Eds.),Second international symposium oninformation theory, Academiai Kiado, Budapest, 267-281.

Akaike, H. (1974). A new look at the statistical model identification.IEEE Transac-tions on Automatic Control, AC-19, 716-723.

Akaike, H. (1987). Factor analysis and AIC.Psychometrika, 52, 317-332.

Akaike, H. (1994). Implications of informational point of view on the developmentof statistical science. In H. Bozdogan (Ed.),Engineering & scientific applications ofinformational modeling, Volume3, pp. 27-38. Proceedings of the first US/Japan con-ference on the frontiers of statistical modeling: An informational approach. KluwerAcademic Publishers, the Netherlands, Dordrecht.

Bailey, Covert (1994).Smart Exercise: Burning Fat, Getting Fit, Houghton-MifflinCo., Boston, pp. 179-186.

Bauer, R. J. Jr. (1994).Genetic Algorithm and Investment Strategies.John Wiley &Sons, New York.

Baxter, R. A. (1996).Minimum Message Length Inference: Theory and Applications.Unpublished Ph.D. Thesis, Department of Computer Science,Monash University,Clayton, Victoria, Australia.

Bearse, P. M. and Bozdogan, H. (1998). Subset selection in vector autoregressive

Page 38: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

52 Statistical Data Mining and Knowledge Discovery

(VAR) models using the genetic algorithm with informational complexity as the fit-ness function.Systems Analysis, Modeling, Simulation (SAMS), 31, 61-91.

Bearse, P.M. and Bozdogan, H. (2002). Multivariate regressions, Genetic Algo-rithms, and Information Complexity: A Three Way Hybrid. InMeasurement andMultivariate Analysis, S. Nishisato, Y. Baba, H. Bozdogan, and K. Kanefuji (Eds.),Springer, Tokyo, 2002, 269-278.

Behboodian, J. (1964).Information for Estimating the Parameters in Mixtures ofExponential and Normal Distributions. Ph.D.Thesis, Department of Mathematics,University of Michigan, Ann Arbor, MI.

Behnke, A.R. and Wilmore, J.H. (1974).Evaluation and Regulation of Body Buildand Composition, Prentice-Hall, Englewood Cliffs, NJ.

Bellman, R. (1961).Adaptive Control Processes: A Guided Tour. Princeton Univer-sity Press, Princeton, NJ.

Blahut, R. E. (1987).Principles and Practice of Information Theory. Addison-Wesley Publishing Company, Reading, MA.

Bock, H. H. (1994). Information and entropy in cluster analysis. In InMultivariateStatistical Modeling, H. Bozdogan (Ed.),Vol. 2, Proceedings of the First US/JapanConference on the Frontiers of Statistical Modeling: An Informational Approach,Kluwer Academic Publishers, the Netherlands, Dordrecht; 115-147.

Bozdogan, H. (1987). Model selection and Akaike’s Information Criterion (AIC):The general theory and its analytical extensions.Psychometrika,52(3), 345-370.

Bozdogan, H. (1988). ICOMP: a new model-selection criterion. In Classificationand Related Methods of Data Analysis, H. H. Bock (Ed.), Elsevier Science Publish-ers, Amsterdam; 599-608.

Bozdogan, H. (1990). On the information-based measure of covariance complexityand its application to the evaluation of multivariate linear models.Communicationsin Statistics, Theory and Methods, 19, 221-278.

Bozdogan, H. (1994). Mixture-model cluster analysis usinga new informationalcomplexity and model selection criteria. InMultivariate Statistical Modeling, H.Bozdogan (Ed.), Vol. 2, Proceedings of the First US/Japan Conference on the Fron-tiers of Statistical Modeling: An Informational Approach,Kluwer Academic Pub-lishers, the Netherlands, Dordrecht, 69-113.

Page 39: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 53

Bozdogan, H. (2000). Akaike’s information criterion and recent developments ininformation complexity.Journal of Mathematical Psychology, 44, 62-91.

Bozdogan, H. (2004).Statistical Modeling and Model Evaluation: A New Informa-tional Approach. To appear.

Bozdogan, H. and Haughton, D.M.A. (1998). Informational complexity criteria forregression models.Computational Statistics and Data Analysis, 28, 51-76.

Bozdogan, H. and Ueno, M. (2000). A unified approach to information-theoretic andBayesian model selection criteria. Invited paper presented in the Technical SessionTrack C on: Information Theoretic Methods and Bayesian Modeling at the 6th WorldMeeting of the International Society for Bayesian Analysis(ISBA), May 28-June 1,2000, Hersonissos-Heraklion, Crete.

Bozdogan, H. and Magnus, J. R. (2003). Misspecification resistant model selectionusing information complexity. Working paper.

Burnham, K. P. and Anderson D. R. (1998).Model Selection and Inference: A Prac-tical Information-Theoretic Approach. Springer, New York.

Boyce, D. E., Farhi, A., and Weischedel, R. (1974).Optimal Subset Selection:Multiple Regression, Interdependence, and Optimal Network Algorithms. Springer-Verlag, New York.

Chen, X. (1996).Model Selection in Nonlinear Regression Analysis. UnpublishedPh.D. Thesis, The University of Tennessee, Knoxville, TN.

Chen, X. and Bozdogan, H. (2004).Model Selection in Nonlinear Regression Anal-ysis: A New Information Complexity Approach. Working research monograph.

Chernoff, H. (1956). Large sample theory: parametric case,Annals of MathematicalStatistics, 27, 1-22.

Cover, T. M., Gacs, P., and Gray, R. M. (1989). Kolmogorov’s contibutions to infor-mation theory and algorithmic complexity.Ann. Prob., 17, 840-865.

Cramer, H. (1946).Mathematical Methods of Statistics. Princeton University Press,Princeton, NJ.

Forrest, S. (1993). Genetic algorithms: principles of natural selection applied tocomputation,Science, 261(??), 872-878.

Graham, A. (1987).Nonnegative Matrices and Applicable Topics in Linear Algebra.Halsted Press, a Division of John Wiley and Sons, New York.

Goldberg, D. E. (1989).Genetic Algorithms in Search, Optimization, and MachineLearning, Addison-Wesley, New York.

Gosh, J. K. (1988) (Ed.).Statistical Information and Likelihood: A Collection ofCritical Essays by Dr. D. Basu. Springer-Verlag, New York.

Harris, C. J. (1978). An information theoretic approach to estimation. In M. J. Greg-son (Ed.),Recent Theoretical Developments in Control, Academic Press, London,

Page 40: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

54 Statistical Data Mining and Knowledge Discovery

563-590.

Hocking, R. R. (1976). The analysis and selection variablesin linear regression,Biometrics, 32, 1044.

Hocking, R. R. (1983). Developments in linear regression methodology: 1959-1982,Technometrics, 25, 219-230.

Holland, J. (1992). Genetic algorithms.Scientific American, 66-72.

Irizarry, R. A. (2001). Information and posterior probability criteria for model selec-tion in local likelihood estimation.Journal of the American Statistical Association,March 2001, Vol. 96, No. 453, 303-315.

Katch, F. and McArdle, W. (1977).Nutrition, Weight Control, and Exercise, HoughtonMifflin Co., Boston.

Kauffman, S. A. (1993).The Origins of Order: Self-organization and Selection inEvolution, Oxford University Press, Oxford.

Kolmogorov, A. N. (1983). Combinatorial foundations of information theory and thecalculus of probabilities.Russian Math Surveys, 38, 29-40.

Kullback, S. (1968).Information Theory and Statistics. Dover, New York.

Kullback, S. and Leibler, R. (1951). On information and sufficiency. Ann. Math.Statist., 22, 79-86.

Lanning, M. J. and Bozdogan, H. (2003). Ordinal Logistic Modeling Using ICOMPas a Goodness-of-Fit Criteria. InStatistical Data Mining and Knowledge Discovery,H. Bozdogan (Ed.), Chapman & Hall/CRC, Boca Raton, FL.

Ljung, L. and Rissanen, J. (1978). On canonical forms, parameter identifiability andthe concept of complexity. InIdentification and System Parameter Estimation, N. S.Rajbman (Ed.), North-Holland, Amsterdam, 1415-1426.

Magnus, J. R. and Neudecker, H. (1999).Matrix Differential Calculus, 2nd Edition,John Wiley & Sons, New York.

Mahfoud, S. W. ( 1994).Population Sizing for Sharing Methods, Illinois Genetic Al-gorithms Laboratory Report No. 94005. University of Illinois, Champaign-Urbana,IL.

Maklad, M. S. and Nichols, T. (1980). A new approach to model structure discrimi-nation.IEEE Trans. on Syst., Man, and Cybernetics, SMC 10, 78-84.

Mantel, N. (1970). Why stepdown procedures in variables selection,Technometrics,12, 591-612.

Peirce, C. S. (1955). Abduction and Induction. InPhilosophical Writings of Peirce,J. Buchler (Ed.),Dover, New York, 150-156.

McQuarie, A. D. R. and Tsai, C-L. (1998).Regression and Time Series Model Se-lection. World Scientific Publishing Company, Singapore.

Page 41: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

Statistical Data Mining and Knowledge Discovery 55

Michalewicz, Z. ( 1992).Genetic Algorithms + Data Structures = Evolution Pro-grams, Springer-Verlag, New York.

Miller, A. J. ( 1990).Subset selection in regression, Chapman and Hall, London.

Morgera, S. D. (1985). Information theoretic covariance complexity and its relationto pattern recognition.IEEE Trans. on Syst., Man, and Cybernetics, SMC 15, 608-619.

Mustonen, S. (1997). A measure of total variability in multivariate normal distribu-tion. Comp. Statist. and Data Ana., 23, 321-334.

Moses, L. E. (1986).Think and Explain with Statistics, Addison-Wesley, Reading,MA.

Rao, C. R. (1945). Information and accuracy attainable in the estimation of statisticalparameters.Bull. Calcutta Math Soc., 37, 81.

Rao, C. R. (1947). Minimum variance and the estimation of several parameters.Proc. Cam. Phil. Soc., 43, 280.

Rao, C. R. (1948). Sufficient statistics and minimum variance estimates.Proc. Cam.Phil. Soc., 45, 213.

Rao, C. R. (1965).Linear Statistical Inference and Its Applications. John Wiley &Sons, New York.

Rissanen, J. (1976). Minmax entropy estimation of models for vector processes. InSystem Identification, : R. K. Mehra and D. G. Lainiotis (Eds.), Academic Press,New York, 97-119.

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465-471.

Rissanen, J. (1986). Stochastic complexity and modeling.Ann. Statist., 14, 1080-1100.

Rissanen, J. (1987). Stochastic complexity. (With discussion),J. of the Royal Statist.Soc., Series B, 49, 223-239.

Rissanen, J. (1989).Stochastic Complexity in Statistical Inquiry. World ScientificPublishing Company, Teaneck, NJ.

Roughgarden, J. (1979).Theory of Population Genetics and Evolutionary Ecology:An Introduction, MacMillian Publishing, New York.

Sakamoto, Y., Ishiguro, M., and Kitagawa, G. (1986).Akaike Information CriterionStatistics, KTK Scientific Publishers, Tokyo.

Schwarz, G. (1978). Estimating the dimension of a model.Ann. Statist., 6, 461-464.

Sclove, S. L. (1987). Application of model-selection criteria to some problems inmultivariate analysis.Psychometrika, 52, 333-343.

Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems

Page 42: Intelligent Statistical Data Mining with Information ...zoe.bme.gatech.edu/~klee7/docs/cha/template_HDM2008_PAPER.pdf · Statistical Data Mining and Knowledge Discovery 19 where L(θbk)is

56 Statistical Data Mining and Knowledge Discovery

Technology Journal, 27, 379-423.

Siri, W.E. (1956). Gross composition of the body. InAdvances in Biological andMedical Physics, Vol. IV, J.H. Lawrence and C.A. Tobias (Eds.), Academic Press,New York.

Sokal, R. R. and Rohlf, F. J. (1981).Biometry, 2nd ed., W. H. Freeman and Company,New York.

Srinivas, M. and Patnaik, L. M. (1994). Genetic algorithms:a survey,IEEE Trans-actions of Signal Processing, 42 (4), 927-935.

Sumida, B. H., Houston, A. I., McNamara, J. M. and Hamilton, W. D. (1990). Ge-netic algorithms and evolution.J. Theoretical Biology,147, 59-84.

Theil, H. and Fiebig, D. G. (1984).Exploiting Continuity: Maximum Entropy Es-timation of Continuous Distributions. Ballinger Publishing Company, Cambridge,MA.

van Emden, M. H. (1971).An Analysis of Complexity. Mathematical Centre Tracts,Amsterdam, 35.

von Neumann, J. (1966).Theory of Self-Reproducing Automata. In A. W. Burks(Ed.), University of Illinois Press, Urbana.

Wallace, C. S. and Freeman, P. R. (1987). Estimation and inference by compactcoding. (With discussion).J. Royal Statist. Soc., Series B, 49, 240-265.

Wallace, C. S. and Dowe, D. L. (1993). MML estimation of the von Mises con-centration parameter. Technical Report 93/193, Department of Computer Science,Monash University, Clayton 3168, Australia.

Watanabe, S. (1985).Pattern Recognition: Human and Mechanical. John Wiley andSons, New York.

White, H. 1982. Maximum likelihood estimation of misspecified models,Economet-rica, 50, 1-26.

Wilkinson, L. (1989).SYSTAT: The System for Statistics, SYSTAT, Evanston, IL.

Wilmore, Jack (1976).Athletic Training and Physical Fitness: Physiological Prin-ciples of the Conditioning Process, Allyn and Bacon, Inc., Boston.