Lecture1

Machine Learning and Web SearchPart 1: Basics of Machine Learning

Hongyuan Zha

College of ComputingGeorgia Institute of Technology

Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 1 / 50

Outline

1 Classification ProblemsBayes Error and RiskNaive Bayes ClassifierLogistic RegressionDecision Trees

2 Regression ProblemsLeast Squares ProblemRegularizationBias-Variance Decomposition

3 Cross-Validation and ComparisonCross-Validationp-Value and Test of Significance


Supervised Learning

We are predicting a target variable based a set of predictor variables usinga training set of examples.

Classification: predict a discrete target variable– spam filtering based on message contentsRegression: predict a continuous target variable– predict income based on other demographic information


Probabilistic Setting for Classification

X is the predictor space, and C = {1, . . . , k} the set of class labelsP(x , j) a probability distribution on X × CA classifier d(x) is a function h : X 7→ CWe want to learn d from a training sample

D = {(x1, j1), . . . (xN , jN)}

How do we measure the performance of a classifier?Miss-classification error,

errorh = P({(x , j) | h(x) 6= j}),

where (x , j) ∼ P(x , j)


Bayes Classifier

P(x , j) = P(x |j)P(j) = P(j |x)P(x)

Assume X continuous, and

P(x ∈ A, j) =

∫A

p(j |x)p(x)dx

errorh = P(h(x) 6= j)

=

∫A

p(h(x) 6= j |x)p(x)dx =

∫A

(1− p(h(x)|x))p(x)dx

Bayes error = minh errorh and Bayes classifier

h∗(x) = argmaxjp(j |x)


Risk Minimization

Loss function L(i , j) = (C(i |j)): cost if class j is predicted to be class iRisk for h = expected loss for h,

Rf =

∫ k∑j=1

L(h(x), j)p(j |x)p(x)dx

Minimizing the risk ⇒ Bayes classifier

h∗(x) = argminj

k∑`=1

C(j |`)p(`|x)

0/1-loss ⇒ errorh


Naive Bayes Classifier

Use Bayes Rulep(j |x) ∼ p(x |j)p(j)

Feature vector x = [t1, . . . , tn]. Conditional independence assumption

p(x |j) = p(t1|j)p(t2|j) . . . p(tn|j)

MLE for p(ti |j), smoothed version,

p(ti |j) =nc + mpn + m

n: the number of training examples in class jnc : number of examples in class j with i attribute tip: a priori estimatem: the equivalent sample size


Naive Bayes Classifier: Example

Data

Naive Bayes Classifier example

Eric Meisner

November 22, 2003

1 The ClassifierThe Bayes Naive classifier selects the most likely classification Vnb given the attribute values a1, a2, . . . an.This results in:

Vnb = argmaxvj!V P (vj)!

P (ai|vj) (1)

We generally estimate P (ai|vj) using m-estimates:

P (ai|vj) =nc + mp

n + m(2)

where:

n = the number of training examples for which v = vj

nc = number of examples for which v = vj and a = ai

p = a priori estimate for P (ai|vj)m = the equivalent sample size

2 Car theft ExampleAttributes are Color , Type , Origin, and the subject, stolen can be either yes or no.

2.1 data setExample No. Color Type Origin Stolen?

1 Red Sports Domestic Yes2 Red Sports Domestic No3 Red Sports Domestic Yes4 Yellow Sports Domestic No5 Yellow Sports Imported Yes6 Yellow SUV Imported No7 Yellow SUV Imported Yes8 Yellow SUV Domestic No9 Red SUV Imported No

10 Red Sports Imported Yes

2.2 Training exampleWe want to classify a Red Domestic SUV. Note there is no example of a Red Domestic SUV in our dataset. Looking back at equation (2) we can see how to compute this. We need to calculate the probabilities

P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) ,

P(Red|No) , P(SUV|No), and P(Domestic|No)

1

and multiply them by P(Yes) and P(No) respectively . We can estimate these values using equation (3).

Yes: No:Red: Red:

n = 5 n = 5n_c= 3 n_c = 2p = .5 p = .5m = 3 m = 3

SUV: SUV:n = 5 n = 5n_c = 1 n_c = 3p = .5 p = .5m = 3 m = 3

Domestic: Domestic:n = 5 n = 5n_c = 2 n_c = 3p = .5 p = .5m = 3 m =3

Looking at P (Red|Y es), we have 5 cases where vj = Yes , and in 3 of those cases ai = Red. So forP (Red|Y es), n = 5 and nc = 3. Note that all attribute are binary (two possible values). We are assumingno other information so, p = 1 / (number-of-attribute-values) = 0.5 for all of our attributes. Our m valueis arbitrary, (We will use m = 3) but consistent for all attributes. Now we simply apply eqauation (3)using the precomputed values of n , nc, p, and m.

P (Red|Y es) =3 + 3 ! .5

5 + 3= .56 P (Red|No) =

2 + 3 ! .55 + 3

= .43

P (SUV |Y es) =1 + 3 ! .5

5 + 3= .31 P (SUV |No) =

3 + 3 ! .55 + 3

= .56

P (Domestic|Y es) =2 + 3 ! .5

5 + 3= .43 P (Domestic|No) =

3 + 3 ! .55 + 3

= .56

We have P (Y es) = .5 and P (No) = .5, so we can apply equation (2). For v = Y es, we have

P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes)

= .5 * .56 * .31 * .43 = .037

and for v = No, we have

P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No)

= .5 * .43 * .56 * .56 = .069

Since 0.069 > 0.037, our example gets classified as ’NO’

2

Estimates

and multiply them by P(Yes) and P(No) respectively . We can estimate these values using equation (3).

Yes: No:Red: Red:

n = 5 n = 5n_c= 3 n_c = 2p = .5 p = .5m = 3 m = 3

SUV: SUV:n = 5 n = 5n_c = 1 n_c = 3p = .5 p = .5m = 3 m = 3

Domestic: Domestic:n = 5 n = 5n_c = 2 n_c = 3p = .5 p = .5m = 3 m =3

Looking at P (Red|Y es), we have 5 cases where vj = Yes , and in 3 of those cases ai = Red. So forP (Red|Y es), n = 5 and nc = 3. Note that all attribute are binary (two possible values). We are assumingno other information so, p = 1 / (number-of-attribute-values) = 0.5 for all of our attributes. Our m valueis arbitrary, (We will use m = 3) but consistent for all attributes. Now we simply apply eqauation (3)using the precomputed values of n , nc, p, and m.

P (Red|Y es) =3 + 3 ! .5

5 + 3= .56 P (Red|No) =

2 + 3 ! .55 + 3

= .43

P (SUV |Y es) =1 + 3 ! .5

5 + 3= .31 P (SUV |No) =

3 + 3 ! .55 + 3

= .56

P (Domestic|Y es) =2 + 3 ! .5

5 + 3= .43 P (Domestic|No) =

3 + 3 ! .55 + 3

= .56

We have P (Y es) = .5 and P (No) = .5, so we can apply equation (2). For v = Y es, we have

P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes)

= .5 * .56 * .31 * .43 = .037

and for v = No, we have

P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No)

= .5 * .43 * .56 * .56 = .069

Since 0.069 > 0.037, our example gets classified as ’NO’

2


Naive Bayes Classifier: Example

To classify a Red Domestic SUV,For Yes:P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes)= .5 * .56 * .31 * .43 = .037

For No:P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No) =.5 * .43 * .56 * .56 = .069


Learning to Classify Text

Target concept Interesting : Document → {+,−}1 Example: document classification using BOW

— Multiple Bernoulli model— Multinomial model

one attribute per word position in document2 Learning: Use training examples to estimate

P(+), P(−), P(doc|+), P(doc|−)

Naive Bayes conditional independence assumption

P(doc|j) =

length(doc)∏i=1

P(ti |j)

where P(ti |j) is probability that word i appears in class j


Learn_naive_Bayes_text(Examples, k)

1. collect all words and other tokens that occur in ExamplesVocabulary ← all distinct words and other tokens in Examples

2. calculate the required P(j) and P(ti |j) probability termsFor each target value j in {1, . . . , k} do

docsj ← subset of Examples for which the target value is jP(j)← |docsj |

|Examples|Textj ← a single document created by concatenating all members ofdocsjn← total number of words in Textj (counting duplicate words multipletimes)for each word ti in Vocabulary

nk ← number of times word wk occurs in TextjP(ti |j)← ni +1

n+|Vocabulary|


Classify_naive_Bayes_text(Doc)

positions ← all word positions in Doc that contain tokens found inVocabularyReturn vNB, where

jNB = argmaxj

P(j)∏

iP(ti |j)

When k � 1, need special smoothing method,Congle Zhang et. al. Web-scale classification with Naive Bayes, WWW2009


Logistic Regression

In Naive Bayes, the discriminant function is

P(j)∏

iP(ti |j)

Let ni be frequency of ti , and take the log of the above

logP(j) +∑

ini logP(ti |j)

which is a linear function of the frequency vector x = [n1, . . . , nV ]T


Logistic Regression

More generally,

P(j |x) =1

Z (x)exp(wT

j x) ≡ 1Zw (x)

exp(wT f (x , j))

Given a training sample L = {(x1, j1), . . . (xN , jN)} we minimize theConditional likelihood,

min 12‖w‖

2 + C∑

ilogZw (xi )− wT f (xi , ji )

A Convex function in wC.J. Lin et. al., Trust region Newton methods for large-scale logisticregression, ICML, 2007 (several millions of features, N = 105)


Generative vs. Discriminative Classifiers

Naive Bayes estimates parameters for P(j),P(x |j) while logisticregression estimates parameters for P(j |x)

Naive Bayes: generative classifierLogistic regression: discriminative classifierLogistic more general, gives better asymptotic error. Convergencerates are different, GNB with O(log n) examples, and logisticregression with O(n) examples, n dimension of X

Ng and Jordan, On generative vs. discriminative classifiers: a comparisonof Naive Bayes and logistic regession, NIPS 2002


Tree-Based Models

14.4. Tree-based Models 663

Figure 14.4 Comparison of the squared error(green) with the absolute error (red)showing how the latter places muchless emphasis on large errors andhence is more robust to outliers andmislabelled data points.

0 z

E(z)

!1 1

can be addressed by basing the boosting algorithm on the absolute deviation |y ! t|instead. These two error functions are compared in Figure 14.4.

14.4. Tree-based Models

There are various simple, but widely used, models that work by partitioning theinput space into cuboid regions, whose edges are aligned with the axes, and thenassigning a simple model (for example, a constant) to each region. They can beviewed as a model combination method in which only one model is responsiblefor making predictions at any given point in input space. The process of selectinga specific model, given a new input x, can be described by a sequential decisionmaking process corresponding to the traversal of a binary tree (one that splits intotwo branches at each node). Here we focus on a particular tree-based frameworkcalled classification and regression trees, or CART (Breiman et al., 1984), althoughthere are many other variants going by such names as ID3 and C4.5 (Quinlan, 1986;Quinlan, 1993).

Figure 14.5 shows an illustration of a recursive binary partitioning of the inputspace, along with the corresponding tree structure. In this example, the first step

Figure 14.5 Illustration of a two-dimensional in-put space that has been partitionedinto five regions using axis-alignedboundaries.

A

B

C D

E

!1 !4

!2

!3

x1

x2664 14. COMBINING MODELS

Figure 14.6 Binary tree corresponding to the par-titioning of input space shown in Fig-ure 14.5.

x1 > !1

x2 > !3

x1 ! !4

x2 ! !2

A B C D E

divides the whole of the input space into two regions according to whether x1 ! !1

or x1 > !1 where !1 is a parameter of the model. This creates two subregions, eachof which can then be subdivided independently. For instance, the region x1 ! !1

is further subdivided according to whether x2 ! !2 or x2 > !2, giving rise to theregions denoted A and B. The recursive subdivision can be described by the traversalof the binary tree shown in Figure 14.6. For any new input x, we determine whichregion it falls into by starting at the top of the tree at the root node and followinga path down to a specific leaf node according to the decision criteria at each node.Note that such decision trees are not probabilistic graphical models.

Within each region, there is a separate model to predict the target variable. Forinstance, in regression we might simply predict a constant over each region, or inclassification we might assign each region to a specific class. A key property of tree-based models, which makes them popular in fields such as medical diagnosis, forexample, is that they are readily interpretable by humans because they correspondto a sequence of binary decisions applied to the individual input variables. For in-stance, to predict a patient’s disease, we might first ask “is their temperature greaterthan some threshold?”. If the answer is yes, then we might next ask “is their bloodpressure less than some threshold?”. Each leaf of the tree is then associated with aspecific diagnosis.

In order to learn such a model from a training set, we have to determine thestructure of the tree, including which input variable is chosen at each node to formthe split criterion as well as the value of the threshold parameter !i for the split. Wealso have to determine the values of the predictive variable within each region.

Consider first a regression problem in which the goal is to predict a single targetvariable t from a D-dimensional vector x = (x1, . . . , xD)T of input variables. Thetraining data consists of input vectors {x1, . . . ,xN} along with the correspondingcontinuous labels {t1, . . . , tN}. If the partitioning of the input space is given, and weminimize the sum-of-squares error function, then the optimal value of the predictivevariable within any given region is just given by the average of the values of tn forthose data points that fall in that region.Exercise 14.10

Now consider how to determine the structure of the decision tree. Even for afixed number of nodes in the tree, the problem of determining the optimal structure(including choice of input variable for each split as well as the corresponding thresh-

Partition input space into cuboid regions, with edges aligned with theaxesClassifier

h(x) =∑

iji I(x ∈ Ri )

CART (classification and regression trees)


Decision Trees

Decision tree representation:Each internal node tests an attribute (predictor variable)Each branch corresponds to attribute value– branching factor > 2 (discrete case)– binary tree more common (split on a threshold)Each leaf node assigns a class label


Top-Down Induction of Decision Trees

Main loop:1 A← the “best” decision attribute for next node2 For each value of A, create new descendant of node3 Sort training examples to leaf nodes4 If training examples perfectly classified, Then STOP, Else iterate over

new leaf nodesWhich attribute is best? Binary classification with two attributes

A1=? A2=?

ft ft

[29+,35-] [29+,35-]

[21+,5-] [8+,30-] [18+,33-] [11+,2-]


Information Gain

S is a sample of training examplesp⊕ is the proportion of positive examples in Sp is the proportion of negative examples in SEntropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

Gain(S,A) = expected reduction in entropy due to sorting on A

Gain(S,A) ≡ Entropy(S) −∑

v∈Values(A)

|Sv ||S| Entropy(Sv ) ≥ 0


Extension

General predictor variables x ∈ XA set of binary splits s at each node based on a question: is x ∈ A?where A ∈ Xthe split s sends all (xi , ji ) with "yes" answer to the left child and"no" answer to the right childStandard set of questions

Predictor variable x i continuous: is x i ≤ c?Predictor variable x i categorical: is x i ∈ T ′? where T ′ ⊂ T and T isthe values of x i


Goodness of Split

Classification/Decision Trees (I)

Goodness of Split

! The goodness of split is measured by an impurity functiondefined for each node.

! Intuitively, we want each leaf node to be “pure”, that is, oneclass dominates.

Jia Li http://www.stat.psu.edu/!jialiThe goodness of split is measured by an impurity function defined foreach nodeIntuitively, we want each leaf node to be "pure", that is, one classdominatesGiven class probabilities in a node S: p(1|S), . . . , p(J |S) Impurityfunction for S, (more generally for the samples in a node)

i(S) = φ(p(1|S), . . . , p(J |S))


Goodness of Split

ExamplesEntropy

φ(p1, . . . , pJ ) = −∑

jpj log pj

Gini Indexφ(p1, . . . , pJ ) = −

∑i 6=j

pipj = 1−∑

jp2

j

Goodness of a split s for node t,

Φ(s, t) = ∆i(s, t) = i(t)− pR i(tR)− pLi(tL)

where pR and pL are the proportions of the samples in node t that goto the right node tR and the left node tL, respectively.


Stopping Criteria

A simple criteria: stop splitting a node t when

maxs∈S

p(t)∆i(s, t) < β

The above stopping criteria is unsatisfactory— A node with a small decrease of impurity after one step of splittingmay have a large decrease after multiple levels of splits.


CART: Classification and Regression Trees

Two phases: growing and pruningGrowing: input space is recursively partitioned into cells, each cellcorresponding to a leaf node– training data are fitted well– but poor performance on test data (overfitting)Pruning: objective function consists of empirical risk and penalty term

C(T ) = LN(T ) + α|T |

where T ∈ T , all possible subtrees obtained from prunning theoriginal tree.CART select T to minimize C(T ) with α selected by cross-validation


Probabilistic Setting for Regression

X is the predictor space, and T = R the set of realsP(x , t) a probability distribution on X × TA regression function y(t) is a function h : X 7→ TWe want to learn h from a training sample

D = {(x1, t1), . . . (xN , tN)}

How do we measure the performance of a classifier?Mean Squared Error,

errorh =

∫(t − y(x))2dP(x , t)


Conditional Mean as Optimal Regression Function

Mean Squared Error,

errorh =

∫(t − h(x))2dP(x , t) =

∫(t − h(x))2p(t|x)dtdP(x)

Optimal regression function

h∗(x) =

∫tp(t|x)dt


Least Squares Problem6 1. INTRODUCTION

Figure 1.3 The error function (1.2) corre-sponds to (one half of) the sum ofthe squares of the displacements(shown by the vertical green bars)of each data point from the functiony(x,w).

t

x

y(xn,w)

tn

xn

function y(x,w) were to pass exactly through each training data point. The geomet-rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3.

We can solve the curve fitting problem by choosing the value of w for whichE(w) is as small as possible. Because the error function is a quadratic function ofthe coefficients w, its derivatives with respect to the coefficients will be linear in theelements of w, and so the minimization of the error function has a unique solution,denoted by w!, which can be found in closed form. The resulting polynomial isExercise 1.1given by the function y(x,w!).

There remains the problem of choosing the order M of the polynomial, and aswe shall see this will turn out to be an example of an important concept called modelcomparison or model selection. In Figure 1.4, we show four examples of the resultsof fitting polynomials having orders M = 0, 1, 3, and 9 to the data set shown inFigure 1.2.

We notice that the constant (M = 0) and first order (M = 1) polynomialsgive rather poor fits to the data and consequently rather poor representations of thefunction sin(2!x). The third order (M = 3) polynomial seems to give the best fitto the function sin(2!x) of the examples shown in Figure 1.4. When we go to amuch higher order polynomial (M = 9), we obtain an excellent fit to the trainingdata. In fact, the polynomial passes exactly through each data point and E(w!) = 0.However, the fitted curve oscillates wildly and gives a very poor representation ofthe function sin(2!x). This latter behaviour is known as over-fitting.

As we have noted earlier, the goal is to achieve good generalization by makingaccurate predictions for new data. We can obtain some quantitative insight into thedependence of the generalization performance on M by considering a separate testset comprising 100 data points generated using exactly the same procedure usedto generate the training set points but with new choices for the random noise valuesincluded in the target values. For each choice of M , we can then evaluate the residualvalue of E(w!) given by (1.2) for the training data, and we can also evaluate E(w!)for the test data set. It is sometimes more convenient to use the root-mean-square

For training set, D = {(x1, t1), . . . (xN , tN)}, the empirical risk, ortraining error, for y(x ,w),

E (w) =12

N∑i=1

(ti − y(xi ,w))2


Linear Least Squares Problem

y(x ,w) = wT f (x), f (x) = [f1(x), . . . , fn(x)] a set of basis functions,

E (w) =12

N∑i=1

(ti − wT f (xi ))2

Let t = [ti ],A = [f (xi )], then

minw‖t − Aw‖22

Normal equation,AT Aw = AT t


Overfitting and Regularization

Example taken from C. Bishop4 1. INTRODUCTION

Figure 1.2 Plot of a training data set of N =10 points, shown as blue circles,each comprising an observationof the input variable x along withthe corresponding target variablet. The green curve shows thefunction sin(2!x) used to gener-ate the data. Our goal is to pre-dict the value of t for some newvalue of x, without knowledge ofthe green curve.

x

t

0 1

!1

0

1

detailed treatment lies beyond the scope of this book.Although each of these tasks needs its own tools and techniques, many of the

key ideas that underpin them are common to all such problems. One of the maingoals of this chapter is to introduce, in a relatively informal way, several of the mostimportant of these concepts and to illustrate them using simple examples. Later inthe book we shall see these same ideas re-emerge in the context of more sophisti-cated models that are applicable to real-world pattern recognition applications. Thischapter also provides a self-contained introduction to three important tools that willbe used throughout the book, namely probability theory, decision theory, and infor-mation theory. Although these might sound like daunting topics, they are in factstraightforward, and a clear understanding of them is essential if machine learningtechniques are to be used to best effect in practical applications.

1.1. Example: Polynomial Curve Fitting

We begin by introducing a simple regression problem, which we shall use as a run-ning example throughout this chapter to motivate a number of key concepts. Sup-pose we observe a real-valued input variable x and we wish to use this observation topredict the value of a real-valued target variable t. For the present purposes, it is in-structive to consider an artificial example using synthetically generated data becausewe then know the precise process that generated the data for comparison against anylearned model. The data for this example is generated from the function sin(2!x)with random noise included in the target values, as described in detail in Appendix A.

Now suppose that we are given a training set comprising N observations of x,written x ! (x1, . . . , xN )T, together with corresponding observations of the valuesof t, denoted t ! (t1, . . . , tN )T. Figure 1.2 shows a plot of a training set comprisingN = 10 data points. The input data set x in Figure 1.2 was generated by choos-ing values of xn, for n = 1, . . . , N , spaced uniformly in range [0, 1], and the targetdata set t was obtained by first computing the corresponding values of the function

Polynomial curve fitting

y(x ,w) = w0 + w1x + · · ·+ wMxM


1.1. Example: Polynomial Curve Fitting 7

x

t

M = 0

0 1

!1

0

1

x

t

M = 1

0 1

!1

0

1

x

t

M = 3

0 1

!1

0

1

x

t

M = 9

0 1

!1

0

1

Figure 1.4 Plots of polynomials having various orders M , shown as red curves, fitted to the data set shown inFigure 1.2.

(RMS) error defined byERMS =

!2E(w!)/N (1.3)

in which the division by N allows us to compare different sizes of data sets onan equal footing, and the square root ensures that ERMS is measured on the samescale (and in the same units) as the target variable t. Graphs of the training andtest set RMS errors are shown, for various values of M , in Figure 1.5. The testset error is a measure of how well we are doing in predicting the values of t fornew data observations of x. We note from Figure 1.5 that small values of M giverelatively large values of the test set error, and this can be attributed to the fact thatthe corresponding polynomials are rather inflexible and are incapable of capturingthe oscillations in the function sin(2!x). Values of M in the range 3 ! M ! 8give small values for the test set error, and these also give reasonable representationsof the generating function sin(2!x), as can be seen, for the case of M = 3, fromFigure 1.4.


Training Error and Test Error

8 1. INTRODUCTION

Figure 1.5 Graphs of the root-mean-squareerror, defined by (1.3), evaluatedon the training set and on an inde-pendent test set for various valuesof M .

M

ER

MS

0 3 6 90

0.5

1TrainingTest

For M = 9, the training set error goes to zero, as we might expect becausethis polynomial contains 10 degrees of freedom corresponding to the 10 coefficientsw0, . . . , w9, and so can be tuned exactly to the 10 data points in the training set.However, the test set error has become very large and, as we saw in Figure 1.4, thecorresponding function y(x,w!) exhibits wild oscillations.

This may seem paradoxical because a polynomial of given order contains alllower order polynomials as special cases. The M = 9 polynomial is therefore capa-ble of generating results at least as good as the M = 3 polynomial. Furthermore, wemight suppose that the best predictor of new data would be the function sin(2!x)from which the data was generated (and we shall see later that this is indeed thecase). We know that a power series expansion of the function sin(2!x) containsterms of all orders, so we might expect that results should improve monotonically aswe increase M .

We can gain some insight into the problem by examining the values of the co-efficients w! obtained from polynomials of various order, as shown in Table 1.1.We see that, as M increases, the magnitude of the coefficients typically gets larger.In particular for the M = 9 polynomial, the coefficients have become finely tunedto the data by developing large positive and negative values so that the correspond-

Table 1.1 Table of the coefficients w! forpolynomials of various order.Observe how the typical mag-nitude of the coefficients in-creases dramatically as the or-der of the polynomial increases.

M = 0 M = 1 M = 6 M = 9w!

0 0.19 0.82 0.31 0.35w!

1 -1.27 7.99 232.37w!

2 -25.43 -5321.83w!

3 17.37 48568.31w!

4 -231639.30w!

5 640042.26w!

6 -1061800.52w!

7 1042400.18w!

8 -557682.99w!

9 125201.43

8 1. INTRODUCTION

Figure 1.5 Graphs of the root-mean-squareerror, defined by (1.3), evaluatedon the training set and on an inde-pendent test set for various valuesof M .

M

ER

MS

0 3 6 90

0.5

1TrainingTest

For M = 9, the training set error goes to zero, as we might expect becausethis polynomial contains 10 degrees of freedom corresponding to the 10 coefficientsw0, . . . , w9, and so can be tuned exactly to the 10 data points in the training set.However, the test set error has become very large and, as we saw in Figure 1.4, thecorresponding function y(x,w!) exhibits wild oscillations.

This may seem paradoxical because a polynomial of given order contains alllower order polynomials as special cases. The M = 9 polynomial is therefore capa-ble of generating results at least as good as the M = 3 polynomial. Furthermore, wemight suppose that the best predictor of new data would be the function sin(2!x)from which the data was generated (and we shall see later that this is indeed thecase). We know that a power series expansion of the function sin(2!x) containsterms of all orders, so we might expect that results should improve monotonically aswe increase M .

We can gain some insight into the problem by examining the values of the co-efficients w! obtained from polynomials of various order, as shown in Table 1.1.We see that, as M increases, the magnitude of the coefficients typically gets larger.In particular for the M = 9 polynomial, the coefficients have become finely tunedto the data by developing large positive and negative values so that the correspond-

Table 1.1 Table of the coefficients w! forpolynomials of various order.Observe how the typical mag-nitude of the coefficients in-creases dramatically as the or-der of the polynomial increases.

M = 0 M = 1 M = 6 M = 9w!

0 0.19 0.82 0.31 0.35w!

1 -1.27 7.99 232.37w!

2 -25.43 -5321.83w!

3 17.37 48568.31w!

4 -231639.30w!

5 640042.26w!

6 -1061800.52w!

7 1042400.18w!

8 -557682.99w!

9 125201.43


Increasing Training Set Size


x

t

N = 15

0 1

!1

0

1

x

t

N = 100

0 1

!1

0

1

Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9polynomial for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing thesize of the data set reduces the over-fitting problem.

ing polynomial function matches each of the data points exactly, but between datapoints (particularly near the ends of the range) the function exhibits the large oscilla-tions observed in Figure 1.4. Intuitively, what is happening is that the more flexiblepolynomials with larger values of M are becoming increasingly tuned to the randomnoise on the target values.

It is also interesting to examine the behaviour of a given model as the size of thedata set is varied, as shown in Figure 1.6. We see that, for a given model complexity,the over-fitting problem become less severe as the size of the data set increases.Another way to say this is that the larger the data set, the more complex (in otherwords more flexible) the model that we can afford to fit to the data. One roughheuristic that is sometimes advocated is that the number of data points should beno less than some multiple (say 5 or 10) of the number of adaptive parameters inthe model. However, as we shall see in Chapter 3, the number of parameters is notnecessarily the most appropriate measure of model complexity.

Also, there is something rather unsatisfying about having to limit the number ofparameters in a model according to the size of the available training set. It wouldseem more reasonable to choose the complexity of the model according to the com-plexity of the problem being solved. We shall see that the least squares approachto finding the model parameters represents a specific case of maximum likelihood(discussed in Section 1.2.5), and that the over-fitting problem can be understood asa general property of maximum likelihood. By adopting a Bayesian approach, theSection 3.4over-fitting problem can be avoided. We shall see that there is no difficulty froma Bayesian perspective in employing models for which the number of parametersgreatly exceeds the number of data points. Indeed, in a Bayesian model the effectivenumber of parameters adapts automatically to the size of the data set.

For the moment, however, it is instructive to continue with the current approachand to consider how in practice we can apply it to data sets of limited size where we


Regularization10 1. INTRODUCTION

x

t

ln ! = !18

0 1

!1

0

1

x

t

ln ! = 0

0 1

!1

0

1

Figure 1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized errorfunction (1.4) for two values of the regularization parameter ! corresponding to ln ! = !18 and ln ! = 0. Thecase of no regularizer, i.e., ! = 0, corresponding to ln ! = !", is shown at the bottom right of Figure 1.4.

may wish to use relatively complex and flexible models. One technique that is oftenused to control the over-fitting phenomenon in such cases is that of regularization,which involves adding a penalty term to the error function (1.2) in order to discouragethe coefficients from reaching large values. The simplest such penalty term takes theform of a sum of squares of all of the coefficients, leading to a modified error functionof the form

!E(w) =12

N"

n=1

{y(xn,w) ! tn}2 +!

2"w"2 (1.4)

where "w"2 # wTw = w20 + w2

1 + . . . + w2M , and the coefficient ! governs the rel-

ative importance of the regularization term compared with the sum-of-squares errorterm. Note that often the coefficient w0 is omitted from the regularizer because itsinclusion causes the results to depend on the choice of origin for the target variable(Hastie et al., 2001), or it may be included but with its own regularization coefficient(we shall discuss this topic in more detail in Section 5.5.1). Again, the error functionin (1.4) can be minimized exactly in closed form. Techniques such as this are knownExercise 1.2in the statistics literature as shrinkage methods because they reduce the value of thecoefficients. The particular case of a quadratic regularizer is called ridge regres-sion (Hoerl and Kennard, 1970). In the context of neural networks, this approach isknown as weight decay.

Figure 1.7 shows the results of fitting the polynomial of order M = 9 to thesame data set as before but now using the regularized error function given by (1.4).We see that, for a value of ln! = !18, the over-fitting has been suppressed and wenow obtain a much closer representation of the underlying function sin(2"x). If,however, we use too large a value for ! then we again obtain a poor fit, as shown inFigure 1.7 for ln! = 0. The corresponding coefficients from the fitted polynomialsare given in Table 1.2, showing that regularization has the desired effect of reducing

For training set, L = {(x1, t1), . . . (xN , tN)}, the empirical risk, ortraining error, for y(x ,w),

E (w) =12

N∑i=1

(ti − y(xi ,w))2 +λ

2 ‖w‖2


Training Error and Test Error: Regularization


Table 1.2 Table of the coefficients w! for M =9 polynomials with various values forthe regularization parameter !. Notethat ln ! = !" corresponds to amodel with no regularization, i.e., tothe graph at the bottom right in Fig-ure 1.4. We see that, as the value of! increases, the typical magnitude ofthe coefficients gets smaller.

ln ! = !" ln! = !18 ln! = 0w!

0 0.35 0.35 0.13w!

1 232.37 4.74 -0.05w!

2 -5321.83 -0.77 -0.06w!

3 48568.31 -31.97 -0.05w!

4 -231639.30 -3.89 -0.03w!

5 640042.26 55.28 -0.02w!

6 -1061800.52 41.32 -0.01w!

7 1042400.18 -45.95 -0.00w!

8 -557682.99 -91.53 0.00w!

9 125201.43 72.68 0.01

the magnitude of the coefficients.The impact of the regularization term on the generalization error can be seen by

plotting the value of the RMS error (1.3) for both training and test sets against ln!,as shown in Figure 1.8. We see that in effect ! now controls the effective complexityof the model and hence determines the degree of over-fitting.

The issue of model complexity is an important one and will be discussed atlength in Section 1.3. Here we simply note that, if we were trying to solve a practicalapplication using this approach of minimizing an error function, we would have tofind a way to determine a suitable value for the model complexity. The results abovesuggest a simple way of achieving this, namely by taking the available data andpartitioning it into a training set, used to determine the coefficients w, and a separatevalidation set, also called a hold-out set, used to optimize the model complexity(either M or !). In many cases, however, this will prove to be too wasteful ofvaluable training data, and we have to seek more sophisticated approaches.Section 1.3

So far our discussion of polynomial curve fitting has appealed largely to in-tuition. We now seek a more principled approach to solving problems in patternrecognition by turning to a discussion of probability theory. As well as providing thefoundation for nearly all of the subsequent developments in this book, it will also

Figure 1.8 Graph of the root-mean-square er-ror (1.3) versus ln ! for the M = 9polynomial.

ER

MS

ln !!35 !30 !25 !200

0.5

1TrainingTest


Table 1.2 Table of the coefficients w! for M =9 polynomials with various values forthe regularization parameter !. Notethat ln ! = !" corresponds to amodel with no regularization, i.e., tothe graph at the bottom right in Fig-ure 1.4. We see that, as the value of! increases, the typical magnitude ofthe coefficients gets smaller.

ln ! = !" ln! = !18 ln! = 0w!

0 0.35 0.35 0.13w!

1 232.37 4.74 -0.05w!

2 -5321.83 -0.77 -0.06w!

3 48568.31 -31.97 -0.05w!

4 -231639.30 -3.89 -0.03w!

5 640042.26 55.28 -0.02w!

6 -1061800.52 41.32 -0.01w!

7 1042400.18 -45.95 -0.00w!

8 -557682.99 -91.53 0.00w!

9 125201.43 72.68 0.01

the magnitude of the coefficients.The impact of the regularization term on the generalization error can be seen by

plotting the value of the RMS error (1.3) for both training and test sets against ln!,as shown in Figure 1.8. We see that in effect ! now controls the effective complexityof the model and hence determines the degree of over-fitting.

The issue of model complexity is an important one and will be discussed atlength in Section 1.3. Here we simply note that, if we were trying to solve a practicalapplication using this approach of minimizing an error function, we would have tofind a way to determine a suitable value for the model complexity. The results abovesuggest a simple way of achieving this, namely by taking the available data andpartitioning it into a training set, used to determine the coefficients w, and a separatevalidation set, also called a hold-out set, used to optimize the model complexity(either M or !). In many cases, however, this will prove to be too wasteful ofvaluable training data, and we have to seek more sophisticated approaches.Section 1.3

So far our discussion of polynomial curve fitting has appealed largely to in-tuition. We now seek a more principled approach to solving problems in patternrecognition by turning to a discussion of probability theory. As well as providing thefoundation for nearly all of the subsequent developments in this book, it will also

Figure 1.8 Graph of the root-mean-square er-ror (1.3) versus ln ! for the M = 9polynomial.

ER

MS

ln !!35 !30 !25 !200

0.5

1TrainingTest


Bias-Variance Decomposition

Let h∗(x) be the conditional mean, h∗(x) =∫

tp(t|x)dtFor any regression function y(x),∫

(y(x)− t)2 =

∫(y(x)− h(x))2 +

∫(h(x)− t)2

Given training data D, algorithm outputs y(x ,D)— average behavior over all D

ED(y(x ,D)−h(x))2 = (EDy(x ,D)− h(x))2︸︷︷︸(bias)2

+ ED(y(x ,D)− EDy(x ,D))2︸︷︷︸variance

Expected loss = (bias)2 + variance + noise


Bias-Variance and Model Complexity150 3. LINEAR MODELS FOR REGRESSION

x

tln ! = 2.6

0 1

!1

0

1

x

t

0 1

!1

0

1

x

tln ! = !0.31

0 1

!1

0

1

x

t

0 1

!1

0

1

x

tln ! = !2.4

0 1

!1

0

1

x

t

0 1

!1

0

1

Figure 3.5 Illustration of the dependence of bias and variance on model complexity, governed by a regulariza-tion parameter !, using the sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N = 25data points, and there are 24 Gaussian basis functions in the model so that the total number of parameters isM = 25 including the bias parameter. The left column shows the result of fitting the model to the data sets forvarious values of ln ! (for clarity, only 20 of the 100 fits are shown). The right column shows the correspondingaverage of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green).


Bias-Variance Decomposition3.2. The Bias-Variance Decomposition 151

Figure 3.6 Plot of squared bias and variance,together with their sum, correspond-ing to the results shown in Fig-ure 3.5. Also shown is the averagetest set error for a test data set sizeof 1000 points. The minimum valueof (bias)2 + variance occurs aroundln ! = !0.31, which is close to thevalue that gives the minimum erroron the test data.

ln !

!3 !2 !1 0 1 20

0.03

0.06

0.09

0.12

0.15(bias)2

variance

(bias)2 + variancetest error

fit a model with 24 Gaussian basis functions by minimizing the regularized errorfunction (3.27) to give a prediction function y(l)(x) as shown in Figure 3.5. Thetop row corresponds to a large value of the regularization coefficient ! that gives lowvariance (because the red curves in the left plot look similar) but high bias (becausethe two curves in the right plot are very different). Conversely on the bottom row, forwhich ! is small, there is large variance (shown by the high variability between thered curves in the left plot) but low bias (shown by the good fit between the averagemodel fit and the original sinusoidal function). Note that the result of averaging manysolutions for the complex model with M = 25 is a very good fit to the regressionfunction, which suggests that averaging may be a beneficial procedure. Indeed, aweighted averaging of multiple solutions lies at the heart of a Bayesian approach,although the averaging is with respect to the posterior distribution of parameters, notwith respect to multiple data sets.

We can also examine the bias-variance trade-off quantitatively for this example.The average prediction is estimated from

y(x) =1L

L!

l=1

y(l)(x) (3.45)

and the integrated squared bias and integrated variance are then given by

(bias)2 =1N

N!

n=1

{y(xn) ! h(xn)}2 (3.46)

variance =1N

N!

n=1

1L

L!

l=1

"y(l)(xn) ! y(xn)

#2(3.47)

where the integral over x weighted by the distribution p(x) is approximated by afinite sum over data points drawn from that distribution. These quantities, alongwith their sum, are plotted as a function of ln! in Figure 3.6. We see that smallvalues of ! allow the model to become finely tuned to the noise on each individual


Overfitting

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 10 20 30 40 50 60 70 80 90 100

Acc

ura

cy

Size of tree (number of nodes)

On training dataOn test data

Common problem with most learning algorithmsGiven a function space H, a function h ∈ H is said to overfitthetraining data if there exists some alternative function h′ ∈ H suchthat h has smaller error than h′ over the training examples but h′ hassmaller error than h over the entire distribution of instances


Cross-Validation

Performance on the training set is not a good indicator of predictiveperformance on unseen dataIf data is plentiful, training data, validation data and test data— models (different degree polynomials) compared on validation data— high variance if validation data smallS-fold cross-validation 1.4. The Curse of Dimensionality 33

Figure 1.18 The technique of S-fold cross-validation, illus-trated here for the case of S = 4, involves tak-ing the available data and partitioning it into Sgroups (in the simplest case these are of equalsize). Then S ! 1 of the groups are used to traina set of models that are then evaluated on the re-maining group. This procedure is then repeatedfor all S possible choices for the held-out group,indicated here by the red blocks, and the perfor-mance scores from the S runs are then averaged.

run 1

run 2

run 3

run 4

data to assess performance. When data is particularly scarce, it may be appropriateto consider the case S = N , where N is the total number of data points, which givesthe leave-one-out technique.

One major drawback of cross-validation is that the number of training runs thatmust be performed is increased by a factor of S, and this can prove problematic formodels in which the training is itself computationally expensive. A further problemwith techniques such as cross-validation that use separate data to assess performanceis that we might have multiple complexity parameters for a single model (for in-stance, there might be several regularization parameters). Exploring combinationsof settings for such parameters could, in the worst case, require a number of trainingruns that is exponential in the number of parameters. Clearly, we need a better ap-proach. Ideally, this should rely only on the training data and should allow multiplehyperparameters and model types to be compared in a single training run. We there-fore need to find a measure of performance which depends only on the training dataand which does not suffer from bias due to over-fitting.

Historically various ‘information criteria’ have been proposed that attempt tocorrect for the bias of maximum likelihood by the addition of a penalty term tocompensate for the over-fitting of more complex models. For example, the Akaikeinformation criterion, or AIC (Akaike, 1974), chooses the model for which the quan-tity

ln p(D|wML) ! M (1.73)

is largest. Here p(D|wML) is the best-fit log likelihood, and M is the number ofadjustable parameters in the model. A variant of this quantity, called the Bayesianinformation criterion, or BIC, will be discussed in Section 4.4.1. Such criteria donot take account of the uncertainty in the model parameters, however, and in practicethey tend to favour overly simple models. We therefore turn in Section 3.4 to a fullyBayesian approach where we shall see how complexity penalties arise in a naturaland principled way.

1.4. The Curse of Dimensionality

In the polynomial curve fitting example we had just one input variable x. For prac-tical applications of pattern recognition, however, we will have to deal with spaces


Cross-Validation

Copyright © Andrew W. Moore Slide 35

Which kind of Cross Validation?

Doesn’t waste dataExpensive.

Has some weird behavior

Leave-

one-out

Only wastes 10%. Only

10 times more expensive

instead of R times.

Wastes 10% of the data.

10 times more expensive

than test set

10-fold

Slightly better than test-

set

Wastier than 10-fold.

Expensivier than test set

3-fold

Identical to Leave-one-outR-fold

CheapVariance: unreliable

estimate of future

performance

Test-set

UpsideDownside

Andrew Moore’s slides


CV-based Model Choice

Example: Choosing which model to useStep 1: Compute 10-fold CV error for six different model classes,

Copyright © Andrew W. Moore Slide 38

CV-based Model Selection• Example: Choosing number of hidden units in a one-

hidden-layer neural net.

• Step 1: Compute 10-fold CV error for six different model

classes:

3 hidden units

4 hidden units

5 hidden units

!2 hidden units

1 hidden units

0 hidden units

Choice10-FOLD-CV-ERRTRAINERRAlgorithm

• Step 2: Whichever model class gave best CV score: train it

with all the data, and that’s the predictive model you’ll use.Step 2: Whichever model gave best CV score: train it with all thedata, and that is the predictive model you will use.


Two Definitions of Error

The true error of classifier h with respect to P(x , y) is the probabilitythat h will misclassify an instance drawn at random according to P(population).

errorP(h) ≡ P[h(x) 6= y ]

The sample error of h with respect to data sample S = {(xi , yi )}Ni=1 isthe proportion of examples h misclassifies

errorS(h) ≡ 1n

N∑i=1

δ(yi 6= h(xi ))

Where δ(y 6= h(x)) is 1 if y 6= h(x), and 0 otherwise.How well does errorS(h) estimate errorP(h)?


Example

Hypothesis h misclassifies 12 of the 40 examples in S

errorS(h) =1240 = .30

What is errorP(h)?


Confidence Intervals

IfS contains n examples, drawn independently of h and each othern ≥ 30

ThenWith approximately N% probability, errorP(h) lies in interval

errorS(h)± zN

√errorS(h)(1− errorS(h))

n

whereN%: 50% 68% 80% 90% 95% 98% 99%zN : 0.67 1.00 1.28 1.64 1.96 2.33 2.58


k-fold Cross-Validated Paired t-test

Comparing two algorithms A and B— L(S) returns the classifier produced by algorithm L training on trainingdata S

1 Randomly partition training data D into k disjoint test setsT1,T2, . . . ,Tk of equal size.

2 For i from 1 to k, douse Ti for the test set, and the remaining data for training set Si

hA ← LA(Si ), hB ← LB(Si )δi ← errorTi (hA)− errorTi (hB)

3 Return the value δ ≡ 1k∑k

i=1 δi

4 Let sδ ≡√

1k(k−1)

∑ki=1(δi − δ)2.


t-Distribution

Then t ≡ δ/sδ has an approximate t-distribution with k − 1 degree offreedom under the null hypothesis that there’s no difference in the trueerrors


p-Value and Test of Significance

Null hypothesis: no difference in true errors, and alternativehypothesisWe may be able to demonstrate that the alternative is much moreplausible than the null hypothesis given the dataThis is done in terms of a probability (a p-value)— quantifying the strength of the evidence against the nullhypothesis in favor of the alternative.Are the data consistent with the null hypothesis?— use a test statistic, like the t— need to know the null distribution of test statistic (degree k − 1student t)



For a given data set, we can compute the value of t, and see whetherit is— in the middle of the distribution (consistent with the nullhypothesis)— out in a tail of the distribution (making the alternative hypothesisseem more plausible)Alternative hypothesis ⇒ large positive t— measure of how far out t is in the right-hand tail of nulldistributionp-Value is the probability to the right of our test statistic (t)calculated using the null distributionThe smaller the P-value, and the stronger the evidence against thenull hypothesis in favor of the alternative



then calculate this subjective probability by specifying a prior probability (subjective beliefbefore looking at the data) that the null hypothesis is true, and then use the data and themodel to update one’s subjective probability. This is called the Bayesian approach becauseBayes’ Theorem is used to update subjective probabilities to reflect new information.

When reporting a P-value to persons unfamiliar with statistics, it is often necessary to usedescriptive language to indicate the strength of the evidence. I tend to use the followingsort of language. Obviously the cut-offs are somewhat arbitrary and another person mightuse different language.

P > 0.10 No evidence against the null hypothesis. The data appear to beconsistent with the null hypothesis.

0.05 < P < 0.10 Weak evidence against the null hypothesis in favor of the alternative.

0.01 < P < 0.05 Moderate evidence against the null hypothesis in favor of thealternative.

0.001 < P < 0.01 Strong evidence against the null hypothesis in favor of thealternative.

P < 0.001 Very strong evidence against the null hypothesis in favor of thealternative.

In using this kind of language, one should keep in mind the difference between statisticalsignificance and practical significance. In a large study one may obtain a small P-valueeven though the magnitude of the effect being tested is too small to be of importance (seethe discussion of power below). It is a good idea to support a P-value with a confidenceinterval for the parameter being tested.

A P-value can also be reported more formally in terms of a fixed level ! test. Here ! is anumber selected independently of the data, usually 0.05 or 0.01, more rarely 0.10. Wereject the null hypothesis at level ! if the P-value is smaller than !, otherwise we fail toreject the null hypothesis at level !. I am not fond of this kind of language because itsuggests a more definite, clear-cut answer than is often available. There is essentially nodifference between a P-value of 0.051 and 0.049. In some situations it may be necessaryto proceed with some course of action based on our belief in whether the null or alternativehypothesis is true. More often, it seems better to report the P-value as a measure ofevidence.

A fixed level ! test can be calculated without first calculating a P-value. This is done bycomparing the test statistic with a critical value of the null distribution corresponding to thelevel !. This is usually the easiest approach when doing hand calculations and usingstatistical tables, which provide percentiles for a relatively small set of probabilities. Moststatistical software produces P-values which can be compared directly with !. There is noneed to repeat the calculation by hand.

Fixed level ! tests are needed for discussing the power of a test, a useful concept whenplanning a study. Suppose we are comparing a new medical treatment with a standardtreatment, the control. The null hypothesis is that of no treatment effect (no differencebetween treatment and control). The alternative hypothesis is that the treatment effect(mean difference of treatment minus control using some outcome variable) is positive. Wewant to have good chance of reporting a small P-value assuming the alternative hypothesis

Level α (usually 0.05 or 0.01) test: We reject the null hypothesis at levelα if the P-value is smaller than α


The Power of Tests

When comparing algorithms— null hypothesis: no difference— alternative hypothesis: my new algorithm is betterWe want to have good chance of reporting a small P-value assumingthe alternative hypothesis is trueThe power of level α test: the probability that the null hypothesis willbe rejected at level α (i.e., the p-value will be less than α) assumingthe alternative hypothesis— variability of the data: lower variance, higher power— sample size: higher N, higher power— the magnitude of the difference: large difference, higher poweris true.


Education

Lecture1