131
Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan Cios / Pedrycz / Swiniarski / Kurgan

Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Embed Size (px)

Citation preview

Page 1: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Chapter 11

Supervised Learning:STATISTICAL METHODS

Cios / Pedrycz / Swiniarski / KurganCios / Pedrycz / Swiniarski / Kurgan

Page 2: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan2

• Bayesian Methods– Basics of Bayesian Methods– Bayesian Classification – General Case– Classification that Minimizes Risk– Decision Regions and Probability of Errors– Discriminant Functions– Estimation of Probability Densities– Probabilistic Neural Network– Constraints in Classifier Design

Outline

Page 3: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Outline

• Regression

– Data Models

– Simple Linear Regression

– Multiple Regression

– General Least Squares and Multiple Regression

– Assessing Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan3

Page 4: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan4

Bayesian Methods

Statistical processing based on the Bayes decision

theory is a fundamental technique for pattern recognition and classification.

The Bayes decision theory provides a framework for

statistical methods for classifying patterns into classes based on probabilities of patterns and their features.

Page 5: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods

Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk.

States of nature C = { “ an eagle ”, “ a hawk ” }

Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” }

We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”)

and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N)

Cios / Pedrycz / Swiniarski / Kurgan5

Page 6: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan6

Basics of Bayesian Methods

A priori (prior) probability P(ci):

Estimation of a prior P(ci):

P(ci) denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object.

Page 7: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods

The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears.

– Natural and best decision:

“Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a

bird to a class c2 ”

– The probability of classification error:

P(classification error) = P(c2) if we decide C = c1

P(c1) if we decide C = c2

Cios / Pedrycz / Swiniarski / Kurgan7

Page 8: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan8

Involving Object Features in Classification

• Feature variable / feature x– It characterizes an object and allows for better discrimination

between one class from another

– We assume it to be a continuous random variable taking

continuous values from a given range

– The variability of a random variable x can be expressed in

probabilistic terms

– We represent a distribution of a random variable x by the class

conditional probability density function (the state conditional

probability density function):

Page 9: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan9

Involving Object Features in Classification

Examples of probability densities

Page 10: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan10

Involving Object Features in Classification

• Probability density function p(x|ci)

– also called the likelihood of a class ci with respect to the value x of a feature variable

– the likelihood that an object belongs to class ci is bigger if p(x|ci) is larger

– joint probability density function p(ci , x)

A probability density that an object is in a class ci and has a feature variable value x.

– A posteriori (posterior) probability P(x|ci)

The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x.

Page 11: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan11

Involving Object Features in Classification

• Bayes’ rule / Bayes’ theorem

– From probability theory (see Appendix B)

– An unconditional probability density function

Page 12: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan12

Involving Object Features in Classification

• Bayes’ rule

“The conditional probability P(ci|x) can be expressed in

terms of the a priori probability function P(ci), together with

the class conditional probability density function p(ci|x).”

Page 13: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan13

Involving Object Features in Classification

• Bayes’ decision rule

P(c2|x) if we decide C = c1

P(classification error | x) =

P(c1|x) if we decide C = c2

“This statistical classification rule is best in the sense of

minimizing the probability of misclassification (the

probability of classification error)”

– Bayes’ classification rule guarantees minimization of the

average probability of classification error

Page 14: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan14

Involving Object Features in Classification

ExampleLet us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2). Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2 and p(45|c2) = 1.1053 ∙ 10-2. Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is

Page 15: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan15

Bayesian Classification – General Case

• Bayes’ Classification Rule for Multiclass Multifeature Objects

– Real-valued features of an object as n-dimensional column

vector x Rn:

– The object may belong to l distinct classes (l distinct states

of nature):

Page 16: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan16

Bayesian Classification – General Case

• Bayes’ Classification Rule for Multiclass Multifeature Objects

– Bayes’ theorem

A priori probability: P(ci) (i = 1, 2…,l)

Class conditional probability density function : p(x|ci)

A posteriori (posterior) probability: P(ci |x)

Unconditional probability density function:

Page 17: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan17

Bayesian Classification – General Case

• Bayes’ Classification Rule for Multiclass Multifeature Objects

– Bayes classification rule: A given object with a given value x of a feature vector can

be classified as belonging to class cj when:

Assign an object with a given value x of a feature vector to class cj when:

Page 18: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan18

Classification that Minimizes Risk

• Basic Idea To incorporate the fact that misclassifications of some

classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature

• A loss function– Cost (penalty, weight) due to the fact of assigning an object to

class cj when in fact the true class is ci

Page 19: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan19

Classification that Minimizes Risk

• A loss matrix– We denote a loss function by Lij matrix for l-class

classification problems

• Expected (average) conditional loss

In short,

Page 20: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan20

Classification that Minimizes Risk

• Overall Risk

The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision.

• Bayes riskMinimal overall risk R leads to the generalization of Bayes’

rule for minimization of probability of the classification error.

Page 21: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan21

Classification that Minimizes Risk

• Bayes’ classification rule with Bayes risk

Choose a decision (a class) ci for which:

Page 22: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan22

Classification that Minimizes Risk

• Bayesian Classification Minimizing the Probability of Error

– Symmetrical zero-one conditional loss function

– The conditional risk R(cj| x) criterion is the same as the average probability of classification error:

– An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision

Page 23: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan23

Classification that Minimizes Risk

• Generalization of the Maximum Likelihood Classification

– Generalized likelihood ratio for classes ci and cj

– Generalized threshold value

– The maximum likelihood classification rule

“Decide a class cj if

Page 24: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan24

Decision Regions and Probability of Errors

• Decision regions

– A classifier divides the feature space into l

disjoint decision subspaces R1,R2, … Rl

– The region Ri is a subspace such that each

realization x of a feature vector of an object falling

into this region will be assigned to a class ci

Page 25: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan25

Decision Regions and Probability of Errors

• Decision boundaries (decision surfaces)

– The regions intersect, and boundaries between

adjacent regions

“The task of a classifier design is to find classification rules

that will guarantee division of a feature space into optimal

decision regions R1,R2, … Rl (with optimal decision

boundaries) that will minimize a selected classification

performance criterion”

Page 26: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan26

Decision Regions and Probability of Errors

• Decision boundaries

Page 27: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan27

Decision Regions and Probability of Errors

• Optimal classification with decision regions

– Average probability of correct classification

“Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion”

Page 28: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan28

Discriminant Functions

• Discriminant functions:

• Discriminant type classifier– It assigns an object with a given value x of a feature vector

to a class cj if

• Classification rule for a discriminant function-based classifier

1) Compute numerical values of all discriminant functions for x

2) Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest:

Select a class cj for which dj(x) = max(di(x) ); i = 1, 2, …, l

Page 29: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan29

Discriminant Functions

• Discriminant classifier

Page 30: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan30

Discriminant Functions

• Discriminant type classifier for Bayesian classification

– The natural choice for the discriminant function is the a

posteriori conditional probability P(ci|x):

– Practical versions using Bayes’ theorem

– Bayesian discriminant in a natural logarithmic form

Page 31: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan31

Discriminant Functions

• Characteristics of discriminant function

– Discriminant functions define the decision boundaries that

separate the decision regions

– Generally, the decision boundaries are defined by

neighboring decision regions when the corresponding

discriminant function values are equal

– The decision boundaries are unaffected by the increasingly

monotonic transformation of discriminant functions

Page 32: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan32

Discriminant Functions

• Bayesian Discriminant Functions for Two Classes

– General caseTwo discriminant functions: d1(x) and d2(x).

Two decision regions: R1 and R2.

The decision boundary: d1(x) = d2(x).

– Using dichotomizerSingle discriminant function: d (x) = d1(x) - d2(x).

Page 33: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan33

Discriminant Functions

• Quadratic and Linear Discriminants Derived from the Bayes Rule

– Quadratic Discriminant

– Assumption:

A multivariate normal Gaussian distribution of the

feature vector x within each class

– The Bayesian discriminant( in the previous section):

Page 34: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan34

Discriminant Functions

• Quadratic and Linear Discriminants Derived from the Bayes Rule

– Quadratic Discriminant– Gaussian distribution of the probability density function

– Quadratic Discriminant function

– Decision boundaries:

hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.)

Page 35: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan35

Discriminant Functions

• Quadratic and Linear Discriminants Derived from the Bayes Rule

Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci)

1) Compute values of the mean vectors i and the covariance

matrices i for all classes i = 1, 2, …, l based on the training set

2) Compute values of the discriminant function for all classes

3) Choose a class ci as a prediction of true class for which a value of

the associated discriminant function dj(x) is largest:

Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Page 36: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan36

Discriminant Functions

• Quadratic and Linear Discriminants Derived from the Bayes Rule

– Linear Discriminant:

– Assumption: equal covariances for all classes i =

– The Quadratic Discriminant:

– A linear form of discriminant functions:

Page 37: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan37

Discriminant Functions

• Quadratic and Linear Discriminants Derived from the Bayes Rule

– Linear Discriminant:

Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space

Page 38: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan38

Discriminant Functions

• Quadratic and Linear Discriminants Derived from the Bayes Rule

– The classification process using linear discriminants

1) Compute, for a given x, numerical values of discriminant functions for all classes:

2) Choose a class ci for which a value of the discriminant function dj(x) is largest:

Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Page 39: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan39

Discriminant Functions

• Quadratic and Linear Discriminants

Example Let us assume that the following two-feature patterns x R2

from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution:

Page 40: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan40

Discriminant Functions

• Quadratic and Linear Discriminants

Example – The estimates of the symmetric covariance matrices for both

classes

– The linear discriminant functions for both classes

Page 41: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan41

Discriminant Functions

• Quadratic and Linear Discriminants

Example – Two-class two-feature pattern dichotomizer.

Page 42: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan42

Discriminant Functions

• Quadratic and Linear Discriminants

– Minimum Mahalanobis Distance Classifier

– Assumption

– Equal covariances for all classes i = ( i = 1, 2, …, l )

– Equal a priori probabilities for all classes P(ci) = P

– Discriminant function

Page 43: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan43

Discriminant Functions

• Quadratic and Linear Discriminants

– Minimum Mahalanobis Distance Classifier

– A classifier selects the class cj for which a value x is

nearest, in the sense of Mahalanobis distance, to the

corresponding mean vector j . This classifier is called a

minimum Mahalanobis distance classifier.

– Linear version of the minimum Mahalanobis distance

classifier

Page 44: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan44

Discriminant Functions

• Quadratic and Linear Discriminants

– Minimum Mahalanobis Distance Classifier Given: The mean vectors for all classes i (i = 1, 2, …, l) and a

given value x of a feature vector

1) Compute numerical values of the Mahalanobis distances between x and means i for all classes.

2) Choose a class cj as a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum:

Page 45: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan45

Discriminant Functions

• Quadratic and Linear Discriminants– Linear Discriminant for Statistically Independent

Features

– Assumption

– Equal covariances for all classes i = ( i = 1, 2, …, l )

– Features are statistically independent

– Discriminant function

where

Page 46: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan46

Discriminant Functions

• Quadratic and Linear Discriminants

– Linear Discriminant for Statistically Independent Features

– Discriminants

– Quadratic discriminant formula

– Linear discriminant formula

Page 47: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan47

Discriminant Functions

• Quadratic and Linear Discriminants

– Linear Discriminant for Statistically Independent Features

– “Neural network” style as a linear threshold machine

where

– The decision surfaces for the linear discriminants are

pieces of hyperplanes defined by equations di(x)-dj(x).

Page 48: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan48

Discriminant Functions

• Quadratic and Linear Discriminants

– Minimum Euclidean Distance Classifier

– Assumption– Equal covariances for all classes i = ( i = 1, 2, …, l )

– Features are statistically independent

– Equal a priori probabilities for all classes P(ci) = P

– Discriminants

or

Page 49: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan49

Discriminant Functions

• Quadratic and Linear Discriminants

– Minimum Euclidean Distance Classifier

– The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j .

– Linear version of the minimum distance classifier

Page 50: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan50

Discriminant Functions

• Quadratic and Linear Discriminants Given: The mean vectors for all classes i (i = 1, 2, …, l) and a

given value x of a feature vector

1) Compute numerical values of Euclidean distances between x and means i for all classes:

2) Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest:

Page 51: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan51

Discriminant Functions

• Quadratic and Linear Discriminants

– Characteristics of Bayesian Normal Discriminant– Assumptions

– multivariate normality within classes– equal covariance matrices between classes

– The linear discriminant is equivalent to the optimal classifier

– These assumptions are satisfied only approximately– Due to its simple structure, the linear discriminant tends

not to overfit the training data set, which may lead to stronger generalization ability for unseen cases

Page 52: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan52

Estimation of Probability Densities

• Basic Idea

In Bayesian classifier design, it is necessary to estimate a

priori probabilities and conditional probability densities due

to the limited number of a priori observed objects. This

estimation should be optimal according to the well-defined

estimation criterion.

• Estimates of a priori probabilities

Page 53: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan53

Estimation of Probability Densities• Estimation of the class conditional probability

densities p(x|ci)

– Parametric methods with the assumption of a specific functional form of a

probability density function

– Nonparametric methods without the assumption of a specific functional form of a

probability density function

– Semiparametric method a combination of parametric and nonparametric methods

Page 54: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan54

Estimation of Probability Densities

• Parametric Methods– A priori observations of objects and corresponding

patterns:

– Split set of all patterns X according to a class into l disjoint sets:

– Assume that the parametric form of the class conditional probability density is given as a function:

where

Page 55: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan55

Estimation of Probability Densities

• Parametric Methods

– If the probability density has a normal (Gaussian) form:

where

Page 56: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan56

Estimation of Probability Densities

• The Maximum Likelihood Estimation of Parameters

– Assumption

– we are given a limited-size set of N patterns xi:

– we know a parametric form p(x|) of a conditional probability density function

– Goal– The task of estimation is to find the optimal (the best

according to the used criterion) value of the parameter vector of a given dimension m.

Page 57: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan57

Estimation of Probability Densities

• The Maximum Likelihood Estimation of Parameters

– Likelihood– The joint probability density L( ) is a function of a

parameter vector for a given set of patterns X.

– It is called the likelihood of for a given set of patterns X.

– Maximum Likelihood Estimation

The function L( ) can be chosen as a criterion for finding the optimal estimate of . It is called the maximum likelihood estimation of parameters

Page 58: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan58

Estimation of Probability Densities

• The Maximum Likelihood Estimation of Parameters

– Minimizing the negative natural logarithm of the likelihood

L( ) :

– For the differentiable function p(xi| ):

Page 59: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan59

Estimation of Probability Densities

• The Maximum Likelihood Estimation of Parameters

– For the normal form of a probability density function N(µ,)

with unknown parameters µ and constituting vector :

Page 60: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan60

Estimation of Probability Densities

• The Maximum Likelihood Estimation of Parameters

– Example of Maximum Likelihood Estimation

– for

– The maximum likelihood estimation criterion

– The maximum likelihood estimates for the parameters:

Page 61: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan61

Estimation of Probability Densities

• Nonparametric Methods “Nonparametric methods are more general methods of

probability density estimation that  based on existing data, but without an assumption about  a functional  form  of the probability density function.”

– Nonparametric techniques:

– Histogram

– Kernel-based method

– k-nearest neighbors

– Nearest neighbors

Page 62: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan62

Estimation of Probability Densities• Nonparametric Methods

General Idea Determine an estimate of a true probability density

p(x) based on the available limited-size samples – The probability that a new pattern x will fall inside a

region R

– Approximation of the probability for a small region and for continuous p(x), with almost the same values within a region R

Page 63: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan63

Estimation of Probability Densities

• Nonparametric Methods

General Idea– The probability that for N sample patterns set k of them

will fall in a region R

– Estimate of the probability P

– Approximation for a probability density function for a given pattern x

Page 64: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan64

Estimation of Probability Densities

• Nonparametric Methods

– Kernel-based Method and Parzen Window

– Kernel-based method is based on fixing around a

pattern vector x a region R (and thus a region volume V )

and counting a number k of given training patterns

falling in this region by using a special kernel function

associated with the region.

– Such a kernel function is also called a Parzen window

Page 65: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan65

Estimation of Probability Densities

• Nonparametric Methods

– Hypercube-type Parzen window

Volume of the hypercube:

Kernel (window) function:

Total number of patterns falling within the hypercube

The estimate of the probability density function

Page 66: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan66

Estimation of Probability Densities

• Nonparametric Methods

– Smooth estimate of the probability density function

– A kernel function must satisfy two conditions:

and

– For example, the radial symmetric multivariate Gaussian

(normal) kernel:

– The estimate of the probability density function:

Page 67: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan67

Estimation of Probability Densities

• Nonparametric Methods

– Smooth estimate of the probability density function

– The estimate of the class-dependent p(x|ck) probability

density:

– The estimate of the class-dependent p(x|ck) probability

density for the Gaussian kernel:

Page 68: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan68

Estimation of Probability Densities

• Nonparametric Methods

– Design issues

– The selection of a kernel function:

Parzen window, Gaussian kernel, etc.

– The selection of a smoothing parameter

– The generalization ability of the kernel-based

density estimation depends on the training set

and on smoothing parameters

Page 69: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan69

Estimation of Probability Densities

• Nonparametric Methods

– K-nearest Neighbors“A method of probability density estimation with variable

size regions”– First, a small n-dimensional sphere is located in the

pattern space centered at the point x.– Second, a radius of this sphere is extended until the

sphere contains exactly the fixed number k of patterns from a given training set.

– Then an estimate of the probability density for x is computed as

Page 70: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan70

Estimation of Probability Densities

• Nonparametric Methods

– K-nearest Neighbors Classification Rule– First, for a given x, the first k-nearest neighbors from a

training set should be found (regardless of a class label) based on a defined pattern distance measure.

– Second, among the selected k nearest neighbors, numbers ni of patterns belonging to each class ci are computed.

– Then, the predicted class cj assigned to x corresponds to a class for which nj is the largest.

Page 71: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan71

Estimation of Probability Densities

• Nonparametric Methods– Nearest Neighbors Classification Rule

“The simple version of the k-nearest neighbors classification is for a number of neighbors k equal to one”

– Algorithm Given: A training set Ttra of N patterns x1, x2, …, xN labeled by

l classes. A new pattern x.

• Compute for a given x the nearest neighbor xj from a whole training set based on the defined pattern distance measure distance(x, xi).

• Assign to x a class cj of nearest neighbors to x.

Page 72: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan72

Estimation of Probability Densities

• Semiparametric Methods

“Combination of parametric and nonparametric methods”

– Two semiparametric methods– Functional approximation– Mixture models (mixtures of probability densities)

– Major advantage It is able to precisely fit component functions locally to

specific regions of a feature space, based on discoveries about probability distributions and their modalities from the existing data

Page 73: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan73

Estimation of Probability Densities

• Semiparametric Methods

– Functional Approximation

– Approximation of density by the linear combination of m

basis functions i(x):

– Using a symmetric radial basis function

Page 74: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan74

Estimation of Probability Densities

• Semiparametric Methods

– Functional Approximation– Gaussian radial function: “The most commonly used

basis function”

– Optimization criterion for the functional approximation of density

– Optimal estimates for parameters:

Page 75: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan75

Estimation of Probability Densities• Semiparametric Methods

– The algorithm for functional approximation Given : A training set Ttra of N patterns x1, x2, …, xN. The m

orthonormal radial basis functions i(x) (i = 1, 2,…,m), along with their parameters.

1) Compute the estimates of unknown parameters

2) Form the model of the probability density as a functional approximation

Page 76: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan76

Estimation of Probability Densities

• Semiparametric Methods

– Mixture Models (Mixtures of Probability Densities) “These models are based on linear parametric combination

of known probability density functions (for example, normal densities) localized in certain regions of data”

– The linear mixture distribution

Simplified version:

Page 77: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan77

Estimation of Probability Densities• Distance Between Probability Densities and

the Kullback-Leibler Distance

– Distance

“We can define distance between two densities,

with true density p(x) and its approximate estimate ”

– Kullback-Leibler distance

Page 78: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan78

Probabilistic Neural Network

• Probabilistic Neural Network “The PNN is a hardware implementation of the kernel-based

method of density estimation and Bayesian optimal classification (providing minimization of the average probability of the classification error”

– Optimal Bayes’ classification rule

– Kernel-based estimation of a probability density function

Page 79: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan79

Probabilistic Neural Network

• Topology

Page 80: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan80

Probabilistic Neural Network

• Details

– An input layer (weightless) consists of n neurons (units),

each receiving one element xi (i = 1,2,…, n) of the n-

dimensional input pattern vector x.

– A pattern layer consists of N neurons (units, nodes), each

representing one reference pattern from the training set Ttra .

– The transfer function of the pattern layer neuron

implements a kernel function(a Parzen window)

Page 81: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan81

Probabilistic Neural Network

• Details– The weightless second hidden layer is the summation layer.

The number of neurons in the summation layer is equal to the number of classes l.

– The output activation function of the summation layer neuron is generally equal to but may be modified for different kernel functions.

– The output layer is the classification decision layer that implements Bayes’ classification rule by selecting the largest value and thus decides a class cj for the pattern x

Page 82: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan82

Probabilistic Neural Network

• Pattern Processing “Processing of patterns by the already-designed PNN network

is performed in the feedforward manner. The input pattern is presented to the network and processed forward by each layer. The resulting output is the predicted class”

• PNN with the Radial Gaussian Kernel– Kernel function:

– Transfer function:

– Output activation function:

Page 83: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan83

Probabilistic Neural Network• PNN with the Radial Gaussian Normal Kernel and

Normalized Patterns– Transfer function:

– Normalization of patterns allows for a simpler architecture of the pattern-layer neurons, containing here also input weights and an exponential output activation function.

– The transfer function of a pattern neuron can be divided into a neuron’s transfer function and an output activation function

– The pattern-neuron output activation function:

Page 84: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan84

Constraints in Classifier Design

• Problems

– Will a classifier guarantee minimization of the average

probability of the classification error?

– Does a training set well represent patterns generated by a

physical phenomenon?

– Are patterns drawn according to the characteristic of

underlying phenomenon probability density?

– Is the average probability of a classification error difficult to

calculate?

Page 85: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan85

Constraints in Classifier Design

• Suboptimal solutions of Bayesian classifier design

– The estimation of class conditional probabilities is based on

a limited sample

– The samples are frequently collected randomly, and not by

use of a well-planned experimental procedure

Page 86: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan86

REGRESSION

• Data Models

• Simple Linear Regression Analysis

• Multiple Regression

• General Least Squares and Multiple Regression

• Assessing the Quality of the Multiple Regression

Model

Page 87: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan87

Data Models• Mathematical models “They are useful approximate representations of phenomena that

generate data and may be used for prediction, classification, compression, or control design.”

• Black-box models– Mathematical models obtained by processing existing data

without using laws of physics governing data-generating phenomena

• Regression analysis– Data analysis and model design are based on a sample from a

given population

Page 88: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan88

Data Models

• Categories of regression models– Simple linear regression – Multiple linear regression – Neural network-based linear regression – Polynomial regression – Logistic regression – Log-linear regression – Local piecewise linear regression – Nonlinear regression (with a nonlinear model) – Neural network-based nonlinear regression

Page 89: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan89

Data Models

• Static and dynamic models

– A static model produces outcomes based only on

the current input (no internal memory).

– A dynamic model produces outcomes based on

the current input and the past history of the model

behavior (internal memory)

Page 90: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan90

Data Models

• Data gathering

– Random sample from a certain population

– N pairs of the experimental data set named Torig

Page 91: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan91

Data Models

• Regression analysis “A statistical method used to discover the relationship between

variables and to design a data model that can be used to predict variable values based on other variables”

Page 92: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan92

Data Models• Regression analysis

– A simple linear regression– To find the linear relationship between two variables, x

and y, and to discover a linear model, i.e., a line equation y = b+ax, which is the best fit to given data in order to predict values of data

– This modeling line is called the regression line of y on x– The equation of that line is called a regression equation

(regression model)

– Typical linear regression analysis provides a prediction of a dependent variable y based on an independent variable x

Page 93: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan93

Data Models

• Visualization of Regression

– Scatter plot for height versus weight data

Page 94: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan94

Data Models

• Visualization of Regression

– Scatter plot for height versus weight data

Page 95: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan95

Simple Linear Regression Analysis• Sample data and Regression model

Page 96: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan96

Simple Linear Regression Analysis

• Assumptions– The observations yi (i = 1, …, N) are random samples and are

mutually independent. – The regression error terms (the difference between the

predicted value and the true value) are also mutually independent, with the same distribution (normal distribution with zero mean) and constant variances

– The distribution of the error term is independent of the joint distribution of explanatory variables. It is also assumed that unknown parameters of regression models are constants

Page 97: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan97

Simple Linear Regression Analysis• Simple Linear Regression

Analysis– Evaluation of basic statistical

characteristics of data

– An estimation of the optimal parameters of a linear model

– Assess of model quality and generalization ability to predict the outcome for new data

Page 98: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan98

Simple Linear Regression Analysis

• Model Structure– Nonlinear data:

– Generally, a function f(x) could be nonlinear in x:

– Linear form :

Page 99: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan99

Simple Linear Regression Analysis

• Regression Error (residual error)

– Difference between real-value yi and predicted-value yi,est

Page 100: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan100

Simple Linear Regression Analysis

• Performance Criterion – Sum of Squared Errors.

– The sum of squared errors performance criterion for

multiple regression

– The minimization technique in regression uses as a

criterion the sum of squared error - method of least squares

or errors (LSE) or, in short, the method of least squares

Page 101: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan101

Simple Linear Regression Analysis

• Basic Statistical Characteristics of Data

– The mean of N samples

– The variance

– The covariance

Page 102: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan102

Simple Linear Regression Analysis• Sum of Squared Variations in y Caused by the

Regression Model

– The total sum of squared variations in y

– These formulas are used to define important regression measures (for example, the correlation coefficient)

Page 103: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan103

Simple Linear Regression Analysis• Computing Optimal Values of the Regression

Model Parameters– The optimal model parameters values have to be

computed based on the given data set and the defined performance criterion

– Methods for estimation of optimal model parameter values

– The analytical offline method– The analytical recursive offline method– Searching iteratively optimal model parameters– Neural network-based regression

Page 104: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan104

Simple Linear Regression Analysis• Simple Linear Regression Analysis, Linear

Least Squares, and Design of a Model

– The general linear model structure

– The performance criterion

and performance curve

y = ax (a model with b=0)

Page 105: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan105

Simple Linear Regression Analysis• Simple Linear Regression Analysis, Linear Least

Squares, and Design of a Model

– The optimal parameters

Page 106: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan106

Simple Linear Regression Analysis

• Procedure for simple linear regression Given: The number N of experimental observations, and the

set of the N experimental data points { (xi, yi), i = 1, 2, …, N }

1) Compute the statistical characteristics of the data

2) Compute the estimates of the model optimal parameters using Equations

3) Assess the regression model quality indicating howwell the model fits the data. Compute

a) Standard error of estimate

b) Correlation coefficient r

c) Coefficient of determination r2

Page 107: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan107

Simple Linear Regression Analysis

Example

– Sample of four data points

– Resulting regression line

y = 0.9 + 0.56x

Page 108: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan108

Simple Linear Regression Analysis• Optimal Parameter Values in the Minimum

Least Squares Sense

– Required conditions for a valid linear regression

– The error term e = y - (b + ax) is normally distributed

– The error variance is the same for all values of x

– Error are independent of each other.

Page 109: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan109

Simple Linear Regression Analysis• Quality of the Linear Regression Model and

Linear Correlation Analysis

– Assessment of model quality

– The resulting correlation coefficient can be used as a

measure of how well the trends predicted by the values

follow the trends in the training data

– The coefficient of determination can be used to measure

how well the regression line fits the data points

Page 110: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan110

Simple Linear Regression Analysis

• Correlation coefficient

• Coefficient of determination – The percent of variation in the dependent variable y that can

be explained by the regression equation, – the explained variation in y divided by the total variation, or – the square of r (correlation coefficient)

Page 111: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan111

Simple Linear Regression Analysis• Coefficient of determination

– Explained and unexplained variation in y

Page 112: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan112

Simple Linear Regression Analysis• Coefficient of determination

– Example– If the coefficient of correlation has the value r = 0.9327,

then the value of the coefficient of determination is r2 = 0.8700. It can be understood that 87% portion of the total variation in y can be explained by the linear relationship between x and y, as it is described by the optimal regression model of the data. The remaining portion 13% of the total variation in y remains unexplained.

– The calculation of coefficient of determination

Page 113: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan113

Simple Linear Regression Analysis• Matrix Version of Simple Linear Regression

Based on Least Squares Method

– The matrix form of the model description (the estimation of

) for all N experimental data points

– The regression error

Page 114: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan114

Simple Linear Regression Analysis• Matrix Version of Simple Linear Regression

Based on Least Squares Method– The performance criterion:

– Optimal parameters:

– The value of the criterion for the optimal parameter vector:

– The regression error for the model with the optimal parameter vector:

Page 115: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan115

Simple Linear Regression Analysis• Matrix Version of Simple Linear Regression

Based on Least Squares Method

– Example: let us consider again the dataset shown in the

following table

y = 0.56x + 0.9

Page 116: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan116

Multiple Regression

• Definition The multiple regression analysis is the statistical technique of

exploring the relation (association) between the set of n independent variables that are used to explain the variability of one (generally many) dependent variable y

– Linear multiple regression model

– Linear multiple regression model using vector notation

– This regression model is represented by a hyperplane in (n + 1)-dimensional space.

Page 117: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan117

Multiple Regression

• Geometrical Interpretation: Regression Errors The goal of multiple regression is to find a hyperplane in the

(n + 1)-dimensional space that will best fit the data

– The performance criterion

– The error variance and standard error of the estimate

Page 118: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan118

Multiple Regression

• Degree of Freedom

– The denominator N – n – 1 in the previous equation tells us

that in multiple regression with n independent variables, the

standard error has N – n – 1 degrees of freedom

– The degree of freedom has been reduced from N by n + 1

because n + 1 numerical parameters a0, a1, a2, …, an of the

regression model have been estimated from the data

Page 119: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan119

General Least Squares and Multiple Regression

• General model description in function form

– Data model

– Performance criterion

– Regression error

Page 120: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan120

General Least Squares and Multiple Regression

• General model description in matrix form

– Data model

– Performance criterion

– Optimal parameters

Page 121: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan121

General Least Squares and Multiple Regression

• Practical, Numerically Stable Computation of the Optimal Model Parameters

– Problem “The solution for the optimal least-squares parameters is

almost never computed from the equation due to its poor numerical performance in cases when the matrix (the covariance matrix) is ill conditioned”

– Solution: various matrix decomposition methods

Page 122: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan122

Assessing the Quality of the Multiple Regression Model

• The Coefficient of Multiple Determination,R2

“The percent of the variance in the dependent variable that can be explained by all of the independent variables taken together.”

– Adjusted R2

Adjusted R2 uses the number of design parameters plus a constant that are used in the model and the number of data points N in order to correct the statistic of this coefficient in situations when unnecessary parameters are used in the model structure

Page 123: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan123

Assessing the Quality of the Multiple Regression Model

• Cp Statistic

– It is used to compare multiple regression models Cp

– When comparing alternative regression models, the

designer aims to choose models whose values of Cn is close

to or below (n + 1)

Page 124: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan124

Assessing the Quality of the Multiple Regression Model

• Multiple Correlation– A value of R can be found as the positive square root of R2

(coefficient of multiple determination)

– It is a measure of the strength of the linear relationship between the dependent variable y and the set of independent variables x1, x2, …, xn.

– A value of R close to 1 indicates that the fit is very good

– A value near zero indicates that the model is not a good approximation of the data and cannot be efficiently used for prediction

Page 125: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan125

Assessing the Quality of the Multiple Regression Model

Example “Let us consider a multiple linear regression analysis for the

data set containing N = 4 cases, composed with one dependent variable y and two independent variables x1 and x2”

– Three-dimensional data

Page 126: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan126

Assessing the Quality of the Multiple Regression Model

Example – The scatter plot of data points in three-dimensional space

(x1, x2, y)

Page 127: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan127

Assessing the Quality of the Multiple Regression Model

Example

– The data matrix

– The optimal model parameters

Page 128: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan128

Assessing the Quality of the Multiple Regression Model

Example

– The optimal model:

y = 3.1+0.9x1+0.56x2

– The optimal

regression model in

(x1, x2, y) space :

Page 129: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan129

Assessing the Quality of the Multiple Regression Model

Example

– Multipleregression, regression plane model and scatter plot

Page 130: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan130

Assessing the Quality of the Multiple Regression Model

Example

– The residuals (errors)

– The criterion value for the optimal parameters: 0.016

Page 131: Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan131

References

Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford

Press

Cios, K.J., Pedrycz, W., and Swiniarski, R. 1998. Data Mining

Methods for Knowledge Discovery. Kluwer

Draper, N.R., and Smith, H. Applied Regression Analysis Wiley

Series in Probability and Statistics

Duda, R.O. Hart, P.E., and D.G. Stork. 2001. Pattern Classification.

Wiley

Myers, R.H. 1986. Classical and Modern Regression with

Applications, Boston, MA: Duxbury Press.