Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Chapter 11

Supervised Learning:STATISTICAL METHODS

Cios / Pedrycz / Swiniarski / KurganCios / Pedrycz / Swiniarski / Kurgan

Cios / Pedrycz / Swiniarski / Kurgan2

• Bayesian Methods– Basics of Bayesian Methods– Bayesian Classification – General Case– Classification that Minimizes Risk– Decision Regions and Probability of Errors– Discriminant Functions– Estimation of Probability Densities– Probabilistic Neural Network– Constraints in Classifier Design

Outline

Outline

• Regression

– Data Models

– Simple Linear Regression

– Multiple Regression

– General Least Squares and Multiple Regression

– Assessing Quality of the Multiple Regression Model



Bayesian Methods

Statistical processing based on the Bayes decision

theory is a fundamental technique for pattern recognition and classification.

The Bayes decision theory provides a framework for

statistical methods for classifying patterns into classes based on probabilities of patterns and their features.

Basics of Bayesian Methods

Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk.

States of nature C = { “ an eagle ”, “ a hawk ” }

Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” }

We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”)

and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N)




A priori (prior) probability P(ci):

Estimation of a prior P(ci):

P(ci) denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object.


The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears.

– Natural and best decision:

“Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a

bird to a class c2 ”

– The probability of classification error:

P(classification error) = P(c2) if we decide C = c1

P(c1) if we decide C = c2



Involving Object Features in Classification

• Feature variable / feature x– It characterizes an object and allows for better discrimination

between one class from another

– We assume it to be a continuous random variable taking

continuous values from a given range

– The variability of a random variable x can be expressed in

probabilistic terms

– We represent a distribution of a random variable x by the class

conditional probability density function (the state conditional

probability density function):



Examples of probability densities



• Probability density function p(x|ci)

– also called the likelihood of a class ci with respect to the value x of a feature variable

– the likelihood that an object belongs to class ci is bigger if p(x|ci) is larger

– joint probability density function p(ci , x)

A probability density that an object is in a class ci and has a feature variable value x.

– A posteriori (posterior) probability P(x|ci)

The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x.



• Bayes’ rule / Bayes’ theorem

– From probability theory (see Appendix B)

– An unconditional probability density function



• Bayes’ rule

“The conditional probability P(ci|x) can be expressed in

terms of the a priori probability function P(ci), together with

the class conditional probability density function p(ci|x).”



• Bayes’ decision rule

P(c2|x) if we decide C = c1

P(classification error | x) =

P(c1|x) if we decide C = c2

“This statistical classification rule is best in the sense of

minimizing the probability of misclassification (the

probability of classification error)”

– Bayes’ classification rule guarantees minimization of the

average probability of classification error



ExampleLet us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2). Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2 and p(45|c2) = 1.1053 ∙ 10-2. Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is


Bayesian Classification – General Case

• Bayes’ Classification Rule for Multiclass Multifeature Objects

– Real-valued features of an object as n-dimensional column

vector x Rn:

– The object may belong to l distinct classes (l distinct states

of nature):




– Bayes’ theorem

A priori probability: P(ci) (i = 1, 2…,l)

Class conditional probability density function : p(x|ci)

A posteriori (posterior) probability: P(ci |x)

Unconditional probability density function:




– Bayes classification rule: A given object with a given value x of a feature vector can

be classified as belonging to class cj when:

Assign an object with a given value x of a feature vector to class cj when:


Classification that Minimizes Risk

• Basic Idea To incorporate the fact that misclassifications of some

classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature

• A loss function– Cost (penalty, weight) due to the fact of assigning an object to

class cj when in fact the true class is ci



• A loss matrix– We denote a loss function by Lij matrix for l-class

classification problems

• Expected (average) conditional loss

In short,



• Overall Risk

The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision.

• Bayes riskMinimal overall risk R leads to the generalization of Bayes’

rule for minimization of probability of the classification error.



• Bayes’ classification rule with Bayes risk

Choose a decision (a class) ci for which:



• Bayesian Classification Minimizing the Probability of Error

– Symmetrical zero-one conditional loss function

– The conditional risk R(cj| x) criterion is the same as the average probability of classification error:

– An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision



• Generalization of the Maximum Likelihood Classification

– Generalized likelihood ratio for classes ci and cj

– Generalized threshold value

– The maximum likelihood classification rule

“Decide a class cj if


Decision Regions and Probability of Errors

• Decision regions

– A classifier divides the feature space into l

disjoint decision subspaces R1,R2, … Rl

– The region Ri is a subspace such that each

realization x of a feature vector of an object falling

into this region will be assigned to a class ci



• Decision boundaries (decision surfaces)

– The regions intersect, and boundaries between

adjacent regions

“The task of a classifier design is to find classification rules

that will guarantee division of a feature space into optimal

decision regions R1,R2, … Rl (with optimal decision

boundaries) that will minimize a selected classification

performance criterion”



• Decision boundaries



• Optimal classification with decision regions

– Average probability of correct classification

“Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion”


Discriminant Functions

• Discriminant functions:

• Discriminant type classifier– It assigns an object with a given value x of a feature vector

to a class cj if

• Classification rule for a discriminant function-based classifier

1) Compute numerical values of all discriminant functions for x

2) Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest:

Select a class cj for which dj(x) = max(di(x) ); i = 1, 2, …, l



• Discriminant classifier



• Discriminant type classifier for Bayesian classification

– The natural choice for the discriminant function is the a

posteriori conditional probability P(ci|x):

– Practical versions using Bayes’ theorem

– Bayesian discriminant in a natural logarithmic form



• Characteristics of discriminant function

– Discriminant functions define the decision boundaries that

separate the decision regions

– Generally, the decision boundaries are defined by

neighboring decision regions when the corresponding

discriminant function values are equal

– The decision boundaries are unaffected by the increasingly

monotonic transformation of discriminant functions



• Bayesian Discriminant Functions for Two Classes

– General caseTwo discriminant functions: d1(x) and d2(x).

Two decision regions: R1 and R2.

The decision boundary: d1(x) = d2(x).

– Using dichotomizerSingle discriminant function: d (x) = d1(x) - d2(x).



• Quadratic and Linear Discriminants Derived from the Bayes Rule

– Quadratic Discriminant

– Assumption:

A multivariate normal Gaussian distribution of the

feature vector x within each class

– The Bayesian discriminant( in the previous section):




– Quadratic Discriminant– Gaussian distribution of the probability density function

– Quadratic Discriminant function

– Decision boundaries:

hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.)




Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci)

1) Compute values of the mean vectors i and the covariance

matrices i for all classes i = 1, 2, …, l based on the training set

2) Compute values of the discriminant function for all classes

3) Choose a class ci as a prediction of true class for which a value of

the associated discriminant function dj(x) is largest:

Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l




– Linear Discriminant:

– Assumption: equal covariances for all classes i =

– The Quadratic Discriminant:

– A linear form of discriminant functions:




– Linear Discriminant:

Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space




– The classification process using linear discriminants

1) Compute, for a given x, numerical values of discriminant functions for all classes:

2) Choose a class ci for which a value of the discriminant function dj(x) is largest:

Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l



• Quadratic and Linear Discriminants

Example Let us assume that the following two-feature patterns x R2

from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution:




Example – The estimates of the symmetric covariance matrices for both

classes

– The linear discriminant functions for both classes




Example – Two-class two-feature pattern dichotomizer.




– Minimum Mahalanobis Distance Classifier

– Assumption

– Equal covariances for all classes i = ( i = 1, 2, …, l )

– Equal a priori probabilities for all classes P(ci) = P

– Discriminant function




– Minimum Mahalanobis Distance Classifier

– A classifier selects the class cj for which a value x is

nearest, in the sense of Mahalanobis distance, to the

corresponding mean vector j . This classifier is called a

minimum Mahalanobis distance classifier.

– Linear version of the minimum Mahalanobis distance

classifier




– Minimum Mahalanobis Distance Classifier Given: The mean vectors for all classes i (i = 1, 2, …, l) and a

given value x of a feature vector

1) Compute numerical values of the Mahalanobis distances between x and means i for all classes.

2) Choose a class cj as a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum:



• Quadratic and Linear Discriminants– Linear Discriminant for Statistically Independent

Features

– Assumption

– Equal covariances for all classes i = ( i = 1, 2, …, l )

– Features are statistically independent

– Discriminant function

where




– Linear Discriminant for Statistically Independent Features

– Discriminants

– Quadratic discriminant formula

– Linear discriminant formula




– Linear Discriminant for Statistically Independent Features

– “Neural network” style as a linear threshold machine

where

– The decision surfaces for the linear discriminants are

pieces of hyperplanes defined by equations di(x)-dj(x).




– Minimum Euclidean Distance Classifier

– Assumption– Equal covariances for all classes i = ( i = 1, 2, …, l )

– Features are statistically independent

– Equal a priori probabilities for all classes P(ci) = P

– Discriminants

or




– Minimum Euclidean Distance Classifier

– The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j .

– Linear version of the minimum distance classifier



• Quadratic and Linear Discriminants Given: The mean vectors for all classes i (i = 1, 2, …, l) and a

given value x of a feature vector

1) Compute numerical values of Euclidean distances between x and means i for all classes:

2) Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest:




– Characteristics of Bayesian Normal Discriminant– Assumptions

– multivariate normality within classes– equal covariance matrices between classes

– The linear discriminant is equivalent to the optimal classifier

– These assumptions are satisfied only approximately– Due to its simple structure, the linear discriminant tends

not to overfit the training data set, which may lead to stronger generalization ability for unseen cases


Estimation of Probability Densities

• Basic Idea

In Bayesian classifier design, it is necessary to estimate a

priori probabilities and conditional probability densities due

to the limited number of a priori observed objects. This

estimation should be optimal according to the well-defined

estimation criterion.

• Estimates of a priori probabilities


Estimation of Probability Densities• Estimation of the class conditional probability

densities p(x|ci)

– Parametric methods with the assumption of a specific functional form of a

probability density function

– Nonparametric methods without the assumption of a specific functional form of a

probability density function

– Semiparametric method a combination of parametric and nonparametric methods



• Parametric Methods– A priori observations of objects and corresponding

patterns:

– Split set of all patterns X according to a class into l disjoint sets:

– Assume that the parametric form of the class conditional probability density is given as a function:

where



• Parametric Methods

– If the probability density has a normal (Gaussian) form:

where



• The Maximum Likelihood Estimation of Parameters

– Assumption

– we are given a limited-size set of N patterns xi:

– we know a parametric form p(x|) of a conditional probability density function

– Goal– The task of estimation is to find the optimal (the best

according to the used criterion) value of the parameter vector of a given dimension m.




– Likelihood– The joint probability density L( ) is a function of a

parameter vector for a given set of patterns X.

– It is called the likelihood of for a given set of patterns X.

– Maximum Likelihood Estimation

The function L( ) can be chosen as a criterion for finding the optimal estimate of . It is called the maximum likelihood estimation of parameters




– Minimizing the negative natural logarithm of the likelihood

L( ) :

– For the differentiable function p(xi| ):




– For the normal form of a probability density function N(µ,)

with unknown parameters µ and constituting vector :




– Example of Maximum Likelihood Estimation

– for

– The maximum likelihood estimation criterion

– The maximum likelihood estimates for the parameters:



• Nonparametric Methods “Nonparametric methods are more general methods of

probability density estimation that based on existing data, but without an assumption about a functional form of the probability density function.”

– Nonparametric techniques:

– Histogram

– Kernel-based method

– k-nearest neighbors

– Nearest neighbors


Estimation of Probability Densities• Nonparametric Methods

General Idea Determine an estimate of a true probability density

p(x) based on the available limited-size samples – The probability that a new pattern x will fall inside a

region R

– Approximation of the probability for a small region and for continuous p(x), with almost the same values within a region R



• Nonparametric Methods

General Idea– The probability that for N sample patterns set k of them

will fall in a region R

– Estimate of the probability P

– Approximation for a probability density function for a given pattern x




– Kernel-based Method and Parzen Window

– Kernel-based method is based on fixing around a

pattern vector x a region R (and thus a region volume V )

and counting a number k of given training patterns

falling in this region by using a special kernel function

associated with the region.

– Such a kernel function is also called a Parzen window




– Hypercube-type Parzen window

Volume of the hypercube:

Kernel (window) function:

Total number of patterns falling within the hypercube

The estimate of the probability density function




– Smooth estimate of the probability density function

– A kernel function must satisfy two conditions:

and

– For example, the radial symmetric multivariate Gaussian

(normal) kernel:

– The estimate of the probability density function:




– Smooth estimate of the probability density function

– The estimate of the class-dependent p(x|ck) probability

density:

– The estimate of the class-dependent p(x|ck) probability

density for the Gaussian kernel:




– Design issues

– The selection of a kernel function:

Parzen window, Gaussian kernel, etc.

– The selection of a smoothing parameter

– The generalization ability of the kernel-based

density estimation depends on the training set

and on smoothing parameters




– K-nearest Neighbors“A method of probability density estimation with variable

size regions”– First, a small n-dimensional sphere is located in the

pattern space centered at the point x.– Second, a radius of this sphere is extended until the

sphere contains exactly the fixed number k of patterns from a given training set.

– Then an estimate of the probability density for x is computed as




– K-nearest Neighbors Classification Rule– First, for a given x, the first k-nearest neighbors from a

training set should be found (regardless of a class label) based on a defined pattern distance measure.

– Second, among the selected k nearest neighbors, numbers ni of patterns belonging to each class ci are computed.

– Then, the predicted class cj assigned to x corresponds to a class for which nj is the largest.



• Nonparametric Methods– Nearest Neighbors Classification Rule

“The simple version of the k-nearest neighbors classification is for a number of neighbors k equal to one”

– Algorithm Given: A training set Ttra of N patterns x1, x2, …, xN labeled by

l classes. A new pattern x.

• Compute for a given x the nearest neighbor xj from a whole training set based on the defined pattern distance measure distance(x, xi).

• Assign to x a class cj of nearest neighbors to x.



• Semiparametric Methods

“Combination of parametric and nonparametric methods”

– Two semiparametric methods– Functional approximation– Mixture models (mixtures of probability densities)

– Major advantage It is able to precisely fit component functions locally to

specific regions of a feature space, based on discoveries about probability distributions and their modalities from the existing data




– Functional Approximation

– Approximation of density by the linear combination of m

basis functions i(x):

– Using a symmetric radial basis function




– Functional Approximation– Gaussian radial function: “The most commonly used

basis function”

– Optimization criterion for the functional approximation of density

– Optimal estimates for parameters:


Estimation of Probability Densities• Semiparametric Methods

– The algorithm for functional approximation Given : A training set Ttra of N patterns x1, x2, …, xN. The m

orthonormal radial basis functions i(x) (i = 1, 2,…,m), along with their parameters.

1) Compute the estimates of unknown parameters

2) Form the model of the probability density as a functional approximation




– Mixture Models (Mixtures of Probability Densities) “These models are based on linear parametric combination

of known probability density functions (for example, normal densities) localized in certain regions of data”

– The linear mixture distribution

Simplified version:


Estimation of Probability Densities• Distance Between Probability Densities and

the Kullback-Leibler Distance

– Distance

“We can define distance between two densities,

with true density p(x) and its approximate estimate ”

– Kullback-Leibler distance


Probabilistic Neural Network

• Probabilistic Neural Network “The PNN is a hardware implementation of the kernel-based

method of density estimation and Bayesian optimal classification (providing minimization of the average probability of the classification error”

– Optimal Bayes’ classification rule

– Kernel-based estimation of a probability density function



• Topology



• Details

– An input layer (weightless) consists of n neurons (units),

each receiving one element xi (i = 1,2,…, n) of the n-

dimensional input pattern vector x.

– A pattern layer consists of N neurons (units, nodes), each

representing one reference pattern from the training set Ttra .

– The transfer function of the pattern layer neuron

implements a kernel function(a Parzen window)



• Details– The weightless second hidden layer is the summation layer.

The number of neurons in the summation layer is equal to the number of classes l.

– The output activation function of the summation layer neuron is generally equal to but may be modified for different kernel functions.

– The output layer is the classification decision layer that implements Bayes’ classification rule by selecting the largest value and thus decides a class cj for the pattern x



• Pattern Processing “Processing of patterns by the already-designed PNN network

is performed in the feedforward manner. The input pattern is presented to the network and processed forward by each layer. The resulting output is the predicted class”

• PNN with the Radial Gaussian Kernel– Kernel function:

– Transfer function:

– Output activation function:


Probabilistic Neural Network• PNN with the Radial Gaussian Normal Kernel and

Normalized Patterns– Transfer function:

– Normalization of patterns allows for a simpler architecture of the pattern-layer neurons, containing here also input weights and an exponential output activation function.

– The transfer function of a pattern neuron can be divided into a neuron’s transfer function and an output activation function

– The pattern-neuron output activation function:


Constraints in Classifier Design

• Problems

– Will a classifier guarantee minimization of the average

probability of the classification error?

– Does a training set well represent patterns generated by a

physical phenomenon?

– Are patterns drawn according to the characteristic of

underlying phenomenon probability density?

– Is the average probability of a classification error difficult to

calculate?


Constraints in Classifier Design

• Suboptimal solutions of Bayesian classifier design

– The estimation of class conditional probabilities is based on

a limited sample

– The samples are frequently collected randomly, and not by

use of a well-planned experimental procedure


REGRESSION

• Data Models

• Simple Linear Regression Analysis

• Multiple Regression

• General Least Squares and Multiple Regression

• Assessing the Quality of the Multiple Regression

Model


Data Models• Mathematical models “They are useful approximate representations of phenomena that

generate data and may be used for prediction, classification, compression, or control design.”

• Black-box models– Mathematical models obtained by processing existing data

without using laws of physics governing data-generating phenomena

• Regression analysis– Data analysis and model design are based on a sample from a

given population


Data Models

• Categories of regression models– Simple linear regression – Multiple linear regression – Neural network-based linear regression – Polynomial regression – Logistic regression – Log-linear regression – Local piecewise linear regression – Nonlinear regression (with a nonlinear model) – Neural network-based nonlinear regression


Data Models

• Static and dynamic models

– A static model produces outcomes based only on

the current input (no internal memory).

– A dynamic model produces outcomes based on

the current input and the past history of the model

behavior (internal memory)


Data Models

• Data gathering

– Random sample from a certain population

– N pairs of the experimental data set named Torig


Data Models

• Regression analysis “A statistical method used to discover the relationship between

variables and to design a data model that can be used to predict variable values based on other variables”


Data Models• Regression analysis

– A simple linear regression– To find the linear relationship between two variables, x

and y, and to discover a linear model, i.e., a line equation y = b+ax, which is the best fit to given data in order to predict values of data

– This modeling line is called the regression line of y on x– The equation of that line is called a regression equation

(regression model)

– Typical linear regression analysis provides a prediction of a dependent variable y based on an independent variable x


Data Models

• Visualization of Regression

– Scatter plot for height versus weight data


Data Models

• Visualization of Regression

– Scatter plot for height versus weight data


Simple Linear Regression Analysis• Sample data and Regression model


Simple Linear Regression Analysis

• Assumptions– The observations yi (i = 1, …, N) are random samples and are

mutually independent. – The regression error terms (the difference between the

predicted value and the true value) are also mutually independent, with the same distribution (normal distribution with zero mean) and constant variances

– The distribution of the error term is independent of the joint distribution of explanatory variables. It is also assumed that unknown parameters of regression models are constants


Simple Linear Regression Analysis• Simple Linear Regression

Analysis– Evaluation of basic statistical

characteristics of data

– An estimation of the optimal parameters of a linear model

– Assess of model quality and generalization ability to predict the outcome for new data



• Model Structure– Nonlinear data:

– Generally, a function f(x) could be nonlinear in x:

– Linear form :



• Regression Error (residual error)

– Difference between real-value yi and predicted-value yi,est



• Performance Criterion – Sum of Squared Errors.

– The sum of squared errors performance criterion for

multiple regression

– The minimization technique in regression uses as a

criterion the sum of squared error - method of least squares

or errors (LSE) or, in short, the method of least squares



• Basic Statistical Characteristics of Data

– The mean of N samples

– The variance

– The covariance


Simple Linear Regression Analysis• Sum of Squared Variations in y Caused by the

Regression Model

– The total sum of squared variations in y

– These formulas are used to define important regression measures (for example, the correlation coefficient)


Simple Linear Regression Analysis• Computing Optimal Values of the Regression

Model Parameters– The optimal model parameters values have to be

computed based on the given data set and the defined performance criterion

– Methods for estimation of optimal model parameter values

– The analytical offline method– The analytical recursive offline method– Searching iteratively optimal model parameters– Neural network-based regression


Simple Linear Regression Analysis• Simple Linear Regression Analysis, Linear

Least Squares, and Design of a Model

– The general linear model structure

– The performance criterion

and performance curve

y = ax (a model with b=0)


Simple Linear Regression Analysis• Simple Linear Regression Analysis, Linear Least

Squares, and Design of a Model

– The optimal parameters



• Procedure for simple linear regression Given: The number N of experimental observations, and the

set of the N experimental data points { (xi, yi), i = 1, 2, …, N }

1) Compute the statistical characteristics of the data

2) Compute the estimates of the model optimal parameters using Equations

3) Assess the regression model quality indicating howwell the model fits the data. Compute

a) Standard error of estimate

b) Correlation coefficient r

c) Coefficient of determination r2



Example

– Sample of four data points

– Resulting regression line

y = 0.9 + 0.56x


Simple Linear Regression Analysis• Optimal Parameter Values in the Minimum

Least Squares Sense

– Required conditions for a valid linear regression

– The error term e = y - (b + ax) is normally distributed

– The error variance is the same for all values of x

– Error are independent of each other.


Simple Linear Regression Analysis• Quality of the Linear Regression Model and

Linear Correlation Analysis

– Assessment of model quality

– The resulting correlation coefficient can be used as a

measure of how well the trends predicted by the values

follow the trends in the training data

– The coefficient of determination can be used to measure

how well the regression line fits the data points



• Correlation coefficient

• Coefficient of determination – The percent of variation in the dependent variable y that can

be explained by the regression equation, – the explained variation in y divided by the total variation, or – the square of r (correlation coefficient)


Simple Linear Regression Analysis• Coefficient of determination

– Explained and unexplained variation in y


Simple Linear Regression Analysis• Coefficient of determination

– Example– If the coefficient of correlation has the value r = 0.9327,

then the value of the coefficient of determination is r2 = 0.8700. It can be understood that 87% portion of the total variation in y can be explained by the linear relationship between x and y, as it is described by the optimal regression model of the data. The remaining portion 13% of the total variation in y remains unexplained.

– The calculation of coefficient of determination


Simple Linear Regression Analysis• Matrix Version of Simple Linear Regression

Based on Least Squares Method

– The matrix form of the model description (the estimation of

) for all N experimental data points

– The regression error



Based on Least Squares Method– The performance criterion:

– Optimal parameters:

– The value of the criterion for the optimal parameter vector:

– The regression error for the model with the optimal parameter vector:



Based on Least Squares Method

– Example: let us consider again the dataset shown in the

following table

y = 0.56x + 0.9


Multiple Regression

• Definition The multiple regression analysis is the statistical technique of

exploring the relation (association) between the set of n independent variables that are used to explain the variability of one (generally many) dependent variable y

– Linear multiple regression model

– Linear multiple regression model using vector notation

– This regression model is represented by a hyperplane in (n + 1)-dimensional space.


Multiple Regression

• Geometrical Interpretation: Regression Errors The goal of multiple regression is to find a hyperplane in the

(n + 1)-dimensional space that will best fit the data

– The performance criterion

– The error variance and standard error of the estimate


Multiple Regression

• Degree of Freedom

– The denominator N – n – 1 in the previous equation tells us

that in multiple regression with n independent variables, the

standard error has N – n – 1 degrees of freedom

– The degree of freedom has been reduced from N by n + 1

because n + 1 numerical parameters a0, a1, a2, …, an of the

regression model have been estimated from the data


General Least Squares and Multiple Regression

• General model description in function form

– Data model

– Performance criterion

– Regression error



• General model description in matrix form

– Data model

– Performance criterion

– Optimal parameters



• Practical, Numerically Stable Computation of the Optimal Model Parameters

– Problem “The solution for the optimal least-squares parameters is

almost never computed from the equation due to its poor numerical performance in cases when the matrix (the covariance matrix) is ill conditioned”

– Solution: various matrix decomposition methods


Assessing the Quality of the Multiple Regression Model

• The Coefficient of Multiple Determination,R2

“The percent of the variance in the dependent variable that can be explained by all of the independent variables taken together.”

– Adjusted R2

Adjusted R2 uses the number of design parameters plus a constant that are used in the model and the number of data points N in order to correct the statistic of this coefficient in situations when unnecessary parameters are used in the model structure



• Cp Statistic

– It is used to compare multiple regression models Cp

– When comparing alternative regression models, the

designer aims to choose models whose values of Cn is close

to or below (n + 1)



• Multiple Correlation– A value of R can be found as the positive square root of R2

(coefficient of multiple determination)

– It is a measure of the strength of the linear relationship between the dependent variable y and the set of independent variables x1, x2, …, xn.

– A value of R close to 1 indicates that the fit is very good

– A value near zero indicates that the model is not a good approximation of the data and cannot be efficiently used for prediction



Example “Let us consider a multiple linear regression analysis for the

data set containing N = 4 cases, composed with one dependent variable y and two independent variables x1 and x2”

– Three-dimensional data



Example – The scatter plot of data points in three-dimensional space

(x1, x2, y)



Example

– The data matrix

– The optimal model parameters



Example

– The optimal model:

y = 3.1+0.9x1+0.56x2

– The optimal

regression model in

(x1, x2, y) space :



Example

– Multipleregression, regression plane model and scatter plot



Example

– The residuals (errors)

– The criterion value for the optimal parameters: 0.016


References

Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford

Press

Cios, K.J., Pedrycz, W., and Swiniarski, R. 1998. Data Mining

Methods for Knowledge Discovery. Kluwer

Draper, N.R., and Smith, H. Applied Regression Analysis Wiley

Series in Probability and Statistics

Duda, R.O. Hart, P.E., and D.G. Stork. 2001. Pattern Classification.

Wiley

Myers, R.H. 1986. Classical and Modern Regression with

Applications, Boston, MA: Duxbury Press.

Documents

Chapter 11 Supervised Learning: STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan