156
1 Computational BioMedical Informatics SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept.

Computational BioMedical Informatics

  • Upload
    abiola

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Computational BioMedical Informatics. SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. Course Information. Instructor: Dr. Jinbo Bi Office: ITEB 233 Phone: 860-486-1458 Email: [email protected] - PowerPoint PPT Presentation

Citation preview

Page 1: Computational  BioMedical  Informatics

1

Computational BioMedical Informatics

SCE 5095: Special Topics Course

Instructor: Jinbo BiComputer Science and Engineering Dept.

Page 2: Computational  BioMedical  Informatics

2

Course Information

Instructor: Dr. Jinbo Bi – Office: ITEB 233– Phone: 860-486-1458– Email: [email protected]

– Web: http://www.engr.uconn.edu/~jinbo/– Time: Mon / Wed. 2:00pm – 3:15pm – Location: CAST 204– Office hours: Mon. 3:30-4:30pm

HuskyCT– http://learn.uconn.edu– Login with your NetID and password– Illustration

Page 3: Computational  BioMedical  Informatics

3

Review of last chapter

General introduction to the topics in medical informatics, and the data mining techniques involved

Review some basics of probability-statistics More slides on probability and linear algebra

uploaded to huskyCT

This class, we start to discuss supervised learning: classification and regression

Page 4: Computational  BioMedical  Informatics

4

Regression and classification

Both regression and classification problems are typically supervised learning problems

The main property of supervised learning– Training example contains the input variables

and the corresponding target label– The goal is to find a good mapping from the

input variables to the target variable

Page 5: Computational  BioMedical  Informatics

5

Classification: Definition

Given a collection of examples (training set )– Each example contains a set of variables

(features), and the target variable class. Find a model for class attribute as a function

of the values of other variables. Goal: previously unseen examples should be

assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 6: Computational  BioMedical  Informatics

6

Classification Application 1

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set Model

Learn Classifier

Past transaction records, label them

Current data, want to use the model to predict

Fraud detection – goals: Predict fraudulent cases in credit card transactions.

Page 7: Computational  BioMedical  Informatics

7

Classification: Application 2

Handwritten Digit Recognition Goal: Identify the digit of a handwritten number

– Approach:Align all images to derive the featuresModel the class (identity) based on these features

Page 8: Computational  BioMedical  Informatics

8

Illustrating Classification Task

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Page 9: Computational  BioMedical  Informatics

9

Classification algorithms

K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Tree Logistic Regression Graphical models

Page 10: Computational  BioMedical  Informatics

10

Regression: Definition

Goal: predict the value of one or more continuous target attributes give the values of the input attributes

Difference between classification and regression only lies in the target attribute– Classification: discrete or categorical target– Regression: continuous target

Greatly studied in statistics, neural network fields.

Page 11: Computational  BioMedical  Informatics

11

Regression application 1

categorical

categorical

continuous

Continuous ta

rget

Refund Marital Status

Taxable Income Loss

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ? 10

TestSet

Training Set Model

Learn Regressor

Past transaction records, label them

Current data, want to use the model to predict

goals: Predict the possible loss from a customer

Tid Refund MaritalStatus

TaxableIncome Loss

1 Yes Single 125K 100

2 No Married 100K 120

3 No Single 70K -200

4 Yes Married 120K -300

5 No Divorced 95K -400

6 No Married 60K -500

7 Yes Divorced 220K -190

8 No Single 85K 300

9 No Married 75K -240

10 No Single 90K 9010

Page 12: Computational  BioMedical  Informatics

12

Regression applications

Examples:– Predicting sales amounts of new product

based on advertising expenditure.– Predicting wind velocities as a function of

temperature, humidity, air pressure, etc.– Time series prediction of stock market indices.

Page 13: Computational  BioMedical  Informatics

13

Regression algorithms

Least squares methods Regularized linear regression (ridge regression) Neural networks Support vector machines (SVM) Bayesian linear regression

Page 14: Computational  BioMedical  Informatics

14

Practical issues in the training

Underfitting

Overfitting

Before introducing these important concept, let us study a simple regression algorithm – linear regression

Page 15: Computational  BioMedical  Informatics

15

Least squares

We wish to use some real-valued input variables x to predict the value of a target y

We collect training data of pairs (xi,yi), i=1,…N Suppose we have a model f that maps each x

example to a value of y’ Sum of squares function:

– Sum of the squares of the deviation between the observed target value y and the predicted value y’

N

iii

N

iii xfyyy

1

2

1

2 )('

Page 16: Computational  BioMedical  Informatics

16

Least squares

Find a function f such that the sum of squares is minimized

For example, your function is in the form of linear functions f (x) = wTx

Least squares with a linear function of parameters w is called “linear regression”

N

iii xfy

1

2)(min

N

ii

Ti xwyw

1

2min

Page 17: Computational  BioMedical  Informatics

17

Linear regression

Linear regression has a closed-form solution for w

The minimum is attained at the zero derivative

)(XwyXwy

min1

2

wE

wxyw

T

N

i

Tii

0)(2)(

XwyXw

wE T

yXXXw TT 1

Page 18: Computational  BioMedical  Informatics

18

x is evenly distributed from [0,1] y = f(x) + random error y = sin(2πx) + ε, ε ~ N(0,σ)

Polynomial Curve Fitting

Page 19: Computational  BioMedical  Informatics

19

Polynomial Curve Fitting

Page 20: Computational  BioMedical  Informatics

20

Sum-of-Squares Error Function

Page 21: Computational  BioMedical  Informatics

21

0th Order Polynomial

Page 22: Computational  BioMedical  Informatics

22

1st Order Polynomial

Page 23: Computational  BioMedical  Informatics

23

3rd Order Polynomial

Page 24: Computational  BioMedical  Informatics

24

9th Order Polynomial

Page 25: Computational  BioMedical  Informatics

25

Over-fitting

Root-Mean-Square (RMS) Error:

Page 26: Computational  BioMedical  Informatics

26

Polynomial Coefficients

Page 27: Computational  BioMedical  Informatics

27

Data Set Size:

9th Order Polynomial

Page 28: Computational  BioMedical  Informatics

28

Data Set Size:

9th Order Polynomial

Page 29: Computational  BioMedical  Informatics

29

Regularization

Penalize large coefficient values

Ridge regression

Page 30: Computational  BioMedical  Informatics

30

Regularization:

Page 31: Computational  BioMedical  Informatics

31

Regularization:

Page 32: Computational  BioMedical  Informatics

32

Regularization: vs.

Page 33: Computational  BioMedical  Informatics

33

Polynomial Coefficients

Page 34: Computational  BioMedical  Informatics

34

Classification

Underfitting or Overfitting can also happen in classification approaches

We will illustrate these practical issues on classification problem

Before the illustration, we introduce a simple classification technique – K-nearest neighbor method

Page 35: Computational  BioMedical  Informatics

35

K-nearest neighbor (K-NN)

K-NN is one of the simplest machine learning algorithm

K-NN is a method for classifying test examples based on closest training examples in the feature space

An example is classified by a majority vote of its neighbors

k is a positive integer, typically small. If k = 1, then the example is simply assigned to the class of its nearest neighbor.

Page 36: Computational  BioMedical  Informatics

36

K-NN

K = 1K = 3

Page 37: Computational  BioMedical  Informatics

37

K-NN on real problem data

• Oil data set• K acts as a smoother, choosing K is model

selection• For , the error rate of the 1-nearest-neighbour

classifier is never more than twice the optimal error (obtained from the true conditional class distributions).

Page 38: Computational  BioMedical  Informatics

38

Limitation of K-NN

K-NN is a nonparametric model (no any particular function is fitted)

Nonparametric models requires storing and computing with the entire data set.

Parametric models, once fitted, are much more efficient in terms of storage and computation.

Page 39: Computational  BioMedical  Informatics

39

Probabilistic interpretation of K-NN

Given a data set with Nk data points from class Ck and , we have

and correspondingly

Since , Bayes’ theorem gives

Page 40: Computational  BioMedical  Informatics

40

Underfit and Overfit (Classification)

500 circular and 500 triangular data points.

Circular points:0.5 sqrt(x1

2+x22) 1

Triangular points:sqrt(x1

2+x22) > 1 or

sqrt(x12+x2

2) < 0.5

Page 41: Computational  BioMedical  Informatics

41

Underfit and Overfit (Classification)

500 circular and 500 triangular data points.

Circular points:0.5 sqrt(x1

2+x22) 1

Triangular points:sqrt(x1

2+x22) > 1 or

sqrt(x12+x2

2) < 0.5

Page 42: Computational  BioMedical  Informatics

42

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large

Number of Iterations

Page 43: Computational  BioMedical  Informatics

43

Overfitting due to Noise

Decision boundary is distorted by noise point

Page 44: Computational  BioMedical  Informatics

44

Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the neural nets to predict the test examples using other training records that are irrelevant to the classification task

Page 45: Computational  BioMedical  Informatics

45

Notes on Overfitting

Overfitting results in classifiers (a neural net, or a support vector machine) that are more complex than necessary

Training error no longer provides a good estimate of how well the classifier will perform on previously unseen records

Need new ways for estimating errors

Page 46: Computational  BioMedical  Informatics

46

Occam’s Razor

Given two models of similar generalization errors, one should prefer the simpler model over the more complex model

For complex models, there is a greater chance that it was fitted accidentally by errors in data

Therefore, one should include model complexity when evaluating a model

Page 47: Computational  BioMedical  Informatics

47

How to Address Overfitting

Minimize training error no longer guarantees a good model (a classifier or a regressor)

Need better estimate of the error on the true population – generalization error Ppopulation( f(x) not equal to y )

In practice, design a procedure that gives better estimate of the error than training error

In theoretical analysis, find an analytical bound to bound the generalization error or use Bayesian formula

Page 48: Computational  BioMedical  Informatics

48

Model Evaluation (pp. 295—304 of data mining)

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

Page 49: Computational  BioMedical  Informatics

49

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

Page 50: Computational  BioMedical  Informatics

50

Metrics for Performance Evaluation

Regression– Sum of squares

– Sum of deviation

– Exponential function of the deviation

Page 51: Computational  BioMedical  Informatics

51

Metrics for Performance Evaluation

Focus on the predictive capability of a model– Rather than how fast it takes to classify or

build models, scalability, etc. Confusion Matrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)b: FN (false negative)c: FP (false positive)d: TN (true negative)

Page 52: Computational  BioMedical  Informatics

52

Metrics for Performance Evaluation…

Most widely-used metric:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a(TP)

b(FN)

Class=No c(FP)

d(TN)

FNFPTNTPTNTP

dcbada

Accuracy

Page 53: Computational  BioMedical  Informatics

53

Limitation of Accuracy

Consider a 2-class problem– Number of Class 0 examples = 9990– Number of Class 1 examples = 10

If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %– Accuracy is misleading because model does

not detect any class 1 example

Page 54: Computational  BioMedical  Informatics

54

Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)

Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

Page 55: Computational  BioMedical  Informatics

55

Computing Cost of Classification

Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) + -+ -1 100- 1 0

Model M1 PREDICTED CLASS

ACTUALCLASS

+ -+ 150 40- 60 250

Model M2 PREDICTED CLASS

ACTUALCLASS

+ -+ 250 45- 5 200

Accuracy = 80%Cost = 3910

Accuracy = 90%Cost = 4255

Page 56: Computational  BioMedical  Informatics

56

Cost vs Accuracy

Count PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

Cost PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes p q

Class=No q p

N = a + b + c + d

Accuracy = (a + d)/N

Cost = p (a + d) + q (b + c)

= p (a + d) + q (N – a – d)

= q N – (q – p)(a + d)

= N [q – (q-p) Accuracy]

Accuracy is proportional to cost if1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p

Page 57: Computational  BioMedical  Informatics

57

Cost-Sensitive Measures

baa

caa

(r) Recall

(p)Precision

Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes)

Count PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a b

Class=No

c d

A model that declares every record to be the positive class: b = d = 0

A model that assigns a positive class to the (sure) test record: c is small

Recall is high

Precision is high

Page 58: Computational  BioMedical  Informatics

58

Cost-Sensitive Measures (Cont’d)

cbaa

prrp

baa

caa

222(F) measure-F

(r) Recall

(p)Precision

F-measure is biased towards all except C(No|No)

dwcwbwawdwaw

4321

41Accuracy Weighted

Count PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a b

Class=No

c d

Page 59: Computational  BioMedical  Informatics

59

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

Page 60: Computational  BioMedical  Informatics

60

Methods for Performance Evaluation

How to obtain a reliable estimate of performance?

Performance of a model may depend on other factors besides the learning algorithm:– Class distribution– Cost of misclassification– Size of training and test sets

Page 61: Computational  BioMedical  Informatics

61

Learning Curve

Learning curve shows how accuracy changes with varying sample size

Requires a sampling schedule for creating learning curve: Arithmetic sampling

(Langley, et al) Geometric sampling

(Provost et al)

Effect of small sample size:- Bias in the estimate- Variance of estimate

Page 62: Computational  BioMedical  Informatics

62

Methods of Estimation Holdout

– Reserve 2/3 for training and 1/3 for testing Random subsampling

– Repeated holdout Cross validation

– Partition data into k disjoint subsets– k-fold: train on k-1 partitions, test on the remaining one– Leave-one-out: k=n

Stratified sampling – oversampling vs undersampling

Bootstrap– Sampling with replacement

Page 63: Computational  BioMedical  Informatics

63

Methods of Estimation (Cont’d) Holdout method

– Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation

– Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies

obtained Cross-validation (k-fold, where k = 10 is most popular)

– Randomly partition the data into k mutually exclusive subsets, each approximately equal size

– At i-th iteration, use Di as test set and others as training set

– Leave-one-out: k folds where k = # of tuples, for small sized data– Stratified cross-validation: folds are stratified so that class dist. in

each fold is approx. the same as that in the initial data

Page 64: Computational  BioMedical  Informatics

64

Methods of Estimation (Cont’d) Bootstrap

– Works well with small data sets– Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected

again and re-added to the training set Several boostrap methods, and a common one is .632 boostrap

– Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)

– Repeat the sampling procedure k times, overall accuracy of the model:

))(368.0)(632.0()( _1

_ settraini

k

isettesti MaccMaccMacc

Page 65: Computational  BioMedical  Informatics

65

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

Page 66: Computational  BioMedical  Informatics

66

ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive

hits and false alarms ROC curve plots TPR (on the y-axis) against FPR

(on the x-axis) Performance of each classifier represented as a

point on the ROC curve If the classifier returns a real-valued prediction,

– changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point

Page 67: Computational  BioMedical  Informatics

67

ROC Curve

At threshold t:TP=50, FN=50, FP=12, TN=88

PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a(TP)

b(FN)

Class=No

c(FP)

d(TN)

TPR = TP/(TP+FN)FPR = FP/(FP+TN)

Page 68: Computational  BioMedical  Informatics

68

ROC Curve

PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a(TP)

b(FN)

Class=No

c(FP)

d(TN)

TPR = TP/(TP+FN)FPR = FP/(FP+TN)

(TPR,FPR): (0,0): declare everything

to be negative class– TP=0, FP = 0

(1,1): declare everything to be positive class– FN = 0, TN = 0

(1,0): ideal– FN = 0, FP = 0

Page 69: Computational  BioMedical  Informatics

69

ROC Curve

(TPR,FPR): (0,0): declare everything

to be negative class (1,1): declare everything

to be positive class (1,0): ideal

Diagonal line:– Random guessing– Below diagonal line: prediction is opposite of the

true class

Page 70: Computational  BioMedical  Informatics

70

How to Construct an ROC curve

Instance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Use classifier that produces posterior probability for each test instance P(+|A)

• Sort the instances according to P(+|A) in decreasing order

• Apply threshold at each unique value of P(+|A)

• Count the number of TP, FP,

TN, FN at each threshold• TP rate, TPR = TP/(TP+FN)• FP rate, FPR = FP/(FP +

TN)

Page 71: Computational  BioMedical  Informatics

71

How to Construct an ROC curve

Instance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Use classifier that produces posterior probability for each test instance P(+|A)

• Sort the instances according to P(+|A) in decreasing order

• Pick a threshold 0.85• p>= 0.85, predicted to P• p< 0.85, predicted to N• TP = 3, FP=3, TN=2, FN=2• TP rate, TPR = 3/5=60%• FP rate, FPR = 3/5=60%

Page 72: Computational  BioMedical  Informatics

72

How to construct an ROC curveClass + - + - - - + - + +

P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Threshold >=

ROC Curve:

Page 73: Computational  BioMedical  Informatics

73

Using ROC for Model Comparison

No model consistently outperforms the other M1 is better for

small FPR M2 is better for

large FPR

Area Under the ROC curve (AUC) Ideal:

Area = 1 Random guess:

Area = 0.5

Page 74: Computational  BioMedical  Informatics

74

Revisit K-Nearest Neighbor

K-NN:– Instance-based algorithm Uses k “closest” points (nearest neighbors) for

performing classification– k-NN classifiers are lazy learners (does not

build models explicitly)– Classifying unknown examples are relatively

expensive than model-learning algorithms (or parametric approaches)

Page 75: Computational  BioMedical  Informatics

75

Nearest Neighbor Classifiers

Basic idea:– If it walks like a duck, quacks like a duck, then

it’s probably a duck

Training Records

Test Record

Compute Distance

Choose k of the “nearest” records

Page 76: Computational  BioMedical  Informatics

76

Nearest-Neighbor Classifiers

Requires three things– The set of stored examples– Distance Metric to compute

distance between examples– The value of k, the number of

nearest neighbors to retrieve

To classify an unknown record:– Compute distance to other

training records– Identify k nearest neighbors – Use class labels of nearest

neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Unknown record

Page 77: Computational  BioMedical  Informatics

77

Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

Page 78: Computational  BioMedical  Informatics

78

1 nearest-neighbor

Voronoi Diagram

Page 79: Computational  BioMedical  Informatics

79

Nearest Neighbor Classification

Compute distance between two points:– Euclidean distance

Determine the class from nearest neighbor list– take the majority vote of class labels among

the k-nearest neighbors– Weigh the vote according to distance weight factor, w = 1/d2

i ii

qpqpd 2)(),(

Page 80: Computational  BioMedical  Informatics

80

Nearest Neighbor Classification…

Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points from

other classes

X

Page 81: Computational  BioMedical  Informatics

81

Nearest Neighbor Classification…

Scaling issues– Attributes may have to be scaled to prevent

distance measures from being dominated by one of the attributes

– Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M

Page 82: Computational  BioMedical  Informatics

82

Nearest Neighbor Classification…

Problem with Euclidean measure:– High dimensional data curse of dimensionality solution is to do dimension reduction first

– Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0

0 1 1 1 1 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1vs

d = 1.4142 d = 1.4142 Solution: Normalize data

Page 83: Computational  BioMedical  Informatics

83

Data normalization

Example-wise normalization– Each example is normalized

and mapped to unit sphere Feature-wise normalization

– [0,1]-normalization: normalize each feature into a unit space

– Standard normalization: normalize each feature to have mean 0 and standard deviation 1

1

1

1

1

Page 84: Computational  BioMedical  Informatics

84

Training data is given. – Each object is associated with a class label Y {1, 2,

…, K} and a feature vector of d measurements: X = (X1, …, Xd).

Build a model from the training data.

Unseen objects are to be classified as belonging to one of a number of predefined classes {1, 2, …, K}.

Linear Discriminant Analysis / Fisher’s linear disciminant

Classification

Page 85: Computational  BioMedical  Informatics

85

Two classes

Variable 1

Varia

ble

2

Best a

xis

1u

2u

Page 86: Computational  BioMedical  Informatics

86

Three classes

Page 87: Computational  BioMedical  Informatics

87

Classifiers are built from a training set (TS) L = (X1, Y1), ..., (Xn,Yn)

Classifier C built from a learning set L:

C: X {1,2, ... ,K}

Bayes classifier base on conditional densities p(Ck | X), C(X) = arg maxk p(Ck | X) This is a maximum a posterior, and p(Ck | X) is a

posterior density

Classifiers

Page 88: Computational  BioMedical  Informatics

88

The Rules of Probability

Sum Rule

Product Rule

Bayes’ Rule

posterior likelihood × prior

= p(X|Y)p(Y)

is irrelevant to Y=C

)|( YXp)|( dataXCYp )(Yp

Page 89: Computational  BioMedical  Informatics

89

p(Ck | X) = p(X | Ck) p(Ck) /p(X)

Find a class label C(X) so that maxk p(Ck | X) = maxk p(X | Ck) p(Ck)

Naïve Bayes assumes independence among all features (last class)– p(X | Ck) = p(x1 | Ck) p(x2 | Ck) . . . p(xd | Ck)

Very strong assumption

Maximum a posterior

Page 90: Computational  BioMedical  Informatics

90

kkT

k

kdk XXCXp

1

21exp

)det()2(1)|(

Assume multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k),

)(log)det()2(log

21)()|(log 1

kkd

kkT

kkk

Cp

XXCpCXp

C(X) = arg mink {(X - k)’ k-1

(X - k) + log| k | -2log(p(Ck))}

Multivariate normal dist for each class

Maximizing posterior is equivalent to maximizing p(X|CK)p(CK), and equivalent to maximizing the logorithm of p(X|CK)p(CK)

Page 91: Computational  BioMedical  Informatics

91

Two-class case

TXXXX TT 22

12211

111 loglog

1)()|()()|(

22

11 CpCXpCpCXp

)()|()()|( 2211 CpCXpCpCXp C(X) = C1If

otherwise C(X) = C2

))()(log(

)|()|(log

1

2

2

1

CpCp

CXpCXp

Equivalently, )()(

)|()|(

1

2

2

1

CpCp

CXpCXp

Page 92: Computational  BioMedical  Informatics

92

Guassian discriminant rule

For multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k), the classification rule is

C(X) = arg mink {(X - k)’ k-1

(X - k) + log| k |}

In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)

In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities

TXXXX TT 22

12211

111 loglog

Page 93: Computational  BioMedical  Informatics

93

iCx

Tii

ii xx

C))((1

iCxi

i xC ||1Class mean

Class covariance

Sample mean and variance

Page 94: Computational  BioMedical  Informatics

94

000021012

31

000011011

000000001

000010000

31

011011

001001

01001

0

31

))(())(())((31

111

31

.120

,112

,101

332211

321

321

TTT XXXXXX

XXX

XXX

Example

Page 95: Computational  BioMedical  Informatics

95

Two-class case

If the two classes have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2):

Quadratic rule

TXXXX TT 22

12211

111 loglog

cX T )( 121

cwX T )( 121 w

become

where

Usually, )(12211 nn

n

Page 96: Computational  BioMedical  Informatics

96

Illustration

μ1 μ2

Page 97: Computational  BioMedical  Informatics

97

Two-class case

Maximize the signal-to-noise ratio

wwww

withinT

betweenT

w max

Tbetween ))(( 1212

)(12211 nn

nwithin

where

Solution is )( 121

withinw

Between-class separation

Within-class cohesion

Page 98: Computational  BioMedical  Informatics

98

Two-class case (illustration)

LDA gives the yellow direction

Two classes overlap

Two classes are separated

Page 99: Computational  BioMedical  Informatics

99

Two-class case (illustration)

1 2

2- 1

LDA axisBest Threshold

Page 100: Computational  BioMedical  Informatics

100

Multi-class case

Two approaches– Apply Fisher LDA to each “one-versus-rest”

class

Page 101: Computational  BioMedical  Informatics

101

Multi-class case

K

i Cx

Tiiw

K

k

Tkkkb

i

xxn

S

nn

S

1

1

))((1

))((1

Transformation matrix G that projects the data to be most separable is the matrix that maximizes

WSWWSW

wT

bT

maxW

Second approach:Similarly, find multiple directions that form a low dimensional space

Correct way to write it is

1

W))((tracemax WSWWSW w

Tb

T

Between-class matrix

Within-class matrix

Page 102: Computational  BioMedical  Informatics

102

The goal is to simultaneously maximize the between-class separation and minimize the within-class cohesion

The solution to is the generalized eigenvalue problem The generalized eigenvectors are eigenvectors by solving

Intuition

gSgS wb

bw SS 1

1

W))((tracemax WSWWSW w

Tb

T

Page 103: Computational  BioMedical  Informatics

103

Graphic view of the transformation (projection)

d

n

dnA Training data (matrix)

)1( knLA

n

K-1

Reduced training data

K-1

d

)1( kdWTransformation matrix

Page 104: Computational  BioMedical  Informatics

104

Graphical view of classificationd

n

dnA

n

K-1

)1( knLAK-1

d

)1( kdG

1K-1

)1(1 kLh

Find the nearest neighborOr nearest centroid

d1

dh 1

A test data point h

Page 105: Computational  BioMedical  Informatics

105

First applied by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA):

Dimension reduction– Finds linear combinations of the features X=X1,...,Xd with

large ratios of between-groups to within-groups sums of squares - discriminant variables;

Classification– Predicts the class of an observation X by the class whose

mean vector is closest to X in terms of the discriminant variables

Summary

Page 106: Computational  BioMedical  Informatics

106

We just introduced Fisher discriminant analysis, particularly linear discriminant analysis

Now let us discuss Support Vector Machine

Page 107: Computational  BioMedical  Informatics

107

History of SVM SVM is inspired from statistical learning theory [3]. SVM was first introduced in 1992 [1]. SVM becomes popular because of its success in

handwritten digit recognition [2]. SVM is now regarded as an important example of

“kernel methods”, arguably the hottest area in machine learning. http://www.kernel-machines.org/

[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992.

[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82. 1994.

[3] V. Vapnik. The Nature of Statistical Learning Theory. 1nd edition, Springer, 1996.

Page 108: Computational  BioMedical  Informatics

108

Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data

Page 109: Computational  BioMedical  Informatics

109

Support Vector Machines

One Possible Solution

B1

Page 110: Computational  BioMedical  Informatics

110

Support Vector Machines

Another possible solution

B2

Page 111: Computational  BioMedical  Informatics

111

Support Vector Machines

Other possible solutions

B2

Page 112: Computational  BioMedical  Informatics

112

Support Vector Machines

Which one is better? B1 or B2? How do you define better?

B1

B2

Page 113: Computational  BioMedical  Informatics

113

Support Vector Machines

Find hyperplane maximizes the margin => B1 is better than B2

B1

B2

b11

b12

b21b22

margin

Page 114: Computational  BioMedical  Informatics

114

Support Vector Machines

B1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf 2||||

2 Marginw

Page 115: Computational  BioMedical  Informatics

115

Support Vector Machines

What if the problem is not linearly separable?

Page 116: Computational  BioMedical  Informatics

116

Nonlinear Support Vector Machines

What if decision boundary is not linear?

Page 117: Computational  BioMedical  Informatics

117

Nonlinear Support Vector Machines

Transform data into higher dimensional space

Page 118: Computational  BioMedical  Informatics

118

Outline of SVM lecture

Linear classifier

Maximum margin classifier– Estimate the margin

SVM for separable data

SVM for non-separable data

Page 119: Computational  BioMedical  Informatics

119

Linear classifiersf x

a

y

denotes +1denotes -1

f(x,w,b) = sign(w. x +b)

How would you classify this data?

Page 120: Computational  BioMedical  Informatics

120

Linear classifiersf x

a

y

denotes +1denotes -1

f(x,w,b) = sign(w. x +b)

How would you classify this data?

Page 121: Computational  BioMedical  Informatics

121

Classifier Marginf x

a

y

denotes +1denotes -1

f(x,w,b) = sign(w. x + b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Page 122: Computational  BioMedical  Informatics

122

Maximum Marginf x

a

y

denotes +1denotes -1

f(x,w,b) = sign(w. x + b)

The maximum margin linear classifier is the linear classifier with the maximum margin.This is the simplest kind of SVM (Called an LSVM)Linear SVM

Page 123: Computational  BioMedical  Informatics

123

Maximum Marginf x

a

y

denotes +1denotes -1

f(x,w,b) = sign(w. x + b)

The maximum margin linear classifier is the linear classifier with the maximum margin.This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those data points that the margin pushes up against

Linear SVM

Page 124: Computational  BioMedical  Informatics

124

Why Maximum Margin?

denotes +1denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

1. Intuitively this feels safest. 2. If we’ve made a small error in the

location of the boundary this gives us least chance of causing a misclassification.

3. The model is immune to removal of any non-support-vector datapoints.

4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.

5. Empirically it works very very well.

Page 125: Computational  BioMedical  Informatics

125

Estimate the Margin

What is the distance expression for a point x to a line wx+b= 0?

denotes +1denotes -1 x wx +b =

0

2 212

( )d

ii

b bd

w

x w x wx

w

Page 126: Computational  BioMedical  Informatics

126

Estimate the Margin

wx +b = 0

xy

distance

d

iiw

xwb

w

xwbw

xwb

wxwyw

wwxy

wwxy

1

22

d

have we0,byw Using

)(,

Page 127: Computational  BioMedical  Informatics

127

Estimate the Margin

What is the expression for margin?

denotes +1denotes -1 wx +b =

0

Margin

21

margin arg min ( ) arg mindD D

ii

bd

w

x x

x wx

Page 128: Computational  BioMedical  Informatics

128

Maximize Margin

denotes +1

denotes -1wx +b =

0

Margin

,

,

2,1

argmax margin( , , )

= argmax arg min ( )

argmax arg min

i

i

b

ib D

i

db Dii

b D

d

b

w

w

w x

w x

w

x

x w

Page 129: Computational  BioMedical  Informatics

129

Maximize Margin

denotes +1

denotes -1wx +b =

0

2,1

argmax arg min

subject to : 0

i

i

db Dii

i i i

b

w

D y b

w x

x w

x x w

Margin

Min-max problem

Page 130: Computational  BioMedical  Informatics

130

Maximize Margindenotes

+1denotes -1

wx +b = 0

2,1

argmax arg min

subject to : 0

i

i

db Dii

i i i

b

w

D y b

w x

x w

x x w

Margin

Strategy:

: 1i iD b x x w

21

,argmin

subject to : 1

dii

b

i i i

w

D y b

w

x x w

Page 131: Computational  BioMedical  Informatics

131

Maximum Margin Linear Classifier

How to solve it?

* * 21

,

1 1

2 2

{ , }= argmax

subject to1

1

....1

dkk

w b

N N

w b w

y w x b

y w x b

y w x b

Page 132: Computational  BioMedical  Informatics

132

Learning via Quadratic Programming

QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Availabel open-source solvers– SVMLight http://svmlight.joachims.org/

– LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/

– Matlab optimization toolbox

Page 133: Computational  BioMedical  Informatics

133

Quadratic Programming

2maxarg uuudu

RcT

T Find

nmnmnn

mm

mm

buauaua

buauauabuauaua

...:......

2211

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

...:......

enmmenenen

nmmnnn

nmmnnn

buauaua

buauauabuauaua

And subject to

n additional linear inequality constraints

e additional linear equality constraints

Quadratic criterion

Subject to

Page 134: Computational  BioMedical  Informatics

134

Quadratic Programming of SVM

* * 2

,{ , }= min

subject to 1 for all training data ( , )

iiw b

i i i i

w b w

y w x b x y

* *

,

1 1

2 2

{ , }= argmax 0 0

1

1 inequality constraints

....1

T

w b

N N

w b w w w

y w x b

y w x b

y w x b

nI

Page 135: Computational  BioMedical  Informatics

135

Non-separable

denotes +1denotes -1

This is going to be a problem!What should we do?

Page 136: Computational  BioMedical  Informatics

136

denotes +1denotes -1

This is going to be a problem!What should we do?Idea 1:

Find minimum ||w||2, while minimizing number of training set errors.

Problemette: Two things to minimize makes for an ill-define optimization

Non-separable

21

,argmin

subject to : 1

dii

b

i i i

w

D y b

w

x x w

Separable case

Page 137: Computational  BioMedical  Informatics

137

denotes +1denotes -1

This is going to be a problem!What should we do?Idea 1.1:

Minimize ||w||2 + C (#train errors)

Tradeoff parameter

Non-separable

1)( bxwy ii

Some points will violate

0 ,1)( iiii εbxwy We allow errors to occur

Hinge loss

Page 138: Computational  BioMedical  Informatics

138

denotes +1denotes -1

This is going to be a problem!What should we do?Idea 2.0:

Minimize ||w||2 + C (distance of

error points to

their correct place)

Non-separable

0 ,1)(

1

iiii

N

ii

εbxwy

Page 139: Computational  BioMedical  Informatics

139

Balance the trade off between margin and classification errors

d* * 21 1,

1 1 1 1

2 2 2 2

{ , }= min

1 , 0

1 , 0

...1 , 0

Ni ji jw b

N N N N

w b w c

y w x b

y w x b

y w x b

denotes +1denotes -1

1

2

3

Linear inseparable case

Page 140: Computational  BioMedical  Informatics

140

Determining value for c

How do we determine the appropriate value for c ? Cross-validation on training data

– Take possible choices for c– For each choice, Run a cross validation procedure Calculate the error metric (chosen properly)

– Find the choice that achieves the best metric– Use the best choice on all training data

Page 141: Computational  BioMedical  Informatics

141

A toy example on SVM (assignment 2)

0.8281 1.3162 12.0391 1.1447 21.9653 2.2966 20.4878 2.3856 10.3570 0.5606 11.4951 1.4693 22.8792 1.3368 21.0212 1.9389 11.7558 2.1281 20.6714 2.2641 1

X y

Training Data

0 0.5 1 1.5 2 2.5 30.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

x1

x2

x1

x2

x10

Page 142: Computational  BioMedical  Informatics

142

Separable case

21

,argmin

subject to : 1

dii

b

i i i

w

D y b

w

x x wMatlab scripts[N,d] = size(X);% constraintsA = diag(y) * [X ones(N,1)];Rhs = ones(N,1);% objectiveH = [eye(d) zeros(d,1)];H = [H; [zeros(1,d) 0]];f = zeros(d+1, 1);[X,FVAL,EXITFLAG,OUTPUT] = QUADPROG(H,f,-A,-Rhs)

Page 143: Computational  BioMedical  Informatics

143

Inseparable case

01 osubject t

min11

2

iiii

N

ii

d

ii

, εεb)x(wy

εc w

Matlab scripts[N,d] = size(X);% constraintsA = [diag(y) * [X ones(N,1)] eye(N)];Rhs = ones(N,1);% objectiveH = [eye(d) zeros(d,1+N)];H = [H; [zeros(1+N,d) zeros(N,N)]];f = [zeros(d+1, 1); c*ones(N,1)];% bound constraintsLb = [-Inf * ones(d+1,1); zeros(N,1)];[X,FVAL,EXITFLAG,OUTPUT] = QUADPROG(H,f,-A,-Rhs,[],[],Lb)

Page 144: Computational  BioMedical  Informatics

144

Next couple of slides are backup slides (not required in this class)

Page 145: Computational  BioMedical  Informatics

145

Support Vector Machine for Noisy Data

margin theoutside lies and correctly, classified is 0margin theinside liesbut ,classifiedcorrectly is 10

icationmisclassif i.e.,,0)(1

ii

ii

iii

xx

bwxy

errors. training ofnumber on the

boundupper an is 1

k

ii

Class 1

Class 2

Page 146: Computational  BioMedical  Informatics

146

Support Vector Machine for Noisy Data

* * 21

,

1 1 1 1

2 2 2 2

{ , }= argmin

1 , 0

1 , 0 inequality constraints

....1 , 0

Ni ji j

w b

N N N N

w b w c

y w x b

y w x b

y w x b

How do we determine the appropriate value for c ?• Cross-validation

Page 147: Computational  BioMedical  Informatics

147

Support Vector Machine for Noisy Data

.,,1,0)(g subject tof(w) minimize

i kiw )(f(w))(gf(w))(w, i1

wgwL Tk

iip aaa

0 subject to

)(w,Linf )( maximize p

a

aa wDLf(w))( aDL

)(-f(w) aDL

General optimization problem Define the Lagrangian

Lagrangian dual problem Weak duality theorem

Duality gap

Let *wbe the minimum of the Lagrangian with respect to w, and let*abe the maximum of the lagrangian dual with respect to a

If the constrains g are linear functions of w, then the duality gap is 0.

i. allfor ,0)( ** wgiia

Page 148: Computational  BioMedical  Informatics

148

Support Vector Machine for Noisy Data

0),(w ** apLw

0),(w ** aa pL

k1, i 0,)(wg *i

* ia

Karush-Kuhn-Tucker Conditions

k1, i 0,)(wg *i

k1, i 0,* ia

Complementarity condition

Feasibility condition

Page 149: Computational  BioMedical  Informatics

149

Support Vector Machine for Noisy Data

i. allfor ,01 iii bwxy

iai. allfor ,0i

Use the Lagrangian formulation for the optimization problem.Introduce a positive Lagrangian multiplier for each inequality constraint.

i

Lagrangian multipliers

ii

iiiii

ii

ip bwxycwL a 12

Get the following Lagrangian:

Page 150: Computational  BioMedical  Informatics

150

Support Vector Machine for Noisy Data

ii

iiiii

ii

ip bwxycwL a 12

0021

ii

iii

ip yy

bL

aa

iii

iiii

ip xywxyw

wL

aa2102

jijijji

ii

iD xxyyL aaa,2

1

iiiii

p ccL

aa

0

function. in the involvednot are multiplier its and εBoth ii

.ε and b, w,respect towith

L of sderivative theTake

i

p

icαi 0

Page 151: Computational  BioMedical  Informatics

151

The Dual Form of QP

Maximize

R

k

R

lkllk

R

kk Qααα

1 11 21 where ( )kl k l k lQ y y x x

Subject to these constraints:

kcαk 0

Then define:

R

kkkk yα

121 xw

01

R

kkk yα

Page 152: Computational  BioMedical  Informatics

152

The Dual Form of QP

Maximize

R

k

R

lkllk

R

kk Qααα

1 11 21 where ( )kl k l k lQ y y x x

Subject to these constraints:

kCαk 0

Then define:

R

kkkk yα

121 xw Then classify with:

f(x,w,b) = sign(w. x + b)

01

R

kkk yα

Page 153: Computational  BioMedical  Informatics

153

An Equivalent QP

Maximize

where ).( lklkkl yyQ xx

Subject to these constraints:

kcαk 0

Then define:

R

kkkk yα

121 xw

01

R

kkk yα

Datapoints with ak > 0 will be the support vectors

..so this sum only needs to be over the support vectors.

R

k

R

lkllk

R

kk Qααα

1 11 21

Page 154: Computational  BioMedical  Informatics

154

Support Vectors

denotes +1denotes -1

1w x b

1w x b

w

Support Vectors

Decision boundary is determined only by those support vectors !

R

kkkk yα

121 xw

: 1 0i i i ii y w x ba

ai = 0 for non-support vectors

ai 0 for support vectors

Page 155: Computational  BioMedical  Informatics

155

The Dual Form of QP

Maximize

R

k

R

lkllk

R

kk Qααα

1 11 21 where ( )kl k l k lQ y y x x

Subject to these constraints:

kcαk 0

Then define:

R

kkkk yα

121 xw Then classify with:

f(x,w,b) = sign(w. x + b)

01

R

kkk yα

How to determine b ?

Page 156: Computational  BioMedical  Informatics

156

Another approach based on support vectors:

An Equivalent QP: Determine b

A linear programming problem !

ci a0

* * 21

,

1 1 1 1

2 2 2 2

{ , }= argmin

1 , 0

1 , 0

....1 , 0

Ni ji j

w b

N N N N

w b w c

y w x b

y w x b

y w x b

1

*1

,

1 1 1 1

2 2 2 2

= argmin

1 , 0

1 , 0

....1 , 0

Ni i

Njj

b

N N N N

b

y w x b

y w x b

y w x b

Fix w

: 1 0i i i ii y w x ba

1

01

wxyyb

bwxy

iii

ii

0ii00 iii c a