39
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Embed Size (px)

Citation preview

Page 1: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

WEB BAR 2004 Advanced Retrieval and Web Mining

Lecture 17

Page 2: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Today’s Topics

Text classification Logistic regression Support vector machines

Page 3: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Intro to Logistic Regression

Naïve Bayes and Logistic Regression are probabilistic models.

Naïve Bayes is a generative model. Logistic regression is a discriminative

model. directly maximizes classification accuracy.

Page 4: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Intro to Logistic Regression

A generative model predicts probability that document d will be generated by a source c

Naïve Bayes model:

Parameters, i.e. P(w |c)’s, are fit to optimally predict generation of d

( | ) ( | )w d

P d c P w c

Page 5: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

One source model for each class c Choose class c with largest value of:

Generative model Our criterion is: how likely is it that this model

generated the document? Classification accuracy is optimized indirectly and

imperfectly. Accuracy and MAP are different goals, and will in

general lead to different predictions.

Classify Text w/ Gen. Model

( | ) ( | ) ( ) /P c d P d c P c k

dwcwPcPdcP )|(log)(log)|(log

Page 6: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

For binary classification, we have:

Naïve Bayes

( | ) ( | ) ( ) /P c d P d c P c k

( | ) ( ) ( | )log log log

( | ) ( ) ( | )w d

P C d P C P w C

P C d P C P w C

dwcwPcPdcP )|(log)(log)|(log

Page 7: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Naïve Bayes:

Directly model probability of generating class conditional on words w

Logistic regression:

The discriminative alternative: Logistic Regression

( | )log

( | ) ww d

P C dw

P C d

( | ) ( ) ( | )log log log

( | ) ( ) ( | )w d

P C d P C P w C

P C d P C P w C

Page 8: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

The discriminative alternative: Logistic Regression

Logistic regression:

Tune parameters βw to optimize conditional likelihood (class probability predictions)

What a statistician would probably tell you to use if you said you had a categorical decision problem (like text categorization)

( | )log

( | ) ww d

P C dw

P C d

Page 9: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

The Logit-Based Model The simplest model for optimizing predictive accuracy is

linear regression. Why don’t we use linear regression:? p = + X + e Normal assumption does not work for probabilities. Need to transform the input variables and predicted

variable to be able to apply regression. The transformation is the logit:

Logit(p) = ln[p/(1-p)] = + X + e Apply also to input variables: logit(p) = a+b1*logit(F1) + b2*logit(F2) + ... +

bn*logit(Fn) (where p≈class-probability and Fi≈word i)

Model: The logit of a predicted probability is the (weighted) sum of the logits of the probabilities associated with each of the features

An instance of a generalized linear model where one response is conditioned on all features

Page 10: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Logit and Logistic

Logit: ln[p/(1-p)] = + X

The slope coefficient () is interpreted as the rate of change in the log odds as X changes

A more intuitive interpretation of the logit is the “odds ratio”.

Since [p/(1-p)] = exp( + X)

exp(…) is the effect of the independent variable on the odds of having a certain classification

Logistic(X) = 1/ (1+exp(- - X))

Page 11: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

1

1

Logit and logistic transforms

logit(p) = ln(p/[1-p])logistic(x) = 1/(1+e-x)

logit

logistic

Page 12: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Classification

• Compute vector representation X of document

• Compute z = + X• Dot product of weight vector with vector

representation X• Beta defines a hyperplane as before

• P(C) = logistic(z) = 1/(1+e-z)• P(C) is the probability that the document is

in the class• If we use a good method to estimate beta,

then this will be a ‘’good’’ probability (as opposed to Naïve Bayes)

Page 13: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Training a Logistic Regression Model

Training consists of computing parameters and

MLE – maximum likelihood estimation MLE is a statistical method for estimating the

coefficients of a model that maximizes some likelihood

Here the likelihood function (L) measures the probability of observing the particular set of class labels (C, not C) that occur in the training data

LogR MLE is normally done by some form of iterative fitting algorithm, or a gradient descent procedure such as CG

Expensive for large models with many features

Page 14: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

LR & NB: Same Parameters!

( | )LR : log

( | )w

w d

P C dw

P C d

( | ) ( ) ( | )NB: log log log

( | ) ( ) ( | )w d

P C d P C P w C

P C d P C P w C

- Binary or raw TF weighting - Optimized differently

Page 15: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Performance

Early results with LogR were disappointing, because people didn’t understand the means to regularize (smooth) LogR to cope with sparse data

Done right, LogR outperforms NB in text classification NB optimizes parameters to predict words,

LogR optimizes to predict class LogR seems as good as SVMs (or any

known text cat method – Tong & Oles 2001) though less studied and less trendy than SVMs.

Page 16: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Support Vector Machines

Page 17: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Recall: Which Hyperplane?

In general, lots of possible solutions

Support Vector Machine (SVM) finds an optimal solution.

Page 18: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Support Vector Machine (SVM)Support vectors

Maximizemargin

SVMs maximize the margin around the separating hyperplane.

The decision function is fully specified by a subset of training samples, the support vectors.

Quadratic programming problem

Seen by many as most successful current text classification method

Page 19: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

w: hyperplane normal xi: data point i yi: class of data point i (+1 or -1)

Constraint optimization formalization:

(1)

(2) maximize margin: 2/||w||

Maximum Margin: Formalization

Page 20: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Key differentiator of SVMs is reliance on support vectors

Conceptually: only what is close to the decision boundary should matter.

Why is margin determined by support vectors only?

Support vectors

Page 21: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Quadratic Programming

Quadratic programming setup explains two properties of SVMs

1. Classification is determined by support vectors only 2. The notion of kernel

Page 22: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Quadratic Programming

Most i will be zero. Non-zero i are support vectors.

One can show that hyperplane normal w with maximum margin is:

i: lagrange multipliers xi: data point i yi: class of data point i (+1 or -1) Where the i are the solution to maximizing:

Page 23: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Non-Separable Case

Now we know how to build a separator for two linearly separable classes

What about classes whose exemplary docs are not linearly separable?

Page 24: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Not Linearly Separable

Find a line that penalizespoints on “the wrong side”.

Page 25: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Penalizing Bad Points

Define distance for each point withrespect to separator ax + by = c: (ax + by) - c for red pointsc - (ax + by) for green points.Negative for

bad points.

Page 26: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Classification with SVMs

Given a new point (x1,x2), can score its projection onto the hyperplane normal: Compute score: w x + b In 2 dims: score = w1x1+w2x2+b. Set confidence threshold t.

35

7

Score > t: yes

Score < -t: no

Else: don’t know

Page 27: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

SVMs: Predicting Generalization

We want the classifier with the best generalization (best accuracy on new data).

What are clues for good generalization? Large training set Low error on training set Capacity/variance (number of parameters in

the model, expressive power of model) SVMs give you an explicit bound on error

on new data based on these.

Page 28: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Capacity/Variance: VC Dimension

Theoretical risk boundary:

Risk = mean error rate – the model (defined by its parameters) Remp - empirical risk, l - #observations, h – VC

dimension, the above holds with prob. (1-η) VC dimension/Capacity: max number of points that

can be shattered A set can be shattered if the classifier can learn

every possible labeling. VC = Vapnik-Chervonenkis Dimension

Page 29: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Capacity of Hyperplanes?

Page 30: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Exercise

Suppose you have n points in d dimensions, labeled red or green. How big need n be (as a function of d) in order to create an example with the red and green points not linearly separable?

E.g., for d=2, n 4.

Page 31: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Capacity/Variance: VC Dimension

Theoretical risk boundary:

Remp - empirical risk, l - #observations, h – VC dimension, the above holds with prob. (1-η)

Important theoretical property Not very often used in practice

Page 32: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

SVM Kernels

Recall: We’re maximizing:

Observation: data only occur in dot products. We can map data into a very high dimensional

space (even infinite!) as long as kernel computable. For mapping function Ф, compute kernel K(i,j) =

Ф(xi)∙Ф(xj) Example:

Page 33: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Kernels

Why use kernels? Make non-separable problem separable. Map data into better representational space

Common kernels Linear Polynomial Radial basis function (infinite space)

Page 34: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Results for Kernels (Joachims)

Page 35: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Performance of SVM

SVM are seen as best-performing method by many.

Statistical significance of most results not clear.

There are many methods that perform about as well as SVM.

Example: regularized logistic regression (Zhang&Oles)

Example of a comparison study: Yang&Liu

Page 36: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Yang&Liu: SVM vs Other Methods

Page 37: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Yang&Liu: Statistical Significance

Page 38: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Summary Support vector machines (SVM)

Choose hyperplane based on support vectors Support vector = “critical” point close to decision

boundary Kernels: powerful and elegant way to define

similarity metric Bound on “risk” (expected error on test set) Best performing text classifier? Partly popular due to availability of svmlight

Svmlight is accurate and fast – and free (for research) Logistic regression (LR)

Traditional statistical technique for classification Does not work “out of the box” due to high

dimensionality of text Robust/regularized versions perform as well as SVM? No equivalent to svmlight available

(Degree-1) SVMs and LR are linear classifiers.

Page 39: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Resources Foundations of Statistical Natural Language Processing. Chapter

16. MIT Press. Manning and Schuetze. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements

of Statistical Learning: Data Mining, Inference and Prediction" Springer-Verlag, New York.

A Tutorial on Support Vector Machines for Pattern Recognition (1998)  Christopher J. C. Burges

R.M. Tong, L.A. Appelbaum, V.N. Askman, J.F. Cunningham. Conceptual Information Retrieval using RUBRIC. Proc. ACM SIGIR 247-253, (1987).

S. T. Dumais, Using SVMs for text categorization, IEEE Intelligent Systems, 13(4), Jul/Aug 1998

S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive learning algorithms and representations for text categorization. Proceedings of CIKM ’98, pp. 148-155.

re-examination of text categorization methods (1999) Yiming Yang, Xin Liu 22nd Annual International SIGIR

Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001)

‘Classic’ Reuters data set: http://www.daviddlewis.com /resources /testcollections/reuters21578/