82
October 2005 CSA3180: Text Processing III 1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture • Discovering Word Associations • Text Classification • TF.IDF • Clustering/Data Mining • Linear and Non-Linear Classification • Binary Classification • Multi-Class Classification

October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

Embed Size (px)

Citation preview

Page 1: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 1

CSA3180: Natural Language Processing

Text Processing 3 – Double Lecture• Discovering Word Associations• Text Classification• TF.IDF• Clustering/Data Mining• Linear and Non-Linear Classification• Binary Classification• Multi-Class Classification

Page 2: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 2

Introduction

• Slides partly based on Lectures by Barbara Rosario and Preslav Nakov

• Classification – Text categorization (and other applications)

• Various issues regarding classification– Clustering vs. classification, binary vs. multi-way, flat vs.

hierarchical classification…

• Introduce the steps necessary for a classification task– Define classes– Label text– Features– Training and evaluation of a classifier

Page 3: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 3

Classification

Goal: Assign ‘objects’ from a universe to two or more classes or categories

Problem Object CategoriesTagging Word POSSense Disambiguation Word The word’s sensesInformation retrieval Document Relevant/not relevantSentiment classification Document Positive/negativeAuthor identification Document AuthorsLanguage identification Document LanguagesText Classification Document Topics

Page 4: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 4

Author Identification• They agreed that Mrs. X should only hear of the

departure of the family, without being alarmed on the score of the gentleman's conduct; but even this partial communication gave her a great deal of concern, and she bewailed it as exceedingly unlucky that the ladies should happen to go away, just as they were all getting so intimate together.

• Gas looming through the fog in divers places in the streets, much as the sun may, from the spongey fields, be seen to loom by husbandman and ploughboy. Most of the shops lighted two hours before their time--as the gas seems to know, for it has a haggard and unwilling look. The raw afternoon is rawest, and the dense fog is densest, and the muddy streets are muddiest near that leaden-headed old obstruction, appropriate ornament for the threshold of a leaden-headed old corporation, Temple Bar.

Page 5: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 5

Author Identification

• Jane Austen (1775-1817), Pride and Prejudice

• Charles Dickens (1812-70), Bleak House

Page 6: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 6

Author Identification

• Federalist papers – 77 short essays written in 1787-1788 by Hamilton,

Jay and Madison to persuade NY to ratify the US Constitution; published under a pseudonym

– The authorships of 12 papers was in dispute (disputed papers)

– In 1964 Mosteller and Wallace* solved the problem– They identified 70 function words as good candidates

for authorships analysis – Using statistical inference they concluded the author

was Madison

Page 7: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 7

Author IdentificationFunction Words

Page 8: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 8

Author Identification

Page 9: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 9

Language Identification• Tutti gli esseri umani nascono liberi ed eguali in

dignità e diritti. Essi sono dotati di ragione e di coscienza e devono agire gli uni verso gli altri in spirito di fratellanza.

• Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen.

Universal Declaration of Human Rights, UN, in 363 languages

Page 10: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 10

Language Identification

• égaux - French• eguali - Italian• iguales - Spanish

• edistämään - Finnish

• għ - Maltese

Page 11: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 11

Text Classification

• http://news.google.com/

• Reuters– Collection of (21,578) newswire

documents. – For research purposes: a standard text

collection to compare systems and algorithms

– 135 valid topics categories

Page 12: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 12

Reuters Newswire Corpus

Page 13: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 13

Reuters Sample<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-

SET" OLDID="12981" NEWID="798"><DATE> 2-MAR-1987 16:51:43.42</DATE><TOPICS><D>livestock</D><D>hog</D></TOPICS><TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE><DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American

Pork Congress kicks offtomorrow, March 3, in Indianapolis with 160 of the nations pork producers from

44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

Page 14: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 14

Classification vs. Clustering

• Classification assumes labeled data: we know how many classes there are and we have examples for each class (labeled data).

• Classification is supervised• In Clustering we don’t have labeled data; we just

assume that there is a natural division in the data and we may not know how many divisions (clusters) there are

• Clustering is unsupervised

Page 15: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 15

Classification

Class1

Class2

Page 16: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 16

Classification

Class1

Class2

Page 17: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 17

Classification

Class1

Class2

Page 18: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 18

Classification

Class1

Class2

Page 19: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 19

Clustering

Page 20: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 20

Clustering

Page 21: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 21

Clustering

Page 22: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 22

Clustering

Page 23: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 23

Clustering

Page 24: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 24

Categories (Labels, Classes)

• Labeling data• 2 problems: • Decide the possible classes (which ones,

how many)– Domain and application dependent– http://news.google.com

• Label text– Difficult, time consuming, inconsistency

between annotators

Page 25: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 25

Reuters• <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"

OLDID="12981" NEWID="798">• <DATE> 2-MAR-1987 16:51:43.42</DATE>• <TOPICS><D>livestock</D><D>hog</D></TOPICS>• <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>• <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks

off• tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states

determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

• Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

• A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

• &#3;</BODY></TEXT></REUTERS>

Why not topic = policy ?

Page 26: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 26

Binary vs. Multi-Class Classification

• Binary classification: two classes

• Multi-way classification: more than two classes

• Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

Page 27: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 27

Flat vs. Hierarchical Classification

• Flat classification: relations between the classes undetermined

• Hierarchical classification: hierarchy where each node is the sub-class of its parent’s node

Page 28: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 28

Single vs. Multi-Category Classification

• In single-category text classification each text belongs to exactly one category

• In multi-category text classification, each text can have zero or more categories

Page 29: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 29

LabeledText class in NLTK

• LabeledText class

• >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix."

• >>> label = “sport” • >>> labeled_text = LabeledText(text, label) • >>> labeled_text.text() • “Seven-time Formula One champion Michael

Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix.”

• >>> labeled_text.label() • “sport”

Page 30: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 30

NLTK Classifier Interface• classify determines which label is most appropriate for a given text

token, and returns a labeled text token with that label. • labels returns the list of category labels that are used by the

classifier. • >>> token = Token(“The World Health Organization is

recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”)

• >>> my_classifier.classify(token) “The World Health Organization is recommending more

importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”/ health

>>> my_classifier.labels() • ("sport", "health", "world",…)

Page 31: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 31

Features• >>> text = "Seven-time Formula One champion

Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix."

• >>> label = “sport” • >>> labeled_text = LabeledText(text, label)

• Here the classification takes as input the whole string• What’s the problem with that?• What are the features that could be useful for this

example?

Page 32: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 32

Features

• Feature: An aspect of the text that is relevant to the task

• Some typical features– Words present in text – Frequency of words – Capitalization– Are there NE?– WordNet – Others?

Page 33: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 33

Features

• Feature: An aspect of the text that is relevant to the task

• Feature value: the realization of the feature in the text– Words present in text : Kerry, Schumacher,

China… – Frequency of word: Kerry(10), Schumacher(1)…– Are there dates? Yes/no– Are there PERSONS? Yes/no– Are there ORGANIZATIONS? Yes/no– WordNet: Holonyms (China is part of Asia),

Synonyms(China, People's Republic of China, mainland China)

Page 34: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 34

Features

• Boolean (or Binary) Features• Features that generate boolean (binary) values. • Boolean features are the simplest and the most

common type of feature.

– f1(text) = 1 if text contain “Kerry”

0 otherwise– f2(text) = 1 if text contain PERSON

0 otherwise

Page 35: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 35

Features

• Integer Features• Features that generate integer values. • Integer features can be used to give classifiers

access to more precise information about the text.

– f1(text) = Number of times text contains “Kerry”

– f2(text) = Number of times text contains PERSON

Page 36: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 36

Features in NLTK• Feature Detectors

– Features can be defined using feature detector functions, which map LabeledTexts to values

– Method: detect, which takes a labeled text, and returns a feature value.

– >>> def ball(ltext): return (“ball” in ltext.text()) – >>> fdetector = FunctionFeatureDetector(ball) – >>> document1 = "John threw the ball over the

fence".split()– >>> fdetector.detect(LabeledText(document1) 1– >>> document2 = "Mary solved the equation".split()– >>> fdetector.detect(LabeledText(document2) 0

Page 37: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 37

Features

• Linguistic features – Words

• lowercase? (should we convert to?)• normalized? (e.g. “texts” “text”)

– Phrases– Word-level n-grams– Character-level n-grams– Punctuation– Part of Speech

• Non-linguistic features

– document formatting– informative character sequences (e.g. &lt)

Page 38: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 38

When do we need Feature Selection?

• If the algorithm cannot handle all possible features

• e.g. language identification for 100 languages using all words• text classification using n-grams

• Good features can result in higher accuracy– But! Why feature selection?– What if we just keep all features?

• Even the unreliable features can be helpful.• But we need to weight them:

– In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection!

Page 39: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 39

Why do we need Feature Selection?

• Not all features are equally good!– Bad features: best to remove

• Infrequent– unlikely to be be met again– co-occurrence with a class can be due to chance

• Too frequent– mostly function words

• Uniform across all categories

– Good features: should be kept• Co-occur with a particular category• Do not co-occur with other categories

– The rest: good to keep

Page 40: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 40

What types of Feature Selection?

Feature selection reduces the number of features Usually:

Eliminating features Weighting features Normalizing features

Sometimes by transforming parameters e.g. Latent Semantic Indexing using Singular Value

Decomposition Method may depend on problem type

For classification and filtering, may use information from example documents to guide selection

Page 41: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 41

What types of Feature Selection?

• Task independent methods– Document Frequency (DF)– Term Strength (TS)

• Task-dependent methods– Information Gain (IG)– Mutual Information (MI) 2 statistic (CHI)

Empirically compared by Yang & Pedersen (1997)

Page 42: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 42

Document Frequency (DF)

DF: number of documents a term appears in

• Based on Zipf’s Law• Remove the rare terms: (met 1-2 times)

– Non-informative– Unreliable – can be just noise– Not influential in the final decision– Unlikely to appear in new documents

• Plus– Easy to compute– Task independent: do not need to know the classes

• Minus– Ad hoc criterion– Rare terms can be good discriminators (e.g., in IR)

What about the frequent terms?

What is a “rare” term?

Page 43: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 43

Stop Word Removal

• Common words from a predefined list– Mostly from closed-class categories:

• unlikely to have a new word added• include: auxiliaries, conjunctions, determiners, prepositions,

pronouns, articles

– But also some open-class words like numerals

• Bad discriminators– uniformly spread across all classes– can be safely removed from the vocabulary

• Is this always a good idea? (e.g. author identification)

Page 44: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 44

2 statistic (CHI)

• 2 statistic (pronounced “kai square”) – The most commonly used method of comparing proportions.– Checks whether there is a relationship between being in one of

two groups and a characteristic under study.

– Example: Let us measure the dependency between a term t and a category c.

• the groups would be: – 1) the documents from a category ci – 2) all other documents

• the characteristic would be:– “document contains term t”

Page 45: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 45

2 statistic (CHI)

Is “jaguar” a good predictor for the “auto” class?

We want to compare:• the observed distribution above; and• null hypothesis: that jaguar and auto are independent

Term = jaguar Term jaguar

Class = auto 2 500

Class auto 3 9500

Page 46: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 46

2 statistic (CHI)

Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect?– We would have: Pr(j,a) = Pr(j) Pr(a)– So, there would be: N Pr(j,a), i.e. N Pr(j) Pr(a)– Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500– Which is: N(5/N)(502/N)=2510/N=2510/10005 0.25

Term = jaguar Term jaguar

Class = auto 2 500

Class auto 3 9500

Page 47: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 47

2 statistic (CHI)

Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect?– We would have: Pr(j,a) = Pr(j) Pr(a)– So, there would be: N Pr(j,a), i.e. N Pr(j) Pr(a)– Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500– Which is: N(5/N)(502/N)=2510/N=2510/10005 0.25

Term = jaguar Term jaguar

Class = auto 2 (0.25) 500 (502)

Class auto 3 (4.75) 9500 (9498)

expected: fe

observed: fo

Page 48: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 48

2 statistic (CHI)2 is interested in (fo – fe)2/fe summed over all table entries:

The null hypothesis is rejected with confidence .999,

since 12.9 > 10.83 (the value for .999 confidence).

)001.(9.129498/)94989500(502/)502500(

75.4/)75.43(25./)25.2(/)(),(22

2222

p

EEOaj

Term = jaguar Term jaguar

Class = auto 2 (0.25) 500 (502)

Class auto 3 (4.75) 9500 (9498)

expected: fe

observed: fo

Page 49: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 49

2 statistic (CHI)

How to use 2 for multiple categories?

Compute 2 for each category and then combine:– we can require to discriminate well across all

categories, then we need to take the expected value of 2:

or to discriminate well for a single category, then we take the maximum:

Page 50: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 50

2 statistic (CHI)

Pros– normalized and thus comparable across terms – 2(t,c) is 0, when t and c are independent– can be compared to 2 distribution, 1 degree of

freedom

• Cons– unreliable for low frequency terms – computationally expensive

Page 51: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 51

Term Weighting

• In the study just shown, terms were (mainly) treated as binary features– If a term occurred in a document, it was

assigned 1– Else 0

• Often it us useful to weight the selected features

• Standard technique: tf.idf

Page 52: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 52

TF.IDF Term Weighting

• TF: term frequency– definition: TF = tij

• frequency of term i in document j

– purpose: makes the frequent words for the document more important

• IDF: inverted document frequency– definition: IDF = log(N/ni)

• ni : number of documents containing term i • N : total number of documents

– purpose: makes rare words across documents more important

• TF.IDF– definition: tij log(N/ni)

Page 53: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 53

Term Normalization

• Combine different words into a single representation– Stemming/morphological analysis

• bought, buy, buys -> buy– General word categories

• $23.45, 5.30 Yen -> MONEY• 1984, 10,000 -> DATE, NUM• PERSON• ORGANIZATION

– (Covered in Information Extraction segment)

– Generalize with lexical hierarchies• WordNet, MeSH

– (Covered later in the semester)

Page 54: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 54

Stemming and Lemmatization

• Purpose: conflate morphological variants of a word to a single index term – Stemming: normalize to a pseudoword

• e.g. “more” and “morals” become “mor” (Porter stemmer)

– Lemmatization: convert to the root form• e.g. “more” and “morals” become “more” and “moral”

• Plus:– vocabulary size reduction– data sparseness reduction

• Minus:– loses important features (even to_lowercase() can be bad!)– questionable utility (maybe just “-s”, “-ing” and “-ed”?)

Page 55: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 55

Practical Approach

1. Feature selection• infrequent term removal

– infrequent across the whole collection (i.e. DF)– met in a single document

• most frequent term removal (i.e. stop words)

2. Normalization:1. Stemming. (often) 2. Word classes (sometimes)

3. Feature weighting: TF.IDF or IDF4. Dimensionality reduction. (occasionally)

Page 56: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 56

Classification

• Linear versus non linear classification• Binary classification

– Perceptron– Winnow– Support Vector Machines (SVM)– Kernel Methods (covered in statistics

lectures)• Multi-Class classification (covered in

Statistics Lectures)– Decision Trees– Naïve Bayes– K nearest neighbor

Page 57: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 57

Binary Classification

• Spam filtering (spam, not spam)• Customer service message classification

(urgent vs. not urgent)• Information retrieval (relevant, not relevant)• Sentiment classification (positive, negative)• Sometime it can be convenient to treat a multi-

way problem like a binary one: one class versus all the others, for all classes

Page 58: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 58

Binary Classification

• Given: some data items that belong to a positive (+1 ) or a negative (-1 ) class

• Task: Train the classifier and predict the class for a new data item

• Geometrically: find a separator

Page 59: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 59

Linear vs. Non-Linear

• Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary

Page 60: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 60

Linear vs. Non-Linear

Linear Decision boundary

Page 61: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 61

Linear vs. Non-Linear

Class1Class2Non-Linearly Separable

Page 62: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 62

Linear vs. Non-Linear

Non Linear Classifier Class1Class2

Page 63: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 63

Linear vs. Non-Linear

• Linear or Non linear separable data?– We can find out only empirically

• Linear algorithms (algorithms that find a linear decision boundary)– When we think the data is linearly separable– Advantages

• Simpler, less parameters– Disadvantages

• High dimensional data (like for NLT) is usually not linearly separable

– Examples: Perceptron, Winnow, SVM– Note: we can use linear algorithms also for non linear

problems (see Kernel methods)

Page 64: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 64

Linear vs. Non-Linear

• Non Linear– When the data is non linearly separable– Advantages

• More accurate

– Disadvantages• More complicated, more parameters

– Example: Kernel methods

• Note: the distinction between linear and non linear applies also for multi-class classification (we’ll see this later)

Page 65: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 65

Simple Linear Algorithms

• Perceptron and Winnow algorithm– Linear– Binary classification– Online (process data sequentially, one

data point at the time)– Mistake driven– Simple single layer Neural Networks

Page 66: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 66

Simple Linear Algorithms

• Data: {(xi,yi)}i=1...n

– x in Rd (x is a vector in d-dimensional space) feature vector– y in {-1,+1} label (class, category)

• Question: – Design a linear decision boundary: wx + b (equation of

hyperplane) such that the classification rule associated with it has minimal probability of error

– classification rule: • y = sign(w x + b) which means:• if wx + b > 0 then y = +1• if wx + b < 0 then y = -1

Page 67: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 67

Simple Linear Algorithms

• Find a good hyperplane

(w,b) in Rd+1

that correctly classifies data points as much as possible

• In online fashion: one data point at the time, update weights as necessary

wx + b = 0

Classification Rule: y = sign(wx + b)

Page 68: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 68

Perceptron Algorithm• Initialize: w1 = 0• Updating rule For each data point x

– If class(x) != decision(x,w)

– then wk+1 wk + yixi

k k + 1 – else

wk+1 wk

• Function decision(x, w)– If wx + b > 0 return +1– Else return -1

wk

0

+1

-1wk x + b = 0

wk+1

Wk+1 x + b = 0

Page 69: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 69

Perceptron Algorithm

• Online: can adjust to changing target, over time• Advantages

– Simple and computationally efficient– Guaranteed to learn a linearly separable

problem (convergence, global optimum)

• Limitations– Only linear separations– Only converges for linearly separable data– Not really “efficient with many features”

Page 70: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 70

Winnow Algorithm

• Another online algorithm for learning perceptron weights:

f(x) = sign(wx + b)

• Linear, binary classification

• Update-rule: again error-driven, but multiplicative (instead of additive)

Page 71: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 71

Winnow Algorithm• Initialize: w1 = 0• Updating rule For each data point x

– If class(x) != decision(x,w)– then

wk+1 wk + yixi Perceptron

wk+1 wk *exp(yixi) Winnow

k k + 1 – else

wk+1 wk

• Function decision(x, w)– If wx + b > 0 return +1– Else return -1

wk

0

+1

-1wk x + b= 0

wk+1

Wk+1 x + b = 0

Page 72: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 72

Perceptron vs. Winnow

• Assume– N available features– only K relevant items, with K<<N

• Perceptron: number of mistakes: O( K N)

• Winnow: number of mistakes: O(K log N)

Winnow is more robust to high-dimensional feature spaces

Page 73: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 73

Perceptron vs. Winnow• Perceptron• Online: can adjust to changing

target, over time• Advantages

– Simple and computationally efficient

– Guaranteed to learn a linearly separable problem

• Limitations– only linear separations– only converges for linearly

separable data– not really “efficient with many

features”

• Winnow• Online: can adjust to changing

target, over time• Advantages

– Simple and computationally efficient

– Guaranteed to learn a linearly separable problem

– Suitable for problems with many irrelevant attributes

• Limitations– only linear separations– only converges for linearly

separable data– not really “efficient with many

features”• Used in NLP

Page 74: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 74

Support Vector Machine (SVM)

• Large Margin Classifier

• Linearly separable case

• Goal: find the hyperplane that maximizes the margin

wT x + b = 0

M wTxa + b = 1

wTxb + b = -1

Support vectors

Page 75: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 75

Support Vector Machine (SVM)

• Text classification

• Hand-writing recognition

• Computational biology (e.g., micro-array data)

• Face detection

• Face expression recognition

• Time series prediction

Page 76: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 76

Classification II

• Non-linear algorithms

• Kernel methods

• Multi-class classification

• Decision trees

• Naïve Bayes

• Last topic for today: k Nearest Neighbour

Page 77: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 77

k Nearest Neighbour

• Nearest Neighbor classification rule: to classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor

• K Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1

– Example of similarity measure often used in NLP is cosine similarity

Page 78: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 78

1 Nearest Neighbour

Page 79: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 79

1 Nearest Neighbour

Page 80: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 80

3 Nearest Neighbour

Page 81: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 81

3 Nearest Neighbour

Assign the category of the majority of the neighbors

But this is closer..We can weight neighbors according to their similarity

Page 82: October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification

October 2005 CSA3180: Text Processing III 82

k Nearest Neighbour

• Strengths– Robust– Conceptually simple– Often works well– Powerful (arbitrary decision boundaries)

• Weaknesses– Performance is very dependent on the similarity

measure used (and to a lesser extent on the number of neighbors k used)

– Finding a good similarity measure can be difficult

– Computationally expensive