84
Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Natural Language ProcessingSoSe 2015

Machine Learning for NLP

Dr. Mariana Neves May 4th, 2015

(based on the slides of Dr. Saeedeh Momtazi)

Page 2: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Introduction

● „Field of study that gives computers the ability to learn without being explicitly programmed“

– Arthur Samuel, 1959

2

(http://www.cs.toronto.edu/~urtasun/courses/CSC2515/CSC2515_Winter15.html)

Page 3: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Introduction

● Learning Methods

– Supervised learning

● Active learning

– Unsupervised learning

– Semi-supervised learning

– Reinforcement learning

3

Page 4: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Outline

● Supervised Learning

● Semi-supervised learning

● Unsupervised learning

4

Page 5: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Supervised Learning

● Example: mortgage credit decision

– Age

– Income

5

http://nationalmortgageprofessional.com/news24271/regulatory-compliance-outlook-new-risk-based-pricing-rules

Page 6: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Supervised Learning

6

http://nationalmortgageprofessional.com/news24271/regulatory-compliance-outlook-new-risk-based-pricing-rules

age

income

?

Page 7: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Classification

7

Training

Testing

T1

T2

…T

n

C1

C2

…C

n

F1

F2

…F

n

Model(F,C)

Tn+1

? Fn+1

Cn+1

Page 8: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Applications

8

Problems Items Categories

POS tagging Word POS

Named entity recognition Word Named entity

Word sense disambiguation Word Word's sense

Spam mail detection Document Spam/Not Spam

Language identification Document Language

Text categorization Document Topic

Information retrieval Document Relevant/Not relevant

Page 9: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Part-of-speech tagging

9

http://weaver.nlplab.org/~brat/demo/latest/#/not-editable/CoNLL-00-Chunking/train.txt-doc-1

Page 10: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Named entity recognition

10

http://corpora.informatik.hu-berlin.de/index.xhtml#/cellfinder/version1_sections/16316465_03_results

Page 11: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Word sense disambiguation

11

Page 12: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Spam mail detection

12

Page 13: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Language identification

13

Page 14: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Text categorization

14

Page 15: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Classification

15

Training

Testing

T1

T2

…T

n

C1

C2

…C

n

F1

F2

…F

n

Model(F,C)

Tn+1

? Fn+1

Cn+1

Page 16: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Classification algorithms● K Nearest Neighbor

● Support Vector Machines

● Naïve Bayes

● Maximum Entropy

● Linear Regression

● Logistic Regression

● Neural Networks

● Decision Trees

● Boosting

● ...

16

Page 17: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Classification algorithms● K Nearest Neighbor

● Support Vector Machines

● Naïve Bayes

● Maximum Entropy

● Linear Regression

● Logistic Regression

● Neural Networks

● Decision Trees

● Boosting

● ...

17

Page 18: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K Nearest Neighbor

18

?

Page 19: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K Nearest Neighbor

19

?

Page 20: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K Nearest Neighbor● 1-nearest neighbor

20

Page 21: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K Nearest Neighbor

21

?

● 3-nearest neighbors

Page 22: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K Nearest Neighbor

22

● 3-nearest neighbors

Page 23: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K Nearest Neighbor

● Simple approach

● No black-box

● Choose

– Features

– Distance metrics

– Value of k (majority voting)

23

?

Page 24: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K Nearest Neighbor

● Class distribution is skewed

24

?

Page 25: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Classification algorithms● K Nearest Neighbor

● Support Vector Machines

● Naïve Bayes

● Maximum Entropy

● Linear Regression

● Logistic Regression

● Neural Networks

● Decision Trees

● Boosting

● ...

25

Page 26: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Support vector machines

26

Page 27: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Support vector machines● Find a hyperplane in the vector space that separates the items

of the two categories

27

Page 28: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Support vector machines● There might be more than one possible separating hyperplane

28

Page 29: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Support vector machines● Find the hyperplane with maximum margin

● Vectors at the margins are called support vectors

29

(http://en.wikipedia.org/wiki/Support_vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg)

Page 30: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Support vector machines

● Usually provides good results

● There are libraries available for many programming languages

● Linear and non-linear classification

30(http://cs.stanford.edu/people/karpathy/svmjs/demo/)

Page 31: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Support vector machines

● Black-box

● Multi-class problems usually modelled as many binary classifications

31

(http://www.cs.ubc.ca/~schmidtm/Software/minFunc/examples.html)

Page 32: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Classification algorithms● K Nearest Neighbor

● Support Vector Machines

● Naïve Bayes

● Maximum Entropy

● Linear Regression

● Logistic Regression

● Neural Networks

● Decision Trees

● Boosting

● ...

32

Page 33: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Naïve Bayes

● Selecting the class with highest probability

– Minimizing the number of items with wrong labels

– Probability should depend on the to be classified data (d)

33

c=argmax c i P (ci)

P(ci∣d )

Page 34: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Naïve Bayes

34

c=argmax c i P (ci)

c=argmax c i P (ci∣d )

c=argmax c iP(d∣ci)⋅P(ci)

P (d )

c=argmax c i P (d∣ci)⋅P (ci)

Page 35: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Naïve Bayes

35

c=argmax c i P (d∣ci)⋅P (ci)

Prior probability

Likelihood probability

Page 36: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Classification

36

Training

Testing

T1

T2

…T

n

C1

C2

…C

n

F1

F2

…F

n

Model(F,C)

Tn+1

? Fn+1

Cn+1

Page 37: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Spam mail detection

37

Features:- words- sender's email- contains links- contains attachments- contains money amounts...

Page 38: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Feature selection

● Bag-of-words:

– Document represented by the set of words

– High dimensional feature space

– Computationally expensive

38(http://www.python-course.eu/text_classification_python.php)

Page 39: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Feature selection

● Solution: Feature selection to select informative words

39

(http://scikit-learn.org/0.11/modules/feature_selection.html)

Page 40: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Feature selection methods

● Information gain

● Mutual information

● χ-Square

40

Page 41: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Information gain

● Measuring the presence or absence of a term in the document

● Removing words whose information gain is less than a predefined threshold

41(http://seqam.rutgers.edu/site/index.php?option=com_jresearch&view=project&task=show&id=35)

Page 42: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Information gain

42

IG (w)= ∑ P (ci)⋅log P (ci)K

i=1

+P (w)⋅ ∑ P (ci∣w )⋅log P (ci∣w)K

i=1

+P ( w)⋅ ∑ P (ci∣w )⋅log P (ci∣w)K

i=1

Page 43: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Information gain– = # docs– = # docs in category ci

– = # docs containing w– = # docs not containing w– = # docs in category ci containing w

– = # docs in category ci not containing w

43

P(ci)=N i

N

P(w)=N w

N

P( w)=N w

N

P(ci∣w)=N iw

N i

P (ci∣w)=N i w

N i

N i

N

N w

N w

N iw

N i w

Page 44: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Mutual information

● Measuring the effect of each word in predicting the category

44

MI (w ,ci)=logP (w ,ci)

P (w)⋅P (ci)

Page 45: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Mutual information

● Removing words whose mutual information is less than a predefined threshold

45

MI (w)=max iMI (w ,ci)

MI (w)=∑ P (ci)⋅MI (w ,ci)i

Page 46: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

χ-Square● Measuring the dependencies between words and categories

● Ranking words based on their χ-square measure

● Selecting the top words as features

46

χ2(w ,ci)=N⋅(N iw N iw−N i w N i w)

2

(N iw+N i w)⋅(N i w+N iw)⋅(N iw+N i w)⋅(N i w+N iw)

χ2(w)= ∑ P (ci)K

i=1⋅χ2(w ,ci)

Page 47: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Feature selection

● These models perform well for document-level classification

– Spam Mail Detection

– Language Identification

– Text Categorization

● Word-level Classification might need another types of features

– Part-of-speech tagging

– Named Entity Recognition

47

Page 48: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Supervised learning

● Shortcoming

– Relies heavily on annotated data

– Time consuming and expensive task

48

Page 49: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Supervised learning

● Active learning

– Using a minimum amount of annotated data

– Annotating further data by human, if they are very informative

49

http://en.wikipedia.org/wiki/Transduction_(machine_learning)

Page 50: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Active learning

50

Page 51: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Active learning

51

- Annotating a small amount of data

Page 52: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Active learning

52

L H

M L

- Calculating the confidence score of the classifier on unlabeled data

Page 53: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Active learning

53

L H

M L

- Finding the informative unlabeled data(data with lowest confidence)

- manually annotating the informative data

Page 54: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Outline

● Supervised Learning

● Semi-supervised learning

● Unsupervised learning

54

Page 55: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Semi-supervised learning

● Annotating data is a time consuming and expensive task

● Solution

– Using a minimum amount of annotated data

– Annotating further data automatically

55

Page 56: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Semi-supervised learning

56

- A small amount of labeled data

Page 57: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Semi-supervised learning

57

- A large amount of unlabeled data

Page 58: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Semi-supervised learning

58

- Similarity between the labeled andunlabeled data

- Predicting the labels of the unlabeled data

Page 59: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Semi-supervised learning

59

- Training the classifier using labeled data and predicted labels of unlabeled data

Page 60: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Semi-supervised learning

60

- Introduce noisy data to the system

- Add only predicted label which has high confidence

Page 61: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Outline

● Supervised Learning

● Semi-supervised learning

● Unsupervised learning

61

Page 62: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Supervised Learning

62

http://nationalmortgageprofessional.com/news24271/regulatory-compliance-outlook-new-risk-based-pricing-rules

age

income

?

Page 63: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Unsupervised Learning

63

http://nationalmortgageprofessional.com/news24271/regulatory-compliance-outlook-new-risk-based-pricing-rules

age

income

Page 64: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Unsupervised Learning

64

http://nationalmortgageprofessional.com/news24271/regulatory-compliance-outlook-new-risk-based-pricing-rules

age

income

Page 65: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Clustering

● Calculating similarities between the data items

● Grouping similar data items to the same cluster

65

Page 66: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Applications● Word clustering

– Speech recognition

– Machine translation

– Named entity recognition

– Information retrieval

– ...

● Document clustering

– Text classification

– Information retrieval

– ...

66

Page 67: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Speech recognition

● „Computers can recognize a speeech.“

● „Computers can wreck a nice peach.“

67

named-entity

recognition

hand-writing

speech

ball

wreck

ship

Page 68: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Machine translation● „The cat eats...“

– „Die Katze frisst...“

– „Die Katze isst...“

68

Hund

Katze

laufen

fressen

Mann

essenJung

Page 69: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Clustering algorithms● Flat

– K-means

● Hierarchical

– Top-Down (Divisive)

– Bottom-Up (Agglomerative)

● Single-link● Complete-link● Average-link

69

Page 70: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K-means

● The best known clustering algorithm (default/baseline), works well for many cases

● The cluster center is the mean or centroid of the items in the cluster

● Minimizing the average squared Euclidean distance of the items from their cluster centers

70

μ=1∣c∣ ∑ x

x∈c

(http://mnemstudio.org/clustering-k-means-introduction.htm)

Page 71: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K-means

Initialization: Randomly choose k items as initial centroids

while stopping criterion has not been met do

for each item do

Find the nearest centroid

Assign the item to the cluster associated with the nearest centroid

end for

for each cluster do

Update the centroid of the cluster based on the average of all items in the cluster

end for

end while

71

Page 72: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K-means

● Iterating two steps:

– Re-assignment

● Assigning each vector to its closest centroid– Re-computation

● Computing each centroid as the average of the vectors that were assigned to it in re-assignment

72(http://mnemstudio.org/clustering-k-means-introduction.htm)

Page 73: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

K-means

73

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Page 74: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Hierarchical Agglomerative Clustering (HAC)● Creating a hierarchy in the form of a binary tree

74

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

Page 75: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Hierarchical Agglomerative Clustering (HAC)● Creating a hierarchy in the form of a binary tree

75

Page 76: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Hierarchical Agglomerative Clustering (HAC)

Initial Mapping: Put a single item in each cluster

while reaching the predefined number of clusters do

for each pair of clusters do

Measure the similarity of two clusters

end for

Merge the two clusters that are most similar

end while

76

Page 77: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Hierarchical Agglomerative Clustering (HAC)

● Measuring the similarity in three ways:

– Single-link

– Complete-link

– Average-link

77

Page 78: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Hierarchical Agglomerative Clustering (HAC)● Single-link / single-linkage clustering

– Based on the similarity of the most similar members

78

Page 79: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Hierarchical Agglomerative Clustering (HAC)● Complete-link / complete-linkage clustering

– Based on the similarity of the most dissimilar members

79

Page 80: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Hierarchical Agglomerative Clustering (HAC)● Average-link / average-linkage clustering

– Based on the average of all similarities between the members

80

Page 81: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Hierarchical Agglomerative Clustering (HAC)

81

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

Page 82: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Further reading

82

Page 83: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Further reading

83

Page 84: Natural Language Processing SoSe 2015 · These models perform well for document-level classification – Spam Mail Detection – Language Identification – Text Categorization Word-level

Further reading

84