Machine Learning for NLP Lecture 1: Introduction · Machine Learning for NLP Lecture 1: Introduction ... I lab sessions where you work on assignment or project ... I write code, carry

Machine Learning for NLPLecture 1: Introduction

UNIVERSITY OF

GOTHENBURG

Richard Johansson

August 29, 2016

-20pt

UNIVERSITY OF

GOTHENBURG

overview of the lecture

I some information about the course

I machine learning basics and overview

I overview of the assignments

I introduction to the scikit-learn library

-20pt

UNIVERSITY OF

GOTHENBURG

overview

information about the course

machine learning basics

introduction to the scikit-learn library

-20pt

UNIVERSITY OF

GOTHENBURG

teaching

I 7 lectures

I lab sessions where you work on assignment or project

I seminars

-20pt

UNIVERSITY OF

GOTHENBURG

your work

I 3 mandatory assignments plus one optional

I present a research paper at a seminar

I mini-project

I for VG grade: written exam

-20pt

UNIVERSITY OF

GOTHENBURG

request

I if you have a personal interest in some topic, please let me

know!

-20pt

UNIVERSITY OF

GOTHENBURG

assignment 1: feature design for function tagging

(root) She lives in a house of brick

-20pt

UNIVERSITY OF

GOTHENBURG


(root)

nsubj

She lives in a house of brick

-20pt

UNIVERSITY OF

GOTHENBURG


I the purpose of this assignment is to practice the typical stepsin building a machine learning-based system

I designing featuresI analyzing the performance of the system (and its errors)I trying out di�erent learning methods

-20pt

UNIVERSITY OF

GOTHENBURG

assignment 2: classi�er implementation

I read a paper about a simple algorithm for training the support

vector machine classi�er

I write code to implement the algorithm similar to the

algorithms in scikit-learn

-20pt

UNIVERSITY OF

GOTHENBURG

assignment 3: learning for sequence tagging

I implement a sequential tagging model

I use it to build a named-entity recognizer

United Nations o�cial Ekeus heads for Baghdad.

[ ORG ] [ PER ] [ LOC ]

-20pt

UNIVERSITY OF

GOTHENBURG

assignment 4: using TensorFlow (optional)

I explore Google's new TensorFlow library for neural network

training

I try out some classi�cation and structure prediction tasks (e.g.

translation)

-20pt

UNIVERSITY OF

GOTHENBURG

independent work

I select a topic of interest (or ask me for ideas)

I de�ne a small project

I write code, carry out experiments

I write a short paper, present it at a seminar at the end of the

course

-20pt

UNIVERSITY OF

GOTHENBURG

written exam (required for VG)

I the questions will test your understanding of central ideas in

machine learning

I and let you sketch and discuss (not code) ML-based solutions

to some real-world problems in NLP

I ML is a mathy subject, but the questions will not require much

math � but you might need to understand the idea behind a

few formulas

-20pt

UNIVERSITY OF

GOTHENBURG

overview




-20pt

UNIVERSITY OF

GOTHENBURG

basic ideas

I given some object, make a predictionI is this patient diabetic?I is the sentiment of this movie review positive?I does this image contain a cat?I what is the grammatical function of this noun phrase?I what will be tomorrow's share value of this stock?I what are the part-of-speech tags of the words in this sentence?

I the goal of machine learning is to build the prediction

functions by observing data

-20pt

UNIVERSITY OF

GOTHENBURG

basic ideas

I given some object, make a predictionI is this patient diabetic?I is the sentiment of this movie review positive?I does this image contain a cat?I what is the grammatical function of this noun phrase?I what will be tomorrow's share value of this stock?I what are the part-of-speech tags of the words in this sentence?

I the goal of machine learning is to build the prediction

functions by observing data

-20pt

UNIVERSITY OF

GOTHENBURG

some types of machine learning problems

I classi�cation: learning to output a category labelI spam/non-spam; positive/negative; subject/object, . . .

I structure prediction: learning to build some structureI POS tagging; dependency parsing; translation; . . .

I (numerical regression: learning to guess a number)I value of a share; number of stars in a review; . . .

I (reinforcement learning: learning to act in an environment)I dialogue systems; playing games; . . .

-20pt

UNIVERSITY OF

GOTHENBURG

machine learning in NLP research

I ACL, EMNLP, Coling, etc are heavily dominated by

ML-focused papers

-20pt

UNIVERSITY OF

GOTHENBURG

why machine learning?

why would we want to build the function from data instead of just

implementing it?

I usually because we don't really know how to write down thefunction by hand

I speech recognitionI image classi�cationI syntactic parsingI translationI . . .

I might not be necessary for limited tasks where we know:I morphology?I sentence splitting and tokenization?I identi�cation of limited classes of names, dates and times?

I what is more expensive in your case? knowledge or data?

-20pt

UNIVERSITY OF

GOTHENBURG

don't forget your linguistic intuitions!

machine learning automatizes some tasks, but we still need our

brains:

I de�ning the tasks and terminology

I annotating training and testing data

I having an intuition about which features may be useful canbe crucial

I in general, features are more important than the choice oflearning algorithm

I error analysis

I de�ning constraints to guide the learnerI valency lexicons can be used in parsersI grammar-based parsers with ML-trained disambiguators

-20pt

UNIVERSITY OF

GOTHENBURG

learning from data

-20pt

UNIVERSITY OF

GOTHENBURG

example: is the patient diabetic?

in order to predict, we make some measurements of propertieswe believe will be useful

these are called the features

-20pt

UNIVERSITY OF

GOTHENBURG

example: is the patient diabetic?

I in order to predict, we make some measurements of propertieswe believe will be useful

I these are called the features

-20pt

UNIVERSITY OF

GOTHENBURG

attributes/values or bag of words

I we often represent the features as attributes with valuesI in practice, as a Python dict

features = { "gender":"male",

"age":37,

"blood_pressure":130,

.... }

I sometimes, it's easier just to see the features as a list of e.g.

words (bag of words)

features = [ "here", "are", "some", "words",

"in", "a", "document" ]

-20pt

UNIVERSITY OF

GOTHENBURG

examples of ML in NLP: document classi�cation

I in a previous course, you have implemented a classi�er of

documents

I many document classi�ers use the words of the documents

as its features (bag of words)

-20pt

UNIVERSITY OF

GOTHENBURG

examples of ML in NLP: document classi�cation

I in a previous course, you have implemented a classi�er of

documents

I many document classi�ers use the words of the documents

as its features (bag of words)

I . . . but we could also add other features such as the presence

of smileys or negations

I Günther, Sentiment Analysis of Microblogs, MLT Master's Thesis, 2013

http://tobias.io/semevaltweet/sentiment_analysis_of_microblogs.pdf

-20pt

UNIVERSITY OF

GOTHENBURG

examples of ML in NLP: di�culty level classi�cation

I what learner level (e.g. according to CEFR) do you need tounderstand the following Swedish sentences?

I Flickan sover. → A1

I Under förberedelsetiden har en baslinjestudie utförts för attkartlägga bland annat diabetesärftlighet, oral glukostolerans,mat och konditionsvanor och socialekonomiska faktorer iåldrarna 35-54 år i �era kommuner inom Nordvästra ochSydöstra sjukvårdsområdena (NVSO resp SÖSO). → C2

I Pilán, NLP-based Approaches to Sentence Readability for Second Language

Learning Purposes, MLT Master's Thesis, 2013

https://www.academia.edu/6845845/NLP-based_Approaches_to_Sentence_Readability_for_Second_Language_Learning_Purposes

https://www.academia.edu/6845845/NLP-based_Approaches_to_Sentence_Readability_for_Second_Language_Learning_Purposes

-20pt

UNIVERSITY OF

GOTHENBURG

examples of ML in NLP: di�culty level classi�cation

-20pt

UNIVERSITY OF

GOTHENBURG

examples of ML in NLP: coreference resolution

I do two given noun phrases refer to the same real-world entity?

I Soon et al. A Machine Learning Approach to Coreference

Resolution of Noun Phrases, Comp. Ling. 2001

http://aclweb.org/anthology/J/J01/J01-4004.pdf

http://aclweb.org/anthology/J/J01/J01-4004.pdf

-20pt

UNIVERSITY OF

GOTHENBURG

examples of ML in NLP: named entity recognition


[ ORG ] [ PER ] [ LOC ]

-20pt

UNIVERSITY OF

GOTHENBURG



[ ORG ] [ PER ] [ LOC ]

-20pt

UNIVERSITY OF

GOTHENBURG


I Zhang and Johnson A Robust Risk Minimization based Named

Entity Recognition System, CoNLL 2003

http://aclweb.org/anthology/W/W03/W03-0434.pdf

http://aclweb.org/anthology/W/W03/W03-0434.pdf

-20pt

UNIVERSITY OF

GOTHENBURG

what goes on when we �learn�?

I the learning algorithm observes the examples in the training set

I it tries to �nd common patterns that explain the data: it

generalizes so that we can make predictions for new examples

I how this is done depends on what algorithm we are using

-20pt

UNIVERSITY OF

GOTHENBURG

�knowledge� from experience?

I given some experience, when can we be certain that we can

draw any conclusion?

-20pt

UNIVERSITY OF

GOTHENBURG

a fundamental tradeo� in machine learning

I goodness of �t: the learned classi�er should be able to

correctly classify the examples in the training data

I regularization: the classi�er should be simple

I this tradeo� is called the bias�variance tradeo�I see e.g. Wikipedia

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

-20pt

UNIVERSITY OF

GOTHENBURG

example: guessing the gender, based on height and weight

150 155 160 165 170 175 180 185 190 19540

50

60

70

80

90

100

110

120

-20pt

UNIVERSITY OF

GOTHENBURG

learning algorithms that we have seen so far

I we have already seen a number of learning algorithms inprevious courses:

I Naive BayesI perceptronI hidden Markov modelsI decision treesI transformation rules (Brill tagger)

-20pt

UNIVERSITY OF

GOTHENBURG

representation of the prediction function

we may represent our prediction function in di�erent ways:

I numerical models:I weight or probability tablesI networked models

I rulesI decision treesI transformation rules

-20pt

UNIVERSITY OF

GOTHENBURG

example: the prediction function as numbers

def sentiment_is_positive(features):

score = 0.0

score += 2.1 * features["wonderful"]

score += 0.6 * features["good"]

...

score -= 0.9 * features["bad"]

score -= 3.1 * features["awful"]

...

if score > 0:

return True

else:

return False

-20pt

UNIVERSITY OF

GOTHENBURG

example: the prediction function as rules

def patient_is_sick(features):

if features["systolic_blood_pressure"] > 140:

return True

if features["gender"] == "m":

if features["psa"] > 4.0:

return True

...

return False

-20pt

UNIVERSITY OF

GOTHENBURG

perceptron revisited

I the perceptron learning algorithm creates a weight table

I each weight in the table corresponds to a featureI e.g. "fine" probably has a high positive weight in sentiment

analysisI "boring" a negative weightI "and" near zero

I classi�cation is carried out by summing the weights for each

feature

def perceptron_classify(features, weights):

score = 0

for f in features:

score += weights.get(f, 0)

if score >= 0:

return "pos"

else:

return "neg"

-20pt

UNIVERSITY OF

GOTHENBURG

the perceptron learning algorithm

I start with an empty weight table

I classify according to the current weight table

I each time we misclassify, change the weight table a bitI if a positive instance was misclassi�ed, add 1 to the weight of

each feature in the documentI and conversely . . .

def perceptron_learn(examples, number_iterations):

weights = {}

for iteration in range(number_iterations):

for label, features in examples:

guess = perceptron_classify(features, weights)

if label == "pos" and guess == "neg":

for f in features:

weights[f] = weights.get(f, 0) + 1

elif label == "neg" and guess == "pos":

for f in features:

weights[f] = weights.get(f, 0) - 1

return weights

-20pt

UNIVERSITY OF

GOTHENBURG

estimation in Naive Bayes, revisited

I Naive Bayes:

P(document, label) =

P(f1, . . . , fn, label) = P(label) · P(f1, . . . , fn|label)

= P(label) · P(f1|label) · . . . · P(fn|label)

I how do we estimate the probabilities?I maximum likelihood: set the probabilities so that the

probability of the data is maximized

-20pt

UNIVERSITY OF

GOTHENBURG

estimation in Naive Bayes: supervised case

I how do we estimate P(positive)?

PMLE(positive) =count(positive)

count(all)=

2

4

I how do we estimate P(�nice�|positive)?

PMLE(�nice�|positive) =count(�nice�, positive)

count(any word, positive)=

2

7

-20pt

UNIVERSITY OF

GOTHENBURG

overview




-20pt

UNIVERSITY OF

GOTHENBURG

machine learning software

I general-purpose software, large collections of algorithms:I scikit-learn: http://scikit-learn.org

I Python library � will be used in this course

I Weka: http://www.cs.waikato.ac.nz/ml/wekaI Java library with nice user interface

I NLTK includes some learning algorithms but seems to bediscontinuing them in favor of scikit-learn

I special-purpose software, small collections of algorithms:I LibSVM/LibLinear for support vector machinesI CRF++, CRFSGD for conditional random �eldsI Tensor�ow, Theano, Ca�e, Keras for neural networksI . . .

http://scikit-learn.org

http://www.cs.waikato.ac.nz/ml/weka

-20pt

UNIVERSITY OF

GOTHENBURG

scikit-learn toy example: a simple training set

# training set: the features

X = [{'city':'Gothenburg', 'month':'July'},

{'city':'Gothenburg', 'month':'December'},

{'city':'Paris', 'month':'July'},

{'city':'Paris', 'month':'December'}]

# training set: the gold-standard outputs

Y = ['rain', 'rain', 'sun', 'rain']

-20pt

UNIVERSITY OF

GOTHENBURG

scikit-learn toy example: training a classi�er

from sklearn.feature_extraction import DictVectorizer

from sklearn.svm import LinearSVC

from sklearn.pipeline import Pipeline

import pickle

classifier = Pipeline([('v', DictVectorizer()),

('c', LinearSVC())])

# train the classifier

classifier.fit(X, Y)

# optionally: save the classifier to a file...

with open('weather.classifier', 'wb') as f:

pickle.dump(classifier, f)

-20pt

UNIVERSITY OF

GOTHENBURG

explanation of the code: DictVectorizer

I internally, the features used by scikit-learn's classi�ers are

numbers, not strings

I a Vectorizer converts the strings into numbers � more about

this in the next lecture!

I rule of thumb:I use a DictVectorizer for attribute�value featuresI use a CountVectorizer or TfidfVectorizer for

bag-of-words features

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

-20pt

UNIVERSITY OF

GOTHENBURG

explanation of the code: LinearSVC

I LinearSVC is the actual classi�er we're usingI this is called a linear support vector machineI more about this in lecture 3

I use Naive Bayes instead:

from sklearn.naive_bayes import MultinomialNB

...


('c', MultinomialNB())])

I perceptron:

from sklearn.linear_model import Perceptron

...


('c', Perceptron())])

http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html

-20pt

UNIVERSITY OF

GOTHENBURG

explanation of the code: Pipeline and fit

I in scikit-learn, preprocessing steps and classi�ers are oftencombined into a Pipeline

I in our case, a DictVectorizer and a LinearSVC

I the whole Pipeline is trained by calling the method fitI which will in turn call fit on all the parts of the Pipeline

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

-20pt

UNIVERSITY OF

GOTHENBURG

toy example: making new predictions and evaluating

from sklearn.metrics import accuracy_score

Xtest = [{'city':'Gothenburg', 'month':'June'},

{'city':'Gothenburg', 'month':'November'},

{'city':'Paris', 'month':'June'},

{'city':'Paris', 'month':'November'}]

Ytest = ['rain', 'rain', 'sun', 'rain']

# classify all the test instances

guesses = classifier.predict(Xtest)

# compute the classification accuracy

print(accuracy_score(Ytest, guesses))

-20pt

UNIVERSITY OF

GOTHENBURG

a note on e�ciency

I Python is a nice language for programmers but not always the

most e�cient

I in scikit-learn, many functions are implemented in faster

languages (e.g. C) and use specialized math librariesI so in many cases, it is much faster to call the library once than

many times:import time

t0 = time.time()

guesses1 = classifier.predict(Xtest)

t1 = time.time()

guesses2 = []

for x in Xtest:

guess = classifier.predict(x)

guesses2.append(guess)

t2 = time.time()

print(t1-t0)

print(t2-t1)

I result: 0.29 sec and 45 sec

-20pt

UNIVERSITY OF

GOTHENBURG

some other practical functions

I splitting the data:

from sklearn.cross_validation import train_test_split

train_files, dev_files = train_test_split(td_files,

train_size=0.8,

random_state=0)

I evaluation, e.g. accuracy, precision, recall, F-score:

from sklearn.metrics import f1_score

print(f1_score(Y_eval, Y_out))

I note that we're using our own evaluation in the �rst

assignment, since we need more details

-20pt

UNIVERSITY OF

GOTHENBURG

aside: evaluation methodology

I how do we evaluate our systems?I intrinsic evaluation: test the performance in isolationI extrinsic evaluation: I changed my POS tagger � how does

this change the performance of my parser?I how much more money do I make?

I common measures in intrinsic evaluationI classi�cation accuracyI precision and recall (for needle in haystack)

I also several other task-dependent measures.

-20pt

UNIVERSITY OF

GOTHENBURG

extended example 1: named entity classi�cation

I we are given a name (a single word) in a sentence

I determine if it is a person, location, or an organization

My aunt Gözde lives in Ashgabat.

I the information our classi�er can use:I the words in the sentenceI the part-of-speech tagsI the position of the name that we are classifying

-20pt

UNIVERSITY OF

GOTHENBURG

extended example 2: document classi�cation

I we are given a document

I determine the category of the document (select from a small

set of prede�ned categories)

I we reuse the review dataset that we had in the previous courseI this dataset has polarity and topic labels for each document

Documents

Machine Learning for NLP Lecture 1: Introduction · Machine Learning for NLP Lecture 1: Introduction ... I lab sessions where you work on assignment or project ... I write code, carry