Upload
ngomien
View
223
Download
0
Embed Size (px)
Citation preview
Machine Learning for NLPLecture 1: Introduction
UNIVERSITY OF
GOTHENBURG
Richard Johansson
August 29, 2016
-20pt
UNIVERSITY OF
GOTHENBURG
overview of the lecture
I some information about the course
I machine learning basics and overview
I overview of the assignments
I introduction to the scikit-learn library
-20pt
UNIVERSITY OF
GOTHENBURG
overview
information about the course
machine learning basics
introduction to the scikit-learn library
-20pt
UNIVERSITY OF
GOTHENBURG
teaching
I 7 lectures
I lab sessions where you work on assignment or project
I seminars
-20pt
UNIVERSITY OF
GOTHENBURG
your work
I 3 mandatory assignments plus one optional
I present a research paper at a seminar
I mini-project
I for VG grade: written exam
-20pt
UNIVERSITY OF
GOTHENBURG
request
I if you have a personal interest in some topic, please let me
know!
-20pt
UNIVERSITY OF
GOTHENBURG
assignment 1: feature design for function tagging
(root) She lives in a house of brick
-20pt
UNIVERSITY OF
GOTHENBURG
assignment 1: feature design for function tagging
(root)
nsubj
She lives in a house of brick
-20pt
UNIVERSITY OF
GOTHENBURG
assignment 1: feature design for function tagging
I the purpose of this assignment is to practice the typical stepsin building a machine learning-based system
I designing featuresI analyzing the performance of the system (and its errors)I trying out di�erent learning methods
-20pt
UNIVERSITY OF
GOTHENBURG
assignment 2: classi�er implementation
I read a paper about a simple algorithm for training the support
vector machine classi�er
I write code to implement the algorithm similar to the
algorithms in scikit-learn
-20pt
UNIVERSITY OF
GOTHENBURG
assignment 3: learning for sequence tagging
I implement a sequential tagging model
I use it to build a named-entity recognizer
United Nations o�cial Ekeus heads for Baghdad.
[ ORG ] [ PER ] [ LOC ]
-20pt
UNIVERSITY OF
GOTHENBURG
assignment 4: using TensorFlow (optional)
I explore Google's new TensorFlow library for neural network
training
I try out some classi�cation and structure prediction tasks (e.g.
translation)
-20pt
UNIVERSITY OF
GOTHENBURG
independent work
I select a topic of interest (or ask me for ideas)
I de�ne a small project
I write code, carry out experiments
I write a short paper, present it at a seminar at the end of the
course
-20pt
UNIVERSITY OF
GOTHENBURG
written exam (required for VG)
I the questions will test your understanding of central ideas in
machine learning
I and let you sketch and discuss (not code) ML-based solutions
to some real-world problems in NLP
I ML is a mathy subject, but the questions will not require much
math � but you might need to understand the idea behind a
few formulas
-20pt
UNIVERSITY OF
GOTHENBURG
overview
information about the course
machine learning basics
introduction to the scikit-learn library
-20pt
UNIVERSITY OF
GOTHENBURG
basic ideas
I given some object, make a predictionI is this patient diabetic?I is the sentiment of this movie review positive?I does this image contain a cat?I what is the grammatical function of this noun phrase?I what will be tomorrow's share value of this stock?I what are the part-of-speech tags of the words in this sentence?
I the goal of machine learning is to build the prediction
functions by observing data
-20pt
UNIVERSITY OF
GOTHENBURG
basic ideas
I given some object, make a predictionI is this patient diabetic?I is the sentiment of this movie review positive?I does this image contain a cat?I what is the grammatical function of this noun phrase?I what will be tomorrow's share value of this stock?I what are the part-of-speech tags of the words in this sentence?
I the goal of machine learning is to build the prediction
functions by observing data
-20pt
UNIVERSITY OF
GOTHENBURG
some types of machine learning problems
I classi�cation: learning to output a category labelI spam/non-spam; positive/negative; subject/object, . . .
I structure prediction: learning to build some structureI POS tagging; dependency parsing; translation; . . .
I (numerical regression: learning to guess a number)I value of a share; number of stars in a review; . . .
I (reinforcement learning: learning to act in an environment)I dialogue systems; playing games; . . .
-20pt
UNIVERSITY OF
GOTHENBURG
machine learning in NLP research
I ACL, EMNLP, Coling, etc are heavily dominated by
ML-focused papers
-20pt
UNIVERSITY OF
GOTHENBURG
why machine learning?
why would we want to build the function from data instead of just
implementing it?
I usually because we don't really know how to write down thefunction by hand
I speech recognitionI image classi�cationI syntactic parsingI translationI . . .
I might not be necessary for limited tasks where we know:I morphology?I sentence splitting and tokenization?I identi�cation of limited classes of names, dates and times?
I what is more expensive in your case? knowledge or data?
-20pt
UNIVERSITY OF
GOTHENBURG
don't forget your linguistic intuitions!
machine learning automatizes some tasks, but we still need our
brains:
I de�ning the tasks and terminology
I annotating training and testing data
I having an intuition about which features may be useful canbe crucial
I in general, features are more important than the choice oflearning algorithm
I error analysis
I de�ning constraints to guide the learnerI valency lexicons can be used in parsersI grammar-based parsers with ML-trained disambiguators
-20pt
UNIVERSITY OF
GOTHENBURG
learning from data
-20pt
UNIVERSITY OF
GOTHENBURG
example: is the patient diabetic?
in order to predict, we make some measurements of propertieswe believe will be useful
these are called the features
-20pt
UNIVERSITY OF
GOTHENBURG
example: is the patient diabetic?
I in order to predict, we make some measurements of propertieswe believe will be useful
I these are called the features
-20pt
UNIVERSITY OF
GOTHENBURG
attributes/values or bag of words
I we often represent the features as attributes with valuesI in practice, as a Python dict
features = { "gender":"male",
"age":37,
"blood_pressure":130,
.... }
I sometimes, it's easier just to see the features as a list of e.g.
words (bag of words)
features = [ "here", "are", "some", "words",
"in", "a", "document" ]
-20pt
UNIVERSITY OF
GOTHENBURG
examples of ML in NLP: document classi�cation
I in a previous course, you have implemented a classi�er of
documents
I many document classi�ers use the words of the documents
as its features (bag of words)
-20pt
UNIVERSITY OF
GOTHENBURG
examples of ML in NLP: document classi�cation
I in a previous course, you have implemented a classi�er of
documents
I many document classi�ers use the words of the documents
as its features (bag of words)
I . . . but we could also add other features such as the presence
of smileys or negations
I Günther, Sentiment Analysis of Microblogs, MLT Master's Thesis, 2013
-20pt
UNIVERSITY OF
GOTHENBURG
examples of ML in NLP: di�culty level classi�cation
I what learner level (e.g. according to CEFR) do you need tounderstand the following Swedish sentences?
I Flickan sover. → A1
I Under förberedelsetiden har en baslinjestudie utförts för attkartlägga bland annat diabetesärftlighet, oral glukostolerans,mat och konditionsvanor och socialekonomiska faktorer iåldrarna 35-54 år i �era kommuner inom Nordvästra ochSydöstra sjukvårdsområdena (NVSO resp SÖSO). → C2
I Pilán, NLP-based Approaches to Sentence Readability for Second Language
Learning Purposes, MLT Master's Thesis, 2013
-20pt
UNIVERSITY OF
GOTHENBURG
examples of ML in NLP: di�culty level classi�cation
-20pt
UNIVERSITY OF
GOTHENBURG
examples of ML in NLP: coreference resolution
I do two given noun phrases refer to the same real-world entity?
I Soon et al. A Machine Learning Approach to Coreference
Resolution of Noun Phrases, Comp. Ling. 2001
-20pt
UNIVERSITY OF
GOTHENBURG
examples of ML in NLP: named entity recognition
United Nations o�cial Ekeus heads for Baghdad.
[ ORG ] [ PER ] [ LOC ]
-20pt
UNIVERSITY OF
GOTHENBURG
examples of ML in NLP: named entity recognition
United Nations o�cial Ekeus heads for Baghdad.
[ ORG ] [ PER ] [ LOC ]
-20pt
UNIVERSITY OF
GOTHENBURG
examples of ML in NLP: named entity recognition
I Zhang and Johnson A Robust Risk Minimization based Named
Entity Recognition System, CoNLL 2003
-20pt
UNIVERSITY OF
GOTHENBURG
what goes on when we �learn�?
I the learning algorithm observes the examples in the training set
I it tries to �nd common patterns that explain the data: it
generalizes so that we can make predictions for new examples
I how this is done depends on what algorithm we are using
-20pt
UNIVERSITY OF
GOTHENBURG
�knowledge� from experience?
I given some experience, when can we be certain that we can
draw any conclusion?
-20pt
UNIVERSITY OF
GOTHENBURG
a fundamental tradeo� in machine learning
I goodness of �t: the learned classi�er should be able to
correctly classify the examples in the training data
I regularization: the classi�er should be simple
I this tradeo� is called the bias�variance tradeo�I see e.g. Wikipedia
-20pt
UNIVERSITY OF
GOTHENBURG
example: guessing the gender, based on height and weight
150 155 160 165 170 175 180 185 190 19540
50
60
70
80
90
100
110
120
-20pt
UNIVERSITY OF
GOTHENBURG
learning algorithms that we have seen so far
I we have already seen a number of learning algorithms inprevious courses:
I Naive BayesI perceptronI hidden Markov modelsI decision treesI transformation rules (Brill tagger)
-20pt
UNIVERSITY OF
GOTHENBURG
representation of the prediction function
we may represent our prediction function in di�erent ways:
I numerical models:I weight or probability tablesI networked models
I rulesI decision treesI transformation rules
-20pt
UNIVERSITY OF
GOTHENBURG
example: the prediction function as numbers
def sentiment_is_positive(features):
score = 0.0
score += 2.1 * features["wonderful"]
score += 0.6 * features["good"]
...
score -= 0.9 * features["bad"]
score -= 3.1 * features["awful"]
...
if score > 0:
return True
else:
return False
-20pt
UNIVERSITY OF
GOTHENBURG
example: the prediction function as rules
def patient_is_sick(features):
if features["systolic_blood_pressure"] > 140:
return True
if features["gender"] == "m":
if features["psa"] > 4.0:
return True
...
return False
-20pt
UNIVERSITY OF
GOTHENBURG
perceptron revisited
I the perceptron learning algorithm creates a weight table
I each weight in the table corresponds to a featureI e.g. "fine" probably has a high positive weight in sentiment
analysisI "boring" a negative weightI "and" near zero
I classi�cation is carried out by summing the weights for each
feature
def perceptron_classify(features, weights):
score = 0
for f in features:
score += weights.get(f, 0)
if score >= 0:
return "pos"
else:
return "neg"
-20pt
UNIVERSITY OF
GOTHENBURG
the perceptron learning algorithm
I start with an empty weight table
I classify according to the current weight table
I each time we misclassify, change the weight table a bitI if a positive instance was misclassi�ed, add 1 to the weight of
each feature in the documentI and conversely . . .
def perceptron_learn(examples, number_iterations):
weights = {}
for iteration in range(number_iterations):
for label, features in examples:
guess = perceptron_classify(features, weights)
if label == "pos" and guess == "neg":
for f in features:
weights[f] = weights.get(f, 0) + 1
elif label == "neg" and guess == "pos":
for f in features:
weights[f] = weights.get(f, 0) - 1
return weights
-20pt
UNIVERSITY OF
GOTHENBURG
estimation in Naive Bayes, revisited
I Naive Bayes:
P(document, label) =
P(f1, . . . , fn, label) = P(label) · P(f1, . . . , fn|label)
= P(label) · P(f1|label) · . . . · P(fn|label)
I how do we estimate the probabilities?I maximum likelihood: set the probabilities so that the
probability of the data is maximized
-20pt
UNIVERSITY OF
GOTHENBURG
estimation in Naive Bayes: supervised case
I how do we estimate P(positive)?
PMLE(positive) =count(positive)
count(all)=
2
4
I how do we estimate P(�nice�|positive)?
PMLE(�nice�|positive) =count(�nice�, positive)
count(any word, positive)=
2
7
-20pt
UNIVERSITY OF
GOTHENBURG
overview
information about the course
machine learning basics
introduction to the scikit-learn library
-20pt
UNIVERSITY OF
GOTHENBURG
machine learning software
I general-purpose software, large collections of algorithms:I scikit-learn: http://scikit-learn.org
I Python library � will be used in this course
I Weka: http://www.cs.waikato.ac.nz/ml/wekaI Java library with nice user interface
I NLTK includes some learning algorithms but seems to bediscontinuing them in favor of scikit-learn
I special-purpose software, small collections of algorithms:I LibSVM/LibLinear for support vector machinesI CRF++, CRFSGD for conditional random �eldsI Tensor�ow, Theano, Ca�e, Keras for neural networksI . . .
-20pt
UNIVERSITY OF
GOTHENBURG
scikit-learn toy example: a simple training set
# training set: the features
X = [{'city':'Gothenburg', 'month':'July'},
{'city':'Gothenburg', 'month':'December'},
{'city':'Paris', 'month':'July'},
{'city':'Paris', 'month':'December'}]
# training set: the gold-standard outputs
Y = ['rain', 'rain', 'sun', 'rain']
-20pt
UNIVERSITY OF
GOTHENBURG
scikit-learn toy example: training a classi�er
from sklearn.feature_extraction import DictVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
import pickle
classifier = Pipeline([('v', DictVectorizer()),
('c', LinearSVC())])
# train the classifier
classifier.fit(X, Y)
# optionally: save the classifier to a file...
with open('weather.classifier', 'wb') as f:
pickle.dump(classifier, f)
-20pt
UNIVERSITY OF
GOTHENBURG
explanation of the code: DictVectorizer
I internally, the features used by scikit-learn's classi�ers are
numbers, not strings
I a Vectorizer converts the strings into numbers � more about
this in the next lecture!
I rule of thumb:I use a DictVectorizer for attribute�value featuresI use a CountVectorizer or TfidfVectorizer for
bag-of-words features
-20pt
UNIVERSITY OF
GOTHENBURG
explanation of the code: LinearSVC
I LinearSVC is the actual classi�er we're usingI this is called a linear support vector machineI more about this in lecture 3
I use Naive Bayes instead:
from sklearn.naive_bayes import MultinomialNB
...
classifier = Pipeline([('v', DictVectorizer()),
('c', MultinomialNB())])
I perceptron:
from sklearn.linear_model import Perceptron
...
classifier = Pipeline([('v', DictVectorizer()),
('c', Perceptron())])
-20pt
UNIVERSITY OF
GOTHENBURG
explanation of the code: Pipeline and fit
I in scikit-learn, preprocessing steps and classi�ers are oftencombined into a Pipeline
I in our case, a DictVectorizer and a LinearSVC
I the whole Pipeline is trained by calling the method fitI which will in turn call fit on all the parts of the Pipeline
-20pt
UNIVERSITY OF
GOTHENBURG
toy example: making new predictions and evaluating
from sklearn.metrics import accuracy_score
Xtest = [{'city':'Gothenburg', 'month':'June'},
{'city':'Gothenburg', 'month':'November'},
{'city':'Paris', 'month':'June'},
{'city':'Paris', 'month':'November'}]
Ytest = ['rain', 'rain', 'sun', 'rain']
# classify all the test instances
guesses = classifier.predict(Xtest)
# compute the classification accuracy
print(accuracy_score(Ytest, guesses))
-20pt
UNIVERSITY OF
GOTHENBURG
a note on e�ciency
I Python is a nice language for programmers but not always the
most e�cient
I in scikit-learn, many functions are implemented in faster
languages (e.g. C) and use specialized math librariesI so in many cases, it is much faster to call the library once than
many times:import time
t0 = time.time()
guesses1 = classifier.predict(Xtest)
t1 = time.time()
guesses2 = []
for x in Xtest:
guess = classifier.predict(x)
guesses2.append(guess)
t2 = time.time()
print(t1-t0)
print(t2-t1)
I result: 0.29 sec and 45 sec
-20pt
UNIVERSITY OF
GOTHENBURG
some other practical functions
I splitting the data:
from sklearn.cross_validation import train_test_split
train_files, dev_files = train_test_split(td_files,
train_size=0.8,
random_state=0)
I evaluation, e.g. accuracy, precision, recall, F-score:
from sklearn.metrics import f1_score
print(f1_score(Y_eval, Y_out))
I note that we're using our own evaluation in the �rst
assignment, since we need more details
-20pt
UNIVERSITY OF
GOTHENBURG
aside: evaluation methodology
I how do we evaluate our systems?I intrinsic evaluation: test the performance in isolationI extrinsic evaluation: I changed my POS tagger � how does
this change the performance of my parser?I how much more money do I make?
I common measures in intrinsic evaluationI classi�cation accuracyI precision and recall (for needle in haystack)
I also several other task-dependent measures.
-20pt
UNIVERSITY OF
GOTHENBURG
extended example 1: named entity classi�cation
I we are given a name (a single word) in a sentence
I determine if it is a person, location, or an organization
My aunt Gözde lives in Ashgabat.
I the information our classi�er can use:I the words in the sentenceI the part-of-speech tagsI the position of the name that we are classifying
-20pt
UNIVERSITY OF
GOTHENBURG
extended example 2: document classi�cation
I we are given a document
I determine the category of the document (select from a small
set of prede�ned categories)
I we reuse the review dataset that we had in the previous courseI this dataset has polarity and topic labels for each document