Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
NLP Demys*fied
Noah Smith Language Technologies Ins*tute Machine Learning Department School of Computer Science Carnegie Mellon University
NLP?
Outline
1. Automa*cally categorizing documents 2. Decoding sequences of words 3. Clustering documents and/or words
Categorizing Documents: Examples
• Mosteller and Wallace (1964): authorship of the Federalist papers
• News categories: U.S., world, sports, religion, business, technology, entertainment, ...
• How posi*ve or nega*ve is a review of a film or restaurant?
• Is a given email message spam? • What is the reading level of a piece of text? • How influen*al will a research paper be? • Will a congressional bill pass commiZee?
The Vision
• Human experts label some data • Feed the data to a learning algorithm that constructs an automa*c labeling func*on
• Apply that func*on to as much data as you want!
Basic Recipe for Document Categoriza*on
1. Obtain a pool of correctly categorized documents D.
2. Define a func*on f from documents to feature vectors.
3. Define a parameterized func*on hw from feature vectors to categories.
4. Select h’s parameters w using a training sample from D.
5. Es*mate performance on a held-‐out sample from D.
1. Obtain Categorized Documents
Spinoza, 17th century ra*onalist
2. Define the Feature Vector Func*on
• Simplest choice: one dimension per word, and let [ f(d) ]j be the count of wj in d.
• Twists: – Monotonic transforms, like dividing by the length of d or taking a log.
– Increase the weights of words that occur in fewer documents (“inverse document frequency”)
– n-‐grams – Count specially defined groupings of words – Sta*s*cal tests to select words likely to be informa*ve
Basic Recipe for Document Categoriza*on
1. Obtain a pool of correctly categorized documents D.
2. Define a func*on f from documents to feature vectors.
3. Define a parameterized func*on hw from feature vectors to categories.
4. Select h’s parameters w using a training sample from D.
5. Es*mate performance on a held-‐out sample from D.
3. Define a Func*on from Feature Vectors to Categories
• Simplest choice: linear model
wc is the vector of coefficients associa*ng each feature with class c (can be posi*ve or nega*ve). – Advantage: interpretability – Advantage: computa*onal efficiency
• Some alterna*ves: k-‐nearest neighbors, decision trees, neural networks, ...
hw(d) = argmax
cw>
c f(d) + wbiasc
4. Select Parameters using Data
• Also known as “machine learning.” • Many learning op*ons for linear classifiers!
probabilis3c interpreta3on
discrimina3ve
LR SVM
perceptron
NB
4. Select Parameters using Data
Op*miza*on view of learning:
Typical loss func*ons for linear models are convex and can be efficiently op*mized using online or batch itera*ve algorithms with convergence guarantees.
w = argminw
R(w) +1
|Dtrain |X
d2Dtrain
L(d;w)
“regulariza*on” to avoid overfijng “empirical risk” = average loss over training data
4. Select Parameters using Data Considera*ons: • Do you want posterior probabili*es, or just labels?
• What methods do you understand well enough to explain in your paper?
• What methods will your readers understand? • What implementa*ons are available? – Cost, scalability, programming language, compa*bility with your workflow, ...
• How well does it work (on held-‐out data)?
5. Es*mate Performance
• Always, always, always use held-‐out data. – Mul*ple rounds of tests? Fresh tes*ng data!
• Consider the “most frequent class” baseline. • Consider inter-‐annotator agreement. • What to measure? – Accuracy – When one class is special: precision/recall
5. Es*mate Performance
precision
recall
hw(d) = argmax
cw>
c f(d) + wbiasc
Basic Recipe for Document Categoriza*on
1. Obtain a pool of correctly categorized documents D.
2. Define a func*on f from documents to feature vectors.
3. Define a parameterized func*on hw from feature vectors to categories.
4. Select h’s parameters w using a training sample from D.
5. Es*mate performance on a held-‐out sample from D.
Outline
ü Automa*cally categorizing documents 2. Decoding sequences of words 3. Clustering documents and/or words
Decoding Word Sequences: Examples
• Categorizing each word by its part-‐of-‐speech or seman*c class
• Recognizing men*ons of named en**es • Segmen*ng a document into parts • Parsing a sentence into a gramma*cal or seman*c structure
High-‐Level View
d c classifica*on
x y1
y3
y2
yN ...
structured predic*on
Possible Lines of AZack
1. Transform into a sequence of classifica*on problems (see part 1).
2. Transform into a sequence labeling problem and use a variant of the Viterbi algorithm.
3. Design a representa*on, predic*on algorithm, and learning algorithm for your par*cular problem.
Shameless Self-‐Promo*on
Morgan Claypool Publishers&SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES
w w w . m o r g a n c l a y p o o l . c o m
Series Editor: Graeme Hirst, University of Toronto
MO
RG
AN
&C
LA
YP
OO
L
CM& Morgan Claypool Publishers&
About SYNTHESIsThis volume is a printed version of a work that appears in the SynthesisDigital Library of Engineering and Computer Science. Synthesis Lecturesprovide concise, original presentations of important research and developmenttopics, published quickly, in digital and print formats. For more informationvisit www.morganclaypool.com
SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES
LINGUISTIC STRUCTURE PREDICTION
Linguistic StructurePrediction
Graeme Hirst, Series Editor
ISBN: 978-1-60845-405-1
9 781608 454051
90000
Series ISSN: 1947-4040
Linguistic Structure PredictionNoah A. Smith, Carnegie Mellon UniversityA major part of natural language processing now depends on the use of text data to build linguisticanalyzers. We consider statistical, computational approaches to modeling linguistic structure. We seekto unify across many approaches and many kinds of linguistic structures. Assuming a basic understandingof natural language processing and/or machine learning, we seek to bridge the gap between the two fields.Approaches to decoding (i.e., carrying out linguistic structure prediction) and supervised and unsupervisedlearning of models that predict discrete structures as outputs are the focus. We also survey natural languageprocessing problems to which these methods are being applied, and we address related topics in probabilisticinference, optimization, and experimental methodology.
Noah A. Smith
SMITH
$56.43 on amazon.com possibly free in electronic form, through your university’s library
Lines of AZack
1. Reduce to a sequence of classifica*on problems (see part 1).
2. Reduce to a sequence labeling problem and use a variant of the Viterbi algorithm.
3. Design a representa*on, predic*on algorithm, and learning algorithm for your problem.
Sequence Labeling
• Input: sequence of symbols x1 x2 ... xL • Output: sequence of labels y1 y2 ... yL each ∈ Λ Predic*on rule: Problem: there are O(|Λ|L) choices for y1 y2 ... yL !
hw(x) = argmax
yw>
f(x1 . . . xL, y1 . . . yL)
Sequence Labeling with Local Features
A key assump*on about f allows us to solve the problem exactly, in O(|Λ|2 L) *me and O(|Λ|L) space.
hw(x) = argmax
yw>
f(x1 . . . xL, y1 . . . yL)
= argmax
yw>
L�1X
`=1
f
local
(x1 . . . xL, y`y`+1)
!
If I knew the best label sequence for x1 ... xL – 1, then yL would be easy. That decision would depend only on state L – 1. I don’t know that best sequence, but there are only |Λ| op*ons at L – 1. So I only need the score of the best sequence up to L – 1, for each possible label at L – 1. Call this V[L – 1, y] for y ∈ Λ. From this, I can score each label at L, for each hypothe*cal label at L – 1. Score of the best sequences up to L – 1 relies similarly on score of the best sequences up to L – 2. DiZo, at every other *mestep L – 2, L – 3, ... 1.
y
⇤L = arg max
yL2⇤w>
L�2X
`=1
flocal
(x1 . . . xL, y⇤` y
⇤`+1)
!+w>f
local
(x1 . . . xL, y⇤L�1yL)
= w>
L�2X
`=1
flocal
(x1 . . . xL, y⇤` y
⇤`+1)
!+ arg max
yL2⇤w>f
local
(x1 . . . xL, y⇤L�1yL)
(Featurized) Viterbi Algorithm
• Precompute V[*, *] from ler to right. V[1, *] = 0. For ℓ𝓁 = 2 to L, for each y in Λ:
• Backtrack and select the labels from right to ler. For ℓ𝓁 = L -‐ 1 to 1:
y⇤L = argmax
yV [L, y]
y⇤` = B[`+ 1, y⇤`+1]
V [`, y] = max
y02⇤V [`� 1, y
0] +w>f
local
(x1 . . . xL, y0y)
B[`, y] = argmax
y02⇤V [`� 1, y
0] +w>f
local
(x1 . . . xL, y0y)
Part of Speech Tagging
Arer paying the medical bills , Frances was nearly broke . RB VBG DT JJ NNS , NNP VBZ RB JJ . • Adverb (RB) • Verb (VBG, VBZ, and others) • Determiner (DT) • Adjec*ve (JJ) • Noun (NN, NNS, NNP, and others) • Punctua*on (., ,, and others)
Named En*ty Recogni*on
With Commander Chris Ferguson at the helm ,
Atlan*s touched down at Kennedy Space Center .
Named En*ty Recogni*on
With Commander Chris Ferguson at the helm ,
Atlan*s touched down at Kennedy Space Center .
B-‐person I-‐person I-‐person O O O O O
O O O O B-‐space-‐shuZle B-‐place I-‐place I-‐place
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.
NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .
Basic Recipe for Document Categoriza*on
1. Obtain a pool of correctly labeled sequences D. 2. Define a locally factored func*on f from
sequences and labelings to feature vectors. 3. Define a parameterized func*on hw from
feature vectors to labelings. 4. Select h’s parameters w using a training sample
from D. 5. Es*mate performance on a held-‐out sample
from D.
Sequence Labeling
Structured Learners Generalize Linear Classifica*on Learners!
• hidden Markov models ⟵ naïve Bayes • condi*onal random fields ⟵ logis*c regression • structured perceptron ⟵ perceptron • structured SVM ⟵ support vector machine
Addi*onal Notes
• Outputs that are trees, graphs, logical forms, other strings ... parse trees (phrase structure, dependencies) coreference rela*onships among en*ty men*ons (and pronouns) a huge range of seman*c analyses
• Evalua*on?
Dependency Parse
Frame-‐Seman*c Parse
Run our Parsers!
http://demo.ark.cs.cmu.edu/parse
Outline
ü Automa*cally categorizing documents ü Decoding sequences of words 3. Clustering documents and/or words
Clustering Real Data
K-‐Means
Given: points {x1, …, xN}, K (number of clusters) 1. Arbitrarily select μ1, …, μK. 2. Assign each xi to the nearest μj. 3. Select each μj to be the mean of all xi assigned to it.
4. If all μj have converged stop; else go to 2.
K-‐Means, Visualized
K-‐Means, Visualized
K-‐Means, Visualized
K-‐Means, Visualized
K-‐Means, Visualized
K-‐Means, Visualized
K-‐Means for Text?
• Documents – Use the same f we might use for classifica*on.
• Words – Use “context” vectors ...
Where’s the beef?
chicken
Hypothe*cal Counts based on Syntac*c Dependencies
Modified-‐by-‐ferocious(adj)
Subject-‐of-‐devour(v)
Object-‐of-‐pet(v)
Modified-‐by-‐African(adj)
Modified-‐by-‐big(adj)
Lion 15 5 0 6 15
Dog 7 3 8 0 12
Cat 1 1 6 1 9
Elephant 0 0 0 10 15
…
Brown Clustering
Given: corpus of length N, K 1. Assign each word to its cluster (V clusters) 2. Repeat V – K *mes: • Find the single merge (cj, ck) that results in a new clustering with the highest Quality score • Prepend cj’s bitstring with 0 and ck’s with 1 (and the same for all their descendents)
Mini-‐Example
Bitstrings that share a prefix are in the same cluster, at some level of granularity.
Clusters from Brown et al. (1992)
Clusters from Owopu* et al. (2013) (56M Tweets)
acronyms for laughter
lmao lmfao lmaoo lmaooo hahahahaha lool c�u rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol
onomatopoeic laugher
haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahah
affirma*ve yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yus
nega*ve yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh nooooo
metacomment smh jk #fail #random #fact sm� #smh #winning #realtalk smdh #dead #justsaying
second person pronoun
u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou
preposi*ons w fo fa fr fro ov fer fir whit abou ar serie fore fah fuh w/her w/that fron isn agains
“contrac*ons” tryna gon finna bouta trynna bouZa gne fina gonn tryina fenna qone trynaa qon
going to gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona
so+ soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo
Clusters from Owopu* et al. (2013) (56M Tweets)
mischevious ;) :p :-‐) xd ;-‐) ;d (; :3 ;p =p :-‐p =)) ;] xdd #gno xddd >:) ;-‐p >:d 8-‐) ;-‐d
happy :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-‐^ :))))
sad :( :/ -‐_-‐ -‐.-‐ :-‐( :’( d: :| :s -‐__-‐ =( =/ >.< -‐___-‐ :-‐/ </3 :\ -‐____-‐ ;( /: :(( >_< =[ :[ #fml
love <3 xoxo <33 xo <333 #love s2 <URL-‐twi**on.com> #neversaynever <3333
F-‐word + ing
fucking fuckin freaking bloody freakin friggin effin effing fuckn fucken frickin fukin f'n fckn flippin �n motherfucking fckin f*cking fricken fukn fuccin fcking fukkin
Clusters from Owopu* et al. (2013) (56M Tweets)
Browse our TwiZer Clusters!
http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html
Addi*onal Notes
• Sor clustering allows items to have mixed membership in different clusters. – Typically accomplished with probabilis*c models – Latent Dirichlet alloca*on is a popular and Bayesian model
• Evalua*on? • One view of clusters: feature crea*on!
Summary
supervised classifica*on
(5 steps: data, features, predic3on func3on, learning, evalua3on)
structured predic*on
local factoring + dynamic programming
unsupervised clustering
alterna3ng or greedy
op3miza3on