NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

NLP Demys*fied

Noah Smith Language Technologies Ins*tute Machine Learning Department School of Computer Science Carnegie Mellon University

[email protected]

NLP?

Outline

1.  Automa*cally categorizing documents 2.  Decoding sequences of words 3.  Clustering documents and/or words

Categorizing Documents: Examples

•  Mosteller and Wallace (1964): authorship of the Federalist papers

•  News categories: U.S., world, sports, religion, business, technology, entertainment, ...

•  How posi*ve or nega*ve is a review of a film or restaurant?

•  Is a given email message spam? •  What is the reading level of a piece of text? •  How influen*al will a research paper be? •  Will a congressional bill pass commiZee?

The Vision

•  Human experts label some data •  Feed the data to a learning algorithm that constructs an automa*c labeling func*on

•  Apply that func*on to as much data as you want!

Basic Recipe for Document Categoriza*on

1.  Obtain a pool of correctly categorized documents D.

2.  Define a func*on f from documents to feature vectors.

3.  Define a parameterized func*on hw from feature vectors to categories.

4.  Select h’s parameters w using a training sample from D.

5.  Es*mate performance on a held-‐out sample from D.

1. Obtain Categorized Documents

Spinoza, 17th century ra*onalist

2. Define the Feature Vector Func*on

•  Simplest choice: one dimension per word, and let [ f(d) ]j be the count of wj in d.

•  Twists: – Monotonic transforms, like dividing by the length of d or taking a log.

–  Increase the weights of words that occur in fewer documents (“inverse document frequency”)

–  n-‐grams –  Count specially defined groupings of words –  Sta*s*cal tests to select words likely to be informa*ve







3. Define a Func*on from Feature Vectors to Categories

•  Simplest choice: linear model

wc is the vector of coefficients associa*ng each feature with class c (can be posi*ve or nega*ve). – Advantage: interpretability – Advantage: computa*onal efficiency

•  Some alterna*ves: k-‐nearest neighbors, decision trees, neural networks, ...

hw(d) = argmax

cw>

c f(d) + wbiasc

4. Select Parameters using Data

•  Also known as “machine learning.” •  Many learning op*ons for linear classifiers!

probabilis3c interpreta3on

discrimina3ve

LR SVM

perceptron

NB

4. Select Parameters using Data

Op*miza*on view of learning:

Typical loss func*ons for linear models are convex and can be efficiently op*mized using online or batch itera*ve algorithms with convergence guarantees.

w = argminw

R(w) +1

|Dtrain |X

d2Dtrain

L(d;w)

“regulariza*on” to avoid overfijng “empirical risk” = average loss over training data

4. Select Parameters using Data Considera*ons: •  Do you want posterior probabili*es, or just labels?

•  What methods do you understand well enough to explain in your paper?

•  What methods will your readers understand? •  What implementa*ons are available? –  Cost, scalability, programming language, compa*bility with your workflow, ...

•  How well does it work (on held-‐out data)?

5. Es*mate Performance

•  Always, always, always use held-‐out data. – Mul*ple rounds of tests? Fresh tes*ng data!

•  Consider the “most frequent class” baseline. •  Consider inter-‐annotator agreement. •  What to measure? – Accuracy – When one class is special: precision/recall

5. Es*mate Performance

precision

recall

hw(d) = argmax

cw>

c f(d) + wbiasc







Outline

ü Automa*cally categorizing documents 2.  Decoding sequences of words 3.  Clustering documents and/or words

Decoding Word Sequences: Examples

•  Categorizing each word by its part-‐of-‐speech or seman*c class

•  Recognizing men*ons of named en**es •  Segmen*ng a document into parts •  Parsing a sentence into a gramma*cal or seman*c structure

High-‐Level View

d c classifica*on

x y1

y3

y2

yN ...

structured predic*on

Possible Lines of AZack

1.  Transform into a sequence of classifica*on problems (see part 1).

2.  Transform into a sequence labeling problem and use a variant of the Viterbi algorithm.

3.  Design a representa*on, predic*on algorithm, and learning algorithm for your par*cular problem.

Shameless Self-‐Promo*on

Morgan Claypool Publishers&SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

w w w . m o r g a n c l a y p o o l . c o m

Series Editor: Graeme Hirst, University of Toronto

MO

RG

AN

&C

LA

YP

OO

L

CM& Morgan Claypool Publishers&

About SYNTHESIsThis volume is a printed version of a work that appears in the SynthesisDigital Library of Engineering and Computer Science. Synthesis Lecturesprovide concise, original presentations of important research and developmenttopics, published quickly, in digital and print formats. For more informationvisit www.morganclaypool.com

SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

LINGUISTIC STRUCTURE PREDICTION

Linguistic StructurePrediction

Graeme Hirst, Series Editor

ISBN: 978-1-60845-405-1

9 781608 454051

90000

Series ISSN: 1947-4040

Linguistic Structure PredictionNoah A. Smith, Carnegie Mellon UniversityA major part of natural language processing now depends on the use of text data to build linguisticanalyzers. We consider statistical, computational approaches to modeling linguistic structure. We seekto unify across many approaches and many kinds of linguistic structures. Assuming a basic understandingof natural language processing and/or machine learning, we seek to bridge the gap between the two fields.Approaches to decoding (i.e., carrying out linguistic structure prediction) and supervised and unsupervisedlearning of models that predict discrete structures as outputs are the focus. We also survey natural languageprocessing problems to which these methods are being applied, and we address related topics in probabilisticinference, optimization, and experimental methodology.

Noah A. Smith

SMITH

$56.43 on amazon.com possibly free in electronic form, through your university’s library

Lines of AZack

1.  Reduce to a sequence of classifica*on problems (see part 1).

2.  Reduce to a sequence labeling problem and use a variant of the Viterbi algorithm.

3.  Design a representa*on, predic*on algorithm, and learning algorithm for your problem.

Sequence Labeling

•  Input: sequence of symbols x1 x2 ... xL •  Output: sequence of labels y1 y2 ... yL each ∈ Λ Predic*on rule: Problem: there are O(|Λ|L) choices for y1 y2 ... yL !

hw(x) = argmax

yw>

f(x1 . . . xL, y1 . . . yL)

Sequence Labeling with Local Features

A key assump*on about f allows us to solve the problem exactly, in O(|Λ|2 L) *me and O(|Λ|L) space.

hw(x) = argmax

yw>

f(x1 . . . xL, y1 . . . yL)

= argmax

yw>

L�1X

`=1

f

local

(x1 . . . xL, y`y`+1)

!

If I knew the best label sequence for x1 ... xL – 1, then yL would be easy. That decision would depend only on state L – 1. I don’t know that best sequence, but there are only |Λ| op*ons at L – 1. So I only need the score of the best sequence up to L – 1, for each possible label at L – 1. Call this V[L – 1, y] for y ∈ Λ. From this, I can score each label at L, for each hypothe*cal label at L – 1. Score of the best sequences up to L – 1 relies similarly on score of the best sequences up to L – 2. DiZo, at every other *mestep L – 2, L – 3, ... 1.

y

⇤L = arg max

yL2⇤w>

L�2X

`=1

flocal

(x1 . . . xL, y⇤` y

⇤`+1)

!+w>f

local

(x1 . . . xL, y⇤L�1yL)

= w>

L�2X

`=1

flocal

(x1 . . . xL, y⇤` y

⇤`+1)

!+ arg max

yL2⇤w>f

local

(x1 . . . xL, y⇤L�1yL)

(Featurized) Viterbi Algorithm

•  Precompute V[*, *] from ler to right. V[1, *] = 0. For ℓ𝓁 = 2 to L, for each y in Λ:

•  Backtrack and select the labels from right to ler. For ℓ𝓁 = L -‐ 1 to 1:

y⇤L = argmax

yV [L, y]

y⇤` = B[`+ 1, y⇤`+1]

V [`, y] = max

y02⇤V [`� 1, y

0] +w>f

local

(x1 . . . xL, y0y)

B[`, y] = argmax

y02⇤V [`� 1, y

0] +w>f

local

(x1 . . . xL, y0y)

Part of Speech Tagging

Arer paying the medical bills , Frances was nearly broke . RB VBG DT JJ NNS , NNP VBZ RB JJ . •  Adverb (RB) •  Verb (VBG, VBZ, and others) •  Determiner (DT) •  Adjec*ve (JJ) •  Noun (NN, NNS, NNP, and others) •  Punctua*on (., ,, and others)

Named En*ty Recogni*on

With Commander Chris Ferguson at the helm ,

Atlan*s touched down at Kennedy Space Center .

Named En*ty Recogni*on

With Commander Chris Ferguson at the helm ,

Atlan*s touched down at Kennedy Space Center .

B-‐person I-‐person I-‐person O O O O O

O O O O B-‐space-‐shuZle B-‐place I-‐place I-‐place

Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.

NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .




















1.  Obtain a pool of correctly labeled sequences D. 2.  Define a locally factored func*on f from

sequences and labelings to feature vectors. 3.  Define a parameterized func*on hw from

feature vectors to labelings. 4.  Select h’s parameters w using a training sample

from D. 5.  Es*mate performance on a held-‐out sample

from D.

Sequence Labeling

Structured Learners Generalize Linear Classifica*on Learners!

•  hidden Markov models ⟵ naïve Bayes •  condi*onal random fields ⟵ logis*c regression •  structured perceptron ⟵ perceptron •  structured SVM ⟵ support vector machine

Addi*onal Notes

•  Outputs that are trees, graphs, logical forms, other strings ... parse trees (phrase structure, dependencies) coreference rela*onships among en*ty men*ons (and pronouns) a huge range of seman*c analyses

•  Evalua*on?

Dependency Parse

Frame-‐Seman*c Parse

Run our Parsers!

http://demo.ark.cs.cmu.edu/parse

Outline

ü Automa*cally categorizing documents ü Decoding sequences of words 3.  Clustering documents and/or words

Clustering Real Data

K-‐Means

Given: points {x1, …, xN}, K (number of clusters) 1. Arbitrarily select μ1, …, μK. 2. Assign each xi to the nearest μj. 3. Select each μj to be the mean of all xi assigned to it.

4. If all μj have converged stop; else go to 2.

K-‐Means, Visualized






K-‐Means for Text?

•  Documents – Use the same f we might use for classifica*on.

•  Words – Use “context” vectors ...

Where’s the beef?

chicken

Hypothe*cal Counts based on Syntac*c Dependencies

Modified-‐by-‐ferocious(adj)

Subject-‐of-‐devour(v)

Object-‐of-‐pet(v)

Modified-‐by-‐African(adj)

Modified-‐by-‐big(adj)

Lion 15 5 0 6 15

Dog 7 3 8 0 12

Cat 1 1 6 1 9

Elephant 0 0 0 10 15

…

Brown Clustering

Given: corpus of length N, K 1.  Assign each word to its cluster (V clusters) 2.  Repeat V – K *mes: •  Find the single merge (cj, ck) that results in a new clustering with the highest Quality score •  Prepend cj’s bitstring with 0 and ck’s with 1 (and the same for all their descendents)

Mini-‐Example

Bitstrings that share a prefix are in the same cluster, at some level of granularity.

Clusters from Brown et al. (1992)

Clusters from Owopu* et al. (2013) (56M Tweets)

acronyms for laughter

lmao lmfao lmaoo lmaooo hahahahaha lool c�u rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

onomatopoeic laugher

haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahah

affirma*ve yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yus

nega*ve yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh nooooo

metacomment smh jk #fail #random #fact sm� #smh #winning #realtalk smdh #dead #justsaying

second person pronoun

u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

preposi*ons w fo fa fr fro ov fer fir whit abou ar serie fore fah fuh w/her w/that fron isn agains

“contrac*ons” tryna gon finna bouta trynna bouZa gne fina gonn tryina fenna qone trynaa qon

going to gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

so+ soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo


mischevious ;) :p :-‐) xd ;-‐) ;d (; :3 ;p =p :-‐p =)) ;] xdd #gno xddd >:) ;-‐p >:d 8-‐) ;-‐d

happy :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-‐^ :))))

sad :( :/ -‐_-‐ -‐.-‐ :-‐( :’( d: :| :s -‐__-‐ =( =/ >.< -‐___-‐ :-‐/ </3 :\ -‐____-‐ ;( /: :(( >_< =[ :[ #fml

love <3 xoxo <33 xo <333 #love s2 <URL-‐twi**on.com> #neversaynever <3333

F-‐word + ing

fucking fuckin freaking bloody freakin friggin effin effing fuckn fucken frickin fukin f'n fckn flippin �n motherfucking fckin f*cking fricken fukn fuccin fcking fukkin


Browse our TwiZer Clusters!

http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html

Addi*onal Notes

•  Sor clustering allows items to have mixed membership in different clusters. – Typically accomplished with probabilis*c models – Latent Dirichlet alloca*on is a popular and Bayesian model

•  Evalua*on? •  One view of clusters: feature crea*on!

Summary

supervised classifica*on

(5 steps: data, features, predic3on func3on, learning, evalua3on)

structured predic*on

local factoring + dynamic programming

unsupervised clustering

alterna3ng or greedy

op3miza3on