Practical Natural Language Processing

Practical NaturalLanguage Processing

Catherine HavasiLuminoso / MIT Media Lab

[email protected]

I have always found this to be the dow

nsideIt w

as good, but there wasn’t

Couldn’t understand

Everything I could have expected ifNever saw

Love if I was a drunk college

wet dog

Christine C. Quinn, the New York City Council speaker, released a video and planned to visit all five boroughs on Sunday as she officially began her campaign Many social norms, like “tha.

There are notes!luminoso.com/blog

Too much text?

Wouldn’t it be cool if we could talk to a computer?

This is hard.

It takes a lot of knowledge to understand language

I made her duck.

I made her duck

• I cooked waterfowl for her benefit (to eat)• I cooked waterfowl belonging to her • I created the (plaster?) duck she owns• I made sure she got her head down• I waved my magic wand and turned her

into undifferentiated waterfowl

Language is Recursive• You can build new concepts out of old

ones indefinitely

Confidential Luminoso http://lumino.so

Language is Creative

It was really stuffy.

Smelled really musty.Reminds me of a dusty closet.

Was like a wet dog.

It was like it had been shut away for a long time.

Smells like an old house.

Really stale.

It smelled terrible.

A multi-lingual world

Linguistics to the rescue?

Linguistics to the rescue?

--Randall Munroe, xkcd.org/114

“Much Debate”

We just want to get things done.

So, what is state of the art?

The NLP process

• Take in a string of language• Where are the words?• What are the root forms of these words?• How do the words fit together?• Which words look important?• What decisions should we make based on

these words?

The NLP process (simplified)

• Fake understanding

The NLP process (simplified)

• Fake understanding• Until you make understanding

Example: Detecting bad words

• You want to flag content with certain bad words in it

• Don’t just match sequences of characters• That would lead to this classic mistake

Many forms of fowl language

• Suppose we want people to not say the word “duck”

Many forms of fowl language

“What the duck’s wrong with this”

“It’s all ducked up”

“Un-ducking-believable”

Step 1: break text into tokens

it’sallduckedupunduckingbelievable

Step 2: replace tokens with their root forms

it → it’s → isall → allducked → duckup → upun → unducking → duckbelievable → believe

In a few lines of Python:>>> import nltk>>> text = "It's all ducked up. Un-ducking-believable.">>> tokens = nltk.wordpunct_tokenize(text.lower())>>> tokens[’it', "'", 's', 'all', 'ducked', 'up', '.', ’un', '-',

'ducking', '-', 'believable', '.']

>>> stemmer = nltk.stem.PorterStemmer()>>> [stemmer.stem_word(token) for token in tokens][’it', "'", 's', 'all', 'duck', 'up', '.', ’un', '-',

'duck', '-', 'believ', '.']

Stemmers can spell things oddly

• duck → duck• ducking → duck• believe → believ• believable → believ• happy → happi• happiness → happi

Stemmers can mix up some words

• sincere → sincer• sincerity → sincer• universe → univers• university → univers

The NLP tool chain

• Some source of text (a database, a labeled corpus, Web scraping, Twitter...)

• Tokenizer: breaks text into word-like things• Stemmer: finds words with the same root• Tagger: identifies parts of speech• Chunker: identifies key phrases• Something that makes decisions based on

these results

Useful toolkits

• NLTK (Python)• LingPipe (Java)• Stanford Core NLP (Java; many wrappers)• FreeLing (C++)

The statistics of text

• Often we want to understand the differences between different categories of text– Different genres– Different writers– Different forms of writing

Collecting word counts

• Start with a corpus of text• Brown corpus (1961)• British National Corpus (1993)• Google Books (2009, 2012)


>>> import nltk>>> from nltk.corpus import brown>>> from collections import Counter>>> counts = Counter(brown.words())>>> counts.most_common()[:20][('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]


>>> for category in brown.categories():... frequency = Counter(brown.words(... categories=category))...... for word in frequency:... frequency[word] /= counts[word] + 100....... # format the results nicely... print "%20s -> %s" % (category, ... ', '.join(word for word, prop... in frequency.most_common()[:10]))

Prominent words by category editorial -> Berlin, Khrushchev, East, editor, nuclear, West, Soviet, Podger, Kennedy, budget fiction -> Kate, Winston, Scotty, Rector, Hans, Watson, Alex, Eileen, doctor, ! government -> fiscal, Rhode, Act, Government, shelter, States, tax, Island, property, shall hobbies -> feed, clay, Hanover, site, your, design, mold, Class, Junior, Juniors news -> Mrs., Monday, Mantle, yesterday, Dallas, Texas, Kennedy, Tuesday, jury, Palmer religion -> God, Christ, Him, Christian, Jesus, membership, faith, sin, Church, Catholic reviews -> music, musical, Sept., jazz, Keys, audience, singing, Newport, cholesterol science_fiction -> Ekstrohm, Helva, Hal, B'dikkat, Mercer, Ryan, Earth, ship, Mike, Hesperus

Classifying text

• We can take text that’s categorized and figure out its word frequencies

• Wouldn’t it be more useful to look at word frequencies and figure out the category?

Example: Spam filtering

• Paul Graham’s SpamBayes (2002)• Remember what e-mail was like before 2002?• A simple classifier (Naive Bayes) changed

everything

Supervised classification

• Distinguish things from other things based on examples

Applications

• Spam filtering• Detecting important e-mails• Topic detection• Language detection• Sentiment analysis

Naive Bayes

• We know the probability of various data given a category

• Estimate the probability of the category given the data

• Assume all features of the data are independent (that’s the naive part)

• It’s simple• It’s fast• Sometimes it even works

A quick Naive Bayes experiment

• nltk.corpus.movie_reviews: movie reviews labeled as ‘pos’ or ‘neg’

• Define document_features(doc) to describe a document by the words it contains

Statistics beyond single words

• Many interesting things about text are longer than one word

• bigram: a sequence of two tokens• collocation: a bigram that seems to be more

than the sum of its parts

When is a bigram interesting?

#(vice president) #(vice)

#(president) total words

Guess the text

Guess the text

>>> from nltk.book import text4>>> text4.collocations()United States; fellow citizens; four years; years

ago; Federal Government; General Government; American people; Vice President; Old World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; every citizen; Indian tribes; public debt; one another; foreign nations; political parties

Guess the text

>>> from nltk.book import text3>>> text3.collocations()said unto; pray thee; thou shalt; thou hast; thy

seed; years old; spake unto; thou art; LORD God; every living; God hath; begat sons; seven years; shalt thou; little ones; living creature; creeping thing; savoury meat; thirty years; every beast

Guess the text

>>> from nltk.book import text6>>> text6.collocations()BLACK KNIGHT; HEAD KNIGHT; Holy Grail;

FRENCH GUARD; Sir Robin; Run away; CARTOON CHARACTER; King Arthur; Iesu domine; Pie Iesu; DEAD PERSON; Round Table; OLD MAN; dramatic chord; dona eis; eis requiem; LEFT HEAD; FRENCH GUARDS; music stops; Sir Launcelot

What about grammar?

• Eh• Too hard

What about word meanings?

• “I liked the movie.”• “I enjoyed the film.”• These have a lot more in common than “I” and

“the”.

WordNet

• A dictionary for computers• Contains links between definitions• Words form (roughly) a tree

good, right, ripe – (most suitable or right for a particular purpose; "a good time to plant tomatoes"; "the right time to act"; "the time is ripe for great sociological changes")

Glosses

Synset Definition

Measuring word similarity

• Various methods of measuring word similarity using paths in WordNet

>>> from nltk.corpus import wordnet as wn>>> wn.wup_similarity(wn.synset('movie.n.1'), wn.synset('film.n.1'))

1.0>>> wn.wup_similarity(wn.synset('cat.n.1'), wn.synset('dog.n.1'))

0.8571>>> wn.wup_similarity(wn.synset('cat.n.1'), wn.synset('movie.n.1'))

0.3636

The black hats have WordNet too

• This is why content farms might try to tell you “What to Anticipate When You’re Anticipating”

Limitations of WordNet

>>> print wn.wup_similarity(wn.synset('taxi.n.1'), wn.synset('driver.n.1'))

0.235294117647

>>> print wn.wup_similarity(wn.synset(’kitten.n.1'), wn.synset(’adorable.a.1'))

None

ConceptNet

• More types of word relationships• More languages• Less precise definitions• Conceptnet5.media.mit.edu

buy groceries

money

wallet

cookrequires

requires

bank

locati

on location

supermarket

is fo

r

groceries

buy

has verbha

s obj

ecttakes object

requires

spend

take

s obj

ect

produce

type

of

foodpart of

take

s obj

ect

pocket

loca

tion

building

person

wants

has

expense

relate

d to

type of

does

n’t w

ant

has

cashierta

kes o

bjec

tlocation

type o

f

company

type of

has

is for

requires

Take-away points• NLP in general is hard• Specific things are easy• Find tools that work well and chain them

together• Try experimenting with NLTK• If you need to classify things, try Naive Bayes

first

Catherine HavasiLuminoso & MIT Media Lab

[email protected]

Documents

Practical Natural Language Processing