Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Natural Language Processing

17 June 2017

Approaches to NLP

Approaches to NLP

• Rule based approach – circa from 1950

• Machine learning – circa from 1980

Approaches to NLP – Rule based

• Chomsky (1957): Syntactic structures

• Machine translation from IBM & Georgetown

Time flies like an arrow.

Ženu holí stroj.

Approaches to NLP – Machine learning

• Statistical methods, machine learning, including neural networks

• Importance of exact evaluation

• Data, data, data

• Annotation

Machine learning

•Unsupervised

• Finding hidden structure in data

• For example clustering

•Supervised

• Requires training data with correct answers

Data – Corpora

● Morphology, tagging: Penn Treebank, PDT, ...

● Parallel corpora:European, Canadian parliament, movie subtitles

● Specialised (e.g. for sentiment)

Sentiment analysis

Sentiment Analysis

● Česká ekonomika zážívá nebývalý boom.

● Obchod s lidmi zažívá nebývalý boom.

● Burger King má lepší hranolky než McDonnald.

● Baterka je dobrá, ale displej je hroznej.

● It’s not good, but I still love it.

Sentiment Analysis

● I was happy.

● I was sad.

● I was not happy.

● I have never been happy in my life.

● I have never been so happy in my life.

Sentiment Analysis

● She is pretty.

● She is pretty annoying.

● Bylo to strašně dobrý.

Sentiment Analysis

● To se vám teda povedlo.

● Go read the book.

Sentiment Analysis

● Předchozí verze byla úžasná, výborně se s ní pracovalo a opravdu mi ušetřila práci. Teď si nejsem jistý.

Sentiment Analysis

● Jsme veselí.

● Ve Veselí mě okradli.

● Krásná u Aše je naprosto děsná.

● Pan Šťastný mi leze na nervy.

● That was a bad ass burger!

● Bob’s Bad Breath Burger is delicious.

Sentiment Analysis

● Aš je příšerná prdel.

● Aš je pěkná prdel.

● Krásná u Aše je pěkná prdel.

● To si ze mě děláš prdel.

● S Honzou je prdel.

Sentiment classification in python

Go to Jupyter Sandbox in Keboola Connection

Naive Bayes

Features considered independently

Evaluation

Evaluation: Precision/Recall

False positiveBad Precision

False negativeBad Recall

Evaluation:Confusion matrix

Prediction

Predicted positive

Predicted Negative

Reality

Real positive

True positive

False negative

Real negative

False positive

True negative

Evaluation: Confusion matrix

Prediction

Predicted positive

Predicted Negative

Reality

Real positive

10 5

Real negative

3 16

Evaluation

● Multiple possible answers

● Not all errors are equally important

● Inter-annotator agreement (very low for tagging with an open tag set)

Machine learning – Overfitting

Naive Bayes

Features considered independently

Neural network

● fully connected layers

● other architectures possible

Discovery Analysis

Discovery analysis

● explore the data

● prefer recall over precision

● malformed or irrelevant tags not a big deal (as opposed to media tags)

Yelp Sample – 160k Restaurant Reviews

Simple tagging

1. Tokenization: split text into words

2. Drop unimportant words: stop words

3. Find important words: tf-idf

tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!

tf(pizza) = 5

tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!

tf(pizza) = 5tf(the) = 15

idf – inverse document frequency

idf(the) = log(160,000 / 152,000) = log(1.05)

idf(pizza) = log(160,000 / 11,800) = log(13.56)

idf(Tokyo) = log(160,000 / 194) = log(824.74)

tf-idf

w(t,doc) = log(1 + tf(t) ) * log (N / df(t))

idf(the) = log(1+15) * log(1.05) = 0.28

idf(pizza) = log(1+5) * log(13.56) = 9.72

idf(Tokyo) = log(1+0) * log(824.74) = 0

Simple, yet works well

Go to python notebooks

Simple tags in python

Tokenization● Vignátová-Mimochodská, Merriam-Webster's, Tumu-M'Pongo

● 10-year, one-liners, self-proclaimed, je-li

● United Kingdom-United States relations

● 5-3, 5-3+1, U+2010, 2:4, 14:34, 10:00-14:00

● 10000, 10 000, 10,000

● 3.14159, 10.12., 10. prosince, U.S.A., H2O

● km/h, A/C, s/he, byl(a), °C

● N40° 44.9064', W073° 59.0735'

● www.noviny.cz/clanek-o-necem [email protected]

● cos, tys, polívkus; proň, ses, abys; křížem krážem

Tokenization

● Arbitrary decisions have to be made.

● Stick to them consistently.

● Pre-trained models might work poorly if fed with differently tokenized data

eat x ate x eatenOK, not cheap but not outrageously expensive either. I've eaten here twice, the last time during May 2009, I enjoyed both the food & atmosphere. I suppose you could call the place a Bistro. The food is Scottish & locally sourced, caters for vegetarians & has a pretty varied menu without being ridiculously extensive. I seem to remember a good selection of wines but don't think they serve anything but bottled beer. Damned if I can remember what I ate but had fish once that was extremely tasty & their veg isn't undercooked that can be the fashion. The service was friendly with no unseemly waiting! A great night out in New Town. There are two sister restaurants: A Room in the West End & A Room in Leith. Enjoy a great place to eat in a fabulous city!

tf(eat) + tf(ate) + tf(eaten) = 3tf(eat) = 1

Lemmatization &

Morphology

Processing Morphology

Lemmatization: word → lemma (dictionary form) Peter saw her. → Peter see she.státu → stát-1_^(státní_útvar)

POS Tagging: word → tagPeter saw her. → noun, verb, pronoun, punct

Morphological analysis: ignores contextsaw → {[see, verb], [saw, noun]}ženou → {[hnát, verb], [žena, noun]}

Morpheme segmentation: nej-ne-ob-hospod-ař-ova-teln-ějš-ího

Generation: žena + verb+3+pl+present → ženou

Morphology – not so easy

matk-a – mat-e-k – matc-e – matč-in

city – citi-es, goose – geese, sheep – sheep, go – went, jít – šla

Stuhl – Stühl-e, Vater – Väter

po-trub-í (*potrub, *trubí)Povltaví, Pobaltí, potrubí, pobřeží

Morphology – not so easy

Tagalog (Philippines):

basa ‘read’ b-um-asa ‘readpast’sulat ‘write’ s-um-ulat ‘wrote’

rare in English: abso-bloody-lutely

Arabic, Hebrew – templates

Choice of lemma depth

inflection: sedaček → sedačka, debates → debatebrought, brings, bringing → bring

negation: nezdravý → zdravý, unreasonable → reasonable

gradation: nejvyšší → vysoký (Nejvyšší soud)

Morphology: not so easy - derivation

hubnutí – hubnout

sportovec – sport – sportovní

řez – řezbář – řezník

dát – vzdát, jít – pojít

unloosen = loosen; unnerve, unearth

Zipf’s lawword frequency is inversely proportional to its freq rank

Consequences

● Pareto’s rule (80 : 20)○ One can achieve “reasonable” quality fast○ Costs of additional improvements rise “exponentially” (long tail)

● Ambiguity and fuzziness on every layer of language

Part-of-speech tagging

Part-of-speech tagging

I love hiking through the woods on weekends .

PRP VBP N IN DT NNS IN NNS .

● Penn treebank tagset: cca 40 tags; VBD – verb in past tense● Czech positional tagset: 4000+ tags; VpNS---XR-AA---

Petrov et al – (Google) Universal POS TagsetVERB - verbs (all tenses and modes)

NOUN - nouns (common and proper)

PRON - pronouns

ADJ - adjectives

ADV - adverbs

ADP - adpositions (prepositions and postpositions)

CONJ - conjunctions

DET - determiners

NUM - cardinal numbers

PRT - particles or other function words

X - other: foreign words, typos, abbreviations

. - punctuation

Penn Treebank tagset

Entities

Example – Svejk – characters

Entities – named and non-named

● Named entities: personal names, organizations, geographical names

● Other interesting entities: URL, e-mail, phone numbers, money amounts and other quantities, date and time

● Custom entities for given domain: bacon, onion, tomato, cheese for a burger chain

Entities – some basic challenges

● Types – fuzziness, hierarchy○ Facebook – product or company?○ European Union – organization or place

● Embedded entities○ [Dr.] Martin Luther King [Jr.]○ [The [New England] Journal of Medicine] ○ [Gymnázium [Jozefa Gregora Tajovského] v [Banskej Bystrici]]○ [Univerzita Karlova v Praze] vs [Univerzita Karlova] v Praze

● List look-up not enough○ Washington, The police, ANO, Šanca na Skok z Mosta do Siete

Entities – ML ● annotation – tag tokens with labels like PERSON_START, PERSON_CONT● popular classifier – CRF● features

○ word shape (case, is alphanumeric etc.)○ morphological features○ gazetteers○ distsim, word2vec○ labels already assigned to previous word(s)○ add features of surrounding tokens, previous instances of the same word, use

n-grams …● could use two passes● can use two passes

Entities/Tags – remaining issues● Correference – increases tf (phrase importance)

○ Miloš Zeman = pan prezident = on = 0 přišel

● Standardization○ iPads > iPad, Windows != Window, ○ Spojené státy != Spojený stát, česká dráha○ The first stage has landed on Of Course I Still Love You.○ Zazpíval Bratříčku zavírej vrátka.

● Normalization○ USA = United States of America = United States ~ America○ Hillary Rodham = Hillary Clinton

Příhody z natáčení ...● Škoda Octavius● Livius Klausová● Miroslava Kalousek● CPI● Yoko onen● jihoamerické tanky● kat Perry● laso Vegas● Světový ekonomický fór● Ústí nad Labem x Brisbane nad Austrálií● Homosexuální pára● Křišťálový lup

Syntax & Parsing

Slunečníky nahradily deštníky.

Pět nemocnic zrušilo ministerstvo.

Návrh putuje k Ústavnímu soudu v době, kdy je neúplný.

Ambiguity

Old men and women are hard to live with.

I saw her duck.

The chicken are too hot to eat.

The mayor is a dirty street fighter.

Happily they left.

Terry loves his wife and so do I.

Ambiguity

Parser

Try Stanford Online Dependency Parser

http://nlp.stanford.edu:8080/corenlp/process

End-to-end example:Keboola Connection

Vectors

Vector methods

One way to bridge natural language and classical ML

After transforming to vectors, integration with ML systems is easy

Applications: Search

Text classification

Preprocessing / feature extraction for any ML task e.g., neural networks image -> vector -> text

Vector methods: bag of words

Preprocessing – tokenization, stemming/lemmatization, cleaning

Create a vector d with dimension V (size of vocabulary)

di = tfi (term frequency of the i-th word)

A black cat and a white cat slept on a mat -> {black:1, white:1, cat:2, sleep:1, mat:1} -> [1, 1, 2, 1, 1, ...]

Vector methods: bag of words improved

Fancier values instead of tf. (e.g., tf-idf)

Add n-grams/phrases/entities to the bag {..., black cat:1, white cat:1, ...}

Vector methods: dimensionality reduction

Each term is a feature - very big dimension

Dimensionality reductionLSI (LSA) – term-document matrix decomposition

LDA – topic inference using probabilistic graphical model

Word2vec – transform words to vectors of given size, capture their context

gensim Python library

Vector methods: Latent Semantic Indexing (LSI)

goal: map semantically similar documents to similar vectors

{(car), (truck), (flower)} –> a{(1.345 * car + 0.282 * truck), (flower)}

reduce dimensionality by singular value decomposition (SVD) of the term-document matrix

somehow addresses synonymy, in lesser extent homonymy

From EP corpus: 0.365*fishery + 0.342*fishing + 0.197*fish + -0.153*tax + -0.140*food + 0.116*aquaculture + ...

Source: Jialu Liu: Topic Model

Vector methods: Latent Dirichlet Allocation (LDA)

topic1 –> 0.1 milk, 0.09 meow, 0.08 kitten

topic2 –> 0.12 bark, 0.11 bone, 0.07 puppy

Finds probability distributions of topics for documents and words for topics

Source: Jialu Liu: Topic Model

Vector methods: Latent Dirichlet Allocation (LDA)

From EP corpus: 0.018*transport + 0.013*passenger + 0.011*airline + 0.010*road + 0.009*safety + 0.007*simplify + 0.007*rail + 0.006*travel + ...

0.025*Israel + 0.017*Palestinian + 0.015*Jerusalem + 0.015*Gaza + 0.012*Prime + 0.011*Israeli + 0.009*peace +

Vector methods: Word2vec

Doesn’t ignore word order, uses either skip-grams or continuous bag of words (CBOW)

vector arithmetic king - man + woman = queen

uses neural networks

research shows analogy to matrix factorization

Vector methods: Word2vec

model.most_similar(positive=['nuclear'])

[('stations', 0.6321508884429932),('reactor', 0.6199184060096741),('plants', 0.6013395190238953),('atomic', 0.5934208035469055),('coal-fired', 0.5920413732528687),('reactors', 0.549136221408844),('solar', 0.5483176112174988),('weapons', 0.5343624353408813),('disarmament', 0.5275484919548035),('plant', 0.5141536593437195)]

Documents

Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company