84
Natural Language Processing 17 June 2017

Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Natural Language Processing

17 June 2017

Page 2: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Approaches to NLP

Page 3: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Approaches to NLP

• Rule based approach – circa from 1950

• Machine learning – circa from 1980

Page 4: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Approaches to NLP – Rule based

• Chomsky (1957): Syntactic structures

• Machine translation from IBM & Georgetown

Page 5: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 6: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Time flies like an arrow.

Page 7: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Ženu holí stroj.

Page 8: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Approaches to NLP – Machine learning

• Statistical methods, machine learning, including neural networks

• Importance of exact evaluation

• Data, data, data

• Annotation

Page 9: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Machine learning

•Unsupervised

• Finding hidden structure in data

• For example clustering

•Supervised

• Requires training data with correct answers

Page 10: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Data – Corpora

● Morphology, tagging: Penn Treebank, PDT, ...

● Parallel corpora:European, Canadian parliament, movie subtitles

● Specialised (e.g. for sentiment)

Page 11: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Sentiment analysis

Page 12: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Sentiment Analysis

● Česká ekonomika zážívá nebývalý boom.

● Obchod s lidmi zažívá nebývalý boom.

● Burger King má lepší hranolky než McDonnald.

● Baterka je dobrá, ale displej je hroznej.

● It’s not good, but I still love it.

Page 13: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Sentiment Analysis

● I was happy.

● I was sad.

● I was not happy.

● I have never been happy in my life.

● I have never been so happy in my life.

Page 14: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Sentiment Analysis

● She is pretty.

● She is pretty annoying.

● Bylo to strašně dobrý.

Page 15: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Sentiment Analysis

● To se vám teda povedlo.

● Go read the book.

Page 16: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Sentiment Analysis

● Předchozí verze byla úžasná, výborně se s ní pracovalo a opravdu mi ušetřila práci. Teď si nejsem jistý.

Page 17: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Sentiment Analysis

● Jsme veselí.

● Ve Veselí mě okradli.

● Krásná u Aše je naprosto děsná.

● Pan Šťastný mi leze na nervy.

● That was a bad ass burger!

● Bob’s Bad Breath Burger is delicious.

Page 18: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Sentiment Analysis

● Aš je příšerná prdel.

● Aš je pěkná prdel.

● Krásná u Aše je pěkná prdel.

● To si ze mě děláš prdel.

● S Honzou je prdel.

Page 19: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Sentiment classification in python

Go to Jupyter Sandbox in Keboola Connection

Page 20: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 21: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Naive Bayes

Features considered independently

Page 22: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 23: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Evaluation

Page 24: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Evaluation: Precision/Recall

False positiveBad Precision

False negativeBad Recall

Page 25: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Evaluation:Confusion matrix

Prediction

Predicted positive

Predicted Negative

Reality

Real positive

True positive

False negative

Real negative

False positive

True negative

Page 26: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Evaluation: Confusion matrix

Prediction

Predicted positive

Predicted Negative

Reality

Real positive

10 5

Real negative

3 16

Page 27: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Evaluation

● Multiple possible answers

● Not all errors are equally important

● Inter-annotator agreement (very low for tagging with an open tag set)

Page 28: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Machine learning – Overfitting

Page 29: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Naive Bayes

Features considered independently

Page 30: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Neural network

● fully connected layers

● other architectures possible

Page 31: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Discovery Analysis

Page 32: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Discovery analysis

● explore the data

● prefer recall over precision

● malformed or irrelevant tags not a big deal (as opposed to media tags)

Page 33: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Yelp Sample – 160k Restaurant Reviews

Page 34: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 35: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Simple tagging

1. Tokenization: split text into words

2. Drop unimportant words: stop words

3. Find important words: tf-idf

Page 36: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!

tf(pizza) = 5

Page 37: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!

tf(pizza) = 5tf(the) = 15

Page 38: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

idf – inverse document frequency

idf(the) = log(160,000 / 152,000) = log(1.05)

idf(pizza) = log(160,000 / 11,800) = log(13.56)

idf(Tokyo) = log(160,000 / 194) = log(824.74)

Page 39: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

tf-idf

w(t,doc) = log(1 + tf(t) ) * log (N / df(t))

idf(the) = log(1+15) * log(1.05) = 0.28

idf(pizza) = log(1+5) * log(13.56) = 9.72

idf(Tokyo) = log(1+0) * log(824.74) = 0

Simple, yet works well

Page 40: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Go to python notebooks

Simple tags in python

Page 41: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Tokenization● Vignátová-Mimochodská, Merriam-Webster's, Tumu-M'Pongo

● 10-year, one-liners, self-proclaimed, je-li

● United Kingdom-United States relations

● 5-3, 5-3+1, U+2010, 2:4, 14:34, 10:00-14:00

● 10000, 10 000, 10,000

● 3.14159, 10.12., 10. prosince, U.S.A., H2O

● km/h, A/C, s/he, byl(a), °C

● N40° 44.9064', W073° 59.0735'

● www.noviny.cz/clanek-o-necem [email protected]

● cos, tys, polívkus; proň, ses, abys; křížem krážem

Page 42: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Tokenization

● Arbitrary decisions have to be made.

● Stick to them consistently.

● Pre-trained models might work poorly if fed with differently tokenized data

Page 43: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

eat x ate x eatenOK, not cheap but not outrageously expensive either. I've eaten here twice, the last time during May 2009, I enjoyed both the food & atmosphere. I suppose you could call the place a Bistro. The food is Scottish & locally sourced, caters for vegetarians & has a pretty varied menu without being ridiculously extensive. I seem to remember a good selection of wines but don't think they serve anything but bottled beer. Damned if I can remember what I ate but had fish once that was extremely tasty & their veg isn't undercooked that can be the fashion. The service was friendly with no unseemly waiting! A great night out in New Town. There are two sister restaurants: A Room in the West End & A Room in Leith. Enjoy a great place to eat in a fabulous city!

tf(eat) + tf(ate) + tf(eaten) = 3tf(eat) = 1

Page 44: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Lemmatization &

Morphology

Page 45: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Processing Morphology

Lemmatization: word → lemma (dictionary form) Peter saw her. → Peter see she.státu → stát-1_^(státní_útvar)

POS Tagging: word → tagPeter saw her. → noun, verb, pronoun, punct

Morphological analysis: ignores contextsaw → {[see, verb], [saw, noun]}ženou → {[hnát, verb], [žena, noun]}

Morpheme segmentation: nej-ne-ob-hospod-ař-ova-teln-ějš-ího

Generation: žena + verb+3+pl+present → ženou

Page 46: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Morphology – not so easy

matk-a – mat-e-k – matc-e – matč-in

city – citi-es, goose – geese, sheep – sheep, go – went, jít – šla

Stuhl – Stühl-e, Vater – Väter

po-trub-í (*potrub, *trubí)Povltaví, Pobaltí, potrubí, pobřeží

Page 47: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Morphology – not so easy

Tagalog (Philippines):

basa ‘read’ b-um-asa ‘readpast’sulat ‘write’ s-um-ulat ‘wrote’

rare in English: abso-bloody-lutely

Arabic, Hebrew – templates

Page 48: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Choice of lemma depth

inflection: sedaček → sedačka, debates → debatebrought, brings, bringing → bring

negation: nezdravý → zdravý, unreasonable → reasonable

gradation: nejvyšší → vysoký (Nejvyšší soud)

Page 49: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Morphology: not so easy - derivation

hubnutí – hubnout

sportovec – sport – sportovní

řez – řezbář – řezník

dát – vzdát, jít – pojít

unloosen = loosen; unnerve, unearth

Page 50: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Zipf’s lawword frequency is inversely proportional to its freq rank

Page 51: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Consequences

● Pareto’s rule (80 : 20)○ One can achieve “reasonable” quality fast○ Costs of additional improvements rise “exponentially” (long tail)

● Ambiguity and fuzziness on every layer of language

Page 52: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Part-of-speech tagging

Page 53: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Part-of-speech tagging

I love hiking through the woods on weekends .

PRP VBP N IN DT NNS IN NNS .

● Penn treebank tagset: cca 40 tags; VBD – verb in past tense● Czech positional tagset: 4000+ tags; VpNS---XR-AA---

Page 54: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Petrov et al – (Google) Universal POS TagsetVERB - verbs (all tenses and modes)

NOUN - nouns (common and proper)

PRON - pronouns

ADJ - adjectives

ADV - adverbs

ADP - adpositions (prepositions and postpositions)

CONJ - conjunctions

DET - determiners

NUM - cardinal numbers

PRT - particles or other function words

X - other: foreign words, typos, abbreviations

. - punctuation

Page 55: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Penn Treebank tagset

Page 56: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Entities

Page 57: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 58: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Example – Svejk – characters

Page 59: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Entities – named and non-named

● Named entities: personal names, organizations, geographical names

● Other interesting entities: URL, e-mail, phone numbers, money amounts and other quantities, date and time

● Custom entities for given domain: bacon, onion, tomato, cheese for a burger chain

Page 60: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Entities – some basic challenges

● Types – fuzziness, hierarchy○ Facebook – product or company?○ European Union – organization or place

● Embedded entities○ [Dr.] Martin Luther King [Jr.]○ [The [New England] Journal of Medicine] ○ [Gymnázium [Jozefa Gregora Tajovského] v [Banskej Bystrici]]○ [Univerzita Karlova v Praze] vs [Univerzita Karlova] v Praze

● List look-up not enough○ Washington, The police, ANO, Šanca na Skok z Mosta do Siete

Page 61: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Entities – ML ● annotation – tag tokens with labels like PERSON_START, PERSON_CONT● popular classifier – CRF● features

○ word shape (case, is alphanumeric etc.)○ morphological features○ gazetteers○ distsim, word2vec○ labels already assigned to previous word(s)○ add features of surrounding tokens, previous instances of the same word, use

n-grams …● could use two passes● can use two passes

Page 62: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Entities/Tags – remaining issues● Correference – increases tf (phrase importance)

○ Miloš Zeman = pan prezident = on = 0 přišel

● Standardization○ iPads > iPad, Windows != Window, ○ Spojené státy != Spojený stát, česká dráha○ The first stage has landed on Of Course I Still Love You.○ Zazpíval Bratříčku zavírej vrátka.

● Normalization○ USA = United States of America = United States ~ America○ Hillary Rodham = Hillary Clinton

Page 63: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Příhody z natáčení ...● Škoda Octavius● Livius Klausová● Miroslava Kalousek● CPI● Yoko onen● jihoamerické tanky● kat Perry● laso Vegas● Světový ekonomický fór● Ústí nad Labem x Brisbane nad Austrálií● Homosexuální pára● Křišťálový lup

Page 64: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Syntax & Parsing

Page 65: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 66: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 67: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 68: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 69: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 70: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Slunečníky nahradily deštníky.

Pět nemocnic zrušilo ministerstvo.

Návrh putuje k Ústavnímu soudu v době, kdy je neúplný.

Ambiguity

Page 71: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Old men and women are hard to live with.

I saw her duck.

The chicken are too hot to eat.

The mayor is a dirty street fighter.

Happily they left.

Terry loves his wife and so do I.

Ambiguity

Page 72: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Parser

Try Stanford Online Dependency Parser

http://nlp.stanford.edu:8080/corenlp/process

Page 73: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

End-to-end example:Keboola Connection

Page 74: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vectors

Page 75: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company
Page 76: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vector methods

One way to bridge natural language and classical ML

After transforming to vectors, integration with ML systems is easy

Applications: Search

Text classification

Preprocessing / feature extraction for any ML task e.g., neural networks image -> vector -> text

Page 77: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vector methods: bag of words

Preprocessing – tokenization, stemming/lemmatization, cleaning

Create a vector d with dimension V (size of vocabulary)

di = tfi (term frequency of the i-th word)

A black cat and a white cat slept on a mat -> {black:1, white:1, cat:2, sleep:1, mat:1} -> [1, 1, 2, 1, 1, ...]

Page 78: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vector methods: bag of words improved

Fancier values instead of tf. (e.g., tf-idf)

Add n-grams/phrases/entities to the bag {..., black cat:1, white cat:1, ...}

Page 79: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vector methods: dimensionality reduction

Each term is a feature - very big dimension

Dimensionality reductionLSI (LSA) – term-document matrix decomposition

LDA – topic inference using probabilistic graphical model

Word2vec – transform words to vectors of given size, capture their context

gensim Python library

Page 80: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vector methods: Latent Semantic Indexing (LSI)

goal: map semantically similar documents to similar vectors

{(car), (truck), (flower)} –> a{(1.345 * car + 0.282 * truck), (flower)}

reduce dimensionality by singular value decomposition (SVD) of the term-document matrix

somehow addresses synonymy, in lesser extent homonymy

From EP corpus: 0.365*fishery + 0.342*fishing + 0.197*fish + -0.153*tax + -0.140*food + 0.116*aquaculture + ...

Source: Jialu Liu: Topic Model

Page 81: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vector methods: Latent Dirichlet Allocation (LDA)

topic1 –> 0.1 milk, 0.09 meow, 0.08 kitten

topic2 –> 0.12 bark, 0.11 bone, 0.07 puppy

Finds probability distributions of topics for documents and words for topics

Source: Jialu Liu: Topic Model

Page 82: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vector methods: Latent Dirichlet Allocation (LDA)

From EP corpus: 0.018*transport + 0.013*passenger + 0.011*airline + 0.010*road + 0.009*safety + 0.007*simplify + 0.007*rail + 0.006*travel + ...

0.025*Israel + 0.017*Palestinian + 0.015*Jerusalem + 0.015*Gaza + 0.012*Prime + 0.011*Israeli + 0.009*peace +

Page 83: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vector methods: Word2vec

Doesn’t ignore word order, uses either skip-grams or continuous bag of words (CBOW)

vector arithmetic king - man + woman = queen

uses neural networks

research shows analogy to matrix factorization

Page 84: Natural Language Processing · tf – term frequency So bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company

Vector methods: Word2vec

model.most_similar(positive=['nuclear'])

[('stations', 0.6321508884429932),('reactor', 0.6199184060096741),('plants', 0.6013395190238953),('atomic', 0.5934208035469055),('coal-fired', 0.5920413732528687),('reactors', 0.549136221408844),('solar', 0.5483176112174988),('weapons', 0.5343624353408813),('disarmament', 0.5275484919548035),('plant', 0.5141536593437195)]