Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Natural Language Processing
17 June 2017
Approaches to NLP
Approaches to NLP
• Rule based approach – circa from 1950
• Machine learning – circa from 1980
Approaches to NLP – Rule based
• Chomsky (1957): Syntactic structures
• Machine translation from IBM & Georgetown
Time flies like an arrow.
Ženu holí stroj.
Approaches to NLP – Machine learning
• Statistical methods, machine learning, including neural networks
• Importance of exact evaluation
• Data, data, data
• Annotation
Machine learning
•Unsupervised
• Finding hidden structure in data
• For example clustering
•Supervised
• Requires training data with correct answers
Data – Corpora
● Morphology, tagging: Penn Treebank, PDT, ...
● Parallel corpora:European, Canadian parliament, movie subtitles
● Specialised (e.g. for sentiment)
Sentiment analysis
Sentiment Analysis
● Česká ekonomika zážívá nebývalý boom.
● Obchod s lidmi zažívá nebývalý boom.
● Burger King má lepší hranolky než McDonnald.
● Baterka je dobrá, ale displej je hroznej.
● It’s not good, but I still love it.
Sentiment Analysis
● I was happy.
● I was sad.
● I was not happy.
● I have never been happy in my life.
● I have never been so happy in my life.
Sentiment Analysis
● She is pretty.
● She is pretty annoying.
● Bylo to strašně dobrý.
Sentiment Analysis
● To se vám teda povedlo.
● Go read the book.
Sentiment Analysis
● Předchozí verze byla úžasná, výborně se s ní pracovalo a opravdu mi ušetřila práci. Teď si nejsem jistý.
Sentiment Analysis
● Jsme veselí.
● Ve Veselí mě okradli.
● Krásná u Aše je naprosto děsná.
● Pan Šťastný mi leze na nervy.
● That was a bad ass burger!
● Bob’s Bad Breath Burger is delicious.
Sentiment Analysis
● Aš je příšerná prdel.
● Aš je pěkná prdel.
● Krásná u Aše je pěkná prdel.
● To si ze mě děláš prdel.
● S Honzou je prdel.
Sentiment classification in python
Go to Jupyter Sandbox in Keboola Connection
Naive Bayes
Features considered independently
Evaluation
Evaluation: Precision/Recall
False positiveBad Precision
False negativeBad Recall
Evaluation:Confusion matrix
Prediction
Predicted positive
Predicted Negative
Reality
Real positive
True positive
False negative
Real negative
False positive
True negative
Evaluation: Confusion matrix
Prediction
Predicted positive
Predicted Negative
Reality
Real positive
10 5
Real negative
3 16
Evaluation
● Multiple possible answers
● Not all errors are equally important
● Inter-annotator agreement (very low for tagging with an open tag set)
Machine learning – Overfitting
Naive Bayes
Features considered independently
Neural network
● fully connected layers
● other architectures possible
Discovery Analysis
Discovery analysis
● explore the data
● prefer recall over precision
● malformed or irrelevant tags not a big deal (as opposed to media tags)
Yelp Sample – 160k Restaurant Reviews
Simple tagging
1. Tokenization: split text into words
2. Drop unimportant words: stop words
3. Find important words: tf-idf
tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!
tf(pizza) = 5
tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!
tf(pizza) = 5tf(the) = 15
idf – inverse document frequency
idf(the) = log(160,000 / 152,000) = log(1.05)
idf(pizza) = log(160,000 / 11,800) = log(13.56)
idf(Tokyo) = log(160,000 / 194) = log(824.74)
tf-idf
w(t,doc) = log(1 + tf(t) ) * log (N / df(t))
idf(the) = log(1+15) * log(1.05) = 0.28
idf(pizza) = log(1+5) * log(13.56) = 9.72
idf(Tokyo) = log(1+0) * log(824.74) = 0
Simple, yet works well
Go to python notebooks
Simple tags in python
Tokenization● Vignátová-Mimochodská, Merriam-Webster's, Tumu-M'Pongo
● 10-year, one-liners, self-proclaimed, je-li
● United Kingdom-United States relations
● 5-3, 5-3+1, U+2010, 2:4, 14:34, 10:00-14:00
● 10000, 10 000, 10,000
● 3.14159, 10.12., 10. prosince, U.S.A., H2O
● km/h, A/C, s/he, byl(a), °C
● N40° 44.9064', W073° 59.0735'
● www.noviny.cz/clanek-o-necem [email protected]
● cos, tys, polívkus; proň, ses, abys; křížem krážem
Tokenization
● Arbitrary decisions have to be made.
● Stick to them consistently.
● Pre-trained models might work poorly if fed with differently tokenized data
eat x ate x eatenOK, not cheap but not outrageously expensive either. I've eaten here twice, the last time during May 2009, I enjoyed both the food & atmosphere. I suppose you could call the place a Bistro. The food is Scottish & locally sourced, caters for vegetarians & has a pretty varied menu without being ridiculously extensive. I seem to remember a good selection of wines but don't think they serve anything but bottled beer. Damned if I can remember what I ate but had fish once that was extremely tasty & their veg isn't undercooked that can be the fashion. The service was friendly with no unseemly waiting! A great night out in New Town. There are two sister restaurants: A Room in the West End & A Room in Leith. Enjoy a great place to eat in a fabulous city!
tf(eat) + tf(ate) + tf(eaten) = 3tf(eat) = 1
Lemmatization &
Morphology
Processing Morphology
Lemmatization: word → lemma (dictionary form) Peter saw her. → Peter see she.státu → stát-1_^(státní_útvar)
POS Tagging: word → tagPeter saw her. → noun, verb, pronoun, punct
Morphological analysis: ignores contextsaw → {[see, verb], [saw, noun]}ženou → {[hnát, verb], [žena, noun]}
Morpheme segmentation: nej-ne-ob-hospod-ař-ova-teln-ějš-ího
Generation: žena + verb+3+pl+present → ženou
Morphology – not so easy
matk-a – mat-e-k – matc-e – matč-in
city – citi-es, goose – geese, sheep – sheep, go – went, jít – šla
Stuhl – Stühl-e, Vater – Väter
po-trub-í (*potrub, *trubí)Povltaví, Pobaltí, potrubí, pobřeží
Morphology – not so easy
Tagalog (Philippines):
basa ‘read’ b-um-asa ‘readpast’sulat ‘write’ s-um-ulat ‘wrote’
rare in English: abso-bloody-lutely
Arabic, Hebrew – templates
Choice of lemma depth
inflection: sedaček → sedačka, debates → debatebrought, brings, bringing → bring
negation: nezdravý → zdravý, unreasonable → reasonable
gradation: nejvyšší → vysoký (Nejvyšší soud)
Morphology: not so easy - derivation
hubnutí – hubnout
sportovec – sport – sportovní
řez – řezbář – řezník
dát – vzdát, jít – pojít
unloosen = loosen; unnerve, unearth
Zipf’s lawword frequency is inversely proportional to its freq rank
Consequences
● Pareto’s rule (80 : 20)○ One can achieve “reasonable” quality fast○ Costs of additional improvements rise “exponentially” (long tail)
● Ambiguity and fuzziness on every layer of language
Part-of-speech tagging
Part-of-speech tagging
I love hiking through the woods on weekends .
PRP VBP N IN DT NNS IN NNS .
● Penn treebank tagset: cca 40 tags; VBD – verb in past tense● Czech positional tagset: 4000+ tags; VpNS---XR-AA---
Petrov et al – (Google) Universal POS TagsetVERB - verbs (all tenses and modes)
NOUN - nouns (common and proper)
PRON - pronouns
ADJ - adjectives
ADV - adverbs
ADP - adpositions (prepositions and postpositions)
CONJ - conjunctions
DET - determiners
NUM - cardinal numbers
PRT - particles or other function words
X - other: foreign words, typos, abbreviations
. - punctuation
Penn Treebank tagset
Entities
Example – Svejk – characters
Entities – named and non-named
● Named entities: personal names, organizations, geographical names
● Other interesting entities: URL, e-mail, phone numbers, money amounts and other quantities, date and time
● Custom entities for given domain: bacon, onion, tomato, cheese for a burger chain
Entities – some basic challenges
● Types – fuzziness, hierarchy○ Facebook – product or company?○ European Union – organization or place
● Embedded entities○ [Dr.] Martin Luther King [Jr.]○ [The [New England] Journal of Medicine] ○ [Gymnázium [Jozefa Gregora Tajovského] v [Banskej Bystrici]]○ [Univerzita Karlova v Praze] vs [Univerzita Karlova] v Praze
● List look-up not enough○ Washington, The police, ANO, Šanca na Skok z Mosta do Siete
Entities – ML ● annotation – tag tokens with labels like PERSON_START, PERSON_CONT● popular classifier – CRF● features
○ word shape (case, is alphanumeric etc.)○ morphological features○ gazetteers○ distsim, word2vec○ labels already assigned to previous word(s)○ add features of surrounding tokens, previous instances of the same word, use
n-grams …● could use two passes● can use two passes
Entities/Tags – remaining issues● Correference – increases tf (phrase importance)
○ Miloš Zeman = pan prezident = on = 0 přišel
● Standardization○ iPads > iPad, Windows != Window, ○ Spojené státy != Spojený stát, česká dráha○ The first stage has landed on Of Course I Still Love You.○ Zazpíval Bratříčku zavírej vrátka.
● Normalization○ USA = United States of America = United States ~ America○ Hillary Rodham = Hillary Clinton
Příhody z natáčení ...● Škoda Octavius● Livius Klausová● Miroslava Kalousek● CPI● Yoko onen● jihoamerické tanky● kat Perry● laso Vegas● Světový ekonomický fór● Ústí nad Labem x Brisbane nad Austrálií● Homosexuální pára● Křišťálový lup
Syntax & Parsing
Slunečníky nahradily deštníky.
Pět nemocnic zrušilo ministerstvo.
Návrh putuje k Ústavnímu soudu v době, kdy je neúplný.
Ambiguity
Old men and women are hard to live with.
I saw her duck.
The chicken are too hot to eat.
The mayor is a dirty street fighter.
Happily they left.
Terry loves his wife and so do I.
Ambiguity
Parser
Try Stanford Online Dependency Parser
http://nlp.stanford.edu:8080/corenlp/process
End-to-end example:Keboola Connection
Vectors
Vector methods
One way to bridge natural language and classical ML
After transforming to vectors, integration with ML systems is easy
Applications: Search
Text classification
Preprocessing / feature extraction for any ML task e.g., neural networks image -> vector -> text
Vector methods: bag of words
Preprocessing – tokenization, stemming/lemmatization, cleaning
Create a vector d with dimension V (size of vocabulary)
di = tfi (term frequency of the i-th word)
A black cat and a white cat slept on a mat -> {black:1, white:1, cat:2, sleep:1, mat:1} -> [1, 1, 2, 1, 1, ...]
Vector methods: bag of words improved
Fancier values instead of tf. (e.g., tf-idf)
Add n-grams/phrases/entities to the bag {..., black cat:1, white cat:1, ...}
Vector methods: dimensionality reduction
Each term is a feature - very big dimension
Dimensionality reductionLSI (LSA) – term-document matrix decomposition
LDA – topic inference using probabilistic graphical model
Word2vec – transform words to vectors of given size, capture their context
gensim Python library
Vector methods: Latent Semantic Indexing (LSI)
goal: map semantically similar documents to similar vectors
{(car), (truck), (flower)} –> a{(1.345 * car + 0.282 * truck), (flower)}
reduce dimensionality by singular value decomposition (SVD) of the term-document matrix
somehow addresses synonymy, in lesser extent homonymy
From EP corpus: 0.365*fishery + 0.342*fishing + 0.197*fish + -0.153*tax + -0.140*food + 0.116*aquaculture + ...
Source: Jialu Liu: Topic Model
Vector methods: Latent Dirichlet Allocation (LDA)
topic1 –> 0.1 milk, 0.09 meow, 0.08 kitten
topic2 –> 0.12 bark, 0.11 bone, 0.07 puppy
Finds probability distributions of topics for documents and words for topics
Source: Jialu Liu: Topic Model
Vector methods: Latent Dirichlet Allocation (LDA)
From EP corpus: 0.018*transport + 0.013*passenger + 0.011*airline + 0.010*road + 0.009*safety + 0.007*simplify + 0.007*rail + 0.006*travel + ...
0.025*Israel + 0.017*Palestinian + 0.015*Jerusalem + 0.015*Gaza + 0.012*Prime + 0.011*Israeli + 0.009*peace +
Vector methods: Word2vec
Doesn’t ignore word order, uses either skip-grams or continuous bag of words (CBOW)
vector arithmetic king - man + woman = queen
uses neural networks
research shows analogy to matrix factorization
Vector methods: Word2vec
model.most_similar(positive=['nuclear'])
[('stations', 0.6321508884429932),('reactor', 0.6199184060096741),('plants', 0.6013395190238953),('atomic', 0.5934208035469055),('coal-fired', 0.5920413732528687),('reactors', 0.549136221408844),('solar', 0.5483176112174988),('weapons', 0.5343624353408813),('disarmament', 0.5275484919548035),('plant', 0.5141536593437195)]