Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
IAAA / PSTALNIntroduction to natural language processing
Benoit Favre <[email protected]>
Aix-Marseille Université, LIF/CNRS
last generated on December 1, 2019
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 1 / 29
PSTALN: class outline
2019/12/02▶ Class: intro on tasks and
problems▶ Practical: intro to pytorch
2019/12/09▶ Word representations▶ Word embedding evaluation
2019/12/16▶ Named entities and other
tagging tasks▶ Noun phrase tagging
2020/01/06▶ Language modeling▶ Generation
2020/01/13▶ Machine translation▶ Grapheme to phoneme
conversion2020/01/20
▶ Dialog systems▶ Document classification
2020/01/27▶ Adaptation▶ Cross-lingual classification
2020/01/31▶ Exam
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 2 / 29
What is Natural Language Processing?
What is Natural Language Processing (NLP)?Allow computer to communicate with humans using everydaylanguageTeach computers to reproduce human behavior regarding languagemanipulationLinked to the study of human language through computers(Computational Linguistics)
Why is it difficult?People do not follow rules strictly when they talk or write: “r uready?”Language is ambiguous: “time flies like an arrow”Input can be noisy: speech recognition in the subway
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 3 / 29
Example: generate annotations
My kids have lost a card, the TV doesn't work
If I understood, you have lost your decoder card
Yes, error code S03
We will send you another one by mail
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 4 / 29
Example: generate annotations
Customer: My kids have lost a card, the TV doesn't work
Acknowledge
Losing
Possession (card)Owner (kids) Entity (TV)
Prop(Function)
Awareness
Cognizer (Agent)
Content (Losing)
Owner (Customer) Possession (decoder card)
Means(error code S03)
Sending
Sender(Agent) Recipent (Customer) Theme(decoder card)
NegationResult
Pro
ble
m s
tate
men
t
Restatement
Complete
Result
Agent: If I understood, you have lost your decoder card
Customer: Yes, error code S03
Agent: I will send you another one by mail
Res
olu
tio
n
Agree
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 5 / 29
NLP is everywhere
Spell checker / grammar correction (Word)Information retrieval / search (Google)Machine translation (Google)Information extraction (Ask.com)Question answering (Jeopardy)Automatic summarization (Google news)Call routing (Telcos)Sentiment analysis (Amazon)Spam filtering (Email)Writing recognition (Cheque processing)Voice dictation (Dragon, Nuance)Speech synthesis (In-car GPS)Dialog systems (Siri/OK Google/Alexa...)
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 6 / 29
Domains related to NLP
Artificial intelligenceFormal language theoryMachine learningLinguisticsPsycholinguisticsCognitives sciencesPhilosophy of language
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 7 / 29
Communication channel
From the point of view of the source (the speaker)1 Intent: the message we want to communicate2 Generation: the message in linguistic form3 Production: the muscular action which leads to sound production
From a receiver point of view (listener)1 Perception: how the sound is transmitted to neurons2 Analysis: interpretation of the linguistic message (syntactic,
semantic...)3 Integration: believe or not the information, reply...
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 8 / 29
Processing levels
“John loves Mary”1 Lexical : segment character stream in words, identify linguistic units
John/firstname-male loves/verb-love Mary/firstname-female2 Syntax : identify grammatical structures
(S (NP (NNP John)) (VP (VBZ loves) (NP (NNP Mary))) (. .))3 Semantic : represent meaning
love(person(John), person(Mary))4 Pragmatic : what is the function of that sentence in context?
Is it reciprocal ?Since when ?What does it entail ? know(John, Mary)
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 9 / 29
Modular approach
question
answer
Automatictranscription
Syntacticanalysis
Semanticanalysis
Dialogmanager
Syntacticgeneration
Lexicalgeneration
Speechsynthesis
wordssyntactic
tree
concepts,relations
words,prosody
primitivesyntax
logicalrepresentation
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 10 / 29
Language ambiguityPhonetic
▶ I don’t know! – I don’t - no!GraphicalPhonetic and graphical
▶ I live by the bank (river bank or financial institution)Etymology
▶ I met an Indian (from India or native American)▶ I love American wine (from USA or from the Americas)
Syntactic▶ He looks at the man with a telescope▶ He gave her cat food
Referential▶ She is gone. Who?
Notational conventions▶ Birth date: 08/01/05
(wikipedia)
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 11 / 29
Basic NLP tasksMeta
▶ Author/speaker traits recognition▶ Genre classification
Syntax▶ Word / sentence segmentation▶ Morphological analysis▶ Part-of-speech tagging▶ Syntactic chunking▶ Syntactic parsing
Semantic▶ Word sense disambiguation▶ Semantic role labeling▶ Logical form creation
Pragmatic▶ Coreference resolution▶ Discourse parsing
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 12 / 29
Syntax: Word segmentation
Character sequence → word sequence (tokenization)Split according to delimiters [ :,.!?’]What about compounds? Multiword expressions?URLs (http://www.google.com), variable names(theMaximumInTheTable)In Chinese, no spaces between words:
▶ 男孩喜歡冰淇淋。→ 男孩 (the boy) 喜歡 (likes) 冰淇淋 (ice cream) 。
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 13 / 29
Syntax: Morphological analysis
Split words in relevant factorsGender and number
▶ flower, flower+s, floppy, flopp+iesVerb tense
▶ parse, pars+ing, pars+edPrefixes, roots and suffixes
▶ geo+caching▶ re+do, un+do, over+do▶ pre+fix, suf+fix▶ geo+local+ization
Agglutinative languages▶ pronouns are glued to the verb (Arabic, spanish...)
Rich morphology▶ Turkish, Finish
→ Lemmatization task: find canonical word form
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 14 / 29
Syntax: Part-of-speech tagging
Syntactic categories
Noun Adverb Discourse markerProper name Determiner Foreign wordsVerb Preposition PunctuationAdjective Conjunctions Pronouns
Each word can have multiple categoriesExample : time flies like an arrow
▶ flies: verb or noun?▶ like: preposition or verb?
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 15 / 29
Syntax: Syntactic analysis
Constituency parsing
Dependency parsing
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 16 / 29
Semantics: Word sense disambiguation (WSD)
What is the sense of each word in its context?red: color? wine? communist?fly: what birds do? insect?bank: river? financial institution?book: made of paper? make a reservation?
Word meaning highly depends on domainapple: fruit? company?to pitch: a ball? a product? a note?
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 17 / 29
Semantics: Semantic parsing
Syntax is ambiguousThe man opens the doorThe door opensThe key opens the door
Semantic rolesWho performed the action? the agentWho receives the action? the patientWho helps making the action? the instrumentWhen, where, why?
John sold his car to his brother this morningagent predicate instrument patient time
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 18 / 29
Pragmatics: Reference resolution
Link all references to the same entity▶ “Alexander Graham Bell (March 3, 1847 –August 2, 1922)[4] was a
Scottish-born[N 3] scientist, inventor, engineer, and innovator who iscredited with patenting the first practical telephone.” (Wikipedia)
AmbiguityPronouns (it, she, he, we, you, who, whose, both...)Noun phrases (the young man, the former president, the company...)Proper names (”Victoria”: South-African city, Canadian region,Queen, model...)
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 19 / 29
Pragmatics: Discourse analysisRelationship between sentences of a text, argument structure.
“Fully Automated Generation of Question-Answer Pairs for Scripted Virtual Instruction”,Kuyten et al, 2012Relation type (Rhetorical Structure Theory)
BackgroundElaborationPreparationContrastObjective
CauseCircumstancesInterpretationJustificationReformulation
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 20 / 29
Pragmatics: Create a logical form
Predicate representation▶ Can be used to infer new
John loves Mary but it is not reciprocal.
∃x, y,name(x, ‘‘John”) ∧ name(y, ‘‘Mary”) ∧ loves(x, y) ∧ not(loves(y, x))
John sold his car this morning to his brother.
∃x, y, z,name(x, ‘‘John”) ∧ brother(x, y) ∧ car(z)∧owns(x, z) ∧ sell(x, y, z) ∧ time(‘‘morning”)
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 21 / 29
History of natural language processing1950: Theory (Turing test, Chomsky grammars)
▶ Automatic translation during the cold war1960: Toy systems
▶ SHRDLU “place the red box next to the blue circle”, ELIZA “thetherapist”
1970:▶ Prolog (logic-base language for NLP), Dictionaries of semantic frames
1980: Dictation, Development of grammars1990
▶ Transition “introspection” → “corpus”▶ Evaluation campaigns▶ Neural networks are “forgotten”
2000▶ Machine learning▶ Applications: speech recognition, machine translation
2010...▶ Deep learning
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 22 / 29
Notion of corpus
Language in the wild▶ Email▶ Forums▶ Chats▶ Speech recordings▶ Video
Manual Annotation of all elements we want to predict▶ Text → topic▶ Sentence → parse tree▶ Review → sentiment
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 23 / 29
General approach
How to approach natural language processing?▶ Introspective approach
⋆ Study what you know about language⋆ Originated from linguistics⋆ Formal models, grammar engineering
▶ Corpus-based approach⋆ Study how people produce language in the wild⋆ The de facto modern approach
An empirical approach1 Task2 Corpus3 Guide4 Annotation5 System6 Evaluation
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 24 / 29
NLP Systems
Input▶ Raw text, audio...▶ Sentences, contextualized words▶ Output of another system
Output▶ n classes (ex: topics)▶ Structure (ex: syntactic parse)▶ Novel text (ex: translation, summary)▶ Commands for a system (ex: chatbots)
Process▶ output = f(input)▶ Deterministic vs random (evaluations need to be repeatable)▶ Parametrisable: output = f(input,parameters)
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 25 / 29
NLP as machine learningHow to design a system from examples of annotated data?
▶ For some problems, it’s difficult to engineer an algorithm▶ Rely on statistical regularities for performing the task
Machine learning (ML)▶ find parameters of a function
to minimize errors on atraining corpus
▶ loss/cost functionminimization
NLP as ML tasks▶ Labeling (i.e. word senses)▶ Segmentation (i.e. words in Chinese)▶ Regression (i.e. sentiment intensity)▶ Linking (i.e. dependency parsing)▶ Generation (i.e. machine translation)
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 26 / 29
Deep learningNon-linear compositions of differentiable functions
▶ Also called neural networks for historical reasons
Minimize loss by gradient descent▶ Chain rule: derivative of composition as
product of local derivatives▶ Iteratively improve parameters by taking small
steps towards gradient
fx1fx1(y)
x2
fx2(y)f(x1, x2)
y
→ Build complex functions like lego▶ Much more expressive than before
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 27 / 29
New playground for NLP
Fundamental paradigms emerging with deep learning▶ Pre-training on very large corpora (word, sentence, text embeddings)▶ End-to-end learning (propagate supervision to all stages)▶ Sequence-to-sequence learning (complex structures as input and
output)
Advanced approaches interesting in NLP▶ Multitask learning (reuse models for multiple tasks)▶ Zero-shot learning (process never-seen conditions)▶ Adversarial learning (a game between a learner and an anti-learner)▶ Reinforcement learning (supervision in non-derivable settings)
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 28 / 29
ML taking over NLP
1965-2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
mid-2019
0
5
10
15
20
25
30
35
40
Percent of articles in ACL Anthology with titles matching ML words
Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 29 / 29