Upload
kristi-davis
View
14
Download
0
Tags:
Embed Size (px)
DESCRIPTION
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
Citation preview
Natural Language Processing
Chapter TwoSystax
Binyam TekalignDebre Birhan University 28 October 2014
Natural Language Processing
• NLP is the branch of computer science focused on developing systems
that allow computers to communicate with people using everyday
language.
• Also called Computational Linguistics
• Also concerns how computational methods can aid the understanding of
human language
Word Classes and POS Tagging8
Background
• Part of speech:
• Noun, verb, pronoun, preposition, adverb, conjunction, particle, and article
• Recent lists of POS (also know as word classes, morphological class, or lexical tags) have
much larger numbers of word classes.
• 45 for Penn Treebank
• http://www.cs.colorado.edu/~martin/SLP/Figures/
• 87 for the Brown corpus,
• http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html
• 146 for the C7 tagset
• http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
Why Do We Care about Parts of Speech?
•Predicting what words can be expected next
Personal pronoun (e.g., I, she) ____________
•Stemming
-s means singular for verbs, plural for nouns
•As the basis for syntactic parsing and then meaning extraction
I will lead the group into the lead smelter.
•Machine translation
•Text Summarization
Word Classes and POS Tagging 10
English Word Classes
• Two broad subcategories of POS:
1. Closed class
2. Open class
Word Classes and POS Tagging 11
Cont..
1. Closed class
– Having relatively fixed membership, e.g., prepositions
– Function words:
• Grammatical words like of, and, or you,
• very short, occur frequently, and play an important role in grammar.
2. Open class
• Four major open classes occurring in the languages of the world:
• nouns, verbs, adjectives, and adverbs.
Word Classes and POS Tagging13
Open Class: Noun
• The name given to people, places, or things occur
• Thus, nouns include
• Concrete terms, like ship, and chair,
• Abstractions like bandwidth and relationship, and
• Verb-like terms like pacing
• Noun in English
• Things to occur with determiners (a goat)
• To take possessives (IBM’s annual revenue), and
• To occur in the plural form (goats)
Word Classes and POS Tagging 14
Open Class: Noun
• Nouns are traditionally grouped into proper nouns and common nouns.
• Proper nouns:
• IBM, Abebe
• Common nouns
• Count nouns:
• both singular and plural (goat/goats),
• Mass nouns:
• snow, salt,
Word Classes and POS Tagging 15
Open Class: Verb
• Most of the words referring to actions and processes including main verbs like
• draw, provide, differ, and go.
• A number of morphological forms:
• non-3rd-person (eat),
• 3rd-person (eats),
• progressive (eating),
• past participle (eaten)
Word Classes and POS Tagging 16
Open Class: Adjectives
• Terms describing properties or qualities
• Most languages have adjectives for the concepts of color (white, black), age (old,
young), and value (good, bad), but
• There are languages without adjectives, e.g., Chinese.
Word Classes and POS Tagging17
Open Class: Adverbs
• Words viewed as modifying something (often verbs)
• Directional (or locative) adverbs: specify the direction or location of some action,
• here, downhill
• Degree adverbs: specify the extent of some action, process, or property,
• extremely, very, somewhat
• Manner adverb: describe the manner of some action or process,
• slowly, slinkily, delicately
• Temporal adverbs: describe the time that some action or event took place,
• yesterday, Monday
Word Classes and POS Tagging 18
Closed Classes
• Some important closed classes in English
• Prepositions: on, under, over, near, by, at, from, to, with
• Determiners: a, an, the
• Pronouns: she, who, I, others
• Conjunctions: and, but, or, as, if, when
• Auxiliary verbs: can, may, should, are
• Particles: up, down, on, off, in, out, at, by
• Numerals: one, two, three, first, second, third
Word Classes and POS Tagging 19
Prepositions
• Prepositions occur before nouns, semantically they are relational
Preposition (and particles) of English from CELEX
Word Classes and POS Tagging 20
Particles
• A particle is a word that resembles a preposition or an adverb, and that often
combines with a verb to form a larger unit call a phrasal verb
English single-word particles from Quirk, et al (1985)
Word Classes and POS Tagging 21
Conjunctions
• Conjunctions are used to join two phrases, clauses, or sentences.
• and, or, or, but
Word Classes and POS Tagging 22
Coordinating and subordinating conjunctions of EnglishFrom the CELEX on-line dictionary.
Word Classes and POS Tagging 23
Open Classes: Pronouns
• Pronouns act as a kind of shorthand for referring to some noun
phrase or entity or event.
• Personal pronouns: persons or entities (you, she, I, it, me, etc)
• Possessive pronouns: forms of personal pronouns indicating actual
possession or just an abstract relation between the person and some
objects.
• Wh-pronouns: used in certain question forms, or may act as
complementizer.
Word Classes and POS Tagging 26
Open Classes: Auxiliary Verbs
• Auxiliary verbs: mark certain semantic feature of a main verb
English modal verbs from the CELEX on-line dictionary.
Word Classes and POS Tagging 27
Open Classes: Others
• Interjections: oh, ah, hey, man,
• Negatives: no, not
• Politeness markers: please, thank you
• Greetings: hello, goodbye
Word Classes and POS Tagging 28
Tagsets for English
• There are a small number of popular tagsets for English, many of which evolved
from the 87-tag tagset used for the Brown corpus.
• Three commonly used
• The small 45-tag Penn Treebank tagset
• The medium-sized 61 tag C5 tageset used by the Lancaster UCREL project’s CLAWS tagger to tag
the British National Corpus, and
• The larger 146-tag C7 tagset
Word Classes and POS Tagging 30
Part-of-Speech Tagging• POS tagging (tagging)
• The process of assigning a POS or other lexical marker to each word in a corpus.
• Also applied to punctuation marks
• Tags for NL are much more ambiguous.
• Taggers play an increasingly important role in speech recognition, NL parsing and IR
An Example
thegirlkisstheboyonthecheek
LEMMA TAG
+DET+NOUN+VPAST+DET+NOUN+PREP+DET+NOUN
thegirlkissedtheboyonthecheek
WORD
Labelling words for POS can be done by
dictionary lookup
morphological analysis
“tagging”
Word Classes and POS Tagging 32
Part-of-Speech Tagging
• The input to a tagging algorithm is a string of words and a specified tagset of the
kind described previously.VB DT NN .Book that flight .
VBZ DT NN VB NN ?Does that flight serve dinner ?
• Automatically assigning a tag to a word is not trivial
– For example, book is ambiguous: it can be a verb or a noun
– Similarly, that can be a determiner, or a complementizer.
Word Classes and POS Tagging33
Part-of-Speech Tagging
• Many tagging algorithms fall into two classes:
• Rule-based taggers
• Involve a large database of hand-written disambiguation rule
• Typically more than 1000 hand-written rules
• Stochastic taggers
• Resolve tagging ambiguities by using a training corpus to count the probability of a
given word having a given tag in a given context.
Syntax• Syntax: from Greek syntaxis “setting out together”
• Refers to the way words are arranged together, and the relationship between
them.
• Goal of syntax is
• to model the knowledge of that people unconsciously have about the grammar of their
native language
Main ideas of syntax:• Constituency
• Groups of words may behave as a single unit or phrase
• e.g., NP
• Grammatical relations
• A formalization of ideas from traditional grammar about SUBJECT, OBJECT
• E.g. She ate her breakfast
• Subcategorization and dependencies
• Referring to certain kind of relations between words and phrases,
• e.g., the verb want can be followed by an infinitival phrase, as in I want to fly to Detroit.
Constituency
• NP:
• A sequence of words surrounding at least one noun, e.g.,
• three parties from Brooklyn arrive …
• hey sit
• the reason he comes into the Hot Box
• Preposed or postposed constructions,
• e.g., the PP, on September seventeenth, can be placed in a number of different locations
• On September seventeenth, I’d like to fly from Atlanta to Denver.
• I’d like to fly on September seventeenth from Atlanta to Denver.
• I’d like to fly from Atlanta to Denver On September seventeenth.
NPs
• NP -> Pronoun• I came, you saw it, they conquered
• NP -> Proper-Noun• Los Angeles is west of Texas• John Hennessy is the president of Stanford
• NP -> Det Noun• The president
• NP -> Nominal
• Nominal -> Noun Noun• A morning flight to Denver
Syntax
• Why should we care?
• Grammar checkers
• Question answering
• Information extraction
• Machine translation
42
Cont..
• A context-free grammar is a notation for describing languages.
• It is more powerful than finite automata or Regular Expression
• But still cannot define all possible languages.
04/21/2023 Speech and Language Processing - Jurafsky and Martin
43
Context-Free Grammars
• Terminals
• We’ll take these to be words
• Non-Terminals
• The elements in a language
• Like Noun, Noun phrase NP, verb phrase VP and sentence S
• A start symbol S, which is a member of none terminals
• Rules
• Rules are equations that consist of a single non-terminal on the left and any number of
terminals and non-terminals on the right.
Cont..
A → B CMeans that A can be rewrite as B followed by C regardless of the context in
which A is foundS → NP VP
• A language that is defined by some CFG is called a context-free language.
Cont..
• Noun → flight | breeze | trip | morning | …
• Verb → is | prefer | like | need | want | fly …
• Adjective → cheapest | non-stop | first | latest
• Pronoun → me | I | you | it | …
• Proper-Noun → Alaska | Baltimore Chicago |
• Determiner → the | a | an | this | these | that
• Preposition → from | to | on | near | …
• Conjunction → and | or | but | …
Rules• S → NP VP I + want a morning flight
• NP → Pronoun I
• | Proper-Noun Los Angeles
• | Det Nominal a + flight
• Nominal → Noun Nominal morning + flight
• | Noun flights
• VP → Verb do
• | Verb NP want + a flight
• | Verb NP PP leave + Boston + in the morning
• | Verb PP leaving + on Thursday
• PP → Preposition NP from + Los Angeles
Sentence-Level Constructions
• There are a great number of possible overall sentence structures, but
• four are particularly common and important:
• Declarative structure,
• imperative structure,
• yes-no-question structure,
• wh-question structure.
Sentence-Types
• Declaratives: A plane leftS -> NP VP
• Imperatives: Leave!S -> VP
• Yes-No Questions: Did the plane leave?S -> Aux NP VP
• WH Questions: When did the plane leave?S -> WH Aux NP VP
Sentences with declarative structure
• –A subject NP followed by a VP
• The flight should be eleven a.m. tomorrow.
• I need a flight to Seattle leaving from Baltimore making a stop in Minneapolis.
• The return flight should leave at around seven p.m.
• I want a flight from Atlanta to Chicago.
• I plan to leave on July first around six thirty in the evening.
• S → NP VP
Sentence with imperative structure
– Begin with a VP and have no subject.
– Always used for commands and suggestions
• Show the lowest fare.
• Show me the cheapest fare that has lunch.
• List all flights between five and seven p.m.
• Show me all the flights leaving Baltimore.
• Show me flights arriving within thirty minutes of each other.
• Show me the last flight to leave.
– S → VP
Sentences with yes-no-question structure
– Begin with auxiliary, followed by a subject NP, followed by a VP.
• Do any of these flights have stops?
• Does American’s flight eighteen twenty five serve dinner?
• Can you give me the same information for United?
– S → Aux NP VP
The wh-subject-question structure
– Identical to the declarative structure, except that the first NP contains
some wh-word.
• What airlines fly from Burbank to Denver?
• Which flights serve breakfast?
• Which of these flights have the longest layover Nashville?
– S → Wh-NP VP
• The wh-noun-subject-question structure
• What flights do you have from Atlanta to Washington?
– S → Wh-NP Aux NP VP
Auxiliaries
• Auxiliaries or helping verbs– A subclass of verbs– Including the modal verb, can, could many, might, must, will, would, shall,
and should– The perfect auxiliary have,– The progressive auxiliary be, and– The passive auxiliary be.
Parsing
Parsing
• derive the syntactic structure of a sentence based on a language model (grammar)
• construct a parse tree, i.e. the derivation of the sentence based on the grammar (rewrite
system)
Outline Language, Syntax, Parsing
Problems in Parsing Ambiguity
Bottom vs. Top Down Parsing
Chart-Parsing
Earley-Algorithm04/21/2023 COSC 709: Natural Language Processing 56
Sample Grammar
Non Terminal (S, NT, T, P) Sentence Symbol S NT, Part-of-Speech NT, Constituents NT,Terminals, Word TGrammar Rules P NT (NT T)*
S NP VP statementS Aux NP VP questionS VP commandNP Det Nominal NP Proper-Noun Nominal Noun | Noun Nominal | Nominal PPVP Verb | Verb NP | Verb PP | Verb NP PP PP Prep NP
Det that | this | aNoun book | flight | meal | moneyProper-Noun Houston | American Airlines | TWAVerb book | include | preferAux doesPrep from | to | on
04/21/2023 COSC 709: Natural Language Processing 57
Parsing Task
Parse "Does this flight include a meal?"
04/21/2023 COSC 709: Natural Language Processing 58
Sample Parse Tree
Parse "Does this flight include a meal?"
S
Aux NP VP
Det Nominal Verb NP
Noun Det Nominal
does this flight include a meal
04/21/2023 COSC 709: Natural Language Processing 59
Problems in Parsing
Ambiguity
“Peter saw Mary with the telescope”
syntactical/structural ambiguity – several parse trees are possible e.g. above
sentence
semantic/lexical ambiguity – several word meanings e.g. bank (where you get
money) and (river) bank
even different word categories possible (interim) e.g. “He books the flight.” vs.
“The books are here.“ 04/21/2023 COSC 709: Natural Language Processing 60
Bottom-up – from word-nodes to sentence-symbol Top-down Parsing – from sentence-symbol to words
Bottom-up and Top-down Parsing
S
AUX NP VP
Det Nominal Verb NP
Noun Det Nominal
does this flight include a meal
04/21/2023 COSC 709: Natural Language Processing 61
Top Down Parsing
S
VP
Verb NP
book Pronoun
Xthat
04/21/2023 COSC 709: Natural Language Processing 77
Top Down Parsing
S
VP
Verb NP
book ProperNoun
Xthat
04/21/2023 COSC 709: Natural Language Processing 79
Top Down Parsing
S
VP
Verb NP
book Det Nominal
that
04/21/2023 COSC 709: Natural Language Processing 81
Top Down Parsing
S
VP
Verb NP
book Det Nominal
that Noun
04/21/2023 COSC 709: Natural Language Processing 82
Top Down Parsing
S
VP
Verb NP
book Det Nominal
that Noun
flight
04/21/2023 COSC 709: Natural Language Processing 83
Bottom Up Parsing
book that flight
Noun
Nominal Noun
Nominal
04/21/2023 COSC 709: Natural Language Processing 87
Bottom Up Parsing
book that flight
Noun
Nominal Noun
Nominal
X
04/21/2023 COSC 709: Natural Language Processing 88
Bottom Up Parsing
89
book that flight
Noun
Nominal PP
Nominal
04/21/2023 COSC 709: Natural Language Processing 89
Bottom Up Parsing
90
book that flight
Noun Det
Nominal PP
Nominal
04/21/2023 COSC 709: Natural Language Processing 90
Bottom Up Parsing
91
book that flight
Noun Det
NP
Nominal
Nominal PP
Nominal
04/21/2023 COSC 709: Natural Language Processing 91
Bottom Up Parsing
book that
Noun Det
NP
Nominal
flight
Noun
Nominal PP
Nominal
04/21/2023 COSC 709: Natural Language Processing 92
Bottom Up Parsing
book that
Noun Det
NP
Nominal
flight
Noun
Nominal PP
Nominal
04/21/2023 COSC 709: Natural Language Processing 93
Bottom Up Parsing
book that
Noun Det
NP
Nominal
flight
Noun
S
VP
Nominal PP
Nominal
04/21/2023 COSC 709: Natural Language Processing 94
Bottom Up Parsing
book that
Noun Det
NP
Nominal
flight
Noun
S
VP
X
Nominal PP
Nominal
04/21/2023 COSC 709: Natural Language Processing 95
Bottom Up Parsing
book that
Noun Det
NP
Nominal
flight
Noun
Nominal PP
Nominal
X
04/21/2023 COSC 709: Natural Language Processing 96
Bottom Up Parsing
book that
Verb Det
NP
Nominal
flight
Noun
04/21/2023 COSC 709: Natural Language Processing 97
Bottom Up Parsing
book that
Verb
VP
Det
NP
Nominal
flight
Noun
04/21/2023 COSC 709: Natural Language Processing 98
Det
Bottom Up Parsing
book that
Verb
VP
S
NP
Nominal
flight
Noun
04/21/2023 COSC 709: Natural Language Processing 99
Det
Bottom Up Parsing
book that
Verb
VP
S
XNP
Nominal
flight
Noun
04/21/2023 COSC 709: Natural Language Processing 100
Bottom Up Parsing
book that
Verb
VP
VP
PP
Det
NP
Nominal
flight
Noun
04/21/2023 COSC 709: Natural Language Processing 101
Bottom Up Parsing
book that
Verb
VP
VP
PP
Det
NP
Nominal
flight
Noun
X
04/21/2023 COSC 709: Natural Language Processing 102
Bottom Up Parsing
book that
Verb
VP
Det
NP
Nominal
flight
Noun
NP
04/21/2023 COSC 709: Natural Language Processing 103
Bottom Up Parsing
book that
Verb
VP
Det
NP
Nominal
flight
Noun
04/21/2023 COSC 709: Natural Language Processing 104
Bottom Up Parsing
book that
Verb
VP
Det
NP
Nominal
flight
Noun
S
04/21/2023 COSC 709: Natural Language Processing 105
Problems with Bottom-up and Top-down Parsing
• Problems with left-recursive rules like NP NP PP:
• don’t know how many times recursion is needed
• Pure Bottom-up or Top-down Parsing is inefficient because
• it generates and explores too many structures which in the end turn out to be.
• Combine top-down and bottom-up approach:
• Start with sentence; use rules top-down (look-ahead); read input; try to find shortest path
from input to highest unparsed constituent (from left to right).
• Chart-Parsing / Earley-Parser04/21/2023 COSC 709: Natural Language Processing 106
Chart Parsing / Early Algorithm• Early-Parser based on Chart-Parsing
• Essence: Integrate top-down and bottom-up parsing.
• Top-down:
• Start with S-symbol.
• Generate all applicable rules for S.
• Go further down with left-most constituent in rules and add rules for these constituents until you
encounter a left-most node on the RHS which is a word category (POS).
• Bottom-up:
• Read input word and compare.
• If word matches, mark as recognized and move parsing on to the next category in the rule(s).04/21/2023 COSC 709: Natural Language Processing 107
ChartA Chart is a graph with n+1 nodes marked 0 to n for a sequence of n input words. Arcs indicate recognized part of RHS of rule.The • indicates recognized constituents in rules.
A directed acyclic graph representation of the three dotted rules above.
04/21/2023 COSC 709: Natural Language Processing 108
Chart Parsing / Earley Parser 1Chart
Sequence of n input words; n+1 nodes marked 0 to n.
States in chart represent possible rules and recognized constituents.
RHS of recognized rule is covered by arc.
Interim state
S • VP, [0,0]
top-down look at rule S VP
nothing of RHS of rule yet recognized (• is far left)
arc at beginning, no coverage (covers no input word; beginning of arc at node 0 and end of arc at node 0)
04/21/2023 COSC 709: Natural Language Processing 109
Chart Parsing / Earley Parser 2Interim states
NP Det • Nominal, [1,2]
top-down look with rule NP Det • Nominal
Det recognized (• after Det)
arc covers one input word which is between node 1 and node 2
look next for Nominal, top-down
NP Det Nominal • , [1,3]
Nominal was recognized, move • after Nominal
move end of arc to cover Nominal; change 2 to 3
structure is completely recognized; arc is inactive;
mark NP as recognized in other rules (move • ), bottom up04/21/2023 COSC 709: Natural Language Processing 110
Chart - 1
VP V . NP
V
Book this flight
S . VP
NP . Det Nom
04/21/2023 COSC 709: Natural Language Processing 112
Chart - 2
VP V . NP
V
Book this flight
S . VP
NP Det . Nom
Det
04/21/2023 COSC 709: Natural Language Processing 113
Chart - 3a
VP V . NP
V
Book this flight
S . VP
NP Det . Nom
Det
Nom Noun .
Noun
04/21/2023 COSC 709: Natural Language Processing 114
Chart - 3b
VP V . NP
V
Book this flight
S . VP
NP Det Nom .
Det
Nom Noun .
Noun
04/21/2023 COSC 709: Natural Language Processing 115
Chart - 3c
VP V NP .
V
Book this flight
NP Det Nom .
Det
Nom Noun .
Noun
S . VP
04/21/2023 COSC 709: Natural Language Processing 116
Chart - 3d
VP V NP .
V
Book this flight
S VP .
NP Det Nom .
Det
Nom Noun .
Noun
04/21/2023 COSC 709: Natural Language Processing 117
Earley Algorithm - Functionspredictor
generates new rules for partly recognized RHS with constituent right of • (top-down
generation)
scanner
if word category (POS) is found right of the • , the Scanner reads the next input word and
adds a rule for it to the chart (bottom-up mode)
completer
if rule is completely recognized (the • is far right), the recognition state of earlier rules in the
chart advances: the • is moved over the recognized constituent (bottom-up recognition).
04/21/2023 COSC 709: Natural Language Processing 119
Earley – Chart for “book that flight” from 2nd edition
Earley – Chart for “book that flight”
04/21/2023 COSC 709: Natural Language Processing 120
function EARLEY-PARSE(words, grammar) returns chartENQUEUE(( S, [0,0]), chart[0])for i_from 0 to LENGTH(words) do
for each state in chart[i] doif INCOMPLETE?(state) and NEXT-CAT(state) is not a part of speech then PREDICTOR(state) elseif INCOMPLETE?(state) and NEXT-CAT(state)is a part of speech then SCANNER(state)else COMPLETER(state)
endendreturn(chart) - continued -
Earley-Algorithm
04/21/2023 COSC 709: Natural Language Processing 121
procedure PREDICTOR((A B , [i,j]))for each (B ) in GRAMMAR-RULES-FOR(B, grammar) do ENQUEUE((B [j,j], chart[j])
end
procedure SCANNER ((A B , [i,j]))if B PARTS-OF-SPEECH(word[j]) then ENQUEUE((B word[j], [j,j+1]), chart[j+1])
end
procedure COMPLETER ((B , [j,k]))for each (A B , [i,j]) in chart[j] do ENQUEUE((A B , [i,k]), chart[k])
end
procedure ENQUEUE(state, chart-entry)if state is not already in chart-entry then PUSH(state, chart-entry)
end
Earley-Algorithm (continued)
04/21/2023 COSC 709: Natural Language Processing 122
Earley-Algorithm (copy from 2nd edition)
Earley – Algorithm
main
04/21/2023 COSC 709: Natural Language Processing 123
Earley-Algorithm (continued)
Earley – Algorithm
processes
04/21/2023 COSC 709: Natural Language Processing 124
• I have a car• I have expensive car• What are you doing• I am running• Get out of jail• Do not do that• Could you please give me the coffee
Sentence-Types
• Declaratives: A plane leftS -> NP VP
• Imperatives: Leave!S -> VP
• Yes-No Questions: Did the plane leave?S -> Aux NP VP Does he know the case?
• WH Questions: When did the plane leave?S -> WH Aux NP VP
139
An Exercise: The city hall parking lot in town
• NP NP NP PP• NP Det Nom• NP Adj Nom• NP Nom Nom• Nom NP Nom• Nom N• PP Prep NP• N city | hall | lot | town• Adj parking• Prep to | for | in
Earley – Chart for “book that flight” from 2nd edition
Earley – Chart for “book that flight”
04/21/2023 COSC 709: Natural Language Processing 140
S.NP VP Predicator Vhave. Scanner
S.Aux NP VP “” VPV.NP Completer
S.VP “” VP V.PP “”
VP .V NP “” NP.Det Nom predicator
VP .V NP “” NP.Pro “”
VP .V PP “” Deta. Scanner
NP.Det Nom “” NPDet.Nom Completer
NP.Pro “” Nom.n nom Predictor
Nom .n “” Nom.n PP “”
proI. Scanner Nom.n “”
NPpro. Completer Ncar. Scanner
SNP.VP “” NomN. Completer
VP .V NP predicator NPDet Nom. Completer
VP .V NP PP “” VPVNP. Completer
VP .V PP “” SNP VP. Completer
Earley Chart for “I have a car”.