Upload
clara-stephens
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
NLP research at UBCTOPICS• Generation and Summarization of Evaluative
Text (e.g., customer reviews)• Summarization of conversations (emails,
blogs, meetings)• Subjectivity Detection, Domain Adaptation,
Rhetorical Parsing
PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students
SUPPORT: NSERC, Google, BObjects(now SAP), COLLABORATIONS: MSResearch
04/18/23 CPSC503 Winter 2009 2
http://people.cs.ubc.ca/~rjoty/Webpage/
04/18/23 CPSC503 Winter 2009 3
State Machines (no prob.)• Finite State Automata
(and Regular Expressions)
• Finite State Transducers
(English)Morpholo
gy
Logical formalisms (First-Order Logics)
Rule systems (and prob. version)(e.g., (Prob.) Context-Free
Grammars)
Syntax
PragmaticsDiscourse and
Dialogue
Semantics
AI planners
Linguistic Knowledge Formalisms and associated Algorithms
04/18/23 CPSC503 Winter 2009 4
Computational tasks in Morphology
• Recognition: recognize whether a string is an English/… word (FSA)
• Parsing/Generation: word
stem, class, lexical features
….….
boughtbuy +V +PAST-PART
buy +V +PAST• Stemming:
wordstem
….
e.g.,
04/18/23 CPSC503 Winter 2009 5
Today Sept 16
• Finite State Transducers (FSTs) and Morphological Parsing
• Stemming (Porter Stemmer)
04/18/23 CPSC503 Winter 2009 6
FST definition (Recap.)
• Q: a finite set of states• I,O: input and an output alphabets
(which may include ε)• Σ: a finite alphabet of complex symbols
i:o, iI and oO
• Q0: the start state
• F: a set of accept/final states (FQ)• A transition relation δ that maps QxΣ
to 2Q
E.g., |Q| =3 ; I={a,b,c, ε} ; O={a,b}; |Σ|=?; 0 <= |δ| <= ?
04/18/23 CPSC503 Winter 2009 7
FST can be used as…
• Translators: input one string from I, output another from O (or vice versa)
• Recognizers: input a string from IxO
• Generator: output a string from IxO Terminology
warning!E.g., if I={a,b} ; O={a,b,ε};
……
04/18/23 CPSC503 Winter 2009 8
FST: inflectional morphology of plural
Some regular-nouns
Some irregular-nouns o:i
X -> X:X
lexical:surface
Notes:
04/18/23 CPSC503 Winter 2009 10
Computational Morphology: Problems/Challenges
1. Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages)
2. Spelling changes: may occur when two morphemes are combinede.g. butterfly + -s -> butterflies
04/18/23 CPSC503 Winter 2009 11
Ambiguity: more complex example
• What’s the right parse for Unionizable?– Union-ize-able– Un-ion-ize-able
• Each would represent a valid path through an FST for derivational morphology.
• Both Adj……
04/18/23 CPSC503 Winter 2009 12
Deal with Morphological Ambiguity
•Find all the possible outputs (all paths) and return them all (without choosing)Then Part-of-
speech taggingto choose…… look at the neighboring words
04/18/23 CPSC503 Winter 2009 13
(2) Spelling Changes
When morphemes are combined inflectionally the spelling at the boundaries may change Examples
•E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box)
•Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., butterfly, try)
04/18/23 CPSC503 Winter 2009 14
Solution: Multi-Tape Machines
• Add intermediate tape • Use the output of one tape
machine as the input to the next
• Add intermediate symbols– ^ morpheme boundary– # word boundary
04/18/23 CPSC503 Winter 2009 15
Multi-Level Tape Machines
• FST-1 translates between the lexical and the intermediate level
• FTS-2 handles the spelling changes (due to one rule) to the surface tape
FST-1
FST-2
04/18/23 CPSC503 Winter 2009 16
FST-1 for inflectional morphology of plural (Lexical <->
Intermediate )Some regular-nouns
Some irregular-nouns o:i
+PL:^s#
#
#
#
+PL:^ ε:s ε:#
04/18/23 CPSC503 Winter 2009 17
Example
f o x
intemediate
lexical
s em o u
intemediate
lexical
+PL+N
+N +PL
04/18/23 CPSC503 Winter 2009 18
FST-2 for E-insertion(Intermediate <-> Surface)
E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x
…as in fox^s# <-> foxes
#: ε
04/18/23 CPSC503 Winter 2009 19
Examples
^ sf o xintermediate
surface
#
^ ib o xintermediate
surface
n g #
04/18/23 CPSC503 Winter 2009 23
Intersection (FST1, FST2) = FST3
For all i,j,n,m,a,b δ3((q1i,q2j), a:b) = (q1n,q2m) iff
– δ1(q1i, a:b) = q1n AND
– δ2(q2j, a:b) = q2m
• States of FST1 and FST2 : Q1 and Q2
• States of intersection: (Q1 x Q2)
• Transitions of FST1 and FST2 : δ1, δ2
• Transitions of intersection : δ3
a:b
(q1i,q2j) (q1n,q2m
)
a:b
q1i q1n
a:b
q2j q2m
a:b
04/18/23 CPSC503 Winter 2009 24
Composition(FST1, FST2) = FST3 • States of FST1 and FST2 : Q1 and Q2
• States of composition : Q1 x Q2
• Transitions of FST1 and FST2 : δ1, δ2
• Transitions of composition : δ3
For all i,j,n,m,a,b δ3((q1i,q2j), a:b) = (q1n,q2m) iff– There exists c such that
– δ1(q1i, a:c) = q1n AND
– δ2(q2j, c:b) = q2ma:b
(q1i,q2j) (q1n,q2m
)
a:b
a:c
q1i q1n
c:b
q2j q2m
04/18/23 CPSC503 Winter 2009 25
FSTs in Practice• Install an FST package…… (pointers)• Describe your “formal language” (e.g,
lexicon, morphotactic and rules) in a RegExp-like notation (pointer)
• Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and
Karttunen, 2003, CSLI Publications)
Complexity/Coverage: • FSTs for the morphology of a natural
language may have 105 – 107 states and arcs
• Spanish (1996) 46x103 stems; 3.4 x 106 word forms
• Arabic (2002?) 131x103 stems; 7.7 x 106 word forms
04/18/23 CPSC503 Winter 2009 26
Other important applications of FST in NLP
From segmenting words into morphemes to…
• Tokenization:
– finding word boundaries in text (?!) …maxmatch
– Finding sentence boundaries: punctuation… but . is ambiguous look at example in Fig. 3.22
• Shallow syntactic parsing: e.g., find only noun phrases
• Phonological Rules…… (Chpt. 11)
04/18/23 CPSC503 Winter 2009 27
Computational tasks in Morphology
• Recognition: recognize whether a string is an English word (FSA)
• Parsing/Generation: word
stem, class, lexical features
….….
boughtbuy +V +PAST-PART
buy +V +PAST• Stemmin
g:wordstem
….
e.g.,
04/18/23 CPSC503 Winter 2009 28
Stemmer• E.g. the Porter algorithm, which is
based on a series of sets of simple cascaded rewrite rules:
• (condition) S1->S2– ATIONAL ATE (relational relate)– (*v*) ING if stem contains vowel (motoring
motor)
• Cascade of rules applied to: computerization– ization -> -ize computerize– ize -> ε computer
• Errors occur:– organization organ, university universe
Code freely available in most languages: Python, Java,…
04/18/23 CPSC503 Winter 2009 29
Stemming mainly used in Information Retrieval
1. Run a stemmer on the documents to be indexed
2. Run a stemmer on users queries3. Compute similarity between
queries and documents (based on stems they contain)
Seems to work especially well with smaller documents
04/18/23 CPSC503 Winter 2009 30
Porter as an FST
• The original exposition of the Porter stemmer did not describe it as a transducer but…– Each stage is a separate
transducer– The stages can be composed to
get one big transducer