Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Empirical Approaches to
Multilingual Lexical Acquisition
Lecturer: Timothy Baldwin
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Lecture 4
Unsupervised Approaches to Lexical
Acquisition: Word Segmentation and MWE
Extraction
1
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Word Segmentation
2
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Basic Task Description
• Take an unsegmented string and dynamically insert the word
boundaries:
spotthebreaksinthisstring : spot the breaks in this string
• Applications in:
⋆ the segmentation of non-segmenting languages (e.g. Japanese,
Chinese, Thai)
⋆ segmenting up a speech signal into words
⋆ modelling child word acquisition
3
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Difficulties in Word Detection
• Interdepedence between word and boundary detection
• Segmentation of frequently-occurring strings
ad hoc, 東京都 tou·kyou·to “Tokyo prefecture”
• Inherent boundary ambiguities
大学長 dai·gaku·chou “(university) president”
• Data sparseness
4
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Unsupervised Discovery ofMorphemes
(Creutz and Lagus 2002)
5
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Types of morphological linkage
• Inflectional (table : tables, bus : buses/busses, phenomenon
: phenomena)
• Derivational (central : centralise)
• Compositional (house + boat : houseboat)
6
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Semantic linkage
• employ : employer , employee
• Some of these are lost in the mists of time, e.g. ploy :
employ/deploy?
7
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Basic tasks in morphology
• Discovering the morphemes
• Determining the interaction between the morphemes in a given
word
unfathomable
l
un·fathom·ablel
unNEG·fathomV ·ableV →N
8
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Unsupervised Discovery of Morphemes
• Segment each word token into morphemes without using prior
knowledge of morphological processes or morphemes:
misinformed : mis·inform·edreconstruction : re·construct·ionkahvinjuojallekin : kahvi·n·juo·ja·lle·kin [FI]
• Assume agglutinative or simple inflectional morphology
• Based on orthographic regularities
Creutz and Lagus (2002) 9
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Applications
• Language modelling (speech recognition)
• Language-independent lexicographic tool for determining
stems/word boundaries
• N.B. assumes a corpus of raw text up-front
Creutz and Lagus (2002) 10
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Minimum Description Length
• Minimum Description Length (MDL, a.k.a. Minimum Message
Length or MML) is a method for simultaneously evaluating the
compactness of a particular encoding and the compactness of the
data using that encoding
Description length = model DL + data DL
‘Simplicity vs. Accuracy’
• Too much ‘accuracy’ : overfitting
• How can we compress / generalise the data model?
Creutz and Lagus (2002) 11
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
MDL in word segmentation
• In morpheme discovery terms, minimise C:
C = Cost(Source text) + Cost(Codebook)
=∑
tokens
− log p(mi) +∑
types
k × l(mj)
Creutz and Lagus (2002) 12
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Example: Word splitting
howmuchwoodcouldawoodchuckchuckifawoodchuckcouldchuckwood
Creutz and Lagus (2002) 13
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
mi − log p(mi) freq(mi) l(mi)
howmuchwoodcoulda... 0.00 1 57
— 0.00 57
C = 285.00 (k = 5)
Creutz and Lagus (2002) 14
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
mi − log p(mi) freq(mi) l(mi)
a 2.70 2 1
chuck 2.70 2 5
could 2.70 2 5
how 3.70 1 3
if 3.70 1 2
much 3.70 1 4
wood 2.70 2 4
woodchuck 2.70 2 9
— 38.11 33 C = 203.11 (k = 5)
Creutz and Lagus (2002) 15
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
mi − log p(mi) freq(mi) l(mi)
a 2.91 2 1
chuck 1.91 4 5
could 2.91 2 5
how 3.91 1 3
if 3.91 1 2
much 3.91 1 4
wood 1.91 4 4
— 38.60 24 C = 158.60 (k = 5)
Creutz and Lagus (2002) 16
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
mi − log p(mi) freq(mi) l(mi)
a 4.83 2 1
c 2.37 11 1
d 3.25 6 1
f 5.83 1 1
h 3.25 6 1
i 5.83 1 1
k 3.83 4 1
l 4.83 2 1
m 5.83 1 1
o 2.37 11 1
u 3.03 7 1
w 3.51 5 1
— 182.09 12 C = 242.09 (k = 5)
Creutz and Lagus (2002) 17
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Greedy search algorithm
• Read in each word at a time, and recursively select the binary
segmentation of a word which minimises C
• Continue until no C-optimal segmentations remain
• Dynamically segment even if the word has been observed previously
in the data (removing any previous segmentations from the model)
Creutz and Lagus (2002) 18
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Dreaming
• At regular intervals, randomly iterate over words already processed
to remove any global sub-optimal segmentations
• Dream for a fixed time (over a fixed number of words) OR until no
significant change in C is observed
• Demo:
http://www.cis.hut.fi/cgi-bin/morpho/nform.cgi
19
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Mostly-UnsupervisedStatistical Segmentation ofJapanese Kanji Sequences
(Kubota Ando and Lee 2003)
20
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Outline
• Unsupervised method for segmenting Japanese kanji sequences
社長兼業務部長l
社長 |兼 |業務 |部長 sha·chou|ken|gyou·mu|bu·chou “president
and general business manager”
• Based on character n-grams (sequences of n characters)
Kubota Ando and Lee (2003) 21
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Why the Interest in Kanji?
• Japanese consists of three basic orthographies:
⋆ hiragana: 46 elements (functional words, onomatapaea,
verb/adjective suffixes)
⋆ katakana: 46 elements (words of foreign origin, scientific names)
⋆ kanji: 1,945 “official” elements, more in practice (content words)
• Kanji the most productive and sparse of the three
Kubota Ando and Lee (2003) 22
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Basic Intuition
• N -gram “windows” over a kanji string can be used as predictors of
word boundaries in that:
1. frequent n-grams suggest the lack of a boundary
2. infrequent n-grams suggest the presence of a boundary
spotthebreaks with 4-gram window:s p o t t h e b r e a k s
s p o t
p o t t
o t t h...
Kubota Ando and Lee (2003) 23
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Example: spotthebreaks
Word N-gram Freq
spot 119
pott 24
otth 120
tthe 6075
theb 2118
hebr 218
ebre 49
brea 328
reak 246
eaks 57
Kubota Ando and Lee (2003) 24
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Method
• For each potential boundary point:
⋆ For each n-gram order (e.g. 4):
∗ For each of the n-grams to the left and right of the boundary:
◦ What is the proportion of n-grams straddling the boundary
that are less frequent than the n-gram adjoining it?
• vn = 12(n−1)
∑
d∈{L,R}
∑n−1j=1 I>(#(sn
d), #(tnj ))
Kubota Ando and Lee (2003) 25
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Returning to our Example
• For the boundary spot|theb:
v4 = 12(4−1)(1 + 2) = 0.5
• For the boundary pott|hebr :
v4 = 12(4−1)(0 + 1) ≈ 0.17
Kubota Ando and Lee (2003) 26
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Combining n-gram Orders
• So as to balance up the effects of data sparseness, the method
advocates “voting” across different n-gram orders:
vN(k) = 1|N |
∑
n∈N vn(k)
• Place boundaries at locations l where EITHER:
1. vN(l) > vN(l − 1) and vN(l) > vN(l + 1) OR
2. vN(l) > t
t = threshold constant
Kubota Ando and Lee (2003) 27
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Evaluation
• Extract n-gram statistics from 150MB of newswire data (N.B.
unannotated)
• Hand-annotated 5 held-out datasets containing 500 kanji
sequences each
• For each held-out dataset, use 450 sequences for parameter training
(N, t) and the remaining 50 sequences for evaluation (hence
mostly-unsupervised)
Kubota Ando and Lee (2003) 28
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Words vs. Morphemes
• Two separate evaluations were carried out, based on:
⋆ words: a stem and its affixes are considered as a single unit
[電話器] [deN·wa·ki] “telephone (device)”
⋆ morphemes: stems and affixes are each individual units
[電話][器] [deN·wa][ki] “[telephone][device]”
Kubota Ando and Lee (2003) 29
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Parameter Optimisation
• Run the algorithm over each set of 450 sequences using the different
values of N and t
• Optimise according to one of:
⋆ word precision = correct proposed bracketsproposed brackets
⋆ word recall = correct proposed bracketsactual brackets
⋆ word F(-measure/score) = precision×recallprecision+recall
• Evaluate according to the optimal parameter setting
Kubota Ando and Lee (2003) 30
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Results (Briefly)
• Word performance better than established Japanese morphological
analyzers (JUMAN, Chasen)
• Word performance considerably better than morpheme performance
• Single-character morphemes greatest source of errors
Kubota Ando and Lee (2003) 31
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Summary
• Method proposed for segmenting up Japanese kanji sequences
based on simple n-gram statistics and voting
• Basic intuition that n-grams straddling word boundaries will be
less-frequent than those adjacent to word boundaries
• Impressive results achieved for such a conceptually simple algorithm
32
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Multiword ExpressionDiscovery
33
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Basic Task Description
• Identify the multiword expression (MWE) types in tagged (or raw)
text from observation of the token distribution
• Recall: a MWE is made up of multiple words and is lexically,
syntactically, semantically, pragmatically and/or statistically
idiosyncratic
• Considerable body of literature on collocation extraction
34
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Lexico-syntactic Idiomaticityby and large
???
conjP Adj
by and large
ad hoc
Adj
? ?
ad hoc
wine and dine
V[trans]
conjV[intrans] V[intrans]
wine and dine
35
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Semantic Idiomaticity
kick the bucket spill the beans
die’ reveal’(secret’)
kindle excitement
kindle’(excitement’)
36
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Pragmatic Idiomaticity
• The Wheel of Fortune factor — how to represent the jumble of
phrases stored in the mental lexicon?
• The Monty Python factor — mish-mash of language fragments
which evoke particular events/individuals/memories
37
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Statistical Idiomaticityunblemished spotless flawless immaculate impeccable
eye – – – – +gentleman – – ? – +home ? + – + ?lawn – – ? + –memory – – + – ?quality – – – – +record + + + + +reputation + – – + +taste – – – – +
Adapted from Cruse (1986)
38
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Complications in MWE Discovery
• Working out the extent of the collocation (phrase boundary
detection)
trip the light X
trip the light fantastic V
trip the light fantastic at X
• Fine line between collocations and simple default lexical
combinations
buy a car/purchase power
39
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Collocations vs. MWEs
• A collocation (≈ institutionalised phrase) is an arbitrary and
recurrent word combination
• Collocations tend to stand in opposition to anti-collocations =
lexically-marked synonym-substitutes
• Collocations can be semantically-marked (e.g. dark horse) but tend
to be compositional (e.g. strong coffee)
• Predicative collocations vs. rigid NPs vs. phrasal templates
40
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Retrieving Collocations fromText: Xtract
(Smadja 1993)
41
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Outline
• Automatic method for extracting collocations from raw text based
on n-gram statistics
• Basic intuition: collocations are more rigid syntactically and more
frequent than other word combinations
• Method: attempt to capture this intuition using the basic statistics
of word combinations
Smadja (1993) 42
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Stage 1: Extract Significant Bigrams
• w and wi co-occur (wi is a collocate of w) if they are found in a
single sentence separated by fewer than 5 words
• a bigram (w,wi) is significant iff:
⋆ w and wi co-occur more frequently than chance
⋆ w and wi appear in a relatively rigid configuration
• Divide up the set of collocates according to POS
Smadja (1993) 43
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Example Corpus
... multiword(−1) expressions ...
... multiword(−1) expressions ...
... dialect(−1) expressions ...
... dialect(−2) and expressions ...
... expressions of interest(2) ...
... multiword(−1) expressions ...
... collocation extraction ...
... expressions dialect(1) ...
... multiword(−1) expressions ...
... multiword(−1) expressions ...
Smadja (1993) 44
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Noun Co-occurrence Table
w = expressions, D = {−2,−1, 1, 2}
multiword dialect collocation interestcollocate
w1 w2 w3 w4
p−2i 0 1 0 0
p−1i 5 1 0 0
p1i 0 1 0 0
p2i 0 0 0 1
freq i 5 3 0 1
wi ∈ C yes yes no yes
Smadja (1993) 45
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Statistics of Expectation
• f =P
wi∈C freqi
|C| = 5+3+13 = 3 (frequency average)
• σ =
√
P
wi∈C(freqi−f)2
|C| =
√
(5−3)2+(3−3)2+(1−3)2
3 ≈ 1.63
• ki = freqi−f
σ(strength)
• pi =P
j∈D pji
|D| (pair count average)
• Ui =P
j∈D(pji−pi)
2
|D| (pair count variance)
Smadja (1993) 46
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Collocation Filters
• Strength: ki > kα(= 1)
: select frequent collocates
• Spread: Ui > U0(= 1)
: select spiky distributions
• Peakiness: pji ≥ pi + (kβ ×
√Ui) kβ = 0.51
: identify interesting spikes
1Value of 10 suggested for kβ in Smadja (1993)
Smadja (1993) 47
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Back to our Example: Strength
• w1 (multiword)
⋆ k1 = 5−31.63 = 1.22 > 1 V
• w2 (dialect)
⋆ k2 = 3−31.63 < 1 X
• w4 (interest)
⋆ k3 = 1−31.63 < 1 X
Smadja (1993) 48
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Spread and Peakiness (w1 = multiword)
• U1 = (0−1.25)2+(5−1.25)2+(0−1.25)2+(0−1.25)2
4 ≈ 20.31 > 1 V
⋆ p1 = 0+5+0+04 = 1.25
⋆ p1 + (kβ ×√
U1) = 1.25 + (0.5 ×√
20.31) ≈ 3.50
j pi + (kβ ×√
Ui)
p−21 0 < 3.50 X
p−11 5 ≥ 3.50 V
p+11 0 < 3.50 X
p+21 0 < 3.50 X
Smadja (1993) 49
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Stage 2: Bigrams to n-grams
• Independent filter to detect larger n-grams
• Method: for each fixed-distance collocate (w,wji ), extract out
contiguous word sequences where max(p(word[i])) > T (= 0.75)
Smadja (1993) 50
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Example
w = resistance, w−3i = path
Concordances from the BNC:
... trod the path(−3) of least resistance , ...
... finding the path(−3) of least resistance will ...
... along the path(−3) of least resistance .
... the safest path(−3) of least resistance through ...
... took the path(−3) of least resistance and ...
: the path of least resistance is a rigid noun phrase
Smadja (1993) 51
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Stage 3: Add Syntax
• Use cass dependency parser to assign syntactic labels to bigrams
• Same basic thresholding filter as used in Stage 2
Smadja (1993) 52
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Summary
• Two methods proposed for extracting collocations based on
distributional statistics
⋆ 1st method based on the intuition that collocations occur
frequently in relatively fixed configurations
⋆ 2nd method based on the predictability of contiguous lexical
contexts for given concordances
• Basic method illustrated for determining the syntax of extracted
collocations
Smadja (1993) 53
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Reflections
• (At the time) groundbreaking research on collocation extraction
• Not effective at extracting out low-frequency words
• Difficulties in evaluating the results of collocation extraction
(applies to this day)
• Difficulties in extracting non-contiguous (predicative) collocations
such as verb particles
Smadja (1993) 54
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
ReferencesCreutz, Mathias, and Krista Lagus. 2002. Unsupervised discovery of morphemes. In
Proc. of the 6th Meeting of the ACL Special Interest Group in Computational Phonology ,
Philadelphia, USA.
Cruse, D. Alan. 1986. Lexical Semantics. Cambridge, UK: Cambridge University Press.
Kubota Ando, Rie, and Lillian Lee. 2003. Mostly-unsupervised statistical segmentation of
Japanese kanji sequences. Natural Language Engineering 9.127–49.
Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics
19.143–77.
55