Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Diﬃculties in

Empirical Approaches to

Multilingual Lexical Acquisition

Lecturer: Timothy Baldwin

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Lecture 4

Unsupervised Approaches to Lexical

Acquisition: Word Segmentation and MWE

Extraction

1


Word Segmentation

2


Basic Task Description

• Take an unsegmented string and dynamically insert the word

boundaries:

spotthebreaksinthisstring : spot the breaks in this string

• Applications in:

⋆ the segmentation of non-segmenting languages (e.g. Japanese,

Chinese, Thai)

⋆ segmenting up a speech signal into words

⋆ modelling child word acquisition

3


Difficulties in Word Detection

• Interdepedence between word and boundary detection

• Segmentation of frequently-occurring strings

ad hoc, 東京都 tou·kyou·to “Tokyo prefecture”

• Inherent boundary ambiguities

大学長 dai·gaku·chou “(university) president”

• Data sparseness

4


Unsupervised Discovery ofMorphemes

(Creutz and Lagus 2002)

5


Types of morphological linkage

• Inflectional (table : tables, bus : buses/busses, phenomenon

: phenomena)

• Derivational (central : centralise)

• Compositional (house + boat : houseboat)

6


Semantic linkage

• employ : employer , employee

• Some of these are lost in the mists of time, e.g. ploy :

employ/deploy?

7


Basic tasks in morphology

• Discovering the morphemes

• Determining the interaction between the morphemes in a given

word

unfathomable

l

un·fathom·ablel

unNEG·fathomV ·ableV →N

8


Unsupervised Discovery of Morphemes

• Segment each word token into morphemes without using prior

knowledge of morphological processes or morphemes:

misinformed : mis·inform·edreconstruction : re·construct·ionkahvinjuojallekin : kahvi·n·juo·ja·lle·kin [FI]

• Assume agglutinative or simple inflectional morphology

• Based on orthographic regularities

Creutz and Lagus (2002) 9


Applications

• Language modelling (speech recognition)

• Language-independent lexicographic tool for determining

stems/word boundaries

• N.B. assumes a corpus of raw text up-front



Minimum Description Length

• Minimum Description Length (MDL, a.k.a. Minimum Message

Length or MML) is a method for simultaneously evaluating the

compactness of a particular encoding and the compactness of the

data using that encoding

Description length = model DL + data DL

‘Simplicity vs. Accuracy’

• Too much ‘accuracy’ : overfitting

• How can we compress / generalise the data model?



MDL in word segmentation

• In morpheme discovery terms, minimise C:

C = Cost(Source text) + Cost(Codebook)

=∑

tokens

− log p(mi) +∑

types

k × l(mj)



Example: Word splitting

howmuchwoodcouldawoodchuckchuckifawoodchuckcouldchuckwood



mi − log p(mi) freq(mi) l(mi)

howmuchwoodcoulda... 0.00 1 57

— 0.00 57

C = 285.00 (k = 5)




a 2.70 2 1

chuck 2.70 2 5

could 2.70 2 5

how 3.70 1 3

if 3.70 1 2

much 3.70 1 4

wood 2.70 2 4

woodchuck 2.70 2 9

— 38.11 33 C = 203.11 (k = 5)




a 2.91 2 1

chuck 1.91 4 5

could 2.91 2 5

how 3.91 1 3

if 3.91 1 2

much 3.91 1 4

wood 1.91 4 4

— 38.60 24 C = 158.60 (k = 5)




a 4.83 2 1

c 2.37 11 1

d 3.25 6 1

f 5.83 1 1

h 3.25 6 1

i 5.83 1 1

k 3.83 4 1

l 4.83 2 1

m 5.83 1 1

o 2.37 11 1

u 3.03 7 1

w 3.51 5 1

— 182.09 12 C = 242.09 (k = 5)



Greedy search algorithm

• Read in each word at a time, and recursively select the binary

segmentation of a word which minimises C

• Continue until no C-optimal segmentations remain

• Dynamically segment even if the word has been observed previously

in the data (removing any previous segmentations from the model)



Dreaming

• At regular intervals, randomly iterate over words already processed

to remove any global sub-optimal segmentations

• Dream for a fixed time (over a fixed number of words) OR until no

significant change in C is observed

• Demo:

http://www.cis.hut.fi/cgi-bin/morpho/nform.cgi

19


Mostly-UnsupervisedStatistical Segmentation ofJapanese Kanji Sequences

(Kubota Ando and Lee 2003)

20


Outline

• Unsupervised method for segmenting Japanese kanji sequences

社長兼業務部長l

社長 |兼 |業務 |部長 sha·chou|ken|gyou·mu|bu·chou “president

and general business manager”

• Based on character n-grams (sequences of n characters)

Kubota Ando and Lee (2003) 21


Why the Interest in Kanji?

• Japanese consists of three basic orthographies:

⋆ hiragana: 46 elements (functional words, onomatapaea,

verb/adjective suffixes)

⋆ katakana: 46 elements (words of foreign origin, scientific names)

⋆ kanji: 1,945 “official” elements, more in practice (content words)

• Kanji the most productive and sparse of the three



Basic Intuition

• N -gram “windows” over a kanji string can be used as predictors of

word boundaries in that:

1. frequent n-grams suggest the lack of a boundary

2. infrequent n-grams suggest the presence of a boundary

spotthebreaks with 4-gram window:s p o t t h e b r e a k s

s p o t

p o t t

o t t h...



Example: spotthebreaks

Word N-gram Freq

spot 119

pott 24

otth 120

tthe 6075

theb 2118

hebr 218

ebre 49

brea 328

reak 246

eaks 57



Method

• For each potential boundary point:

⋆ For each n-gram order (e.g. 4):

∗ For each of the n-grams to the left and right of the boundary:

◦ What is the proportion of n-grams straddling the boundary

that are less frequent than the n-gram adjoining it?

• vn = 12(n−1)

∑

d∈{L,R}

∑n−1j=1 I>(#(sn

d), #(tnj ))



Returning to our Example

• For the boundary spot|theb:

v4 = 12(4−1)(1 + 2) = 0.5

• For the boundary pott|hebr :

v4 = 12(4−1)(0 + 1) ≈ 0.17



Combining n-gram Orders

• So as to balance up the effects of data sparseness, the method

advocates “voting” across different n-gram orders:

vN(k) = 1|N |

∑

n∈N vn(k)

• Place boundaries at locations l where EITHER:

1. vN(l) > vN(l − 1) and vN(l) > vN(l + 1) OR

2. vN(l) > t

t = threshold constant



Evaluation

• Extract n-gram statistics from 150MB of newswire data (N.B.

unannotated)

• Hand-annotated 5 held-out datasets containing 500 kanji

sequences each

• For each held-out dataset, use 450 sequences for parameter training

(N, t) and the remaining 50 sequences for evaluation (hence

mostly-unsupervised)



Words vs. Morphemes

• Two separate evaluations were carried out, based on:

⋆ words: a stem and its affixes are considered as a single unit

[電話器] [deN·wa·ki] “telephone (device)”

⋆ morphemes: stems and affixes are each individual units

[電話][器] [deN·wa][ki] “[telephone][device]”



Parameter Optimisation

• Run the algorithm over each set of 450 sequences using the different

values of N and t

• Optimise according to one of:

⋆ word precision = correct proposed bracketsproposed brackets

⋆ word recall = correct proposed bracketsactual brackets

⋆ word F(-measure/score) = precision×recallprecision+recall

• Evaluate according to the optimal parameter setting



Results (Briefly)

• Word performance better than established Japanese morphological

analyzers (JUMAN, Chasen)

• Word performance considerably better than morpheme performance

• Single-character morphemes greatest source of errors



Summary

• Method proposed for segmenting up Japanese kanji sequences

based on simple n-gram statistics and voting

• Basic intuition that n-grams straddling word boundaries will be

less-frequent than those adjacent to word boundaries

• Impressive results achieved for such a conceptually simple algorithm

32


Multiword ExpressionDiscovery

33


Basic Task Description

• Identify the multiword expression (MWE) types in tagged (or raw)

text from observation of the token distribution

• Recall: a MWE is made up of multiple words and is lexically,

syntactically, semantically, pragmatically and/or statistically

idiosyncratic

• Considerable body of literature on collocation extraction

34


Lexico-syntactic Idiomaticityby and large

???

conjP Adj

by and large

ad hoc

Adj

? ?

ad hoc

wine and dine

V[trans]

conjV[intrans] V[intrans]

wine and dine

35


Semantic Idiomaticity

kick the bucket spill the beans

die’ reveal’(secret’)

kindle excitement

kindle’(excitement’)

36


Pragmatic Idiomaticity

• The Wheel of Fortune factor — how to represent the jumble of

phrases stored in the mental lexicon?

• The Monty Python factor — mish-mash of language fragments

which evoke particular events/individuals/memories

37


Statistical Idiomaticityunblemished spotless flawless immaculate impeccable

eye – – – – +gentleman – – ? – +home ? + – + ?lawn – – ? + –memory – – + – ?quality – – – – +record + + + + +reputation + – – + +taste – – – – +

Adapted from Cruse (1986)

38


Complications in MWE Discovery

• Working out the extent of the collocation (phrase boundary

detection)

trip the light X

trip the light fantastic V

trip the light fantastic at X

• Fine line between collocations and simple default lexical

combinations

buy a car/purchase power

39


Collocations vs. MWEs

• A collocation (≈ institutionalised phrase) is an arbitrary and

recurrent word combination

• Collocations tend to stand in opposition to anti-collocations =

lexically-marked synonym-substitutes

• Collocations can be semantically-marked (e.g. dark horse) but tend

to be compositional (e.g. strong coffee)

• Predicative collocations vs. rigid NPs vs. phrasal templates

40


Retrieving Collocations fromText: Xtract

(Smadja 1993)

41


Outline

• Automatic method for extracting collocations from raw text based

on n-gram statistics

• Basic intuition: collocations are more rigid syntactically and more

frequent than other word combinations

• Method: attempt to capture this intuition using the basic statistics

of word combinations

Smadja (1993) 42


Stage 1: Extract Significant Bigrams

• w and wi co-occur (wi is a collocate of w) if they are found in a

single sentence separated by fewer than 5 words

• a bigram (w,wi) is significant iff:

⋆ w and wi co-occur more frequently than chance

⋆ w and wi appear in a relatively rigid configuration

• Divide up the set of collocates according to POS

Smadja (1993) 43


Example Corpus

... multiword(−1) expressions ...


... dialect(−1) expressions ...

... dialect(−2) and expressions ...

... expressions of interest(2) ...


... collocation extraction ...

... expressions dialect(1) ...



Smadja (1993) 44


Noun Co-occurrence Table

w = expressions, D = {−2,−1, 1, 2}

multiword dialect collocation interestcollocate

w1 w2 w3 w4

p−2i 0 1 0 0

p−1i 5 1 0 0

p1i 0 1 0 0

p2i 0 0 0 1

freq i 5 3 0 1

wi ∈ C yes yes no yes

Smadja (1993) 45


Statistics of Expectation

• f =P

wi∈C freqi

|C| = 5+3+13 = 3 (frequency average)

• σ =

√

P

wi∈C(freqi−f)2

|C| =

√

(5−3)2+(3−3)2+(1−3)2

3 ≈ 1.63

• ki = freqi−f

σ(strength)

• pi =P

j∈D pji

|D| (pair count average)

• Ui =P

j∈D(pji−pi)

2

|D| (pair count variance)

Smadja (1993) 46


Collocation Filters

• Strength: ki > kα(= 1)

: select frequent collocates

• Spread: Ui > U0(= 1)

: select spiky distributions

• Peakiness: pji ≥ pi + (kβ ×

√Ui) kβ = 0.51

: identify interesting spikes

1Value of 10 suggested for kβ in Smadja (1993)

Smadja (1993) 47


Back to our Example: Strength

• w1 (multiword)

⋆ k1 = 5−31.63 = 1.22 > 1 V

• w2 (dialect)

⋆ k2 = 3−31.63 < 1 X

• w4 (interest)

⋆ k3 = 1−31.63 < 1 X

Smadja (1993) 48


Spread and Peakiness (w1 = multiword)

• U1 = (0−1.25)2+(5−1.25)2+(0−1.25)2+(0−1.25)2

4 ≈ 20.31 > 1 V

⋆ p1 = 0+5+0+04 = 1.25

⋆ p1 + (kβ ×√

U1) = 1.25 + (0.5 ×√

20.31) ≈ 3.50

j pi + (kβ ×√

Ui)

p−21 0 < 3.50 X

p−11 5 ≥ 3.50 V

p+11 0 < 3.50 X

p+21 0 < 3.50 X

Smadja (1993) 49


Stage 2: Bigrams to n-grams

• Independent filter to detect larger n-grams

• Method: for each fixed-distance collocate (w,wji ), extract out

contiguous word sequences where max(p(word[i])) > T (= 0.75)

Smadja (1993) 50


Example

w = resistance, w−3i = path

Concordances from the BNC:

... trod the path(−3) of least resistance , ...

... finding the path(−3) of least resistance will ...

... along the path(−3) of least resistance .

... the safest path(−3) of least resistance through ...

... took the path(−3) of least resistance and ...

: the path of least resistance is a rigid noun phrase

Smadja (1993) 51


Stage 3: Add Syntax

• Use cass dependency parser to assign syntactic labels to bigrams

• Same basic thresholding filter as used in Stage 2

Smadja (1993) 52


Summary

• Two methods proposed for extracting collocations based on

distributional statistics

⋆ 1st method based on the intuition that collocations occur

frequently in relatively fixed configurations

⋆ 2nd method based on the predictability of contiguous lexical

contexts for given concordances

• Basic method illustrated for determining the syntax of extracted

collocations

Smadja (1993) 53


Reflections

• (At the time) groundbreaking research on collocation extraction

• Not effective at extracting out low-frequency words

• Difficulties in evaluating the results of collocation extraction

(applies to this day)

• Difficulties in extracting non-contiguous (predicative) collocations

such as verb particles

Smadja (1993) 54


ReferencesCreutz, Mathias, and Krista Lagus. 2002. Unsupervised discovery of morphemes. In

Proc. of the 6th Meeting of the ACL Special Interest Group in Computational Phonology ,

Philadelphia, USA.

Cruse, D. Alan. 1986. Lexical Semantics. Cambridge, UK: Cambridge University Press.

Kubota Ando, Rie, and Lillian Lee. 2003. Mostly-unsupervised statistical segmentation of

Japanese kanji sequences. Natural Language Engineering 9.127–49.

Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics

19.143–77.

55

Documents

Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Diﬃculties in