Upload
garry-francis-cooper
View
212
Download
0
Embed Size (px)
Citation preview
LREC 2008 AWN 1
Arabic WordNet: Semi-automatic Extensions using
Bayesian Inference
H. Rodríguez1, D. Farwell1, J. Farreres1, M. Bertran1, M.
Alkhalifa2, M.A. Martí2
1 Talp Research Center, UPC, Barcelona, Spain2 UB, Barcelona, Spain
LREC 2008 AWN 2
Index of the talk
• The AWN project• Semi-automatic Extensions of AWN
Intuitive basis Previous work using heuristics Using Bayesian Networks
• Empirical evaluation• Conclusions
LREC 2008 AWN 3
The AWN project
• USA REFLEX program funded (2005-2007) Partners:
Universities Princeton, Manchester, UPC, UB
Companies Articulate Software, Irion
Description:Black et al, 2006Elkateb et al, 2006Rodríguez et al, 2008
LREC 2008 AWN 4
The AWN project
• Objectives 10,000 synsets including some amount of
domain specific data linked to PWN 2.0
finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry
LREC 2008 AWN 5
The AWN project
• Current figuresArabic synsets 11270
Arabic words 23496
pos DB content
adj 661
nouns 7961
adv 110
verbs 2538
Named entities:
Synsets that are named entities 1142
Synsets that are not named entities 10028
Words in synsets that are named entities 1656
LREC 2008 AWN 6
Semi-automatic Extensions of AWN
• Intuitive basis In Arabic (and other Semitic Languages)
many words having a common root (i.e. a sequence of typically three consonants) have related meanings and can be derived from a base verbal form by means of a reduced set of lexical rules
LREC 2008 AWN 7
Semi-automatic Extensions of AWN
LREC 2008 AWN 8
Semi-automatic Extensions of AWN
• Lexical rules regular verbal derivative forms regular nominal and adjectival derivative
formsmasdar (nominal verb)masculine and feminine active and passive
participles inflected verbal forms
LREC 2008 AWN 9
Semi-automatic Extensions of AWN
• Procedure for generating a set of likely <Arabic word, English synset, score>: produce an initial list of candidate word
forms filter out the less likely candidates from this
list generate an initial list of attachments score the reliability of these candidates manually review the best scored candidates
and include the valid associations in AWN.
LREC 2008 AWN 10
Semi-automatic Extensions of AWN
• Resources PWN AWN LOGOS database of conjugated Arabic verbs NMSU bilingual Arabic-English lexicon Arabic Gigaword Corpus UN (2000-2002) bilingual Arabic-English Corpus
LREC 2008 AWN 11
Semi-automatic Extensions of AWN
• Score the reliability of the candidatesbuild a graph representing the words, synsets
and their associations associations synset-synset:
explicit in WN2.0 path-based
apply a set of heuristic rules that use directly the structure of the graph
GWC 2008apply Bayesian inference
LREC 2008
LREC 2008 AWN 12
Using Bayesian Inference
Abase
A1
E1
An
Ej
Ei
Em
...
...
...
...
...
...
...
S1
Sp
...
...
Abase
A1
E1
An
Ej
Ei
Em
...
...
...
...
...
...
...
S1
Sp
...
...
LREC 2008 AWN 13
Using Bayesian Inference
A1
E1
An
Ej
Ei
Em
...
...
...
...
...
...
S11
S1p
...
...
S2r
...
...
S21
layers 1 2 3 4
A1
E1
An
Ej
Ei
Em
...
...
...
...
...
...
S11
S1p
...
...
S2r
...
...
S21
A1
E1
An
Ej
Ei
Em
...
...
...
...
...
...
S11
S1p
...
...
S2r
...
...
S21
layers 1 2 3 4
LREC 2008 AWN 14
Using Bayesian Inference
• Building the CPT for each node in the BN edges EW AW
probabilities from statistical translation models built from the UN corpus using GIZA++ (word-word probabilities) filtered to avoid pairs having Arabic expressions with invalid Buckwalter encodings.
all the mass probability is distributed between pairs occurring in the BN
other edges (EW S, S S) linear distribution on priorsnoisy or model
LREC 2008 AWN 15
Using Bayesian Inference
• Performing Bayesian Inference in the BN Assign probability 1 to nodes in layer 1 Infer the probabilities of nodes in layer 3 Select for each word in layer 1 select as
candidates the synsets in layer 3 connected to it and with probability over a threshold
Score the candidate pair with this probability Select the candidates scored over a threshold
LREC 2008 AWN 16
Empirical Evaluation
• 10 verbs randomly selected from AWN + درس
Arabic verb # English Words # Synsets (S1 S2) KَلKاَم K190 107 َع
ب KَقRَعK K 71 77ََأKَل Kَق K21 31 َص Kب V َّت K102 62 ر
Kَر VَّخK َأ 19 9
Kَر K َب RَّخK َأ 80 105 Kَح Vَّش Kر 40 22
KَرKاَم Kَغ 56 49 ع K َب RَّشK 34 38 َأKَج Kَر RَّخK َأ 85 140
Kس VرK51 57 د
LREC 2008 AWN 17
Empirical Evaluation
• ResultsSelection Threshold candidates accept reject precision recall F1 HEU all heuristics 272 135 137 0.50 0.61 0.55 HEU heuristics 1,2 61 40 21 0.65 0.18 0.28 BN 0 554 223 331 0.40 1 0.57 BN 0.01 243 125 118 0.51 0.56 0.53 BN 0.02 214 116 98 0.54 0.52 0.53 BN 0.07 112 65 47 0.58 0.29 0.39 BN 0.1 100 60 40 0.60 0.27 0.37 BN + HEU 0 272 154 118 0.56 0.69 0.62 BN + HEU 0.01 212 121 91 0.57 0.54 0.55 BN + HEU 0.02 201 115 86 0.57 0.41 0.48 BN + HEU 0.07 92 65 27 0.71 0.38 0.5 BN + HEU 0.1 83 59 24 0.71 0.12 0,21
LREC 2008 AWN 18
Conclusions
• the BN approach doubles the number of candidates of the previous HEU approach (554 vs 272).
• The sample is clearly insufficient.• The overlaping of Heu + BN seems to
improve the results• An analysis of the errors shows a
substantial number were due to the lack of the shadda diacritic or the feminine ending form (ta marbuta, ة).
LREC 2008 AWN 19
Further work
• Repeat the entire procedure relying when possible on dictionaries containing diacritics
• Refine the scoring procedure by assigning different weights to the different relations.
• Include additional relations (e.g. path-based)• Use additional Knowledge Sources for
weighting the relations: related entries already included in AWN SUMO Magnini's domain codes
LREC 2008 AWN 20
Thank you for your attention