LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 1

Arabic WordNet: Semi-automatic Extensions using

Bayesian Inference

H. Rodríguez1, D. Farwell1, J. Farreres1, M. Bertran1, M.

Alkhalifa2, M.A. Martí2

1 Talp Research Center, UPC, Barcelona, Spain2 UB, Barcelona, Spain

LREC 2008 AWN 2

Index of the talk

• The AWN project• Semi-automatic Extensions of AWN

Intuitive basis Previous work using heuristics Using Bayesian Networks

• Empirical evaluation• Conclusions

LREC 2008 AWN 3

The AWN project

• USA REFLEX program funded (2005-2007) Partners:

Universities Princeton, Manchester, UPC, UB

Companies Articulate Software, Irion

Description:Black et al, 2006Elkateb et al, 2006Rodríguez et al, 2008

LREC 2008 AWN 4

The AWN project

• Objectives 10,000 synsets including some amount of

domain specific data linked to PWN 2.0

finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry

LREC 2008 AWN 5

The AWN project

• Current figuresArabic synsets 11270

Arabic words 23496

pos DB content

adj 661

nouns 7961

adv 110

verbs 2538

Named entities:

Synsets that are named entities 1142

Synsets that are not named entities 10028

Words in synsets that are named entities 1656

LREC 2008 AWN 6

Semi-automatic Extensions of AWN

• Intuitive basis In Arabic (and other Semitic Languages)

many words having a common root (i.e. a sequence of typically three consonants) have related meanings and can be derived from a base verbal form by means of a reduced set of lexical rules

LREC 2008 AWN 7


LREC 2008 AWN 8


• Lexical rules regular verbal derivative forms regular nominal and adjectival derivative

formsmasdar (nominal verb)masculine and feminine active and passive

participles inflected verbal forms

LREC 2008 AWN 9


• Procedure for generating a set of likely <Arabic word, English synset, score>: produce an initial list of candidate word

forms filter out the less likely candidates from this

list generate an initial list of attachments score the reliability of these candidates manually review the best scored candidates

and include the valid associations in AWN.

LREC 2008 AWN 10


• Resources PWN AWN LOGOS database of conjugated Arabic verbs NMSU bilingual Arabic-English lexicon Arabic Gigaword Corpus UN (2000-2002) bilingual Arabic-English Corpus

LREC 2008 AWN 11


• Score the reliability of the candidatesbuild a graph representing the words, synsets

and their associations associations synset-synset:

explicit in WN2.0 path-based

apply a set of heuristic rules that use directly the structure of the graph

GWC 2008apply Bayesian inference

LREC 2008

LREC 2008 AWN 12

Using Bayesian Inference

Abase

A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

...

S1

Sp

...

...

Abase

A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

...

S1

Sp

...

...

LREC 2008 AWN 13


A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

S11

S1p

...

...

S2r

...

...

S21

layers 1 2 3 4

A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

S11

S1p

...

...

S2r

...

...

S21

A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

S11

S1p

...

...

S2r

...

...

S21

layers 1 2 3 4

LREC 2008 AWN 14


• Building the CPT for each node in the BN edges EW AW

probabilities from statistical translation models built from the UN corpus using GIZA++ (word-word probabilities) filtered to avoid pairs having Arabic expressions with invalid Buckwalter encodings.

all the mass probability is distributed between pairs occurring in the BN

other edges (EW S, S S) linear distribution on priorsnoisy or model

LREC 2008 AWN 15


• Performing Bayesian Inference in the BN Assign probability 1 to nodes in layer 1 Infer the probabilities of nodes in layer 3 Select for each word in layer 1 select as

candidates the synsets in layer 3 connected to it and with probability over a threshold

Score the candidate pair with this probability Select the candidates scored over a threshold

LREC 2008 AWN 16

Empirical Evaluation

• 10 verbs randomly selected from AWN + درس

Arabic verb # English Words # Synsets (S1 S2) KَلKاَم K190 107 َع

ب KَقRَعK K 71 77ََأKَل Kَق K21 31 َص Kب V َّت K102 62 ر

Kَر VَّخK َأ 19 9

Kَر K َب RَّخK َأ 80 105 Kَح Vَّش Kر 40 22

KَرKاَم Kَغ 56 49 ع K َب RَّشK 34 38 َأKَج Kَر RَّخK َأ 85 140

Kس VرK51 57 د

LREC 2008 AWN 17

Empirical Evaluation

• ResultsSelection Threshold candidates accept reject precision recall F1 HEU all heuristics 272 135 137 0.50 0.61 0.55 HEU heuristics 1,2 61 40 21 0.65 0.18 0.28 BN 0 554 223 331 0.40 1 0.57 BN 0.01 243 125 118 0.51 0.56 0.53 BN 0.02 214 116 98 0.54 0.52 0.53 BN 0.07 112 65 47 0.58 0.29 0.39 BN 0.1 100 60 40 0.60 0.27 0.37 BN + HEU 0 272 154 118 0.56 0.69 0.62 BN + HEU 0.01 212 121 91 0.57 0.54 0.55 BN + HEU 0.02 201 115 86 0.57 0.41 0.48 BN + HEU 0.07 92 65 27 0.71 0.38 0.5 BN + HEU 0.1 83 59 24 0.71 0.12 0,21

LREC 2008 AWN 18

Conclusions

• the BN approach doubles the number of candidates of the previous HEU approach (554 vs 272).

• The sample is clearly insufficient.• The overlaping of Heu + BN seems to

improve the results• An analysis of the errors shows a

substantial number were due to the lack of the shadda diacritic or the feminine ending form (ta marbuta, ة).

LREC 2008 AWN 19

Further work

• Repeat the entire procedure relying when possible on dictionaries containing diacritics

• Refine the scoring procedure by assigning different weights to the different relations.

• Include additional relations (e.g. path-based)• Use additional Knowledge Sources for

weighting the relations: related entries already included in AWN SUMO Magnini's domain codes

LREC 2008 AWN 20

Thank you for your attention

Documents

LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa