20
LREC 2008 AWN 1 Arabic WordNet: Semi- automatic Extensions using Bayesian Inference H. Rodríguez 1 , D. Farwell 1 , J. Farreres 1 , M. Bertran 1 , M. Alkhalifa 2 , M.A. Martí 2 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain

LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

Embed Size (px)

Citation preview

Page 1: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 1

Arabic WordNet: Semi-automatic Extensions using

Bayesian Inference

H. Rodríguez1, D. Farwell1, J. Farreres1, M. Bertran1, M.

Alkhalifa2, M.A. Martí2

1 Talp Research Center, UPC, Barcelona, Spain2 UB, Barcelona, Spain

Page 2: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 2

Index of the talk

• The AWN project• Semi-automatic Extensions of AWN

Intuitive basis Previous work using heuristics Using Bayesian Networks

• Empirical evaluation• Conclusions

Page 3: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 3

The AWN project

• USA REFLEX program funded (2005-2007) Partners:

Universities Princeton, Manchester, UPC, UB

Companies Articulate Software, Irion

Description:Black et al, 2006Elkateb et al, 2006Rodríguez et al, 2008

Page 4: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 4

The AWN project

• Objectives 10,000 synsets including some amount of

domain specific data linked to PWN 2.0

finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry

Page 5: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 5

The AWN project

• Current figuresArabic synsets 11270

Arabic words 23496

pos DB content

adj 661

nouns 7961

adv 110

verbs 2538

Named entities:

Synsets that are named entities 1142

Synsets that are not named entities 10028

Words in synsets that are named entities 1656

Page 6: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 6

Semi-automatic Extensions of AWN

• Intuitive basis In Arabic (and other Semitic Languages)

many words having a common root (i.e. a sequence of typically three consonants) have related meanings and can be derived from a base verbal form by means of a reduced set of lexical rules

Page 7: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 7

Semi-automatic Extensions of AWN

Page 8: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 8

Semi-automatic Extensions of AWN

• Lexical rules regular verbal derivative forms regular nominal and adjectival derivative

formsmasdar (nominal verb)masculine and feminine active and passive

participles inflected verbal forms

Page 9: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 9

Semi-automatic Extensions of AWN

• Procedure for generating a set of likely <Arabic word, English synset, score>: produce an initial list of candidate word

forms filter out the less likely candidates from this

list generate an initial list of attachments score the reliability of these candidates manually review the best scored candidates

and include the valid associations in AWN.

Page 10: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 10

Semi-automatic Extensions of AWN

• Resources PWN AWN LOGOS database of conjugated Arabic verbs NMSU bilingual Arabic-English lexicon Arabic Gigaword Corpus UN (2000-2002) bilingual Arabic-English Corpus

Page 11: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 11

Semi-automatic Extensions of AWN

• Score the reliability of the candidatesbuild a graph representing the words, synsets

and their associations associations synset-synset:

explicit in WN2.0 path-based

apply a set of heuristic rules that use directly the structure of the graph

GWC 2008apply Bayesian inference

LREC 2008

Page 12: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 12

Using Bayesian Inference

Abase

A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

...

S1

Sp

...

...

Abase

A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

...

S1

Sp

...

...

Page 13: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 13

Using Bayesian Inference

A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

S11

S1p

...

...

S2r

...

...

S21

layers 1 2 3 4

A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

S11

S1p

...

...

S2r

...

...

S21

A1

E1

An

Ej

Ei

Em

...

...

...

...

...

...

S11

S1p

...

...

S2r

...

...

S21

layers 1 2 3 4

Page 14: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 14

Using Bayesian Inference

• Building the CPT for each node in the BN edges EW AW

probabilities from statistical translation models built from the UN corpus using GIZA++ (word-word probabilities) filtered to avoid pairs having Arabic expressions with invalid Buckwalter encodings.

all the mass probability is distributed between pairs occurring in the BN

other edges (EW S, S S) linear distribution on priorsnoisy or model

Page 15: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 15

Using Bayesian Inference

• Performing Bayesian Inference in the BN Assign probability 1 to nodes in layer 1 Infer the probabilities of nodes in layer 3 Select for each word in layer 1 select as

candidates the synsets in layer 3 connected to it and with probability over a threshold

Score the candidate pair with this probability Select the candidates scored over a threshold

Page 16: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 16

Empirical Evaluation

• 10 verbs randomly selected from AWN + درس

Arabic verb # English Words # Synsets (S1 S2) KَلKاَم K190 107 َع

ب KَقRَعK K 71 77ََأKَل Kَق K21 31 َص Kب V َّت K102 62 ر

Kَر VَّخK َأ 19 9

Kَر K َب RَّخK َأ 80 105 Kَح Vَّش Kر 40 22

KَرKاَم Kَغ 56 49 ع K َب RَّشK 34 38 َأKَج Kَر RَّخK َأ 85 140

Kس VرK51 57 د

Page 17: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 17

Empirical Evaluation

• ResultsSelection Threshold candidates accept reject precision recall F1 HEU all heuristics 272 135 137 0.50 0.61 0.55 HEU heuristics 1,2 61 40 21 0.65 0.18 0.28 BN 0 554 223 331 0.40 1 0.57 BN 0.01 243 125 118 0.51 0.56 0.53 BN 0.02 214 116 98 0.54 0.52 0.53 BN 0.07 112 65 47 0.58 0.29 0.39 BN 0.1 100 60 40 0.60 0.27 0.37 BN + HEU 0 272 154 118 0.56 0.69 0.62 BN + HEU 0.01 212 121 91 0.57 0.54 0.55 BN + HEU 0.02 201 115 86 0.57 0.41 0.48 BN + HEU 0.07 92 65 27 0.71 0.38 0.5 BN + HEU 0.1 83 59 24 0.71 0.12 0,21

Page 18: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 18

Conclusions

• the BN approach doubles the number of candidates of the previous HEU approach (554 vs 272).

• The sample is clearly insufficient.• The overlaping of Heu + BN seems to

improve the results• An analysis of the errors shows a

substantial number were due to the lack of the shadda diacritic or the feminine ending form (ta marbuta, ة).

Page 19: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 19

Further work

• Repeat the entire procedure relying when possible on dictionaries containing diacritics

• Refine the scoring procedure by assigning different weights to the different relations.

• Include additional relations (e.g. path-based)• Use additional Knowledge Sources for

weighting the relations: related entries already included in AWN SUMO Magnini's domain codes

Page 20: LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa

LREC 2008 AWN 20

Thank you for your attention