25
Automatic Assignment of Domain Labels to WordNet Mauro Castillo V. Francis Real V. German Rigau C. GWC 2004 rtament de Llenguatges i Sistemes Informàtics ersitat Politècnica de Catalunya

Automatic Assignment of Domain Labels to WordNet Mauro Castillo V. Francis Real V. German Rigau C. GWC 2004 Departament de Llenguatges i Sistemes Informàtics

Embed Size (px)

Citation preview

Automatic Assignment of

Domain Labels to WordNet

Mauro Castillo V.

Francis Real V.

German Rigau C.

GWC 2004

Departament de Llenguatges i Sistemes InformàticsUniversitat Politècnica de Catalunya

Outline

•Introduction

•WordNet

•WN Domains

•Experimentation

•Evaluation and results

•Discussion

•Conclusions

Introduction

• To semantically enrich any WN version with the semantic domain labels of MultiWordNet Domains

• WN is an standard resource for semantic processing

• Effectiveness of Word Domain Disambiguation

• The work presented explores the automatic and sistematic assignment of domain labels to glosses

• Proposed Method can be used to correct and verify the suggested labeling

WordNet

• The version WN1.6 was used because of the availability of WN Domains

WN Domains

TOPpure_science

biology

botany

zoology

entomology

anatomy

mathematics

geometry

statistics

... ... ...

WordNet Domain hierarchy

developed at IRST (Magnini and Cavagliá, 2000)

WN Domains

• The synsets have been annotated semiautomatically with one or more labels

• Most of synsets it has single a label

# nom verb adj adv %1 56458 11287 16681 3460 88,20202 8104 743 1113 109 10,10503 1251 88 113 6 1,46324 210 8 8 0 0,22685 2 1 0 0 0,0030

Distribution of domain labels for synset

noun = 1.170 verb = 1.078adj = 1.076adv = 1.033

Average labels for synset

WN Domains

• A domain may include synsets of different syntactic categories : e.g. MEDICINE

doctor#1 (n)operar#7 (v)medical#1 (a)clinically#1 (r)

• A domain label may also contain senses from different Wn subhierarchies. e.g. SPORT

athleta#1 life-form#1game-equipment#1 physical-object#1sport#1 act#2playing-field#1 location#1

WN Domains

• Synsets that have more than one label, do not seem to follow any pattern

• sultana#n#1 (pale yellow seedless grape used for raisins and wine)

• morocco#n#2 (a soft pebble-grained leather made from goatskin; used for shoes and book bindings etc.)

• canicola_fever#n#1(an acute feverish disease in people and in dogs marked by gastroenteritis and mild jaundice)

• blue#n#1, blueness#n#1 (the color of the clear sky in the daytime; "he had eyes of bright blue")

Botany Gastronomy

Anatomy Zoology

Medicine Physiology Zoology

Color Quality

WN Domains

• FACTOTUM : Used to mark the senses of WN that do not have a specific domain

• STOP Senses: The synsets that appear frequently in different contexts, for instance: numbers, colours, etc.

• Word Sense Disambiguation• Word Domain Disambiguation• Text Categorization, etc.

Applications of WN Domains

Experimentation

POS FAC no FAC %FACnoun 66025 58252 11,77verb 12127 4425 63,51adj 17915 6910 61,42adv 3575 1039 70,93

• Process to automatically assign domain labels to WN1.6 glosses

• Validation procedures of the consistency of the domains assignment in WN1.6, and especially, the automatic assignment of the factotum labels

Distribution of synset with and without the domain label factotum in WN1.6

Experimentación

POS FAC no FAC %FACnoun 572 647 11,90verb 43 121 60,33

Test set was randomly selected (around 1%) and the other synsets were used as a training set

Corpus test for nouns and verbs

Experimentation

castle#n#4, castling#n#1 CHESS SPORT

castle castling | interchanging the positions of the king and a rook

castle chess

castle sport

castling chess

castling sport

interchanging chess

interchanging sport

interchanging chess

interchanging sport

interchanging chess

interchanging sport

king chess

king sport

rook chess

rook sport

Calculation of frequency

castle chess 68

castle sport 27

castle hystory 18

castle archictecture 57

castle law 12

castle tourism 24

M2: Association Ratio

Experimentation

Measures

Ar(w,D) = Pr(w|D)log2(Pr(w|D) / Pr(w))

M3: Logarithm formula

log2(N*c(w,D) / c(w)c(D))

M1: Square root formula

c(w,D) - 1/N*c(w)c(D)

c(w,D)

Experimentation

CALCULATIONMATRIX

OF WEIGHTS

orange botany 10.1739451057135orange gastronomy 4.98225066954225orange color 3.28232334801756orange jewellery 1.49369255002054orange entomology 1.23243498322359orange quality 1.17822271128967orange hunting 0.412524764820793orange geology 0.293707167933641orange chemistry 0.166183492890361orange biology 0.110492358490017

VALIDATION

TRAINING

Experimentation

glossvariant

VD = weigth(wi,dj)*percentage person

POSITION 1: person = 30.23POSITION 2: politics = 13.40POSITION 3: law = 11.08......

leader | a person who rules or guides or inspires others

06950891 leader#n#1 PERSON

politics 4.30history 3.33religion 2.19person 1.78mythology 1.17commerce 1.11

person 19.94law 8.01economy 4.74religion 4.24anthropology 3.74sexuality 3.53politics 3.49

law 2.70factotum 2.09computer_science 2.05mathematics 1.83grammar 1.68play 1.57linguistics 1.54politics 1.35

tourism 1.64industry 1.54person 1.46mechanics 1.26factotum 1.24occultism 0.98pedagogy 0.93

psychology 0.96factotum 0.82

Evaluation y Results: nouns

N AP AT P R F1M1A 70,94 79,75 64,74 68,25 66,45M1D 74,50 84,85 68,88 72,62 70,70M2A 45,75 50,39 42,73 43,12 42,92M2D 52,09 57,50 48,75 49,21 48,98M3A 66,77 74,50 60,86 63,76 62,27M3D 71,56 81,45 66,54 69,71 68,09

Results for nouns with factotum CF

AP: Accuracy first label

AT: Accuracy all labels

P : Precision

R : Recall

F1 : 2PR/(P+R)

MiA : Measures the success of each formula (M1, M2 or M3) when the first proposed label is correct

MiD : Measures the success of each formula (M1, M2 or M3) when the first proposed label is correct (or subsumed as correct one in the domain hierarchy).

N AP AT P R F1M1A 73,95 81,82 66,81 68,68 67,73M1D 78,50 87,24 71,24 73,24 72,23M2A 52,45 57,52 49,32 48,24 48,77M2D 59,44 65,21 55,94 54,71 55,32M3A 74,48 82,69 68,41 69,41 68,91M3D 78,85 88,64 73,33 74,41 73,87

Results for nouns without factotum SF

Evaluation y Results: verbs

Results for verbs with factotum CF

AP: Accuracy first label

AT: Accuracy all labels

P : Precision

R : Recall

F1 : 2PR/(P+R)

Results for verbs without factotum SF

V AP AT P R F1M1A 51,24 57,02 47,26 50,74 48,94M1D 51,24 57,02 47,26 50,74 48,94M2A 13,22 14,88 12,68 13,24 12,95M2D 16,53 19,83 16,90 17,65 17,27M3A 23,14 28,10 21,94 25,00 23,37M3D 24,79 29,75 23,23 26,47 24,74

V AP AT P R F1M1A 69,77 76,74 64,71 55,93 60,00M1D 74,72 83,72 69,23 61,02 64,86M2A 20,93 25,58 19,64 18,64 19,13M2D 41,86 51,16 38,60 37,29 37,93M3A 41,86 55,81 39,34 40,68 40,00M3D 53,49 67,44 46,77 49,15 47,93

MiA : Measures the success of each formula (M1, M2 or M3) when the first proposed label is correct

MiD : Measures the success of each formula (M1, M2 or M3) when the first proposed label is correct (or subsumed as correct one in the domain hierarchy).

Evaluation y Results

• On average, the method assigns:Noun : 1.23 domains labels (1.170)Verb : 1.20 domains labels (1.078)

• We obtain better results with nouns

• The best average results were obtained with the M1 measure

• The first proposed label (noun): 70% accuracy

• The results of verbs are worse than nouns, one of the reasons may be the high number of verbal synsets labels with factotum domain

Discussion

Monosemic words:

credit application#n#1 (an application for a line of credit)

Domains: SCHOOLProposal 1. BankingProposal 2. Economy Banking

economy

banking

Discussion

Relation between labels:

Academic_program#n#1 (a program of education in liberal arts and sciences (usually in preparation for higher education))Domains: PEDAGOGY

Proposal 1. SchoolProposal 2. University

pedagogy

school university

Discussion

shopping#n#1 (searching for or buying goods or services: "went shopping for a reliable plumber"; "does her shopping at the mall rather than down town")

Domains: ECONOMY

Proposal 1. Commerce

social_science

commerce economy

Relation between labels:

Discussion

Fire_control_radar#n#1 (radar that controls the delivery of fire on a military target)

Domains: MERCHANT_NAVY

Proposal 1. Military

social_science

transport

merchant_navy

military

Relation between labels:

Discussion

Uncertain cases:

birthmark#n#1 (a blemish on the skin formed before birth)Domains: QUALITY

Proposal 1. Medicine

bardolatry#n#1 (idolization of William Shakespeare)Domains: RELIGION

Proposal 1. HistoryProposal 1. Literature

Conclusions

• The procedure to assign automatically domain labels to WN gloss seems to be dificult

• The proposal process is very reliable with the first proposal labels

• The proposal labels are ordered by priority

• It is posible to add new correct labels or validate the old ones

Mauro Castillo V.

Francis Real V.

German Rigau C.

Departament de Llenguatges i Sistemes InformàticsUniversitat Politècnica de Catalunya

Automatic Assignment of

Domain Labels to WordNet

GWC 2004