45
1 Oslo, 14-16 Sep 2003 Extraction of Ontological Information from Corpora (and Lexicon) Dimitrios Kokkinakis [email protected] Maria Toporowska Gronostaj [email protected]

Extraction of Ontological Information from Corpora (and Lexicon)

  • Upload
    tyanne

  • View
    32

  • Download
    1

Embed Size (px)

DESCRIPTION

Extraction of Ontological Information from Corpora (and Lexicon). Dimitrios Kokkinakis [email protected] Maria Toporowska Gronostaj [email protected]. Outline. Goals & Observations, Resources Related Research - PowerPoint PPT Presentation

Citation preview

Page 1: Extraction of Ontological Information from Corpora  (and Lexicon)

1 Oslo, 14-16 Sep 2003

Extraction of OntologicalInformation from Corpora (and Lexicon)

Dimitrios [email protected]

Maria Toporowska [email protected]

Page 2: Extraction of Ontological Information from Corpora  (and Lexicon)

2 Oslo, 14-16 Sep 2003

OutlineOutline

Goals & Observations, ResourcesGoals & Observations, Resources

Related ResearchRelated Research

Extending the Coverage of Semantic Resources Extending the Coverage of Semantic Resources ((S-SIMPLES-SIMPLE: Quality: Quality but not but not QuantityQuantity))

– Why and How?

Key Issues Investigated for the AcquisitionKey Issues Investigated for the Acquisition– Compounding vs. Syntactic Parsing & Large Corpora vs. Defining

Lexicons– Pilot study regarding lexico-syntactic patterns

EnhancementEnhancement – What has been achieved?

Error AnalysisError Analysis – For parts of the studies…

Conclusions & Future PlansConclusions & Future Plans

Page 3: Extraction of Ontological Information from Corpora  (and Lexicon)

3 Oslo, 14-16 Sep 2003

GoalsGoals

Extend & enrich the coverage of the Swedish semantic Extend & enrich the coverage of the Swedish semantic lexicon:lexicon:

– as automaticallyautomatically as possible– as inexpensiveinexpensive as possible (using whatever support was

available)– re-usingre-using lexical resources (not neccessarily semantic)

Test ideas regarding:Test ideas regarding:– context similarity– similarity in NPs of Enumerative Type (+ evaluation) -

breadth– the power of compounds - breadth– bootstrapping the SIMPLE content– using lexico-syntactic patterns for hyper/hypo relations - depth– (statistical means)

research conducted 00-01

Page 4: Extraction of Ontological Information from Corpora  (and Lexicon)

4 Oslo, 14-16 Sep 2003

Observations Observations & Hypotheses& Hypotheses

Observation-1: Take into account the compounding characteristic of Swedish

– + easiereasier to identify (cmp to English-at least in raw text)– - harderharder to segment/analyse (cmp to English)– + a lot ofa lot of disambiguated compounds in our lexical DB

Observation-2: Yet another view of context similarity (see Related Research)

Members of a semantic group are often surrounded by other members of the same group in text; in other words: words entering into the same syntagmatic relation with other words can be perceived as to be semantically similar

Observation-3: Apply lexico-syntactic patterns á la Hearst for more complex

relations (pilot…) – why? because during the previous 2 steps (see later discussion) we mainly extract synonymic/co-hyponymic entries

Page 5: Extraction of Ontological Information from Corpora  (and Lexicon)

5 Oslo, 14-16 Sep 2003

ResourcesResources

Core SIMPLE lexiconCore SIMPLE lexicon– 10,000 semantic units 10,000 semantic units ( 6,000 words)– a vital part of the different entries' semantic unit is the notion

of semantic class whose value is an element in a semantic class list (95 classes) hierarchically structured (LexiQuest)

– content: high qualityhigh quality; manually compiled and verifiedmanually compiled and verified, but…– limitedlimited vocabulary - quantitatively insufficient for HLT

Gothenburg Lexical DataBase (GLDB)Gothenburg Lexical DataBase (GLDB)– ca 70,000 lexical entriesca 70,000 lexical entries– monolingual defining lexicon monolingual defining lexicon –– for human readers (but for human readers (but ++ RDB- RDB-

format)format)– advantage (particularly for this study): a number of synonymic advantage (particularly for this study): a number of synonymic

compoundscompounds

CorporaCorpora– ca 40 mil. tokens (syntactically analysed)ca 40 mil. tokens (syntactically analysed)

Page 6: Extraction of Ontological Information from Corpora  (and Lexicon)

6 Oslo, 14-16 Sep 2003

Related Research (1)

context similarity plays and important role in word acquisition

… so, common characteristic of most approaches is the computation of the semantic similarity between two words on the basis of the extent to which words' average contexts of use overlap

usual assumption: members of the same semantic group co-occur in discourse [cf.

Riloff&Sheperd, 97] use of syntax for generating semantic knowledge

based on distributional evidence & syntagmatic relations is found in most previous research

Page 7: Extraction of Ontological Information from Corpora  (and Lexicon)

7 Oslo, 14-16 Sep 2003

Related Research (2)

Approaches in general – steps:– Extract word co-occurrences (most crucial part)

usually gathered based on certain relations, e.g. predicate-argumentmodifier-modified, adjacency,…

– Define similarities between words on the basis of co-occurrences (+linguistic knowledge)

combine existing linguistic knowledge (seed lex.) & co-occur. data

– Cluster words on the basis of similarities

e.g. by using the contexts of the words as features and group

together the words that tend to appear in similar context

for compensating the sparseness of the co-occ. data

Page 8: Extraction of Ontological Information from Corpora  (and Lexicon)

8 Oslo, 14-16 Sep 2003

Related Research (3a)

Hearst (1992): lexico-syntactic patterns – discovered by observation - for extracting hyponymyhyponymy relations from corpora

– e.g. NP {,NP}* {,} and other NP temples, treasuries and other important civic

buildingsbuildings

Grefenstette (1994): extract corpus-specific semantics in parsed text using (weighted) Jaccard (between two objects m and n is the num. of shared attributes divided by the number of attributes in the unique union of the set of attributes for each object) e.g. comparing ‘dog‘ & ‘cat‘ via textually derived attributes and binary Jaccard measure

– dog/pet-DOBJ dog/eat-SBJ dog/brown dog/shaggy dog/leash– cat/pet-DOBJ cat/pet-DOBJ cat/hairy cat/leashcount({attribs shared by cat and dog})/count({uniq attribs

possesed by cat or dog})

brown

eat

hairy

leash

pet-DOBJ

shaggy

leash

pet-DOBJ=2/6=0,333

Page 9: Extraction of Ontological Information from Corpora  (and Lexicon)

9 Oslo, 14-16 Sep 2003

Related Research (3b)

Lin (1998): constructing a thesaurus using syntactically parsed corpora containing dependency triples: ||word1 relation word2||frequency; word similarity measure is defined based on the distributional pattern of words (“the similarity between 2 objects is defined to be the amount of information contained in the commonality between the objects divided by the amount of information in the descriptions of the objects”)

e.g.: ||cell, pobj-of, inside||=16 (dependeny triple=2 words+gram. relation) I(w,r,w’)=log (||w,r,w’||x||*,r,*||)/(||w,r,*||x||*,r,w’||)similarity between 2 words (w1,w2) is based on:

((r,w)T(w1)T(w2) (I(w1,r,w)+/(w2,r,w)) / ((r,w)T(w1) I(w1,r,w)+ (r,w)T(w2) I(w2,r,w))

Roark & Charniak (1998): noun-phrase co-occurrence statistics (actually bigrams ranked by log-likelihood) for semi-automatic semantic lexicon construction; input is a parsed corpus and initial seed words (= the most frequent head nouns in a corpus [top200-500]) – based on conjunctions cars and trucks, lists planes, trains and automobiles, appositives and noun compounds pickup truck

I(w,r,w’) the amount of info in ||w,r,w’||

Page 10: Extraction of Ontological Information from Corpora  (and Lexicon)

10 Oslo, 14-16 Sep 2003

Related Research (3c)

Takunaga et al. (1997): new words (nouns) are classified on the basis of relative probabilities of a word belonging to a given word class, with the probabilies calculated using noun-verb co-occurrence pairs (japanese+BGH thesaurus) – algo. originally developed for document categorization – each noun is represented by a set of co-occuring verbs

Lin & Pantel (2002): each word is represented by a feature vector, each feature correspond to a context in which the word occurs (threaten with _ is a context and if handgun occurred in that context the context is a feature of handgun) the value of a feature is the MI between feature and the word; similarity between 2 words is calculated using cosine coef. of their MI vectors – clustering is then based on these results

Page 11: Extraction of Ontological Information from Corpora  (and Lexicon)

11 Oslo, 14-16 Sep 2003

So… enhancing So… enhancing SIMPLE by…SIMPLE by…

……Analyzing CompoundsAnalyzing Compounds a large number of compounds can inherit relevant

parts of semantic info provided that the heads of lexemes occur in SIMPLE; testing for lexicalisation in GLDB in order to avoid incorporation of idiomatic or metonymic meanings; applying compound segmentation

……Semantic similarity in NPs of enumerative Semantic similarity in NPs of enumerative typetypeuse of partial parsing on large corpora;

words entering into the same syntagmatic relation with other words are perceived semantically similar; however, certain conditions must be satisfied in order to avoid incorporation of erroneous entries ……Lexico-syntactic patternsLexico-syntactic patterns

for acquiring higher in the hierarchy concepts see examples

Page 12: Extraction of Ontological Information from Corpora  (and Lexicon)

12 Oslo, 14-16 Sep 2003

Extending SIMPLEExtending SIMPLE… … illustrationillustration

Compounding exampleCompounding example:färja?, kryssningsfartyg?, tankers? och ro-ro-fartyg?

>> No matches ferries, cruise-ships, tankers and ro-ro-vesselsfärja? kryssnings#fartygfartygVEH tankers? ro-ro-#fartygfartygVEH

>> färjaVEH kryssningsfartygVEH tankersVEH ro-ro-fartygVEH

Enumerative NP exampleEnumerative NP example:juristerOCC-AG, läkareOCC-AG, optikerOCC-AG, psykologer? och sjukgymnaster? >> 3 Matcheslawyers, doctors, opticians, psycologists and physiotherapists>> condition: if >2 have same tag & rest no ==> add in

lexicon! >>psykologpsykologOCC-AG sjukgymnastsjukgymnastOCC-AG

• Lexico-syntactic pattern example:Lexico-syntactic pattern example:älgar, sorkar, fåglar, kor, hästar älgar, sorkar, fåglar, kor, hästar och andra och andra djurdjur

Page 13: Extraction of Ontological Information from Corpora  (and Lexicon)

13 Oslo, 14-16 Sep 2003

CompoundingCompounding

take advantage that Swedish is a compounding language take advantage that Swedish is a compounding language (e.g. (e.g. >70% of SAOL are compounds>70% of SAOL are compounds))

– single orthographic units– many compound words are lexically not represented– generally having predictable meanings - relatively transparent– most compounds are essentially binary & in most cases both

elements are represented in GLDB– given a sizeable number of analysed compounds its possible to

automatically establish a ”semantic compounding profile” for all lexemes in predictable compounds

– meaning as a function of the meaning of the components related to each other by an implied predicative functor

– e.g. brödkniv brödXknivY ‘bread knife’ implies ‘Y for (cutting) X ’

used compounds from the GLDBs synonym-slot used compounds from the GLDBs synonym-slot … … and corpora … and corpora … butbut the have to be segmented & anaysed the have to be segmented & anaysed

see Järborg, Kokkinakis & Toporowska-Gronostaj, ’02

Page 14: Extraction of Ontological Information from Corpora  (and Lexicon)

14 Oslo, 14-16 Sep 2003

SemanticSemanticCompound DefinitionsCompound Definitions

Semantic Definition ExampleY that is located in/at… klassrumsdörr classroom+door

Y that is made up of X kanalsystem canal+system

Y that originates from X smutsfläck dirt+stain

Y that is aimed at X kaninjakt rabbit+hunt

Y that is about X partikelfysik particle+physics

Y that produces X batterifabrik battery+factory

Y that prevails in X partiideologi party+ideology

Y that contains X kaffetermos coffee+thermos

Y that consists of X kaffepulver coffee+powder

Y that has to do with X .......

klädbesvär clothes+trouble .......

Page 15: Extraction of Ontological Information from Corpora  (and Lexicon)

15 Oslo, 14-16 Sep 2003

An Example ProfileAn Example Profilefor ´område´for ´område´

område.1.1.0 <geogr.>

avrinning.1.1.0

bangård.1.1.0

barrskog.1.1.0

kust.1.1.0katastrof.1.1.0

land.1.1.b

Luleå.PM

myr.1.1.0

område.1.1.b <abstr.>

Medelhavs.PM

marknadsföra.1.1.0

affär.1.2.b

avtal.1.1.0

kommunikation.1.2.0

kompetens.1.1.0

kultur.1.2.0

kostnad.1.1.0

kärna.1.1.c

kunskap.1.1.a

läkemedel.1.1.0

marknad.1.2.0

motiv.1.2.0

mark.1.2.0

Page 16: Extraction of Ontological Information from Corpora  (and Lexicon)

16 Oslo, 14-16 Sep 2003

Compounds fr.Compounds fr.GLDBGLDB

already disambiguated... GLDB & S-SIMPLE entries linked to

the sub-senses in GLDB e.g. S-SIMPLE encodes the non-

compound lemma ämneämne (as having 4 senses, marked 1/1-1/4), which are disambiguated here by means of their assignment to the following semantic types and semantic classes:

– Material: Matter ‘material’– Substance: Substance ‘stoff’– Part: Abstract ‘topic’– Domain: Notion ‘subject,

discipline’ Each of the senses is exemplified

in GLDB with a number of compounds, comprising 26 in total with ämneämne as the head

SIMPLE (5)SIMPLE (5) GLDB (26)GLDB (26)

ämneämne:1/1:Matter:1/1:Matter grundämne:1/1:Matter

ämneämne:1/2:Substanc:1/2:Substancee

ämneämne:1/3:Abstract:1/3:Abstract

ämneämne:1/4:Notion:1/4:Notion

färgämne:1/1

hornämne:1/1

yxämne:1/2

fruktämne:1/2

predikoämne:1/3

uppsatsämne:1/3

läroämne:1/4

skolämne:1/4

Page 17: Extraction of Ontological Information from Corpora  (and Lexicon)

17 Oslo, 14-16 Sep 2003

Compounds fr. CorporaHeuristic compound decomposition/segmentation and matching of

the SIMPLE content with the heads of the segmented compounds

• Try to distinguish the modifier’s characteristics(pos & semantic category - if any)

• is modifier=adjective or proper-noun? OK• e.g. klocka digitaldigital||klocka; storstor||klocka anhängare• anhängare HitlerHitler||anhängare; LikudLikud||anhängare• S-SIMPLE as a means of bootstrapping the process• e.g. glas ‘glass’, extended with compounds having SUBSTANCE as

a modifier:[vatten,vin,öl,likör]glas: ‘water, wine, beer’ and ‘liqueur’

• Check against lists of lexicalized ones to eliminate incorrect data => GLDB allow the exclusion of such compounds from the derived sets• e.g. feber - 40 compounds from corpora, e.g. scharlakansfeber -

but not all are ILLNESSILLNESS ‘resfeber’ ‘diamantfeber’

Page 18: Extraction of Ontological Information from Corpora  (and Lexicon)

18 Oslo, 14-16 Sep 2003

Heuristic Compound Segmentation

previous attempts to segment Swedish compounds without the help of a “real” lexicon are described in Brodda (1979)

based on the distributional properties of graphemes, trying to identify grapheme combinations indicating possible boundaries (promising for Germanic languages)

mostly automatic with some manual work

sdsgtktp

is||dans (ice-dance)bidrags||givare (contributor) bröst||kirurgi (breast surgery)vit||peppar (white pepper)

dsbpsrpsdftvrnk

lands||bygd (countryside)bröllops||resa (honeymoon trip)kropps||delen (body part)luft||värme (air warmth)kärn||kraft (nuclear power)

ngsstsfagsspsplaspap

honungs||sött (honey sweet)besluts||fattare (decision-maker)vardags||språket (colloquial language)femårs||plan (five year plan)bakplåts||papper (baking-plate paper)

Page 19: Extraction of Ontological Information from Corpora  (and Lexicon)

19 Oslo, 14-16 Sep 2003

Compound Processing cont´d

• Estimation >20-25 compounds per S-SIMPLE entry (for NOUNS)• Based on: 1,000 nouns in SIMPLE; increased the

vocabulary to >22,000 • The top-5 non-compound entries from corpora, most rich in compound variants (some very ambiguous!)

• program ‘programme, program’ (469 diff. comp.)

arbete ‘work, employment’ (402 diff. comp.) chef ‘chief’ (390 diff. comp.)

bok ‘book’ (357 diff. comp.) verksamhet ‘activity, operation’ (299 diff.

comp.)

Page 20: Extraction of Ontological Information from Corpora  (and Lexicon)

20 Oslo, 14-16 Sep 2003

Modifier’s Characteristics

bad||toffla#garment

barn||vårds||lärare#occupation_agent

bas||bolag#agency

bläck||fisk#fish

bolags||plundrare#occupation_agent

brud||bergs||skola#abstract#agency#functional_space

bygg||bolag#agency

bygg||företag#agency

centralbanks||chef#occupation_agent

doping||brott#change

dt

rnv, dsl

sb

kf

gspl

gss, db

gb

gf

ksch

ngbSIMPLE

Page 21: Extraction of Ontological Information from Corpora  (and Lexicon)

21 Oslo, 14-16 Sep 2003

Syntactic Parsing (1)

Compounds are a valuable resource; but howhow can we cope with the rest of the vocabulary?

Corpus-driven approach to acquire semantic lexicons cf. Kokkinakis,

2001

Investigate how, and to what extent the flexibility and robustness of a partial parser can be utilized to fully automatic extend existing semantic lexicons - cascaded finite-state syntactic parserfinite-state syntactic parser;

– Observation: members of a semantic group are often surrounded by other members of the same group in text; in other words: words entering into the same syntagmatic relation with other words are perceived as semantically similar

Page 22: Extraction of Ontological Information from Corpora  (and Lexicon)

22 Oslo, 14-16 Sep 2003

Syntactic Parsing (2)

Corpus: 40 mil. tokens (Swedish Language Bank) tagged with Brill's tagger

Parsing using CASS-SWE in which levels or bundles of rules of very special characteristics & content can be rapidly created & tested e.g. specific types of NPs (takes pos-tagged texts as input)

Example - simplifiedExample - simplified: – Rule => ‘DETERMINER? COM-NOUN (COM-NOUN F)* COM-DETERMINER? COM-NOUN (COM-NOUN F)* COM-

NOUN CONJ COM-NOUNNOUN CONJ COM-NOUN’ (färger, penslar, papper och matsäckar)

– Rule => ‘APPOSITION-NOUN? PROP-NOUN+ (F PROP-NOUN)+ APPOSITION-NOUN? PROP-NOUN+ (F PROP-NOUN)+ CONJ PROP-NOUN+CONJ PROP-NOUN+’ (Venezuela, Trinidad och Island)

Amount of unique retrieved phrases were ca 36,000 (phrases without proper names) and ca 72,000 (phrases with proper names)

Page 23: Extraction of Ontological Information from Corpora  (and Lexicon)

23 Oslo, 14-16 Sep 2003

Syntactic Parsing (3)

1. Gather, pos-annotate & parse large corpora2. Filter out long NPs; & Filter out knowledge-poor

elements3. 1st Pass: Measure the overlap between the members

of the phrases extracted and the entries in the semantic lexicon;

3a. If conditions apply, add new categorised entries in the database;

3b. Repeat the previous 2 steps, until very few or nothing is matched;

4. 2nd Pass: Compound segment members of the phrases left;

4a. Check whether they are lexicalised, do not use them if they are;

4b. Repeat the process from step (3) by matching this time the heads with the content of the database

Page 24: Extraction of Ontological Information from Corpora  (and Lexicon)

24 Oslo, 14-16 Sep 2003

Syntactic Parsing (4)

Large quantities of partially parsed corpora is an important ingredient for the enrichment and further development of the semantic resources – cf. all previous attempts: use syntax for generating semantic knowledge

From the forest of chunks produced, filter out long NPs (=>3 Com. Nouns), lemmatise, normalise, filter out knowledge-poor elements (determiners, punctuation) & measure the overlap between the nouns in the NPs and the entries in S-SIMPLE

If at least 2 of the nouns in the NPs are entries in SIMPLE, with the samesame semantic class, then there is a strong indicationindication that the rest of the nouns are co-hyponymsco-hyponyms, thus semantically similar with the two already encoded in S-SIMPLE – iterate

Apply compounds segmentation on the members of the phrases left – check for lexicalization in a def. dictionary (GLDB) don’t use them are lexicalized – repeat previous step & iterate BUT match the heads!

Page 25: Extraction of Ontological Information from Corpora  (and Lexicon)

25 Oslo, 14-16 Sep 2003

First Pass Overlap

Matching a db with the content of the resources against the content of the phrases

Assume: if at least 2at least 2 of the members of a phrase are also entries in the lexicon, with the samesame semantic class, and the rest of the phrase members have nothave not received a semantic annotation, then there is a strong indication that the rest of the members are co-hyponyms, and thus semantically similar with the two already encoded in the lexicon. Accordingly, we annotate them with the same semantic class

e.g. lawyers, doctors, opticians, psycologists and physiotherapistsjuristerOCC-AG, läkareOCC-AG, optikerOCC-AG, psykologer? och

sjukgymnaster? ===> 3 Matches==> condition: if >2 have same tag & rest no ==> add in lexicon!

psykologOCC-AG sjukgymnastOCC-AG

Page 26: Extraction of Ontological Information from Corpora  (and Lexicon)

26 Oslo, 14-16 Sep 2003

Second Pass OverlapA large number of phrases not used; none or only

one of the members of the phrases was covered by SIMPLE, either the original or the enriched version

Take account the compounding characteristic of Swedish (> 70% or 80,000 in SAOL are compounds); Heuristic decomposition of compounds & matching the SIMPLE content with the heads of the segmented compounds

AssumeAssume: a considerable number of casual or on the fly created compounds can inherit relevant parts of semantic info. provided on their heads by SIMPLE

e.g.: färjor?, kryssningsfartyg?, tankers? och ro-ro-fartyg?

===> No matches (ferries, cruise-ships, tankers and ro-ro-vessels)färja? kryssnings||fartygVEH tankers? ro-ro-||fartygVEH

===> färjaVEH kryssningsfartygVEH tankersVEH ro-ro-fartygVEH

Page 27: Extraction of Ontological Information from Corpora  (and Lexicon)

27 Oslo, 14-16 Sep 2003

Syntactic Parsing (5)

• Errors/noise can be eliminated, if the semantic tags

of all the words in a phrase are compared kvinnor:BIOBIO, barn:BIOBIO, husdjur:?????? och möbler:FURNITUREFURNITURE

• Ambiguities are propagatedflaskor:CONTAINER-AMOUNTCONTAINER-AMOUNT, tallrikar:CONTAINER-AMOUNTCONTAINER-AMOUNT, vinglas:??????

Result:Result:Approx. 3,300 new noun entries to the Swe-S could

be identified without any further processing (i.e. bootstrapping the compound analysis) – and only during the ‘first pass’

Page 28: Extraction of Ontological Information from Corpora  (and Lexicon)

28 Oslo, 14-16 Sep 2003

Loooong NPs (1)

hhar jag ätit ko, gris, lamm, häst, hare, kanin, ren, älg,ar jag ätit ko, gris, lamm, häst, hare, kanin, ren, älg, känguru, känguru, orre, tjäder, duva, kyckling, anka, gås, struts, krokodil, haj, lax, orre, tjäder, duva, kyckling, anka, gås, struts, krokodil, haj, lax, torsk, abborre, gädda, bläckfisk och en massa firrar tiltorsk, abborre, gädda, bläckfisk och en massa firrar til l …l …

ekonom sociolog litteraturvetare stadsplanerare mediaexpert ekonom sociolog litteraturvetare stadsplanerare mediaexpert filosof reklamfolk företrädare formgivare ingenjör författare filosof reklamfolk företrädare formgivare ingenjör författare diktare filmare popmusiker leksaksfabrikant klädskapare arkitekt diktare filmare popmusiker leksaksfabrikant klädskapare arkitekt journalistjournalist vetenskapsman...vetenskapsman... (p(pressress98)98)

inflationsutveckling framtidstro orderingång inflationsutveckling framtidstro orderingång arbetsmarknadspolitik företagsbeskattning ränteläge arbetsmarknadspolitik företagsbeskattning ränteläge handelshinder investeringstakt råvaruprishandelshinder investeringstakt råvarupris produktionsutvecklingproduktionsutveckling……

slangnipplar slangpumpar flödesmätare gummihandskar slangnipplar slangpumpar flödesmätare gummihandskar röntgenapparater proteser testcyklar diskmaskiner journalsystem röntgenapparater proteser testcyklar diskmaskiner journalsystem bensågar kuvöser blodmixrarbensågar kuvöser blodmixrar urintestremsor centrifuger... urintestremsor centrifuger... (p(pressress95)95)

bokstav måttband klocka miniräknare plastbestick barnbild bokstav måttband klocka miniräknare plastbestick barnbild nyckel batterier filmrullenyckel batterier filmrulle (SUC)(SUC)

Page 29: Extraction of Ontological Information from Corpora  (and Lexicon)

29 Oslo, 14-16 Sep 2003

Loooong NPs (2)

Belgien Danmark Frankrike Grekland Island Italien Kanada Belgien Danmark Frankrike Grekland Island Italien Kanada Luxemburg Nederländerna Norge Portugal Spanien Luxemburg Nederländerna Norge Portugal Spanien Storbritannien Turkiet Tyskland USA…Storbritannien Turkiet Tyskland USA… (p97)(p97)

all världens ortnamn : Lahti , Kalundborg , Oslo , Motala , all världens ortnamn : Lahti , Kalundborg , Oslo , Motala , Luleå , Moskva , Tromsö , Vasa , Åbo , Rom , Hilversum , Luleå , Moskva , Tromsö , Vasa , Åbo , Rom , Hilversum , Vigra , Bryssel , London , Prag , Athlone , Köpenhamn , Vigra , Bryssel , London , Prag , Athlone , Köpenhamn , Stuttgart , München , Riga , Stavanger , Paris , Warszawa , Stuttgart , München , Riga , Stavanger , Paris , Warszawa , Bodö och Wien… Bodö och Wien… (romii)(romii)

Birte Heribertson Bodil Mårtensson Anette Norberg Bror Birte Heribertson Bodil Mårtensson Anette Norberg Bror Tommy Borgström Karin Bergqvist Mats Ågren Mattias Tommy Borgström Karin Bergqvist Mats Ågren Mattias Renehed Tobias Ekstrand…Renehed Tobias Ekstrand… (p96)(p96)

Robert Hedman , Kjell Jönsson , Ingemar Eriksson , Jonas Robert Hedman , Kjell Jönsson , Ingemar Eriksson , Jonas Runesson , Miguel Exposito , Micke Berg , Lars Oscarsson , Runesson , Miguel Exposito , Micke Berg , Lars Oscarsson , Fredrik Aliris , Jimmy Anjevall , Putte Johansson , Petter Fredrik Aliris , Jimmy Anjevall , Putte Johansson , Petter Jokobsson , Daniel Edfalk , Mattias Larsson , Daniel , Jokobsson , Daniel Edfalk , Mattias Larsson , Daniel , Westerlund , Daniel Johansson , Peter Westerlund , Daniel Johansson , Peter ......

Page 30: Extraction of Ontological Information from Corpora  (and Lexicon)

30 Oslo, 14-16 Sep 2003

Evaluation (1)

Quantity Evaluation of the Syntactic Parsing approach (see Kokkinakis, 01)

Results after six iterations:

Original Pass-1 Pass-2 Total

SIMPLE 2,921 5,110 1,100 9,131

NAMES 10,550 25,700 --- 36,250

Page 31: Extraction of Ontological Information from Corpora  (and Lexicon)

31 Oslo, 14-16 Sep 2003

Evaluation (2)

Quality Evaluation: Manually, for a number of groups based on common sense and judgement

Class Original New Wrong/Spurious Precision

OrganisationNE 1300 395 22 94,4%

Phenomenon 36 29 9 69%

Bio 46 107 12 88,8%

Ideo 17 74 9 97,8%

Vehicle 33 118 17 85,6%

Apparatus 22 27 2 92,6%

Garment 25 184 19 89,7%

Illness 38 66 8 87,9%

Flower 19 26 3 88,5%

Page 32: Extraction of Ontological Information from Corpora  (and Lexicon)

32 Oslo, 14-16 Sep 2003

Examples of Acquired Entries (1)

BIOBIO: any classification of human beings (groups or individuals) according to a biological chracteristic like age, sex, etc; i.e. adult, twin, brother, bastard, husband, miss…

ORIGINAL (46)ORIGINAL (46): bror, fru, hustru, son, tjej, gudbarn, ...

NEW (107)NEW (107): barn, barnbarnsbarnbarn, children!!, dotter, dotterdotter, fader, far, farbror, farfader, farfarsfar, farförälder, farmoder, faster, flickvän, fosterförälder, fästmö, huskarl, hustru, jungfru, kusin, …

SPURIOUS/WRONG (12)SPURIOUS/WRONG (12): orientarmé, regnskog, sjukhuspersonal, skilsmässa, sopa, studieförbund, svågra, totalisatorspel, trapetsartist, tutsier, älder, äppelträd

PRECISIONPRECISION:: 88,8%

Page 33: Extraction of Ontological Information from Corpora  (and Lexicon)

33 Oslo, 14-16 Sep 2003

APPARATUSAPPARATUS: tools or devices used together to provide a particular functionality for a particular task; i.e. dishwasher, camera, computer, recorder…

ORIGINAL (22)ORIGINAL (22): video, kamera, frys, kopiator, mixer, ...

NEW (27)NEW (27): bandspelare, cd-rom-läsare, cd-spelare, dator, dvd-spelare, faxapparat, filmkamera, frysbox, handdator, nätverksdator, radio, skrivare, symaskin, televisionsapparat, teve-apparat, tv-apparat, videoapparat, ...

 SPURIOUS/WRONG (2)SPURIOUS/WRONG (2): fonduegryta??, skafferi

PRECISIONPRECISION:: 92,6%

Examples of Acquired Entries (2)

Page 34: Extraction of Ontological Information from Corpora  (and Lexicon)

34 Oslo, 14-16 Sep 2003

VEHICLEVEHICLE: artifacts (or their parts) made for the transport of goods, livestock or people; i.e. truck, sedan, bicycle, license plate!!!,submarine…

ORIGINAL (33)ORIGINAL (33):: kajak, bil, jeep, båt, flotte,…

NEW (118)NEW (118): ambulans, brandbil, buss, charter, direktbuss, distributionsbil, elbil, flakmoped, flakmoppa, flodbåt, flyg, flygplan, fordon, fregatt, färja, helikopter, husvagn, hästfordon, hästkärra, korvett, krigsfartyg, lastvagn, …

SPURIOUS/WRONG (17)SPURIOUS/WRONG (17): anläggningsmaskin, arbetsmaskin, artilleri, artilleripjäs, entreprenadmaskin, förband, förvaltningsmyndighet, gräsklippare, skida

PRECISIONPRECISION:: 85,6%

Examples of Acquired Entries (3)

Page 35: Extraction of Ontological Information from Corpora  (and Lexicon)

35 Oslo, 14-16 Sep 2003

Evaluation (3)

Quality Evaluation nr2

Comparison with 2 Synonym Dictionaries

STRÖMBERGS & BONNIERS

SIMPLESIMPLE

LabelLabel

STR+BONSTR+BON

(x+x=unique)(x+x=unique)

Missing in SIMPLEMissing in SIMPLE

bil - car VEHICLE 7+8=11 3 – vagn, kärra, åk

regn - rain

PHENOM. 17+14=21 15 – väta, ström, flod, dusch, kaskad, våtväder etc.

rederi – shipping company

AGENCY 3+4=6 5 – skeppsägare, linje, båtbolag, fartygsbolag, sjöfartsbolag

(Missing in STR+BON: ösregn, spöregn, hällregn!

Page 36: Extraction of Ontological Information from Corpora  (and Lexicon)

36 Oslo, 14-16 Sep 2003

Error AnalysisSource of Errors:

• Part-of-speech and lemmatisation errors

tröjaGARMENT halsduk strumpaGARMENT underkläder skiva album => GARMENT ... assigned to the rest...

• A number of long, enumerative NPs with many unknown to the lexicon entries, where 2 or 3 (happened) to correctly get the same semantic label but some the wrong one

•… and of course polysemydepressionEMOTION ångestEMOTION spänning? => EMOTION ...buttryckATTRIBUTE spänningEMOTION? vibration tyngdkraftATTRIBUTE

Page 37: Extraction of Ontological Information from Corpora  (and Lexicon)

37 Oslo, 14-16 Sep 2003

Lexico-syntacticPatterns

Compounding and enumerative NPs are a good starting point for acquiring synonyms & co-hyponymssynonyms & co-hyponyms

Pattern based lexico-syntactic recognition is suitable for acquiring hyperonyms-hyponyms hyperonyms-hyponyms (and partly meronyms)

Language specific patterns

Discovery by observation

A good parser is necessary – good coverage of NPs

Requires more research on the effects of the various modifiers that can alter the semantic relation

Page 38: Extraction of Ontological Information from Corpora  (and Lexicon)

38 Oslo, 14-16 Sep 2003

Lexico-syntacticPatterns (1)

NP av (typ/en|märke/t|model/len|…) ("|'|:)? (NP|(NP,)NP av (typ/en|märke/t|model/len|…) ("|'|:)? (NP|(NP,)+) (och NP|eller NP)?+) (och NP|eller NP)?

… en bil av märket Ford Granada …

… okänd soldat som bar gymnastikskor av märket Nike …

… sys bland annat kalsonger och undertröjor av märket Börje Salming …

… tusen personbilar av modellen S70/V70 i Masas fabrik .

… planen är av typen F117A ( stealth ) …

… fartygen har jaktplan av typen F14 som anpassats att bära laserstyrda …

hyperonym-hyponym

Page 39: Extraction of Ontological Information from Corpora  (and Lexicon)

39 Oslo, 14-16 Sep 2003

Lexico-syntacticPatterns (2)

NP ,? NP ,? (såsom|liksom|som)(NP|(NP,)+|:NP|:(NP,)+)(såsom|liksom|som)(NP|(NP,)+|:NP|:(NP,)+) (eller|och) (eller|och) (andra|annat|annan)(andra|annat|annan) NP NP

NP ,? (eller|och) NP ,? (eller|och) (andra|annat|annan)(andra|annat|annan) NP NP

NP ,? NP ,? (såsom|liksom|som) (såsom|liksom|som) (andra|annat|annan)(andra|annat|annan) NP NP

… explorer plockar poäng på automatlåda , farthållare , luftkonditionering , radio och annan utrustning

… fastighetsägaren ville ha en total renovering med ny spis , kyl , frys , spiskåpa och annan köksinredning

NP : NP (NP ,)+ (m fl|med flera|mm|osv)?NP : NP (NP ,)+ (m fl|med flera|mm|osv)?

… årets dansband : Arvingarna , Barbados , Joyride , Sound Express .

… riksdagsmännens alla bidrag : barnbidrag , bostadsbidrag , socialbidrag , studiebidrag osv .

… kroniskt sjuka : epileptiker , hjärtsjuka , njursjuka m fl

… bästa webbplatserna : Spray , Gula Sidorna , Dagens_Nyheter , Passagen , Arbetsförmedlingen , Resfeber , Pricerunner , Bidlet , SEB och Bluemarx .

Hyperonym?-hyponym?

hyperonym-hyponym

Page 40: Extraction of Ontological Information from Corpora  (and Lexicon)

40 Oslo, 14-16 Sep 2003

Lexico-syntacticPatterns (3)

NP ,?|(? inklusive (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?NP ,?|(? inklusive (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?

NP ,? (? särskilt (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?NP ,? (? särskilt (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?

NP ,? (? speciellt (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?NP ,? (? speciellt (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?

NP ,? (? mestadels (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?NP ,? (? mestadels (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?

NP ,? (? däribland (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?NP ,? (? däribland (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?

… en rad företag , däribland Ica , Dagab och Ikea

… Natoländer , inklusive Frankrike , Tyskland , Spanien och Grekland NP som (till exempel|t ex|t.ex.) NP (, NP)*NP som (till exempel|t ex|t.ex.) NP (, NP)*

… stora båtar som till exempel segelfartyg

… storhelger som t ex nyårsdagen , juldagen har vi …

… finns det specialavdelningar att se på mässan? som t ex Classic boat show , surfexpo , sjösäkerhet och dykexpo .

hyperonym-hyponym

hyperonym-hyponym

Page 41: Extraction of Ontological Information from Corpora  (and Lexicon)

41 Oslo, 14-16 Sep 2003

Lexico-syntacticPatterns (4)

(sån/a/t|sådan/a/t)? (sån/a/t|sådan/a/t)? NP ,? NP ,? (som|såsom)(som|såsom) (NP|(NP,)+|:NP|:(NP,)+) ( (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NPoch NP|eller NP)? )?

… välkända biorullar såsom Carrie , Eldfödd , Stalker , Den onda cirkeln , Shining och Matilda

… flera färger såsom lichtgult , svart , vitt , rött , blått , grönt ,

… en rad underspecialiteter , såsom kardiologi , gastro-enterologi , endokrinologi , hematologi , njurmedicin och reumatologi .

NP : NP (, NP)+ (och NP|eller NP)?NP : NP (, NP)+ (och NP|eller NP)?

… leverantörerna av affärssystem : SAP , Intentia , IFS och IBS

… folksjukdomarna : alkoholism , ätstörningar , medicinmissbruk och panikångest

… krafter av olika slag : tyngdkraft , muskelkraft , friktionskraft , magnetisk kraft

hyperonym-hyponym

hyperonym-hyponym

Page 42: Extraction of Ontological Information from Corpora  (and Lexicon)

42 Oslo, 14-16 Sep 2003

Lexico-syntacticPatterns (5)

NP (, NP)+ är några av NP NP (, NP)+ är några av NP

…" Nilens dotter " , " Sorgens stad " och " Marionettmästaren " är några av de filmer …

… La-Seyne-sur-Mer , Orléans , Brest och Dijon är några av de städer…

… språk , internationell rätt , utrikes- och säkerhetspolitik , press- och informationsfrågor , administration samt muntlig och skriftlig framställning är några av de ämnen som studeras …

… El Salvador , Kazakstan och Jamaica är några av de länder som nu …

• NP? som? bestårNP? som? består?SENSE? av NP (, NP)+ (och NP)? av NP (, NP)+ (och NP)?

… instrumentalensemblen? som består av flöjt , klarinett , trombon, gitarr , violin ,…

…” De ensamma öarna?” som består av Koufonissi , Iraklia , Donousa och Schinousa

… av företagsamhet som består av produktutveckling , produktion , distribution och försäljning

hyperonym-hyponym

holonym-meronym

Page 43: Extraction of Ontological Information from Corpora  (and Lexicon)

43 Oslo, 14-16 Sep 2003

Conclusion & Outlook

simple, surprisingly efficient methods to acquire/enhance general purpose semantic knowledge from large corpora

use of partially parsed corpora for extending semantic lexicons, a unified way to process compounds

both parsing & compounding are of equal importance, through parsing we allow the incorporation of new, through parsing we allow the incorporation of new, mainly non-compound words, through compounding mainly non-compound words, through compounding we allow new compounds of existing we allow new compounds of existing entriesentries; Kokkinakis et al. ’00

better means of evaluation and decrease the amount of spurious generated entries (many due to pos)

profiting from the productive compounding characteristic of S.

Page 44: Extraction of Ontological Information from Corpora  (and Lexicon)

44 Oslo, 14-16 Sep 2003

Conclusion & Outlook cont´d

We believe that S-SIMPLE can be extended to a large semantic resource appropriate for a large number of (intermediate) NLP tasks;

Its compatibility with the manually developed S-SIMPLE lexicon, can be guaranteed and its high quality maintained

near future - NOV ‘03: expect evaluation from VR – whether our application will get funded or not – passed through 1st step but that doesnt guarantee success

==> goal: larger corpus; more comprehensive study; combine larger corpus; more comprehensive study; combine

compounding, parsing, patterns and statisticscompounding, parsing, patterns and statistics

Page 45: Extraction of Ontological Information from Corpora  (and Lexicon)

45 Oslo, 14-16 Sep 2003

References

Brodda B. (1979). Något om de svenska ordens fonotax och morfotax: Iakttagelse med utgångspunkt från experiment med automatisk morfologisk analys. In: ”I huvet på Benny Brodda”. Festskrift till densammes 65-årsdag.

Grefenstette G. (1994). Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers. Hearst M. (1992) Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of the 14th International Conference on

Computational Linguistics. Nantes, France Järborg J., Kokkinakis D. & Toporowska-Gronostaj M. (2002). Lexical and Textual Resources for Sense Recognition and Description.

Proceedings of the 3rd LREC, Las Palmas. Kokkinakis D., Toporowska Gronostaj M. and Warmenius K. (2000) Annotating, Disambiguating & Automatically Extending the Coverage

of the Swedish SIMPLE Lexicon. Proceedings of the 2nd Languages Resources and Evaluation Conference (LREC), vol. III:1397-1404. Athens, Hellas.

Kokkinakis D. (2001). Syntactic Parsing as a Step for Automatically Augmenting Semantic Lexicons. Proceedings of the 39th Association of Computational Linguistics (ACL) and 10th European Chapter of the Association of Computational Linguistics (EACL), 13-18. Miltsakaki E., Monz C. and Ribeiro A. (eds). (Companion Volume). CNRS, Toulouse, France.

Lin D. (1998). Automatic Retrieval and Clustering of Similar Words. COLING-ACL98, Montreal, Canada. Lin D. & Pantel P. (2002). Concept Discovery from Text. Proceedings of the International Conference on Computational Linguistics. pp.

577-583. Taipei, Taiwan. Riloff, E., and Shepherd, J. 1997. A Corpus-Based Approach for Building Semantic Lexicons. Proceedings of the Second Conference on

Empirical Methods in Natural Language Processing, 117--124. Roark B. & Charniak E. (1998). Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction. Proceedings of

the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 1110-1116.

Takunaga et al. (1997) Extending a thesaurus by classifying words. Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications.