Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit...

Preview:

Citation preview

Automatic term extraction from domain corpora

Piek Vossen

Irion Technologies/Vrije Universiteit AmsterdamGastcollege Corpus-based MethodsUniversiteit Nijmegen, 26 November 2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Overview

Corpus versus Domain-based text collections Customer-case Term-extraction Demo

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Corpus versus Domain-based text collections Corpus to study linguistic phenomena:

INL corpus: NRC-handelsblad Corpus geschreven Nederlands British National Corpus Brown corpus -> SemCor

Domain corpora: portals Wikipedia

Customer corpora: web sites manuals

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Customer-case

Connect suppliers and buyers and create traffic and advertisement

B2B: companies with specialized products and services terminology driven branch driven

C2B: consumers looking for products and services general language terminology: -> folksonomy bottom-up

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Subscription forproduct names

Companies in database1.5 million websites

user querysearching forproducts or servcies

"vlinderkleppen voor een hoge drukpomp"(butterfly valves for high pressure pumps)

Product namein ontology of 150,000 products

"kleppen, vlinder, pomp, hoge druk"(valves, butterfly, pump, high pressure)

product name on company website

"Wij zijn gespecialiseerd in:pompen en pomponderdelenzoals kleppen"(We are specialized in: pumps and components such as valves

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction

morpho-syntactic analysis statistical analysis conceptual analysis contextual analysis

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: morpho-syntactic analysis Tokenization, tagging and NP-chunking:

“een gele kaart voor de vleugelaanvaller” (a yellow card for the wing-player)

Term candidates: Syntactic head of NPs: kaart (card); vleugelaanvaller

(wing-player). Word combinations including syntactic head: gele kaart

(yellow card); kaart voor vleugelaanvaller (card for wing-player).

Head of compounds: aanvaller (attacker-player). Term is a concept:

Normalized form (plural-singular variants, synonyms) Hypernym based on the syntactic head

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: statistical analysis Reference corpus based on 500 websites of

diverse range of companies Salience = normFreq * normRef normFreq = normalized frequency of terms on the

websitenormFreq = nTermFrequencynWords / nPages

normRef = normalized number of websites on which the term occurs in the reference corpus

multiwords: normRef = 1-((nWebsitesnWords) / (referenceCorpusSize))

singlewords: normRef = 1-((nWebsites) / (referenceCorpusSize))

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Preferred term nTokens nPages Salience

Klicken (click) 9 9 0.0010

Weitere (further) 5 2 0.0011

Wahl (choose) 1 1 0.0011

Verkauf (sell) 1 1 0.0011

Service (service) 37 3 0.0011

Radio (radio) 1 1 0.0011

Promotionen (promote) 1 1 0.0011

Optionen (options) 6 2 0.0011

Netzwerk (network) 4 2 0.0011

Medias (media) 1 1 0.0011

Kauf (buy) 1 1 0.0011

Html 1 1 0.0011

Gewerbe (commercial) 1 1 0.0011

Fax 16 16 0.0011

Büro (office) 1 1 0.0011

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: conceptual analysis Structural properties of the term hierarchy Poor hierarchies:

many tops few levels diverse branches

Each branch is a concept: number of descendants and levels cumulated frequency of descendants Branch profiling:

Domain classification of the hierarchy Domain classification of each branch Minimal overlap in domain

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Wordnet: Domain information

type-of

type-ofpart-of

Relations

rec: 12345- financial institute

rec: 54321

- river side

rec: 9876

- small string instrument

rec: 65438

- musician playing a violin

rec:42654

- musician

rec:25876

- string instrument

rec:35576

- string of an instrument

rec:29551

- underwear

ConceptsVocabularies of languages

bank

violin

violist

string

1

2

1

2

1

2

Domains

Music

Culture FinanceClothing Sport

Ball

sports

Winter

sports

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: contextual analysis Anything can be a product or service: there

are no intrinsic properties to define products Contextual features:

context patterns for products product pages special marking in HTML

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: contextual analysis Context patterns for products: 144 patterns in English and 288 patterns in

German [we supply] [we deliver] [we provide] [our products are][we

are one of the leading, producers on the market for] [we are, leading, producers on the market for] [is one of the leading, producers on the market for] [is, leading, producer on the market for] [we develop, products for] [we design, products for] [we produce, products for] [Our most common products]

Each term is scored for a product context in terms of the strength of the pattern and the distance

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: contextual analysis Product pages:

landing page: index.html html files with product names: product, service,

solution html files referred to by these pages html files referred to by menus with such names

Special marking in HTML: meta keywords headings and titles menus

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Product terms with feature bundles

URL nPages Preferred term Head nTokensnDocs Salience nSiblings nTokensConceptnDocsConceptTerm SourceFeature FeatureScoreConnectivityRelations ProfileMatchhttp://www.illycafe.ch/ 16 Mischung 2 2 0.0111 28 56 43 #product geniess 1.0 28 0.5773http://www.illycafe.ch/ 16 Vanillearomen aromen 1 1 0.0523 1 1 1 #product pulver 1.0 2 #vanille 0.5773http://www.illycafe.ch/ 16 Köstlichsten Getränke Getränke 1 1 0.0523 1 1 1 #product produkt 0.28 2 #kostlich 0.0http://www.illycafe.ch/ 16 Gedanke 1 1 0.0427 28 56 43 #product anhauen 1.0 28 0.5773http://www.illycafe.ch/ 16 2-KilogrammvakuumpackungKilogrammvakuumpackung2 1 0.1046 1 2 1 #product produkt 0.28 2 #2 -1.0http://www.illycafe.ch/ 16 Einzelportionen portionen 4 1 0.2093 1 4 1 #product espresso 1.0 2 #einzel -1.0http://www.illycafe.ch/ 16 Anhieb 1 1 0.0523 28 56 43 #product gedanke 1.0 28 0.5773http://www.illycafe.ch/ 16 Arabica-Kaffee GemahlenerGemahlener 1 1 0.0523 1 1 1 #product kaffee 1.0 10 #arabica#kaffee#arabica-kaffee-1.0http://www.illycafe.ch/ 16 Trinkschokoladen-LiebhaberLiebhaber 1 1 0.0523 2 3 2 #product punkt 1.0 4 #trinkschokolad-1.0http://www.illycafe.ch/ 16 Koffeinfreier freier 1 1 0.0523 1 1 1 #product kaffee 1.0 2 #koffein 0.5773http://www.illycafe.ch/ 16 Kaffee 6 5 0.0615 28 56 43 #product#indexstandard 1.0 28 -1.0http://www.illycafe.ch/ 16 Gourmet 1 1 0.0102 28 56 43 #product geheimtip 1.0 28 -1.0http://www.illycafe.ch/ 16 Kleinpackungs-Palette Palette 2 1 0.1046 1 2 1 #product illymisch 1.0 2 #kleinpack 0.5773

<class><name><![CDATA[arabica-kaffee gemahlene]]></name><id>48</id> <pos>1</pos><preferred_form><![CDATA[Arabica-Kaffee Gemahlener]]></preferred_form><parent_form><![CDATA[Gemahlener]]></parent_form><documents>1</documents><frequency>1</frequency><salience>0.0523</salience><connectivity>10</connectivity><modifiers>

<modifier>arabica</modifier><modifier>kaffee</modifier><modifier>arabica-kaffee</modifier>

</modifiers><profileMatch>-1</profileMatch> <profile/><termSource><![CDATA[#product]]></termSource><cumfrequency_parent>1</cumfrequency_parent><cumdocuments_parent>1</cumdocuments_parent><siblings>1</siblings><features>

<feature><featureName>RIGHT</featureName><featureValue>kaffee</featureValue><featureScore>1.0</featureScore>

</feature></features>

</class>

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Evaluation of French product extractionNr. of URLs 29

Nr. of evaluated URLs 27

Total good terms 95

Total bad terms 53

Total new terms 54

Total terms 202

Average nTerms per URL 6

Total precision (good/(good+bad) 64

Average precision (nPrecision/Evaluated Urls) 68

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Evaluation of French product extraction    Good Bad New Precision

nTokens low(2) 67 35 18 65

  med(5) 12 7 19 63

  high(>5) 16 11 17 59

getNrDocs low(2) 71 38 31 65

  med(5) 13 5 11 72

  high(>5) 11 10 12 52

nSalience low(0.05) 0 0 0 0

  med(0.1) 0 0 0 0

  high(0.5) 4 3 3 57

  top(>0.5) 91 50 51 64

nSiblings low(2) 84 51 48 62

  med(5) 11 2 6 84

  high(>5) 0 0 0 0

nCumFreqParent low(2) 68 43 21 61

  med(5) 19 5 11 79

  high(>5) 8 5 22 61

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Evaluation of French product extraction    Good Bad New Precision

Term Source meta 64 29 11 68

  product 26 21 38 55

  service 5 3 5 62

  solution 0 0 0 0

  index 0 0 0 0

  other 0 0 0 0

ProfileMatch -1 12 6 15 66

  0 67 40 30 62

  low(0.1) 0 0 0 0

  med(0.5) 2 7 9 22

  high(>0.5) 5 0 0 100

  top(>0.7) 9 0 0 100

Recommended