23
Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit Nijmegen, 26 November 2007

Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Embed Size (px)

Citation preview

Page 1: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Automatic term extraction from domain corpora

Piek Vossen

Irion Technologies/Vrije Universiteit AmsterdamGastcollege Corpus-based MethodsUniversiteit Nijmegen, 26 November 2007

Page 2: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Overview

Corpus versus Domain-based text collections Customer-case Term-extraction Demo

Page 3: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Corpus versus Domain-based text collections Corpus to study linguistic phenomena:

INL corpus: NRC-handelsblad Corpus geschreven Nederlands British National Corpus Brown corpus -> SemCor

Domain corpora: portals Wikipedia

Customer corpora: web sites manuals

Page 4: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Customer-case

Connect suppliers and buyers and create traffic and advertisement

B2B: companies with specialized products and services terminology driven branch driven

C2B: consumers looking for products and services general language terminology: -> folksonomy bottom-up

Page 5: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Subscription forproduct names

Companies in database1.5 million websites

user querysearching forproducts or servcies

"vlinderkleppen voor een hoge drukpomp"(butterfly valves for high pressure pumps)

Product namein ontology of 150,000 products

"kleppen, vlinder, pomp, hoge druk"(valves, butterfly, pump, high pressure)

product name on company website

"Wij zijn gespecialiseerd in:pompen en pomponderdelenzoals kleppen"(We are specialized in: pumps and components such as valves

Page 6: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Page 7: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Page 8: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction

morpho-syntactic analysis statistical analysis conceptual analysis contextual analysis

Page 9: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: morpho-syntactic analysis Tokenization, tagging and NP-chunking:

“een gele kaart voor de vleugelaanvaller” (a yellow card for the wing-player)

Term candidates: Syntactic head of NPs: kaart (card); vleugelaanvaller

(wing-player). Word combinations including syntactic head: gele kaart

(yellow card); kaart voor vleugelaanvaller (card for wing-player).

Head of compounds: aanvaller (attacker-player). Term is a concept:

Normalized form (plural-singular variants, synonyms) Hypernym based on the syntactic head

Page 10: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Page 11: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: statistical analysis Reference corpus based on 500 websites of

diverse range of companies Salience = normFreq * normRef normFreq = normalized frequency of terms on the

websitenormFreq = nTermFrequencynWords / nPages

normRef = normalized number of websites on which the term occurs in the reference corpus

multiwords: normRef = 1-((nWebsitesnWords) / (referenceCorpusSize))

singlewords: normRef = 1-((nWebsites) / (referenceCorpusSize))

Page 12: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Preferred term nTokens nPages Salience

Klicken (click) 9 9 0.0010

Weitere (further) 5 2 0.0011

Wahl (choose) 1 1 0.0011

Verkauf (sell) 1 1 0.0011

Service (service) 37 3 0.0011

Radio (radio) 1 1 0.0011

Promotionen (promote) 1 1 0.0011

Optionen (options) 6 2 0.0011

Netzwerk (network) 4 2 0.0011

Medias (media) 1 1 0.0011

Kauf (buy) 1 1 0.0011

Html 1 1 0.0011

Gewerbe (commercial) 1 1 0.0011

Fax 16 16 0.0011

Büro (office) 1 1 0.0011

Page 13: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: conceptual analysis Structural properties of the term hierarchy Poor hierarchies:

many tops few levels diverse branches

Each branch is a concept: number of descendants and levels cumulated frequency of descendants Branch profiling:

Domain classification of the hierarchy Domain classification of each branch Minimal overlap in domain

Page 14: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Page 15: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Wordnet: Domain information

type-of

type-ofpart-of

Relations

rec: 12345- financial institute

rec: 54321

- river side

rec: 9876

- small string instrument

rec: 65438

- musician playing a violin

rec:42654

- musician

rec:25876

- string instrument

rec:35576

- string of an instrument

rec:29551

- underwear

ConceptsVocabularies of languages

bank

violin

violist

string

1

2

1

2

1

2

Domains

Music

Culture FinanceClothing Sport

Ball

sports

Winter

sports

Page 16: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: contextual analysis Anything can be a product or service: there

are no intrinsic properties to define products Contextual features:

context patterns for products product pages special marking in HTML

Page 17: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: contextual analysis Context patterns for products: 144 patterns in English and 288 patterns in

German [we supply] [we deliver] [we provide] [our products are][we

are one of the leading, producers on the market for] [we are, leading, producers on the market for] [is one of the leading, producers on the market for] [is, leading, producer on the market for] [we develop, products for] [we design, products for] [we produce, products for] [Our most common products]

Each term is scored for a product context in terms of the strength of the pattern and the distance

Page 18: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Term-extraction: contextual analysis Product pages:

landing page: index.html html files with product names: product, service,

solution html files referred to by these pages html files referred to by menus with such names

Special marking in HTML: meta keywords headings and titles menus

Page 19: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Product terms with feature bundles

URL nPages Preferred term Head nTokensnDocs Salience nSiblings nTokensConceptnDocsConceptTerm SourceFeature FeatureScoreConnectivityRelations ProfileMatchhttp://www.illycafe.ch/ 16 Mischung 2 2 0.0111 28 56 43 #product geniess 1.0 28 0.5773http://www.illycafe.ch/ 16 Vanillearomen aromen 1 1 0.0523 1 1 1 #product pulver 1.0 2 #vanille 0.5773http://www.illycafe.ch/ 16 Köstlichsten Getränke Getränke 1 1 0.0523 1 1 1 #product produkt 0.28 2 #kostlich 0.0http://www.illycafe.ch/ 16 Gedanke 1 1 0.0427 28 56 43 #product anhauen 1.0 28 0.5773http://www.illycafe.ch/ 16 2-KilogrammvakuumpackungKilogrammvakuumpackung2 1 0.1046 1 2 1 #product produkt 0.28 2 #2 -1.0http://www.illycafe.ch/ 16 Einzelportionen portionen 4 1 0.2093 1 4 1 #product espresso 1.0 2 #einzel -1.0http://www.illycafe.ch/ 16 Anhieb 1 1 0.0523 28 56 43 #product gedanke 1.0 28 0.5773http://www.illycafe.ch/ 16 Arabica-Kaffee GemahlenerGemahlener 1 1 0.0523 1 1 1 #product kaffee 1.0 10 #arabica#kaffee#arabica-kaffee-1.0http://www.illycafe.ch/ 16 Trinkschokoladen-LiebhaberLiebhaber 1 1 0.0523 2 3 2 #product punkt 1.0 4 #trinkschokolad-1.0http://www.illycafe.ch/ 16 Koffeinfreier freier 1 1 0.0523 1 1 1 #product kaffee 1.0 2 #koffein 0.5773http://www.illycafe.ch/ 16 Kaffee 6 5 0.0615 28 56 43 #product#indexstandard 1.0 28 -1.0http://www.illycafe.ch/ 16 Gourmet 1 1 0.0102 28 56 43 #product geheimtip 1.0 28 -1.0http://www.illycafe.ch/ 16 Kleinpackungs-Palette Palette 2 1 0.1046 1 2 1 #product illymisch 1.0 2 #kleinpack 0.5773

Page 20: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

<class><name><![CDATA[arabica-kaffee gemahlene]]></name><id>48</id> <pos>1</pos><preferred_form><![CDATA[Arabica-Kaffee Gemahlener]]></preferred_form><parent_form><![CDATA[Gemahlener]]></parent_form><documents>1</documents><frequency>1</frequency><salience>0.0523</salience><connectivity>10</connectivity><modifiers>

<modifier>arabica</modifier><modifier>kaffee</modifier><modifier>arabica-kaffee</modifier>

</modifiers><profileMatch>-1</profileMatch> <profile/><termSource><![CDATA[#product]]></termSource><cumfrequency_parent>1</cumfrequency_parent><cumdocuments_parent>1</cumdocuments_parent><siblings>1</siblings><features>

<feature><featureName>RIGHT</featureName><featureValue>kaffee</featureValue><featureScore>1.0</featureScore>

</feature></features>

</class>

Page 21: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Evaluation of French product extractionNr. of URLs 29

Nr. of evaluated URLs 27

Total good terms 95

Total bad terms 53

Total new terms 54

Total terms 202

Average nTerms per URL 6

Total precision (good/(good+bad) 64

Average precision (nPrecision/Evaluated Urls) 68

Page 22: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Evaluation of French product extraction    Good Bad New Precision

nTokens low(2) 67 35 18 65

  med(5) 12 7 19 63

  high(>5) 16 11 17 59

getNrDocs low(2) 71 38 31 65

  med(5) 13 5 11 72

  high(>5) 11 10 12 52

nSalience low(0.05) 0 0 0 0

  med(0.1) 0 0 0 0

  high(0.5) 4 3 3 57

  top(>0.5) 91 50 51 64

nSiblings low(2) 84 51 48 62

  med(5) 11 2 6 84

  high(>5) 0 0 0 0

nCumFreqParent low(2) 68 43 21 61

  med(5) 19 5 11 79

  high(>5) 8 5 22 61

Page 23: Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

2007

Evaluation of French product extraction    Good Bad New Precision

Term Source meta 64 29 11 68

  product 26 21 38 55

  service 5 3 5 62

  solution 0 0 0 0

  index 0 0 0 0

  other 0 0 0 0

ProfileMatch -1 12 6 15 66

  0 67 40 30 62

  low(0.1) 0 0 0 0

  med(0.5) 2 7 9 22

  high(>0.5) 5 0 0 100

  top(>0.7) 9 0 0 100