Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit...

Automatic term extraction from domain corpora

Piek Vossen

Irion Technologies/Vrije Universiteit AmsterdamGastcollege Corpus-based MethodsUniversiteit Nijmegen, 26 November 2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november

Overview

Corpus versus Domain-based text collections Customer-case Term-extraction Demo

Corpus versus Domain-based text collections Corpus to study linguistic phenomena:

INL corpus: NRC-handelsblad Corpus geschreven Nederlands British National Corpus Brown corpus -> SemCor

Domain corpora: portals Wikipedia

Customer corpora: web sites manuals

Customer-case

Connect suppliers and buyers and create traffic and advertisement

B2B: companies with specialized products and services terminology driven branch driven

C2B: consumers looking for products and services general language terminology: -> folksonomy bottom-up

Subscription forproduct names

Companies in database1.5 million websites

user querysearching forproducts or servcies

"vlinderkleppen voor een hoge drukpomp"(butterfly valves for high pressure pumps)

Product namein ontology of 150,000 products

"kleppen, vlinder, pomp, hoge druk"(valves, butterfly, pump, high pressure)

product name on company website

"Wij zijn gespecialiseerd in:pompen en pomponderdelenzoals kleppen"(We are specialized in: pumps and components such as valves

Term-extraction

morpho-syntactic analysis statistical analysis conceptual analysis contextual analysis

Term-extraction: morpho-syntactic analysis Tokenization, tagging and NP-chunking:

“een gele kaart voor de vleugelaanvaller” (a yellow card for the wing-player)

Term candidates: Syntactic head of NPs: kaart (card); vleugelaanvaller

(wing-player). Word combinations including syntactic head: gele kaart

(yellow card); kaart voor vleugelaanvaller (card for wing-player).

Head of compounds: aanvaller (attacker-player). Term is a concept:

Normalized form (plural-singular variants, synonyms) Hypernym based on the syntactic head

Term-extraction: statistical analysis Reference corpus based on 500 websites of

diverse range of companies Salience = normFreq * normRef normFreq = normalized frequency of terms on the

websitenormFreq = nTermFrequencynWords / nPages

normRef = normalized number of websites on which the term occurs in the reference corpus

multiwords: normRef = 1-((nWebsitesnWords) / (referenceCorpusSize))

singlewords: normRef = 1-((nWebsites) / (referenceCorpusSize))

Preferred term nTokens nPages Salience

Klicken (click) 9 9 0.0010

Weitere (further) 5 2 0.0011

Wahl (choose) 1 1 0.0011

Verkauf (sell) 1 1 0.0011

Service (service) 37 3 0.0011

Radio (radio) 1 1 0.0011

Promotionen (promote) 1 1 0.0011

Optionen (options) 6 2 0.0011

Netzwerk (network) 4 2 0.0011

Medias (media) 1 1 0.0011

Kauf (buy) 1 1 0.0011

Html 1 1 0.0011

Gewerbe (commercial) 1 1 0.0011

Fax 16 16 0.0011

Büro (office) 1 1 0.0011

Term-extraction: conceptual analysis Structural properties of the term hierarchy Poor hierarchies:

many tops few levels diverse branches

Each branch is a concept: number of descendants and levels cumulated frequency of descendants Branch profiling:

Domain classification of the hierarchy Domain classification of each branch Minimal overlap in domain

Wordnet: Domain information

type-of

type-ofpart-of

Relations

rec: 12345- financial institute

rec: 54321

- river side

rec: 9876

- small string instrument

rec: 65438

- musician playing a violin

rec:42654

- musician

rec:25876

- string instrument

rec:35576

- string of an instrument

rec:29551

- underwear

ConceptsVocabularies of languages

violin

violist

string

Domains

Culture FinanceClothing Sport

sports

Winter

sports

Term-extraction: contextual analysis Anything can be a product or service: there

are no intrinsic properties to define products Contextual features:

context patterns for products product pages special marking in HTML

Term-extraction: contextual analysis Context patterns for products: 144 patterns in English and 288 patterns in

German [we supply] [we deliver] [we provide] [our products are][we

are one of the leading, producers on the market for] [we are, leading, producers on the market for] [is one of the leading, producers on the market for] [is, leading, producer on the market for] [we develop, products for] [we design, products for] [we produce, products for] [Our most common products]

Each term is scored for a product context in terms of the strength of the pattern and the distance

Term-extraction: contextual analysis Product pages:

landing page: index.html html files with product names: product, service,

solution html files referred to by these pages html files referred to by menus with such names

Special marking in HTML: meta keywords headings and titles menus

Product terms with feature bundles

URL nPages Preferred term Head nTokensnDocs Salience nSiblings nTokensConceptnDocsConceptTerm SourceFeature FeatureScoreConnectivityRelations ProfileMatchhttp://www.illycafe.ch/ 16 Mischung 2 2 0.0111 28 56 43 #product geniess 1.0 28 0.5773http://www.illycafe.ch/ 16 Vanillearomen aromen 1 1 0.0523 1 1 1 #product pulver 1.0 2 #vanille 0.5773http://www.illycafe.ch/ 16 Köstlichsten Getränke Getränke 1 1 0.0523 1 1 1 #product produkt 0.28 2 #kostlich 0.0http://www.illycafe.ch/ 16 Gedanke 1 1 0.0427 28 56 43 #product anhauen 1.0 28 0.5773http://www.illycafe.ch/ 16 2-KilogrammvakuumpackungKilogrammvakuumpackung2 1 0.1046 1 2 1 #product produkt 0.28 2 #2 -1.0http://www.illycafe.ch/ 16 Einzelportionen portionen 4 1 0.2093 1 4 1 #product espresso 1.0 2 #einzel -1.0http://www.illycafe.ch/ 16 Anhieb 1 1 0.0523 28 56 43 #product gedanke 1.0 28 0.5773http://www.illycafe.ch/ 16 Arabica-Kaffee GemahlenerGemahlener 1 1 0.0523 1 1 1 #product kaffee 1.0 10 #arabica#kaffee#arabica-kaffee-1.0http://www.illycafe.ch/ 16 Trinkschokoladen-LiebhaberLiebhaber 1 1 0.0523 2 3 2 #product punkt 1.0 4 #trinkschokolad-1.0http://www.illycafe.ch/ 16 Koffeinfreier freier 1 1 0.0523 1 1 1 #product kaffee 1.0 2 #koffein 0.5773http://www.illycafe.ch/ 16 Kaffee 6 5 0.0615 28 56 43 #product#indexstandard 1.0 28 -1.0http://www.illycafe.ch/ 16 Gourmet 1 1 0.0102 28 56 43 #product geheimtip 1.0 28 -1.0http://www.illycafe.ch/ 16 Kleinpackungs-Palette Palette 2 1 0.1046 1 2 1 #product illymisch 1.0 2 #kleinpack 0.5773

<modifier>arabica</modifier><modifier>kaffee</modifier><modifier>arabica-kaffee</modifier>

</modifiers><profileMatch>-1</profileMatch> <profile/><termSource><![CDATA[#product]]></termSource><cumfrequency_parent>1</cumfrequency_parent><cumdocuments_parent>1</cumdocuments_parent><siblings>1</siblings><features>

<feature><featureName>RIGHT</featureName><featureValue>kaffee</featureValue><featureScore>1.0</featureScore>

</feature></features>

</class>

Evaluation of French product extractionNr. of URLs 29

Nr. of evaluated URLs 27

Total good terms 95

Total bad terms 53

Total new terms 54

Total terms 202

Average nTerms per URL 6

Total precision (good/(good+bad) 64

Average precision (nPrecision/Evaluated Urls) 68

Evaluation of French product extraction Good Bad New Precision

nTokens low(2) 67 35 18 65

med(5) 12 7 19 63

high(>5) 16 11 17 59

getNrDocs low(2) 71 38 31 65

med(5) 13 5 11 72

high(>5) 11 10 12 52

nSalience low(0.05) 0 0 0 0

med(0.1) 0 0 0 0

high(0.5) 4 3 3 57

top(>0.5) 91 50 51 64

nSiblings low(2) 84 51 48 62

med(5) 11 2 6 84

high(>5) 0 0 0 0

nCumFreqParent low(2) 68 43 21 61

med(5) 19 5 11 79

high(>5) 8 5 22 61

Evaluation of French product extraction Good Bad New Precision

Term Source meta 64 29 11 68

product 26 21 38 55

service 5 3 5 62

solution 0 0 0 0

index 0 0 0 0

other 0 0 0 0

ProfileMatch -1 12 6 15 66

0 67 40 30 62

low(0.1) 0 0 0 0

med(0.5) 2 7 9 22

high(>0.5) 5 0 0 100

top(>0.7) 9 0 0 100

Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit...

Documents

Danny vossen

Vossen Wheels v. Toprich, Inc. d/b/a Redline Wheels et. al

LREC, Malta MayApril 20 th, 2010 Annotation Scheme and Gold Standard for Dutch sentiment-bearing Adjectives Isa Maks and Piek Vossen Faculty of Arts, VU

Flarenet-Silt workshop on Ontology and Lexicon September-19 th -2009, Pisa Division of semantic labor over vocabulary and ontology layers Piek Vossen,

Chad Vossen - We shall produce working software

ON THE WORKS OF S. E. Cohn-Vossen A. D AlexandrovStefan/Alexandrov_Iacob.pdf · In this way, in the realm of problems concerning bending of surfaces in the large Cohn-Vossen did contribute

Vol.7 REV. 05/2015 - VOSSEN JAPANvossen.jp/download/VOSSEN-WEB-CATALOG_rev201505.pdf · vol.7 rev. 05/2015. vps302 vps304 vps306 vps303 vps307 vps308 vps301 vps309 vps305 vps311 vps312

1 From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages to universal meaning Piek Vossen VU University Amsterdam

Division of semantic labor in the Global WordNet Grid Piek Vossen, VU University Amsterdam German Rigau, University of the Basque Country 5 th Global Wordnet

Vossen Wheels v. Wheel World 3

Chapter 10 Cornetto: A Combinatorial Lexical Semantic ... · Cornetto: A Combinatorial Lexical Semantic Database for Dutch Piek Vossen, Isa Maks, Roxane Segers, Hennie van der Vliet,

The Global Wordnet Grid: anchoring languages to universal meaning Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam 6 th International Plain

Vossen - Financial Legal Issues Fp7 March 2011 Enlace & Eucarinet 2

Agata Cybulska, Piek Vossen

Building Wordnets Piek Vossen, Irion Technologies

1 Future of Transverse Spin at RSC Meeting, Ames, Iowa, May 15th Anselm Vossen

Java Card - Radboud Universiteit - Radboud Universiteit

EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amsterdam, Netherlands NewsReader: recording history by processing massive streams of daily news

The Global Wordnet Grid: anchoring languages to universal meaning Piek Vossen Irion Technologies/Free University of Amsterdam

Vossen 2011 ELSIN Antwerp (poster)