Upload
alexandra-hammond
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Automatic term extraction from domain corpora
Piek Vossen
Irion Technologies/Vrije Universiteit AmsterdamGastcollege Corpus-based MethodsUniversiteit Nijmegen, 26 November 2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Overview
Corpus versus Domain-based text collections Customer-case Term-extraction Demo
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Corpus versus Domain-based text collections Corpus to study linguistic phenomena:
INL corpus: NRC-handelsblad Corpus geschreven Nederlands British National Corpus Brown corpus -> SemCor
Domain corpora: portals Wikipedia
Customer corpora: web sites manuals
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Customer-case
Connect suppliers and buyers and create traffic and advertisement
B2B: companies with specialized products and services terminology driven branch driven
C2B: consumers looking for products and services general language terminology: -> folksonomy bottom-up
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Subscription forproduct names
Companies in database1.5 million websites
user querysearching forproducts or servcies
"vlinderkleppen voor een hoge drukpomp"(butterfly valves for high pressure pumps)
Product namein ontology of 150,000 products
"kleppen, vlinder, pomp, hoge druk"(valves, butterfly, pump, high pressure)
product name on company website
"Wij zijn gespecialiseerd in:pompen en pomponderdelenzoals kleppen"(We are specialized in: pumps and components such as valves
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Term-extraction
morpho-syntactic analysis statistical analysis conceptual analysis contextual analysis
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Term-extraction: morpho-syntactic analysis Tokenization, tagging and NP-chunking:
“een gele kaart voor de vleugelaanvaller” (a yellow card for the wing-player)
Term candidates: Syntactic head of NPs: kaart (card); vleugelaanvaller
(wing-player). Word combinations including syntactic head: gele kaart
(yellow card); kaart voor vleugelaanvaller (card for wing-player).
Head of compounds: aanvaller (attacker-player). Term is a concept:
Normalized form (plural-singular variants, synonyms) Hypernym based on the syntactic head
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Term-extraction: statistical analysis Reference corpus based on 500 websites of
diverse range of companies Salience = normFreq * normRef normFreq = normalized frequency of terms on the
websitenormFreq = nTermFrequencynWords / nPages
normRef = normalized number of websites on which the term occurs in the reference corpus
multiwords: normRef = 1-((nWebsitesnWords) / (referenceCorpusSize))
singlewords: normRef = 1-((nWebsites) / (referenceCorpusSize))
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Preferred term nTokens nPages Salience
Klicken (click) 9 9 0.0010
Weitere (further) 5 2 0.0011
Wahl (choose) 1 1 0.0011
Verkauf (sell) 1 1 0.0011
Service (service) 37 3 0.0011
Radio (radio) 1 1 0.0011
Promotionen (promote) 1 1 0.0011
Optionen (options) 6 2 0.0011
Netzwerk (network) 4 2 0.0011
Medias (media) 1 1 0.0011
Kauf (buy) 1 1 0.0011
Html 1 1 0.0011
Gewerbe (commercial) 1 1 0.0011
Fax 16 16 0.0011
Büro (office) 1 1 0.0011
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Term-extraction: conceptual analysis Structural properties of the term hierarchy Poor hierarchies:
many tops few levels diverse branches
Each branch is a concept: number of descendants and levels cumulated frequency of descendants Branch profiling:
Domain classification of the hierarchy Domain classification of each branch Minimal overlap in domain
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Wordnet: Domain information
type-of
type-ofpart-of
Relations
rec: 12345- financial institute
rec: 54321
- river side
rec: 9876
- small string instrument
rec: 65438
- musician playing a violin
rec:42654
- musician
rec:25876
- string instrument
rec:35576
- string of an instrument
rec:29551
- underwear
ConceptsVocabularies of languages
bank
violin
violist
string
1
2
1
2
1
2
Domains
Music
Culture FinanceClothing Sport
Ball
sports
Winter
sports
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Term-extraction: contextual analysis Anything can be a product or service: there
are no intrinsic properties to define products Contextual features:
context patterns for products product pages special marking in HTML
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Term-extraction: contextual analysis Context patterns for products: 144 patterns in English and 288 patterns in
German [we supply] [we deliver] [we provide] [our products are][we
are one of the leading, producers on the market for] [we are, leading, producers on the market for] [is one of the leading, producers on the market for] [is, leading, producer on the market for] [we develop, products for] [we design, products for] [we produce, products for] [Our most common products]
Each term is scored for a product context in terms of the strength of the pattern and the distance
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Term-extraction: contextual analysis Product pages:
landing page: index.html html files with product names: product, service,
solution html files referred to by these pages html files referred to by menus with such names
Special marking in HTML: meta keywords headings and titles menus
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Product terms with feature bundles
URL nPages Preferred term Head nTokensnDocs Salience nSiblings nTokensConceptnDocsConceptTerm SourceFeature FeatureScoreConnectivityRelations ProfileMatchhttp://www.illycafe.ch/ 16 Mischung 2 2 0.0111 28 56 43 #product geniess 1.0 28 0.5773http://www.illycafe.ch/ 16 Vanillearomen aromen 1 1 0.0523 1 1 1 #product pulver 1.0 2 #vanille 0.5773http://www.illycafe.ch/ 16 Köstlichsten Getränke Getränke 1 1 0.0523 1 1 1 #product produkt 0.28 2 #kostlich 0.0http://www.illycafe.ch/ 16 Gedanke 1 1 0.0427 28 56 43 #product anhauen 1.0 28 0.5773http://www.illycafe.ch/ 16 2-KilogrammvakuumpackungKilogrammvakuumpackung2 1 0.1046 1 2 1 #product produkt 0.28 2 #2 -1.0http://www.illycafe.ch/ 16 Einzelportionen portionen 4 1 0.2093 1 4 1 #product espresso 1.0 2 #einzel -1.0http://www.illycafe.ch/ 16 Anhieb 1 1 0.0523 28 56 43 #product gedanke 1.0 28 0.5773http://www.illycafe.ch/ 16 Arabica-Kaffee GemahlenerGemahlener 1 1 0.0523 1 1 1 #product kaffee 1.0 10 #arabica#kaffee#arabica-kaffee-1.0http://www.illycafe.ch/ 16 Trinkschokoladen-LiebhaberLiebhaber 1 1 0.0523 2 3 2 #product punkt 1.0 4 #trinkschokolad-1.0http://www.illycafe.ch/ 16 Koffeinfreier freier 1 1 0.0523 1 1 1 #product kaffee 1.0 2 #koffein 0.5773http://www.illycafe.ch/ 16 Kaffee 6 5 0.0615 28 56 43 #product#indexstandard 1.0 28 -1.0http://www.illycafe.ch/ 16 Gourmet 1 1 0.0102 28 56 43 #product geheimtip 1.0 28 -1.0http://www.illycafe.ch/ 16 Kleinpackungs-Palette Palette 2 1 0.1046 1 2 1 #product illymisch 1.0 2 #kleinpack 0.5773
<class><name><![CDATA[arabica-kaffee gemahlene]]></name><id>48</id> <pos>1</pos><preferred_form><![CDATA[Arabica-Kaffee Gemahlener]]></preferred_form><parent_form><![CDATA[Gemahlener]]></parent_form><documents>1</documents><frequency>1</frequency><salience>0.0523</salience><connectivity>10</connectivity><modifiers>
<modifier>arabica</modifier><modifier>kaffee</modifier><modifier>arabica-kaffee</modifier>
</modifiers><profileMatch>-1</profileMatch> <profile/><termSource><![CDATA[#product]]></termSource><cumfrequency_parent>1</cumfrequency_parent><cumdocuments_parent>1</cumdocuments_parent><siblings>1</siblings><features>
<feature><featureName>RIGHT</featureName><featureValue>kaffee</featureValue><featureScore>1.0</featureScore>
</feature></features>
</class>
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Evaluation of French product extractionNr. of URLs 29
Nr. of evaluated URLs 27
Total good terms 95
Total bad terms 53
Total new terms 54
Total terms 202
Average nTerms per URL 6
Total precision (good/(good+bad) 64
Average precision (nPrecision/Evaluated Urls) 68
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Evaluation of French product extraction Good Bad New Precision
nTokens low(2) 67 35 18 65
med(5) 12 7 19 63
high(>5) 16 11 17 59
getNrDocs low(2) 71 38 31 65
med(5) 13 5 11 72
high(>5) 11 10 12 52
nSalience low(0.05) 0 0 0 0
med(0.1) 0 0 0 0
high(0.5) 4 3 3 57
top(>0.5) 91 50 51 64
nSiblings low(2) 84 51 48 62
med(5) 11 2 6 84
high(>5) 0 0 0 0
nCumFreqParent low(2) 68 43 21 61
med(5) 19 5 11 79
high(>5) 8 5 22 61
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november
2007
Evaluation of French product extraction Good Bad New Precision
Term Source meta 64 29 11 68
product 26 21 38 55
service 5 3 5 62
solution 0 0 0 0
index 0 0 0 0
other 0 0 0 0
ProfileMatch -1 12 6 15 66
0 67 40 30 62
low(0.1) 0 0 0 0
med(0.5) 2 7 9 22
high(>0.5) 5 0 0 100
top(>0.7) 9 0 0 100