Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit.

  • Published on
    27-Mar-2015

  • View
    213

  • Download
    1

Transcript

<ul><li>Slide 1</li></ul> <p>Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit Nijmegen, 26 November 2007 Slide 2 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Overview Corpus versus Domain-based text collections Customer-case Term-extraction Demo Slide 3 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Corpus versus Domain-based text collections Corpus to study linguistic phenomena: INL corpus: NRC-handelsblad Corpus geschreven Nederlands British National Corpus Brown corpus -&gt; SemCor Domain corpora: portals Wikipedia Customer corpora: web sites manuals Slide 4 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Customer-case Connect suppliers and buyers and create traffic and advertisement B2B: companies with specialized products and services terminology driven branch driven C2B: consumers looking for products and services general language terminology: -&gt; folksonomy bottom-up Slide 5 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Subscription for product names Companies in database 1.5 million websites user query searching for products or servcies "vlinderkleppen voor een hoge drukpomp" (butterfly valves for high pressure pumps) Product name in ontology of 150,000 products "kleppen, vlinder, pomp, hoge druk" (valves, butterfly, pump, high pressure) product name on company website "Wij zijn gespecialiseerd in: pompen en pomponderdelen zoals kleppen" (We are specialized in: pumps and components such as valves Slide 6 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Slide 7 Slide 8 Term-extraction morpho-syntactic analysis statistical analysis conceptual analysis contextual analysis Slide 9 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: morpho-syntactic analysis Tokenization, tagging and NP-chunking: een gele kaart voor de vleugelaanvaller (a yellow card for the wing-player) Term candidates: Syntactic head of NPs: kaart (card); vleugelaanvaller (wing-player). Word combinations including syntactic head: gele kaart (yellow card); kaart voor vleugelaanvaller (card for wing- player). Head of compounds: aanvaller (attacker-player). Term is a concept: Normalized form (plural-singular variants, synonyms) Hypernym based on the syntactic head Slide 10 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Slide 11 Term-extraction: statistical analysis Reference corpus based on 500 websites of diverse range of companies Salience = normFreq * normRef normFreq = normalized frequency of terms on the website normFreq = nTermFrequency nWords / nPages normRef = normalized number of websites on which the term occurs in the reference corpus multiwords: normRef = 1-((nWebsites nWords ) / (referenceCorpusSize)) singlewords: normRef = 1-((nWebsites) / (referenceCorpusSize)) Slide 12 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Preferred termnTokensnPagesSalience Klicken (click)990.0010 Weitere (further)520.0011 Wahl (choose)110.0011 Verkauf (sell)110.0011 Service (service)3730.0011 Radio (radio)110.0011 Promotionen (promote)110.0011 Optionen (options)620.0011 Netzwerk (network)420.0011 Medias (media)110.0011 Kauf (buy)110.0011 Html110.0011 Gewerbe (commercial)110.0011 Fax16 0.0011 Bro (office)110.0011 Slide 13 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: conceptual analysis Structural properties of the term hierarchy Poor hierarchies: many tops few levels diverse branches Each branch is a concept: number of descendants and levels cumulated frequency of descendants Branch profiling: Domain classification of the hierarchy Domain classification of each branch Minimal overlap in domain Slide 14 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Slide 15 Wordnet: Domain information type-of part-of Relations rec: 12345 - financial institute rec: 54321 - river side rec: 9876 - small string instrument rec: 65438 - musician playing a violin rec:42654 - musician rec:25876 - string instrument rec:35576 - string of an instrument rec:29551 - underwear Concepts Vocabularies of languages bank violin violist string 1 2 1 2 1 2 Domains Music Culture FinanceClothingSport Ball sports Winter sports Slide 16 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Domain based concept selection TwentyOne Classify Text Classifier Text grouped by Domains Un-seen Document - - Phrase: financial scandal Juventus - - Phrase: Players boycott the match - Classify More Contexts + Domain Train Set of concepts Domain Synsets Glosses Examples WordNet/Semnet Concept Selection Sport - words Train Export - Microworld: Sport - Nanoworld: Finance - Nanoworld: Sport Slide 17 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis Anything can be a product or service: there are no intrinsic properties to define products Contextual features: context patterns for products product pages special marking in HTML Slide 18 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis Context patterns for products: 144 patterns in English and 288 patterns in German [we supply] [we deliver] [we provide] [our products are][we are one of the leading, producers on the market for] [we are, leading, producers on the market for] [is one of the leading, producers on the market for] [is, leading, producer on the market for] [we develop, products for] [we design, products for] [we produce, products for] [Our most common products] Each term is scored for a product context in terms of the strength of the pattern and the distance Slide 19 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis Product pages: landing page: index.html html files with product names: product, service, solution html files referred to by these pages html files referred to by menus with such names Special marking in HTML: meta keywords headings and titles menus Slide 20 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Product terms with feature bundles Slide 21 48 1 1 0.0523 10 arabica kaffee arabica-kaffee 1 RIGHT kaffee 1.0 Slide 22 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Evaluation of French product extraction Nr. of URLs29 Nr. of evaluated URLs27 Total good terms95 Total bad terms53 Total new terms54 Total terms202 Average nTerms per URL6 Total precision (good/(good+bad)64 Average precision (nPrecision/Evaluated Urls) 68 Slide 23 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Evaluation of French product extraction GoodBadNewPrecision nTokenslow(2)67351865 med(5)1271963 high(&gt;5)16111759 getNrDocslow(2)71383165 med(5)13511 72 high(&gt;5)11101252 nSaliencelow(0.05)0000 med(0.1)0000 high(0.5)43357 top(&gt;0.5)91505164 nSiblingslow(2)84514862 med(5)1126 84 high(&gt;5)0000 nCumFreqParentlow(2)68432161 med(5)19511 79 high(&gt;5)852261 Slide 24 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Evaluation of French product extraction GoodBadNewPrecision Term Sourcemeta642911 68 product26213855 service53562 solution0000 index0000 other0000 ProfileMatch12615 66 067403062 low(0.1)0000 med(0.5)27922 high(&gt;0.5)500 100 top(&gt;0.7)900 100 </p>

Recommended

View more >