View
212
Download
0
Category
Preview:
Citation preview
Practical Issues for Automated Categorization
ofWeb Sites
John M. Pierrejpierre@metacode.com
Metacode Technologies, Inc.139 Townsend Street
San Francisco, CA 94107
(Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)
Outline
Project overview
Web content•Automated Categorization•Feature Selection•Metadata
Experimental Setup•Data•Targeted Spidering•System Architecture
Results
Conclusions
Project Overview
Specific:•Categorize large number of domain names by industry category•NAICS classification scheme•~30,000 domain names for testing (.com)•Text categorization approach
General:•Domain specific classification•Metadata•Targeted spidering•Feature selection•Classifier training
Web Content: Automated Categorization
Challenges:•Vast (over 1 Billion pages)•Heterogeneous (content, formats, not just HTML)•Dynamic (growing, changing)
Benefits:•Good source of information•Accessible!•Machine readable (vs. machine understandable)•Semi-structured
Tools:•Classification•Automated classification•Text Categorization/Machine Learning•Intelligent agents
Related Work
Manual:•Yahoo!•Open Directory Project•Looksmart
Automatic:•Northern Light•Thunderstone/Texis•Inktomi
Other:•EU Project DESIRE II•Pharos•Attardi, Sebanstiani et al•L. Page et al•McCallum et al
Related Work
Manual:•Yahoo!•Open Directory Project•Looksmart
Automatic:•Northern Light•Thunderstone/Texis•Inktomi
Other:•EU Project DESIRE II•Pharos•Attardi, Sebanstiani et al•L. Page et al•McCallum et al
Web Content: Feature Selection
Text Features: (D. Lewis)•Relatively few in number•Moderate in frequency of assignment•Low in redundancy•Low in noise•Related to semantic scope to the classes to be assigned•Relatively unambiguous in meaning
Precision Recall micro F1
Body 0.47 0.34 0.39
Body + Metatags 0.55 0.34 0.42
Metatags 0.64 0.39 0.48
Preliminary Experiment•1125 web domains•SEC+NAICS training set
Use metadata if possible, use body text as last resort!
Use metadata if possible, use body text as last resort!
0%10%20%30%40%50%60%70%80%90%
Pe
rce
nta
ge
0 1 to 10 11 to 50 51 or more
Number of Words
Web Page Content
Title Meta-Description Meta-Keywords Body
Web Content: Metadata
Experimental Setup: Targeted Spidering
‘Query’Pages
Metatags?Metatags?
SendQuery
Use<body>
live?live?
Frames?Frames?
<a href=?<a href=?
Try www.
HTTP GetDomainname
Yes
No
Yes
No
Yes
prod, service, about, info, press, news
No
Experimental Setup: Data
Classification scheme: NAICS
11 Agriculture, Forestry, Fishing and Hunting21 Mining23 Construction31-33 Manufacturing42 Wholesale Trade44-45 Retail Trade48-49 Transportation and Warehousing51 Information52 Finance and Insurance53 Real Estate and Rental and Leasing54 Professional, Scientific and Technical Services55 Management of Companies and Enterprise56 Admin. Support, Waste Mgmt and Remediation Srvcs61 Educational Services62 Health Care and Social Assistance71 Arts, Entertainment & Recreation72 Accommodation and Food Services81 Other services (except 92)92 Public Administration99 Unclassified Establishments
Test Data
~30,000 domain names (SIC)~13,500 pre-classified/content
Training Data
“SEC-NAICS”:•1504 SEC 10-K fillings (SIC)•426 NAICS labels/descriptions
“Web pages”:•3618 pre-classified domains
Crosswalk•SIC <-> NAICS
Experimental Setup: System Architecture
The WebThe WebDomainNames
DomainNames SpiderSpider
IR EngineIR Engine
DecisionDecision
SEC-NAICSSEC-NAICS
Web pagesWeb pages
Foo.com 11, 21, 23Foo.com 11, 21, 23
Text Query
Matching documents
Results
micro P micro R micro F1 macro P macro R macro F1
SEC-NAICS
0.66 0.35 0.45 0.23 0.18 0.09
Webpages
0.71 0.75 0.73 0.7 0.37 0.4
P=Precision = # correctly assigned / # assignedR=Recall = # correctly assigned / # total correct
F1 = 2 P R / (P+R)
micro-averaged = computer over all categoriesmacro-averaged = per category, then averaged
Conclusions
Domain Specific Classification•Knowledge Gathering
•Use of specialized knowledge•Targeted Spidering
•Efficient use of resources•Extract key features, Metadata
•Training•Prior knowledge•Bootstrapping
•Classification•Robust, tolerant of noisy data
Benefits of Semantic Web•Better Metadata•Semantic linking & intelligent spidering
Recommended