11
Practical Issues for Automated Categorization of Web Sites John M. Pierre [email protected] Metacode Technologies, Inc. 139 Townsend Street San Francisco, CA 94107 (Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)

Practical Issues for Automated Categorization of Web Sites John M. Pierre [email protected] Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Embed Size (px)

Citation preview

Page 1: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Practical Issues for Automated Categorization

ofWeb Sites

John M. [email protected]

Metacode Technologies, Inc.139 Townsend Street

San Francisco, CA 94107

(Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)

Page 2: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Outline

Project overview

Web content•Automated Categorization•Feature Selection•Metadata

Experimental Setup•Data•Targeted Spidering•System Architecture

Results

Conclusions

Page 3: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Project Overview

Specific:•Categorize large number of domain names by industry category•NAICS classification scheme•~30,000 domain names for testing (.com)•Text categorization approach

General:•Domain specific classification•Metadata•Targeted spidering•Feature selection•Classifier training

Page 4: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Web Content: Automated Categorization

Challenges:•Vast (over 1 Billion pages)•Heterogeneous (content, formats, not just HTML)•Dynamic (growing, changing)

Benefits:•Good source of information•Accessible!•Machine readable (vs. machine understandable)•Semi-structured

Tools:•Classification•Automated classification•Text Categorization/Machine Learning•Intelligent agents

Related Work

Manual:•Yahoo!•Open Directory Project•Looksmart

Automatic:•Northern Light•Thunderstone/Texis•Inktomi

Other:•EU Project DESIRE II•Pharos•Attardi, Sebanstiani et al•L. Page et al•McCallum et al

Related Work

Manual:•Yahoo!•Open Directory Project•Looksmart

Automatic:•Northern Light•Thunderstone/Texis•Inktomi

Other:•EU Project DESIRE II•Pharos•Attardi, Sebanstiani et al•L. Page et al•McCallum et al

Page 5: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Web Content: Feature Selection

Text Features: (D. Lewis)•Relatively few in number•Moderate in frequency of assignment•Low in redundancy•Low in noise•Related to semantic scope to the classes to be assigned•Relatively unambiguous in meaning

Precision Recall micro F1

Body 0.47 0.34 0.39

Body + Metatags 0.55 0.34 0.42

Metatags 0.64 0.39 0.48

Preliminary Experiment•1125 web domains•SEC+NAICS training set

Use metadata if possible, use body text as last resort!

Use metadata if possible, use body text as last resort!

Page 6: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

0%10%20%30%40%50%60%70%80%90%

Pe

rce

nta

ge

0 1 to 10 11 to 50 51 or more

Number of Words

Web Page Content

Title Meta-Description Meta-Keywords Body

Web Content: Metadata

Page 7: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Experimental Setup: Targeted Spidering

‘Query’Pages

Metatags?Metatags?

SendQuery

Use<body>

live?live?

Frames?Frames?

<a href=?<a href=?

Try www.

HTTP GetDomainname

Yes

No

Yes

No

Yes

prod, service, about, info, press, news

No

Page 8: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Experimental Setup: Data

Classification scheme: NAICS

11 Agriculture, Forestry, Fishing and Hunting21 Mining23 Construction31-33 Manufacturing42 Wholesale Trade44-45 Retail Trade48-49 Transportation and Warehousing51 Information52 Finance and Insurance53 Real Estate and Rental and Leasing54 Professional, Scientific and Technical Services55 Management of Companies and Enterprise56 Admin. Support, Waste Mgmt and Remediation Srvcs61 Educational Services62 Health Care and Social Assistance71 Arts, Entertainment & Recreation72 Accommodation and Food Services81 Other services (except 92)92 Public Administration99 Unclassified Establishments

Test Data

~30,000 domain names (SIC)~13,500 pre-classified/content

Training Data

“SEC-NAICS”:•1504 SEC 10-K fillings (SIC)•426 NAICS labels/descriptions

“Web pages”:•3618 pre-classified domains

Crosswalk•SIC <-> NAICS

Page 9: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Experimental Setup: System Architecture

The WebThe WebDomainNames

DomainNames SpiderSpider

IR EngineIR Engine

DecisionDecision

SEC-NAICSSEC-NAICS

Web pagesWeb pages

Foo.com 11, 21, 23Foo.com 11, 21, 23

Text Query

Matching documents

Page 10: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Results

micro P micro R micro F1 macro P macro R macro F1

SEC-NAICS

0.66 0.35 0.45 0.23 0.18 0.09

Webpages

0.71 0.75 0.73 0.7 0.37 0.4

P=Precision = # correctly assigned / # assignedR=Recall = # correctly assigned / # total correct

F1 = 2 P R / (P+R)

micro-averaged = computer over all categoriesmacro-averaged = per category, then averaged

Page 11: Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Conclusions

Domain Specific Classification•Knowledge Gathering

•Use of specialized knowledge•Targeted Spidering

•Efficient use of resources•Extract key features, Metadata

•Training•Prior knowledge•Bootstrapping

•Classification•Robust, tolerant of noisy data

Benefits of Semantic Web•Better Metadata•Semantic linking & intelligent spidering