1
WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. W eb Pages Feature Extracti on Fused Sim ilarity ContentFeatures Structure Features Links Similarity Represent ation Similarity Represent ation Similarity Represent ation Content-based Similarities Structure-based Similarities Neighborhood-based Similarities Prediction M odel Zenglin Xu, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong {zlxu, king, lyu}@cse.cuhk.edu.hk The Chinese University of Hong The Chinese University of Hong Kong Kong • For web page classification, there are many a vailable data sources, such as the text, the title, the meta data, the anchor text, etc. • Simply putting them together would not greatl y enhance the classification performance. • Different dimensions and types of data sources can be represented into a common format of kernel matrix. • A kernel learning approach is thus proposed to integrate multiple data sources • A systematic way of integrati ng multiple data sources. • Better classification accurac y. 1 2 Dataset: DMOZ Dataset: DMOZ AT: Anchor Text AT: Anchor Text LT: Link Text LT: Link Text MT: MT: Meta Data Meta Data TI: Title TI: Title PT: Plain Text PT: Plain Text UW: Universally Weighted UW: Universally Weighted sources sources KC: sources by Kernel KC: sources by Kernel Combination Combination Mi -F1: Micro-F1 Mi -F1: Micro-F1 Ma-F1: Macro-F1 Ma-F1: Macro-F1 3 4 The Chinese University of Hong Kong 1. 1. Feature Extraction. Feature Extraction. 2. 2. Similarity Representation Similarity Representation. Each data . Each data source is represented as a kernel matrix source is represented as a kernel matrix (Ki) (Ki) 3. 3. Similarity Combination. Similarity Combination. 4. 4. Classification. Classification. Substitute K into the dual SVM Substitute K into the dual SVM We have the following QCQP problem: We have the following QCQP problem: where where αis the parameter of dual SVMs, is the parameter of dual SVMs,δ is a constant and t is the trace is a constant and t is the trace vector. vector.

WWW 200 7 , May 8 – 12 , 200 7 , Banff , Alberta, Canada

  • Upload
    zoltin

  • View
    39

  • Download
    4

Embed Size (px)

DESCRIPTION

The Chinese University of Hong Kong. Web Page Classification with Heterogeneous Data Fusion. Zenglin Xu, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong { zlxu, king, lyu }@cse.cuhk.edu.hk. 1. Motivations. 2. Contributions. - PowerPoint PPT Presentation

Citation preview

Page 1: WWW 200 7 ,  May  8 – 12 , 200 7 ,  Banff ,  Alberta, Canada

WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.

Web Pages

Feature Extracti

on

Fused Similarity

Content Features Structure Features Links

Similarity Represent

ation

Similarity Represent

ation

Similarity Represent

ation

Content- based Similarities

Structure- based Similarities

Neighborhood- based Similarities

Prediction Model

Zenglin Xu, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong {zlxu, king, lyu}@cse.cuhk.edu.hk

The Chinese University of Hong KongThe Chinese University of Hong Kong

• For web page classification, there are many available data sources, such as the text, the title, the meta data, the anchor text, etc. • Simply putting them together would not greatly enhance the classification performance.• Different dimensions and types of data sources can be represented into a common format of kernel matrix.• A kernel learning approach is thus proposed to integrate multiple data sources

• A systematic way of integrating multiple data sources.

• Better classification accuracy.

1 2

•Dataset: DMOZDataset: DMOZ• AT: Anchor TextAT: Anchor Text• LT: Link TextLT: Link Text• MT: MT: Meta DataMeta Data

• TI: TitleTI: Title• PT: Plain TextPT: Plain Text• UW: Universally Weighted sourcesUW: Universally Weighted sources• KC: sources by Kernel CombinationKC: sources by Kernel Combination• Mi -F1: Micro-F1Mi -F1: Micro-F1• Ma-F1: Macro-F1Ma-F1: Macro-F1

3

4

The Chinese University of Hong Kong

• 1. 1. Feature Extraction.Feature Extraction.• 2. 2. Similarity RepresentationSimilarity Representation. Each data source is . Each data source is represented as a kernel matrix (Ki)represented as a kernel matrix (Ki)• 3. 3. Similarity Combination.Similarity Combination.

• 4. 4. Classification.Classification.• Substitute K into the dual SVMSubstitute K into the dual SVM

• We have the following QCQP problem:We have the following QCQP problem:

where where ααis the parameter of dual SVMs,is the parameter of dual SVMs,δδ is a is a constant and t is the trace vector.constant and t is the trace vector.