WWW 200 7 , May 8 – 12 , 200 7 , Banff , Alberta, Canada

Preview:

DESCRIPTION

The Chinese University of Hong Kong. Web Page Classification with Heterogeneous Data Fusion. Zenglin Xu, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong { zlxu, king, lyu }@cse.cuhk.edu.hk. 1. Motivations. 2. Contributions. - PowerPoint PPT Presentation

Citation preview

WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.

Web Pages

Feature Extracti

on

Fused Similarity

Content Features Structure Features Links

Similarity Represent

ation

Similarity Represent

ation

Similarity Represent

ation

Content- based Similarities

Structure- based Similarities

Neighborhood- based Similarities

Prediction Model

Zenglin Xu, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong {zlxu, king, lyu}@cse.cuhk.edu.hk

The Chinese University of Hong KongThe Chinese University of Hong Kong

• For web page classification, there are many available data sources, such as the text, the title, the meta data, the anchor text, etc. • Simply putting them together would not greatly enhance the classification performance.• Different dimensions and types of data sources can be represented into a common format of kernel matrix.• A kernel learning approach is thus proposed to integrate multiple data sources

• A systematic way of integrating multiple data sources.

• Better classification accuracy.

1 2

•Dataset: DMOZDataset: DMOZ• AT: Anchor TextAT: Anchor Text• LT: Link TextLT: Link Text• MT: MT: Meta DataMeta Data

• TI: TitleTI: Title• PT: Plain TextPT: Plain Text• UW: Universally Weighted sourcesUW: Universally Weighted sources• KC: sources by Kernel CombinationKC: sources by Kernel Combination• Mi -F1: Micro-F1Mi -F1: Micro-F1• Ma-F1: Macro-F1Ma-F1: Macro-F1

3

4

The Chinese University of Hong Kong

• 1. 1. Feature Extraction.Feature Extraction.• 2. 2. Similarity RepresentationSimilarity Representation. Each data source is . Each data source is represented as a kernel matrix (Ki)represented as a kernel matrix (Ki)• 3. 3. Similarity Combination.Similarity Combination.

• 4. 4. Classification.Classification.• Substitute K into the dual SVMSubstitute K into the dual SVM

• We have the following QCQP problem:We have the following QCQP problem:

where where ααis the parameter of dual SVMs,is the parameter of dual SVMs,δδ is a is a constant and t is the trace vector.constant and t is the trace vector.

Recommended