WWW 200 7 , May 8 – 12 , 200 7 , Banff , Alberta, Canada

WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.

Web Pages

Feature Extracti

on

Fused Similarity

Content Features Structure Features Links

Similarity Represent

ation


ation


ation

Content- based Similarities

Structure- based Similarities

Neighborhood- based Similarities

Prediction Model

Zenglin Xu, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong {zlxu, king, lyu}@cse.cuhk.edu.hk

The Chinese University of Hong KongThe Chinese University of Hong Kong

• For web page classification, there are many available data sources, such as the text, the title, the meta data, the anchor text, etc. • Simply putting them together would not greatly enhance the classification performance.• Different dimensions and types of data sources can be represented into a common format of kernel matrix.• A kernel learning approach is thus proposed to integrate multiple data sources

• A systematic way of integrating multiple data sources.

• Better classification accuracy.

1 2

•Dataset: DMOZDataset: DMOZ• AT: Anchor TextAT: Anchor Text• LT: Link TextLT: Link Text• MT: MT: Meta DataMeta Data

• TI: TitleTI: Title• PT: Plain TextPT: Plain Text• UW: Universally Weighted sourcesUW: Universally Weighted sources• KC: sources by Kernel CombinationKC: sources by Kernel Combination• Mi -F1: Micro-F1Mi -F1: Micro-F1• Ma-F1: Macro-F1Ma-F1: Macro-F1

3

4

The Chinese University of Hong Kong

• 1. 1. Feature Extraction.Feature Extraction.• 2. 2. Similarity RepresentationSimilarity Representation. Each data source is . Each data source is represented as a kernel matrix (Ki)represented as a kernel matrix (Ki)• 3. 3. Similarity Combination.Similarity Combination.

• 4. 4. Classification.Classification.• Substitute K into the dual SVMSubstitute K into the dual SVM

• We have the following QCQP problem:We have the following QCQP problem:

where where ααis the parameter of dual SVMs,is the parameter of dual SVMs,δδ is a is a constant and t is the trace vector.constant and t is the trace vector.

Documents

WWW 200 7 , May 8 – 12 , 200 7 , Banff , Alberta, Canada