WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.
Web Pages
Feature Extracti
on
Fused Similarity
Content Features Structure Features Links
Similarity Represent
ation
Similarity Represent
ation
Similarity Represent
ation
Content- based Similarities
Structure- based Similarities
Neighborhood- based Similarities
Prediction Model
Zenglin Xu, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong {zlxu, king, lyu}@cse.cuhk.edu.hk
The Chinese University of Hong KongThe Chinese University of Hong Kong
• For web page classification, there are many available data sources, such as the text, the title, the meta data, the anchor text, etc. • Simply putting them together would not greatly enhance the classification performance.• Different dimensions and types of data sources can be represented into a common format of kernel matrix.• A kernel learning approach is thus proposed to integrate multiple data sources
• A systematic way of integrating multiple data sources.
• Better classification accuracy.
1 2
•Dataset: DMOZDataset: DMOZ• AT: Anchor TextAT: Anchor Text• LT: Link TextLT: Link Text• MT: MT: Meta DataMeta Data
• TI: TitleTI: Title• PT: Plain TextPT: Plain Text• UW: Universally Weighted sourcesUW: Universally Weighted sources• KC: sources by Kernel CombinationKC: sources by Kernel Combination• Mi -F1: Micro-F1Mi -F1: Micro-F1• Ma-F1: Macro-F1Ma-F1: Macro-F1
3
4
The Chinese University of Hong Kong
• 1. 1. Feature Extraction.Feature Extraction.• 2. 2. Similarity RepresentationSimilarity Representation. Each data source is . Each data source is represented as a kernel matrix (Ki)represented as a kernel matrix (Ki)• 3. 3. Similarity Combination.Similarity Combination.
• 4. 4. Classification.Classification.• Substitute K into the dual SVMSubstitute K into the dual SVM
• We have the following QCQP problem:We have the following QCQP problem:
where where ααis the parameter of dual SVMs,is the parameter of dual SVMs,δδ is a is a constant and t is the trace vector.constant and t is the trace vector.