Final project: Web Page Classification

1/16

Final project:Web Page Classification

By: Xiaodong Wang Yanhua Wang

Haitang WangUniversity of Cincinnati

Content

Problem formulation Algorithms Implementation Results Discussion and future work

Problem

World Wide Web can be clustered into different subsets and labeled accordingly, search engine users can then restrict their keyword search to these specific subsets.

Clustering of web pages can also be used to post-process searching results.

Efficient clustering of web pages is important Clustering accuracy: feature selection, and web

exploitation Fast algorithm

Web clustering

Clustering is done based on similarity between web pages

Clustering can be done in supervised and unsupervised mode

In our project, we try to focus on unsupervised classification (no sample category labels provided), and compare the efficiency of algorithms and features for clustering web pages.

Project overview In this project, a platform of unsupervised clustering is implemented:

Vector Model is used TFIDF model (term frequency-inverted document frequency) Text, meta information, links and linked content can be configured as

features Similarity measure:

Cosine similarity Euclidean similarity

Clustering algorithm K-means HAC (Hierarchical Agglomerative Clustering)

For a given link list, clustering accuracy and algorithm efficiency is compared.

It is implemented in Java, and can be extended easily.

User interface

Major functionalities

Web page preprocessing downloading Parsing: link, meta, text extraction Filtering of non-sense words: Stop word removal

and stemming Put terms into a pool

clustering

Feature selection First, a naïve approach from ranking of query results is used:

All the unique terms (after text extraction and filtering) forms the feature terms. That is, if there are totally 1000 terms, the vector dimension will be 1000.

This approach works for small sets of links. Then we use all the unique terms appearing as meta

information in web pages as feature terms. The dimension can be reduced dramatically. For 30 links, dimension is 2384 for naïve method, but is reduced to

408 when using meta. Hyperlink exploitation

Links in web page can also be features The content or meta information of linked web pages can be seen as

local content.

TFIDF based vector space model TFIDF(i,j)= TF(i,j)*IDF(i)

TF(i,j): the number of times word i occurs in document

DF(i) the number documents in which word i occurs at least once

IDF(i) can be calculated from the document frequency:

)(

1log)(

iDF

DiIDF

Similarity measure

Euclidean similarity :Given the vector space defined by all terms compute the Euclidean distance between each document, and then the reciprocal is taken.

Cosine similarity= numerator / denominator Numerator: inner product of two vector Denominator: Euclidean length of the document

Cluster algorithms: Hierarchical Agglomerative Clustering (HAC)

It starts with all the documents and successively combines them into groups within which inter-document similarity is high

Cluster algorithms: K means

K means clustering: nonhierarchical method Final required number of clusters is chosen Examines each component in the population

and assigns it to one of the clusters depending on the minimum distance

Centroid's position is recalculated every time a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters

Complexity Analysis

HAC methods need to compute similarity of all pairs of n individual instances which is O(n2).

In K-means, for each round, n documents have to be compared against k centroids, which will take time more efficient than O(n2) HAC.

While in our experiment, we found that clustering result of HAC make more sense than K-means

)(knO

Conclusion

Unique features of web page should be exploited Link, meta information

HAC is better than K-means in clustering accuracy. Correct and robust parsing of web pages is

important for web page clustering Our parser doesn’t work well on all web pages tested.

The overall performance of our implementation is not satisfactory Dimension is still large Space requirement Parsing accuracy, and some page doesn’t have meta

information

Documents

Final project: Web Page Classification