ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZINGAUTHOR/READER COMMENTSBeibei Li, Shuting Xu, Jun Zhang

Department of Computer Science

University of Kentucky

ACMSE’07

INTRODUCTION

blogshighly opinionated personal online

commentary including hyperlinks to other resources

Technorati (July, 2006)tracking more than 50 million blogsabout 175,000 blogs were created dailysize of the blogosphere doubles every six

monthshow many blog authors are updating their

blogs regularly -> not clear

INTRODUCTION(CON.)

analysis of the blogosphere in 2004 more than two-thirds of public blogs are personal

journals knowledge blogs (k-blogs) -> mere 3 percent due to the diverse background of the blog

authors and readers the blogosphere has hyper-accelerated the spread of

information

BLOGS V.S. WEBPAGES

the major difference between blogs and the standard web pages blogs are dated most of blogs allow readers to place comments

on each blog document creates communication channels between the blog

authors and the readers blog authors can place individual blogs into

different categories according to some predefined categories the definitions of the categories may be different for

different authors

BLOG DOCUMENTSuse vector-space model to encode

the blog web pageseach blog page can be viewed as a

column vector each word used can be considered as

one row of the matrixconsider a blog page as three parts

blog titleblog body

the content of the blog pagecomments of the authors and/or the

readers

A SAMPLE BLOG PAGE

HYPOTHESIS

hypothesisthe use of title and comment words in the

dataset will enhance the discrimination of the blog pages

result in more accurate clustering solutions reason

the words in the comments reflect the specific views and questions and answers of the authors and the readers

may hold more weights in discriminating individual blog pages

DATA PREPARATION AND CLUSTERING

Data Preprocessingselected three categories of blog files

gun control church Alzheimer’s disease

downloaded from Windows Live Spaces by searching with the key words

each entry has at least one commenteach category has 70 files for a total of

210 blog filesparsing convert into 3 parts stemming

delete stop words count the number of occurrences of each word

DATA PREPROCESSING(CON.)

represent each document by three vectors vector for the whole document is a weighted

sum of all three vectors:

wt : title weight

wb : body weight

wc : comment weight

DATA PREPROCESSING(CON.)

the word-page matrix A is composed of a set of such document vectors A = (v1 … vm)

vij is the weighted occurrences of the word i in the document vj

to balance the influence of small size and large size documents scale each document vector vj to have its

Euclidean norm equal to 1

tf-idf

TI is the mean value of tfidf over all the documents for each term

use TI to measure the quality of the term the higher the TI value is, the better the term

is to be ranked

FEATURE SELECTION

CLUSTERING

k-means algorithm1. It computes the Euclidean distance from

each of the documents to each cluster center. A document is assigned to the cluster with the smallest distance

2. each cluster center is recomputed to be the mean of its constituent documents

3. repeat steps 1. and 2. until the convergence is reached

criterion function for the convergence

r : the step of the iterations Edist(vi, cj) : computes the Euclidean distance

from the document vi to a cluster center cj given a convergence criterion ε

the k-means algorithm stops when |fr+1 - fr| < ε

CLUSTERING(CON.)

CLUSTERING METRICS

Entropygauges the distribution of each class of

documents within each clustersuppose there are q classes and the

clustering algorithm returns k clusters the entropy E of a cluster Sr of size nr is computed as

is the number of documents in the ith class that are assigned to the rth cluster

entropy of the entire clustering solution is computed as:

CLUSTERING METRICS(CON.)

Purity the purity of the cluster Sr can be defined as

purity value of the entire clustering solution is computed as

EXPERIMENTAL RESULTS

influence of weightnot very good if only

use one of the title, body, or comment

the accuracy of clustering the blog body is better than title or comments

using all of the three parts improves a lot

EXPERIMENTAL RESULTS

Feature Selectionuse only the title and the body for clustering

reducing the percentage of the features used will not change the clustering accuracy

apply feature selection to all the blog content including the comments with certain percentage of features selected,

entropy value can be reduced

making good use of the terms in comments can help increase clustering accuracy

SUMMARY

utilizing a particular feature of the blogs, the comments, to enhance the effectiveness of a clustering algorithm in classifying blog pages

Future workconsider the timing effect of the blogs

better clustering blog documents finding blog communities

the utilization of predefined category information may also improve the classification of blog files

experimenting other data mining algorithms with blog datasets

Documents

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS