Upload
hamilton-tanner
View
32
Download
4
Tags:
Embed Size (px)
DESCRIPTION
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS. Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky. ACMSE’07. INTRODUCTION. blogs highly opinionated personal online commentary including hyperlinks to other resources - PowerPoint PPT Presentation
Citation preview
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZINGAUTHOR/READER COMMENTSBeibei Li, Shuting Xu, Jun Zhang
Department of Computer Science
University of Kentucky
ACMSE’07
INTRODUCTION
blogshighly opinionated personal online
commentary including hyperlinks to other resources
Technorati (July, 2006)tracking more than 50 million blogsabout 175,000 blogs were created dailysize of the blogosphere doubles every six
monthshow many blog authors are updating their
blogs regularly -> not clear
INTRODUCTION(CON.)
analysis of the blogosphere in 2004 more than two-thirds of public blogs are personal
journals knowledge blogs (k-blogs) -> mere 3 percent due to the diverse background of the blog
authors and readers the blogosphere has hyper-accelerated the spread of
information
BLOGS V.S. WEBPAGES
the major difference between blogs and the standard web pages blogs are dated most of blogs allow readers to place comments
on each blog document creates communication channels between the blog
authors and the readers blog authors can place individual blogs into
different categories according to some predefined categories the definitions of the categories may be different for
different authors
BLOG DOCUMENTSuse vector-space model to encode
the blog web pageseach blog page can be viewed as a
column vector each word used can be considered as
one row of the matrixconsider a blog page as three parts
blog titleblog body
the content of the blog pagecomments of the authors and/or the
readers
A SAMPLE BLOG PAGE
HYPOTHESIS
hypothesisthe use of title and comment words in the
dataset will enhance the discrimination of the blog pages
result in more accurate clustering solutions reason
the words in the comments reflect the specific views and questions and answers of the authors and the readers
may hold more weights in discriminating individual blog pages
DATA PREPARATION AND CLUSTERING
Data Preprocessingselected three categories of blog files
gun control church Alzheimer’s disease
downloaded from Windows Live Spaces by searching with the key words
each entry has at least one commenteach category has 70 files for a total of
210 blog filesparsing convert into 3 parts stemming
delete stop words count the number of occurrences of each word
DATA PREPROCESSING(CON.)
represent each document by three vectors vector for the whole document is a weighted
sum of all three vectors:
wt : title weight
wb : body weight
wc : comment weight
DATA PREPROCESSING(CON.)
the word-page matrix A is composed of a set of such document vectors A = (v1 … vm)
vij is the weighted occurrences of the word i in the document vj
to balance the influence of small size and large size documents scale each document vector vj to have its
Euclidean norm equal to 1
tf-idf
TI is the mean value of tfidf over all the documents for each term
use TI to measure the quality of the term the higher the TI value is, the better the term
is to be ranked
FEATURE SELECTION
CLUSTERING
k-means algorithm1. It computes the Euclidean distance from
each of the documents to each cluster center. A document is assigned to the cluster with the smallest distance
2. each cluster center is recomputed to be the mean of its constituent documents
3. repeat steps 1. and 2. until the convergence is reached
criterion function for the convergence
r : the step of the iterations Edist(vi, cj) : computes the Euclidean distance
from the document vi to a cluster center cj given a convergence criterion ε
the k-means algorithm stops when |fr+1 - fr| < ε
CLUSTERING(CON.)
CLUSTERING METRICS
Entropygauges the distribution of each class of
documents within each clustersuppose there are q classes and the
clustering algorithm returns k clusters the entropy E of a cluster Sr of size nr is computed as
is the number of documents in the ith class that are assigned to the rth cluster
entropy of the entire clustering solution is computed as:
CLUSTERING METRICS(CON.)
Purity the purity of the cluster Sr can be defined as
purity value of the entire clustering solution is computed as
EXPERIMENTAL RESULTS
influence of weightnot very good if only
use one of the title, body, or comment
the accuracy of clustering the blog body is better than title or comments
using all of the three parts improves a lot
EXPERIMENTAL RESULTS
Feature Selectionuse only the title and the body for clustering
reducing the percentage of the features used will not change the clustering accuracy
apply feature selection to all the blog content including the comments with certain percentage of features selected,
entropy value can be reduced
making good use of the terms in comments can help increase clustering accuracy
SUMMARY
utilizing a particular feature of the blogs, the comments, to enhance the effectiveness of a clustering algorithm in classifying blog pages
Future workconsider the timing effect of the blogs
better clustering blog documents finding blog communities
the utilization of predefined category information may also improve the classification of blog files
experimenting other data mining algorithms with blog datasets