Using to Save LivesOr, Using Digg to find interesting
events.Presented by: Luis Zaman, Amir Khakpour, and John Felix
ExplanationDigg is a social web-media discovery tool based on
user submitted content. 1 or 2 submissions a minuteHalf-life of
interest is about a day
Digg aggregates interesting content.
But how do we find interesting Events and know their Themes?
MotivationCollaborative nature of Social Media can scour the WWW
very thoroughly. But, this generates A LOT of data (youll see).
It would be cool to find emergencies, or critical situations
based on this collaborative media.
Apple seems like a pretty good starting point.
Preprocessing Digg API REST API
Limitations100 results per request1 Hour of time series data Cant
go fast, or else.
Time SeriesEach digg is the event (only 100 at a time)RowsEach
storys digg countColumnsEvery hour (2,207 of them from August 08
November 08)ClusteringRowsEach story that was digged at any point
in the time seriesColumnsThe words in the title and description of
Preprocessing - Challenges
SLOWReally Dirty DataDifferent Formats of DataREALLY SLOW
Introduction to Document ClusteringChallenges of clustering of
text documents unlike structured data are: Volume Dimensionality
SparsityComplex semanticsIn information retrieval and text mining,
text data is represented in a common representation model, e.g.
Vector Space Model (VSM)Huge sparse matrix, we just store non-zero
TextText documents are converted to Am,n where for m documents
and total number of n words (or phrases), each element xi,j
represents the frequency of the jth term in the ith document.
ClusteringDatasetNumber of stories (m) : 25470 Total number of
unique words (n): 55557 Nonzero values: 469323 (0.03214%)Clustering
using Cluto SoftwareUsing Kmeans, bisecting Kmeans Calculating
Centroids and SSE A C++ program is run on black
Document Clustering by Optimizing Criterion FunctionsAccording
to Zhao et .al, to have a good clustering for documents we can use
some Criterion Function and use optimization to find
clusters:Internal Criterion Functions (I)Maximizing the internal
External Criterion Functions (E)Minimizing the external
Hybrid Criterion Functions (H)Maximizing
ExperimentsSSE for I (K-Means vs Bisecting K-Means)
VisualizationWhat we usedjQueryDatabase query library for
Visualization APITime Series GraphZoomableTimepedia
ConclusionsSuccess?Of course we think so
Future WorkSave lives?Better clusteringCleaner dataMore dataMake
it scalable, and dynamicOn-line and on the fly?
What are we trying to do? And how?*Related work, why we think
this is possible, what we want to get out of it*Turning it into
mineable data*Turning it into mineable data*W. Fan, L. Wallace, S.
Rich, and Z. Zhang, Tapping into the power oftext mining, the
Communications of ACM, 2005. M.M. Gomez, A.L. Lopez, and A.F.
Gelbukh, Information retrievalwith conceptual graph matching, In
DEXA, 312-321, 2000 A. Hotho, S. Staab, and G.Stumme, Text
Clustering based onbackground knowledge, TR425, AIFB, German, 2003
J.M. Ponte and W.B. Croft, A language modeling approach
toinformation retrieval, In research and development in
informationretrieval, 275-281, 1998 S.M. Weiss, N. Indurkhya, T.
Zhang, and F.J. Damerau, Text mining,Springer, 2005 R. Yate and B.
Neto, Modern information retrieval, Addison Wesley,1999 W. Yang,
J.Z. Huang and M.K. Ng, A data cube model for predictionbasedweb
prefetching, Journal of intelligent information systems,20(1),