of 14 /14
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Embed Size (px)

Text of Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir...

  • Using to Save LivesOr, Using Digg to find interesting events.Presented by: Luis Zaman, Amir Khakpour, and John Felix

  • Outline

  • ExplanationDigg is a social web-media discovery tool based on user submitted content. 1 or 2 submissions a minuteHalf-life of interest is about a day

    Digg aggregates interesting content.

    But how do we find interesting Events and know their Themes?

  • MotivationCollaborative nature of Social Media can scour the WWW very thoroughly. But, this generates A LOT of data (youll see).

    It would be cool to find emergencies, or critical situations based on this collaborative media.

    Apple seems like a pretty good starting point.

  • Approach

  • Preprocessing Digg API REST API http://services.digg.com/stories/topic/apple?count=10XML response Limitations100 results per request1 Hour of time series data Cant go fast, or else.

  • Preprocessing

    Time SeriesEach digg is the event (only 100 at a time)RowsEach storys digg countColumnsEvery hour (2,207 of them from August 08 November 08)ClusteringRowsEach story that was digged at any point in the time seriesColumnsThe words in the title and description of this story

  • Preprocessing - Challenges

    SLOWReally Dirty DataDifferent Formats of DataREALLY SLOW

  • Introduction to Document ClusteringChallenges of clustering of text documents unlike structured data are: Volume Dimensionality SparsityComplex semanticsIn information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM)Huge sparse matrix, we just store non-zero values

    TextText documents are converted to Am,n where for m documents and total number of n words (or phrases), each element xi,j represents the frequency of the jth term in the ith document.

  • ClusteringDatasetNumber of stories (m) : 25470 Total number of unique words (n): 55557 Nonzero values: 469323 (0.03214%)Clustering using Cluto SoftwareUsing Kmeans, bisecting Kmeans Calculating Centroids and SSE A C++ program is run on black

  • Document Clustering by Optimizing Criterion FunctionsAccording to Zhao et .al, to have a good clustering for documents we can use some Criterion Function and use optimization to find clusters:Internal Criterion Functions (I)Maximizing the internal similarity function:

    External Criterion Functions (E)Minimizing the external similarity function:

    Hybrid Criterion Functions (H)Maximizing

  • ExperimentsSSE for I (K-Means vs Bisecting K-Means)

  • VisualizationWhat we usedjQueryDatabase query library for javascriptPHP/MySQLScripting language and database backendGoogle Visualization APITime Series GraphZoomableTimepedia ChronoscopeClickable

  • ConclusionsSuccess?Of course we think so

    Future WorkSave lives?Better clusteringCleaner dataMore dataMake it scalable, and dynamicOn-line and on the fly?

    What are we trying to do? And how?*Related work, why we think this is possible, what we want to get out of it*Turning it into mineable data*Turning it into mineable data*W. Fan, L. Wallace, S. Rich, and Z. Zhang, Tapping into the power oftext mining, the Communications of ACM, 2005. M.M. Gomez, A.L. Lopez, and A.F. Gelbukh, Information retrievalwith conceptual graph matching, In DEXA, 312-321, 2000 A. Hotho, S. Staab, and G.Stumme, Text Clustering based onbackground knowledge, TR425, AIFB, German, 2003 J.M. Ponte and W.B. Croft, A language modeling approach toinformation retrieval, In research and development in informationretrieval, 275-281, 1998 S.M. Weiss, N. Indurkhya, T. Zhang, and F.J. Damerau, Text mining,Springer, 2005 R. Yate and B. Neto, Modern information retrieval, Addison Wesley,1999 W. Yang, J.Z. Huang and M.K. Ng, A data cube model for predictionbasedweb prefetching, Journal of intelligent information systems,20(1), 11-30, 2003***