14
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Embed Size (px)

Citation preview

Page 1: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Using to Save Lives

Or, Using Digg to find interesting events.

Presented by: Luis Zaman, Amir Khakpour, and John Felix

Page 2: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Outline

Page 3: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Explanation Digg is a social web-media discovery tool

based on user submitted content. 1 or 2 submissions a minute Half-life of “interest” is about a day

Digg aggregates “interesting” content.

But how do we find interesting Events and know their Themes?

Page 4: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Motivation Collaborative nature of Social Media can scour

the WWW very thoroughly. But, this generates A LOT of data (you’ll see).

It would be cool to find emergencies, or critical situations based on this collaborative media.

Apple seems like a pretty good starting point.

Page 5: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Approach

Page 6: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Preprocessing Digg API

REST API http://services.digg.com/stories/topic/apple?count=10

XML response <?xml version="1.0" encoding="utf-8" ?><users

timestamp="1176998598" total="1" offset="0" count="1"> <user name="sbwms" icon="http://digg.com/img/user-large/user-default.png" registered="1135702996" profileviews="3104" /></users></xml>

Limitations 100 results per request 1 Hour of time series data Can’t go fast, or else.

Page 7: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Preprocessing

Time Series Each digg is the event (only 100 at a time) Rows

Each story’s digg count Columns

Every hour (2,207 of them from August 08 – November 08)

Clustering Rows

Each story that was digged at any point in the time series Columns

The words in the title and description of this story

Page 8: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Preprocessing - Challenges

SLOW Really Dirty Data Different Formats of Data REALLY SLOW

Page 9: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Introduction to Document Clustering

Challenges of clustering of text documents unlike structured data are: Volume Dimensionality Sparsity Complex semantics

In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM) Huge sparse matrix, we just store non-zero values

Text

Text documents are converted to Am,n where for m documents and total number of n words (or phrases), each element xi,j represents the frequency of the jth term in the ith document.

Page 10: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Clustering Dataset

Number of stories (m) : 25470 Total number of unique words (n): 55557 Nonzero values: 469323 (0.03214%)

Clustering using Cluto Software Using Kmeans, bisecting Kmeans

Calculating Centroids and SSE A C++ program is run on “black”

Page 11: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Document Clustering by Optimizing Criterion Functions According to Zhao et .al, to have a good

clustering for documents we can use some Criterion Function and use optimization to find clusters: Internal Criterion Functions (I)

Maximizing the internal similarity function:

External Criterion Functions (E) Minimizing the external similarity function:

Hybrid Criterion Functions (H) Maximizing E

I

Page 12: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Experiments SSE for I (K-Means vs Bisecting K-Means)

Page 13: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Visualization What we used

jQuery Database query library for javascript

PHP/MySQL Scripting language and database backend

Google Visualization API Time Series Graph Zoomable

Timepedia Chronoscope Clickable

Page 14: Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Conclusions Success?

Of course we think so

Future Work Save lives? Better clustering

Cleaner data More data

Make it scalable, and dynamic On-line and on the fly?