Streaming First Story Detection with application to Twitter

RT @deepti: #presentation Streaming First Story Detection with

application to Twitter

Sasa Petrovic, Miles Osborne, Victor Lavrenko

Agenda

1. Awesomeness of Twitter2. Understanding the problem presented in this paper.

- Streaming first story detection.

3. State of the art in FSD.4. Proposed system.5. Experiments.

- Different datasets.- Evaluation metrics.

6. Results.7. Observations.8. Questions/ Discussion.

Social Media Tools

Information explosion

What makes twitter tick?

• Twitter and few other social media tools are sometimes ahead of newswire .Ex#1- Protests during Iranian elections in 2009– People posted news first on Twitter which was

later picked up by the broadcasting corporations.

Ex#2- The swine flu outbreak in US– US Centre for disease control CDC used twitter

to post latest updates on the pandemic.

#Mumbai26/11

#Mumbai26/11• #mumbaiblasts RT @SamuraiSingh: Anyone from inorbit mall Malad

heading towards Dindoshi in Goregaon east? I could do with a lift. #needhelp

• #needhelp RT @OhMyKohli: Need a lift from andheri west to bhandup nagar goregaon.

• RT @NabeelN: RT @splurgestar7: #NeedHelp #Mumbai #blasts I live 5 mins from Kabootar Khana.. Anyone needs help, please let me know!

• #here2help RT @Nakulsud: Stranded in the rain near Gandhi hospital.. No cabs. Anyone around? Call 9920722186..going to mahim #needhelp

• #NeedHelp Find @nikhilwarrier RT @jayblawgs @sukhkarni Demon Stealer Records is where he works.I'm not sure where it is. Trying Google Maps.

• B-ve donors needed tomorow after 10 am, KEM Hospital, Parel, #Mumbai contact the hosp. blood bank at 022-24135189/24107421 #needhelp

Twitter leads to..

• Citizen Journalism• Promotion• Subjective opinion

Categorizing twitter data

Problem statement

To detect new events from a stream of Twitter posts.

Topic Detection and Tracking

• An information filtering task.• Focuses on organizing news documents.• Subtasks– Story Segmentation.– Topic Tracking.– Topic Detection.– First-Story Detection.– Link Detection

Definitions

• An event is an unique thing that happens at some specific time and place– Eg: Earthquake in Italy in April 2009.

• A topic is an event or activity, along with all directly related events or activities.– Eg: Elections, Natural Disasters etc.

First-Story Detection

Is this the first story on a topic?

FSD on Twitter data

• Challenges– Much higher volume of data.– High level of noise.

• Benefit– First hand information on the impact of an event

and how people reacted to it.

First Story Detection – traditional approach

Streaming algorithm

Old

New

Time

Nearest Neighborhood Approach

• Documents are represented as vectors in term space.

• Coordinates represent the frequency of a particular term in a document.

• Each new document is compared to the previous ones

If (similarity < threshold )First story detected

Allan et al.,2000

Disadvantages of NN approach• Not scalable to the twitter streaming setting.• Space and time requirements increase with incoming data.

Alternative - Approximate neighborhood search• To find any point that lies within (1+Ɛ)r distance of the

query point.– r here is the distance to the nearest neighbor.

• One way to achieve this.– Locality sensitive hashing (LSH)

Hash tables - definitions

Hash tables - definitions

• Hash function– Mapping from the input value to a hash key

• Hash key– Value returned by a hash function– Identifiers of each bucket.

• Collision – When two or more input values are mapped to

the same bucket.– More buckets -> less collision.

Locality Sensitive Hashing

• Hash each query point into buckets in such a way that probability of collision is proportional to distance between the points.

ÞNearer points have higher chance to be hashed into the same bucket.

• Points in the same bucket are inspected to find the nearest one.

Locality Sensitive Hashing ( contd..)

• Number of hyper planes (k)• Higher k value, less is the probability of

collision of non-similar points.• For any two points x and y

- Angle between x and y

Hash key values

• No of hyperplanes = No of bits in the hash key = k

• If ( x.ui < 0 ) i [1…k]Set the ith bit to 0

• ElseSet the ith bit to 1

Higher k value leads to more computation.

0 1 0 0 1 1 0 1

Multiple hash tables

• To increase the chance that the nearest neighbor will collide with our point at least once.

• Each hash table has k independently chosen random hyperplanes.

• The number of hash tables L

(δ probability of missing a nearest neighbor)

Variance Reduction Strategy

• LSH fails to find the true near neighbor when the query point lies far away from all other points.

• Another level of processing is added.

Variance Reduction Strategy (contd..)

LSH scheme

New

Old

• Compare the query with a fixed number of most recent documents.

• Update the distance value if necessary.

Streaming First Story detection - Challenges

• Millions of new documents are published each hour

• The volume limits the amount of space and time we can spend on each document– Cannot compare new document with all documents

returned by LSH.– Cannot store all the previous documents in the main

memory.• Additional metadata – Time stamp, topic tags etc.

Desiderata for a streaming FSD system

• For each document say whether it discusses a previously unseen event and give confidence in this decision.

• Decision should be made in bounded time. • Use bounded space• Only one pass over data allowed• Decision should be immediately made.

Using the LSH system without bounds

• Number of documents in each bucket will grow without bound.

=> unbounded amount of space

• Number of comparisons also grow without bound.

Constant space approach

• Limit the number of documents in a single bucket to a constant.– Remove the oldest document if the bucket is full.

• The document is removed only from one single bucket in one of the L hash tables.

Constant number of comparisons

• Limiting the number of documents might still result in large number of comparisons.– A new document can collide with all the

documents in a bucket.

• An additional limit to make a constant number of comparisons.

Constant number of comparisons (contd..)

• Compare each new document with at most 3L documents that it most frequently collided with in all L hash tables.

• If S is the set of all documents that collided with a new document in all L hash tables.– Order the elements in S according to the number of hash

tables where the collision occurred. – Pick the first 3L elements of that ordered set and compare

the new document only with them.

Detecting Events in Twitter Posts

• Not all tweet posts are actual stories.– Updates on personal life.– Spams– Conversations.– Real stories.

• An important event – that which interests a larger population.

Detecting Events – Threading

• Run the streaming FSD system and assign a novelty score to each tweet.– Score is based on a cosine distance to the nearest

tweet.– Output <tweet, its novelty score, its nearest

tweet>• Tweet a links to tweet b if– b is the nearest neighbor of a– 1-cos(a,b) < t (t ε [0.5,0.6]).

Threading (contd..)

• For each tweet a– If its NN is within distance t.• Assign a to an existing thread to which b belongs.

– Else • a is the first tweet in a new thread.

• Once we have threads of tweets – Can identify which threads grow fastest news of

a new event is spreading.

Analysis of social media - Related Work

1. Luo et al (2007) worked on new event detection in a large scale streaming.

– Used traditional FSD approach and employed various heuristics.– Not a generalized approach, never showed utility of their system on a large

scale task.

2. Saha and Getoor (2009) worked on maximum coverage problem.

– select k blogs that maximize the cover of interests specified by user.– 20 days of blog data totaling to 2 million posts.

• This paper works on twitter data for 6 months totaling over 160 million posts.

• This paper’s FSD approach is more generalized.

Experiments

• Experiments used the English part of the TDT5 consists of 221,306 documents from a time period spamming April 2003 to Sept 2003.

• Experiments done in two stagesA) Test and compare the proposed FSD to the state of the art FSD system on the standard TDT5 dataset.

B) Test different ranking methods on the output of the proposed FSD applied on twitter data.

TDT5 Experimental Setup

• Aim – To test if the proposed system is on par with the best existing system. ( UMass system in particular)- To accurately measure the speedup obtained over the

existing system.

- Same settings as the UMass system- 1-NN clustering- Cosine as a similarity measure- TF-IDF weighted document representation- Top 300 features in each document.

• LSH parameters– Higher k, more computation. Lower k, more

collisions.– k ( No of hyperplanes) = 13.– Probability of missing a neighbor within the distance

of 0.2 is less than 2.5%• The official TDT evaluation requires each system

to assign a confidence score for its decision.– In our case, we assign the score as soon as the new

story arrives.

TDT5 Experimental Setup (contd..)

1. Detection Error Tradeoff (DET) curves– A graphical plot of error rates for binary

classification systems, plotting false reject rate vs. false accept rate.

– DET provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution.

TDT5 Experiment Evaluation Metrics

Plotting DET curves

1. Sort all stories according to their scores.2. Perform threshold sweep. For each threshold

value:– Stories with a score greater than threshold are

considered new.– Calculate false alarm and miss probabilities.

• False alarm - declaring a story new when it is not.• Miss - declaring a new story old.

3. Plot the values on a graph to show the trade off between these two quantities.

DET curves comparing the proposed system with the UMass FSD system

2. Minimal Normalized Cost

Cmiss and CFA are costs of miss and false alarm.

Pmiss and PFA are probabilities of miss and false alarm.

Ptarget and Pnon-target are the prior non target and non-

target probabilities.

Cmin is the minimal value of Cdet over all threshold values.

Lower value of Cmin indicates better performance.

TDT5 Results – Minimal Normalized Cost

• No limit on the bucket size.• Processing time per item was made constant.

TDT5 Results (contd..)

• Variance in case of Pure LSH = 0.046.• Variance in case of Variance Reduced LSH =

0.004.• UMass system took 28 hours to complete the

processing.• The proposed system took only 2 hours.

Comparison of processing time for the proposed and the UMass system

TDT5 Results – Minimal Normalized Cost

• Bucket size limited in terms of the percent of expected number of collisions.

Eg: Bucket size of 0.5% means that the number of docs in a bucket cannot be more than 50% of the expected number of collisions.

• Performance declines when bucket size is limited but is reasonable when bucket size is reduced to 10% of the collisions.

Memory usage on a month of Twitter data.

• X-axis shows how long the system has been running for.

Twitter Experimental Setup

• Dataset– Twitter data gathered over a period of six

months.–163.5 million timestamped tweets, totaling

over 2 billion tokens.• Only ASCII characters• Stripped the words beginning with “@”,

“#”

Twitter Experimental Setup ( contd..)

• Not evaluating our FSD system

• Evaluating different methods of ranking threads which are the outputs of a FSD system– To detect important events in a very noisy and

unstructured stream as twitter.


• Gold Standard– Human experts manually labeled tweets returned

by the system.– 3 labels • Event –A tweet which conveys what exactly happened

without having any prior knowledge about the event. The event referenced should be sufficiently important.• Spam – Automatic weather updates, radio station

updates etc.• Neutral – everything that isn’t an event or a spam.


• Only the 1000 fastest growing threads were labeled.

• 820 tweets on which both the annotators agreed are considered as the gold standard.

Twitter Evaluation

• Evaluation is performed by computing average precision (AP) on the gold standard sorted according to different criteria.

• Test#1– Relevant documents – Event tweets– Non-relevant documents – Neutral and spam tweets.

• Test#2– Relevant documents – Event + Neutral tweets– Non-relevant documents – Spam tweets.

Ranking the threads

• Different ways of ranking the threads– Baseline : Random ordering of threads– Size of the thread – threads are ranked according

to the number of tweets.– Number of users – threads are ranked according

to the number of unique users posting in a thread.– Entropy + users – if the entropy of a thread is <

3.5, move to the back of the list, otherwise sort according to the number of unique users.

Ranking the threads ( contd..)

• Entropy is defined as:

– ni is the number of times a word i appears in a thread.

– N = ni (total number of words in a thread)

• Not able to compare with UMass FSD system since UMass system would not finish in any reasonable amount of time on twitter data.

Average precision chart

• Results of the second experiment are better.

Top ten fastest growing threads in our data.

Observations

• Celebrity deaths are the fastest spreading news of Twitter.– Steve Jobs’s death broke Twitter record with 10,000 tweets

per second.– Tweet count soon after Osama bin Laden’s death.

Questions..

• The language in the tweets – often misspelt, quite informal.

• Topic tags might provide richer information about the trending topic.

• Time complexity of the algorithm and of the sorting techniques not mentioned.

• Entropy is usually a measure of disorder or randomness – i.e., the lesser the entropy the lesser disorder .– But the definition here says that the higher entropy values

are better.

THANK YOU!

Education

Streaming First Story Detection with application to Twitter