61
RT @deepti: #presentation Streaming First Story Detection with application to Twitter Sasa Petrovic, Miles Osborne, Victor Lavrenko

Streaming First Story Detection with application to Twitter

Embed Size (px)

DESCRIPTION

A presentation made for a graduate course on Information retrieval based on the following paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.9438&rep=rep1&type=pdf

Citation preview

Page 1: Streaming First Story Detection with application to Twitter

RT @deepti: #presentation Streaming First Story Detection with

application to Twitter

Sasa Petrovic, Miles Osborne, Victor Lavrenko

Page 2: Streaming First Story Detection with application to Twitter

Agenda

1. Awesomeness of Twitter2. Understanding the problem presented in this paper.

- Streaming first story detection.

3. State of the art in FSD.4. Proposed system.5. Experiments.

- Different datasets.- Evaluation metrics.

6. Results.7. Observations.8. Questions/ Discussion.

Page 3: Streaming First Story Detection with application to Twitter

Social Media Tools

Page 4: Streaming First Story Detection with application to Twitter

Information explosion

Page 5: Streaming First Story Detection with application to Twitter

What makes twitter tick?

• Twitter and few other social media tools are sometimes ahead of newswire .Ex#1- Protests during Iranian elections in 2009– People posted news first on Twitter which was

later picked up by the broadcasting corporations.

Ex#2- The swine flu outbreak in US– US Centre for disease control CDC used twitter

to post latest updates on the pandemic.

Page 6: Streaming First Story Detection with application to Twitter

#Mumbai26/11

Page 7: Streaming First Story Detection with application to Twitter

#Mumbai26/11• #mumbaiblasts RT @SamuraiSingh: Anyone from inorbit mall Malad

heading towards Dindoshi in Goregaon east? I could do with a lift. #needhelp

• #needhelp RT @OhMyKohli: Need a lift from andheri west to bhandup nagar goregaon.

• RT @NabeelN: RT @splurgestar7: #NeedHelp #Mumbai #blasts I live 5 mins from Kabootar Khana.. Anyone needs help, please let me know!

• #here2help RT @Nakulsud: Stranded in the rain near Gandhi hospital.. No cabs. Anyone around? Call 9920722186..going to mahim #needhelp

• #NeedHelp Find @nikhilwarrier RT @jayblawgs @sukhkarni Demon Stealer Records is where he works.I'm not sure where it is. Trying Google Maps.

• B-ve donors needed tomorow after 10 am, KEM Hospital, Parel, #Mumbai contact the hosp. blood bank at 022-24135189/24107421 #needhelp

Page 8: Streaming First Story Detection with application to Twitter

Twitter leads to..

• Citizen Journalism• Promotion• Subjective opinion

Page 9: Streaming First Story Detection with application to Twitter

Categorizing twitter data

Page 10: Streaming First Story Detection with application to Twitter

Problem statement

To detect new events from a stream of Twitter posts.

Page 11: Streaming First Story Detection with application to Twitter

Topic Detection and Tracking

• An information filtering task.• Focuses on organizing news documents.• Subtasks– Story Segmentation.– Topic Tracking.– Topic Detection.– First-Story Detection.– Link Detection

Page 12: Streaming First Story Detection with application to Twitter

Definitions

• An event is an unique thing that happens at some specific time and place– Eg: Earthquake in Italy in April 2009.

• A topic is an event or activity, along with all directly related events or activities.– Eg: Elections, Natural Disasters etc.

Page 13: Streaming First Story Detection with application to Twitter

First-Story Detection

Is this the first story on a topic?

Page 14: Streaming First Story Detection with application to Twitter

FSD on Twitter data

• Challenges– Much higher volume of data.– High level of noise.

• Benefit– First hand information on the impact of an event

and how people reacted to it.

Page 15: Streaming First Story Detection with application to Twitter

First Story Detection – traditional approach

Streaming algorithm

Old

New

Time

Page 16: Streaming First Story Detection with application to Twitter

Nearest Neighborhood Approach

• Documents are represented as vectors in term space.

• Coordinates represent the frequency of a particular term in a document.

• Each new document is compared to the previous ones

If (similarity < threshold )First story detected

Page 17: Streaming First Story Detection with application to Twitter

Allan et al.,2000

Page 18: Streaming First Story Detection with application to Twitter

Disadvantages of NN approach• Not scalable to the twitter streaming setting.• Space and time requirements increase with incoming data.

Alternative - Approximate neighborhood search• To find any point that lies within (1+Ɛ)r distance of the

query point.– r here is the distance to the nearest neighbor.

• One way to achieve this.– Locality sensitive hashing (LSH)

Page 19: Streaming First Story Detection with application to Twitter

Hash tables - definitions

Page 20: Streaming First Story Detection with application to Twitter

Hash tables - definitions

• Hash function– Mapping from the input value to a hash key

• Hash key– Value returned by a hash function– Identifiers of each bucket.

• Collision – When two or more input values are mapped to

the same bucket.– More buckets -> less collision.

Page 21: Streaming First Story Detection with application to Twitter

Locality Sensitive Hashing

• Hash each query point into buckets in such a way that probability of collision is proportional to distance between the points.

ÞNearer points have higher chance to be hashed into the same bucket.

• Points in the same bucket are inspected to find the nearest one.

Page 22: Streaming First Story Detection with application to Twitter

Locality Sensitive Hashing ( contd..)

• Number of hyper planes (k)• Higher k value, less is the probability of

collision of non-similar points.• For any two points x and y

- Angle between x and y

Page 23: Streaming First Story Detection with application to Twitter

Hash key values

• No of hyperplanes = No of bits in the hash key = k

• If ( x.ui < 0 ) i [1…k]Set the ith bit to 0

• ElseSet the ith bit to 1

Higher k value leads to more computation.

0 1 0 0 1 1 0 1

Page 24: Streaming First Story Detection with application to Twitter

Multiple hash tables

• To increase the chance that the nearest neighbor will collide with our point at least once.

• Each hash table has k independently chosen random hyperplanes.

• The number of hash tables L

(δ probability of missing a nearest neighbor)

Page 25: Streaming First Story Detection with application to Twitter

Variance Reduction Strategy

• LSH fails to find the true near neighbor when the query point lies far away from all other points.

• Another level of processing is added.

Page 26: Streaming First Story Detection with application to Twitter

Variance Reduction Strategy (contd..)

LSH scheme

New

Old

• Compare the query with a fixed number of most recent documents.

• Update the distance value if necessary.

Page 27: Streaming First Story Detection with application to Twitter
Page 28: Streaming First Story Detection with application to Twitter

Streaming First Story detection - Challenges

• Millions of new documents are published each hour

• The volume limits the amount of space and time we can spend on each document– Cannot compare new document with all documents

returned by LSH.– Cannot store all the previous documents in the main

memory.• Additional metadata – Time stamp, topic tags etc.

Page 29: Streaming First Story Detection with application to Twitter

Desiderata for a streaming FSD system

• For each document say whether it discusses a previously unseen event and give confidence in this decision.

• Decision should be made in bounded time. • Use bounded space• Only one pass over data allowed• Decision should be immediately made.

Page 30: Streaming First Story Detection with application to Twitter

Using the LSH system without bounds

• Number of documents in each bucket will grow without bound.

=> unbounded amount of space

• Number of comparisons also grow without bound.

Page 31: Streaming First Story Detection with application to Twitter

Constant space approach

• Limit the number of documents in a single bucket to a constant.– Remove the oldest document if the bucket is full.

• The document is removed only from one single bucket in one of the L hash tables.

Page 32: Streaming First Story Detection with application to Twitter

Constant number of comparisons

• Limiting the number of documents might still result in large number of comparisons.– A new document can collide with all the

documents in a bucket.

• An additional limit to make a constant number of comparisons.

Page 33: Streaming First Story Detection with application to Twitter

Constant number of comparisons (contd..)

• Compare each new document with at most 3L documents that it most frequently collided with in all L hash tables.

• If S is the set of all documents that collided with a new document in all L hash tables.– Order the elements in S according to the number of hash

tables where the collision occurred. – Pick the first 3L elements of that ordered set and compare

the new document only with them.

Page 34: Streaming First Story Detection with application to Twitter

Detecting Events in Twitter Posts

• Not all tweet posts are actual stories.– Updates on personal life.– Spams– Conversations.– Real stories.

• An important event – that which interests a larger population.

Page 35: Streaming First Story Detection with application to Twitter

Detecting Events – Threading

• Run the streaming FSD system and assign a novelty score to each tweet.– Score is based on a cosine distance to the nearest

tweet.– Output <tweet, its novelty score, its nearest

tweet>• Tweet a links to tweet b if– b is the nearest neighbor of a– 1-cos(a,b) < t (t ε [0.5,0.6]).

Page 36: Streaming First Story Detection with application to Twitter

Threading (contd..)

• For each tweet a– If its NN is within distance t.• Assign a to an existing thread to which b belongs.

– Else • a is the first tweet in a new thread.

• Once we have threads of tweets – Can identify which threads grow fastest news of

a new event is spreading.

Page 37: Streaming First Story Detection with application to Twitter

Analysis of social media - Related Work

1. Luo et al (2007) worked on new event detection in a large scale streaming.

– Used traditional FSD approach and employed various heuristics.– Not a generalized approach, never showed utility of their system on a large

scale task.

2. Saha and Getoor (2009) worked on maximum coverage problem.

– select k blogs that maximize the cover of interests specified by user.– 20 days of blog data totaling to 2 million posts.

• This paper works on twitter data for 6 months totaling over 160 million posts.

• This paper’s FSD approach is more generalized.

Page 38: Streaming First Story Detection with application to Twitter

Experiments

• Experiments used the English part of the TDT5 consists of 221,306 documents from a time period spamming April 2003 to Sept 2003.

• Experiments done in two stagesA) Test and compare the proposed FSD to the state of the art FSD system on the standard TDT5 dataset.

B) Test different ranking methods on the output of the proposed FSD applied on twitter data.

Page 39: Streaming First Story Detection with application to Twitter

TDT5 Experimental Setup

• Aim – To test if the proposed system is on par with the best existing system. ( UMass system in particular)- To accurately measure the speedup obtained over the

existing system.

- Same settings as the UMass system- 1-NN clustering- Cosine as a similarity measure- TF-IDF weighted document representation- Top 300 features in each document.

Page 40: Streaming First Story Detection with application to Twitter

• LSH parameters– Higher k, more computation. Lower k, more

collisions.– k ( No of hyperplanes) = 13.– Probability of missing a neighbor within the distance

of 0.2 is less than 2.5%• The official TDT evaluation requires each system

to assign a confidence score for its decision.– In our case, we assign the score as soon as the new

story arrives.

TDT5 Experimental Setup (contd..)

Page 41: Streaming First Story Detection with application to Twitter

1. Detection Error Tradeoff (DET) curves– A graphical plot of error rates for binary

classification systems, plotting false reject rate vs. false accept rate.

– DET provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution.

TDT5 Experiment Evaluation Metrics

Page 42: Streaming First Story Detection with application to Twitter

Plotting DET curves

1. Sort all stories according to their scores.2. Perform threshold sweep. For each threshold

value:– Stories with a score greater than threshold are

considered new.– Calculate false alarm and miss probabilities.

• False alarm - declaring a story new when it is not.• Miss - declaring a new story old.

3. Plot the values on a graph to show the trade off between these two quantities.

Page 43: Streaming First Story Detection with application to Twitter

DET curves comparing the proposed system with the UMass FSD system

Page 44: Streaming First Story Detection with application to Twitter

2. Minimal Normalized Cost

Cmiss and CFA are costs of miss and false alarm.

Pmiss and PFA are probabilities of miss and false alarm.

Ptarget and Pnon-target are the prior non target and non-

target probabilities.

Cmin is the minimal value of Cdet over all threshold values.

Lower value of Cmin indicates better performance.

Page 45: Streaming First Story Detection with application to Twitter

TDT5 Results – Minimal Normalized Cost

• No limit on the bucket size.• Processing time per item was made constant.

Page 46: Streaming First Story Detection with application to Twitter

TDT5 Results (contd..)

• Variance in case of Pure LSH = 0.046.• Variance in case of Variance Reduced LSH =

0.004.• UMass system took 28 hours to complete the

processing.• The proposed system took only 2 hours.

Page 47: Streaming First Story Detection with application to Twitter

Comparison of processing time for the proposed and the UMass system

Page 48: Streaming First Story Detection with application to Twitter

TDT5 Results – Minimal Normalized Cost

• Bucket size limited in terms of the percent of expected number of collisions.

Eg: Bucket size of 0.5% means that the number of docs in a bucket cannot be more than 50% of the expected number of collisions.

• Performance declines when bucket size is limited but is reasonable when bucket size is reduced to 10% of the collisions.

Page 49: Streaming First Story Detection with application to Twitter

Memory usage on a month of Twitter data.

• X-axis shows how long the system has been running for.

Page 50: Streaming First Story Detection with application to Twitter

Twitter Experimental Setup

• Dataset– Twitter data gathered over a period of six

months.–163.5 million timestamped tweets, totaling

over 2 billion tokens.• Only ASCII characters• Stripped the words beginning with “@”,

“#”

Page 51: Streaming First Story Detection with application to Twitter

Twitter Experimental Setup ( contd..)

• Not evaluating our FSD system

• Evaluating different methods of ranking threads which are the outputs of a FSD system– To detect important events in a very noisy and

unstructured stream as twitter.

Page 52: Streaming First Story Detection with application to Twitter

Twitter Experimental Setup ( contd..)

• Gold Standard– Human experts manually labeled tweets returned

by the system.– 3 labels • Event –A tweet which conveys what exactly happened

without having any prior knowledge about the event. The event referenced should be sufficiently important.• Spam – Automatic weather updates, radio station

updates etc.• Neutral – everything that isn’t an event or a spam.

Page 53: Streaming First Story Detection with application to Twitter

Twitter Experimental Setup ( contd..)

• Only the 1000 fastest growing threads were labeled.

• 820 tweets on which both the annotators agreed are considered as the gold standard.

Page 54: Streaming First Story Detection with application to Twitter

Twitter Evaluation

• Evaluation is performed by computing average precision (AP) on the gold standard sorted according to different criteria.

• Test#1– Relevant documents – Event tweets– Non-relevant documents – Neutral and spam tweets.

• Test#2– Relevant documents – Event + Neutral tweets– Non-relevant documents – Spam tweets.

Page 55: Streaming First Story Detection with application to Twitter

Ranking the threads

• Different ways of ranking the threads– Baseline : Random ordering of threads– Size of the thread – threads are ranked according

to the number of tweets.– Number of users – threads are ranked according

to the number of unique users posting in a thread.– Entropy + users – if the entropy of a thread is <

3.5, move to the back of the list, otherwise sort according to the number of unique users.

Page 56: Streaming First Story Detection with application to Twitter

Ranking the threads ( contd..)

• Entropy is defined as:

– ni is the number of times a word i appears in a thread.

– N = ni (total number of words in a thread)

• Not able to compare with UMass FSD system since UMass system would not finish in any reasonable amount of time on twitter data.

Page 57: Streaming First Story Detection with application to Twitter

Average precision chart

• Results of the second experiment are better.

Page 58: Streaming First Story Detection with application to Twitter

Top ten fastest growing threads in our data.

Page 59: Streaming First Story Detection with application to Twitter

Observations

• Celebrity deaths are the fastest spreading news of Twitter.– Steve Jobs’s death broke Twitter record with 10,000 tweets

per second.– Tweet count soon after Osama bin Laden’s death.

Page 60: Streaming First Story Detection with application to Twitter

Questions..

• The language in the tweets – often misspelt, quite informal.

• Topic tags might provide richer information about the trending topic.

• Time complexity of the algorithm and of the sorting techniques not mentioned.

• Entropy is usually a measure of disorder or randomness – i.e., the lesser the entropy the lesser disorder .– But the definition here says that the higher entropy values

are better.

Page 61: Streaming First Story Detection with application to Twitter

THANK YOU!