Detecting Trends Through Twitter Stream v2

Embed Size (px)

Citation preview

Diapositiva 1

Detecting Trends through Twitter Stream

Neil Marion dela Cruz, BSCSInstitute of Computer Science

University of the Philippines Los Banos

Introduction

Twitter and Trending Topics

Twitter, in particular, is currently the major microblogging service, with more than 11 million active users.

Trending Topics on the other hand is a list that Twitter provides on its homepage. A trend is a type of search; a term that is trending is simply a term that appears in higher frequency than other terms over a set amount of time.

Twitter and Trending Topics

What Makes A Trend?

Twitter tracks the volume of terms mentioned on Twitter on an ongoing basis. Topics break into Trends list when volume of Tweets about that topic at a given moment dramatically increases.

What the above equation is trying to imply is that if there is a dramatic increase in the frequency of tweets in a relatively short amount of time it tends to get a high trend score.

The Novelty Over Popularity Philosophy

Say for example we do have #justinbieber and #diablo3release, which are terms that has trended at least once. Certainly, there would be a sudden surge of tweets for #diablo3release compared to #justinbieber. It is because we all know that Diablo III2 is a very much anticipated computer game in such a way that there is a great certainty that the game will gain so much attention world wide right after its release. On the other hand, #justinbieber, associated with the famous young pop artist, will not gain a trend as much as or even close to #diablo3release since this topic has been popular for a long time already. In other words #justinbieber has been always popular among tweeters therefore making a safe conclusion that there will not be a trend for the said topic.

Twitter and Trending Topics

Studying Twitters mechanism of trending topics can be an aide for future researches that involves extracting trends from different types of media. Extracting trends is becoming a necessity to social networking media. Therefore this study can help future social networking applications enthusiasts and developers implement their own trending topics module.

What Will Be Presented

In this paper we present a method of extracting trending topics from the Twitter stream by means of the Twitter API for obtaining the stream, Z-Score as the scoring method and Lossy-Counting as our streaming algorithm.

Objectives

General

The main objective of this study is to reproduce Twitters

Trending Topics module.

Specific

To be able to acquire tweets through Twitters streaming API

To apply the appropriate streaming algorithm that will determine the most frequent terms in the Tweet stream

To formulate novelty measurement of trends that will put in consideration the f requency/time element of the terms

To create an application that will implement and help analyze trending topics.

Materials and Methods

The Twitter Stream

Twitter provides an API that can let anyone download stream of data. The stream contains actual tweets from Twitter users around the world at the time of streaming. It provides public statuses from all users that can be filtered in several ways by userid, by keyword, by geographic location. A stream that is produced by random sampling is needed in this study and the API provides that.

The Scoring Method

In order to determine terms that trend, we need to have a robust scoring method. As discussed earlier, Twitter trends are determined by the degree of its novelty. And from this we can conclude that

The Scoring Method

We propose to use the standard score (also called Zscore)as our method for scoring. The standard score is

where x is a raw score to be standardized, is the mean of the population and is the standard deviation of the population.

In the case of scoring tweet trends: x is the currently observed frequency of a term, is the mean of all the historical frequencies of the term and is the standard deviation of the historical frequencies of the term.

The Streaming Algorithm

The Lossy-Counting Streaming algorithm will be used to deal with the very large volume of stream. The algorithm stores tuples which comprise an item, a lower bound on its count, and a delta () value which records the difference between the upper bound and the lower bound. When processing the ith item in the stream, if information is currently stored about the item then its lower bound is increased by one; else, a new tuple for the item is created with the lower bound set to one, and set to bi/kc. Periodically, all tuples whose upper bound is less than bi/kc are deleted. These are correct upper and lower bounds on the count of each item, so at the end of the stream, all items whose count exceeds n/k must be stored.

Processing the Twitter Stream

Processing the Twitter Stream

Results and Discussion

The program was run six times, on different days, each within around 400 to 600 minutes. On the duration, all the terms occuring at around 90% of time, meaning have frequencies almost every minute, were determined. The terms are LOL, LOVE, PHOTO, THE, and YOU.

By the fact stated, it can be concluded that the terms enumerated are the most frequent terms on Twitter. As a matter of fact Figure 2 shows the consistency of the frequency of these terms to being high. In addition, these terms have high percentage of occurences as shown in Table 5.

Conclusions and Future Work

Our experiment proved the correctness of our trending topics algorithm in such a way that it is parallel to Twitters novelty over popularity philosophy. This can be used on different data sets aside from tweets. And for those who wishes to develop their own microblogging sites, the algorithm is free to be extended.