33
TI: AN EFFICIENT INDEXING MECHANISM FOR REAL-TIME SEARCH ON TWEETS SIGMOD ‘11 C. CHEN ET AL Pete Bohman Adam Kunk

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

  • Upload
    nerita

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Pete Bohman Adam Kunk. TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al. What is real-time search?. What do you think as a class?. Real-Time Search. - PowerPoint PPT Presentation

Citation preview

Page 1: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI: AN EFFICIENT INDEXING MECHANISM FOR REAL-TIME

SEARCH ON TWEETSSIGMOD ‘11

C. CHEN ET AL

Pete BohmanAdam Kunk

Page 2: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

What is real-time search?

What do you think as a class?

Page 3: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Real-Time Search Definition: A search mechanism capable

of finding information in an online fashion as it is produced.

Page 4: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Real-Time Search In terms of real-time search, what does

“online” mean?Online means that a constant stream of

input data is handled as it enters the system, contrary to batch processing

Bing Social Search

Page 5: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Real-Time Search Input Data Example of what kind of input data is

considered for real-time search systems:

twittervision

Page 6: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Real-time content Microblogging - Entirely new type of data

1. Short temporal life span2. Little to no context3. Simple ideas, fast reporting of events4. Metadata: time, location, social links5. Less factual, more opinionated6. Static posts 7. Furious input rate8. Often no hyperlink structure, few traditional

ranking factors

Page 7: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Real-time vs. Conventional Search Conventional Search Ranking

Relevance Authority

Real-time Search RankingRelevanceTemporal immediacy Popularity

Page 8: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Real-time vs. Conventional Search Conventional search input

Crawl the web periodically and update index○ Web documents evolve

Incapable of crawling and indexing the entire web in real-time

Real-time search input Stream of data.No need to poll since the posts are static

What can we do with real-time search engines?

Page 9: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Query analysis Collecta real-time search engine Analyzed ~1 Million queries

Continuous Queries○ Monitor events by frequently resubmitting the

same query Different query categories

Conventional Real-TimeShopping Commerce

Entertainment Travel

Adult Economy

Page 10: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Value of real-time search The estimated value of real-time search

is around $33 MillionValue derived from types of queries entered

in real-time search systemsUtilized adwords to determine worth of

keywords appearing in queries

Page 11: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Applications of real-time search TwitterStand: Real-time news reports

Crowd sourcing of first hand reportsExample: Coverage of MJ’s death

Page 12: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Applications of real-time search Real-time alert systems

Leverages tweet metadata (time, location) to raise alerts

Earthquake localization based on tweets

Page 13: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Twitter Real-Time Alerts

USGS Twitter Earthquake Detector

Page 14: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Difficulties of Real-Time Search Two factors:

Efficient indexing in order to provide for fast results

Effective ranking in order to return relevant results

Page 15: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Indexing Background RDBMS Indexing

Indexes built on columns commonly used in queries

Improves the speed of retrieval operations Conventional Search (Inverted) Indexing

Crawl the web for documentsMap keywords to documents containing those key

wordsNon structured dataIf a document does not exist in the index, it will not

appear in query results

Page 16: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Real-Time Search Indexing Index stream of data

Map keywords to tweets containing those keywords

ChallengeProcessing the stream in a timely manor

○ 5,000 tweets per second

Page 17: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI Indexing Not feasible to index every incoming

tweet immediately

Selective indexing based on results that are most likely to appear in queriesDistinguished tweets indexed in real-timeNoisy tweets indexed by batch process

Page 18: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI Tweet Classification Observation

Users are only interested in top-K results for a query

Distinguished tweetsTweet that belongs in the top-K result set of

previous query Noisy tweet

Those tweets not appearing in the top-K results for any of the systems previous queries

Page 19: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI Indexing

Must limit the size of the query set1.6 Billion twitter queries per day

Page 20: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Query set optimization Observation

20% of queries represent 80% of user requests

ThereforeZipf’s distribution used statistically limit the

number of queries tweets were compared against

Page 21: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Real-Time Search Ranking How does ranking differ from traditional

web ranking? There are no social relationships in

traditional web pagesTypical web search engines rank based on

links to a site, and links from a siteWebsite links are not the same as social

networking links

Page 22: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Real-Time Search Ranking Ranking is not necessary in RDBMS

systemsRDBMS systems do not favor certain data

over others based on select criteria

RDBMS systems rank all data contained in the database the same essentially

Page 23: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI Ranking Ranking function comprised of:

1) User’s PageRank○ Combination of user weight (defaulted to 1)

and how many followers they have (popularity)

2) Timestamp (self-explanatory)3) Similarity between tweet and the query

Page 24: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI Ranking Ranking function also

comprised of:4) Popularity of the topic

Determined by large tweet trees

Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree

Tweet Tree Structure

Page 25: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI Ranking ComparisonTI Rank Vs. Time Rank

Page 26: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

What are others doing?

Page 27: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

What are others doing? Facebook

Real-Time Feed

Page 28: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Implications/Conclusion Real-time search engines must provide:

“Online” algorithms to handle constant input Relevant search results

Results of a query are no longer static

Page 29: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Implications/Conclusion TI makes use of two concepts in their

real-time search of Twitter:Selective Indexing

○ Form of partial indexing, can’t afford to index every incoming tweet due to large volume of input

Ranking○ Ranking is a known technique, but

microblogging applications provide new ranking algorithms

Page 30: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

References TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets

http://www.comp.nus.edu.sg/~ooibc/sigmod11ti.pdf Real Time Search User Behavior

http://faculty.ist.psu.edu/jjansen/academic/jansen_real_time_search.pdf TwitterRank: Finding Topic-Sensitive Influential Twitterers

http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1503&context=sis_research Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors

http://ymatsuo.com/papers/www2010.pdf TwitterStand: News in Tweets

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.148.1477&rep=rep1&type=pdf Learning Effective Ranking Functions for Newsgroup Search

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.5556&rep=rep1&type=pdf TwitterSearch: A Comparison of Microblog Search and Web Search

http://www.stanford.edu/~dramage/papers/twitter-wsdm11.pdf TwitterVision

http://twittervision.com/ Bing Social

http://www.bing.com/social Reak tune search on the web: Queries, topics, and economic value

http://collecta.com/RealTimeSearch.pdf

Page 31: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Discussion Questions 1)  What do you think is the most

innovative technique in the TI approach that led to real-time microblog search results?

Page 32: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Discussion Questions 2) Given the partial indexing

optimization provided in the paper, how do you think Google could optimize their indexing algorithm in order to capture the newest content on the web?

Page 33: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Discussion Questions 3) TI makes use of a ranking function in

order to select tweets based on various user characteristics. What would you change about the ranking function, if anything?