TI: AN EFFICIENT INDEXING MECHANISM FOR REAL-TIME
SEARCH ON TWEETSSIGMOD ‘11
C. CHEN ET AL
Pete BohmanAdam Kunk
What is real-time search?
What do you think as a class?
Real-Time Search Definition: A search mechanism capable
of finding information in an online fashion as it is produced.
Real-Time Search In terms of real-time search, what does
“online” mean?Online means that a constant stream of
input data is handled as it enters the system, contrary to batch processing
Bing Social Search
Real-Time Search Input Data Example of what kind of input data is
considered for real-time search systems:
twittervision
Real-time content Microblogging - Entirely new type of data
1. Short temporal life span2. Little to no context3. Simple ideas, fast reporting of events4. Metadata: time, location, social links5. Less factual, more opinionated6. Static posts 7. Furious input rate8. Often no hyperlink structure, few traditional
ranking factors
Real-time vs. Conventional Search Conventional Search Ranking
Relevance Authority
Real-time Search RankingRelevanceTemporal immediacy Popularity
Real-time vs. Conventional Search Conventional search input
Crawl the web periodically and update index○ Web documents evolve
Incapable of crawling and indexing the entire web in real-time
Real-time search input Stream of data.No need to poll since the posts are static
What can we do with real-time search engines?
Query analysis Collecta real-time search engine Analyzed ~1 Million queries
Continuous Queries○ Monitor events by frequently resubmitting the
same query Different query categories
Conventional Real-TimeShopping Commerce
Entertainment Travel
Adult Economy
Value of real-time search The estimated value of real-time search
is around $33 MillionValue derived from types of queries entered
in real-time search systemsUtilized adwords to determine worth of
keywords appearing in queries
Applications of real-time search TwitterStand: Real-time news reports
Crowd sourcing of first hand reportsExample: Coverage of MJ’s death
Applications of real-time search Real-time alert systems
Leverages tweet metadata (time, location) to raise alerts
Earthquake localization based on tweets
Twitter Real-Time Alerts
USGS Twitter Earthquake Detector
Difficulties of Real-Time Search Two factors:
Efficient indexing in order to provide for fast results
Effective ranking in order to return relevant results
Indexing Background RDBMS Indexing
Indexes built on columns commonly used in queries
Improves the speed of retrieval operations Conventional Search (Inverted) Indexing
Crawl the web for documentsMap keywords to documents containing those key
wordsNon structured dataIf a document does not exist in the index, it will not
appear in query results
Real-Time Search Indexing Index stream of data
Map keywords to tweets containing those keywords
ChallengeProcessing the stream in a timely manor
○ 5,000 tweets per second
TI Indexing Not feasible to index every incoming
tweet immediately
Selective indexing based on results that are most likely to appear in queriesDistinguished tweets indexed in real-timeNoisy tweets indexed by batch process
TI Tweet Classification Observation
Users are only interested in top-K results for a query
Distinguished tweetsTweet that belongs in the top-K result set of
previous query Noisy tweet
Those tweets not appearing in the top-K results for any of the systems previous queries
TI Indexing
Must limit the size of the query set1.6 Billion twitter queries per day
Query set optimization Observation
20% of queries represent 80% of user requests
ThereforeZipf’s distribution used statistically limit the
number of queries tweets were compared against
Real-Time Search Ranking How does ranking differ from traditional
web ranking? There are no social relationships in
traditional web pagesTypical web search engines rank based on
links to a site, and links from a siteWebsite links are not the same as social
networking links
Real-Time Search Ranking Ranking is not necessary in RDBMS
systemsRDBMS systems do not favor certain data
over others based on select criteria
RDBMS systems rank all data contained in the database the same essentially
TI Ranking Ranking function comprised of:
1) User’s PageRank○ Combination of user weight (defaulted to 1)
and how many followers they have (popularity)
2) Timestamp (self-explanatory)3) Similarity between tweet and the query
TI Ranking Ranking function also
comprised of:4) Popularity of the topic
Determined by large tweet trees
Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree
Tweet Tree Structure
TI Ranking ComparisonTI Rank Vs. Time Rank
What are others doing?
What are others doing? Facebook
Real-Time Feed
Implications/Conclusion Real-time search engines must provide:
“Online” algorithms to handle constant input Relevant search results
Results of a query are no longer static
Implications/Conclusion TI makes use of two concepts in their
real-time search of Twitter:Selective Indexing
○ Form of partial indexing, can’t afford to index every incoming tweet due to large volume of input
Ranking○ Ranking is a known technique, but
microblogging applications provide new ranking algorithms
References TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets
http://www.comp.nus.edu.sg/~ooibc/sigmod11ti.pdf Real Time Search User Behavior
http://faculty.ist.psu.edu/jjansen/academic/jansen_real_time_search.pdf TwitterRank: Finding Topic-Sensitive Influential Twitterers
http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1503&context=sis_research Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors
http://ymatsuo.com/papers/www2010.pdf TwitterStand: News in Tweets
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.148.1477&rep=rep1&type=pdf Learning Effective Ranking Functions for Newsgroup Search
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.5556&rep=rep1&type=pdf TwitterSearch: A Comparison of Microblog Search and Web Search
http://www.stanford.edu/~dramage/papers/twitter-wsdm11.pdf TwitterVision
http://twittervision.com/ Bing Social
http://www.bing.com/social Reak tune search on the web: Queries, topics, and economic value
http://collecta.com/RealTimeSearch.pdf
Discussion Questions 1) What do you think is the most
innovative technique in the TI approach that led to real-time microblog search results?
Discussion Questions 2) Given the partial indexing
optimization provided in the paper, how do you think Google could optimize their indexing algorithm in order to capture the newest content on the web?
Discussion Questions 3) TI makes use of a ranking function in
order to select tweets based on various user characteristics. What would you change about the ranking function, if anything?