View
222
Download
3
Category
Tags:
Preview:
Citation preview
Nilesh Bansal and Nick Koudas
WebDB 2007
SEARCHING THE BLOGOSPHERE
Nilesh BansalNick KoudasUniversity of Toronto
Nilesh Bansal and Nick Koudas
WebDB 2007
BLOGOSPHERE
Nilesh Bansal and Nick Koudas
WebDB 2007
Nilesh Bansal and Nick Koudas
WebDB 2007
67M KNOWN BLOGS
100K NEW EVERYDAY
DOUBLING EVERY 200 DAYS
Nilesh Bansal and Nick Koudas
WebDB 2007
WHAT ARE THEY WRITING ABOUT??
PERSONAL LIFEPRODUCT REVIEWS
POLITICSTECHNOLOGY
TOURISMSPORTS
ENTERTAINMENT
Nilesh Bansal and Nick Koudas
WebDB 2007
WHY SHOULD WE CARE?
Nilesh Bansal and Nick Koudas
WebDB 2007
HUGE DATA REPOSITORY
WILL CONTINUE TO GROW
EXTRACT PUBLIC OPINION
VALUABLE INSIGHTS
Nilesh Bansal and Nick Koudas
WebDB 2007
KEY INSIGHTS
MARKET RESEARCH
PUBLIC RELATION STRATEGIES
CUSTOMER OPINION TRACKING
Nilesh Bansal and Nick Koudas
WebDB 2007
CHALLENGES AND OPPORTUNITIES
Nilesh Bansal and Nick Koudas
WebDB 2007
HUGE AMOUNTS OF UNSTRUCTURED TEXT
Nilesh Bansal and Nick Koudas
WebDB 2007
Nilesh Bansal and Nick Koudas
WebDB 2007
MACHINE CREATED WEBLOGS
MORE THAN HALF OF BLOGSPOT IS SPAM
33% OF WEBSPAM HOSTED AT BLOGSPOT
Nilesh Bansal and Nick Koudas
WebDB 2007
TEMPORAL DIMENSION
Nilesh Bansal and Nick Koudas
WebDB 2007
GEOGRAPHICAL ASSOCIATION
Nilesh Bansal and Nick Koudas
WebDB 2007
CONVERSATION
Nilesh Bansal and Nick Koudas
WebDB 2007
Gruhl et al., The Predictive Power of Online Chatter, KKD 2005
Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003
Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006
Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006
Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007
Nilesh Bansal and Nick Koudas
WebDB 2007
BLOGSCOPE
Nilesh Bansal and Nick Koudas
WebDB 2007
Nilesh Bansal and Nick Koudas
WebDB 2007
CRAWLER RUNNING 24x7
TRACKING 9M BLOGS
INDEXING 70M ARTICLES
AGGREGATION AND PREPROCESSING
INTERACTIVE SEARCH AND ANALYSIS
Nilesh Bansal and Nick Koudas
WebDB 2007
ANY STREAMING TEXT SOURCE
NEWS
MAILING LISTS
FORUMS
SOCIAL MEDIA
Nilesh Bansal and Nick Koudas
WebDB 2007
www.blogscope.net
HotKeywords
HotKeywords
Nilesh Bansal and Nick Koudas
WebDB 2007
RelatedTerms
RelatedTerms
PopularityCurve
PopularityCurve
SearchResultsSearchResults
GeoSearch
GeoSearch
Nilesh Bansal and Nick Koudas
WebDB 2007
Hawaii Earthquake
TaiwanUndersea
Earthquake Sumatra Earthquake
Nilesh Bansal and Nick Koudas
WebDB 2007
December 15 2006
March 06 2007
Nilesh Bansal and Nick Koudas
WebDB 2007
IPHONE ON JAN 09 2007
Nilesh Bansal and Nick Koudas
WebDB 2007
Curves are usually correlated, except
at one point
Nilesh Bansal and Nick Koudas
WebDB 2007
TECHNIQUES
Nilesh Bansal and Nick Koudas
WebDB 2007
CRAWLS RSS FEEDS
250 THOUSAND NEW POSTS DAILY
PING SERVER: WEBLOGS.COM
Nilesh Bansal and Nick Koudas
WebDB 2007
[Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007[Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004[Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006
LINK BASED ANALYSIS IS NOT EFFECTIVE
SPAMMERS ARE INTELLIGENT
WE USE HEURISTICS
ON GOING BATTLE
Nilesh Bansal and Nick Koudas
WebDB 2007
INTERACTIVE APPLICATION
TWO SECOND RESPONSE TIME
HUGE AMOUNTS OF DATA
SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY
SCALABILITY
Nilesh Bansal and Nick Koudas
WebDB 2007
Nilesh Bansal and Nick Koudas
WebDB 2007
BURST DETECTION
[Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007[Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005
Nilesh Bansal and Nick Koudas
WebDB 2007
POPULARITY = BASE + ZERO MEAN GAUSSIAN
BURST = STATISTICAL OUTLIER
),0( 2 Nx
2x
Nilesh Bansal and Nick Koudas
WebDB 2007
IDENTIFYING RELATED TERMS
Nilesh Bansal and Nick Koudas
WebDB 2007
COLLOCATIONS
POINTWISE MUTUAL INFORMATION
EXPENSIVE
[Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis[Manning and Schutze] Foundation of Natural Statistical Language Processing[Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989
)(
)|(
)(
)|(),(
DbP
DaDbP
DaP
DbDaPbascore
Nilesh Bansal and Nick Koudas
WebDB 2007
FAST COMPUTATION OF RELATED TERMS
RANDOM SAMPLE
MUTUAL INFORMATION IN EXPECTATION
USE TF WITH PRECOMPUTED IDF
)()(
)(),(
|}|{|
|||}|{|),(
dqPdtP
dtdqPqts
dtd
DdtDddqts
Nilesh Bansal and Nick Koudas
WebDB 2007
COMPUTING HOT KEYWORDS
Nilesh Bansal and Nick Koudas
WebDB 2007
POPULAR DOES NOT MEAN HOT
INTERESTING = SURPRISING
MIXTURE OF DIFFERENT SCORING FUNCTIONS
DEVIATION FROM EXPECTED
Nilesh Bansal and Nick Koudas
WebDB 2007
INTELLIGENT ALERT SERVICE
BURST SYNOPSIS
AUTHORATIVE RANKING
Nilesh Bansal and Nick Koudas
WebDB 2007
Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB 2007.
Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal).
JUST THE BEGINNING
Nilesh Bansal and Nick Koudas
WebDB 2007Source: xkcd.com
THANK YOU. QUESTIONS?
Recommended