Upload
cloudera-inc
View
1.206
Download
0
Tags:
Embed Size (px)
Citation preview
News and Blog Analysis
with Lydia
October 2nd, 2009
Charles Ward – Stony Brook University
Karthik Balaji, Levon Lloyd – General Sentiment
� Lydia System Overview
� News Analysis Examples
� Data and Workflow Organization
� Data Access Interface
� Conclusion
Outline
Large-Scale News/Blog Analysis
� The Lydia news/blog analysis system does a daily analysis of over 1000+ English and foreign-language
online newspapers, plus blogs, and other text sources.
� We currently track tens of millions of named entities in the news and blogs, providing spatial, temporal,
relational and sentiment analysis.
� Customer's track entities of interest using reports
generated in our user interface.
Lydia Text Analysis Phases
� Lydia performs named entity recognition and analysis over large text corpora.
� Spidering: Lydia spiders and parses thousands of online news sources. We also handle the feed of social media provided by Spinn3r.
� Named Entity Recognition: Lydia identifies and classifies occurrences of named entities (people, places, companies, etc.)
� Sentiment Analysis: Lydia assigns sentiment scores to identified entities using shallow NLP techniques.
� Entity Statistics Aggregation: Lydia digests marked-up text and produces usable entity statistics.
� Data Exploration: Aggregated entity statistics are made available through user interfaces and programming APIs for detailed exploration of the data.
Lydia Architecture
Outline
� Lydia System Overview
� News Analysis Examples
� Data and Workflow Organization
� Data Access Interface
� Conclusion
Frequency Time Series
� Michael Vick references (2004-2009)
� Mel Gibson references (2004-2009)
Sentiment Analysis
� Michael Phelps sentiment score (June 2008-Feb 2009)
� David Paterson sentiment score (Jan 2008-Jul 2009)
Comparative Analysis
� Peyton Manning vs. Eli Manning
Heatmaps
Arnold Schwarzenegger Alabama
Ethnic Biases in News Coverage
Frequency of coverage of entities
with Hispanic names in the
U.S. news, 2004-2008
Percentage of population self-
reporting as Hispanic in the 2000
census. Courtesy of Wikipedia.
Ethnic Biases in News Coverage
� (a) African
� (b) Hispanic
� (c) East Asian
� (d) Indian
� (e) Eastern European
� (f) Muslim
Juxtaposition Analysis
� Top Juxtapositions for Barack Obama
� Juxtapositions between Barack Obama and John McCain
Outline
� Lydia System Overview
� News Analysis Examples
� Data and Workflow Organization
� Data Access Interface
� Conclusion
Hadoop in Lydia
� The legacy Perl NLP pipeline runs in parallel on Hadoop Streaming, generating articles with marked-up entities which are stored as compressed XML in HDFS.
� To build or update Lydia entity statistics and indexes for a single text corpus, over 80 map-reduce jobs are necessary.
� We have developed a custom workflow management framework in Amazon EC2 to manage our data and processing.
Lydia Workflow Framework
High-level concepts:� A depository is a statistics dataset derived from a
text corpus. It consists of artifacts.� Stored as a directory structure in HDFS
� An artifact is a homogeneous dataset of a specific type.� Examples:
� Key-value artifacts, e.g. entity name -> frequency time series
� Lucene index artifacts (entity and article indexes)
� Stored as a directory in HDFS containing several map-reduce job output subdirectories named as date ranges (we do updates on a daily granularity).
Artifact Dependencies
Most Lydia artifacts are derived from other artifacts:
Artifact Storage
Lydia artifacts are stored in HDFS inside the depository directory:
� /dailies (depository name) � /EXACT_DUP_ARTICLES (artifact name)
� /2004_11_01-2009_03_31 (date range-named MR output)
� /part-00000
� . . .
� /part-00017
� /2009_04_01-2009_04_02� . . .
� /2009_04_03. . .
Job Input Selection
� Artifact updates are incrementally propagated through the dependency graph:
� Multiple date ranges (sometimes overlapping) typically exist for each artifact.
� Some small artifacts get fully rebuilt on every update.
Depository Build Scheduling
� The same tool is used for the initial depository build and for updating it with new data.
� Any set of target artifacts to build can be specified, similarly to a makefile. Prerequisites of the targets are automatically identified.
� Artifacts are built in the correct order according to dependencies.
� The build process runs as a sequence of Hadoop map-reduce jobs and occasional serial jobs.
Amazon EC2
� We run Hadoop on Amazon EC2.
– Quickly scale capacity as requirements change.
� 10 extra large nodes for weekly data processing.
� Amazon S3 is our persistent data store.
� All our web services are hosted in dedicated amazon
nodes.
� S3 is not meeting our required level-of-service
– Moving to EBS
Outline
� Lydia System Overview
� News Analysis Examples
� Data and Workflow Organization
� Data Access Interface
� Conclusion
Depository Server
� Random access to the Lydia depository, e.g.:� Monthly frequency time series of Barack Obama in all
U.S. sources
� Top juxtapositions for Continental Airlines in February 2009
� Sentiment time series for Michael Phelps in all U.S. sources
� Uses the mapfiles generated by map-reduce jobs.
� Currently is not distributed (but we can put different depositories on different machines).
� Provides a caching subsystem to reduce the number of HDFS accesses.
Artifact Date Range Merging
� The depository server combines results from
multiple groups of mapfiles on the fly.
(MR output = date range = mapfile group)
� This may result in performance problems and memory shortage (direct memory buffers).
� Solution: limit the number of covering date ranges
to be O(log N) after N daily updates.
Outline
� Lydia System Overview
� News Analysis Examples
� Data and Workflow Organization
� Data Access Interface
� Conclusion
Conclusion
� Great improvement (up to 20x) in the Lydia system performance and scalability from using Hadoop.
� Lydia w/ Hadoop makes new types of automated analysis of web-scale content possible.