Upload
ambrose-barber
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
1
Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets
Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC) Thomas A. Rietz (Univ. of Iowa)
Daniel Diermeier (Northwestern Univ.) Meichun Hsu, Malu Castellanos, and Carlos Ceja (HP Labs)
2
… Time
Any clues in the companion news stream?Dow Jones Industrial Average [Source: Yahoo Finance]
Text Mining for Understanding Time Series
What might have caused the stock market crash?
Sept 11 Attack!
3
Analysis of Presidential Prediction Markets
What might have caused the sudden drop of price for this candidate?
What “mattered” in this election?
… Time
Any clues in the companion news stream?
Tax cut?
4
Joint Analysis of Text and Time Series to Discover “Causal Topics”
• Input: – Time series – Text data produced in a similar time period (text stream)
• Output– Topics whose coverage in the text stream has strong
correlations with the time series (“causal” topics)
Tax cut
Gun control
…
5
Related Work• Topic modeling (e.g., [Hofmann 99], [Blei et al. 03], …)
– Extract topics from text data and reveal their patterns– No consideration of time series topics extracted may not
be correlated with time series• Stream data mining (e.g., [Agrawal 02])
– Clustering & categorization of time series data– No topics being generated for text data
• Temporal text retrieval and prediction (e.g., [Efron 10], [Smith10])
– Incorporating time factor in retrieval or text-based prediction – No topics being generated
New Problem: Discover causal topics from text streams with time series data for supervision
6
Background: Topic Models
• Topic = multinomial distribution over words (unigram language models)• Text is assumed to be a sample of words drawn from
a mixture of multiple (unknown) topics • Parameter estimation and Bayesian inference
“reveal” – All the unknown topics in a text collection– The coverage of each topic in each document– Prior can be imposed to bias the inference of both topics
and topic coverage
Document as a Sample of Mixed Topics
Topic 1
Topic k
Topic 2
…
Background k
government 0.3 response 0.2...
donate 0.1relief 0.05help 0.02 ...
city 0.2new 0.1orleans 0.05 ...
is 0.05the 0.04a 0.03 ...
[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …
7
GenerativeTopic Model
Inference/EstimationOf topics
Prior can be added on them
8
When a topic model applied to text stream
… Time
Topic 1
Topic k
Topic 2
…
Background k
government 0.3 response 0.2...
donate 0.1relief 0.05help 0.02 ...
city 0.2new 0.1orleans 0.05 ...
is 0.05the 0.04a 0.03 ...
New Text Mining Framework:Iterative Causal Topic Modeling
9
Non-textTime Series
Sep2001
Oct …2001
Text Stream
Causal TopicsTopic
1Topic
2
Topic 3
Topic 4
Zoom into Word Level
Split Words
Feedbackas Prior
CausalWords
Topic 1
Topic Model-ingTopic
2
Topic 3
Topic 4
Topic 1-2W2 --W4 --
Topic 1-1W1 +W3 +
Topic 1
W1 +W2 --W3 +W4 --
W5 …
Iterative Causal Topic Modeling Framework
10
Non-textTime Series
Sep2001
Oct …2001
Text Stream
Causal TopicsTopic
1Topic
2
Topic 3
Topic 4
Zoom into Word Level
Split Words
Feedbackas Prior
CausalWords
Topic 1
Topic Model-ingTopic
2
Topic 3
Topic 4
Topic 1-2W2 --W4 --
Topic 1-1W1 +W3 +
Topic 1
W1 +W2 --W3 +W4 --
W5 …
• General Framework for any topic modeling and any causality measure• Naturally incorporate non-text time series in the process• Topic level + Word level Efficiency + Granularity
11
Heuristic Optimization of Causality + Coherence
12
• Pearson correlation– Basic correlation
• Granger Test – For two time series x (topic), y (stock), time lag p
Significance test if lagged x terms should be retained or not
Causality Measures
Auto-regression Lagged values
Feedback Prior Generation
Topic Word Impact Significance (%)
1
Social + 99Security + 96
Gun - 98Control - 96
5
September - 99Airline - 99
Terrorism - 97… (5 more
words)Attack - 96Good + 96
13
Topic Word Prob
1Social 0.8
security 0.2
2Gun 0.75
Control 0.25
3
September 0.1
Airline 0.1
Terrorism 0.075
… (5 more)
Attack 0.05
Good 0.0
14
• Time: June 2000 – Dec. 2011 • Text data– New York Times
• Time series– American Airlines stock (AAMRQ) – Apple stock (AAPL)
• Question: any “causal topics” to explain fluctuation of the stocks of the two companies?
Experiment Design 1: Stock Market Analysis
15
• Time: May 2000 – Oct. 2000 • Text data– New York Times (use text mentioning Bush or Gore)
• Time series– Normalized “Gore stock price” in Iowa Electronic
Markets (IEM), online future market
• Question: any “causal topics” to explain changes in opinions about Gore?
Experiment Design 2:2000 Presidential election campaign
Measuring Topic Quality
• Causality Confidence of a topic– Based on p-value of causality test (Granger, Pearson)
for the topic• Topic Purity – Consistency in the direction of “causal” relation with
the time series (“are all words in the topic positively correlated with the time series?”)
– Based on entropy of distribution of positive/negative words
16
17
Topic Purity
Topic Word Impact Significance (%)
1
Social + 99
Security + 96
Gun - 98
Control - 96
5
September - 99
Airline - 99
Terrorism - 97
Attack - 96
Good + 96
))(1(*100)(
)""(log)""(
)""(log)""()(
##
#)""(
THTPurity
negTpnegTp
posTpposTpTHEntropy
rdsnegativeWordspositiveWo
rdspositiveWoPosTp
P(T=“pos”)
H(T)
1.0
0 0.5 1.0
P(T=“pos”)=p(T=“neg”)=1/2Highest entropy Lowest purity(0)
P(T=“pos”)=1/5 p(T=“neg”)=4/5 Lower entropy Higher purity
AAMRQ AAPL
russia russian putin europe european germany
bush gore presidential police court judge airlines airport air
united trade terrorismfood foods cheese
nets scott basketball tennis williams open
awards gay boy moss minnesota chechnya
paid notice strussia russian europe
olympic games olympicsshe her ms
oil ford pricesblack fashion blacks
computer technology softwareinternet com webfootball giants jets
japan japanese plane…
18
- Significant topic list of two different external time series.AAMRQ: airline, terrorism topicAAPL: IT industry topic
Topics discovered depend on external time series
Sample Result 1:Topics discovered for AAMRQ vs. AAPL
19
Effect of Iterations on Causality Confidence & Purity
1 2 3 4 50
5
10
15
20
25
30
Number Of Significant Topics
MU10MU50MU100MU500MU1000
1 2 3 4 50
20
40
60
80
100
120
Average Purity
Iter
Different Feedback Strength (µ)
20
1 2 3 4 595.5
96
96.5
97
97.5
98
98.5
99
99.5
100
Average Confidence
• Significant improvement in confidence, number of significant topics by feedback– Clear benefit of feedback
• Large µ guarantees topic purity improvement
µ=10
µ=50
µ=100µ=500µ=1000
Iter Iter
Sample Result 2: Major Topics in 2000 Presidential Election
• Revealed several important issues– E.g. tax cut, abortion,
gun control, oil energy– Such topics are also
cited in political science literature [Pomper `01] and Wikipedia [Link]
21
Top Three Words in Significant Topics
tax cut 1screen pataki guilianienthusiasm door symbolicoil energy pricesnews w toppres al vicelove tucker presentedpartial abortion privatizationcourt supreme abortiongun control nra
22
Additional Results:http://sifaka.cs.uiuc.edu/~hkim277/InCaToMi/demo/2000_Presidential_Election/dashboard/Dashboard.html
23
Conclusions & Future Work
• Meaningful topics can be extracted from text stream by using time series for supervision
• Such “causal” topics provide potential explanations for changes in the time series data
• Preliminary experiment results on 2000 presidential prediction markets are promising
• Future work (discussion) – Issues related to topic models (e.g., local maxima, # of topics,
interpretation of topics)– Issues related to causality analysis (e.g., “local” causality)– Unified analysis model– System to support online interactive analysis of causal topics (time
series can be derived from text too)
24
Thank You!
Questions/Comments?