Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Refactoring Earthquake-Tsunami Causality andMessaging via Big Data Analytics: The
Transformative Potential of Credible Tweets
L. I. Lumb1,2 & J. R. Freemantle3
1York University, 2Univa Corporation & 3Independent
MCBDA 2016 (First Workshop) PVAMU, May 17, 2016
Agenda● Motivation● Traditional Data ● Social-Networking Data
○ Graphs, Semantics & Machine Learning
● Conclusions
Geist, E.L., Titov, V.V., and Synolakis, C.E., 2006, Tsunami: wave of change: Scientific American, v. 294, p. 56-63
Motivation● Non-deterministic cause
○ Uncertainty inherent in any attempt to predict earthquakes■ In situ measurements may reduce uncertainty
● Lead times○ Availability of actionable observations ○ Communication of situation - advisories, warnings, etc.
● Cause-effect relationship○ Energy transfer - inputs ... coupling ... outputs
■ ‘Geometry’ - bathymetry and topography○ Other factors - e.g., tides
● Established effect○ Far-field estimates of tsunami propagation (pre-computed) and coastal inundation (real-time)
have proven to be extremely accurate ... requires● Distributed array of deep-ocean tsunami detection buoys + forecasting model
Agenda● Motivation● Traditional Data ● Social-Networking Data
○ Graphs, Semantics & Machine Learning
● Conclusions
http://www.gitews.org/en/concept/
http
://w
ww
.eas
.slu
.edu
/GG
P/im
ages
/igra
v2.jp
g
Lumb & Aldridge, http://dx.doi.org/10.1109/HPCS.2006.26
Agenda● Motivation● Traditional Data ● Social-Networking Data
○ Graphs, Semantics & Machine Learning ● Conclusions
GGP Scientific Data Twitter SN Data
Volume small, finite BIG, ‘infinite’
Variety semi-structured, restricted unstructured, unrestricted - except for IDs, hashtags & URLs (pages, images)
Velocity slow, sampled fast, streamed
Veracity biases, noise & abnormalities
Validity accuracy & correctness
Volatility low (stationary, irreplaceable) high? (mobile?, disposable?)
6Vs: Scientific vs. Social Networking Data
http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/
Karau et al., Learning Spark, O’Reilly, 2015
Machine Learning Pipeline
Deep Learning from Twitter?Represent data
● Twitter data manually curated into ‘ham’ and ‘spam’ ● In-memory representation via Spark RDDs
Extract features
● Frequency-based usage via Spark MLlib HashingTF ⇒ feature vectors
Develop model object
● Spark MLlib LogisticRegressionWithSGD used for classification
Evaluate model
Future Work ● Machine Learning
○ Classification algorithms ... with categories?○ Training Experiments
■ Larger data sets■ Degrees of ‘hammyness’ ■ Stop-word removal, stemming, ...
○ Real-time streaming - data from Twitter
● Multiparameter credibility - TweetCred + ML + RDF/OWL GA● Cloud-native platform
○ Containerization, dynamic scheduling and micro services
● Other examples ○ Alberta wildfires ○ Industrial incidents ○ Hurricanes
Agenda● Motivation● Traditional Data ● Social-Networking Data
○ Graphs, Semantics & Machine Learning
● Conclusions
Conclusions ● Credible tweets could be transformative
○ Mission-critical Big Data complement to existing data sources and approaches
● Current challenges/opportunities○ Twitter Data
■ Extraction - only 100 tweets at a time (!!!) ■ Curation - manual (read: time consuming!!!)
○ Emphasizing Machine Learning ... appears encouraging, BUT ...■ Graph Analytics ... as well ??? ■ Semantics ... as well ???
Q&AL. I. Lumb1,2 & J. R. Freemantle3
Graph AnalyticsProblem
http
://w
ww
.jma.
go.
jp/jm
a/en
/201
6_K
umam
oto_
Ear
thqu
ake/
2016
_Kum
amot
o_E
arth
qua
ke.h
tml
Perl script prototype● Acquires tweets with the keyword “earthquake”
use Net::Twitter::Lite::WithAPIv1_1;my $nt = Net::Twitter::Lite::WithAPIv1_1->new( consumer_key => 'xxxx...xxxxxxx', consumer_secret => 'xxxxxx.....xxxxxxxxxx', access_token => 'xxxxx....xxxxxxxxxxx', access_token_secret => 'xxxxx.....xxxxxxxxxxx', ssl => 1 );my $result = $nt->search("earthquake");for my $status(@{$result->{statuses}} ) { print "$status->{text}\n";}
Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing
● Fault-tolerant, parallel data structures
o Cluster-ready
● Optionally persistent
● Can be partitioned for optimal placement
● Manipulated via operators
Zaharia et al., NSDI 2012