Refactoring Earthquake-Tsunami Causality and Messaging via Big …credit.pvamu.edu/MCBDA2016/Slides/Day2_Lumb_MCBDA1... · 2016-05-17 · Refactoring Earthquake-Tsunami Causality

Refactoring Earthquake-Tsunami Causality andMessaging via Big Data Analytics: The

Transformative Potential of Credible Tweets

L. I. Lumb1,2 & J. R. Freemantle3

1York University, 2Univa Corporation & 3Independent

MCBDA 2016 (First Workshop) PVAMU, May 17, 2016

http://credit.pvamu.edu/MCBDA2016/program.html





Agenda● Motivation● Traditional Data ● Social-Networking Data

○ Graphs, Semantics & Machine Learning

● Conclusions

Geist, E.L., Titov, V.V., and Synolakis, C.E., 2006, Tsunami: wave of change: Scientific American, v. 294, p. 56-63

http://www.sciam.com/article.cfm?articleID=000CDB86-32E0-13A8-B2E083414B7F0000&ref=sciam&chanID=sa006



Motivation● Non-deterministic cause

○ Uncertainty inherent in any attempt to predict earthquakes■ In situ measurements may reduce uncertainty

● Lead times○ Availability of actionable observations ○ Communication of situation - advisories, warnings, etc.

● Cause-effect relationship○ Energy transfer - inputs ... coupling ... outputs

■ ‘Geometry’ - bathymetry and topography○ Other factors - e.g., tides

● Established effect○ Far-field estimates of tsunami propagation (pre-computed) and coastal inundation (real-time)

have proven to be extremely accurate ... requires● Distributed array of deep-ocean tsunami detection buoys + forecasting model



● Conclusions

http://www.gitews.org/en/concept/



http

://w

ww

.eas

.slu

.edu

/GG

P/im

ages

/igra

v2.jp

g

http://www.eas.slu.edu/GGP/images/igrav2.jpg

http://www.eas.slu.edu/GGP/images/igrav2.jpg

Lumb & Aldridge, http://dx.doi.org/10.1109/HPCS.2006.26

http://dx.doi.org/10.1109/HPCS.2006.26


○ Graphs, Semantics & Machine Learning ● Conclusions

GGP Scientific Data Twitter SN Data

Volume small, finite BIG, ‘infinite’

Variety semi-structured, restricted unstructured, unrestricted - except for IDs, hashtags & URLs (pages, images)

Velocity slow, sampled fast, streamed

Veracity biases, noise & abnormalities

Validity accuracy & correctness

Volatility low (stationary, irreplaceable) high? (mobile?, disposable?)

6Vs: Scientific vs. Social Networking Data

http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/



Karau et al., Learning Spark, O’Reilly, 2015

Machine Learning Pipeline

Deep Learning from Twitter?Represent data

● Twitter data manually curated into ‘ham’ and ‘spam’ ● In-memory representation via Spark RDDs

Extract features

● Frequency-based usage via Spark MLlib HashingTF ⇒ feature vectors

Develop model object

● Spark MLlib LogisticRegressionWithSGD used for classification

Evaluate model

Future Work ● Machine Learning

○ Classification algorithms ... with categories?○ Training Experiments

■ Larger data sets■ Degrees of ‘hammyness’ ■ Stop-word removal, stemming, ...

○ Real-time streaming - data from Twitter

● Multiparameter credibility - TweetCred + ML + RDF/OWL GA● Cloud-native platform

○ Containerization, dynamic scheduling and micro services

● Other examples ○ Alberta wildfires ○ Industrial incidents ○ Hurricanes



● Conclusions

Conclusions ● Credible tweets could be transformative

○ Mission-critical Big Data complement to existing data sources and approaches

● Current challenges/opportunities○ Twitter Data

■ Extraction - only 100 tweets at a time (!!!) ■ Curation - manual (read: time consuming!!!)

○ Emphasizing Machine Learning ... appears encouraging, BUT ...■ Graph Analytics ... as well ??? ■ Semantics ... as well ???

Q&AL. I. Lumb1,2 & J. R. Freemantle3

[email protected], [email protected] & [email protected]

Graph AnalyticsProblem

http

://w

ww

.jma.

go.

jp/jm

a/en

/201

6_K

umam

oto_

Ear

thqu

ake/

2016

_Kum

amot

o_E

arth

qua

ke.h

tml

Perl script prototype● Acquires tweets with the keyword “earthquake”

use Net::Twitter::Lite::WithAPIv1_1;my $nt = Net::Twitter::Lite::WithAPIv1_1->new( consumer_key => 'xxxx...xxxxxxx', consumer_secret => 'xxxxxx.....xxxxxxxxxx', access_token => 'xxxxx....xxxxxxxxxxx', access_token_secret => 'xxxxx.....xxxxxxxxxxx', ssl => 1 );my $result = $nt->search("earthquake");for my $status(@{$result->{statuses}} ) { print "$status->{text}\n";}

Resilient Distributed Datasets (RDDs)

● Abstraction for in-memory computing

● Fault-tolerant, parallel data structures

o Cluster-ready

● Optionally persistent

● Can be partitioned for optimal placement

● Manipulated via operators

Zaharia et al., NSDI 2012

Documents

Refactoring Earthquake-Tsunami Causality and Messaging via Big …credit.pvamu.edu/MCBDA2016/Slides/Day2_Lumb_MCBDA1... · 2016-05-17 · Refactoring Earthquake-Tsunami Causality