2© Hortonworks Inc. 2012
Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks
Formerly Viz, Data Science at Ning, LinkedIn
HBase Dashboards, Career Explorer, InMaps
Agile Analytics Applicationson HDP
3© Hortonworks Inc. 2012
About me... Bearding.
• I’m going to beat this guy
• Seriously
• Bearding is my #1 natural talent
• Salty Sea Beard
• Fortified with Pacific Ocean Minerals
4© Hortonworks Inc. 2012
Agile Data - The Book (July, 2013)
Read on Safari Rough Cuts
Early Release Here
Code Here
5© Hortonworks Inc. 2012
We go fast... but don’t worry!
• Examples for EVERYTHING on the Hortonworks blog: http://hortonworks.com/blog/authors/russell_jurney
• Download the slides - click the links - read examples!
• If its not on the blog, its in the book!
• Order now: http://shop.oreilly.com/product/0636920025054.do
• Read the book Friday on Safari Rough Cuts
5
6© Hortonworks Inc. 2012
HDP Sandbox - Talk Lessons Coming!
8© Hortonworks Inc. 2012
Agile Application Development: Check
• LAMP stack mature• Post-Rails frameworks to choose from• Enable rapid feedback and agility
8
+ NoSQL
9© Hortonworks Inc. 2012
Data Warehousing
10© Hortonworks Inc. 2012
Scientific Computing / HPC
• ‘Smart kid’ only: MPI, Globus, etc. until Hadoop
10
Tubes and Mercury (old school) Cores and Spindles (new school)
UNIVAC and Deep Blue both fill a warehouse. We’re back...
11© Hortonworks Inc. 2012
Data Science?
ApplicationDevelopment Data Warehousing
Scientific Computing / HPC
12© Hortonworks Inc. 2012
Data Center as Computer
• Warehouse Scale Computers and applications
12
“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’
13© Hortonworks Inc. 2012
14© Hortonworks Inc. 2012
15© Hortonworks Inc. 2012
16© Hortonworks Inc. 2012
17© Hortonworks Inc. 2012
Hadoop to the Rescue!
18© Hortonworks Inc. 2012
19© Hortonworks Inc. 2012
Hadoop to the Rescue!
• Easy to use! (Pig, Hive, Cascading)
• CHEAP: 1% the cost of SAN/NAS
• A department can afford its own Hadoop cluster!
• Dump all your data in one place: Hadoop DFS
• Silos come CRASHING DOWN!
• JOIN like crazy!
• ETL like whoah!
• An army of mappers and reducers at your command
• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!19
20© Hortonworks Inc. 2012
NOW WHAT?
?
21© Hortonworks Inc. 2012
Analytics Apps: It takes a Team
• Broad skill-set
• Nobody has them all
• Inherently collaborative
22© Hortonworks Inc. 2012
Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap
• Transactional overhead dominates at 5+ people
• Expert researchers: lend 25-50% of their time to teams
• Creative workers. Run like a studio, not an assembly line
• Total freedom... with goals and deliverables.
• Work environment matters most
22
23© Hortonworks Inc. 2012
How to get insight into product?
• Back-end has gotten t-h-i-c-k-e-r
• Generating $$$ insight can take 10-100x app dev
• Timeline disjoint: analytics vs agile app-dev/design
• How do you ship insights efficiently?
• How do you collaborate on research vs developer timeline?23
24© Hortonworks Inc. 2012
The Wrong Way - Part One
“We made a great design. Your job is to predict the future for it.”
25© Hortonworks Inc. 2012
The Wrong Way - Part Two
“Whats taking you so long to reliably predict the future?”
26© Hortonworks Inc. 2012
The Wrong Way - Part Three
“The users don’t understand what 86% true means.”
27© Hortonworks Inc. 2012
The Wrong Way - Part Four
GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!!
28© Hortonworks Inc. 2012
The Wrong Way - Inevitable Conclusion
Plane Mountain
29© Hortonworks Inc. 2012
Reminds me of... the waterfall model
:(
30© Hortonworks Inc. 2012
Chief Problem
You can’t design insight in analytics applications.
You discover it.
You discover by exploring.
31© Hortonworks Inc. 2012
-> Strategy
So make an app for exploring your data.
Which becomes a palette for what you ship.
Iterate and publish intermediate results.
32© Hortonworks Inc. 2012
Data Design
• Not the 1st query that = insight, its the 15th, or the
150th
• Capturing “Ah ha!” moments
• Slow to do those in batch...
• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in the wrong statistical models
• Semantics of presenting predictions are complex, delicate
• Opportunity lies at intersection of data & design 32
33© Hortonworks Inc. 2012
How do we get back to Agile?
34© Hortonworks Inc. 2012
Statement of Principles
(then tricks, with code)
35© Hortonworks Inc. 2012
Setup an environment where...
• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day 0
• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship
• Until the application pays for itself and more
35
36© Hortonworks Inc. 2012
Value document > relation
Most data is dirty. Most data is semi-structured or un-structured. Rejoice!
37© Hortonworks Inc. 2012
Value document > relation
Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
38© Hortonworks Inc. 2012
Relational Data = Legacy Format
• Why JOIN? Storage is fundamentally cheap!
• Duplicate that JOIN data in one big record type!
• ETL once to document format on import, NOT every job
• Not zero JOINs, but far fewer JOINs
• Semi-structured documents preserve data’s actual structure
• Column compressed document formats beat JOINs! (paper coming)
38
39© Hortonworks Inc. 2012
Value imperative > declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. See: ETL.
• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize
39
40© Hortonworks Inc. 2012
Value dataflow > SELECT
41© Hortonworks Inc. 2012
Ex. dataflow: ETL + email sent count
(I can’t read this either. Get a big version here.)
42© Hortonworks Inc. 2012
Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING• Pig is optimized for refining data, as opposed to consuming it• Pig is imperative, iterative• Pig is dataflows, and SQLish (but not SQL)• Code modularization/re-use: Pig Macros• ILLUSTRATE speeds dev time (even UDFs)• Easy UDFs in Java, JRuby, Jython, Javascript• Pig Streaming = use any tool, period.• Easily prepare our data as it will appear in our app.• If you prefer Hive, use Hive.
42
But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive... See: HCatalog for Pig/Hive integration, and this post.
43© Hortonworks Inc. 2012
Localhost vs Petabyte scale: same tools tools• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with
documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3.
• Everything we serve in our app is re-creatable via Hadoop.
43
44© Hortonworks Inc. 2012
Data-Value Pyramid
Climb it. Do not skip steps. See here.
45© Hortonworks Inc. 2012
0/1) Display atomic records on the web
46© Hortonworks Inc. 2012
0.0) Document-serialize events
• Protobuf
• Thrift
• JSON
• Avro - I use Avro because the schema is onboard.
46
47© Hortonworks Inc. 2012
0.1) Documents via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray
);
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),
enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;
store emails into '/enron/emails.avro' using AvroStorage( Example here.
48© Hortonworks Inc. 2012
0.2) Serialize events from streams
class GmailSlurper(object): ... def init_imap(self, username, password): self.username = username self.password = password try: imap.shutdown() except: pass self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993) self.imap.login(username, password) self.imap.is_readonly = True ... def write(self, record): self.avro_writer.append(record) ... def slurp(self): if(self.imap and self.imap_folder): for email_id in self.id_list: (status, email_hash, charset) = self.fetch_email(email_id) if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash): print email_id, charset, email_hash['thread_id'] self.write(email_hash)
Scrape your own gmail in Python and Ruby.
49© Hortonworks Inc. 2012
0.3) ETL Logs
log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes);
50© Hortonworks Inc. 2012
1) Plumb atomic events -> browser
(Example stack that enables high productivity)
51© Hortonworks Inc. 2012
Lots of Stack Options with Examples
• Pig with Voldemort, Ruby, Sinatra: example
• Pig with ElasticSearch: example
• Pig with MongoDB, Node.js: example
• Pig with Cassandra, Python Streaming, Flask: example
• Pig with HBase, JRuby, Sinatra: example
• Pig with Hive via HCatalog: example (trivial on HDP)
• Up next: Accumulo, Redis, MySQL, etc.
51
52© Hortonworks Inc. 2012
1.1) cat our Avro serialized events
me$ cat_avro ~/Data/enron.avro
{ u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'[email protected]', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'[email protected]', u'name': None} ]}
Get cat_avro in python, ruby
53© Hortonworks Inc. 2012
1.2) Load our events in Pig
me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();grunt> describe enron_emails
emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)}}
54© Hortonworks Inc. 2012
1.3) ILLUSTRATE our events in Pig
grunt> illustrate enron_emails
---------------------------------------------------------------------------| emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} || ccs:bag{cc:tuple(address:chararray,name:chararray)} || bccs:bag{bcc:tuple(address:chararray,name:chararray)} |---------------------------------------------------------------------------| | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | ([email protected], J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {([email protected],)} | | {} | | {} |
Upgrade to Pig 0.10+
55© Hortonworks Inc. 2012
1.4) Publish our events to a ‘database’
pig -l /tmp -x local -v -w -param avros=enron.avro \ -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig
/* MongoDB libraries and configuration */register /me/mongo-hadoop/mongo-2.7.3.jarregister /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jarregister /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar/* Set speculative execution off to avoid chance of duplicate records in Mongo */set mapred.map.tasks.speculative.execution falseset mapred.reduce.tasks.speculative.execution falsedefine MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut *//* By default, lets have 5 reducers */set default_parallel 5avros = load '$avros' using AvroStorage();store avros into '$mongourl' using MongoStorage();
Full instructions here.
Which does this:
From Avro to MongoDB in one command:
56© Hortonworks Inc. 2012
1.5) Check events in our ‘database’
$ mongo enronMongoDB shell version: 2.0.2connecting to: enron> show collectionsemailssystem.indexes> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}){ "_id" : ObjectId("502b4ae703643a6a49c8d180"), "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", "date" : "2001-01-09T06:38:00.000Z", "from" : { "address" : "[email protected]", "name" : "J.R. Bob Dobbs" },"subject" : Re: Enron trade for frop futures, "body" : "Scamming more people...", "tos" : [ { "address" : "connie@enron", "name" : null } ], "ccs" : [ ], "bccs" : [ ]}
57© Hortonworks Inc. 2012
1.6) Publish events on the web
require 'rubygems'require 'sinatra'require 'mongo'require 'json'
connection = Mongo::Connection.newdatabase = connection['agile_data']collection = database['emails']
get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data)end
58© Hortonworks Inc. 2012
1.6) Publish events on the web
59© Hortonworks Inc. 2012
Whats the point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.
• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!
• Feels agile, don’t it? Keep it up!
59
60© Hortonworks Inc. 2012
1.7) Wrap events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body>Complete example here with code here.
61© Hortonworks Inc. 2012
1.7) Wrap events with Bootstrap
62© Hortonworks Inc. 2012
Refine. Add links between documents.
Not the Mona Lisa, but coming along... See: here
64© Hortonworks Inc. 2012
1.8) List links to sorted events
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
"_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
"message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
"from" : [
...
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();
Use Pig, serve/cache a bag/array of email documents:
Use your ‘database’, if it can sort.
65© Hortonworks Inc. 2012
1.8) List links to sorted documents
66© Hortonworks Inc. 2012
1.9) Make it searchable...
If you have list, search is easy with ElasticSearch and Wonderdog...
/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using
ElasticSearch('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-
0.18.6/plugins');
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'
Test it with curl:
ElasticSearch has no security features. Take note. Isolate.
67© Hortonworks Inc. 2012
From now on we speed up...
Don’t worry, its in the book and on the blog.
http://hortonworks.com/blog/
68© Hortonworks Inc. 2012
2) Create Simple Charts
69© Hortonworks Inc. 2012
2) Create Simple Tables and Charts
70© Hortonworks Inc. 2012
2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity
resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.
• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page. 70
71© Hortonworks Inc. 2012
2.1) Top N (of anything) in Pig
pig -l /tmp -x local -v -w
top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc;top_10_things = limit sorted 10;generate group as key, top_10_things as top_10_things;};store top_n into '$mongourl' using MongoStorage();
Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.
72© Hortonworks Inc. 2012
2.2) Time Series (of anything) in Pig
pig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */things_by_month = foreach (group things by (key, ISOToMonth(datetime))generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month;generate group as key, timeseries as timeseries;};
store things_timeseries into '$mongourl' using MongoStorage();
Yet another good Pig Macro.
73© Hortonworks Inc. 2012
Data processing in our stack
A new feature in our application might begin at any layer... great!
Any team member can add new features, no problemo!
I’m creative!
I know Pig!
I’m creative too!
I <3 Javascript!
omghi2u!
where r my legs?
send halp
74© Hortonworks Inc. 2012
Data processing in our stack
... but we shift the data-processing towards batch, as we are able.
Ex: Overall total emails calculated in each layer
See real example here.
75© Hortonworks Inc. 2012
3) Exploring with Reports
76© Hortonworks Inc. 2012
3) Exploring with Reports
77© Hortonworks Inc. 2012
3.0) From charts to reports...
• Extract entities from properties we aggregated by in charts (Step 2)
• Each entity gets its own type of web page
• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step
1)
• Link most related entities together, same and between types.
• More visualizations!
• Parametize results via forms.
77
78© Hortonworks Inc. 2012
3.1) Looks like this...
79© Hortonworks Inc. 2012
3.2) Cultivate common keyspaces
80© Hortonworks Inc. 2012
3.3) Get people clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.
• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.
80
‘People’ could be just your team, if data is sensitive.
81© Hortonworks Inc. 2012
4) Predictions and Recommendations
82© Hortonworks Inc. 2012
4.0) Preparation
• We’ve already extracted entities, their properties and relationships
• Our charts show where our signal is rich
• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data
• They got that understanding by exploring the data
• We are all on the same page!
82
83© Hortonworks Inc. 2012
4.2) Think in different perspectives
• Networks
• Time Series / Distributions
• Natural Language Processing
• Conditional Probabilities / Bayesian Inference
• Check out Chapter 2 of the book...83See here.
84© Hortonworks Inc. 2012
4.3) Networks
85© Hortonworks Inc. 2012
4.3.1) Weighted Email Networks in Pig
DEFINE header_pairs(email, col1, col2) RETURNS pairs { filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL); flat = FOREACH filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2; $pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) AS ego2;}/* Get email address pairs for each type of connection, and union them together */emails = LOAD '/me/Data/enron.avro' USING AvroStorage();from_to = header_pairs(emails, from, to);from_cc = header_pairs(emails, from, cc);from_bcc = header_pairs(emails, from, bcc);pairs = UNION from_to, from_cc, from_bcc;/* Get a count of emails over these edges. */pair_groups = GROUP pairs BY (ego1, ego2);sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) AS (ego1, ego2), COUNT_STAR(pairs) AS total;
86© Hortonworks Inc. 2012
4.3.2) Networks Viz with Gephi
87© Hortonworks Inc. 2012
4.3.3) Gephi = Easy
88© Hortonworks Inc. 2012
4.3.4) Social Network Analysis
89© Hortonworks Inc. 2012
4.4) Time Series & Distributions
pig -l /tmp -x local -v -w
/* Count things per day */
things_per_day = foreach (group things by (key, ISOToDay(datetime))
generate flatten(group) as (key, day),
COUNT_STAR(things) as total;
/* Sort our totals per key by day to get a sorted time series */
things_timeseries = foreach (group things_by_day by key) {
timeseries = order things by day;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
90© Hortonworks Inc. 2012
4.4.1) Smooth Sparse Data
See here.
91© Hortonworks Inc. 2012
4.4.2) Regress to find TrendsJRuby Linear Regression UDF Pig to use the UDF
Trend Line in your Application
92© Hortonworks Inc. 2012
4.5.1) Natural Language Processing
Example with code here and macro here.
import 'tfidf.macro';my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');
/* Get the top 10 Tf*Idf scores per message */per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value);}
93© Hortonworks Inc. 2012
4.5.2) NLP: Extract Topics!
94© Hortonworks Inc. 2012
4.5.3) NLP for All: Extract Topics!
• TF-IDF in Pig - 2 lines of code with Pig Macros: • http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-topic-summarization-2-lines-of-pig/
• LDA with Pig and the Lucene Tokenizer: • http://thedatachef.blogspot.be/2012/03/topic-discovery-with-apache-pig-and.html
94
95© Hortonworks Inc. 2012
4.6) Probability & Bayesian Inference
96© Hortonworks Inc. 2012
4.6.1) Gmail Suggested Recipients
97© Hortonworks Inc. 2012
4.6.1) Reproducing it with Pig...
98© Hortonworks Inc. 2012
4.6.2) Step 1: COUNT(From -> To)
99© Hortonworks Inc. 2012
4.6.2) Step 2: COUNT(From, To, Cc)/Total
P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone
100© Hortonworks Inc. 2012
4.6.3) Wait - Stop Here! It works!
They match...
101© Hortonworks Inc. 2012
4.4) Add predictions to reports
102© Hortonworks Inc. 2012
5) Enable new actions
103© Hortonworks Inc. 2012
Why doesn’t Kate reply to my emails?
• What time is best to catch her?
• Are they too long?
• Are they meant to be replied to (contain original content)?
• Are they nice? (sentiment analysis)
• Do I reply to her emails (reciprocity)?
• Do I cc the wrong people (my mom) ?
103
104© Hortonworks Inc. 2012
Example: LinkedIn InMaps
<------ personalization drives engagement
Shared at http://inmaps.linkedinlabs.com/share/Russell_Jurney/316288748096695765986412570341480077402
105© Hortonworks Inc. 2012
Example: Packetpig and PacketLoop
snort_alerts = LOAD '$pcap' USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');
countries = FOREACH snort_alerts GENERATE com.packetloop.packetpig.udf.geoip.Country(src) as country, priority;
countries = GROUP countries BY country;
countries = FOREACH countries GENERATE group, AVG(countries.priority) as average_severity;
STORE countries into 'output/choropleth_countries' using PigStorage(',');
Code here.
106© Hortonworks Inc. 2012
Example: Packetpig and PacketLoop
107© Hortonworks Inc. 2012
Thank You!
Questions & Answers
Slides: http://slidesha.re/T943VU
Follow: @hortonworks and @rjurneyRead: hortonworks.com/blog