Utrecht NL-HUG/Data Science-NL - Agile Data Slides

Preview:

Citation preview

© Hortonworks Inc. 2012

Agile Data - The Book (March, 2013)

2

Read it now on OFPS

A philosophy,not the only way

But still, its good! Really!

© Hortonworks Inc. 2012

We go fast... but don’t worry!

• Examples for EVERYTHING on the Hortonworks blog: http://hortonworks.com/blog/authors/russell_jurney

• Download the slides - click the links - read examples!

• If its not on the blog, its in the book!

• Order now: http://shop.oreilly.com/product/0636920025054.do

• Read the book NOW on OFPS: • http://ofps.oreilly.com/titles/9781449326265/chapter_2.html

3

© Hortonworks Inc. 2012

Agile Application Development: Check

• LAMP stack mature• Post-Rails frameworks to choose from• Enable rapid feedback and agility

4

+ NoSQL

© Hortonworks Inc. 2012

Data Warehousing

5

© Hortonworks Inc. 2012

Scientific Computing / HPC

• ‘Smart kid’ only: MPI, Globus, etc. until Hadoop

6

Tubes and Mercury (old school) Cores and Spindles (new school)

UNIVAC and Deep Blue both fill a warehouse. We’re back...

© Hortonworks Inc. 2012

Data Science?

7

33%

33%

33%

ApplicationDevelopment Data Warehousing

Scientific Computing / HPC

© Hortonworks Inc. 2012

Data Center as Computer

• Warehouse Scale Computers and applications

8

“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’

© Hortonworks Inc. 2012

Hadoop to the Rescue!

9

MPP EDW NewSQL

SQL NoSQL NewSQL

Big data refinery / Modernize ETL

Page 7

Business Transactions & Interactions

Web, Mobile, CRM, ERP, SCM, …

Business Intelligence & Analytics

Dashboards, Reports, Visualization, …

Audio, Video, Images

Docs, Text, XML

Web Logs, Clicks

Social, Graph, Feeds

Sensors, Devices,

RFID

Spatial, GPS

Events, Other

New Data Sources

ETL

Big Data Refinery

Apache Hadoop

HDFS

I stole this slide from Eric. Update: He stole it from someone else.

© Hortonworks Inc. 2012

Hadoop to the Rescue!

• Easy to use! (Pig, Hive, Cascading)

• CHEAP: 1% the cost of SAN/NAS

• A department can afford its own Hadoop cluster!

• Dump all your data in one place: Hadoop DFS

• Silos come CRASHING DOWN!

• JOIN like crazy!

• ETL like whoah!

• An army of mappers and reducers at your command

• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!10

© Hortonworks Inc. 2012

NOW WHAT?

11

?

© Hortonworks Inc. 2012

Analytics Apps: It takes a Team

12

• Broad skill-set to make useful apps

• Basically nobody has them all

• Application development is inherently collaborative

© Hortonworks Inc. 2012

Data Science Team

• 3-4 team members with broad, diverse skill-sets that overlap• Transactional overhead dominates at 5+ people

• Expert researchers: lend 25-50% of their time to teams• Pick relevant researchers. Leave them alone. They’ll spawn new products by accident. Not just CS/Math. Design. Art?

• Creative workers. Run like a studio, not an assembly line• Total freedom... with goals and deliverables.

• Work environment matters most: private, social & quiet space• Desks/cubes optional

13

© Hortonworks Inc. 2012

How to get insight into product?

• Back-end has gotten t-h-i-c-k-e-r

• Generating $$$ insight can take 10-100x app dev

• Timeline disjoint: analytics vs agile app-dev/design

• How do you ship insights efficiently?

• How do you collaborate on research vs developer timeline?14

© Hortonworks Inc. 2012

The Wrong Way - Part One

15

“We made a great design. Your job is to predict the future for it.”

© Hortonworks Inc. 2012

The Wrong Way - Part Two

16

“Whats taking you so long to reliably predict the future?”

© Hortonworks Inc. 2012

The Wrong Way - Part Three

17

“The users don’t understand what 86% true means.”

© Hortonworks Inc. 2012

The Wrong Way - Part Four

18

GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!!

© Hortonworks Inc. 2012

The Wrong Way - Inevitable Conclusion

19

Plane Mountain

© Hortonworks Inc. 2012

Reminds me of... the waterfall model

20

:(

© Hortonworks Inc. 2012

Chief Problem

21

You can’t design insight in analytics applications.

You discover it.

You discover by exploring.

© Hortonworks Inc. 2012

-> Strategy

22

So make an app for exploring your data.

Which becomes a palette for what you ship.

Iterate and publish intermediate results.

© Hortonworks Inc. 2012

Data Design

• Not the 1st query that = insight, its the 15th, or the 150th

• Capturing “Ah ha!” moments

• Slow to do those in batch...

• Faster, better context in an interactive web application.

• Pre-designed charts wind up terrible. So bad.

• Easy to invest man-years in the wrong statistical models

• Semantics of presenting predictions are complex, delicate

• Opportunity lies at intersection of data & design

23

© Hortonworks Inc. 2012

How do we get back to Agile?

24

© Hortonworks Inc. 2012

Statement of Principles

25

(then tricks, with code)

© Hortonworks Inc. 2012

Setup an environment where...

• Insights repeatedly produced

• Iterative work shared with entire team

• Interactive from day 0

• Data model is consistent end-to-end

• Minimal impedance between layers

• Scope and depth of insights grow

• Insights form the palette for what you ship

• Until the application pays for itself and more

26

© Hortonworks Inc. 2012

Value document > relation

27

Most data is dirty. Most data is semi-structured or un-structured. Rejoice!

© Hortonworks Inc. 2012

Value document > relation

28

Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.

© Hortonworks Inc. 2012

Relational Data = Legacy?

• Why JOIN? Storage is fundamentally cheap!

• Duplicate that JOIN data in one big record type!

• ETL once to document format on import, NOT every job

• Not zero JOINs, but far fewer JOINs

• Semi-structured documents preserve data’s actual structure

• Column compressed document formats beat JOINs! (paper coming)

29

© Hortonworks Inc. 2012

Value imperative > declarative

• We don’t know what we want to SELECT.

• Data is dirty - check each step, clean iteratively.

• 85% of data scientist’s time spent munging. See: ETL.

• Imperative is optimized for our process.

• Process = iterative, snowballing insight

• Efficiency matters, self optimize

30

© Hortonworks Inc. 2012

Value dataflow > SELECT

31

© Hortonworks Inc. 2012

Ex. dataflow: ETL + email sent count

32(I can’t read this either. Get a big version here.)

© Hortonworks Inc. 2012

Value Pig > Hive (for app-dev)

• Pigs eat ANYTHING• Pig is optimized for refining data, as opposed to consuming it• Pig is imperative, iterative• Pig is dataflows, and SQLish (but not SQL)• Code modularization/re-use: Pig Macros• ILLUSTRATE speeds dev time (even UDFs)• Easy UDFs in Java, JRuby, Jython, Javascript• Pig Streaming = use any tool, period.• Easily prepare our data as it will appear in our app.• If you prefer Hive, use Hive.

33

But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive... See: HCatalog for Pig/Hive integration, and this post.

© Hortonworks Inc. 2012

Localhost vs Petabyte scale: same tools

• Simplicity essential to scalability: highest level tools we can• Prepare a good sample - tricky with joins, easy with documents

• Local mode: pig -l /tmp -x local -v -w• Frequent use of ILLUSTRATE• 1st: Iterate, debug & publish locally• 2nd: Run on cluster, publish to team/customer• Consider skipping Object-Relational-Mapping (ORM)• We do not trust ‘databases,’ only HDFS @ n=3.

• Everything we serve in our app is re-creatable via Hadoop.

34

© Hortonworks Inc. 2012

Data-Value Pyramid

35

Climb it. Do not skip steps. See here.

© Hortonworks Inc. 2012

0/1) Display atomic records on the web

36

© Hortonworks Inc. 2012

0.0) Document-serialize events

• Protobuf

• Thrift

• JSON

• Avro - I use Avro because the schema is onboard.

37

© Hortonworks Inc. 2012

0.1) Documents via Relation ETL

38

enron_messages = load '/enron/enron_messages.tsv' as (

message_id:chararray,

sql_date:chararray,

from_address:chararray,

from_name:chararray,

subject:chararray,

body:chararray

);

 

enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);

 

split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';

 

headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;

with_headers = join headers by group, enron_messages by message_id parallel 10;

emails = foreach with_headers generate enron_messages::message_id as message_id,

CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,

TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),

enron_messages::subject as subject,

enron_messages::body as body,

headers::tos.(address, name) as tos,

headers::ccs.(address, name) as ccs,

headers::bccs.(address, name) as bccs;

store emails into '/enron/emails.avro' using AvroStorage(

Example here.

© Hortonworks Inc. 2012

0.2) Serialize events from streams

39

class GmailSlurper(object): ...  def init_imap(self, username, password):    self.username = username    self.password = password    try:      imap.shutdown()    except:      pass    self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)    self.imap.login(username, password)    self.imap.is_readonly = True ...  def write(self, record):    self.avro_writer.append(record) ...  def slurp(self):    if(self.imap and self.imap_folder):      for email_id in self.id_list:        (status, email_hash, charset) = self.fetch_email(email_id)        if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):          print email_id, charset, email_hash['thread_id']          self.write(email_hash)

Scrape your own gmail in Python and Ruby.

© Hortonworks Inc. 2012

0.3) ETL Logs

40

log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes);

© Hortonworks Inc. 2012

1) Plumb atomic events -> browser

41

(Example stack that enables high productivity)

© Hortonworks Inc. 2012

Lots of Stack Options with Examples

• Pig with Voldemort, Ruby, Sinatra: example

• Pig with ElasticSearch: example

• Pig with MongoDB, Node.js: example

• Pig with Cassandra, Python Streaming, Flask: example

• Pig with HBase, JRuby, Sinatra: example

• Pig with Hive via HCatalog: example (trivial on HDP)

• Up next: Accumulo, Redis, MySQL, etc.

42

© Hortonworks Inc. 2012

1.1) cat our Avro serialized events

43

me$ cat_avro ~/Data/enron.avro{ u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ]}

Get cat_avro in python, ruby

© Hortonworks Inc. 2012

1.2) Load our events in Pig

44

me$ pig -l /tmp -x local -v -wgrunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();grunt> describe enron_emails

emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)}}

 

© Hortonworks Inc. 2012

1.3) ILLUSTRATE our events in Pig

45

grunt> illustrate enron_emails 

---------------------------------------------------------------------------| emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} || ccs:bag{cc:tuple(address:chararray,name:chararray)} || bccs:bag{bcc:tuple(address:chararray,name:chararray)} |---------------------------------------------------------------------------| | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} |

Upgrade to Pig 0.10+

© Hortonworks Inc. 2012

1.4) Publish our events to a ‘database’

46

pig -l /tmp -x local -v -w -param avros=enron.avro \ -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig

/* MongoDB libraries and configuration */register /me/mongo-hadoop/mongo-2.7.3.jarregister /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jarregister /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

/* Set speculative execution off to avoid chance of duplicate records in Mongo */set mapred.map.tasks.speculative.execution falseset mapred.reduce.tasks.speculative.execution falsedefine MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */

/* By default, lets have 5 reducers */set default_parallel 5

avros = load '$avros' using AvroStorage();store avros into '$mongourl' using MongoStorage();

Full instructions here.

Which does this:

From Avro to MongoDB in one command:

© Hortonworks Inc. 2012

1.5) Check events in our ‘database’

47

$ mongo enron

MongoDB shell version: 2.0.2connecting to: enron

> show collectionsemailssystem.indexes

> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}){" "_id" : ObjectId("502b4ae703643a6a49c8d180")," "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>"," "date" : "2001-01-09T06:38:00.000Z"," "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }," "subject" : Re: Enron trade for frop futures," "body" : "Scamming more people..."," "tos" : [ { "address" : "connie@enron", "name" : null } ]," "ccs" : [ ]," "bccs" : [ ]}

© Hortonworks Inc. 2012

1.6) Publish events on the web

48

require 'rubygems'require 'sinatra'require 'mongo'require 'json'

connection = Mongo::Connection.newdatabase = connection['agile_data']collection = database['emails']

get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data)end

© Hortonworks Inc. 2012

1.6) Publish events on the web

49

© Hortonworks Inc. 2012

Whats the point?

• A designer can work against real data.

• An application developer can work against real data.

• A product manager can think in terms of real data.

• Entire team is grounded in reality!

• You’ll see how ugly your data really is.

• You’ll see how much work you have yet to do.

• Ship early and often!

• Feels agile, don’t it? Keep it up!

50

© Hortonworks Inc. 2012

1.7) Wrap events with Bootstrap

51

<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">

</head>

<body>

<div class="container" style="margin-top: 100px;">

<table class="table table-striped table-bordered table-condensed">

<thead>

{% for key in data['keys'] %}

<th>{{ key }}</th>

{% endfor %}

</thead>

<tbody>

<tr>

{% for value in data['values'] %}

<td>{{ value }}</td>

{% endfor %}

</tr>

</tbody>

</table>

</div>

</body> Complete example here with code here.

© Hortonworks Inc. 2012

1.7) Wrap events with Bootstrap

52

© Hortonworks Inc. 2012

Refine. Add links between documents.

53Not the Mona Lisa, but coming along... See: here

© Hortonworks Inc. 2012

1.8) List links to sorted events

54

mongo enron

> db.emails.ensureIndex({message_id: 1})

> db.emails.find().sort({date:0}).limit(10).pretty()

{

{

" "_id" : ObjectId("4f7a5da2414e4dd0645d1176"),

" "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",

" "from" : [

...

pig -l /tmp -x local -v -w

emails_per_user = foreach (group emails by from.address) {

sorted = order emails by date;

last_1000 = limit sorted 1000;

generate group as from_address, emails as emails;

};

store emails_per_user into '$mongourl' using MongoStorage();

Use Pig, serve/cache a bag/array of email documents:

Use your ‘database’, if it can sort.

© Hortonworks Inc. 2012

1.8) List links to sorted documents

55

© Hortonworks Inc. 2012

1.9) Make it searchable...

56

If you have list, search is easy with ElasticSearch and Wonderdog...

/* Load ElasticSearch integration */

register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';

register '/me/elasticsearch-0.18.6/lib/*';

define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();

emails = load '/me/tmp/emails' using AvroStorage();

store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/

elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');

curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'

Test it with curl:

ElasticSearch has no security features. Take note. Isolate.

© Hortonworks Inc. 2012

From now on we speed up...

57

Don’t worry, its in the book and on the blog.

http://hortonworks.com/blog/

© Hortonworks Inc. 2012

2) Create Simple Charts

58

© Hortonworks Inc. 2012

2) Create Simple Tables and Charts

59

© Hortonworks Inc. 2012

2) Create Simple Charts

• Start with an HTML table on general principle.

• Then use nvd3.js - reusable charts for d3.js

• Aggregate by properties & displaying is first step in entity

resolution

• Start extracting entities. Ex: people, places, topics, time series

• Group documents by entities, rank and count.

• Publish top N, time series, etc.

• Fill a page with charts.

• Add a chart to your event page.

60

© Hortonworks Inc. 2012

2.1) Top N (of anything) in Pig

61

pig -l /tmp -x local -v -w

top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc;top_10_things = limit sorted 10;generate group as key, top_10_things as top_10_things;};

store top_n into '$mongourl' using MongoStorage();

Remember, this is the same structure the browser gets as json.

This would make a good Pig Macro.

© Hortonworks Inc. 2012

2.2) Time Series (of anything) in Pig

62

pig -l /tmp -x local -v -w

/* Group by our key and date rounded to the month, get a total */things_by_month = foreach (group things by (key, ISOToMonth(datetime))

generate flatten(group) as (key, month), COUNT_STAR(things) as total;

/* Sort our totals per key by month to get a time series */things_timeseries = foreach (group things_by_month by key) {

timeseries = order things by month;generate group as key, timeseries as timeseries;};

store things_timeseries into '$mongourl' using MongoStorage();

Yet another good Pig Macro.

© Hortonworks Inc. 2012

Data processing in our stack

63

A new feature in our application might begin at any layer... great!

Any team member can add new features, no problemo!

I’m creative!I know Pig!

I’m creative too!I <3 Javascript!

omghi2u!where r my legs?

send halp

© Hortonworks Inc. 2012

Data processing in our stack

64

... but we shift the data-processing towards batch, as we are able.

Ex: Overall total emails calculated in each layer

See real example here.

© Hortonworks Inc. 2012

3) Exploring with Reports

65

© Hortonworks Inc. 2012

3) Exploring with Reports

66

© Hortonworks Inc. 2012

3.0) From charts to reports...

• Extract entities from properties we aggregated by in charts (Step 2)

• Each entity gets its own type of web page

• Each unique entity gets its own web page

• Link to entities as they appear in atomic event documents (Step 1)

• Link most related entities together, same and between types.

• More visualizations!

• Parametize results via forms.

67

© Hortonworks Inc. 2012

3.1) Looks like this...

68

© Hortonworks Inc. 2012

3.2) Cultivate common keyspaces

69

© Hortonworks Inc. 2012

3.3) Get people clicking. Learn.

• Explore this web of generated pages, charts and links!

• Everyone on the team gets to know your data.

• Keep trying out different charts, metrics, entities, links.

• See whats interesting.

• Figure out what data needs cleaning and clean it.

• Start thinking about predictions & recommendations.

70

‘People’ could be just your team, if data is sensitive.

© Hortonworks Inc. 2012

4) Predictions and Recommendations

71

© Hortonworks Inc. 2012

4.0) Preparation

• We’ve already extracted entities, their properties and relationships

• Our charts show where our signal is rich

• We’ve cleaned our data to make it presentable

• The entire team has an intuitive understanding of the data

• They got that understanding by exploring the data

• We are all on the same page!

72

© Hortonworks Inc. 2012

4.2) Think in different perspectives

• Networks

• Time Series / Distributions

• Natural Language Processing

• Conditional Probabilities / Bayesian Inference

• Check out Chapter 2 of the book...

73See here.

© Hortonworks Inc. 2012

4.3) Networks

74

© Hortonworks Inc. 2012

4.3.1) Weighted Email Networks in Pig

75

DEFINE header_pairs(email, col1, col2) RETURNS pairs { filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL); flat = FOREACH filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2; $pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) AS ego2;}

/* Get email address pairs for each type of connection, and union them together */emails = LOAD '/me/Data/enron.avro' USING AvroStorage();from_to = header_pairs(emails, from, to);from_cc = header_pairs(emails, from, cc);from_bcc = header_pairs(emails, from, bcc);pairs = UNION from_to, from_cc, from_bcc;

/* Get a count of emails over these edges. */pair_groups = GROUP pairs BY (ego1, ego2);sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) AS (ego1, ego2), COUNT_STAR(pairs) AS total;

© Hortonworks Inc. 2012

4.3.2) Networks Viz with Gephi

76

© Hortonworks Inc. 2012

4.3.3) Gephi = Easy

77

© Hortonworks Inc. 2012

4.3.4) Social Network Analysis

78

© Hortonworks Inc. 2012

4.4) Time Series & Distributions

79

pig -l /tmp -x local -v -w

/* Count things per day */

things_per_day = foreach (group things by (key, ISOToDay(datetime))

generate flatten(group) as (key, day),

COUNT_STAR(things) as total;

/* Sort our totals per key by day to get a sorted time series */

things_timeseries = foreach (group things_by_day by key) {

timeseries = order things by day;

generate group as key, timeseries as timeseries;

};

store things_timeseries into '$mongourl' using MongoStorage();

© Hortonworks Inc. 2012

4.4.2) Regress to find Trends

81

JRuby Linear Regression UDF Pig to use the UDF

Trend Line in your Application

© Hortonworks Inc. 2012

4.5.1) Natural Language Processing

82Example with code here and macro here.

import 'tfidf.macro';my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');

/* Get the top 10 Tf*Idf scores per message */per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value);}

© Hortonworks Inc. 2012

4.5.2) NLP: Extract Topics!

83

© Hortonworks Inc. 2012

4.6) Probability & Bayesian Inference

85

© Hortonworks Inc. 2012

4.6.1) Gmail Suggested Recipients

86

© Hortonworks Inc. 2012

4.6.1) Reproducing it with Pig...

87

© Hortonworks Inc. 2012

4.6.2) Step 1: COUNT(From -> To)

88

© Hortonworks Inc. 2012

4.6.2) Step 2: COUNT(From, To, Cc)/Total

89

P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone

© Hortonworks Inc. 2012

4.6.3) Wait - Stop Here! It works!

90

They match...

© Hortonworks Inc. 2012

4.4) Add predictions to reports

91

© Hortonworks Inc. 2012

5) Enable new actions

92

© Hortonworks Inc. 2012

Example: Packetpig and PacketLoop

93

snort_alerts = LOAD '$pcap'  USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');

countries = FOREACH snort_alerts  GENERATE    com.packetloop.packetpig.udf.geoip.Country(src) as country,    priority;

countries = GROUP countries BY country;

countries = FOREACH countries  GENERATE    group,    AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');

Code here.

© Hortonworks Inc. 2012

• Amsterdam, March 20, 21st

• Call for papers now open!

• Submit a lightning talk!

• http://hadoopsummit.org/amsterdam/

• Discount coupons - 10% off!

95

© Hortonworks Inc. 2012

1

• Simplify deployment to get started quickly and easily

• Monitor, manage any size cluster with familiar console and tools

• Only platform to include data integration services to interact with any data

• Metadata services opens the platform for integration with existing applications

• Dependable high availability architecture

• Tested at scale to future proof your cluster growth

Hortonworks Data Platform

96

Reduce risks and cost of adoption Lower the total cost to administer and provision Integrate with your existing ecosystem

© Hortonworks Inc. 2012

Hortonworks Training

The expert source for Apache Hadoop training & certification

Role-based Developer and Administration training– Coursework built and maintained by the core Apache Hadoop development team.– The “right” course, with the most extensive and realistic hands-on materials– Provide an immersive experience into real-world Hadoop scenarios– Public and Private courses available

Comprehensive Apache Hadoop Certification– Become a trusted and valuable

Apache Hadoop expert

97

© Hortonworks Inc. 2012

Next Steps?

• Expert role based training• Course for admins, developers

and operators• Certification program• Custom onsite options

98

Download Hortonworks Data Platformhortonworks.com/download

1

2 Use the getting started guidehortonworks.com/get-started

3 Learn more… get support

• Full lifecycle technical support

across four service levels• Delivered by Apache Hadoop

Experts/Committers• Forward-compatible

Hortonworks Support

hortonworks.com/training hortonworks.com/support

© Hortonworks Inc. 2012

Thank You!

Questions & Answers

Slides: http://slidesha.re/O8kjaF

Follow: @hortonworks and @rjurneyRead: hortonworks.com/blog

99

Recommended