111
Mongo-Hadoop Integration Mike O’Brien, Software Engineer @ 10gen Thursday, August 8, 13

Webinar: What's New with MongoDB Hadoop Integration

  • Upload
    mongodb

  • View
    21.146

  • Download
    2

Embed Size (px)

DESCRIPTION

MongoDB and Hadoop are often used together to deliver a powerful "Big Data" solution for complex analytics and data processing. With the MongoDB-Hadoop connector, you can easily take your data from MongoDB and process it using Hadoop. In this webinar, we will introduce new features in Hadoop integration, including using Hadoop as the input/output for MapReduce jobs, running MapReduce jobs against static backup files, and using Pig to build data workflows using MongoDB. This webinar covers version 1.0 through 1.2.

Citation preview

Page 1: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop IntegrationMike O’Brien, Software Engineer @ 10gen

Thursday, August 8, 13

Page 2: Webinar: What's New with MongoDB Hadoop Integration

We will cover:

The Mongo-Hadoop connector:•what it is•how it works•a tour of what it can do

A quick briefing on what Mongo and Hadoop are all about

(Q+A at the end)

Thursday, August 8, 13

Page 3: Webinar: What's New with MongoDB Hadoop Integration

Choosing the Right Tool for the TaskUpcoming Webinar:

MongoDB and Hadoop - Essential Tools for Your Big Data Playbook

August 21st, 201310am PDT, 1pm EDT, 6pm BST

Register at 10gen.com/events/biz-hadoop

Thursday, August 8, 13

Page 4: Webinar: What's New with MongoDB Hadoop Integration

Thursday, August 8, 13

Page 5: Webinar: What's New with MongoDB Hadoop Integration

document-oriented database with dynamic schema

Thursday, August 8, 13

Page 6: Webinar: What's New with MongoDB Hadoop Integration

document-oriented database with dynamic schema

stores data in JSON-like documents:

{ _id : “mike”,

age : 21,location : {

state : ”NY”,zip : ”11222”

},favorite_colors : [“red”, “green”]

}

Thursday, August 8, 13

Page 7: Webinar: What's New with MongoDB Hadoop Integration

mongoDB scales horizontally with sharding to handle lots of

data and load

app

Thursday, August 8, 13

Page 8: Webinar: What's New with MongoDB Hadoop Integration

mongoDB scales horizontally with sharding to handle lots of

data and load

app

Thursday, August 8, 13

Page 9: Webinar: What's New with MongoDB Hadoop Integration

mongoDB scales horizontally with sharding to handle lots of

data and load

app

Thursday, August 8, 13

Page 10: Webinar: What's New with MongoDB Hadoop Integration

mongoDB scales horizontally with sharding to handle lots of

data and load

app

Thursday, August 8, 13

Page 11: Webinar: What's New with MongoDB Hadoop Integration

mongoDB scales horizontally with sharding to handle lots of

data and load

app

Thursday, August 8, 13

Page 12: Webinar: What's New with MongoDB Hadoop Integration

Java-based framework for Map/Reduce

Excels at batch processing on large data setsby taking advantage of parallelism

Thursday, August 8, 13

Page 13: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector - Why

Lots of people using Hadoop and Mongo separately, but need integration

Custom code or slow and hacky import/export scripts often used to get data in+out

Scalability and flexibility with changes in Hadoop or MongoDB configurations

Need to process data across multiple sources

Thursday, August 8, 13

Page 14: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop ConnectorTurn MongoDB into a Hadoop-enabled filesystem:

use as the input or output for Hadoop

New Feature: As of v1.1, also works with MongoDB backup files (.bson)

.BSON

-or-

input data

.BSON

-or-

Hadoop Cluster

outputresults

Thursday, August 8, 13

Page 15: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop ConnectorBenefits + Features

Thursday, August 8, 13

Page 16: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop ConnectorBenefits + Features

Takes advantage of full multi-core parallelism to process data in Mongo

Thursday, August 8, 13

Page 17: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop ConnectorBenefits + Features

Takes advantage of full multi-core parallelism to process data in Mongo

Full integration with Hadoop and JVM ecosystems

Thursday, August 8, 13

Page 18: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop ConnectorBenefits + Features

Takes advantage of full multi-core parallelism to process data in Mongo

Full integration with Hadoop and JVM ecosystems

Can be used with Amazon Elastic Mapreduce

Thursday, August 8, 13

Page 19: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop ConnectorBenefits + Features

Takes advantage of full multi-core parallelism to process data in Mongo

Full integration with Hadoop and JVM ecosystems

Can be used with Amazon Elastic Mapreduce

Can read and write backup files from local filesystem, HDFS, or S3

Thursday, August 8, 13

Page 20: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop ConnectorBenefits + Features

Thursday, August 8, 13

Page 21: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

Vanilla Java MapReduce

Benefits + Features

Thursday, August 8, 13

Page 22: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

Vanilla Java MapReduce

or if you don’t want to use Java,support for Hadoop Streaming.

Benefits + Features

Thursday, August 8, 13

Page 23: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

Vanilla Java MapReduce

write MapReduce code in

ruby

or if you don’t want to use Java,support for Hadoop Streaming.

Benefits + Features

Thursday, August 8, 13

Page 24: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

Vanilla Java MapReduce

write MapReduce code in

ruby

or if you don’t want to use Java,support for Hadoop Streaming.

Benefits + Features

Thursday, August 8, 13

Page 25: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

Vanilla Java MapReduce

write MapReduce code in

ruby python

or if you don’t want to use Java,support for Hadoop Streaming.

Benefits + Features

Thursday, August 8, 13

Page 26: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop ConnectorBenefits + Features

Thursday, August 8, 13

Page 27: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

Support for Pighigh-level scripting language for data analysis and

building map/reduce workflows

Benefits + Features

Thursday, August 8, 13

Page 28: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

Support for Pighigh-level scripting language for data analysis and

building map/reduce workflows

Support for HiveSQL-like language for ad-hoc queries + analysis of data sets on

Hadoop-compatible file systems

Benefits + Features

Thursday, August 8, 13

Page 29: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

How it works:

Thursday, August 8, 13

Page 30: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

How it works:

Adapter examines the MongoDB input collection and calculates a set of splits from the data

Thursday, August 8, 13

Page 31: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

How it works:

Adapter examines the MongoDB input collection and calculates a set of splits from the data

Each split gets assigned to a node in Hadoop cluster

Thursday, August 8, 13

Page 32: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

How it works:

Adapter examines the MongoDB input collection and calculates a set of splits from the data

Each split gets assigned to a node in Hadoop cluster

In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally

Thursday, August 8, 13

Page 33: Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop Connector

How it works:

Adapter examines the MongoDB input collection and calculates a set of splits from the data

Each split gets assigned to a node in Hadoop cluster

In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally

Hadoop merges results and streams output back to MongoDB or BSON

Thursday, August 8, 13

Page 34: Webinar: What's New with MongoDB Hadoop Integration

Tour of Mongo-Hadoop, by Example

Thursday, August 8, 13

Page 35: Webinar: What's New with MongoDB Hadoop Integration

Tour of Mongo-Hadoop, by Example

- Using Java MapReduce with Mongo-Hadoop

Thursday, August 8, 13

Page 36: Webinar: What's New with MongoDB Hadoop Integration

Tour of Mongo-Hadoop, by Example

- Using Java MapReduce with Mongo-Hadoop

- Using Hadoop Streaming

Thursday, August 8, 13

Page 37: Webinar: What's New with MongoDB Hadoop Integration

Tour of Mongo-Hadoop, by Example

- Using Java MapReduce with Mongo-Hadoop

- Using Hadoop Streaming

- Pig and Hive with Mongo-Hadoop

Thursday, August 8, 13

Page 38: Webinar: What's New with MongoDB Hadoop Integration

Tour of Mongo-Hadoop, by Example

- Using Java MapReduce with Mongo-Hadoop

- Using Hadoop Streaming

- Pig and Hive with Mongo-Hadoop

- Elastic MapReduce + BSON

Thursday, August 8, 13

Page 39: Webinar: What's New with MongoDB Hadoop Integration

{ "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecast\n\n ", "filename" : "1.", "headers" : { "From" : "[email protected]", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" }}

Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email

Thursday, August 8, 13

Page 40: Webinar: What's New with MongoDB Hadoop Integration

{ "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecast\n\n ", "filename" : "1.", "headers" : { "From" : "[email protected]", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" }}

Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email

sender

Thursday, August 8, 13

Page 41: Webinar: What's New with MongoDB Hadoop Integration

{ "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecast\n\n ", "filename" : "1.", "headers" : { "From" : "[email protected]", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" }}

Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email

sender

recipients

Thursday, August 8, 13

Page 42: Webinar: What's New with MongoDB Hadoop Integration

Thursday, August 8, 13

Page 43: Webinar: What's New with MongoDB Hadoop Integration

Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair

Thursday, August 8, 13

Page 44: Webinar: What's New with MongoDB Hadoop Integration

Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair

bob

alice

eve

charlie

1499

9

48

20

Thursday, August 8, 13

Page 46: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce

Map phase - each input doc gets passed through a Mapper function

@Overridepublic  void  map(NullWritable  key,  BSONObject  val,  final  Context  context){        BSONObject  headers  =  (BSONObject)val.get("headers");        if(headers.containsKey("From")  &&  headers.containsKey("To")){                String  from  =  (String)headers.get("From");                String  to  =  (String)headers.get("To");                String[]  recips  =  to.split(",");                for(int  i=0;i<recips.length;i++){                        String  recip  =  recips[i].trim();                        context.write(new  MailPair(from,  recip),  new  IntWritable(1));                }        }}

Thursday, August 8, 13

Page 47: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce

mongoDB document passed into Hadoop MapReduce

Map phase - each input doc gets passed through a Mapper function

@Overridepublic  void  map(NullWritable  key,  BSONObject  val,  final  Context  context){        BSONObject  headers  =  (BSONObject)val.get("headers");        if(headers.containsKey("From")  &&  headers.containsKey("To")){                String  from  =  (String)headers.get("From");                String  to  =  (String)headers.get("To");                String[]  recips  =  to.split(",");                for(int  i=0;i<recips.length;i++){                        String  recip  =  recips[i].trim();                        context.write(new  MailPair(from,  recip),  new  IntWritable(1));                }        }}

Thursday, August 8, 13

Page 48: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce (cont)

Reduce phase - outputs of Map are grouped together by key and passed to Reducer

       public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;

               for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }

               BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from)

.add(  "t"  ,  pKey.to  )

.get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        }

Thursday, August 8, 13

Page 49: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce (cont)

Reduce phase - outputs of Map are grouped together by key and passed to Reducer

the {to, from} key

       public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;

               for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }

               BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from)

.add(  "t"  ,  pKey.to  )

.get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        }

Thursday, August 8, 13

Page 50: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce (cont)

Reduce phase - outputs of Map are grouped together by key and passed to Reducer

the {to, from} key

list of all the values collected under the key

       public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;

               for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }

               BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from)

.add(  "t"  ,  pKey.to  )

.get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        }

Thursday, August 8, 13

Page 51: Webinar: What's New with MongoDB Hadoop Integration

output written back to MongoDB

Example 1 - Java MapReduce (cont)

Reduce phase - outputs of Map are grouped together by key and passed to Reducer

the {to, from} key

list of all the values collected under the key

       public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;

               for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }

               BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from)

.add(  "t"  ,  pKey.to  )

.get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        }

Thursday, August 8, 13

Page 52: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce (cont)

mongo.job.input.format=com.mongodb.hadoop.MongoInputFormatmongo.input.uri=mongodb://my-db:27017/enron.messages

Read from MongoDB

Thursday, August 8, 13

Page 53: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce (cont)

mongo.job.input.format=com.mongodb.hadoop.MongoInputFormatmongo.input.uri=mongodb://my-db:27017/enron.messages

Read from MongoDB

Read from BSONmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormatmapred.input.dir=file:///tmp/messages.bson

Thursday, August 8, 13

Page 54: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce (cont)

mongo.job.input.format=com.mongodb.hadoop.MongoInputFormatmongo.input.uri=mongodb://my-db:27017/enron.messages

Read from MongoDB

Read from BSONmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormatmapred.input.dir=file:///tmp/messages.bson

hdfs:///tmp/messages.bson

s3:///tmp/messages.bson

Thursday, August 8, 13

Page 55: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce (cont)

mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormatmongo.output.uri=mongodb://my-db:27017/enron.results_out

Write output to MongoDB

Thursday, August 8, 13

Page 56: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce (cont)

mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormatmongo.output.uri=mongodb://my-db:27017/enron.results_out

Write output to MongoDB

Write output to BSONmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormatmapred.output.dir=file:///tmp/results.bson

Thursday, August 8, 13

Page 57: Webinar: What's New with MongoDB Hadoop Integration

Example 1 - Java MapReduce (cont)

mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormatmongo.output.uri=mongodb://my-db:27017/enron.results_out

Write output to MongoDB

Write output to BSONmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormatmapred.output.dir=file:///tmp/results.bson

hdfs:///tmp/results.bson

s3:///tmp/results.bson

Thursday, August 8, 13

Page 59: Webinar: What's New with MongoDB Hadoop Integration

Example 2 - Hadoop Streaming

Let’s do the same Enron Map/Reduce job with Python instead of Java

$ pip install pymongo_hadoop

Thursday, August 8, 13

Page 60: Webinar: What's New with MongoDB Hadoop Integration

Example 2 - Hadoop Streaming (cont)

Hadoop passes data to an external process via STDOUT/STDIN

map(k, v)map(k, v)map(k, v)map()

JVM

STDIN Python / Ruby / JSinterpreter

STDOUT

hadoop (JVM)

def mapper(documents): . . .

Thursday, August 8, 13

Page 61: Webinar: What's New with MongoDB Hadoop Integration

from pymongo_hadoop import BSONMapper

def mapper(documents): i = 0 for doc in documents: i = i + 1 from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1}

BSONMapper(mapper)print >> sys.stderr, "Done Mapping."

Example 2 - Hadoop Streaming (cont)

Thursday, August 8, 13

Page 62: Webinar: What's New with MongoDB Hadoop Integration

from pymongo_hadoop import BSONReducer

def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count}

BSONReducer(reducer)

Example 2 - Hadoop Streaming (cont)

Thursday, August 8, 13

Page 63: Webinar: What's New with MongoDB Hadoop Integration

Surviving Hadoop:making MapReduce easier

with Pig + Hive

Thursday, August 8, 13

Page 64: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig

Let’s do the same thing yet again, but this time using Pig

Thursday, August 8, 13

Page 65: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig

Let’s do the same thing yet again, but this time using Pig

Pig is a powerful language that can generate sophisticated map/reduce

workflows from simple scripts

Thursday, August 8, 13

Page 66: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig

Let’s do the same thing yet again, but this time using Pig

Pig is a powerful language that can generate sophisticated map/reduce

workflows from simple scripts

Can perform JOIN, GROUP, and execute user-defined functions (UDFs)

Thursday, August 8, 13

Page 67: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig (cont)

Pig directives for loading data:BSONLoader and MongoLoader

data = LOAD 'mongodb://localhost:27017/db.collection' using com.mongodb.hadoop.pig.MongoLoader;

STORE records INTO 'file:///output.bson' using com.mongodb.hadoop.pig.BSONStorage;

Writing data outBSONStorage and MongoInsertStorage

Thursday, August 8, 13

Page 68: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig (cont)

Pig has its own special datatypes: Bags, Maps, and Tuples

Mongo-Hadoop Connector intelligently converts between Pig datatypes and

MongoDB datatypes

Thursday, August 8, 13

Page 69: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig (cont)

raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;

Thursday, August 8, 13

Page 70: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig (cont)

raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;

send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;

Thursday, August 8, 13

Page 71: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig (cont)

raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;

send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;

send_recip_filtered = FILTER send_recip BY to IS NOT NULL;

send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;

Thursday, August 8, 13

Page 72: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig (cont)

raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;

send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;

send_recip_filtered = FILTER send_recip BY to IS NOT NULL;

send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;

send_recip_grouped = GROUP send_recip_split BY (from, to);send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count;

Thursday, August 8, 13

Page 73: Webinar: What's New with MongoDB Hadoop Integration

Example 3 - Mongo-Hadoop and Pig (cont)

raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;

send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;

send_recip_filtered = FILTER send_recip BY to IS NOT NULL;

send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;

send_recip_grouped = GROUP send_recip_split BY (from, to);send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count;

STORE send_recip_counted INTO 'file:///enron_results.bson' using com.mongodb.hadoop.pig.BSONStorage;

Thursday, August 8, 13

Page 74: Webinar: What's New with MongoDB Hadoop Integration

Hive with Mongo-Hadoop

Similar idea to Pig - process your data without needing to write Map/Reduce

code from scratch

...but with SQL as the language of choice

Thursday, August 8, 13

Page 75: Webinar: What's New with MongoDB Hadoop Integration

Hive with Mongo-Hadoop

CREATE TABLE mongo_users (id int, name string, age int)STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" )TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users");

first, declare the collection to be accessible in Hive:

Sample Data:db.users

db.users.find(){ "_id": 1, "name": "Tom", "age": 28 }{ "_id": 2, "name": "Alice", "age": 18 }{ "_id": 3, "name": "Bob", "age": 29 }{ "_id": 101, "name": "Scott", "age": 10 }{ "_id": 104, "name": "Jesse", "age": 52 }{ "_id": 110, "name": "Mike", "age": 32 }...

Thursday, August 8, 13

Page 76: Webinar: What's New with MongoDB Hadoop Integration

Hive with Mongo-Hadoop

Thursday, August 8, 13

Page 77: Webinar: What's New with MongoDB Hadoop Integration

Hive with Mongo-Hadoop

. . .then you can run SQL on it, like a table.SELECT name,age FROM mongo_users WHERE id > 100 ;

Thursday, August 8, 13

Page 78: Webinar: What's New with MongoDB Hadoop Integration

Hive with Mongo-Hadoop

. . .then you can run SQL on it, like a table.SELECT name,age FROM mongo_users WHERE id > 100 ;

SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;

you can use GROUP BY:

Thursday, August 8, 13

Page 79: Webinar: What's New with MongoDB Hadoop Integration

Hive with Mongo-Hadoop

. . .then you can run SQL on it, like a table.SELECT name,age FROM mongo_users WHERE id > 100 ;

SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;

you can use GROUP BY:

or JOIN multiple tables/collections together:

SELECT * FROM mongo_users T1 JOIN user_emails T2 WHERE T1.id = T2.id;

Thursday, August 8, 13

Page 80: Webinar: What's New with MongoDB Hadoop Integration

Write the output of queries back into new tables:

INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ;

Thursday, August 8, 13

Page 81: Webinar: What's New with MongoDB Hadoop Integration

Write the output of queries back into new tables:

INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ;

DROP TABLE mongo_users;

Thursday, August 8, 13

Page 82: Webinar: What's New with MongoDB Hadoop Integration

Write the output of queries back into new tables:

INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ;

DROP TABLE mongo_users;

Drop a table in Hive to delete the underlying collection in MongoDB

Thursday, August 8, 13

Page 83: Webinar: What's New with MongoDB Hadoop Integration

Usage with Amazon Elastic MapReduce

Run mongo-hadoop jobs without needing to set up or manage your

own Hadoop cluster.

Thursday, August 8, 13

Page 84: Webinar: What's New with MongoDB Hadoop Integration

Usage with Amazon Elastic MapReduce

First, make a “bootstrap” script that fetches dependencies (mongo-hadoop

jar and java drivers)

#!/bin/sh

wget -P /home/hadoop/lib http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar

wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoop-code/mongo-hadoop-core_1.1.2-1.1.0.jar

this will get executed on each node in the cluster that EMR builds for us.

Thursday, August 8, 13

Page 85: Webinar: What's New with MongoDB Hadoop Integration

Example 4 - Usage with Amazon Elastic MapReduce

Put the bootstrap script, and all your code, into an S3 bucket where Amazon can see it.

s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.shs3mod s3://$S3_BUCKET/bootstrap.sh public-read

s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/enron-example.jars3mod s3://$S3_BUCKET/enron-example.jar public-read

Thursday, August 8, 13

Page 86: Webinar: What's New with MongoDB Hadoop Integration

$ elastic-mapreduce --create --jobflow ENRON000 --instance-type m1.xlarge --num-instances 5 --bootstrap-action s3://$S3_BUCKET/bootstrap.sh --log-uri s3://$S3_BUCKET/enron_logs --jar s3://$S3_BUCKET/enron-example.jar --arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat --arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson --arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT --arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat # (any additional parameters here)

Example 4 - Usage with Amazon Elastic MapReduce

. . .then launch the job from the command line, pointing to your S3 locations

Control the type and number of instances

in the cluster

Thursday, August 8, 13

Page 87: Webinar: What's New with MongoDB Hadoop Integration

Example 4 - Usage with Amazon Elastic MapReduce

Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster

Thursday, August 8, 13

Page 88: Webinar: What's New with MongoDB Hadoop Integration

Example 4 - Usage with Amazon Elastic MapReduce

Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster

Turn up the “num-instances” knob to make jobs complete faster

Thursday, August 8, 13

Page 89: Webinar: What's New with MongoDB Hadoop Integration

Example 4 - Usage with Amazon Elastic MapReduce

Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster

Turn up the “num-instances” knob to make jobs complete faster

Logs get captured into S3 files

Thursday, August 8, 13

Page 90: Webinar: What's New with MongoDB Hadoop Integration

Example 4 - Usage with Amazon Elastic MapReduce

Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster

Turn up the “num-instances” knob to make jobs complete faster

(Pig, Hive, and streaming work on EMR, too!)

Logs get captured into S3 files

Thursday, August 8, 13

Page 91: Webinar: What's New with MongoDB Hadoop Integration

Example 5 - new feature: MongoUpdateWritable

. . . but we can also modify an existing output collection

Works by applying mongoDB update modifiers:$push, $pull, $addToSet, $inc, $set, etc.

Can be used to do incremental Map/Reduce or“join” two collections

In previous examples, we wrote job output data by inserting into a new collection

Thursday, August 8, 13

Page 92: Webinar: What's New with MongoDB Hadoop Integration

Example 5 - MongoUpdateWritable

Let’s say we have two collections.

{    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve",}

sensors

Thursday, August 8, 13

Page 93: Webinar: What's New with MongoDB Hadoop Integration

Example 5 - MongoUpdateWritable

Let’s say we have two collections.

{    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658]}

{    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve",}

sensors

Thursday, August 8, 13

Page 94: Webinar: What's New with MongoDB Hadoop Integration

Example 5 - MongoUpdateWritable

Let’s say we have two collections.

{    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658]}

{    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve",}

sensors

log events

Thursday, August 8, 13

Page 95: Webinar: What's New with MongoDB Hadoop Integration

Example 5 - MongoUpdateWritable

Let’s say we have two collections.

{    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658]}

{    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve",}

sensors

log events

refers to which sensor logged the event

Thursday, August 8, 13

Page 96: Webinar: What's New with MongoDB Hadoop Integration

Example 5 - MongoUpdateWritable

Let’s say we have two collections.

{    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658]}

{    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve",}

sensors

log events

refers to which sensor logged the event

Thursday, August 8, 13

Page 97: Webinar: What's New with MongoDB Hadoop Integration

Example 5 - MongoUpdateWritable

Let’s say we have two collections.

{    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658]}

{    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve",}

sensors

log events

refers to which sensor logged the event

For each owner, we want to calculate how many events were recorded for each type of sensor that logged it.

Thursday, August 8, 13

Page 98: Webinar: What's New with MongoDB Hadoop Integration

Thursday, August 8, 13

Page 99: Webinar: What's New with MongoDB Hadoop Integration

For each owner, we want to calculate how many events were recorded for each type of sensor that logged it.

Thursday, August 8, 13

Page 100: Webinar: What's New with MongoDB Hadoop Integration

For each owner, we want to calculate how many events were recorded for each type of sensor that logged it.

Plain english:

Bob’s sensors for temperature have stored 1300 readingsBob’s sensors for pressure have stored 400 readings

Alice’s sensors for humidity have stored 600 readingsAlice’s sensors for temperature have stored 700 readings

etc...

Thursday, August 8, 13

Page 101: Webinar: What's New with MongoDB Hadoop Integration

sensors(mongoDB collection)

Stage 1 -Map/Reduce on sensors collection

Results(mongoDB collection)

for each sensor, emit: {key: owner+type, value: _id}

group data from map() under each key, output:{key: owner+type, val: [ list of _ids] }

read from mongoDB

insert() new records to mongoDB

map/reduce

log events(mongoDB collection)

Thursday, August 8, 13

Page 102: Webinar: What's New with MongoDB Hadoop Integration

After stage one, the output docs look like:

Thursday, August 8, 13

Page 103: Webinar: What's New with MongoDB Hadoop Integration

the sensor’s owner and type

After stage one, the output docs look like:

Thursday, August 8, 13

Page 104: Webinar: What's New with MongoDB Hadoop Integration

the sensor’s owner and type

After stage one, the output docs look like:

list of ID’s of sensors with this owner and type

{    "_id":  "alice  pressure",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ]}

Thursday, August 8, 13

Page 105: Webinar: What's New with MongoDB Hadoop Integration

the sensor’s owner and type

After stage one, the output docs look like:

list of ID’s of sensors with this owner and type

{    "_id":  "alice  pressure",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ]}

Now we just need to count the total # of log events recorded for any sensors that appear

in the list for each owner/type group.Thursday, August 8, 13

Page 106: Webinar: What's New with MongoDB Hadoop Integration

sensors(mongoDB collection)

Stage 2 -Map/Reduce on log events collection

Results(mongoDB collection)

read from mongoDB

update() existing records in mongoDB

map/reduce

log events(mongoDB collection)

for each sensor, emit: {key: sensor_id, value: 1}

group data from map() under each keyfor each value in that key: update({sensors: key}, {$inc : {logs_count:1}})

Thursday, August 8, 13

Page 107: Webinar: What's New with MongoDB Hadoop Integration

sensors(mongoDB collection)

Stage 2 -Map/Reduce on log events collection

Results(mongoDB collection)

read from mongoDB

update() existing records in mongoDB

map/reduce

log events(mongoDB collection)

for each sensor, emit: {key: sensor_id, value: 1}

group data from map() under each keyfor each value in that key: update({sensors: key}, {$inc : {logs_count:1}})

context.write(null,  new  MongoUpdateWritable(      query,  //which  documents  to  modify        update,  //how  to  modify  ($inc)      true,        //upsert      false));  //  multi

Thursday, August 8, 13

Page 108: Webinar: What's New with MongoDB Hadoop Integration

Example - MongoUpdateWritable

Result after stage 2

{    "_id":  "1UoTcvnCTz  temp",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ],    "logs_count":  1050616}

now populated with correct count

Thursday, August 8, 13

Page 109: Webinar: What's New with MongoDB Hadoop Integration

Upcoming Features (v1.2 and beyond)

Full-featured Hive support

Performance Improvements - Lazy BSON

Support for multi-collection input sources

API for adding custom splitter implementations

and more

Thursday, August 8, 13

Page 110: Webinar: What's New with MongoDB Hadoop Integration

Recap

Mongo-Hadoop - use Hadoop to do massive computations on big data sets stored in Mongo/BSON

Tools and APIs make it easier: Streaming, Pig, Hive, EMR, etc.

MongoDB becomes a Hadoop-enabled filesystem

Thursday, August 8, 13