MongoDB and Apache Flink / Spark
“How to do Data Processing?”
Marc Schwering
Sr. Solution Architect – EMEA
@m4rcsch
2
Agenda For This Session
• Data Processing Architectural Overview
• The Life of an Application
• Separation of Concerns / Real World Architecture
• Apache Spark and Flink Data Processing Projects
• Clustering with Apache Flink
• Next Steps
3
Data Processing Architectural Overview
1. Profile created
2. Enrich with public data
3. Capture activity
4. Clustering analysis
5. Define Personas
6. Tag with personas
7. Personalize interactions
Batch analytics
Public data
Common
technologies
• R
• Hadoop
• Spark
• Python
• Java
• Many other
options Personas
changed much
less often than
tagging
4
Evolution of a Profile (1)
{
"_id" : ObjectId("553ea57b588ac9ef066428e1"),
"ipAddress" : "216.58.219.238",
"referrer" : ”kay.com",
"firstName" : "John",
"lastName" : "Doe",
"email" : "[email protected]"
}
5
Evolution of a Profile (n+1){
"_id" : ObjectId("553e7dca588ac9ef066428e0"),
"firstName" : "John",
"lastName" : "Doe",
"address" : "229 W. 43rd St.",
"city" : "New York",
"state" : "NY",
"zipCode" : "10036",
"age" : 30,
"email" : "[email protected]",
"twitterHandle" : "johndoe",
"gender" : "male",
"interests" : [
"electronics",
"basketball",
"weightlifting",
"ultimate frisbee",
"traveling",
"technology"
],
"visitedCounts" : {
"watches" : 3,
"shirts" : 1,
"sunglasses" : 1,
"bags" : 2
},
"purchases" : [
{
"id" : 1,
"desc" : "Power Oxford Dress Shoe",
"category" : "Mens shoes"
},
{
"id" : 2,
"desc" : "Striped Sportshirt",
"category" : "Mens shirts"
}
],
"persona" : "shoe-fanatic”
}
6
One size/document fits all?
• Profile Data
– Preferences
– Personal information
• Contact information
• DOB, gender, ZIP...
• Customer Data
– Purchase History
– Marketing History
• „Session Data“
– View History
– Shopping Cart Data
– Information Broker Data
• Personalisation Data
– Persona Vectors
– Product and Category recommendations
Application
Batch analytics
7
Separation of Concerns
• Profile Data
– Preferences
– Personal information
• Contact information
• DOB, gender, ZIP...
• Customer Data
– Purchase History
– Marketing History
• „Session Data“
– View History
– Shopping Cart Data
– Information Broker Data
• Personalisation Data
– Persona Vectors
– Product and Category recommendations
Batch analytics Layer
Frontend - System
Profile ServiceCustomer
ServiceSession Service Persona Service
8
Benefits
• Code does less, Document and Code stays focused
• Split ability
– Different Teams
– New Languages
– Defined Dependencies
9
Advice for Developers (1)
• Code does less, Document and Code stays focused
• Split ability
– Different Teams
– New Languages
– Defined Dependencies
KISS
=> Keep it simple and save!
=> Clean Code <=
• Robert C. Marten: https://cleancoders.com/
• M. Fowler / B. Meyer. et. al.: Command Query Separation
Analytics and Personalization
From Query to Clustering
11
Separation of Concerns
• Profile Data
– Preferences
– Personal information
• Contact information
• DOB, gender, ZIP...
• Customer Data
– Purchase History
– Marketing History
• „Session Data“
– View History
– Shopping Cart Data
– Information Broker Data
• Personalisation Data
– Persona Vectors
– Product and Category recommendations
Batch analytics Layer
Frontend – System
Profile ServiceCustomer
ServiceSession Service Persona Service
12
Separation of Concerns
• Profile Data
– Preferences
– Personal information
• Contact information
• DOB, gender, ZIP...
• Customer Data
– Purchase History
– Marketing History
• „Session Data“
– View History
– Shopping Cart Data
– Information Broker Data
• Personalisation Data
– Persona Vectors
– Product and Category recommendations
Batch analytics Layer
Frontend – System
Profile ServiceCustomer
ServiceSession Service Persona Service
13
Architecture revised
Profile ServiceCustomer
ServiceSession Service Persona Service
Frontend – System Backend– Systems
Data
Processing
14
Advice for Developers (2)
• OWN YOUR DATA! (but only relevant Data)
• Say no! (to direct Data ie. DB Access)
Data Processing Solutions
16
Hadoop in a Nutshell
• An open source distributed storage and
distributed batch oriented processing framework
• Hadoop Distributed File System (HDFS) to store data on
commodity hardware
• Yarn as resource management platform
• MapReduce as programming model working on top of HDFS
17
Spark in a Nutshell
• Spark is a top-level Apache project
• Can be run on top of YARN and can read any
Hadoop API data, including HDFS or MongoDB
• Fast and general engine for large-scale data processing and
analytics
• Advanced DAG execution engine with support for data locality
and in-memory computing
18
Flink in a Nutshell
• Flink is a top-level Apache project
• Can be run on top of YARN and can read any
Hadoop API data, including HDFS or MongoDB
• A distributed streaming dataflow engine
• Streaming and batch
• Iterative in memory execution and handling
• Cost based optimizer
19
Latency of query operations
Query Aggregation MapReduce Cluster Algorithms
tim
e
MongoDB
Hadoop
Spark/Flink
Iterative Algorithms / Clustering
22
K-Means as a Process
23
Iterations in Hadoop and Spark
24
Iterations in Flink
• Dedicated iteration operators
• Tasks keep running for the iterations, not redeployed for each step
• Caching and optimizations done automatically
Examplecode
26
Reader / Writer Config
//reader config
public static DataSet<Tuple2<BSONWritable, BSONWritable>> readFromMongo(ExecutionEnvironment env,
String uri) {
JobConf conf = new JobConf();
conf.set("mongo.input.uri", uri);
MongoInputFormat mongoInputFormat = new MongoInputFormat();
return env.createHadoopInput(mongoInputFormat, BSONWritable.class, BSONWritable.class, conf);
}
//writer config
public static void writeToMongo(DataSet<Tuple2<BSONWritable, BSONWritable>> result, String uri) {
JobConf conf = new JobConf();
conf.set("mongo.output.uri", uri);
MongoOutputFormat<BSONWritable, BSONWritable> mongoOutputFormat = new
MongoOutputFormat<BSONWritable, BSONWritable>();
result.output(new HadoopOutputFormat<BSONWritable, BSONWritable>(mongoOutputFormat, conf));
}
27
Import data
//points
DataSet<Tuple2<BSONWritable, BSONWritable>> inPoints = readFromMongo(env, mongoInputUri + pointsSource);
//centers
DataSet<Tuple2<BSONWritable, BSONWritable>> inCenters = readFromMongo(env, mongoInputUri + centerSource);
DataSet<Point> points = convertToPointSet(inPoints);
DataSet<Centroid> centroids = convertToCentroidSet(inCenters);
28
Converting
public Tuple2<BSONWritable, BSONWritable> map(Tuple2<Integer, Point> integerPointTuple2) throws Exception {
Integer id = integerPointTuple2.f0;
Point point = integerPointTuple2.f1; BasicDBObject idDoc = new BasicDBObject();
idDoc.put("_id", id);
BSONWritable bsonId = new BSONWritable();
bsonId.setDoc(idDoc);
BasicDBObject doc = new BasicDBObject();
doc.put("_id", id);
doc.put("x", point.x);
doc.put("y", point.y);
BSONWritable bsonDoc = new BSONWritable();
bsonDoc.setDoc(doc);
return new Tuple2(bsonId,bsonDoc);
}
29
Result
30
More…?
31
Takeaways
• Evolution is amazing and exiting!– Be ready to learn new things, ask questions across Silos!
• Stay focused => Start and stay small– Evaluate with BigDocuments but do a PoC focussed on the topic
• Extending functionality could be challenging– Evolution is outpacing help channels
– A lot of options (Spark, Flink, Storm, Hadoop….)
– More than just a binary
• Extending functionality is easy– Aggregation, MapReduce
– Connectors opening a new variety of Use Cases
32
Next Steps
• Try out Flink
– http://flink.apache.org/
– https://github.com/mongodb/mongo-hadoop
– https://github.com/m4rcsch/flink-mongodb-example
– http://sparkbigdata.com
• Participate and ask Questions!
– @m4rcsch
Thank you!
Marc Schwering
Sr. Solutions Architect – EMEA
@m4rcsch