Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
1
Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis
Elif Dede, Madhusudhan Govindaraju
Lavanya Ramakrishnan, Dan Gunter, Shane Canon
Department of Computer Science, Binghamton University (SUNY) Lawrence Berkeley National Laboratory
Computa(on and Data are cri(cal parts of the scien(fic process
Experiment
Theory
Computation
Data (Fourth Paradigm)
Advance Light Source Data Rates
2009 65 TB/yr
2011 312 TB/yr
2013 1900 TB/yr
Three Pillars of Science
Materials Project
3
Schemaless database
manager.x manager.x manager.x
Brain
www.materialsproject.org Source: Michael Kocher, Daniel Gunter
Data is “Big”
4
Processing “Big Data”: MapReduce
5
• Introduced in OSDI 2004 by Dean and Ghemawat from Google
• Programming model for processing large data sets
• Exploits large a set of commodity machines
• Characteristics of the model: • Relaxed synchronization
constraints • Locality optimization • Fault-tolerance • Load balancing OSDI 2004
Map and Reduce Map/Reduce:
• The map() function is called on every item in the input set and emits a series of intermediate key/value pairs
• All values associated with a given intermediate key are grouped together
• The reduce() function is called on every unique intermediate key, and its value list, and emits a final output value
6
Apache Hadoop • Open-‐source MapReduce implementa;on in Java • Easy scalability • Built-‐in I/O management
• Hadoop Distributed File System(HDFS) • Data distribu(on, management and replica(on
• Load balancing • Handles stragglers
• Fault tolerance • Commodity hardware • Heartbeats • Specula(ve execu(on and data replica(on
• Hadoop Streaming • Create and run MapReduce jobs with any executable or script as the mapper and/or the reducer
7
Scien;fic Compu;ng and Hadoop Hadoop provides: • Data Flow Parallelism
• Data goes through different steps of processing • Similar Job Phases
• Data prepara(on, transforma(on and reduc(on • MapReduce: maps (transforma(on) and reduces
(reduc(on) • Number of maps >>> Number of reduce
• Data transforma(on is typically more parallel than data reduc(on
• Fault Tolerance and Data Locality • Data intensive loads • Long running scien(fic jobs
8
Scien;fic Compu;ng and Hadoop (Cont.)
Hadoop does not provide: • Java implementa(on
• Legacy scien(fic code mostly is not in java and hard to rewrite as map and reduce func(ons
• Hadoop Streaming allows other modes • HDFS is a non-‐POSIX file system
• HDFS java library calls needed to create, read and write files • HDFS data locality good but does not handle applica(ons that
might have mul(ple data sets • Scien(fic data formats do not fit in the line/block oriented inputs of
typical Hadoop jobs • Scien(fic applica(ons o]en work with files where the logical
division of work is per file • New file formats require addi(onal java programming to define
the format, appropriate split for a single map task
9
Scien;fic Compu;ng and Hadoop (Cont.)
Hadoop does not provide:
• Maps and reduces are considered iden(cal (executables/arguments)
• Implemen(ng different tasks requires logic in the tasks that differen(ate the func(onality
• This can cause worker processing (mes to vary widely an lead to (meouts and restarted tasks due to the specula(ve execu(on in Hadoop
• No built-‐in dynamic and itera(ve applica(on support
10
New Genera;on Data
11
• Dynamic Data • Size and Content
• Structured? • Semi structured,
unstructured
• Relational? • Not always
NoSQL
12
A broad class of data management systems where the data is partitioned across a set of servers, where no server plays a privileged role
• NoSQL has emerged as an alternative model for this new non-relational data model.
• Address the ``Big Data'' challenge by providing horizontal scalability.
• Lower maintenance costs and flexibility.
• There are various data models that are represented under NoSQL including key-value, column-oriented and document-oriented stores.
• Each of these models has its own interpretation of data storage and makes different tradeoffs within the Consistency, Availability and Performance
What is MongoDB?
13
• Open source document-oriented database
• Data is not in tables with rows and columns
• Data is stored as “documents”, each of which is a associative array of scalar values, or nested associative arrays
• Javascript Object Notation (JSON) format
• Stored as BSON
• MongoDB uses sharding to split the data evenly across the cluster to parallelize access.
• This is done through front-end “routing servers” and back-end “data servers”
• Provides a built-in MapReduce • Drawbacks • The MapReduce scripts should be written in JavaScript • Slow and poor analytics libraries • The JavaScript implementation used by the MongoDB is not thread safe
Why MongoDB?
14
Materials Project: • A community accessible data store of calculated materials.
• Data store is complex with hundreds of attributes and constantly evolving.
• MongoDB provides an appropriate data model and query language.
• The project also needs to perform complex statistical data mining to discover patterns in materials and validate/verify correctness.
• These task are difficult with MongoDB but natural for MapReduce
ALS: • Advanced Light Source’s Tomogropy beamline uses MongoDB to store metadata from experiments (Summer’12, LBNL)
Schemaless database
manager.x manager.x manager.x
Brain
www.materialsproject.org Source: Michael Kocher, Daniel Gunter (LBNL)
Hadoop-‐MongoDB Connector
15
• Input splits are retrieved from a MongoDB server(s)
• Each mapper can read its splits in parallel
• Results are written back to MongoDB by the Hadoop reducer(s)
• It works with single MongoDB server or with a sharding setup
• User determines the split size
MongoDB: Overhead of mul;ple connec;ons
16
• Test ability to handle large number of simultaneous connections • 768 tasks with different checkpoint intervals compared to when there is no checkpoint
• overhead
• Connections increased from 154 to 768 per second, write volume increased to 768 MBs/.
MongoDB: Overhead, when using more nodes, tasks
17
• 10 min per task, • All tasks run in parallel • 10 sec checkpoint interval • Overhead observed after 1000 parallel tasks
• Large number of connections is the bottleneck
• More than the data volume
MongoDB MapReduce vs. Hadoop-‐MongoDB Read/Write Performance Comparison
18
• Data is stored on a single MongoDB server
• Hadoop cluster consists of 2 worker nodes
• The mongo-hadoop plug-in provides roughly five times better performance.
Hadoop-‐MongoDB: Choosing the Split Size
19
• Processing 9.3 million input records with Hadoop
• Each mapper reads an input split from the MongoDB server, does processing and sends its intermediate output to the reducer
• Split size varies:16, 32, 64, 128, 254 MBs
• sweet spot: 128 MB
• With the default split size of 8MB, Hadoop schedules over 500 mappers; by increasing the split size, this number drops around to 40
Hadoop-‐MongoDB: Increasing Data
20
• For 4.6 million input records, HDFS Hadoop is two times better than MongoDB, and at 37.2 million records it is five times
• At 37.2 million input records mongo-hadoop is more than 3 times slower in reading and more than nine times in writing than Hadoop-HDFS.
• In a sharded setup, mongo-hadoop reading times improve considerably.
2-node Hadoop Cluster and 2 Mongo-DB servers.
Hadoop-‐MongoDB: Sharding and processing on local nodes vs different nodes
21
• The performance slightly worsened compared to running the servers on different machines.
• MongoDB uses mmap to aggressively cache data from disk into memory
• With increasing input size growing memory and CPU usage is observed on the worker/server nodes
• This effects the performance of the MapReduce job
Performance bottleneck is due to memory Contention. Locality has minimal effect.
Hadoop-‐MongoDB: Increasing #Workers
22
• The performance over increasing cluster sizes from 16 to 64 cores
• Single to two MongoDB sharded servers
• The write time is bound by the reduce phase for this MapReduce job
• Number of mappers >> number of reducers
• The write performance of MongoDB still remains to be a bottleneck along with the overhead of routing data to be written between sharding servers.
Write performance of MongoDB is a bottleneck.
Hadoop-‐MongoDB: Different Setups (given that the data is in MongoDB)
23
• Best performance achieved reading from MongoDB and writing the output to HDFS
• Downloading the data to HDFS before running the analysis is the slowest .
Hadoop-HDFS provides the best peformance.
Hadoop-‐MongoDB: Different Setups
24
• Increasing cluster size (from 8 cores to 64) for 37.2 million input records
• With an increasing number of worker nodes the concurrency of the map phase increases
• The map times get considerably faster
Hadoop-‐MongoDB: Fault Tolerance
25
• 32 node Hadoop cluster processing ~37 million input records
• After eight faulted worker nodes Hadoop-HDFS loses too many data nodes and fails to complete the MapReduce job
• Mongo-hadoop gets the input splits from the MongoDB server therefore losing worker nodes does not lead to loss of input data
Conclusions • Sharding helps to improve MongoDB’s performance especially for reads.
• In a sharded setup, mongo-‐hadoop reading (mes improve considerably, as there are mul(ple servers to respond to parallel worker requests
• In cases where data is stored in MongoDB and needs to be analyzed, the mongo-‐hadoop connector is a convenient way to use Hadoop.
• Performance improves when output is wriben to HDFS
• MongoDB performance degrada(on observed with the increasing number of connec(ons, increasing write requests per second, as well as the increase in total write volume
• The mongo-‐hadoop plug-‐in provides roughly five (mes beber performance compared to using MongoDB’s na(ve MapReduce implementa(on.
• The performance gain from using mongo-‐hadoop increases linearly with input size.
26
27
Contact
Madhu Govindaraju [email protected] Binghamton University State University of New York (SUNY)
Dan Gunter [email protected] Lawrence Berkeley National Laboratory