MapReduce on Zero VM

MapReduce on ZeroVM

A Lightweight virtualization for Big Data Processing

Joy RahmanResearch AssistantCloud and Big Data Lab, UTSA

MapReduce and Big Data ● Big data is an all-encompassing term for any collection of data sets so large and

complex that it becomes difficult to process using traditional data processing applications.

● MapReduce is a distributed processing framework that supports Big Data Processing.

● A MapReduce program is composed of a Map() procedure that performs filtering and sorting and a Reduce() procedure that performs a summary operation

● MapReduce libraries have been written in many programming languages. A popular open-source implementation is Apache Hadoop (http://hadoop.apache.org/).

http://en.wikipedia.org/wiki/Data_set

Lets start with an example

Challenge : Count all the words in a fileLorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.

Word Count-------- --------Lorem 5.... 1.... 1.... 1dummy 1

Any problem with this approach? - Yes, the file may be too big

Lets see an example (cont)

A better Approach : Divide and ConquerLorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the

release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at

Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.

Pro

gram

1P

rogr

am 2

Pro

gram

3

Lorem, 2simply, 1has, 1

Lorem, 1was , 2has, 5

Lorem, 3from , 2has, 1

Do you see any problem with this approach?

key value

We need to combine the results..- We have divided the big input file to multiple pieces so that parallel processes can attack the file simultaneously lowering the total processing time.

- But the result from each process needs to be combined.

Lorem, 2simply, 1has, 1

Lorem, 1was , 2has, 5

Lorem, 3from , 2has, 1

Lorem, 6simply, 1has, 7from, 2........

MapReduce

● The example we have just seen is a typical MapReduce program for big data processing,

● where the first phase (split-up and processing of the input) is called Map

● and the final phase (the combining of the results) is called Reduce.

Formal Definitions

❏ The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs.

❏ Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) → list(k2,v2)

The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key.

❏ The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → list(v3)

Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values.

http://en.wikipedia.org/wiki/Data_domain

Split[k1, v1]

sort by k1

Merge[k1, [v1,v2,v3,...]]

Existing Limitations of Big Data Processing on the Cloud

● Current implementation of Cloud has two distinct clusters: ○ 1) Computation Cluster (ex :Amazon EC2)○ 2) Storage Cluster ( ex: Amazon S3)

● Computation cluster is used for cpu intensive processing whereas storage cluster is used to store the persistent data.

● Running MapReduce on the cloud is costly due to the fact a considerable amount of overhead incurred due to fetching the data from storage to the computation cluster and putting them back after processing.

ex: Amazon EMR

Image source & Ref: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html

Costly Data Transfer

Challenges....

● How to avoid the data transfer overhead for big data processing? ○ Answer : Take computation to the Storage cluster

apps

storage cluster

But traditional OS level virtualizations are ● bulky and cpu intensive to run

inside a cluster that is optimized for storage I/O only

● slow spin-up● horizontal scaling is expensive

apps

ZeroVM to the rescue

● ZeroVM is an open–source lightweight virtualization platform

based on the Chromium Native Client project (NaCl provides the

essential isolation through software fault isolation technique)

● ZeroVM permits to safely execute arbitrary code (c/c++, python)

from untrusted users in multi-tenant environments

● The ZeroVM Core is only 75 KB in Size and can spin-up in 5 ms.

● Thus It’s an ideal candidate to be run on top of Storage clusters

like Openstack SWIFT.

● ZeroVM Takes computation to the storage enabling cost effective

MapReduce on the cloud.

ZeroVM Properties1. ZeroVM is small, light, fast, Secure, Hyper Scalable.

2. ZeroVM virtualizes Application not Operating System.

3.Single threaded (thus deterministic) execution. Same executable will

produce same results each time it is run.

4. Predefined resource constraints before execution

● Channel based I/O

● Predefine socket port / network

● Restricted Memory Access

● Limited Read/ Write (in bytes)

● Short life sessions / Predefined session_timeout

credit : Ryan McKinney, Senior Software Engineer, Rackspace

ZeroCloud

● ZeroCloud is the cloud module that runs on top of SWIFT that provides the facility to run zerovm sessions on different servers of the cluster

● ZeroCloud makes it easy to create large clusters of instances, aggregating the compute power of many individual physical servers into a single execution environment.

● Users can leverage the power of 100s of physical servers for a few seconds or even milliseconds at time.

● Horizontal scalability is a key design goal for ZeroVM

ZeroCloud (on SWIFT)

swif

t p

roxy

w

ith

zer

ocl

ou

d

Object Server

REQ

Resp

GET/POST

Object Server

Object Server

Object Server

apps

zerovm session

apps

zerovm session

if (exec) spawn

if (exec) spawn

user supplies the job description with the executables (apps)

result

result

job desc

Openstack SWIFT Cluster

MapReduce on ZeroVM

● ZeroVM running on ZeroCloud is inherently targeted for Big data processing, particularly using MapReduce style.

● Users can have multiple stage jobs and any stage can connect with another stage

● The users need to provide the executables only.● Since data is already inside the SWIFT cluster, an execution

job request through GET/POST is enough to fire the big data processing instantly and obtain the result.

● Ensures Data Locality and eliminates the costly data transfer.

Demonstration???

Do you like to give ZeroVM a try? http://zebra.zerovm.org/

http://zebra.zerovm.org/



Our Research on ZeroVM

● There are many ongoing researches on ZeroVM. ● UTSA Big Data and Cloud Lab has some ongoing research

projects.● Currently I am working under the supervision Dr.Lama to

improve MapReduce on ZeroVM. ● Our projects involves developing a scheduler for ZeroCloud

that will be optimized to ensure Data Locality, Interference & Heterogeneity and Skew Aware.

Our Research on ZeroVM (contd)

● Data Locality is of great importance for Big Data Processing.● Current Implementation ensures Data Locality for Map Phase

since the executables will be run on the input data. ● We would like to optimize and ensure Data Locality for

Reducer phases.● We would like to design a scheduler that would mitigate the

data/computational skew problem (which is inherent in every MapReduce environment) intelligently, which is currently handled manually by the end user

Thanks

Credits:

[1] Prosunjit Biswas, UTSA

[2] Carina C. Zona, Rackspace

[3] Ryan Mckinney, Rackspace

References:

[1] zeroVM: http://www.zerovm.org

[2] apache hadoop: http://apache.hadoop.org

[3] Amazon EMR: http://aws.amazon.com/elasticmapreduce

[4] Map Reduce: http://en.wikipedia.org/wiki/MapReduce[5] Native Client: A Sandbox for Portable, Untrusted x86 Native Code : http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/34913.pdf

More about ZeroVMWebsite: www.zerovm.org

Github: https://github.com/zerovm/

User Mailing List: [email protected]

IRC: #zerovm on Freenode

Get this ppt from: http://goo.gl/6fJpbn

http://www.zerovm.org

http://apache.hadoop.org

http://aws.amazon.com/elasticmapreduce

http://aws.amazon.com/elasticmapreduce

http://en.wikipedia.org/wiki/MapReduce

http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/34913.pdf





https://github.com/zerovm/




Technology

MapReduce on Zero VM