Building distributed processing system from scratch - Part 2

Preview:

Citation preview

Distributed Systems from Scratch - Part 2Handling third party libraries

https://github.com/phatak-dev/distributedsystems

● Madhukara Phatak

● Big data consultant and trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Agenda● Idea● Motivation● Architecture of existing big data system● Function abstraction● Third party libraries● Implementing third party libraries● MySQL task● Code example

Idea

“What it takes to build a distributed processing system

like Spark?”

Motivation● First version of Spark only had 1600 lines of Scala code● Had all basic pieces of RDD and ability to run

distributed system using Mesos● Recreating the same code with step by step

understanding ● Ample of time in hand

Distributed systems from 30000ft

Distributed Storage(HDFS/S3)

Distributed Cluster management(YARN/Mesos)

Distributed Processing Systems(Spark/MapReduce)

Data Applications

Our distributed system

Mesos

Scala function based abstraction

Scala functions to express logic

Function abstraction● The whole spark API can be summarized a scala

function which can represented as follow () => T● This scala function can be parallelized and sent over

network to run on multiple systems using mesos● The function is represented as a task inside the

framework● FunctionTask.scala

Spark API as distributed function● Initial API of the spark revolved around scala function

abstraction for processing as with RDD for data abstraction

● Every API like map, flatMap represented as a function task which takes one parameter and return one value

● The distribution of the functions are initially done by the mesos which later ported to other cluster management

● This shows how the spark started with functional programming

Till now● Discussion about Mesos and its abstraction● Hello world code on Mesos● Defining Function interface● Implementing

○ Scheduler to run scala code○ Custom executor for scala○ Serialize and Deserialize scala function

● https://www.youtube.com/watch?v=Oy9ToN4O63c

What a local function can do?● Access to the local data. Even in spark, normally the

function access the hdfs local data● Ability to access the classes provided by the framework● Any logic which can be serializedWhat it cannot do?● Access classes outside from the framework● Access the results of other functions (shuffle)● Access to lookup data (broadcast)

Need of third party libraries● Ability to add third party libraries in a distributed system

framework is important● Third party libraries allow us to

○ Connect to third party sources○ Use library to implement custom logic like matrix

manipulation inside function abstraction○ Ability to extend base framework using set of

libraries ex: spark-sql○ Ability to optimize for specific hardware

Approaches to third party libraries● There are two different approaches to distribute third

party jars● UberJar - Build all the dependencies with your

application code to single jar● Second approach is to distribute the libraries separately

and adding them to the classpath of executors● UberJar suffers from issues of jar size and versioning● So we are going follow second approach which is

similar to one followed in Spark

Design for distributing jars

Executor 1

Executor 2

Jar serving http server

Scheduler code

Scheduler/Driver

Download jars over http

Download jars over http

Distributing jars● Third party jars are distributed over http protocol over

the cluster● Whenever the scheduler/drives comes up it starts a http

server to serve the jars passed on to it by user● Whenever executors are created, scheduler passes on

the uri of the http server to connect● Executors connect to the jar server and download the

jars to respective machine. Then they add them to their classpath.

Code for implementing ● We need multiple changes to our existing code base to

support third party jars● The following are the different steps

○ Implementation of embedded http server○ Change to scheduler to start http server○ Change to executor to download jars and add it to

classpath○ A function which uses third party library

Http Server● We implement an embedded http server using jetty● Jetty is a popular http server and J2EE servlet container

from eclipse organization● One of the strength of jetty is it can be embedded inside

another program to provide http interfaces to certain functionality

● Initial versions of Spark used jetty for jar distribution. Newer version uses netty.

● https://eclipse.org/jetty/● HttpServer.scala

Scheduler change● Once we have http server, now we need to start when

we start our scheduler● We will use registered callback for creating our jar

server.● As part of starting the jar server, we will copy all the jars

provided by the user to a location which will beame base director for the server.

● Once we have the server running, we pass on the server uri to all the executors

● TaskScheduler.scala

Executor side● In executor, we download the jars using calls to the jar

server running on master● Once we downloaded the jars, we add it the classpath

using URLClassLoader● We use above classloader to run our functions so that it

has access all the jars● We plug this code in the registered callback of the

executor so it run only once● TaskExecutor.scala

MySQL function● This example is a function which access the mysql class

to run jdbc against a mysql instance● We ship mysql jar using our jar distributed framework so

it will be not part of our application jar● There is no change in our function api as it’s a normal

function as other examples● MySQLTask.scala

Recommended