Building distributed processing system from scratch - Part 2

Distributed Systems from Scratch - Part 2Handling third party libraries

https://github.com/phatak-dev/distributedsystems

● Madhukara Phatak

● Big data consultant and trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Agenda● Idea● Motivation● Architecture of existing big data system● Function abstraction● Third party libraries● Implementing third party libraries● MySQL task● Code example

“What it takes to build a distributed processing system

like Spark?”

Motivation● First version of Spark only had 1600 lines of Scala code● Had all basic pieces of RDD and ability to run

distributed system using Mesos● Recreating the same code with step by step

understanding ● Ample of time in hand

Distributed systems from 30000ft

Distributed Storage(HDFS/S3)

Distributed Cluster management(YARN/Mesos)

Distributed Processing Systems(Spark/MapReduce)

Data Applications

Our distributed system

Scala function based abstraction

Scala functions to express logic

Function abstraction● The whole spark API can be summarized a scala

function which can represented as follow () => T● This scala function can be parallelized and sent over

network to run on multiple systems using mesos● The function is represented as a task inside the

framework● FunctionTask.scala

Spark API as distributed function● Initial API of the spark revolved around scala function

abstraction for processing as with RDD for data abstraction

● Every API like map, flatMap represented as a function task which takes one parameter and return one value

● The distribution of the functions are initially done by the mesos which later ported to other cluster management

● This shows how the spark started with functional programming

Till now● Discussion about Mesos and its abstraction● Hello world code on Mesos● Defining Function interface● Implementing

○ Scheduler to run scala code○ Custom executor for scala○ Serialize and Deserialize scala function

● https://www.youtube.com/watch?v=Oy9ToN4O63c

What a local function can do?● Access to the local data. Even in spark, normally the

function access the hdfs local data● Ability to access the classes provided by the framework● Any logic which can be serializedWhat it cannot do?● Access classes outside from the framework● Access the results of other functions (shuffle)● Access to lookup data (broadcast)

Need of third party libraries● Ability to add third party libraries in a distributed system

framework is important● Third party libraries allow us to

○ Connect to third party sources○ Use library to implement custom logic like matrix

manipulation inside function abstraction○ Ability to extend base framework using set of

libraries ex: spark-sql○ Ability to optimize for specific hardware

Approaches to third party libraries● There are two different approaches to distribute third

party jars● UberJar - Build all the dependencies with your

application code to single jar● Second approach is to distribute the libraries separately

and adding them to the classpath of executors● UberJar suffers from issues of jar size and versioning● So we are going follow second approach which is

similar to one followed in Spark

Design for distributing jars

Executor 1

Executor 2

Jar serving http server

Scheduler code

Scheduler/Driver

Download jars over http

Distributing jars● Third party jars are distributed over http protocol over

the cluster● Whenever the scheduler/drives comes up it starts a http

server to serve the jars passed on to it by user● Whenever executors are created, scheduler passes on

the uri of the http server to connect● Executors connect to the jar server and download the

jars to respective machine. Then they add them to their classpath.

Code for implementing ● We need multiple changes to our existing code base to

support third party jars● The following are the different steps

○ Implementation of embedded http server○ Change to scheduler to start http server○ Change to executor to download jars and add it to

classpath○ A function which uses third party library

Http Server● We implement an embedded http server using jetty● Jetty is a popular http server and J2EE servlet container

from eclipse organization● One of the strength of jetty is it can be embedded inside

another program to provide http interfaces to certain functionality

● Initial versions of Spark used jetty for jar distribution. Newer version uses netty.

● https://eclipse.org/jetty/● HttpServer.scala

Scheduler change● Once we have http server, now we need to start when

we start our scheduler● We will use registered callback for creating our jar

server.● As part of starting the jar server, we will copy all the jars

provided by the user to a location which will beame base director for the server.

● Once we have the server running, we pass on the server uri to all the executors

● TaskScheduler.scala

Executor side● In executor, we download the jars using calls to the jar

server running on master● Once we downloaded the jars, we add it the classpath

using URLClassLoader● We use above classloader to run our functions so that it

has access all the jars● We plug this code in the registered callback of the

executor so it run only once● TaskExecutor.scala

MySQL function● This example is a function which access the mysql class

to run jdbc against a mysql instance● We ship mysql jar using our jar distributed framework so

it will be not part of our application jar● There is no change in our function api as it’s a normal

function as other examples● MySQLTask.scala

References● http://blog.madhukaraphatak.com/mesos-single-node-

setup-ubuntu/● http://blog.madhukaraphatak.com/mesos-helloworld-

scala/● http://blog.madhukaraphatak.com/custom-mesos-

executor-scala/● http://blog.madhukaraphatak.com/distributing-third-

party-libraries-in-mesos/

Building distributed processing system from scratch - Part 2

Data & Analytics

Distributed Processing Systems (InterProcess Communication) (Remote Procedure Call) Distributed Processing Systems (InterProcess Communication) (Remote

Distributed Query Processing

Distributed processing

Demystifying Distributed Graph Processing

SYST5030/4030 What is Distributed Processing? Distributed processing refers to geographical distribution of hardware, software, processing, data, and control

Storm distributed processing

Open Distributed Processing in SC7

In-memory Distributed Spatial Query Processing and Optimizationmerlintang.github.io/paper/memory-distributed-spatial.pdf · 1 In-memory Distributed Spatial Query Processing and Optimization

Distributed Processing - Syntricate Processing.pdf · the local machine FTK installation engine. Distributed processing ... in Chapter 3 of the FTK user guide. The Distributed Processing

Distributed Blockchain Processing

L12:Distributed Graph Processing

Distributed Database Systems · – Distributed query processing & optimization! ... • Processing is distributed among multiple database nodes! network! T1! T2! T3! DBS1! DBS3!

Distributed Logging for Transaction Processing - Harvard SEASdaslab.seas.harvard.edu/reading-group/papers/distributed-logging.pdf · Distributed Logging for Transaction Processing

Natural Language Processing (almost) from Scratch · PDF fileNatural Language Processing (almost) from Scratch by an indicator of the beginning or the inside of an entity. The CoNLL

Query Processing in Distributed Database

Distributed Processing Goes Galactic

Building Distributed Systems from Scratch - Part 1

Natural Language Processing (Almost) from Scratch...NATURAL LANGUAGE PROCESSING (ALMOST) FROM SCRATCH 2.4 Semantic Role Labeling SRL aims at giving a semantic role to a syntactic constituent

Distributed Graph Processing

What is Distributed Processing?