Productionalizing a spark application

Productionalizing a Spark application

Productionalizing an application on a frequently evolving framework like Spark

● Shashank L

● Big data consultant and trainer at datamantra.io

● www.shashankgowda.com

http://datamantra.io/

http://datamantra.io/

http://shashankgowda.com

http://shashankgowda.com

Agenda

● Financial analytics● Requirements● Architecture● Initial solution● RDD to Dataframe API● Code quality and testing● Architectural changes● Future improvements● Lookback

Financial Analytics

Financial analytics is used to predict the stock prices for a specific company using its historical

price information

Architecture

Stocks data (Daily basis)

Sql Server

ETL - Pipeline HDFS

Data preprocessing Data Analytics NoSQL

Frontend(Dashboard)

Our team● Data scientists

○ Coming up with the new magic

● Data engineers○ Productionalizing the magic on large datasets

● Front end developer○ Consumes results to make it presentable to

clients.

Requirements● Across geography developers● Variety of developers in team● Better code quality● Better testing mechanisms● Easier team expansion● Lesser infrastructure maintenance overhead● Use latest libraries available

Iteration 1

Initial solution

Iteration 1● Data scientists

○ They were well versed with Python or SQL○ They did analysis using Python Panda dataframe code○ Analysis were tested on only small set of data

● Data engineers○ Using Spark - Spark 0.9○ They used to port Python to Scala RDD API to be able to

scale the analysis to big data○ Custom Framework with ability to write into and read from

multiple sources (File, Hive Table, S3, JDBC)

Data engineers

ArchitectureStocks data (Daily basis)

Sql Server

ETL - Pipeline

HDFSData

preprocessing Data Analytics NoSQL

Frontend(Dashboard)

Analysis(Python)

Data scientists

Challenges● Framework challenges

○ Porting code from one language to another would lead to a lot of inaccuracies

○ Differences in the language constructs and API lead to change in code design

● Architectural challenges○ Clusters used by the team were manually created and

maintained○ Intermediate data was saved in a text based csv

format.

Iteration 2

RDD API to Dataframe API

Iteration 2● Upgrade to Spark 1.3● Data scientists

○ Dataframe API was introduced which was a better known interface for Data scientists

○ SQL API was easier for the Data scientist to perform simple operations

○ Zeppelin for Data scientists to prototype the analytical algorithms

● Data engineers○ CSV based intermediate format to Parquet○ Amazon EMR based Hadoop cluster with Spark on it

Data science cluster

Data engineer Architecture

Stocks data

ETL HDFS

Zeppelin

Dashboard

Data Analytics(PySpark)

Data engineering cluster

Data preprocessing Data Analytics NoSQL

Challenges● Quality challenges

○ Productionalizing multiple analysis required expansion of Data engineering team

○ Team expansion induced code quality issues and bugs in the code

○ Unit tests for the each functionalities were not present

○ Review process for the changes in the code were not present

Iteration 3

Code quality and testing

Iteration 3● Creation of unit test cases for all the analysis● More readable test case suite for the code using

ScalaTest (http://www.scalatest.org/)● Test cases for unit testing small functionalities and

flow testing to test the full ETL flow on sampled data● Review process for the changes in the code through

Github PR● Daily build in Jenkins to test the flow and

functionalities on a daily basis

http://www.scalatest.org/

ScalaTestclass ExampleSpec extends FlatSpec with Matchers {

"A Stack" should "pop values in last-in-first-out order" in {

val stack = new Stack[Int]

stack.push(1)

stack.push(2)

stack.pop() should be (2)

stack.pop() should be (1)

}

it should "throw NoSuchElementException if an empty stack is popped" in {

val emptyStack = new Stack[Int]

a [NoSuchElementException] should be thrownBy {

emptyStack.pop()

}

}

}

Github PR

Challenges● Architectural challenges

○ Cluster resources was a bottleneck for the teams○ Amazon EMR clusters were not throw away

clusters as data was stored in HDFS.○ Upgrading the Spark version on the cluster was

difficult○ Infrastructure to run scheduled jobs was missing

as Jenkins was not the best way to schedule jobs○ Stability issues with Zeppelin

Iteration 4

Architectural changes

Iteration 4● Moved the data storage from HDFS to s3● Moved to Databricks cloud environment (https:

//databricks.com/product/databricks)● Databricks cloud provides notebook based interface

for writing Spark code in Scala, Java, Python and R● Encourage data scientists to use Scala API● Travis for deployment and testing

https://databricks.com/product/databricks



Databrick cloud● Cluster config

○ Launch, configure, scale and terminate

Databrick cloud● Jobs

○ Schedule complex workflows

Databrick cloud● Notebooks

○ Explore, Visualize and Share

Improvements● Data engineers

○ Cluster bottleneck was solved with creating multiple throw away clusters when needed.

○ Need not stick to a cluster for a long time as primary data storage was s3

○ Terminating cluster when not being used would be cost efficient

○ Multiple clusters with different versions of Spark enables the user to try out the latest feature in Spark

○ Cluster maintenance and tuning overhead

Improvements● Data engineers

○ Lesser turnaround time in understanding bottlenecks in the workflows

○ Databricks cloud Jobs can be used for scheduling workflows and daily runs

○ Travis enabled strict and immediate code testing● Data scientists

○ Data Scientists can easily share the notebooks and results of the analysis with the team

○ Ability to write in multiple languages

DATABRICKS CLOUD

Jobs

Architecture

DashboardNoSQL

S3

ETL

Stocks data

Datasciencecluster

Notebook(R/Python)

DataEnggcluster1

Notebook(Scala)

DataEnggcluster2

Notebook(Scala)

Challenges● Framework challenges

○ Schema is static and doesn’t change frequently○ Dataframe doesn’t have static schema check○ Pipeline fails in the middle of the processing if there

is any change in the data○ Current window analysis uses Scala constructs to

load specific set of data to memory and run ML on top of it

○ Domain object based functions are called from inside udf currently

Iteration 5

Road ahead

Iteration 5 (Future iteration)● Data engineers

○ Port analysis from Dataframe API into Dataset API (in Spark 2.0)

○ With Dataset API, we get static schema check○ Using existing Domain object based functions

● Data scientists○ Move from Scala window based analysis to

SparkSQL window analytics

Lookback● Spark version

○ 0.9 -> 1.6.0● API

○ RDD -> Dataframe -> Dataset● Deployment

○ EC2 -> EMR -> DB cloud● Scheduling

○ Jenkins -> DB cloud Jobs● Language

○ Scala

Lookback● Data format

○ Text -> Parquet● Storage

○ HDFS -> s3● Deployment

○ Jenkins -> Travis

References● http://go.databricks.com/databricks-community-

edition-beta-waitlist● https://databricks.com/blog/2014/07/14/databricks-

cloud-making-big-data-easy.html● http://shashankgowda.com/2016/02/20/introduction-

to-dataset-api-in-spark.html

http://go.databricks.com/databricks-community-edition-beta-waitlist



https://databricks.com/blog/2014/07/14/databricks-cloud-making-big-data-easy.html



http://shashankgowda.com/2016/02/20/introduction-to-dataset-api-in-spark.html



Thank you

Data & Analytics

Productionalizing a spark application