35
Productionalizing a Spark application Productionalizing an application on a frequently evolving framework like Spark

Productionalizing a spark application

Embed Size (px)

Citation preview

Page 1: Productionalizing a spark application

Productionalizing a Spark application

Productionalizing an application on a frequently evolving framework like Spark

Page 2: Productionalizing a spark application

● Shashank L

● Big data consultant and trainer at datamantra.io

● www.shashankgowda.com

Page 3: Productionalizing a spark application

Agenda

● Financial analytics● Requirements● Architecture● Initial solution● RDD to Dataframe API● Code quality and testing● Architectural changes● Future improvements● Lookback

Page 4: Productionalizing a spark application

Financial Analytics

Financial analytics is used to predict the stock prices for a specific company using its historical

price information

Page 5: Productionalizing a spark application

Architecture

Stocks data (Daily basis)

Sql Server

ETL - Pipeline HDFS

Data preprocessing Data Analytics NoSQL

Frontend(Dashboard)

Page 6: Productionalizing a spark application

Our team● Data scientists

○ Coming up with the new magic

● Data engineers○ Productionalizing the magic on large datasets

● Front end developer○ Consumes results to make it presentable to

clients.

Page 7: Productionalizing a spark application

Requirements● Across geography developers● Variety of developers in team● Better code quality● Better testing mechanisms● Easier team expansion● Lesser infrastructure maintenance overhead● Use latest libraries available

Page 8: Productionalizing a spark application

Iteration 1

Initial solution

Page 9: Productionalizing a spark application

Iteration 1● Data scientists

○ They were well versed with Python or SQL○ They did analysis using Python Panda dataframe code○ Analysis were tested on only small set of data

● Data engineers○ Using Spark - Spark 0.9○ They used to port Python to Scala RDD API to be able to

scale the analysis to big data○ Custom Framework with ability to write into and read from

multiple sources (File, Hive Table, S3, JDBC)

Page 10: Productionalizing a spark application

Data engineers

ArchitectureStocks data (Daily basis)

Sql Server

ETL - Pipeline

HDFSData

preprocessing Data Analytics NoSQL

Frontend(Dashboard)

Analysis(Python)

Data scientists

Page 11: Productionalizing a spark application

Challenges● Framework challenges

○ Porting code from one language to another would lead to a lot of inaccuracies

○ Differences in the language constructs and API lead to change in code design

● Architectural challenges○ Clusters used by the team were manually created and

maintained○ Intermediate data was saved in a text based csv

format.

Page 12: Productionalizing a spark application

Iteration 2

RDD API to Dataframe API

Page 13: Productionalizing a spark application

Iteration 2● Upgrade to Spark 1.3● Data scientists

○ Dataframe API was introduced which was a better known interface for Data scientists

○ SQL API was easier for the Data scientist to perform simple operations

○ Zeppelin for Data scientists to prototype the analytical algorithms

● Data engineers○ CSV based intermediate format to Parquet○ Amazon EMR based Hadoop cluster with Spark on it

Page 14: Productionalizing a spark application

Data science cluster

Data engineer Architecture

Stocks data

ETL HDFS

Zeppelin

Dashboard

Data Analytics(PySpark)

Data engineering cluster

Data preprocessing Data Analytics NoSQL

Page 15: Productionalizing a spark application

Challenges● Quality challenges

○ Productionalizing multiple analysis required expansion of Data engineering team

○ Team expansion induced code quality issues and bugs in the code

○ Unit tests for the each functionalities were not present

○ Review process for the changes in the code were not present

Page 16: Productionalizing a spark application

Iteration 3

Code quality and testing

Page 17: Productionalizing a spark application

Iteration 3● Creation of unit test cases for all the analysis● More readable test case suite for the code using

ScalaTest (http://www.scalatest.org/)● Test cases for unit testing small functionalities and

flow testing to test the full ETL flow on sampled data● Review process for the changes in the code through

Github PR● Daily build in Jenkins to test the flow and

functionalities on a daily basis

Page 18: Productionalizing a spark application

ScalaTestclass ExampleSpec extends FlatSpec with Matchers {

"A Stack" should "pop values in last-in-first-out order" in {

val stack = new Stack[Int]

stack.push(1)

stack.push(2)

stack.pop() should be (2)

stack.pop() should be (1)

}

it should "throw NoSuchElementException if an empty stack is popped" in {

val emptyStack = new Stack[Int]

a [NoSuchElementException] should be thrownBy {

emptyStack.pop()

}

}

}

Page 19: Productionalizing a spark application

Github PR

Page 20: Productionalizing a spark application

Challenges● Architectural challenges

○ Cluster resources was a bottleneck for the teams○ Amazon EMR clusters were not throw away

clusters as data was stored in HDFS.○ Upgrading the Spark version on the cluster was

difficult○ Infrastructure to run scheduled jobs was missing

as Jenkins was not the best way to schedule jobs○ Stability issues with Zeppelin

Page 21: Productionalizing a spark application

Iteration 4

Architectural changes

Page 22: Productionalizing a spark application

Iteration 4● Moved the data storage from HDFS to s3● Moved to Databricks cloud environment (https:

//databricks.com/product/databricks)● Databricks cloud provides notebook based interface

for writing Spark code in Scala, Java, Python and R● Encourage data scientists to use Scala API● Travis for deployment and testing

Page 23: Productionalizing a spark application

Databrick cloud● Cluster config

○ Launch, configure, scale and terminate

Page 24: Productionalizing a spark application

Databrick cloud● Jobs

○ Schedule complex workflows

Page 25: Productionalizing a spark application

Databrick cloud● Notebooks

○ Explore, Visualize and Share

Page 26: Productionalizing a spark application

Improvements● Data engineers

○ Cluster bottleneck was solved with creating multiple throw away clusters when needed.

○ Need not stick to a cluster for a long time as primary data storage was s3

○ Terminating cluster when not being used would be cost efficient

○ Multiple clusters with different versions of Spark enables the user to try out the latest feature in Spark

○ Cluster maintenance and tuning overhead

Page 27: Productionalizing a spark application

Improvements● Data engineers

○ Lesser turnaround time in understanding bottlenecks in the workflows

○ Databricks cloud Jobs can be used for scheduling workflows and daily runs

○ Travis enabled strict and immediate code testing● Data scientists

○ Data Scientists can easily share the notebooks and results of the analysis with the team

○ Ability to write in multiple languages

Page 28: Productionalizing a spark application

DATABRICKS CLOUD

Jobs

Architecture

DashboardNoSQL

S3

ETL

Stocks data

Datasciencecluster

Notebook(R/Python)

DataEnggcluster1

Notebook(Scala)

DataEnggcluster2

Notebook(Scala)

Page 29: Productionalizing a spark application

Challenges● Framework challenges

○ Schema is static and doesn’t change frequently○ Dataframe doesn’t have static schema check○ Pipeline fails in the middle of the processing if there

is any change in the data○ Current window analysis uses Scala constructs to

load specific set of data to memory and run ML on top of it

○ Domain object based functions are called from inside udf currently

Page 30: Productionalizing a spark application

Iteration 5

Road ahead

Page 31: Productionalizing a spark application

Iteration 5 (Future iteration)● Data engineers

○ Port analysis from Dataframe API into Dataset API (in Spark 2.0)

○ With Dataset API, we get static schema check○ Using existing Domain object based functions

● Data scientists○ Move from Scala window based analysis to

SparkSQL window analytics

Page 32: Productionalizing a spark application

Lookback● Spark version

○ 0.9 -> 1.6.0● API

○ RDD -> Dataframe -> Dataset● Deployment

○ EC2 -> EMR -> DB cloud● Scheduling

○ Jenkins -> DB cloud Jobs● Language

○ Scala

Page 33: Productionalizing a spark application

Lookback● Data format

○ Text -> Parquet● Storage

○ HDFS -> s3● Deployment

○ Jenkins -> Travis

Page 35: Productionalizing a spark application

Thank you