27
A CD Framework For Data Pipelines Yaniv Rodenski @YRodenski [email protected]

A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

A CD Framework For Data PipelinesYaniv Rodenski

@YRodenski [email protected]

Page 2: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Archetypes of Data Pipelines Builders

• Exploratory workloads

• Data centric

• Simple Deployment

Data People (Data Scientist/Analysts/BI Devs) Software Developers

• Code centric

• Heavy on methodologies

• Heavy tooling

• Very complex deployment

“Scientists”

Page 3: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better
Page 4: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Data scientist deploying to production

Page 5: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Making Big Data Teams Scale• Scaling teams is hard

• Scaling Big Data teams is harder

• Different mentality between data professionals/engineers

• Mixture of technologies

• Data as integration point

• Often schema-less

• Lack of tools

Page 6: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

What Do We Need for Deploying our apps?

• Source control system: Git, Hg, etc

• A CI process to integrate code run tests and package app

• A repository to store packaged app

• A repository to store configuration

• An API/DSL to configure the underlaying framework

• Mechanism to monitor the behaviour and performance of the app

Page 7: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

How can we apply these techniques to

Big Data applications?

Page 8: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Who are we? Software developers withyears of Big Data experience

What do we want? Simple and robust way todeploy Big Data pipelines

How will we get it? Write tens thousands of linesof code in Scala

Page 9: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Amaterasu - Simple Continuously Deployed Data Apps

• Big Data apps in Multiple Frameworks

• Multiple Languages

• Scala

• Python

• SQL

• Pipeline deployments are defined as YAML

• Simple to Write, easy to deploy

• Reliable execution

• Multiple Environments

Page 10: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Amaterasu Repositories• Jobs are defined in repositories

• Current implementation - git repositories

• tarballs support is planned for future release

• Repos structure

• maki.yml - The workflow definition

• src - a folder containing the actions (spark scripts, etc.) to be executed

• env - a folder containing configuration per environment

• deps - dependencies configuration

• Benefits of using git:

• Tooling

• Branching

Page 11: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Pipeline DSL - maki.yml (Version 0.2.0)

---job-name: amaterasu-testflow: - name: start runner: group: spark type: scala file: file.scala - exports: odd: parquet - name: step2 runner: group: spark type: pyspark file: file2.py error: file2.scala name: handle-error runner: group: spark type: scala file: cleanup.scala…

Data-structures to be used in downstream actions

Actions are components of the pipeline

Error handling actions

Page 12: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Amaterasu is not a workflow engine, it’s a deployment tool that understands that Big

Data applications are rarely deployed independently of other Big Data applications

Page 13: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Pipeline != Workflow

Page 14: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Pipeline DSL (Version 0.3.0)---job-name: amaterasu-testtype: long-runningdef: - name: start type: long-running runner: group: spark type: scala file: file.scala - exports: odd: parquet - name: step2 type: scheduled

schedule: 10 * * * * runner: group: spark type: pyspark

artifact: - groupid: io.shonto artifactId: mySparkStreaming\ version: 0.1.0

Scheduling is defined using Cronformat

In Version 3 Pipeline and actions can be either long running or

scheduled

Actions can be pulled from other application or git repositories

Page 15: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Actions DSL (Spark)• Your Scala/Python/SQL Future languages Spark

code (R is in the works)

• Few changes:

• Don’t create a new sc/sqlContext, use the one in scope or access via AmaContext.spark AmaContext.sc and AmaContext.sqlContext

• AmaContext.getDataFrame is used to access data from previously executed actions

Page 16: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

import org.apache.amaterasu.runtime._

val highNoDf = AmaContext.getDataFrame("start", “odd") .where(“_1 > 3")

highNoDf.write.json("file:///tmp/test1")

Actions DSL - Spark Scala

import org.apache.amaterasu.runtime._

val data = Array(1, 2, 3, 4, 5)

val rdd = AmaContext.sc.parallelize(data)val odd = rdd.filter(n => n%2 != 0).toDf()

Action 1 (“start”) Action 2

- name: start runner: group: spark type: scala file: file.scala - exports: odd: parquet

Page 17: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

high_no_df = ama_context .get_dataframe(“start”, “odd") .where("_1 > 100”)

high_no_df.write.save(“file:///tmp/test1”, format=“json”)

Actions DSL - PySpark

data = reange(1, 1000)

rdd = ama_context.sc.parallelize(data)odd = rdd.filter(lambda n: n % 2 != 0)

.map(row) .toDf()

Action 1 (“start”) Action 2

- name: start runner: group: spark type: pyspark file: file.py - exports: odd: parquet

Page 18: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Actions DSL - SparkSQL

select * from ama_context.start_odd where

_1 > 100

- name: acttion2 runner: group: spark type: sql file: file.sql - exports: high_no: parquet

Page 19: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Environments• Configuration is stored per environment

• Stored as YAML files in an environment folder

• Contains:

• Input/output path

• Work dir

• User defined key-values

Page 20: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

env/prduction/job.yml

name: default master: mesos://prdmsos:5050 inputRootPath: hdfs://prdhdfs:9000/user/amaterasu/input outputRootPath: hdfs://prdhdfs:9000/user/amaterasu/output workingDir: alluxio://prdalluxio:19998/ configuration: spark.cassandra.connection.host: cassandraprod sourceTable: documents

Page 21: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

env/dev/job.yml

name: test master: local[*] inputRootPath: file///tmp/input outputRootPath: file///tmp/output workingDir: file///tmp/work/ configuration: spark.cassandra.connection.host: 127.0.0.1 sourceTable: documents

Page 22: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

import io.shinto.amaterasu.runtime._

val highNoDf = AmaContext.getDataFrame("start", “x”) .where("_1 > 3”)

highNoDf.write.json(Env.outputPath)

Environments in the Actions DSL

Page 23: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Demo time

Page 24: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Version 2 main futures• YARN support

• Spark SQL, PySpark support

• Extend environments to support:

• Pure YAML support (configuration used to be JSON)

• Full spark configuration

• spark.yml - support all spark configurations

• spark_exec_env.yml - for configuring spark executors environments

• SDK Preview - for building framework integration

Page 25: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Future Development• Long running pipelines and streaming support

• Better tooling

• ama-cli

• Web console

• Other frameworks: Presto, TensorFlow, Apache Flink, Apache Beam, Hive

• SDK improvements

Page 26: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Website http://amaterasu.incubator.apache.org

GitHubhttps://github.com/apache/incubator-amaterasu

Mailing List [email protected]

Slack http://apacheamaterasu.slack.com

Twitter @ApacheAmaterasu

Getting started

Page 27: A CD Framework For Data Pipelines - YOW! Conferences€¦ · • Pure YAML support (configuration used to be JSON) ... • Long running pipelines and streaming support • Better

Thank you!

@YRodenski [email protected]