Upload
luca-milanesio
View
127
Download
1
Tags:
Embed Size (px)
Citation preview
1
Gerrit and Jenkins for Big Data Continuous Delivery
London, UK, June 2015
www.gerritforge.com
#jenkinsconf
2
About GerritForge
• Founded in 2009 in London• Committed to OpenSource
www.gerritforge.com
#jenkinsconf
3
The Team
Luca Milanesio• Co-founder and Director of GerritForge • over 20 years in Agile Development and ALM• OpenSource contributor to many projects
(BigData, Continuous Integration, Git/Gerrit)
Antonios Chalkiopulos• Author of Programming MapReduce with Scalding• Open source contributor to many BigData projects• Working on the "land-of-Hadoop' (landoop.com)
www.gerritforge.com
#jenkinsconf
4
The Team (2)
Tiago Palma• Data Warehouse & Big Data Development
• Senior Data Modeler
• Big Data infrastructure specialist
Stefano Galarraga• 20 years of Agile Development• Middleware, Big Data, Reactive Distributed Systems. • Open Source contributor to many BigData projects.
www.gerritforge.com
#jenkinsconf
5
Agenda
• Why continuous deployment on BigData?• Our Development Lifecycle ingredients
– Gerrit, Jenkins, Mesos, Marathon, CDH / Spark
• Topics to address in BigData development – Type of tests (Unit vs. Integration)– Testing the "real thing" (aka the Cluster)
• Our BigData virtualised infrastructure– Marathon, Mesos and Dockers all around
• Live (minimised) Demo
www.gerritforge.com
#jenkinsconf
6
WHY?
• Early BigData had no process at all = may fail at any time• Mature BigData is mission critical decision maker• Need for more stable sw-engineering methodologies:
– Test-Driven Development (Stefano's ScaldingUnit)– Continuous Integration with Jenkins– Integration & Performance testing– Code review and validation
www.gerritforge.com
#jenkinsconf
7
Code-Review BigData Lifecycle (1)
• GIT used by distributed teams (UK, Israel, India)• Topics and Code Review• Jenkins build on every patch-set• Commits reviewed / approved via Gerrit Submit
www.gerritforge.com
#jenkinsconf
8
Code-Review BigData Lifecycle (2)
www.gerritforge.com
#jenkinsconf
9
Code-Review BigData Lifecycle (3)
• Submitting a Topic automatically does:– all patch-sets merged (semi-atomically)– trigger a longer chain of CI steps– automatically promote a RC if everything passes
• Jenkins automation via Gerrit Trigger Plugin
www.gerritforge.com
#jenkinsconf
10
Ingredients: Gerrit
• Git-based Code Review system
• Pre-commit review• Allows multiple validation steps
(pipeline)• Validation + Integration flags
www.gerritforge.com
#jenkinsconf
11
Ingredients: Jenkins
• Plugins:– Gerrit trigger– Docker build step– Post-build script plugin
www.gerritforge.com
#jenkinsconf
12
Fitting CDH Into this Picture
• Integration Test– Running integration tests into an CDH-enabled docker
container– Hadoop/local and Spark/standalone is not enough– Need to test classes serialisation– Validate package fat-jars (libs conflicts with CDH)– Performance on a real cluster
www.gerritforge.com
#jenkinsconf
13
Fitting CDH Into this Picture
• Acceptance / performance test with short-lived CDHs• Solution: Mesos, Marathon and Docker:
– Ephemeral clusters with defined capacity– Automatic cluster-config– All controlled via Docker/Mesos
www.gerritforge.com
#jenkinsconf
14
Mesos + Marathon
• Apache Mesos– Abstracts CPU, memory, storage, other compute
resources away from machines
• Marathon Framework– Runs on top of Mesos – Guarantees that long-running applications never
stop– REST API for managing and scaling services
www.gerritforge.com
#jenkinsconf
15
CDH Components
• CDH 5.4.1 distribution– Apache Spark– Hadoop HDFS– YARN
www.gerritforge.com
#jenkinsconf
16
Slave Host
Integration Test Flow on CDH Cluster
Jenkins Master
MesosMaster
Marathon PrivateDocker Registry
MesosSlave
Docker
POST to Marathon REST API to start 1 docker container with Cloudera Manager and N docker containers with cloudera agents
Marathon Framework receives resource offers from Mesos Master and submits the tasks
The task is sent to the Mesos Slave
Mesos slave starts the docker container
Docker image is fetched from Docker registry if not present in Slave hostW
aitin
g fo
r D
ocke
rs
Doc
kers
UP
Install Cloudera packages via Cloudera Manager API using Python
Deploy the ETL, run the ETL and the Integration Tests
www.gerritforge.com
#jenkinsconf
17
Unit and Integration Tests sample
• Test project:– Test Spark project – ETL from Oracle to HDFS
• Unit-test directly on Spark logic• Integration tests for every patch-set:
– VERY small dataset just for this demo– CDH and Oracle Docker Images
www.gerritforge.com
#jenkinsconf
18
O
Unit and Integration Tests
Hadoop Pseudo-distributed mode
Spark Standalone
Jenkins
Oracle
CDH
Build Jobinit
Submit job
Init/read HDFS
#jenkinsconf
DEMOSmall-scale of BigData Delivery Pipeline
19
www.gerritforge.com
#jenkinsconf
20
References
• Demo sources
https://github.com/GerritForge• Blog:
https://gitenterprise.me• Twitter:
@GerritReview @GitEnterprise @GerritForge• Learn Gerrit Code Review book:
GerritHub.io/book• Get in touch with GerritForge: