Hadoop Everywhere & Cloudbreak

Hadoop EverywhereHortonworks. We do Hadoop.

$ whoamiSean RobertsPartner Solutions EngineerLondon, EMEA & everywhere

@seanolinkedin.com/in/seanorama

MacGyver. Data Freak. Cook. Autodidact. Volunteer. Ancestral Health. Fito. Couchsurfer. Nomad

- HDP 2.3- http://hortonworks.com/

- Hadoop Summit recordings:- http://2015.hadoopsummit.org/san-jose/- http://2015.hadoopsummit.org/brussels/

- Past & Future workshops:- http://hortonworks.com/partners/learn/

What’s New!

http://hortonworks.com/


http://2015.hadoopsummit.org/san-jose/


http://2015.hadoopsummit.org/brussels/


http://hortonworks.com/partners/learn/


Agenda● Hadoop Everywhere● Deployment challenges & requirements● Cloudbreak & our Docker approach● Workshop: Your own CloudBreak

○ And auto-scaling with Periscope● Cloud best practicesReminder:● Attendee phone lines are muted● Please ask questions in the chat

Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

DisclaimerThis document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.

Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.

This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.

Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.

Hadoop Everywhere


Any applicationBatch, interactive, and real-time

Any dataExisting and new datasets

AnywhereComplete range of deployment options

Commodity Appliance Cloud

YARN: data operating system

Existing applications

Newanalytics

Partner applications

Data access: batch, interactive, real-time

Hadoop Everywhere


Hybrid Deployment ChoiceWindows, Linux, On-Premise or CloudData “gravity” guides choice

Compatible ClustersRun applications and data processing workloads wherever and whenever needed

Replicated DatasetsDemocratize Hadoop data access via automated sharing of datasets using Apache Falcon

Hadoop Up There, Down Here...Everywhere!

Dev / Test BI / ML

IoT Apps

On-Premises


Use Cases Where?Active Archive / Compliance Reporting Sensitive data = “down here”; “up there” valid for many

scenarios

ETL / Data Warehouse Optimization Usually has “down here” gravity; DW in cloud is changing that

Smart Meter Analysis Data typically flows “up there”

Single View of Customer May have “down here” gravity; unless you’re using SaaS apps

Supply Chain Optimization May have heavy “down here” gravity

New Data for Product Management “Up there” could be considered for many scenarios.

Vehicle Data for Transportation/Logistics Why not “up there”?

Vehicle Data for Insurance May have “down here” gravity (ex. join with existing risk data)

Anywhere? Up There or Down Here?

DeploymentChallenges & Requirements

Deployment challenges● Infrastructure is different everywhere

○ e.g. Each cloud provider has their own API○ e.g. Each provider has different networking methods

● OS/images are different everywhere● How to do service discovery?● How to dynamically scale/manage?

See prior operations workshops



- Infrastructure- Operating System- Environment Prepared (see docs)- Ambari Agent/Server installed & registered- Deploy HDP Cluster

- Ambari Blueprints or Cluster Wizard- Ongoing configuration/management

Deployment requirements

http://docs.hortonworks.com/HDPDocuments/Ambari-2.0.1.0/bk_Installing_HDP_AMB/content/_prepare_the_environment.html

Options for Automation- Many combinations of tools

- e.g. Foreman, Ansible, Chef, Puppet, docker-ambari, shell scripts, CloudFormation, …

- Provider specific- Cisco UCS, Teradata, HP, Google’s bdutil, …

- Docker with Cloudbreak

Using Ambari with all of the above!

https://github.com/seanorama/ambari-bootstrap/

Demo: Basic script-based example

https://github.com/seanorama/ambari-bootstrap

Requirements:● Infrastructure prepped (see HDP docs)● Nodes with RedHat EL or CentOS 6 systems● HDFS paths mounted (see HDP docs)● sudo or root access

ambari-bootstrap



After Ambari deployment● (optional) Configure local YUM/APT repos● Deploy HDP with Ambari Wizard or Blueprint● Ongoing configuration/management

Using Ansiblehttps://github.com/rackerlabs/ansible-hadoop

https://github.com/rackerlabs/ansible-hadoop

https://github.com/rackerlabs/ansible-hadoop

Build once. Deploy anywhere.

Docker



Multiplicity of

Stacks

Multiplicity of hardware

environments

Static website Web frontend

User DB

Queue

Analytics DB

Development VMQA server Public Cloud

Contributor’s laptop

Docker is a “Shipping Container” System for Code

Production ClusterCustomer Data Center

An engine that enables any payload to be encapsulated as a lightweight, portable, self-sufficient container


Docker• Container based virtualization• Lightweight and portable• Build once, run anywhere• Ease of packaging applications• Automated and scripted• Isolated


Why Is Docker So Exciting?For Developers:Build once…run anywhere

• A clean, safe, and portable runtime environment for your app.

• No missing dependencies, packages etc.• Run each app in its own isolated container• Automate testing, integration, packaging• Reduce/eliminate concerns about

compatibility on different platforms• Cheap, zero-penalty containers to deploy

services

For DevOps:Configure once…run anything

• Make the entire lifecycle more efficient, consistent, and repeatable

• Eliminate inconsistencies between SDLC stages

• Support segregation of duties• Significantly improves the speed and

reliability of CICD• Significantly lightweight compared to VMs


More Technical ExplanationWHY WHA

T• Run on any LINUX

• Regardless of kernel version (2.6.32+)• Regardless of host distro• Physical or virtual, cloud or not• Container and host architecture must match

• Run anything• If it can run on the host, it can run in the

container• i.e. if it can run on a Linux kernel, it can run

• High Level—It’s a lightweight VM• Own process space• Own network interface• Can run stuff as root

• Low Level—It’s chroot on steroids• Container=isolated processes• Share kernel with host• No device emulation (neither HVM nor PV)

from host)


Docker - How it worksApp

A

Hypervisor (Type 2)

Host OS

Server

GuestOS

Bins/Libs

AppA’

Guest

OS

Bins/Libs

AppB

Guest

OS

Bins/Libs

Docker

Host OS kernel

Server

bin

App A

lib

App B

VM

Container

Containers are isolated. Share OS and bins/libraries

GuestOS

GuestOS

…result is significantly faster deployment, much less overhead, easier migration, faster restart

lib

App B

lib

App B

lib

App B

binA

pp A

CloudbreakTool for Provision and Managing Hadoop Clusters In The Cloud


Cloudbreak• Developed by SequenceIQ• Open source with Apache 2.0

license [ Apache project soon ]• Cloud and infrastructure

agnostic, cost effective Hadoop As-a-Service platform API.

• Elastic – can spin up any number of nodes, add/remove on the fly

• Provides full cloud lifecycle management post-deployment


Key Features of CloudbreakElastic

• Enables provisioning an arbitrary node Cluster

• Enables (de)commissioning nodes from Cluster

• Policy and time based based scaling of cluster

Flexible

• Declarative and flexible Hadoop cluster creation using blueprints

• Provision to multiple public cloud providers or Openstack based private cloud using same common API

• Access all of this functionality through rich UI, secured REST API or automatable Shell

Enterprise-ready

• Supports basic, token based and OAuth2 authentication model

• The cluster is provisioned in a logically isolated network

• Tracking usage and cluster metrics


BI / Analytics(Hive)

IoT Apps(Storm, HBase, Hive)

Launch HDP on Any Cloud for Any Application

Dev / Test(all HDP services)

Data Science(Spark)

Cloudbreak

1. Pick a Blueprint2. Choose a Cloud3. Launch HDP!

Example Ambari Blueprints: IoT Apps, BI / Analytics, Data Science, Dev /

Test


Cloudbreak Approach• Use Ambari for heavy lifting

• Provisioning of Hadoop services• Monitoring

• Use Ambari Blueprints• Assign Host groups to physical instance types

• Public/Private Cloud provider API abstracted• Azure/Google/Amazon/Openstack

• Run Ambari agent/server in Docker container• Networking: docker run –net=host• Service discovery: consul (previously serf)

Workshop: Your own Cloudbreak

cloudbreak-deployer● https://github.com/sequenceiq/cloudbreak-deployer

Requirements:● A Docker host (laptop, server or Cloud infrastructure)● Resources:

○ Very little. Tested with 2GB of RAM.

Workshop: Your Own CloudBreak

https://github.com/sequenceiq/cloudbreak-deployer


Requirement: a Docker host● OSX or Windows: http://boot2docker.io/

○ boot2docker init○ boot2docker up○ eval "$(boot2docker shellinit)"○ boot2docker ssh

● Linux: Install the docker daemon● Anywhere: docker-machine “lets you create Docker hosts on your

computer, on cloud providers, and inside your own data center”○ Example on Rackspace:

■ docker-machine create --driver rackspace \--rackspace-api-key $OS_PASSWORD \--rackspace-username $OS_USERNAME \--rackspace-region DFW docker-rax

■ docker-machine ssh docker-rax

http://boot2docker.io/

https://docs.docker.com/installation/

https://docs.docker.com/machine/

Install cloudbreak-deployerhttps://github.com/sequenceiq/cloudbreak-deployer

● curl \ https://raw.githubusercontent.com/sequenceiq/cloudbreak-deployer/master/install | sh && cbd --version

● cbd init● cbd start

You’ll then have your own CloudBreak & Periscope server with API and Web UI



Done: Your own Cloudbreak

Deploy a cluster with your CloudBreak

Documentation:http://sequenceiq.com/cloudbreak/#cloudbreak-credentials

1. Add Credentials

http://sequenceiq.com/cloudbreak/#cloudbreak-credentials




2. Create Cluster

3. Use your ClusterAmbari available as expected

To reach your Hadoop hosts:● SSH to Docker Host

○ Hosts arre listed in “Cloud stack description”○ ssh cloudbreak@IPofHost

● Shell to the “ambari-agent” container○ sudo docker ps | grep ambari-agent

■ note the CONTAINER ID○ sudo docker -it CONTAINERID bash

● Use the hosts as usual. e.g.:○ hadoop fs -ls /

Cloudbreak internals


Cloudbreak

Cloudbreak Internals

Uluwatu(cbreak UI)

Sultans(User mgmt UI)

Browser

CloudbreakshellOAuth2

(UAA)

uaa-db(psql)

Cloudbreak(rest API)

cb-db(psql)

Periscope(autoscaling

)

ps-db(psql)

consul registrator ambassador

docker

Docker


Swarm• Native clustering for Docker• Distributed container orchestration• Same API as Docker


Swarm – How it works • Swarm managers/agents• Discovery services• Advanced scheduling


Consul • Service discovery/registry• Health checking• Key/Value store• DNS• Multi datacenter aware


Consul – How it works • Consul servers/agents

• Consistency through a quorum (RAFT)

• Scalability due to gossip based protocol (SWIM)

• Decentralized and fault tolerant

• Highly available

• Consistency over availability (CP)

• Multiple interfaces - HTTP and DNS

• Support for watches


Apache Ambari • Easy Hadoop cluster provisioning

• Management and monitoring

• Key feature - Blueprints

• REST API, CLI shell

• Extensible• Stacks• Services• Views


Apache Ambari – How it works• Ambari server/agents

• Define a blueprint (blueprint.json)

• Define a host mapping (hostmapping.json)

• Post the cluster create


Run Hadoop as Docker containers

HDP as Docker Containersvia Cloudbreak

• Fully Automated Ambari Cluster installation• Avoid GUI, use rest API only (ambari-shell)• Fully Automated HDP installation with blueprints• Quick installation (pre-pulled rpms)• Same process/images for dev/qa/prod• Same process for single/multinode

Cloudbreak Ambari HDP

Installs Ambari on the VMs

Docker

VM

Docker

VM

Docker

Linux

Instructs Ambari to build

HDP cluster

Cloud Provider/Bare Metal

Provisions VMs from

Cloud Providers


Provisioning – How it works

Start VMs - with a running

Docker daemon

Cloudbreak Bootstrap•Start Consul Cluster

•Start Swarm Cluster (Consul for discovery)

Start Ambari servers/agents - Swarm API

Ambari services

registered in Consul

(Registrator)

Post Blueprint


Cloudbreak


Docker Docker

DockerDockerDocker

Docker


Cloudbreak


Docker Docker

DockerDockerDocker

Docker

amb-agn amb-ser amb-

agn

amb-agn

amb-agn

amb-agn


Cloudbreak


Docker Docker

DockerDockerDocker

Docker

amb-agn amb-ser amb-

agn

amb-agn

amb-agn

amb-agn

Blueprint


Cloudbreak


Docker Docker

DockerDockerDocker

Docker

amb-agn- hdfs- hbase

amb-seramb-agn-hdfs-hive

amb-agn-hdfs-yarn

amb-agn-hdfs-zookpr

amb-agn-nmnode-hdfs

Workshop: Auto-Scale your Clusterwith Periscope


Optimize Cloud Usage via Elastic HDP Clusters

Dev / Test

Auto-scaling Policy

• Policies based on any Ambari metrics• Dynamically scale to achieve physical elasticity• Coordinates with YARN to achieve elasticity based on

the policies.


Scaling for Static and Dynamic Clusters

Auto-scale PolicyAuto-scale PolicyAuto-scale Policy

YARN

Ambari Alerts

Ambari Metrics

Ambari

Ambari

Ambari

Provisioning

Cloudbreak Static

Dynamic

Enforces Policies Scales Cluster/YARN Apps

Metrics and Alerts Feed Cloudbreak/Periscope

Scale by Ambari Monitoring Metric1. Ambari: review metric2. CloudBreak: set alert3. Cloudbreak: set scaling policy

Scale up/down by time1. Set time-based alert2. Set scaling policy

Repeat with an alertand policy whichscales down

Roadmap


Release Summary Cloudbreak● It’s own project

(separate from Ambari)● Supported on Linux

flavors which support Docker

Periscope● Feature of Cloudbreak 1.0● Will be embedded in

Ambari later in 2015


Release Timeline

Cloudbreak 1.0 GA

June/July 2015

Cloudbreak 2.0 GA2H2015

Ambari 2.1.0HDP “Dal” / 2.3

Ambari 2.2HDP “Erie” / 2.4

Cloudbreak 1.1August 2015

(est)

Ambari 2.1.1HDP “Dal-M10”

CloudbreakIncubator Proposal

July/August 2015 (est)


Supported Cloud Environments

Cloudbreak HDP 2.3

Microsoft Azure GA

AWS GA

Google Compute GA

Cloudbreak HDP 2.3

Cloudbreak HDP 2.4

Openstack Community Tech Preview Tech Preview

Red Hat OSP TBD

HP Helion GA (Tentative)

Mirantis OpenStack

HDP as a Service

Hortonworks Data Platform On Azure

RackspaceCloud Big Data Platform● Rapidly spin up on-demand HDP clusters● Integrated with Cloud Files (OpenStack Swift)● Opt-in for Managed Services by Rackspace

Managed Big Data Platform● Fully Managed HDP on Dedicated and/or Cloud● Leverage Fanatical Support and Industry Leading SLA’s● Supported by Rackspace with escalation to Hortonworks

CSC

HDP on IaaS - Best Practices

Microsoft Azure● Deployment

○ Deploy using CloudBreak○ Deploy using HWX Azure Gallery Image

● Integrated with Azure Blob Storage● Supported directly by Hortonworks● Other offerings

○ Microsoft HDInsight○ HDP Sandbox

Azure Deployment Guideline● All in same Region● Instance Types

○ Typical: A7○ Performance: D14○ 8x1TB Standard LRS x3 Virtual Hard Disk per

server● Multiple Storage Accounts are recommended

○ Recommend no more than 40 Virtual Hard Disks per Storage Account

Azure Blob StoreAzure Blob Store (Object Storage)

● wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

Can be used as a replacement for HDFS● Thoroughly tested in HDP release test suites

Amazon Web Services● Deploy using CloudBreak● Integrated with AWS S3 (object storage)● Supported directly by Hortonworks

Amazon Deployment Guideline● All in same Region/AZ● Instances with Enhanced

Networking

Master Nodes:● Choose EBS Optimized● Boot: 100GB on EBS● Data: 4+ 1TB on EBS

Worker Nodes:● Boot: 100GB on EBS● Data: Instance Storage

○ EBS can be used, but local is preferred

Instance Types:● Typical: d2.● Performance: i2.https://aws.amazon.com/ec2/instance-types/

https://aws.amazon.com/ec2/instance-types/


AWS RDS● Some services rely on MySQL, Oracle or PostgreSQL:

○ Apache Ambari○ Apache Hive○ Apache Oozie○ Apache Ranger

● Use RDS for these instead of managing yourself.

AWS S3 (Object Storage)● s3n:// with HDP 2.2 (Hadoop 2.6)● s3a:// with HDP 2.3 (Hadoop 2.7)

Not currently a direct replacement for HDFS

Recommended to configure access with IAM Role/Policy● https://docs.aws.amazon.

com/IAM/latest/UserGuide/policies_examples.html#iam-policy-example-s3

● Example: http://git.io/vLoGY

https://docs.aws.amazon.com/IAM/latest/UserGuide/policies_examples.html#iam-policy-example-s3




http://git.io/vLoGY

Amazon Deployment Guideline● All in same Region/AZ● Instances with Enhanced

Networking

Master Nodes:● Choose EBS Optimized● Boot: 100GB on EBS● Data: 4+ 1TB on EBS

Worker Nodes:● Boot: 100GB on EBS● Data: Instance Storage

○ EBS can be used, but local is preferred

Instance Types:● Typical: d2.● Performance: i2.https://aws.amazon.com/ec2/instance-types/



Google Cloud● Deploy using

○ CloudBreak○ Google bdutil with Apache Ambari plug-in

● Integrated with Google Cloud Storage● Supported directly by Hortonworks

Google Deployment Guideline

● Instance Types○ Typical: n1 standard 4 with single 1.5 TB

persistent disks○ Performance: n1 standard 8 with 1TB SSD

● Google GCS (Object Storage)● gs://<CONFIGBUCKET>/dir/file● Not currently a replacement for HDFS

S3 & GCS as Secondary storage systemThe connectors are currently eventually consistent so do not replace HDFS

Backup● Falcon, distCP, hadoop fs, HBase ExportSnapshot● Kafka+Storm bolt sends messages to S3/GCS

providing backup & point-in-time recovery sourceInput/Output● Convenient & broadly used upload/download method

○ As a middleware to ease integration with Hadoop & limit access● Publishing static content (optionally with CloudFront)

○ Removes need to manage any web services ● Storage for temporary/ephemeral clusters

Questions

$ shutdown -h now

- HDP 2.3- http://hortonworks.com/

- Hadoop Summit recordings:- http://2015.hadoopsummit.org/san-jose/- http://2015.hadoopsummit.org/brussels/

- Past & Future workshops:- http://hortonworks.com/partners/learn/









Technology

Hadoop Everywhere & Cloudbreak