30
Big Data Computations Using Elastic Data Processing in OpenStack Cloud Sergey Lukjanov (Mirantis) Alexander Ignatov (Mirantis) Trevor McKay (Red Hat)

Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Big Data ComputationsUsing Elastic DataProcessing inOpenStack Cloud

Sergey Lukjanov (Mirantis)Alexander Ignatov (Mirantis)Trevor McKay (Red Hat)

Page 2: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Agenda

• OpenStack Data Processing Overview

• EDP Architecture & Technical Concepts

• Live Demo

Page 3: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Agenda

• OpenStack Data Processing Overview

• EDP Architecture & Technical Concepts

• Live Demo

Page 4: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

OpenStack Data Processing: Sahara

Mission: To provide a scalable data processing stack and associated management interfaces.

• provision and operate Hadoop clusters • schedule and operate Hadoop jobs

Page 5: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Hadoop - Big Data Platform

© http://hortonworks.com/hadoop/yarn/

Page 6: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Trends

http://www.google.com/trends/

Page 7: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Architecture overview

Data Sources

Savanna Python Client RE

ST A

PI

Cluster Configuration

Manager

Horizon

Keystone

Auth

Data Access Layer

Swift

Savanna Pages

HadoopVM

Vendors Plugins

HadoopVM

HadoopVM

HadoopVM

Resources Orchestration

Manager

Job Sources Job

Manager

Heat

Nova

Glance

Cinder

Neutron

Trove DB

Page 8: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Sahara status

• Official integrated OpenStack project• Supported Hadoop distros:

• Vanilla Apache Hadoop• Hortonworks Data Platform• Intel Distribution• Cloudera Distribution in blueprint

• Included into OpenStack distros:• RDO - openstack.redhat.com• Mirantis OpenStack - software.mirantis.com

Page 9: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Contributors

Page 10: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Agenda

• OpenStack Data Processing Overview

• EDP Architecture & Technical Concepts

• Live Demo

Page 11: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Elastic Data Processing

• EDP - API for executing MapReduce jobs on Hadoop clusters (similar to AWS EMR)• Supported data sources: Swift, HDFS, Ceph• Supported job types: Java actions,

MapReduce, MapReduce.Streaming, Pig, Hive• Oozie for Hadoop jobs workflow management

• Supports both Hadoop 1 & 2• Job executions on transient clusters

Page 12: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP Use Cases

• Simplified task executions. You don’t need to know Hadoop!

• Bursty workload: ad-hoc queries requiring a significant resource only for short time period

• Utilization of free IaaS capacity for Hadoop tasks

Page 13: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP - Data Sources

Swift Sahara EDP

INPUT

OUTPUT

HadoopVM

HadoopVM

HadoopVM

HadoopVM

swift://some_container/INPUT

swift://some_container/OUTPUT

Page 14: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP - Job Binaries

Swift

Sahara DB

Sahara EDP

internal-db://script.pig

swift://some_container/mapreduce.jar

1. Pig, Hive scripts2. Executable Jar files3. Pluggable binaries and

libraries

Page 15: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP - Job Execution. Step 1

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

Page 16: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP - Job Execution. Step 2

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

JobTracker

Oozie

HadoopVM

HadoopVM

HadoopVM

Page 17: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP - Job Execution. Step 3

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

JobTracker

OozieExecute a job

Page 18: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP - Job Execution. Step 4

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

JobTracker

Oozie

Page 19: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP - Job Execution. Step 5

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

workflow.xm

l

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

JobTracker

Oozie

Page 20: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP - Job Execution. Step 6

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

workflow.xm

l

Data Processing

OUTPUT

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

JobTracker

Oozie

Page 21: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP - Job Execution. Step 7

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

workflow.xm

l

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

Data Processing

OUTPUT

JobTracker

Oozie

Page 22: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Agenda

• OpenStack Data Processing Overview

• EDP Architecture & Technical Concepts

• Live Demo

Page 23: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP BigPetStore Demo

BigPetStore is now part of Apache BigTop• Test/demo laboratory for all things Hadoop

• Actively developed with integration testing

• Generates and processes data of arbitrary size

• git clone git://git.apache.org/bigtop.git

• Filed under bigtop/bigtop-bigpetstore

Page 24: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP BigPetStore Demo

What are we going to do?

• Generate 1M records of pet supply purchases• Clean the data (“dirty CSV”)• Extract cumulative counts by state• Demonstrates Sahara EDP objects

• Job Binaries• Jobs (Java and Pig)• Data Sources

Page 25: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP BigPetStore Sample Data

Generated Data (first job)

$ hadoop fs -cat bigpetstore/gen/part-r-00000 | more

BigPetStore,storeCode_AK,1 deanna,booker,Sun Jan 18 20:50:06 GMT+00:00 1970,7.5,cat-food

BigPetStore,storeCode_AK,10 erica,buck,Thu Dec 25 16:29:28 GMT+00:00 1969,10.5,dog-food

Cleaned Data (second job)

$ hadoop fs -cat bigpetstore/clean/part-m-00000 | more

BigPetStore storeCode_AK 1 deanna booker Sun Jan 18 20:50:06 GMT+00:00 1970 7.5 cat-food

BigPetStore storeCode_AK 10 erica buck Thu Dec 25 16:29:28 GMT+00:00 1969 10.5 dog-food

Page 26: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

EDP BigPetStore Sample Data

Summed Data For Products by State (3rd job)

$ hadoop fs -cat bigpetstore/analyze_rel/part-r-00000 | more

US-AK cat-food 24837

US-AK dog-food 24994

US-AK fuzzy-collar 25145

US-AK antelope-caller 25024

US-AZ cat-food 25106

US-AZ dog-food 25064

US-AZ leather-collar 24870

US-AZ snake-bite ointment 24960

Page 27: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

What Next for EDP

Potential Areas for Development within EDP

• Pluggable Job Execution Model• Allows Sahara to run jobs with additional execution engines• Current Oozie offerings become one of multiple options

• Expand Capabilities via Oozie• Support upload of user-written Oozie workflows• Support for coordinated jobs

• Enhanced Usability• Better Error Reporting• User Experience (UI, CLI, API)

Please, send us your feedback! Ideas are always welcome• #openstack-sahara on freenode• [email protected] with [openstack-dev][sahara] subject

Page 28: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Design Summit Sessions

7 Sessions: Thursday 1:30 - Friday 10:30

http://goo.gl/lQXtUS

Page 29: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Q&A

Page 30: Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing in OpenStack Cloud

Thank you!