Upload
spinningmatt
View
1.524
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Merging the insightful power of Hadoop with the management capabilities of OpenStack via Sahara
Citation preview
Hadoop and OpenStackMatthew Farrellee, @spinningmatt, Red HatSumit Mohanty, @smohanty, Hortonworks
What is OpenStack?
OpenStack isA cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
An ecosystem of projects● Compute - Nova● Networking - Neutron● Object Storage - Swift● Block Storage - Cinder● Identity - Keystone● Image Service - Glance● Dashboard - Horizon● Telemetry - Ceilometer● Orchestration - Heat● Data Processing - Sahara
Sahara is combining use cases
Trends
HadoopEC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack
EC2 beta Aug 25 2006 (http://aws.typepad.com/aws/2006/08/amazon_ec2_beta.html)
Data analysis is hard
Data analysis is hard...● Come up w/ a relevant question
○ The question you answer won’t be the question you set out to ask
○ Mine: Can I predict doctor specialty from what procedures they perform?
● Find the data○ Tons, little consistency, unknown origin, horded○ Data w/o a dictionary is worse than code w/o
comments. Run away!
Data analysis is hard...● Data usability
○ Acceptable license? (Even for Gov’t sets)■ Mine: Metadata copyrighted by AMA!
○ Private is often highly protected, no/narrow DMZ● Explore and clean
○ Two of the oldest people in the medical profession working with medicare
○ Stephen Glasser graduated in 1773○ Cheryl Palma graduated in 1776
Data analysis is hard...● You got some answer to a question you
approximately asked● You must refine the question and process● Repeat
This is hard enough without having to manage tools and infrastructure!
Sahara’s goal
Make managing Hadoop+ infrastructure and tools so simple that they get out of your way
Sahara provides
● Apache Hadoop cluster and workload management○ Cluster - construct and manage the lifecycle of a
Hadoop cluster○ Workload - workflow for big data processing with
Hadoop (AWS EMR-like)● Through a Python library, REST API, Web
UI, command line interface
Sahara’s architecture
Data Sources
Sahara Python Client RE
ST A
PI
Cluster Configuration
Manager
Horizon
Keystone
Auth
Data Access Layer
Swift
Sahara Pages
HadoopVM
Vendors Plugins
HadoopVM
HadoopVM
HadoopVM
Resources Orchestration
Manager
Job Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Sahara Service
Sahara’s features● Plugin mechanism - distro choice● Cluster scaling - elasticity● Swift integration - data storage● Cinder integration - persistent HDFS● Network management with Nova and Neutron● Anti-affinity, separate services on physical hardware● Data locality with Swift● Repeatable cluster creation w/ template mechanism● http://docs.openstack.
org/developer/sahara/userdoc/features.html
Storage considerations
● Swift○ Input/output through Swift HCFS plugin○ Intermediate data stored in HDFS on cluster○ Locality when co-locating swift & nova-compute
● HDFS○ Local (long lived cluster) and remote (copy in)
● HDFS backed by ephemeral disk or Cinder○ Ephemeral - /var/lib/nova/instances on compute host○ Cinder - persistent block devices attached to instances
Sahara’s plugin architecture● This is important!● It’s where Hadoop distribution vendors
integrate their management software● It’s how users pick different software
versions● Currently: Vanilla (reference impl. w/ Apache
versions), HDP (via Ambari), IDH (via Intel Manager), and Spark (w/ minimal CDH)
HDP Plugin Overview● Full support for all Sahara Functionality
● Nova and Neutron network● Cluster Scaling● Scale Up● Swift Integration● Cinder Support● Data Locality● EDP
● Apache Ambari REST API’s used for clusterprovisioning
● Monitoring/Management of clusters via Ambari● Full support for multiple HDP stacks● HDP pre-installed or generic VM images
HDP 1.3● NameNode● Secondary NameNode● DataNode● HDFS● ZooKeeper ● Ambari Server/Agent● HCatalog● Sqoop● Job Tracker● Task Tracker● MapReduce● Hive● MySQL● Pig● WebHCat Server● Oozie● Ganglia● Nagios● HBase
HDP Plugin Stack Support
HDP 2.0● History Server● MapReduce 2 / YARN● Resource Manager● YARN Client
HDP 2.1● Storm● Falcon
Coming Soon!
Available
Available
HDP 2.1 +● SOLR● Cascading
Roadmap
Ambari Blueprints● Two primary goals of Ambari Blueprints
○ Ability to export a complete description of a running cluster
○ Provide API based cluster installations based on a self- contained cluster description
● Blueprints contain cluster topology and configuration information
● Enables Interesting use cases between physical and virtual, including OpenStack/Sahara
Blueprint API
BLUEPRINTPOST /blueprints/my-blueprint
CLUSTERINSTANCE POST
/clusters/MyCluster
1
2
Example: Single-Node Definitions{ "configurations" : [ { ”hdfs-site" : {
"dfs.namenode.name.dir" : ”/hadoop/nn" } } ], "host_groups" : [ { "name" : ”uber-host", "components" : [ { "name" : "NAMENODE” }, { "name" : "SECONDARY_NAMENODE” }, { "name" : "DATANODE” }, { "name" : "HDFS_CLIENT” }, { "name" : "RESOURCEMANAGER” }, { "name" : "NODEMANAGER” }, { "name" : "YARN_CLIENT” }, { "name" : "HISTORYSERVER” }, { "name" : "MAPREDUCE2_CLIENT” } ], "cardinality" : "1" } ], "Blueprints" : { "blueprint_name" : "single-node-hdfs-yarn", "stack_name" : "HDP", "stack_version" : "2.0" }}
{ "blueprint" : "single-node-hdfs-yarn", "host_groups" :[ { "name" : ”uber-host", "hosts" : [ { "fqdn" : "c6401.ambari.apache.org”
} ] } ]}
BLUEPRINT
CLUSTER INSTANCE
Description• Single-node cluster• Use HDP 2.0 Stack• HDFS + YARN + MR2• Everything on c6401
Demo - youtu.be/vmry_kXqn4c● http://jayunit100.github.io/bigpetstore/slides
● Bigpetstoreo A full stack hadoop applicationo Uses the main players in the hadoop ecosystemo To demonstrate a single domaino Just accepted into the Bigtop project!
● Come by the Red Hat booth - G18
Q&A
● Status - Integrated for Juno (Oct 2014)● Distro - RDO (Fedora/RHEL/CentOS), RHEL
OSP 5, ...● Home - https://launchpad.net/sahara● Docs - http://docs.openstack.org/developer/sahara● Code - https://github.com/openstack/ *sahara*● Email - openstack-dev w/ [sahara]● IRC - #openstack-sahara on freenode