Structor - Automated Building of Virtual Hadoop Clusters

Preview:

DESCRIPTION

Discusses vagrant scripts to setup and deploy a working Hadoop multiple node cluster with or without security. All source code is available on https://github.com/hortonworks/structor .

Citation preview

© Hortonworks Inc. 2014

Structor – Automated Building of Virtual Hadoop Clusters

July 2014

Page 1

Owen O’Malley owen@hortonworks.com@owen_omalley

© Hortonworks Inc. 2014Page 2

•Creating a virtual Hadoop cluster is hard–Takes time to set and configure VM–A “learning experience” for new engineers–Each engineer has a different setup–Experimenting is hazardous!

•Setting up security is even harder–Most developers don’t test with security

•Need to test Ambari and manual installs•Need to test various operating systems

What’s the Problem?

© Hortonworks Inc. 2014Page 3

•Create scripts to create a working Hadoop cluster–Secure or Non-secure–Multiple nodes

•Vagrant–Used for creating and managing the VMs–VM starts as a base box with no Hadoop

•Puppet–Used for provisioning the Hadoop packages

Solution

© Hortonworks Inc. 2014Page 4

•We’ve put everything for a development box into the Vagrant base box (CentOS6)–Build tools: ant, git, java, maven, protobuf, thrift

•Downloaded once and cached•Setup

% vagrant init omalley/centos6_x64

% vagrant up

% vagrant ssh

•Less than a minute

Simplest Case – Development Box

© Hortonworks Inc. 2014Page 5

•Ssh in with “vagrant ssh”–Account: vagrant, Password: vagrant–Become root with “sudo –i”

•Clone directory to make copies•Other useful vagrant commands:

% vagrant status – list virtual machines

% vagrant suspend – suspend virtual machines

% vagrant resume – resume virtual machines

% vagrant destroy – destroy virtual machines

Using the Box

© Hortonworks Inc. 2014Page 6

•Commands to start cluster% git clone git@github.com:hortonworks/structor.git

% cd structor

% vagrant up

•Default profile has 3 machines–gw – client gateway machine–nn – master (NameNode, ResourceMgr)–slave1 – slaves (DataNode, NodeManager)

•HDFS, Yarn, Hive, Pig, and Zookeeper

Setting up Non-Secure Cluster

© Hortonworks Inc. 2014Page 7

•Add hostnames to /etc/hosts240.0.0.10 gw.example.com

240.0.0.11 nn.example.com

240.0.0.12 slave1.example.com

240.0.0.13 slave2.example.com

240.0.0.14 slave3.example.com

•HDFS – http://nn.example.com:50070/•Yarn – http://nn.example.com:8088/•For security–Modify /etc/krb5.conf as in README.md.–Use Safari or Firefox (needs config change)

Setting up your Mac

© Hortonworks Inc. 2014Page 8

•Commands to start cluster% ln –s profiles/3node-secure.profile current.profile

% mkdir generated (bug workaround)

% vagrant up

•Brings up 3 machines with security–Includes a kdc and principles

•Yarn Web UI - https://nn.exaple.com:8090

•“kinit vagrant” on your Mac for Web UI•Ssh to gw and kinit for the CLI

Setting up Secure Cluster

© Hortonworks Inc. 2014Page 9

•JSON files that control cluster•3 node secure cluster:

{ "domain": "example.com”, "realm": "EXAMPLE.COM",

"security": true,

"vm_mem": 2048, "server_mem": 300, "client_mem": 200,

"clients" : [ "hdfs", "yarn", "pig", "hive", "zk" ],

"nodes": [

{ "hostname": "gw", "ip": "240.0.0.10", "roles": [ "client" ] },

{ "hostname": "nn", "ip": "240.0.0.11", "roles": [ "kdc", "nn", "yarn",

"hive-meta", "hive-db”, "zk" ]},

{ "hostname": "slave1", "ip": "240.0.0.12", "roles": [ "slave" ]}]}

Profiles

© Hortonworks Inc. 2014Page 10

•Various profiles–1node-nonsecure–3node-secure–5node-nonsecure–ambari-nonsecure– knox-nonsecure

•Great way to setup Ambari cluster•Project owners should add their project–Help other developers use your project

Additional Profiles

© Hortonworks Inc. 2014Page 11

•The master branch is Hadoop 2.4–There is also an Hadoop 1.1 (hdp-1.3) branch

•All packages are installed via Puppet–Uses built in OS package tools

•Repo file is in files/repos/hdp.repo–Can override source of packages–Easy to change to download custom builds

Choosing HDP versions

© Hortonworks Inc. 2014Page 12

•Each configuration file is templated•HDFS configuration is in–modules/hdfs_client/templates/*.erb–Changes will apply to all nodes

•We use Ruby to find NameNode:<% @namenode = eval(@nodes).select {|node|

node[:roles].include? 'nn’} [0][:hostname] + "." + @domain; %>

<property>

<name>fs.defaultFS</name>

<value>hdfs://<%= @namenode %>:8020</value>

</property>

Configuration Files (eg. core-site.xml)

© Hortonworks Inc. 2014Page 13

•Actual work is done via Puppet–Hides details of each OS

•Modularized–Top level is manifests/default.pp–Each module is in modules/*

•Top level looks like:include selinux

include ntp

if $security == "true" and hasrole($roles, 'kdc') {

include kerberos_kdc

}

Puppet

© Hortonworks Inc. 2014Page 14

•Add other Hadoop ecosystem tools–Tez–HBase

•Add other operating systems–Ubuntu, Suse, CentOS 5

•Support other Vagrant providers–Amazon EC2–Docker

•Support for other backing RDBs

Future Directions

© Hortonworks Inc. 2013

Thank You!Questions & Answers

Page 15

Recommended