4
800-633-1440 1-800-633-1440 www.mindshare.com [email protected] Comprehensive Hadoop Programming and Administration Training Let MindShare bring “Hadoop Programming and Administration” to life for you MindShare’s Hadoop Programming and Administration course is an extensive course on the open source Apache Hadoop Architecture. The course begins by addressing Big Data issues and how Hadoop finds a role in solving them. The course then gradually extends to the finest details of the Hadoop architecture, its deployment & administration and the Hadoop eco-system ensuring a comprehensive understanding of each of these with adequate hands-on sections throughout the course. The hands on labs lead you through the Hadoop installation, deployment, and administration and teach you Hadoop programming in JAVA and other languages using the streaming API. The labs demonstrate the use of Hadoop Distributed File System (HDFS) shell commands, and shows techniques for debugging and integrating Hadoop with your current workflow. You will Learn: What is Big Data and what are the challenges it faces? What are the drawbacks in traditional database systems? How does Hadoop fit as a solution to Big Data problems? The detailed architecture of HDFS, Operations performed, APIs and Shell commands The detailed architecture of MapReduce and MapReduce classes What is Hadoop Streaming and how it is done using Python and Hadoop? Data redundancy issues in Hadoop and the solutions offered Common File Configurations for Hadoop Deployment How to debug using UnitMR? How does Hadoop handle failures and what are its fault tolerance metrics? The detailed architecture of each of the components of the Hadoop eco-system Who Should Attend? The course is typically for programmers, web developers, data scientists, system administrators and statisticians who would want to understand the depth and breadth of the Apache Hadoop open-source platform to deal with abundant data in a cost-effective and performance efficient manner. It is however not restricted to this list. In fact, the first day of the course provides an overview and know-how of Hadoop and its ecosystem that can directly benefit managers, directors, and Vice Presidents. Course Length: 4 Days Course Outline: Day 1: Hadoop Concepts and Overview Introduction to Big Data o What is Big Data o Importance of Big Data in modern world o Challenges in Big Data Traditional vs Big Data Approach o Traditionl approach o Limitations of Traditional approach o Big data approach – scaling out o Scaling out challenges Hadoop o Introduction o Features o Hadoop Use-Cases o Building blocks o Introduction to Hadoop Ecosystem

Comprehensive Hadoop Programming and Administration Training … · Comprehensive Hadoop Programming and Administration Training Let MindShare bring “Hadoop Programming and Administration”

  • Upload
    ledang

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Comprehensive Hadoop Programming and Administration Training … · Comprehensive Hadoop Programming and Administration Training Let MindShare bring “Hadoop Programming and Administration”

800-633-1440 1-800-633-1440                   www.mindshare.com

[email protected]

Comprehensive Hadoop Programming and Administration Training Let MindShare bring “Hadoop Programming and Administration” to life for you MindShare’s Hadoop Programming and Administration course is an extensive course on the open source Apache Hadoop Architecture. The course begins by addressing Big Data issues and how Hadoop finds a role in solving them. The course then gradually extends to the finest details of the Hadoop architecture, its deployment & administration and the Hadoop eco-system ensuring a comprehensive understanding of each of these with adequate hands-on sections throughout the course. The hands on labs lead you through the Hadoop installation, deployment, and administration and teach you Hadoop programming in JAVA and other languages using the streaming API. The labs demonstrate the use of Hadoop Distributed File System (HDFS) shell commands, and shows techniques for debugging and integrating Hadoop with your current workflow. You will Learn:

• What is Big Data and what are the challenges it faces? • What are the drawbacks in traditional database systems? • How does Hadoop fit as a solution to Big Data problems? • The detailed architecture of HDFS, Operations performed, APIs and Shell commands • The detailed architecture of MapReduce and MapReduce classes • What is Hadoop Streaming and how it is done using Python and Hadoop? • Data redundancy issues in Hadoop and the solutions offered • Common File Configurations for Hadoop Deployment • How to debug using UnitMR? • How does Hadoop handle failures and what are its fault tolerance metrics? • The detailed architecture of each of the components of the Hadoop eco-system

Who Should Attend? The course is typically for programmers, web developers, data scientists, system administrators and statisticians who would want to understand the depth and breadth of the Apache Hadoop open-source platform to deal with abundant data in a cost-effective and performance efficient manner. It is however not restricted to this list. In fact, the first day of the course provides an overview and know-how of Hadoop and its ecosystem that can directly benefit managers, directors, and Vice Presidents. Course Length: 4 Days Course Outline: Day 1: Hadoop Concepts and Overview

• Introduction to Big Data o What is Big Data o Importance of Big Data in modern world o Challenges in Big Data

• Traditional vs Big Data Approach o Traditionl approach o Limitations of Traditional approach o Big data approach – scaling out o Scaling out challenges

• Hadoop o Introduction o Features o Hadoop Use-Cases o Building blocks o Introduction to Hadoop Ecosystem

Page 2: Comprehensive Hadoop Programming and Administration Training … · Comprehensive Hadoop Programming and Administration Training Let MindShare bring “Hadoop Programming and Administration”

800-633-1440 1-800-633-1440                   www.mindshare.com

[email protected]

• HDFS o What is HDFS? o Design of HDFS o Limitations of HDFS

• MapReduce o MapReduce Introduction

Split Map Step Shuffle Reduce Step

o MapReduce Life cycle o Dissection of WordCount Example

• How to run Hadoop o Download Hadoop o Hadoop Distributions

• Hadoop Alternatives Day 2: Hadoop Architecture

• Hadoop Distributed File System (HDFS) o Where HDFS fits in Hadoop? o DFS o HDFS Design Goals o When HDFS is not a good fit? o HDFS Key Concepts o HDFS Protocol o HDFS Usage

• MapReduce o Where MapReduce fits in Hadoop? o MapReduce Basics o Job Launch o Tasks o MapReduce Data Types o MapReduce Hands-On

• Hadoop Streaming o What is Hadoop Streaming? o Using Python with Hadoop

Day 3: Hadoop Deployment and Administration

• Hadoop in Real World o MapReduce Joins

What are joins Map-side joins Reduce-side joins In-memory joins

o Hadoop debugging Challenges in debugging Hadoop Best practices and pitfalls MRUnit

• Hadoop Administration o Hadoop configuration

HDFS Configuration MapReduce Configuration

o Hadoop cluster launch o Hadoop cluster management

Balancer dfsadmin

Page 3: Comprehensive Hadoop Programming and Administration Training … · Comprehensive Hadoop Programming and Administration Training Let MindShare bring “Hadoop Programming and Administration”

800-633-1440 1-800-633-1440                   www.mindshare.com

[email protected]

fsck o Hardware selection o Fault tolerance and recovery

• Hadoop Cluster Monitoring o Metrics o Ganglia

Day 4: Hadoop Ecosystem

• Hive o Overview o Architecture and Components o Hive vs RDBMS o HQL

• Pig o Overview o Architecture o Pig Latin

• HBase o Overview o Architecture o Hbase vs RDBMS o Data Model & Operators

• Sqoop o Overview o Architecture

• Flume o Overview o Twitter example

• Oozie o Overview o Workflow o Console

• Miscellaneous o Cascading o Zookeeper o Mahout

Hands-On Labs: DAY 1: Hadoop Concepts and Overview

• Setup a machine and install Hadoop on a single node • Run sample wordcount application

DAY 2: Hadoop Architecture • Write a Java MapReduce job • Run sample Python MapReduce job using streaming API • Try basic HDFS shell commands

DAY 3: Hadoop Deployment • Deploy a multi-node Hadoop cluster • Develop a simplistic MapReduce algorithm • Show how to debug a bug in mapper/reducer.

DAY 4: Integrating Hadoop in your workflow • Install Hive, Pig, and Sqoop • Import a data-set from MySQL to Hadoop using Sqoop • Process logs using Pig (to show the convenience) • Create a report using Hive (to show resemblance to traditional SQL)

Page 4: Comprehensive Hadoop Programming and Administration Training … · Comprehensive Hadoop Programming and Administration Training Let MindShare bring “Hadoop Programming and Administration”

800-633-1440 1-800-633-1440                   www.mindshare.com

[email protected]

Course Material: 1) Downloadable PDF version of the presentation slides. 2) USB stick containing Virtual Machine with Hadoop and examples

Recommended Prerequisites: Basic Knowledge on Databases, SQL and UNIX commands