Bigdata Hadoop and Spark Development - Acadgild · • Custom Writable In MapReduce • Custom WritableComparable In MapReduce • Schedulers In YARN • FIFO Scheduler • Capacity

Course Details

BIG DATAHADOOP & SPARK DEVELOPMENT

[email protected] | www.acadgild.com | 90360 10796

Brief About the CourseHadoop is considered as the most effective data platform for companies working with Big Data and is an integral part of storing, handling and Retrieving enormous amount of data in variety applications. In this course you will learn Hadoop Architecturein depth and also the key components oh Hadoop Ecosystem-Hive, Hbase, Sqoop, flume & pig.

01

02

Who should take this courseAny graduate aiming to successfully build the career around Big Data can do this course. This course is beneficial for:

Software Developers and ArchitectsProfessionals with analytics and data management profileBusiness Intelligence ProfessionalsProject ManagersData ScientistsProfessionals with Business Intelligence, ETL and datawarehousing background

Professionals from testing and mainframes background.

03

Solving Big data problem& Hadoop framework

SYLL

ABU

S

• Why is Data So Important?

• Pre-requisite – Data Scale

• What is Big Data?

• Big Bank: Big Challenge

• Common Problems

• 3 Vs Of Big Data

• Defining Big Data

• Sources Of Data Flood

• Exploding Data Problem

• Redefining The

Challenges Of Big Data

• Possible Solutions:

Scaling Up Vs. Scaling Out

• Challenges Of Scaling Out

• Solution For Data

Explosion-Hadoop

• Hadoop: Introduction

• Hadoop In Layman's Term

• Hadoop Ecosystem

• Evolutionary Features Of

Hadoop

• Hadoop Timeline

• Why Learn Big Data

Technologies?

• Who Is Using Big Data?

• HDFS: Introduction

• Design Of HDFS

• HDFS Blocks

• Components Of Hadoop 1.X

• NameNode And Hadoop

Cluster

• Arrangement Of Racks

• Arrangement Of Machines

And Racks

• Local FS And HDFS

Day 1 2 Hours

04

HDFS

• NameNode

• Checkpointing

• Replica Placement

• Benefits-Replica Placement And

Rack Awareness

• URI

• URL And URN

• HDFS Commands

• Problems With HDFS In

Hadoop 1.X

• HDFS Federation (Included In

Hadoop 2.X)

• HDFS Federation

• High Availability, Anatomy Of

File Read From HDFS

• Data Read Steps

• Important Java Classes To Write

Data To HDFS

• Anatomy Of File Write To HDFS

• Writing File To HDFS: Steps

Day 2 2 Hours

05

Exploring MapReduce

• Building Principles

• Introduction To MapReduce

• MR Demo

• Pseudo Code

• Mapper Class

• Reducer Class

• Driver Code

• InputSplit

• InputSplit And Data Blocks –

Difference

• Why Is The Block Size 128 MB?

• RecordReader

• InputFormat

• Default Inputformat : TextIn

putFormat

• InputFormat

• OutputFormat

• Using A Different

OutputFormat

• Important Points

• Partitioner

• Using Partitioner

• Map Only Job

• Flow Of Operations In

MapReduce

Day 3 2 Hours

06

Schedulers in YARN & Introduction to Pig

• Serialization In MapReduce

• Custom Writable In MapReduce

• Custom WritableComparable In

MapReduce

• Schedulers In YARN

• FIFO Scheduler

• Capacity Scheduler

• Fair Scheduler

• Differences Between Hadoop

1.X And Hadoop 2.X

• Introduction to Apache Pig

• Why Pig?

• Apache Pig Architecture

• Simple Data Types

• Complex Data Types

Day 4 2 Hours

07

Exploring Pig

• Sample Execution

• Pig Operators demo

• Parameter Substitution

• Macros

• Anatomy Of Reduce-Side-Join

• Job Optimizations In Pig

• UDF's in Pig

•Execution Of XML and CSV Files

In Pig

Day 5 2 Hours

08

Hive Introduction

• Introduction

• Hive DDL

• Demo: Databases.Ddl

• Demo: Tables.Ddl

• Hive Views

• Demo: Views.Ddl

• Architecture

• Primary Data Types

• Data Load

• Demo: ImportExport.Dml

• Demo: HiveQueries.Dml

• Demo: Explain.Hql Table Types

• Demo: ExternalTable.Ddl

• Complex Data Types

• Demo: Working With Complex

Datatypes

• Hive Variables

• Demo: Working With Hive

Variables

• Hive Variables And Execution

Customisation

Day 6 2 Hours

09

Hive Operations

Day 7 2 Hours

• Working With Arrays

• Sort By And Order By

• Distribute By And Cluster By

• Partitioning

• Static And Dynamic Partitioning

• Bucketing Vs Partitioning

• Joins And Types

• Bucket-Map Join

• Sort-Merge-Bucket-Map Join

• Left Semi Join

• DDemo: Join Optimisations

• Input Formats In Hive

• Sequence Files In Hive

• RC File In Hive

• File Formats In Hive

• ORC Files In Hive

• Inline Index In ORC Files

• ORC File Configurations

In Hive

10

Advanced Hive

Day 8 2 Hours

• SerDe In Hive

• Demo: CSVSerDe

• JSONSerDe

• RegexSerDe

• Analytic And Windowing In Hive

• Demo: Analytics.Hql

• Hcatalog In Hive,

• Demo: Using_HCatalog

• Accessing Hive With JDBC

• Demo: HiveQueries.Java

• HiveServer2 And Beeline

• Demo: Beeline

• UDF In Hive

• Demo: ToUpper.Java And

Working_with_UDF

• Optimizations In Hive

• Demo: Optimizations

11

HBase

• Challenges With Traditional RDBMS

• Features Of NoSQL Databases

• NoSQL Database Types

• CAP Theorem

• What Is HBase Regions

• HBase HMaster ZooKeeper

• HBase First Read

• HBase Meta Table

• Region Split

• Apache HBase Architecture Benefits

• HBase Vs. RDBMS

• Shell Commands

Day 9 2 Hours

Oozie and Sqoop

• Introduction To Oozie

• Oozie Architecture

• Oozie Workflow Nodes

• Oozie Server

• Oozie Workflow

• Sqoop Architecture

• Sqoop Features

Day 10 2 Hours

12

Sqoop contd. & Apache Flume

• Sqoop Hands On

• Flume: Introduction

• Flume Architecture

• Example Description

• Transactions

• Batching

• Partitioning

• Exec Source

• Spooling Directory Source

• File Channel

• Memory Channel

• Logger Sink

• HDFS Sink

Day 11 2 Hours

13

Project - 1 & Introduction toScala - Session I

• Project Discussion

• Introduction to Function Pro

gramming Language and Scala

• Functional vs OOP

• Variable

• Functions

• Using if

• while to define logic

• Loops in scala

• Collections in scala

Day 12 2 Hours

14

Scala - Session II

• Object Oriented Programming

• Classes and Objects

• Traits in Scala

• Constructors in Scala

• Method Overloading

• Implicit parameter usage

Day 13 2 Hours

Scala - Session III

• Inheritance - OOP

• Override modifier

• Polymorphism

• Invoking superclass methods

• Final members

• Traits in detail

Day 14 2 Hours

15

Scala - Session IV

• Control Structures in detail

• Exception Handling

• Coding without break and continue

• Coding the functional way

• Case classes in Scala

• Implicit conversions

• Parameter in depth

Day 15 2 Hours

Introduction to Apache Spark

• Introduction to Apache Spark

• Map Reduce Limitations

• RDD's

• Spark Context - SQLContext

and HiveContext

Day 16 2 Hours

16

RDD's in Spark

• Programming with RDD's

• Creating RDD's from text-files

• Transformations and Actions

• How does spark execution work

• RDD API's - filter

• flatMap

• fold

• foreach

• glom

• groupBy

• map

• reduceByKey

• zip

• persist

• unpersist

• Read/Write from storage

• RDD Examples

Day 17 2 Hours

17

RDD's contd. & Introduction toDataframes

• RDD API's - aggregate

• cartesian

• checkpoint

• coalesce

• reparition

• cogroup

• collectAsMap

• combineByKey

• count and countApprox

functions

• More RDD Examples

• Schema - StructType

• StructFields

• DataType

• DataFrame API's and

examples

Day 18 2 Hours

18

Spark SQL

• Create temporary tables

• SparkSQL

• Parquet vs Avro

• Examples and problem

solving on real data using

RDD and converting the

same to Dataframe

Day 19 2 Hours

Spark Streaming

• Demo: Spark Streaming Example

Day 20 2 Hours

19

ML-lib and GraphX

• Spark ML-lib • GraphX

Day 21 2 Hours

Deploying aSpark application

• Create a Spark project

• SBT / Maven

• How do maven repo work

Day 22 2 Hours

20

Project Demo From Hadoop

• Demo: Music data analysis using

Hadoop

Day 23 2 Hours

Project II

• Final project discussion

Day 24 2 Hours

[email protected] | www.acadgild.com | 90360 10796

Documents

Bigdata Hadoop and Spark Development - Acadgild · • Custom Writable In MapReduce • Custom WritableComparable In MapReduce • Schedulers In YARN • FIFO Scheduler • Capacity