Big Data Ecosystem Framework, Architecture and Deployment ...d2zmdbbm9feqrf.cloudfront.net/2013/usa/pdf/BRKAPP-2027.pdf · Big Data Ecosystem Framework, Architecture and Deployment

Big Data Ecosystem Framework, Architecture and Deployment in Enterprise

BRKAPP-2027

Nimish Desai

Technical Leader

[email protected]

© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public

Session Objectives

Understand Big Data Land Scape & Application in Enterprise

Understand Hadoop Inner Working

Understand and Demystify Network/Compute Architecture with Hadoop

Validation Results with Hadoop Workload – Network & Compute

3


Building Blocks, Landscape and Ecosystem

4


Definitions & Nature of Data

Structured Data

– Schema Based

– Defined Semantics by Raw/Column Oriented Data Structure

– Records Retrieved Through key-value pair

Data Warehouse & RDBMS

– Oracle, Sybase, SQL are structured relational data bases

– ERP and CRM Feeds

– Traditional Transactional (OLTP) & Reporting(OLAP) via ETL, API Access

Unstructured Data & Semi-structured Data

– Text, Logs, Web-Clicks, Sensors Data, Picture, Video

– Each have structure or semi-structure, e.g., TXT file has space, tab, semi-colon etc..

– When various data-types combined in one data set, it become unstructured in sense the of semantics

5

“Every day I wake up and ask, how can I flow data better,

manage data better, analyze data better?”

Rollin Ford

(CIO, Walmart)

diverse too large

timeframe cost

“Every two days we create as much information as we did

from the dawn of civilization up until 2003.”

Eric Schmidt

(Chairman of Google)

6


Why Big Data?

Challenges

Complex Data

– The explosion of unstructured data poorly suited for transaction-oriented, rigid-schema RDBMS’s

Multiple data sources & Lots of it

– Social Media, Sensors, Web-clicks, Pictures

Requires linear scaling – compute/storage - horizontally

CPU horsepower/density outpacing spinning disk performance leaves compute starved of data

Solution

Batch Processing – Data never dies

Move the compute to the data, avoid SAN/NAS bottlenecks

Build scalability into the code - Make it easy for developers to write distributed code

Leverage economy of scale (read commonly available hardware, low power footprint etc.)

Design for frequent partial failure and recovery – simplify development for distributed computing

7


Infinite Use Cases

Web & E-Commerce

– Faster User Response

– Customer Behaviors & Pricing Models

– Ad Target

Retails

– Customer Churn & Integration of brick & mortar with .com business models

– PoS Transactional Analysis

Insurance & Finance

– Risk Management

– User Behavior & Incentive Management

– Trade Surveillance for Financials

Network Analytics – Splunk

– Text Mining

– Fault Prediction

Security & Threat Defense

8


Big Data Application Realm – Web 2.0 & Social/Community Networks

Data live/die in Internet only entities

Data Domain Partially private

Homogeneous Data Life Cycle

– Mostly Unstructured

– Web Centric, User Driven

– Unified workload – few process & owners

– Typically non-virtualized

Scaling & Integration Dynamics

– Purpose Driven Apps

– Thousands of nodes

– Hundreds of PB and growing exponentially

9

Data

store Service UI


Big Data Application Realm - Enterprise

Data Lives in a confined zone of enterprise repository

– Long Lived, Regulatory and Compliance Driven

Heterogeneous Data Life Cycle

– Many Data Models

– Diverse data – Structured and Unstructured

– Diverse data sources - Subscriber based

– Divers workload from many -sources/groups/process/technology

– Virtualized and non-virtualized with mostly SAN/NAS base

Each Apps/Group/Technology limited in

– data generation

– Consumption

– Servicing confined domains

Scaling & Integration Dynamics are different

– Data Warehousing(structured) with diverse repository + Unstructured Data

– Few hundred to thousand nodes, few PB

– Integration, Policy & Security Challenges

Customer DB

(Oracle/SAP)

Soc

Media

ERP

Module

B

Data

Service

Sales

Pipelin

e

ERP

Module

A

Call

Center

Produc

t

Catalo

g

Catalo

g Data

Video

Conf Collab

Office

Apps

Record

s

Mgmt

Doc

Mgmt

B

Doc

Mgmt

A

VOIP Exec

Report

s

10


Data Sources

11

Enterprise Application

Sales Products Process

Inventory Finance Payroll

Shipping Tracking

Authorization Customers Profile

Machine logs Sensor data

Call data records Web click stream data

Satellite feeds GPS data Sales data

Blogs Emails Pictures Video


Big Data Framework Application Comparison

12

Relational Database

• Structured Data – Rows Oriented

• Optimized for OLTP/OLAP

• Rigid schema applied to data on insert/update

• Read and write (insert, update) many times

• Non-linear scaling

• Most transactions and queries involve a small subset of data set

• Transactional – scaling to thousands of queries

• GB to TBs size

Batch-oriented Big Data (Hadoop)

• Unstructured Data – Files, logs, Web-Clicks

• Data format is abstracted to higher level application programing

• Schema-less, flexible for later re-use

• Write once, read many

• Data never dies

• Linear scaling

• Entire data set at play for a given query

• Multi PB

Real-time Big Data NoSQL

• Hbase, Cassandra, Oracle

• Structured and Unstructured Data

• Sparse column-family data storage or Key-value pair

• Not a RDBMS, though with some schema

• Random read and write

• Modeled after Google’s BigTable

• High transaction – real time scaling to millions

• Not suited for ad-hoc analysis

• More suited for ~1 PB


Aggregation

& Services Layer

Core Layer

(LAN & SAN)

Access Layer

SAN Edge

WAN Edge

Layer

Nexus 7000

10 GE Aggr

Network

Services

Layer 3

Layer 2 - 1GE

Layer 2 - 10GE

10 GE DCB

10 GE FCoE/DCB

4/8 Gb FC

FC SAN A FC SAN B

vPC+

FabricPath

Nexus 7000

10 GE Core

Nexus 5500 10GE

Nexus 2148TP-E

Bare Metal

CBS 31xx

Blade switch Nexus 7000

End-of-Row

Nexus 5500 FCoE

Nexus 2232

Top-of-Rack

UCS FCoE Nexus 3000

Top-of-Rack

10G

1 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B) 10Gb DCB / FCoE Server Access or 10 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B)

L3

L2

MDS 9500

SAN

Director

B22

FEX

HP

Blade

C-class

FC SAN A FC SAN B

MDS 9200 /

9100

Nexus

5500

FCoE

Bare Metal

1G Nexus 3000

Top-of-Rack

Data Center Infrastructure

13


Traditional

Database

RDBMS

Storage

SAN and NAS

“Big Data”

Store and Analyze

“Big Data”

Real-Time Capture,

Read and Update

Operations

NoSQL

Big Data Building Blocks into the Enterprise

14

Application Virtualized,

Bare Metal and Cloud Sensor

Data Logs

Social

Media

Click

Streams Mobility

Trends Event

Data

Cisco Unified Fabric


Sample of Big Data Ecosystem

Hadoop Distribution (Similar to what Redhat does for

Linux) – Services and support model

Spin-out from Yahoo. Services and support model for

Apache Hadoop.

Rewrote hadoop with many optimizations (rewrote

HDFS into a C++ Filesystem and distributed the

metadata)

EMC Greenplum hadoop distribution. Uses MapR.

Hadoop Distribution and NoSQL like offering

announced at Oracle Openworld. Very similar to

HBASE/other NoSQL offerings. Based of BerkeleyDB.

Other NoSQL-like offerings

Various Others

15

Hadoop Basics


Q: What is Hadoop?

17

A: Hadoop is a distributed, fault-tolerant framework for storing and analyzing data.

Its two primary components are the Hadoop Filesystem (HDFS) and the

MapReduce application engine.


Main Hadoop Building Blocks

18

Hadoop has many building blocks…At the base is a way to Store and Process unstructured data…

Hadoop Distributed File System

(HDFS)

At the base is a

Self-healing

clustered storage

system.

Map-Reduce Distributed Data

Processing

PIG Hive Sqoop Top level

abstractions

Top level

Interfaces ETL

Tools

BI

Reporting RDBMS

HBASE

Database with

Real-time

access

Apps API

Flume


Hadoop Components and Operations

Scalable & Fault Tolerant

Types of Functions

– Name Node (Master) - Manages Cluster

– Data Node (Map and Reducer) – Carries blocks

Data is not centrally located, Data is stored across all data nodes in the cluster

Data is divided in multiple large blocks – 64MB default, typical block 128MB

Blocks are not the related to disk geometry

Data is stored reliably. Each block is replicated 3 times


19

Block

1

Block

2

Block

3

Block

4

Block

5

Block

6

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15


Name

Node


Name Node

– Runs a scheduler – Job Tracker

– Manages all data nodes, in memory

– Secondary Name Node – Snapshot of meta data of HDFS cluster

– Typically all three JVM can run on single node

Data Node

– Task Tracker Receives Job Info from Job Tracker(Name Node)

– Map & Reducer Task Managed by Task Tracker

– Configurable Ratio of Map & Reduce Task for various workload per Node/CPU/Core

– Data Locality - IF data not available where the map task is assigned, a missing block be copied over the network

20

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15


HDFS Architecture

21

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15

1

Switch

Name Node File metadata stored in

fsimage.txt and in-

memory only (!) map of

blocks to data nodes

/usr/sean/foo.txt:blk_1,blk_2

/usr/jacob/bar.txt:blk_3,blk_4

Data node 1:blk_1

Data node 2:blk_2, blk_3

Data node 3:blk_3

1

1

2

2

2

3

3

3

4

4

4 4


Name

Node

Hadoop Components and Operations Rack Awareness – FUD & Clarification

Basic intent is to avoid having ALL copies of a given block in the same rack to avoid data loss in the event of failure of the rack

How does Hadoop Name Node knows which rack is with which sets of block?

– It does not

– By default Rack Awareness is essentially "off" - all nodes are part of the same Hadoop rack

– If configured, the Name Node will place the 2nd copy in a different rack than the first, the third copy in the same rack as the second, and all other copies basically at random

Layer 2 vs. Layer 3 Agnostics

Typical replication occurs as lower priority so it not a huge impact to network

A concern when 1G network were not able to replicate. With 10G network not a big issue

22

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15



The Data Ingest & Replication

– External Connectivity

– East West Traffic (Replication of data blocks)

Map Phase – Raw data Analyzed and converted to name/value pair.

– Workload translate to multiple batches of Map task

– Reducer can start the reduce phase ONLY after the entire Map set is complete

Mostly a IO/compute function


23

Map Map

Map Map

Map

Unstructured Data

Map Map

Map Map

Map

Map Map

Map Map

Map

Key 1 Key 1

Key 1 Key 1

Key 1 Key 1

Key 1 Key 2

Key 1 Key 1

Key 1 Key 3

Key 1 Key 1

Key 1 Key 4

Reduce

Shuffle Phase

Reduce Reduce

Result/Output

Reduce

Map Map

Map Map

Map



Shuffle Phase - All name/value pair are sorted and grouped by their keys.

Reducer is PULLING the data from the Mapper Nodes

– High Network Activity

Reduce Phase – All values associates with a key are process for results, three phases – Copy - Get intermediate result from each data node local

disk – Merge - To reduce the number of files – Reduce Method

Output Replication Phase - Reducer replicating result to multiple nodes

– Highest Network Activity

Network activities dependent on workload behavior


24

Map Map

Map Map

Map

Unstructured Data

Map Map

Map Map

Map

Map Map

Map Map

Map

Key 1 Key 1

Key 1 Key 1

Key 1 Key 1

Key 1 Key 2

Key 1 Key 1

Key 1 Key 3

Key 1 Key 1

Key 1 Key 4

Reduce

Shuffle Phase

Reduce Reduce

Result/Output

Reduce

Map Map

Map Map

Map


Hadoop – Anatomy of a MapReduce Job

25

Example: Historic Weather Data (max temperatures/Year)

Maps: Separates temperatures and year out of huge historical database Reducers: Finds the max per year

Source: O’Reilly Hadoop A definitive Guide


Word Count Execution

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1

brown, 1

fox, 1

Quick, 1

quick, 1

the, 1

fox, 1

the, 1

how, 1

now, 1

brown, 1 ate, 1

mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

26

Hadoop Cluster Design & Validated Compute and Network Results


Characteristics that Affect Hadoop Clusters

Cluster Size

– Number of Data Nodes

Data Model & Mapper/Reduces Ratio

– MapReduce functions

Input Data Size

– Total starting dataset

Data Locality in HDFS

– Ability to processes data where it already is located

Background Activity

– Number of Jobs running

– type of jobs

– Importing

– exporting

Characteristics of Data Node

– I/O, CPU, Memory, etc.

Networking Characteristics

– Availability

– Buffering

– Data Node Speed (1G vs. 10G)

– Oversubscription

– Latency

28


Cluster Size

29

24 48 82

Tim

e T

ake

n

No. of Nodes

A general characteristic of an

optimally configured cluster is the

ability to decrease job completion

times by scaling out the nodes.

Sizing Depends on

• Workload Size & Type

• Job completion time

• Map/Reduce Ratio

• CPU, IO and local storage

Test results from ETL-like Workload (Yahoo Terasort) using 1TB

data set.


MapReduce Data Model ETL & BI Workload Benchmark

30

The complexity of the functions used in Map and/or Reduce has a large impact on the

job completion time and network traffic.

• Data set size varies in various phase – Varying impact on the network e.g. 1TB Input, 10MB Shuffle, 1MB Output

• Most of the processing in the Map Functions, smaller intermediate and even smaller final Data

Map Start

Yahoo TeraSort – ETL Workload – Most Network Intensive

Reducers Start

Map Finish Job Finish

• Input, Shuffle and Output data size is the same – e.g. 10 TB data set in all phases

• Yahoo Terasort has a more balanced Map vs. Reduce functions - linear compute and IO

Map Start

Shakespeare WordCount – BI Workload

Reducers Start Map Finish

Job Finish


Job Patterns

31

Analyze

Extract Transform Load

(ETL)

Explode

Reduce

Reduce

Reduce

Ingress vs.

Egress Data

Set

1:0.3

Ingress vs.

Egress Data

Set

1:1

Ingress vs.

Egress Data

Set

1:2

The Time the reducers start is dependent on: mapred.reduce.slowstart.completed.maps

It doesn’t change the amount of data sent to

Reducers, but may change the timing to send

that data


Traffic Types

32

Large Incast (Hadoop Replication)

Small Flows/Messaging (Admin Related, Heart-beats, Keep-alive,

delay sensitive application messaging)

Small – Medium Incast (Hadoop Shuffle)

Large Flows (HDFS Ingest)


Map and Reduce Traffic

33

Many-to-Many Traffic Pattern

Map 1 Map 2 Map N Map 3

Reducer 1 Reducer 2 Reducer 3 Reducer N

HDFS

Shuffle

Output

Replication

NameNode

JobTracker

ZooKeeper


Job Patterns Job Patterns have varying impact on network utilization

Analyze Simulated with Shakespeare Wordcount


(ETL) Simulated with Yahoo TeraSort


(ETL) Simulated with Yahoo TeraSort with

output replication

34


Input Data Size

35

Given the same MapReduce Job, the

larger the input dataset, the longer

the job will take.

Note:

It is important to note that as dataset sizes

increase completion times may not scale

linearly as many jobs can hit the ceiling of I/O

and/or Compute power.

1TB 5TB 10TB

Tim

e T

ake

n

Data Set Size (80 Node Cluster)

Test results from ETL-like Workload (Yahoo Terasort) using varying

data set sizes.


Data Locality in HDFS

36

Data Locality – The ability to process

data where it is locally stored.

Note:

During the Map Phase, the JobTracker attempts

to use data locality to schedule map tasks

where the data is locally stored. This is not

perfect and is dependent on a data nodes

where the data is located. This is a

consideration when choosing the replication

factor. More replicas tend to create higher

probability for data locality.

Reducers StartMaps Finish

Job

CompleteMaps Start

Observations

Notice this initial

spike in RX Traffic is

before the Reducers

kick in.

It represents data

each map task needs

that is not local.

Looking at the spike

it is mainly data from

only a few nodes.

Map Tasks: Initial spike for non-local data. Sometimes a task may be

scheduled on a node that does not have the data available locally.


Network Characteristics

37

The relative impact of various network characteristics on Hadoop clusters*

* Not a scaled or measured data

Availablity

Buffering

Oversubscription

Data Node Speed

Latency


Integration with Enterprise architecture – essential pathway for data flow

– Architecture

– Consistency

– Management

– Risk-assurance

– Enterprise grade features

Consistent Operational Model

– NxOS, CLI, Fault Behavior and Management

High and sustained line-rate east-west BW compared to traditional transactional networks

Over the time Hadoop will have multi-user, multi-workload behavior

– Need enterprise centric features

– Security, SLA, QoS etc.

Big Data is just another apps

38

Hadoop Network Topologies Unified Fabric & ToR DC Design

Name Node

Cisco UCS C200

Single NIC

2248TP-E

2248PQ

Nexus

6001/6004

Nexus

6001/6004

Data Nodes 1 – 40

Cisco UCS C 200 Single NIC

Data Nodes 41 - 80

Cisco UCS 200 Single NIC

Traditional DC Design Nexus N6k/N5k/2K

2248TP-E

2248PQ

L2

L3

© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public 39

Hadoop Network Topologies Unified Fabric & ToR DC Design

Name Node

Cisco UCS C200

Single NIC

2248TP-E

2248PQ

Nexus

6001/6004

Nexus

6001/6004

Data Nodes 1 – 40

Cisco UCS C 200 Single NIC

Data Nodes 41 - 80

Cisco UCS 200 Single NIC

Traditional DC Design Nexus N6k/N5k/2K

2248TP-E

2248PQ

L2

L3 Important to evaluate the overall availability of the system.

– Hadoop was designed with failure in mind, given any one node failure does not represent a huge issue

– Network failures can span many nodes in the system causing rebalancing and decreased overall resources.

– Typically 128 to 256 TB of data transfer occurs for a single ToR or FEX failure

– The tasks on affected nodes need to be rescheduled, schedule maintenance activities such as data rebalancing, increasing load on the cluster.

Redundancy paths and load sharing schemes.

– General redundancy mechanisms can also increase bandwidth, availability and response time

Ease of management and consistent Operation

– Main sources of outages can include human error. Ease of management and consistency are general best practices


Enhanced vPC Server NIC Teaming Topologies

40

Dual homing(active-active) network connection from server allows

– Eliminates replication and data movements during a node failure

– Allow optimal load-sharing

Dual homing FEX avoids single point of failure.

Enhance VPC allows such topology and ideally suited for Big Data applications

Enhanced vPC (EvPC)configuration any and all server NIC teaming configurations will be supported on any port (shipping Q4 CY11)

Supported with Nexus 55xx/6xxx only

Alternatively Nexus 3000 vPC allows host level redundancy with ToR ECMP

Dual NIC Active/Standby

Dual NIC 802.3ad

Single NIC

1G or 10G FEX

Nexus 5K or 6K

Nexus 5K or 6K


Availability

Single NIC failure doubles the job completion time.

Dual NIC has no impact on job completion time

Effective load-sharing of traffic flow on two NICs. NIC bonding configured at Linux – with LACP mode of bonding

Recommended to change the hashing to src-dst-ip-port (both network and NIC bonding in Linux) for optimal load-sharing

Single Attached vs. Dual Attached Node

41

1161 min

286 min


100 Jobs each with 10GB Data Set Stable vs. Node & Rack Failure

42

Almost all jobs are impacted with a single node failure

With multiple jobs running concurrently, node failure impact is as significant as rack failure


Why does the JOB Runs Longer with Single Node or Port Failure?

The MAP job are executed parallel so unit time for each MAP tasks/node remains same and more less completes the job roughly at the same time.

However during the failure, set of MAP task remains pending (since other nodes in the cluster are still completing their task) till ALL the node finishes the assigned tasks.

Once all the node finishes their MAP task, the left over MAP task being reassigned by name node, the unit time it take to finish those sets of MAP task remain the same(linear) as the time it took to finish the other MAPs – its just happened to be NOT done in parallel thus it could double job completion time. This is the worst case scenario with Terasort, other workload may have variable completion time.

Type of workload will have affect the impact of single port/node failure

– Short duration with batch operation – not much – one can always restart and finish it

– Depends on when the failure occurs, map vs. reduced phase

– Long job(hours) e.g., Large Sort, Pricing Calculation, Normalization and Join only workload will have big impact since it will take few more hours to run the job

43


Availability Network Failure Result – 1TB Terasort - ETL

Failure of various components

Failure introduce at 33%, 66% and 99% of reducer completion

Job completion does not have signification impact except singly attached NIC server & Rack failure

FEX Failure is a RACK failure for 1G(single NIC) topology

Job Completion Time in Minutes with Various Failure

Failure Point 1G Single

Attached

2G

Dual Attached

Peer Link 301 258

FEX * 1137 259

Rack * 1137 1017

A Port – Single

Attached

See previous Slide

See previous Slide

A port – Dual

Attach

See previous Slide

See previous Slide

FEX/ToR A

FEX/ToR B

96 Nodes

FEX/ToR A FEX/ToR B

Rack 1 Rack 2 Rack 3

2 FEX per Rack

*Variance in run time with % reducer completed

44


Burst Handling and Queue Depth

45

Several HDFS operations and

phases of MapReduce jobs are

very bursty in nature

Note:

The extent of bursts largely depend on the type

of job (ETL vs. BI). Bursty phases can include

replication of data (either importing into HDFS or

output replication) and the output of the

mappers during the shuffle phase.

A network that cannot handle bursts effectively will drop packets,

so optimal buffering is needed in network devices to absorb bursts.

Optimal Buffering • Given large enough incast, TCP will collapse at some point no matter

how large the buffer

• Well studied by multiple universities

• Alternate solutions (Changing TCP behavior) proposed rather than

Huge buffer switches

http://simula.stanford.edu/sedcl/files/dctcp-final.pdf








Nexus 6000 Unicast Traffic and Buffering

25 MB Buffer per three QSPF Ports. 16 MB for ingress, 9 MB for egress

In the case of congestion at egress, unicast traffic gets buffered at the ingress.

Take advantage of ingress buffer from multiple port or ASIC for unicast burst absorption.

Ensure fairness among multiple ingress ports with many to many traffic pattern

46


Nexus 2248TP-E utilizes a 32MB shared buffer to handle larger traffic bursts

Hadoop, NAS, AVID are examples of bursty applications

You can control the queue limit for a specified Fabric Extender for egress (network to the host) or ingress(host to network)

You can use a lower queue limit value on the Fabric Extender to prevent one blocked receiver from affecting traffic that is sent to other non-congested receivers ("head-of-line blocking”)

N5548-L3(config-fex)# hardware N2248TPE queue-limit 4000000 rx

N5548-L3(config-fex)# hardware N2248TPE queue-limit 4194304 tx

N5548-L3(config)#interface e110/1/1

N5548-L3(config-if)# hardware N2348TP queue-limit 4096000 tx

Nexus 2248TP-E 32MB Shared Buffer

47

Tune 2248TP-E to support a

extremely large burst (Hadoop, AVID,

…)

VM

#4

VM

#3

VM

#2

NAS

iSCSI

10G Attached Source (NAS Array)

1G Attached Server

10G

NF

S


Nexus 2248TP-E Buffer Monitoring

48

Nexus 2248TP-E utilizes a 32MB shared buffer to handle larger traffic bursts

Hadoop and NAS are examples of bursty applications

You can control the queue limit for a specified Fabric Extender for egress (network to the host) or ingress(host to network)

Extensive Drop Counters

– Provides drop counters for both directions: Network to host and Host to Network on a per host interface basis

– Drop counters for different reason Out of buffer drop, No credit drop, Queue limit drop(tail

drop), MAC error drop, Truncation drop, Multicast drop

Buffer Occupancy Counter

– How much buffer is being used. One key indicator of congestion or bursty traffic

switch# attach fex 110

Attaching to FEX 110 ...

To exit type 'exit', to abort type '$.'

fex-110# show platform software qosctrl asic 0 0 number of arguments 4: show asic 0 0

----------------------------------------

QoSCtrl internal info {mod 0x0 asic 0}

mod 0 asic 0:

port type: CIF [0], total: 1, used: 1

port type: BIF [1], total: 1, used: 0

port type: NIF [2], total: 4, used: 4

port type: HIF [3], total: 48, used: 48

bound NIF ports: 2

N2H cells: 14752

H2N cells: 50784

----Programmed Buffers---------

Fixed Cells : 14752

Shared Cells : 50784 Allocated Buffer

in terms of cells(512Bytes)

----Free Buffer Statistics-----

Total Cells : 65374

Fixed Cells : 14590

Shared Cells : 50784 Number of free

cells to be monitored


TeraSort FEX(2248TP-E) Buffer Analysis (10TB)

The buffer utilization is highest during the shuffle and output replication phases.

Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.

49

Buffer Usage During

Shuffle Phase

Buffer Usage During output

Replication


Nexus 3000 Shared Buffer Architecture

50

Nexus 3000 has 9MB of shared buffer in the queuing block on the ASIC.

208 Byte Cells, 9MB means 46080 Cells

Packets larger than 144 Bytes require more cells

Space is divided up among egress per queue per port (20%) and dynamically shared buffer(80%).

When a congestion / burst occurs, the egress port will more use buffer resources.


Increased Visibility in the Buffers

51


Nexus 3000 Buffer depth monitoring: interface

Real time command displaying the status of the shared buffer.

XML support will be added in the maintenance release

Counters are displayed in cell count. A cell is approximately 208 bytes

show hardware internal buffer info pkt-stats [brief|clear|detail]

52

Buffer

usage

Free

buffer

Max buffer

usage since

clear

Total buffer

space on the

platform


TeraSort(ETL) N3k Buffer Analysis (10TB) The buffer utilization is highest during the shuffle and output replication phases.

Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.

53

Note:

The Aggregation switch buffer remained flat as the bursts were absorbed at the Top of Rack layer

Buffer Usage During

Shuffle Phase

Buffer Usage During output

Replication


Python Example: Buffer Counters Using Cisco BufferDepthMonitor

54

# Create BufferDepthMonitor Obj

objBufferDepthMonitor = BufferDepthMonitor()

# Switch Cell Count

objBufferDepthMonitor.get_switch_cell_count()

showBuffer.py >>> help(BufferDepthMonitor) Help on class BufferDepthMonitor in module cisco: class BufferDepthMonitor(CLI) | Method resolution order: | BufferDepthMonitor | CLI | __builtin__.object | | Methods defined here: | | __init__(self) | | dumps(self) | | get_max_cell_usage(self) | | get_remaining_instant_usage(self) | | get_status(self) | | get_switch_cell_count(self) | | get_total_instant_usage(self) | | parse_specific(self) |

Exce

rpt: H

elp

on C

lass

Bu

fferD

ep

thM

onito

r

For help on class BufferDepthMonitor

>>> help(BufferDepthMonitor)


Python Example: Buffer monitoring while running Hadoop

55

12/03/27 08:02:23 INFO mapred.JobClient: map 69% reduce 0%










Hadoop Job Status

Buffer usage statistics

from the switch while

running Hadoop TeraSort

Hadoop job status output while running a 1GB TeraSort using

8 nodes

2012/03/27 08:02:23 0 * 2012/03/27 08:02:24 3810 -----* 2012/03/27 08:02:25 1127 -* 2012/03/27 08:02:26 0 * 2012/03/27 08:02:27 0 * 2012/03/27 08:02:28 0 * 2012/03/27 08:02:29 0 * 2012/03/27 08:02:30 0 * 2012/03/27 08:02:31 0 * 2012/03/27 08:02:32 4921 -------* 2012/03/27 08:02:33 4299 ------* 2012/03/27 08:02:34 6929 ----------* 2012/03/27 08:02:35 0 *

Buffer Usage


Oversubscription Design

Primary benefits of hadoop is to reduce the time required by workload that otherwise would require long time to meet the SLA e.g. Pricing, log analysis, Join-only job etc.

Typically oversubscription is high with 10 G server access then at 1Gbps

Non-blocking network is NOT a requirement, however degree of oversubscription matters for

– Job Completion Time and how long the replication of results takes

– Oversubscription during rack or FEX failure

Static vs. actual oversubscription

– Hadoop transport is TCP based, in which reduced fetch the data at the rate of IO

– Often how much data a single node push is IO bound and number of disk configuration

56

Uplinks Oversubscription Theoretical (16

Servers)

Measured

8 2:1 Next Slides

4 4:1 Next Slides

2 8:1 Next Slides

1 16:1 Next Slides


Network Oversubscriptions

Traffic to the network is limited by IO of the compute node

More spindles more network traffic – but not always linear

Oversubscription in the network is a reasonable trade off

Failure impact:

Normal Job Run – not much impact

Result Replication with 1,2,4, & 8 uplink - larger relative impact

Rack failure is immune to oversubscriptions – IOW the rack failure impact hides the oversubscription loss

57


Map to Reducer Ratio Impact on Job Completion 1 TB file with 128 MB Blocks == 7,813 Map Tasks

The job completion time is directly related to number of reducers

Average network buffer usage lowers as number of reducer gets lower and vice versa

0

5000

10000

15000

20000

25000

30000

192 96 48 24 12 6

No. Of Reduceers

Total Graph of Job Completion Time in Sec

0

10000

20000

30000

24 12 6

No. Of Reduceers

Job Completion Time in Sec

0

200

400

600

800

192 96 48

No. Of Reduceers

Job Completion Time in Sec

58


Network Traffic with Variable Reducers Network Traffic Decreases with Less Reducers available

59

96 Reducers

48 Reducers

24 Reducers


Data Node Network Speed

60

Generally 1GE is being used largely due

to the cost/performance trade-offs.

Though 10GE can provide benefits

depending on workload.

Note:

Multiple 1GE links can be bonded together to

not only increase bandwidth, but increase

resiliency.

C2001GE

C20010GE

C2101GE

C21010GE

Co

mp

leti

on

Tim

es

25th %-tile 50th %-tile75th %-tile 90th %-tile99th %-tile 99.9th %-tile

Cisco UCSC200 M2

(1 GE)

Cisco UCSC200 M2(10 GE)

Cisco UCSC210 M2

(1 GE)

Cisco UCSC210 M2(10 GE)

Co

mp

leti

on

Tim

e

gridmix2 large jobs gridmix2 medium jobs gridmix2 small jobs


Data Node Speed Differences Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workload

61

Single 1GE 100% Utilized

Dual 1GE 75% Utilized

10GE 40% Utilized


Data Node Speed Differences 1G vs. 10G TCPDUMP of Reducers TX

Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workload

Reduced spike with 10G and smoother job completion time.

Multiple 1G or 10G links can be bonded together to not only increase bandwidth, but increase resiliency.

62


1 13 25 37 49 61 73 85 97 109

121

133

145

157

169

181

193

205

217

229

241

253

265

277

289

301

313

325

337

349

361

373

385

397

409

421

433

445

457

469

481

493

505

517

529

541

553

565

577

589

601

613

625

637

649

661

673

685

697

709

721

733

745

757

769

781

793

Job

Com

plet

ion

Cell U

sage

1G Buffer Used 10G Buffer Used 1G Map % 1G Reduce % 10G Map % 10G Reduce %

1GE vs. 10GE Buffer Usage

63

Moving from 1GE to 10GE actually lowers the buffer requirement at the switching layer.

By moving to 10GE, the data node has a wider pipe to receive data lessening the need

for buffers on the network as the total aggregate transfer rate and amount of data does

not increase substantially. This is due, in part, to limits of I/O and Compute capabilities


Network Latency

64

Generally network latency, while

consistent latency being important,

does not represent a significant factor

for Hadoop Clusters.

Note:

There is a difference in network latency vs.

application latency. Optimization in the

application stack can decrease application

latency that can potentially have a

significant benefit. 1TB 5TB 10TB

Co

mp

leti

on

Tim

e (

Sec)

Data Set Size (80 Node Cluster)

N3K Topology 5k/2k Topology

Multi-tenant Environments


Multi-use Cluster Characteristics

66

Hadoop clusters are generally

multi-use. The effect of

background use can effect any

single jobs completion.

Example View of 24 Hour Cluster Use

Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs

(Blue lines are ETL Jobs and purple lines are BI Jobs)

Importing Data into HDFS

Note:

A given Cluster, running many different types of Jobs, Importing into HDFS, Etc.


Various Multitenant Environments

67

Hadoop + HBASE

Job Based

Department Based

Need to understand

Traffic Patterns

Scheduling Dependent

Permissions and

Scheduling Dependent


Hadoop + Hbase

68

Map 1 Map 2 Map N Map 3

Reducer

1

Reducer

2

Reducer

3

Reducer

N

HDFS

Shuffle

Output

Replication

Region

Server

Region

Server

Client Client

Major Compaction

Read Read

Read

Update

Update

Read

Major Compaction


Hbase During Major Compaction Enabling QoS Improves the Latency

69

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Laten

cy (us

)

Time

UPDATE - Average Latency (us) READ - Average Latency (us)

QoS - UPDATE - Average Latency (us) QoS - READ - Average Latency (us)

Read/Update Latency

Comparison of Non-QoS vs.

QoS Policy

~45% for Read

Improvement

Switch Buffer Usage

With Network QoS Policy to prioritize:

Hbase Update/Read Operations vs. Hbase

Major Compaction


Hbase + Hadoop Map Reduce

0

5000

10000

15000

20000

25000

30000

35000

40000

Latency(us)

Time

UPDATE-AverageLatency(us) READ-AverageLatency(us) QoS-UPDATE-AverageLatency(us) QoS-READ-AverageLatency(us)

1

70

139

208

277

346

415

484

553

622

691

760

829

898

967

1036

1105

1174

1243

1312

1381

1450

1519

1588

1657

1726

1795

1864

1933

2002

2071

2140

2209

2278

2347

2416

2485

2554

2623

2692

2761

2830

2899

2968

3037

3106

3175

3244

3313

3382

3451

3520

3589

3658

3727

3796

3865

3934

4003

4072

4141

4210

4279

4348

4417

4486

4555

4624

4693

4762

4831

4900

4969

5038

5107

5176

5245

5314

5383

5452

5521

5590

5659

5728

5797

5866

5935

BufferUsed

Timeline

HadoopTeraSort Hbase

Read/Update Latency

Comparison of Non-QoS

vs. QoS Policy ~60% for Read

Improvement

Switch Buffer Usage

With Network QoS Policy

to prioritize Hbase

Update/Read Operations

70


Summary

10G and/or Dual attached server provides consistent job completion time & better buffer utilization

10G provide reduce burst at the access layer

A single attached node failure has considerable impact on job completion time

Dual Attached Sever is recommended design – 1G or 10G. 10G for future proofing

Rack failure has the biggest impact on job completion time

Does not require non-blocking network

Oversubscription does impact job completion time

Latency does not matter much in Hadoop work load

Extensive Validation of Hadoop Workload

Reference Architecture

– Make it easy for Enterprise

– Demystify Network for Hadoop Deployment

– Integration with Enterprise with efficient choices

71


Big Data @ Cisco

72

Cisco.com Big Data

www.cisco.com/go/bigdata

Certifications and Solutions with UCS C-Series

and Nexus 5500+22xx

• EMC Greenplum MR Solution

• Cloudera Hadoop Certified Technology

• Cloudera Hadoop Solution Brief

• Oracle NoSQL Validated Solution

• Oracle NoSQL Solution Brief

Multi-month network and compute analysis

testing (In conjunction with Cloudera)

• Network/Compute Considerations Whitepaper

• Presented Analysis at Hadoop World

128 Node/1PB test cluster

http://www.cisco.com/go/bigdata

http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns668/LE-33201-cloudera.pdf

http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns1150/ns1155/LE-32102-OracleNoSQLDB.pdf


Don’t forget to activate your Cisco Live Virtual

account for access to all session material,

communities, and on-demand and live

activities throughout the year. Activate your

account at the Cisco booth in the World of

Solutions or visit www.ciscolive.com.

Complete Your Online Session Evaluation

Give us your feedback and you could win fabulous prizes. Winners announced daily.

Receive 20 Passport points for each session evaluation you complete.

Complete your session evaluation online now (open a browser through our wireless network to access our portal) or visit one of the Internet stations throughout the Convention Center.

73

http://www.ciscolive.com

THANK YOU for Listening & Sharing Your Thoughts

Documents

Big Data Ecosystem Framework, Architecture and Deployment ...d2zmdbbm9feqrf.cloudfront.net/2013/usa/pdf/BRKAPP-2027.pdf · Big Data Ecosystem Framework, Architecture and Deployment