Upload
voquynh
View
232
Download
6
Embed Size (px)
Citation preview
Big Data Ecosystem Framework, Architecture and Deployment in Enterprise
BRKAPP-2027
Nimish Desai
Technical Leader
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Session Objectives
Understand Big Data Land Scape & Application in Enterprise
Understand Hadoop Inner Working
Understand and Demystify Network/Compute Architecture with Hadoop
Validation Results with Hadoop Workload – Network & Compute
3
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Building Blocks, Landscape and Ecosystem
4
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Definitions & Nature of Data
Structured Data
– Schema Based
– Defined Semantics by Raw/Column Oriented Data Structure
– Records Retrieved Through key-value pair
Data Warehouse & RDBMS
– Oracle, Sybase, SQL are structured relational data bases
– ERP and CRM Feeds
– Traditional Transactional (OLTP) & Reporting(OLAP) via ETL, API Access
Unstructured Data & Semi-structured Data
– Text, Logs, Web-Clicks, Sensors Data, Picture, Video
– Each have structure or semi-structure, e.g., TXT file has space, tab, semi-colon etc..
– When various data-types combined in one data set, it become unstructured in sense the of semantics
5
“Every day I wake up and ask, how can I flow data better,
manage data better, analyze data better?”
Rollin Ford
(CIO, Walmart)
diverse too large
timeframe cost
“Every two days we create as much information as we did
from the dawn of civilization up until 2003.”
Eric Schmidt
(Chairman of Google)
6
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Why Big Data?
Challenges
Complex Data
– The explosion of unstructured data poorly suited for transaction-oriented, rigid-schema RDBMS’s
Multiple data sources & Lots of it
– Social Media, Sensors, Web-clicks, Pictures
Requires linear scaling – compute/storage - horizontally
CPU horsepower/density outpacing spinning disk performance leaves compute starved of data
Solution
Batch Processing – Data never dies
Move the compute to the data, avoid SAN/NAS bottlenecks
Build scalability into the code - Make it easy for developers to write distributed code
Leverage economy of scale (read commonly available hardware, low power footprint etc.)
Design for frequent partial failure and recovery – simplify development for distributed computing
7
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Infinite Use Cases
Web & E-Commerce
– Faster User Response
– Customer Behaviors & Pricing Models
– Ad Target
Retails
– Customer Churn & Integration of brick & mortar with .com business models
– PoS Transactional Analysis
Insurance & Finance
– Risk Management
– User Behavior & Incentive Management
– Trade Surveillance for Financials
Network Analytics – Splunk
– Text Mining
– Fault Prediction
Security & Threat Defense
8
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Big Data Application Realm – Web 2.0 & Social/Community Networks
Data live/die in Internet only entities
Data Domain Partially private
Homogeneous Data Life Cycle
– Mostly Unstructured
– Web Centric, User Driven
– Unified workload – few process & owners
– Typically non-virtualized
Scaling & Integration Dynamics
– Purpose Driven Apps
– Thousands of nodes
– Hundreds of PB and growing exponentially
9
Data
store Service UI
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Big Data Application Realm - Enterprise
Data Lives in a confined zone of enterprise repository
– Long Lived, Regulatory and Compliance Driven
Heterogeneous Data Life Cycle
– Many Data Models
– Diverse data – Structured and Unstructured
– Diverse data sources - Subscriber based
– Divers workload from many -sources/groups/process/technology
– Virtualized and non-virtualized with mostly SAN/NAS base
Each Apps/Group/Technology limited in
– data generation
– Consumption
– Servicing confined domains
Scaling & Integration Dynamics are different
– Data Warehousing(structured) with diverse repository + Unstructured Data
– Few hundred to thousand nodes, few PB
– Integration, Policy & Security Challenges
Customer DB
(Oracle/SAP)
Soc
Media
ERP
Module
B
Data
Service
Sales
Pipelin
e
ERP
Module
A
Call
Center
Produc
t
Catalo
g
Catalo
g Data
Video
Conf Collab
Office
Apps
Record
s
Mgmt
Doc
Mgmt
B
Doc
Mgmt
A
VOIP Exec
Report
s
10
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Data Sources
11
Enterprise Application
Sales Products Process
Inventory Finance Payroll
Shipping Tracking
Authorization Customers Profile
Machine logs Sensor data
Call data records Web click stream data
Satellite feeds GPS data Sales data
Blogs Emails Pictures Video
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Big Data Framework Application Comparison
12
Relational Database
• Structured Data – Rows Oriented
• Optimized for OLTP/OLAP
• Rigid schema applied to data on insert/update
• Read and write (insert, update) many times
• Non-linear scaling
• Most transactions and queries involve a small subset of data set
• Transactional – scaling to thousands of queries
• GB to TBs size
Batch-oriented Big Data (Hadoop)
• Unstructured Data – Files, logs, Web-Clicks
• Data format is abstracted to higher level application programing
• Schema-less, flexible for later re-use
• Write once, read many
• Data never dies
• Linear scaling
• Entire data set at play for a given query
• Multi PB
Real-time Big Data NoSQL
• Hbase, Cassandra, Oracle
• Structured and Unstructured Data
• Sparse column-family data storage or Key-value pair
• Not a RDBMS, though with some schema
• Random read and write
• Modeled after Google’s BigTable
• High transaction – real time scaling to millions
• Not suited for ad-hoc analysis
• More suited for ~1 PB
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Aggregation
& Services Layer
Core Layer
(LAN & SAN)
Access Layer
SAN Edge
WAN Edge
Layer
Nexus 7000
10 GE Aggr
Network
Services
Layer 3
Layer 2 - 1GE
Layer 2 - 10GE
10 GE DCB
10 GE FCoE/DCB
4/8 Gb FC
FC SAN A FC SAN B
vPC+
FabricPath
Nexus 7000
10 GE Core
Nexus 5500 10GE
Nexus 2148TP-E
Bare Metal
CBS 31xx
Blade switch Nexus 7000
End-of-Row
Nexus 5500 FCoE
Nexus 2232
Top-of-Rack
UCS FCoE Nexus 3000
Top-of-Rack
10G
1 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B) 10Gb DCB / FCoE Server Access or 10 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B)
L3
L2
MDS 9500
SAN
Director
B22
FEX
HP
Blade
C-class
FC SAN A FC SAN B
MDS 9200 /
9100
Nexus
5500
FCoE
Bare Metal
1G Nexus 3000
Top-of-Rack
Data Center Infrastructure
13
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Traditional
Database
RDBMS
Storage
SAN and NAS
“Big Data”
Store and Analyze
“Big Data”
Real-Time Capture,
Read and Update
Operations
NoSQL
Big Data Building Blocks into the Enterprise
14
Application Virtualized,
Bare Metal and Cloud Sensor
Data Logs
Social
Media
Click
Streams Mobility
Trends Event
Data
Cisco Unified Fabric
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Sample of Big Data Ecosystem
Hadoop Distribution (Similar to what Redhat does for
Linux) – Services and support model
Spin-out from Yahoo. Services and support model for
Apache Hadoop.
Rewrote hadoop with many optimizations (rewrote
HDFS into a C++ Filesystem and distributed the
metadata)
EMC Greenplum hadoop distribution. Uses MapR.
Hadoop Distribution and NoSQL like offering
announced at Oracle Openworld. Very similar to
HBASE/other NoSQL offerings. Based of BerkeleyDB.
Other NoSQL-like offerings
Various Others
15
Hadoop Basics
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Q: What is Hadoop?
17
A: Hadoop is a distributed, fault-tolerant framework for storing and analyzing data.
Its two primary components are the Hadoop Filesystem (HDFS) and the
MapReduce application engine.
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Main Hadoop Building Blocks
18
Hadoop has many building blocks…At the base is a way to Store and Process unstructured data…
Hadoop Distributed File System
(HDFS)
At the base is a
Self-healing
clustered storage
system.
Map-Reduce Distributed Data
Processing
PIG Hive Sqoop Top level
abstractions
Top level
Interfaces ETL
Tools
BI
Reporting RDBMS
HBASE
Database with
Real-time
access
Apps API
Flume
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Hadoop Components and Operations
Scalable & Fault Tolerant
Types of Functions
– Name Node (Master) - Manages Cluster
– Data Node (Map and Reducer) – Carries blocks
Data is not centrally located, Data is stored across all data nodes in the cluster
Data is divided in multiple large blocks – 64MB default, typical block 128MB
Blocks are not the related to disk geometry
Data is stored reliably. Each block is replicated 3 times
Hadoop Distributed File System
19
Block
1
Block
2
Block
3
Block
4
Block
5
Block
6
ToR FEX/switch
Data node 1
Data node 2
Data node 3
Data node 4
Data node 5
ToR FEX/switch
Data node 6
Data node 7
Data node 8
Data node 9
Data node 10
ToR FEX/switch
Data node 11
Data node 12
Data node 13
Data node 14
Data node 15
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Name
Node
Hadoop Components and Operations
Name Node
– Runs a scheduler – Job Tracker
– Manages all data nodes, in memory
– Secondary Name Node – Snapshot of meta data of HDFS cluster
– Typically all three JVM can run on single node
Data Node
– Task Tracker Receives Job Info from Job Tracker(Name Node)
– Map & Reducer Task Managed by Task Tracker
– Configurable Ratio of Map & Reduce Task for various workload per Node/CPU/Core
– Data Locality - IF data not available where the map task is assigned, a missing block be copied over the network
20
ToR FEX/switch
Data node 1
Data node 2
Data node 3
Data node 4
Data node 5
ToR FEX/switch
Data node 6
Data node 7
Data node 8
Data node 9
Data node 10
ToR FEX/switch
Data node 11
Data node 12
Data node 13
Data node 14
Data node 15
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
HDFS Architecture
21
ToR FEX/switch
Data node 1
Data node 2
Data node 3
Data node 4
Data node 5
ToR FEX/switch
Data node 6
Data node 7
Data node 8
Data node 9
Data node 10
ToR FEX/switch
Data node 11
Data node 12
Data node 13
Data node 14
Data node 15
1
Switch
Name Node File metadata stored in
fsimage.txt and in-
memory only (!) map of
blocks to data nodes
/usr/sean/foo.txt:blk_1,blk_2
/usr/jacob/bar.txt:blk_3,blk_4
Data node 1:blk_1
Data node 2:blk_2, blk_3
Data node 3:blk_3
1
1
2
2
2
3
3
3
4
4
4 4
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Name
Node
Hadoop Components and Operations Rack Awareness – FUD & Clarification
Basic intent is to avoid having ALL copies of a given block in the same rack to avoid data loss in the event of failure of the rack
How does Hadoop Name Node knows which rack is with which sets of block?
– It does not
– By default Rack Awareness is essentially "off" - all nodes are part of the same Hadoop rack
– If configured, the Name Node will place the 2nd copy in a different rack than the first, the third copy in the same rack as the second, and all other copies basically at random
Layer 2 vs. Layer 3 Agnostics
Typical replication occurs as lower priority so it not a huge impact to network
A concern when 1G network were not able to replicate. With 10G network not a big issue
22
ToR FEX/switch
Data node 1
Data node 2
Data node 3
Data node 4
Data node 5
ToR FEX/switch
Data node 6
Data node 7
Data node 8
Data node 9
Data node 10
ToR FEX/switch
Data node 11
Data node 12
Data node 13
Data node 14
Data node 15
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Hadoop Components and Operations
The Data Ingest & Replication
– External Connectivity
– East West Traffic (Replication of data blocks)
Map Phase – Raw data Analyzed and converted to name/value pair.
– Workload translate to multiple batches of Map task
– Reducer can start the reduce phase ONLY after the entire Map set is complete
Mostly a IO/compute function
Hadoop Distributed File System
23
Map Map
Map Map
Map
Unstructured Data
Map Map
Map Map
Map
Map Map
Map Map
Map
Key 1 Key 1
Key 1 Key 1
Key 1 Key 1
Key 1 Key 2
Key 1 Key 1
Key 1 Key 3
Key 1 Key 1
Key 1 Key 4
Reduce
Shuffle Phase
Reduce Reduce
Result/Output
Reduce
Map Map
Map Map
Map
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Hadoop Components and Operations
Shuffle Phase - All name/value pair are sorted and grouped by their keys.
Reducer is PULLING the data from the Mapper Nodes
– High Network Activity
Reduce Phase – All values associates with a key are process for results, three phases – Copy - Get intermediate result from each data node local
disk – Merge - To reduce the number of files – Reduce Method
Output Replication Phase - Reducer replicating result to multiple nodes
– Highest Network Activity
Network activities dependent on workload behavior
Hadoop Distributed File System
24
Map Map
Map Map
Map
Unstructured Data
Map Map
Map Map
Map
Map Map
Map Map
Map
Key 1 Key 1
Key 1 Key 1
Key 1 Key 1
Key 1 Key 2
Key 1 Key 1
Key 1 Key 3
Key 1 Key 1
Key 1 Key 4
Reduce
Shuffle Phase
Reduce Reduce
Result/Output
Reduce
Map Map
Map Map
Map
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Hadoop – Anatomy of a MapReduce Job
25
Example: Historic Weather Data (max temperatures/Year)
Maps: Separates temperatures and year out of huge historical database Reducers: Finds the max per year
Source: O’Reilly Hadoop A definitive Guide
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
Quick, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1 ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
26
Hadoop Cluster Design & Validated Compute and Network Results
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Characteristics that Affect Hadoop Clusters
Cluster Size
– Number of Data Nodes
Data Model & Mapper/Reduces Ratio
– MapReduce functions
Input Data Size
– Total starting dataset
Data Locality in HDFS
– Ability to processes data where it already is located
Background Activity
– Number of Jobs running
– type of jobs
– Importing
– exporting
Characteristics of Data Node
– I/O, CPU, Memory, etc.
Networking Characteristics
– Availability
– Buffering
– Data Node Speed (1G vs. 10G)
– Oversubscription
– Latency
28
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Cluster Size
29
24 48 82
Tim
e T
ake
n
No. of Nodes
A general characteristic of an
optimally configured cluster is the
ability to decrease job completion
times by scaling out the nodes.
Sizing Depends on
• Workload Size & Type
• Job completion time
• Map/Reduce Ratio
• CPU, IO and local storage
Test results from ETL-like Workload (Yahoo Terasort) using 1TB
data set.
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
MapReduce Data Model ETL & BI Workload Benchmark
30
The complexity of the functions used in Map and/or Reduce has a large impact on the
job completion time and network traffic.
• Data set size varies in various phase – Varying impact on the network e.g. 1TB Input, 10MB Shuffle, 1MB Output
• Most of the processing in the Map Functions, smaller intermediate and even smaller final Data
Map Start
Yahoo TeraSort – ETL Workload – Most Network Intensive
Reducers Start
Map Finish Job Finish
• Input, Shuffle and Output data size is the same – e.g. 10 TB data set in all phases
• Yahoo Terasort has a more balanced Map vs. Reduce functions - linear compute and IO
Map Start
Shakespeare WordCount – BI Workload
Reducers Start Map Finish
Job Finish
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Job Patterns
31
Analyze
Extract Transform Load
(ETL)
Explode
Reduce
Reduce
Reduce
Ingress vs.
Egress Data
Set
1:0.3
Ingress vs.
Egress Data
Set
1:1
Ingress vs.
Egress Data
Set
1:2
The Time the reducers start is dependent on: mapred.reduce.slowstart.completed.maps
It doesn’t change the amount of data sent to
Reducers, but may change the timing to send
that data
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Traffic Types
32
Large Incast (Hadoop Replication)
Small Flows/Messaging (Admin Related, Heart-beats, Keep-alive,
delay sensitive application messaging)
Small – Medium Incast (Hadoop Shuffle)
Large Flows (HDFS Ingest)
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Map and Reduce Traffic
33
Many-to-Many Traffic Pattern
Map 1 Map 2 Map N Map 3
Reducer 1 Reducer 2 Reducer 3 Reducer N
HDFS
Shuffle
Output
Replication
NameNode
JobTracker
ZooKeeper
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Job Patterns Job Patterns have varying impact on network utilization
Analyze Simulated with Shakespeare Wordcount
Extract Transform Load
(ETL) Simulated with Yahoo TeraSort
Extract Transform Load
(ETL) Simulated with Yahoo TeraSort with
output replication
34
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Input Data Size
35
Given the same MapReduce Job, the
larger the input dataset, the longer
the job will take.
Note:
It is important to note that as dataset sizes
increase completion times may not scale
linearly as many jobs can hit the ceiling of I/O
and/or Compute power.
1TB 5TB 10TB
Tim
e T
ake
n
Data Set Size (80 Node Cluster)
Test results from ETL-like Workload (Yahoo Terasort) using varying
data set sizes.
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Data Locality in HDFS
36
Data Locality – The ability to process
data where it is locally stored.
Note:
During the Map Phase, the JobTracker attempts
to use data locality to schedule map tasks
where the data is locally stored. This is not
perfect and is dependent on a data nodes
where the data is located. This is a
consideration when choosing the replication
factor. More replicas tend to create higher
probability for data locality.
Reducers StartMaps Finish
Job
CompleteMaps Start
Observations
Notice this initial
spike in RX Traffic is
before the Reducers
kick in.
It represents data
each map task needs
that is not local.
Looking at the spike
it is mainly data from
only a few nodes.
Map Tasks: Initial spike for non-local data. Sometimes a task may be
scheduled on a node that does not have the data available locally.
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Network Characteristics
37
The relative impact of various network characteristics on Hadoop clusters*
* Not a scaled or measured data
Availablity
Buffering
Oversubscription
Data Node Speed
Latency
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Integration with Enterprise architecture – essential pathway for data flow
– Architecture
– Consistency
– Management
– Risk-assurance
– Enterprise grade features
Consistent Operational Model
– NxOS, CLI, Fault Behavior and Management
High and sustained line-rate east-west BW compared to traditional transactional networks
Over the time Hadoop will have multi-user, multi-workload behavior
– Need enterprise centric features
– Security, SLA, QoS etc.
Big Data is just another apps
38
Hadoop Network Topologies Unified Fabric & ToR DC Design
Name Node
Cisco UCS C200
Single NIC
2248TP-E
2248PQ
Nexus
6001/6004
Nexus
6001/6004
Data Nodes 1 – 40
Cisco UCS C 200 Single NIC
Data Nodes 41 - 80
Cisco UCS 200 Single NIC
Traditional DC Design Nexus N6k/N5k/2K
2248TP-E
2248PQ
L2
L3
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public 39
Hadoop Network Topologies Unified Fabric & ToR DC Design
Name Node
Cisco UCS C200
Single NIC
2248TP-E
2248PQ
Nexus
6001/6004
Nexus
6001/6004
Data Nodes 1 – 40
Cisco UCS C 200 Single NIC
Data Nodes 41 - 80
Cisco UCS 200 Single NIC
Traditional DC Design Nexus N6k/N5k/2K
2248TP-E
2248PQ
L2
L3 Important to evaluate the overall availability of the system.
– Hadoop was designed with failure in mind, given any one node failure does not represent a huge issue
– Network failures can span many nodes in the system causing rebalancing and decreased overall resources.
– Typically 128 to 256 TB of data transfer occurs for a single ToR or FEX failure
– The tasks on affected nodes need to be rescheduled, schedule maintenance activities such as data rebalancing, increasing load on the cluster.
Redundancy paths and load sharing schemes.
– General redundancy mechanisms can also increase bandwidth, availability and response time
Ease of management and consistent Operation
– Main sources of outages can include human error. Ease of management and consistency are general best practices
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Enhanced vPC Server NIC Teaming Topologies
40
Dual homing(active-active) network connection from server allows
– Eliminates replication and data movements during a node failure
– Allow optimal load-sharing
Dual homing FEX avoids single point of failure.
Enhance VPC allows such topology and ideally suited for Big Data applications
Enhanced vPC (EvPC)configuration any and all server NIC teaming configurations will be supported on any port (shipping Q4 CY11)
Supported with Nexus 55xx/6xxx only
Alternatively Nexus 3000 vPC allows host level redundancy with ToR ECMP
Dual NIC Active/Standby
Dual NIC 802.3ad
Single NIC
1G or 10G FEX
Nexus 5K or 6K
Nexus 5K or 6K
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Availability
Single NIC failure doubles the job completion time.
Dual NIC has no impact on job completion time
Effective load-sharing of traffic flow on two NICs. NIC bonding configured at Linux – with LACP mode of bonding
Recommended to change the hashing to src-dst-ip-port (both network and NIC bonding in Linux) for optimal load-sharing
Single Attached vs. Dual Attached Node
41
1161 min
286 min
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
100 Jobs each with 10GB Data Set Stable vs. Node & Rack Failure
42
Almost all jobs are impacted with a single node failure
With multiple jobs running concurrently, node failure impact is as significant as rack failure
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Why does the JOB Runs Longer with Single Node or Port Failure?
The MAP job are executed parallel so unit time for each MAP tasks/node remains same and more less completes the job roughly at the same time.
However during the failure, set of MAP task remains pending (since other nodes in the cluster are still completing their task) till ALL the node finishes the assigned tasks.
Once all the node finishes their MAP task, the left over MAP task being reassigned by name node, the unit time it take to finish those sets of MAP task remain the same(linear) as the time it took to finish the other MAPs – its just happened to be NOT done in parallel thus it could double job completion time. This is the worst case scenario with Terasort, other workload may have variable completion time.
Type of workload will have affect the impact of single port/node failure
– Short duration with batch operation – not much – one can always restart and finish it
– Depends on when the failure occurs, map vs. reduced phase
– Long job(hours) e.g., Large Sort, Pricing Calculation, Normalization and Join only workload will have big impact since it will take few more hours to run the job
43
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Availability Network Failure Result – 1TB Terasort - ETL
Failure of various components
Failure introduce at 33%, 66% and 99% of reducer completion
Job completion does not have signification impact except singly attached NIC server & Rack failure
FEX Failure is a RACK failure for 1G(single NIC) topology
Job Completion Time in Minutes with Various Failure
Failure Point 1G Single
Attached
2G
Dual Attached
Peer Link 301 258
FEX * 1137 259
Rack * 1137 1017
A Port – Single
Attached
See previous Slide
See previous Slide
A port – Dual
Attach
See previous Slide
See previous Slide
FEX/ToR A
FEX/ToR B
96 Nodes
FEX/ToR A FEX/ToR B
Rack 1 Rack 2 Rack 3
2 FEX per Rack
*Variance in run time with % reducer completed
44
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Burst Handling and Queue Depth
45
Several HDFS operations and
phases of MapReduce jobs are
very bursty in nature
Note:
The extent of bursts largely depend on the type
of job (ETL vs. BI). Bursty phases can include
replication of data (either importing into HDFS or
output replication) and the output of the
mappers during the shuffle phase.
A network that cannot handle bursts effectively will drop packets,
so optimal buffering is needed in network devices to absorb bursts.
Optimal Buffering • Given large enough incast, TCP will collapse at some point no matter
how large the buffer
• Well studied by multiple universities
• Alternate solutions (Changing TCP behavior) proposed rather than
Huge buffer switches
http://simula.stanford.edu/sedcl/files/dctcp-final.pdf
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Nexus 6000 Unicast Traffic and Buffering
25 MB Buffer per three QSPF Ports. 16 MB for ingress, 9 MB for egress
In the case of congestion at egress, unicast traffic gets buffered at the ingress.
Take advantage of ingress buffer from multiple port or ASIC for unicast burst absorption.
Ensure fairness among multiple ingress ports with many to many traffic pattern
46
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Nexus 2248TP-E utilizes a 32MB shared buffer to handle larger traffic bursts
Hadoop, NAS, AVID are examples of bursty applications
You can control the queue limit for a specified Fabric Extender for egress (network to the host) or ingress(host to network)
You can use a lower queue limit value on the Fabric Extender to prevent one blocked receiver from affecting traffic that is sent to other non-congested receivers ("head-of-line blocking”)
N5548-L3(config-fex)# hardware N2248TPE queue-limit 4000000 rx
N5548-L3(config-fex)# hardware N2248TPE queue-limit 4194304 tx
N5548-L3(config)#interface e110/1/1
N5548-L3(config-if)# hardware N2348TP queue-limit 4096000 tx
Nexus 2248TP-E 32MB Shared Buffer
47
Tune 2248TP-E to support a
extremely large burst (Hadoop, AVID,
…)
VM
#4
VM
#3
VM
#2
NAS
iSCSI
10G Attached Source (NAS Array)
1G Attached Server
10G
NF
S
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Nexus 2248TP-E Buffer Monitoring
48
Nexus 2248TP-E utilizes a 32MB shared buffer to handle larger traffic bursts
Hadoop and NAS are examples of bursty applications
You can control the queue limit for a specified Fabric Extender for egress (network to the host) or ingress(host to network)
Extensive Drop Counters
– Provides drop counters for both directions: Network to host and Host to Network on a per host interface basis
– Drop counters for different reason Out of buffer drop, No credit drop, Queue limit drop(tail
drop), MAC error drop, Truncation drop, Multicast drop
Buffer Occupancy Counter
– How much buffer is being used. One key indicator of congestion or bursty traffic
switch# attach fex 110
Attaching to FEX 110 ...
To exit type 'exit', to abort type '$.'
fex-110# show platform software qosctrl asic 0 0 number of arguments 4: show asic 0 0
----------------------------------------
QoSCtrl internal info {mod 0x0 asic 0}
mod 0 asic 0:
port type: CIF [0], total: 1, used: 1
port type: BIF [1], total: 1, used: 0
port type: NIF [2], total: 4, used: 4
port type: HIF [3], total: 48, used: 48
bound NIF ports: 2
N2H cells: 14752
H2N cells: 50784
----Programmed Buffers---------
Fixed Cells : 14752
Shared Cells : 50784 Allocated Buffer
in terms of cells(512Bytes)
----Free Buffer Statistics-----
Total Cells : 65374
Fixed Cells : 14590
Shared Cells : 50784 Number of free
cells to be monitored
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
TeraSort FEX(2248TP-E) Buffer Analysis (10TB)
The buffer utilization is highest during the shuffle and output replication phases.
Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.
49
Buffer Usage During
Shuffle Phase
Buffer Usage During output
Replication
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Nexus 3000 Shared Buffer Architecture
50
Nexus 3000 has 9MB of shared buffer in the queuing block on the ASIC.
208 Byte Cells, 9MB means 46080 Cells
Packets larger than 144 Bytes require more cells
Space is divided up among egress per queue per port (20%) and dynamically shared buffer(80%).
When a congestion / burst occurs, the egress port will more use buffer resources.
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Increased Visibility in the Buffers
51
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Nexus 3000 Buffer depth monitoring: interface
Real time command displaying the status of the shared buffer.
XML support will be added in the maintenance release
Counters are displayed in cell count. A cell is approximately 208 bytes
show hardware internal buffer info pkt-stats [brief|clear|detail]
52
Buffer
usage
Free
buffer
Max buffer
usage since
clear
Total buffer
space on the
platform
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
TeraSort(ETL) N3k Buffer Analysis (10TB) The buffer utilization is highest during the shuffle and output replication phases.
Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.
53
Note:
The Aggregation switch buffer remained flat as the bursts were absorbed at the Top of Rack layer
Buffer Usage During
Shuffle Phase
Buffer Usage During output
Replication
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Python Example: Buffer Counters Using Cisco BufferDepthMonitor
54
# Create BufferDepthMonitor Obj
objBufferDepthMonitor = BufferDepthMonitor()
# Switch Cell Count
objBufferDepthMonitor.get_switch_cell_count()
showBuffer.py >>> help(BufferDepthMonitor) Help on class BufferDepthMonitor in module cisco: class BufferDepthMonitor(CLI) | Method resolution order: | BufferDepthMonitor | CLI | __builtin__.object | | Methods defined here: | | __init__(self) | | dumps(self) | | get_max_cell_usage(self) | | get_remaining_instant_usage(self) | | get_status(self) | | get_switch_cell_count(self) | | get_total_instant_usage(self) | | parse_specific(self) |
Exce
rpt: H
elp
on C
lass
Bu
fferD
ep
thM
onito
r
For help on class BufferDepthMonitor
>>> help(BufferDepthMonitor)
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Python Example: Buffer monitoring while running Hadoop
55
12/03/27 08:02:23 INFO mapred.JobClient: map 69% reduce 0%
12/03/27 08:02:24 INFO mapred.JobClient: map 77% reduce 0%
12/03/27 08:02:25 INFO mapred.JobClient: map 87% reduce 0%
12/03/27 08:02:26 INFO mapred.JobClient: map 96% reduce 9%
12/03/27 08:02:27 INFO mapred.JobClient: map 98% reduce 10%
12/03/27 08:02:28 INFO mapred.JobClient: map 100% reduce 10%
12/03/27 08:02:29 INFO mapred.JobClient: map 100% reduce 27%
12/03/27 08:02:30 INFO mapred.JobClient: map 100% reduce 29%
12/03/27 08:02:32 INFO mapred.JobClient: map 100% reduce 32%
12/03/27 08:02:35 INFO mapred.JobClient: map 100% reduce 84%
Hadoop Job Status
Buffer usage statistics
from the switch while
running Hadoop TeraSort
Hadoop job status output while running a 1GB TeraSort using
8 nodes
2012/03/27 08:02:23 0 * 2012/03/27 08:02:24 3810 -----* 2012/03/27 08:02:25 1127 -* 2012/03/27 08:02:26 0 * 2012/03/27 08:02:27 0 * 2012/03/27 08:02:28 0 * 2012/03/27 08:02:29 0 * 2012/03/27 08:02:30 0 * 2012/03/27 08:02:31 0 * 2012/03/27 08:02:32 4921 -------* 2012/03/27 08:02:33 4299 ------* 2012/03/27 08:02:34 6929 ----------* 2012/03/27 08:02:35 0 *
Buffer Usage
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Oversubscription Design
Primary benefits of hadoop is to reduce the time required by workload that otherwise would require long time to meet the SLA e.g. Pricing, log analysis, Join-only job etc.
Typically oversubscription is high with 10 G server access then at 1Gbps
Non-blocking network is NOT a requirement, however degree of oversubscription matters for
– Job Completion Time and how long the replication of results takes
– Oversubscription during rack or FEX failure
Static vs. actual oversubscription
– Hadoop transport is TCP based, in which reduced fetch the data at the rate of IO
– Often how much data a single node push is IO bound and number of disk configuration
56
Uplinks Oversubscription Theoretical (16
Servers)
Measured
8 2:1 Next Slides
4 4:1 Next Slides
2 8:1 Next Slides
1 16:1 Next Slides
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Network Oversubscriptions
Traffic to the network is limited by IO of the compute node
More spindles more network traffic – but not always linear
Oversubscription in the network is a reasonable trade off
Failure impact:
Normal Job Run – not much impact
Result Replication with 1,2,4, & 8 uplink - larger relative impact
Rack failure is immune to oversubscriptions – IOW the rack failure impact hides the oversubscription loss
57
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Map to Reducer Ratio Impact on Job Completion 1 TB file with 128 MB Blocks == 7,813 Map Tasks
The job completion time is directly related to number of reducers
Average network buffer usage lowers as number of reducer gets lower and vice versa
0
5000
10000
15000
20000
25000
30000
192 96 48 24 12 6
No. Of Reduceers
Total Graph of Job Completion Time in Sec
0
10000
20000
30000
24 12 6
No. Of Reduceers
Job Completion Time in Sec
0
200
400
600
800
192 96 48
No. Of Reduceers
Job Completion Time in Sec
58
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Network Traffic with Variable Reducers Network Traffic Decreases with Less Reducers available
59
96 Reducers
48 Reducers
24 Reducers
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Data Node Network Speed
60
Generally 1GE is being used largely due
to the cost/performance trade-offs.
Though 10GE can provide benefits
depending on workload.
Note:
Multiple 1GE links can be bonded together to
not only increase bandwidth, but increase
resiliency.
C2001GE
C20010GE
C2101GE
C21010GE
Co
mp
leti
on
Tim
es
25th %-tile 50th %-tile75th %-tile 90th %-tile99th %-tile 99.9th %-tile
Cisco UCSC200 M2
(1 GE)
Cisco UCSC200 M2(10 GE)
Cisco UCSC210 M2
(1 GE)
Cisco UCSC210 M2(10 GE)
Co
mp
leti
on
Tim
e
gridmix2 large jobs gridmix2 medium jobs gridmix2 small jobs
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Data Node Speed Differences Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workload
61
Single 1GE 100% Utilized
Dual 1GE 75% Utilized
10GE 40% Utilized
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Data Node Speed Differences 1G vs. 10G TCPDUMP of Reducers TX
Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workload
Reduced spike with 10G and smoother job completion time.
Multiple 1G or 10G links can be bonded together to not only increase bandwidth, but increase resiliency.
62
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
1 13 25 37 49 61 73 85 97 109
121
133
145
157
169
181
193
205
217
229
241
253
265
277
289
301
313
325
337
349
361
373
385
397
409
421
433
445
457
469
481
493
505
517
529
541
553
565
577
589
601
613
625
637
649
661
673
685
697
709
721
733
745
757
769
781
793
Job
Com
plet
ion
Cell U
sage
1G Buffer Used 10G Buffer Used 1G Map % 1G Reduce % 10G Map % 10G Reduce %
1GE vs. 10GE Buffer Usage
63
Moving from 1GE to 10GE actually lowers the buffer requirement at the switching layer.
By moving to 10GE, the data node has a wider pipe to receive data lessening the need
for buffers on the network as the total aggregate transfer rate and amount of data does
not increase substantially. This is due, in part, to limits of I/O and Compute capabilities
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Network Latency
64
Generally network latency, while
consistent latency being important,
does not represent a significant factor
for Hadoop Clusters.
Note:
There is a difference in network latency vs.
application latency. Optimization in the
application stack can decrease application
latency that can potentially have a
significant benefit. 1TB 5TB 10TB
Co
mp
leti
on
Tim
e (
Sec)
Data Set Size (80 Node Cluster)
N3K Topology 5k/2k Topology
Multi-tenant Environments
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Multi-use Cluster Characteristics
66
Hadoop clusters are generally
multi-use. The effect of
background use can effect any
single jobs completion.
Example View of 24 Hour Cluster Use
Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs
(Blue lines are ETL Jobs and purple lines are BI Jobs)
Importing Data into HDFS
Note:
A given Cluster, running many different types of Jobs, Importing into HDFS, Etc.
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Various Multitenant Environments
67
Hadoop + HBASE
Job Based
Department Based
Need to understand
Traffic Patterns
Scheduling Dependent
Permissions and
Scheduling Dependent
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Hadoop + Hbase
68
Map 1 Map 2 Map N Map 3
Reducer
1
Reducer
2
Reducer
3
Reducer
N
HDFS
Shuffle
Output
Replication
Region
Server
Region
Server
Client Client
Major Compaction
Read Read
Read
Update
Update
Read
Major Compaction
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Hbase During Major Compaction Enabling QoS Improves the Latency
69
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Laten
cy (us
)
Time
UPDATE - Average Latency (us) READ - Average Latency (us)
QoS - UPDATE - Average Latency (us) QoS - READ - Average Latency (us)
Read/Update Latency
Comparison of Non-QoS vs.
QoS Policy
~45% for Read
Improvement
Switch Buffer Usage
With Network QoS Policy to prioritize:
Hbase Update/Read Operations vs. Hbase
Major Compaction
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Hbase + Hadoop Map Reduce
0
5000
10000
15000
20000
25000
30000
35000
40000
Latency(us)
Time
UPDATE-AverageLatency(us) READ-AverageLatency(us) QoS-UPDATE-AverageLatency(us) QoS-READ-AverageLatency(us)
1
70
139
208
277
346
415
484
553
622
691
760
829
898
967
1036
1105
1174
1243
1312
1381
1450
1519
1588
1657
1726
1795
1864
1933
2002
2071
2140
2209
2278
2347
2416
2485
2554
2623
2692
2761
2830
2899
2968
3037
3106
3175
3244
3313
3382
3451
3520
3589
3658
3727
3796
3865
3934
4003
4072
4141
4210
4279
4348
4417
4486
4555
4624
4693
4762
4831
4900
4969
5038
5107
5176
5245
5314
5383
5452
5521
5590
5659
5728
5797
5866
5935
BufferUsed
Timeline
HadoopTeraSort Hbase
Read/Update Latency
Comparison of Non-QoS
vs. QoS Policy ~60% for Read
Improvement
Switch Buffer Usage
With Network QoS Policy
to prioritize Hbase
Update/Read Operations
70
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Summary
10G and/or Dual attached server provides consistent job completion time & better buffer utilization
10G provide reduce burst at the access layer
A single attached node failure has considerable impact on job completion time
Dual Attached Sever is recommended design – 1G or 10G. 10G for future proofing
Rack failure has the biggest impact on job completion time
Does not require non-blocking network
Oversubscription does impact job completion time
Latency does not matter much in Hadoop work load
Extensive Validation of Hadoop Workload
Reference Architecture
– Make it easy for Enterprise
– Demystify Network for Hadoop Deployment
– Integration with Enterprise with efficient choices
71
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Big Data @ Cisco
72
Cisco.com Big Data
www.cisco.com/go/bigdata
Certifications and Solutions with UCS C-Series
and Nexus 5500+22xx
• EMC Greenplum MR Solution
• Cloudera Hadoop Certified Technology
• Cloudera Hadoop Solution Brief
• Oracle NoSQL Validated Solution
• Oracle NoSQL Solution Brief
Multi-month network and compute analysis
testing (In conjunction with Cloudera)
• Network/Compute Considerations Whitepaper
• Presented Analysis at Hadoop World
128 Node/1PB test cluster
© 2013 Cisco and/or its affiliates. All rights reserved. BRKAPP-2027 Cisco Public
Don’t forget to activate your Cisco Live Virtual
account for access to all session material,
communities, and on-demand and live
activities throughout the year. Activate your
account at the Cisco booth in the World of
Solutions or visit www.ciscolive.com.
Complete Your Online Session Evaluation
Give us your feedback and you could win fabulous prizes. Winners announced daily.
Receive 20 Passport points for each session evaluation you complete.
Complete your session evaluation online now (open a browser through our wireless network to access our portal) or visit one of the Internet stations throughout the Convention Center.
73
THANK YOU for Listening & Sharing Your Thoughts