Data Intensive ComputingHadoop Value Propositions • Price/performance advantage for unstructured data • Complex, analytical processing (using MapReduce and analytics applications)

Bill Mannel

VP, Product Marketing, SGI

Data Intensive Computing

©2011 SGI

Agenda

• Source Drivers: Massive Data Generators

• What is Data Intensive Computing?

• Big Data Architectures, Hadoop and In-Memory

• Looking at Large Data from Different Sources--LiveArc

• Summary

2

©2011 SGI

The Data Explosion

• We are assimilating and generating data at ever-increasing rates. This is due to two main factors:

• Higher resolution instruments/sensors

• Scanners, Gene sequencers, microscopes, etc.

• Fast, affordable processing power

• Higher resolution models, ensemble forecasts, stochastic modeling, multi-disciplinary optimization

3

©2011 SGI

Higher Fidelity Models

Doubling the resolution of a 3D model meansan 8- fold increase in data!

4

4 x 4 x 464 cells

2 x 2 x 28 cells

1 x 1 x 11 cells

©2011 SGI 5

What is Data Intensive Computing?

Computing strategies and

implementations to help deal with the data tsunami

Data intensive computing is collecting, managing, analyzing, and understanding data at volumes and rates

that push the frontier of current technologies.

©2011 SGI 6

Example: Weather Forecasting

18Km Grid � 6Km Grid (27x data volume)

©2011 SGI

Big Data � Big Problems

• The massive data generation capacity of modern systems provides a number of problems:

• Acquiring the data

• Storing the data

• Processing the data

• Managing the data

• Viewing the data

• Data Intensive Computing is about managing the entire workflow of big data

7

©2011 SGI

Compute IntensiveSGI

Technical Computing

SGI® Technical Computing

Memory- and I/O-Intensive

Signal Analysis,Real-Time, Bio

HPC

CFD, CEM, FEACCM, Energy

Structured Data

RDBMS

Unstructured Data

Hadoop

Data Intensive

SGI UVSMP

RackableTM

ClusterSGI ICE

Bladed ClusterSGI UV

SMP

8

©2011 SGI

Big Data Dream TeamRackable UV UV Packages ArcFiniti

UnstructuredData

HPCData

Structured DataDeep Query

Persistent Data

End-to-End Big Data PlatformsDeep Query. Big Data. Fast Analytics.

9

©2011 SGI

Big Data ArchitecturesBig Data Architectures

©2011 SGI

• Advanced approach to storing and analyzing very large data-sets from organic unstructured data sources (e-mail, log files, usage patterns, transactional data and more)

• Build data relationships as you go - mostly read, seldom write

• “Share nothing” physical architecture (let nodes fail; add nodes as you scale)

• Open: x86, Linux, Hadoop distributions, tools (hive, hbase, pigs)

• BIG DATA. BIG SCALE. Single clusters as large as 20,000 cores and 20 pb of local storage.

11

Hadoop

Yahoo SGI Rackable Cluster

Facebook SGI Rackable Cluster

Large GVNT SGI Rackable Cluster

©2011 SGI

Hadoop Value Propositions

• Price/performance advantagefor unstructured data

• Complex, analytical processing (using MapReduceand analytics applications) on very large datasets

• Batch to near real-time analytics

• Streaming data from near real-time sources (like web clicks, transactions and POS)

• Post-processing with analytics applications and specialized DB solutions

12

©2011 SGI

RDBMS

Data size Gigabytes+ Petabytes+

Access Interactive and batch Batch

Latency Low Low to High

Updates Read and write many times

Write once, read many times

Structure Static schema Dynamic schema

Integrity High Low

Scaling Non-linear Linear

How Hadoop Differs from a RDBMS

13

©2011 SGI

SGI Hadoop Background

� SGI has been one of the leading commercial

suppliers of Hadoop servers since the technology

was introduced

� Leading technology

users deploy on

SGI Hadoop Clusters

� SGI supplies customer optimized

Hadoop Clusters to key US

government agencies

� SGI has sourced Hadoop installations as large as

40,000 nodes and individual clusters as large as

4,000 nodes

©2011 SGI

• SGI is one of the leading vendors in Hadoop implementations

• We were there at the beginning:

• We’ve deployed systems in the government and commercial/cloud spaces

• We’ve shipped 10,000-node Hadoop clusters

• One of our Hadoop clusters is on the list of the Top 500 most powerful supercomputers in the world

SGI and Hadoop

Two large SGI Hadoop clusters in integration and test in

Chippewa Falls, Wisconsin

15

©2011 SGI

• Customer Optimized Hadoop

• Flexible and proven Hadoop implementations

optimized to the customer’s requirements:

• Performance

• Power

• Price

• Density

• Management at Scale

• SGI has the experience and tools to manage

high performance systems at scale. And data

growth means Hadoop clusters scale…

• Start Up and GO!

• Factory integrated Hadoop solutions including

Cloudera to minimize time from dock to

production

16

SGI Hadoop Clusters

17©2011 SGI

Terasort Scaling: SGI Hadoop Cluster

Terasort @ 100GB scales super

linearly on a 20-node SGI

Rackable C2005-TY6 cluster

running Cloudera distribution of

Apache Hadoop (CDH3u0)

Terasort Scaling: SGI Rackable C2005-TY6 Hadoop

Cluster - 100 GB job size

0

5

10

15

20

25

30

35

40

45

50

1 5 10 15 20

Number of Nodes

Sc

alin

g

Terasort Scaling

Linear Scaling

Terasort Scaling on SGI vs. Sun Hadoop Cluster

100 GB input data size

-10

0

10

20

30

40

50

0 5 10 15 20 25

Number of Nodes

Sc

alin

g

Terasort Scaling SGI Rackable C2005-TY6clusterTerasort Scaling Sun X2270 M2 cluster

Linear Scaling

Sources: http://sun.systemnews.com/articles/152/1/server/23549 and SGI internal measurements

SGI Rackable C2005-

TY6 cluster running

Cloudera Hadoop is 81%

faster than Sun X2270

cluster of a similar size.

©2011 SGI

• Highest density

• Power optimized

• Highest data capacity

• Factory integrated

• Cloudera certified

Introducing SGI Hadoop Cluster Reference Implementations

19©2011 SGI

Hadoop Analytical Players

Big Data Content Search and Interactive Analytics

Big Data Import, Export, Rich Analytics

Big Data ETL, Business Intelligence on Hadoop with Hive QL

Data Modeling, and Visualization for Interactive Business Intelligence on

Hadoop

SGI Analytics EcosystemExpanding partnerships with industry leading analytics,

visualization and tool vendors

©2011 SGI

• Being broadly adopted for a wide variety of big data workloads (high-speed ingest, in-

memory databases, real-time decision

making, large data sets)

• Real-time flow: Ingest > Store > Process >

Decision > Visualize

• High-speed ingest: Rates of TB/sec

• Large-single systems: 4,096 cores, 64TB of

memory

• Open: x86, Linux, Windows, standard out-of-the-box software tools and databases;

ease of programming

• Examples: Imagery, signals, LIDAR, cyber data, deep query, fast analytics, mass-

spectrometry, bio data, graph processing, genome alignment

20

Shared Memory

GVNT SGI UV Cluster

Tohoku SGI UV Cluster

The Bin Laden Network: SGI UV Cluster

©2011 SGI

Typical latencies for a modern CPU

21

Latency: The Performance Killer

Location Latency (nS) Latency (Cycles)

CPU Register 0 0

L1 Cache 1.3 4

L2 Cache 2.9 9

L3 Cache 21 63

Memory 120 360

Disk 4,500,000 13,500,000

(Typical values for a 3GHz CPU core)

©2011 SGI 22

Shared vs. Clustered Memory Architecture

• Each system has own memory and OS

• Nodes communicate over commodity interconnect

• Inefficient cross-node communication creates bottlenecks

• Coding required for parallel code execution

...

InfiniBand or Gigabit Ethernet

Mem~64GB

system+

OS

Commodity Clusters

mem

system+

OS

mem

system+

OS

mem

system+

OS

mem

system+

OS

• All nodes operate on one large shared memory space

• Eliminates data passing between nodes

• Big data sets fit entirely in memory

• Less memory per node required

• Simpler to program

• High performance, low cost, easy to deploy

Global shared memory to 16TB

SGI NUMAlink 5™ Interconnect

SGI UV Platform

System+

OS

©2011 SGI

• UV provides direct “load/store” semantics

• NUMAlink interconnect provides access times

in nanoseconds even to the most remote

memory locations

• Clustered memory requires complex message passing semantics

• Access times in microseconds

23

Shared vs. Clustered Memory

©2011 SGI

9. Customer ExamplesCustomer Examples

©2011 SGI

The Institute of Cancer Research (UK)• Institute of Cancer Research (ICR)

• “The SGI UV supercomputer will allowextremely large, diverse data sets to beprocessed quickly, enabling our researchers to correlate medical and biological data on an unprecedented scale," said Dr. Rune Linding, cellular and molecular logic team leader at the ICR. “Eventually, this will lead to network-based cancer models that will be used to streamline the process of drug development.”

©2011 SGI

• An SSI system with a large globally addressable memory space has a number of key advantages over a distributed cluster for Data Intensive Computing

• Certain problems are only tractable if resident

in-core

• There are no restrictions on the type or layout

of the data, supporting all methods of

computation

26

SSI Flexibility

©2011 SGI

• The relationship between memory and cores (the

GB/core ratio) is not fixed or limited in any way

• Problems do not need to be decomposed to fit

individual nodes

• All programming paradigms are supported,

including large memory serial, shared memory

parallel and distributed memory parallel

• I/O is inherently global; no need to distribute data

from a source feed or a network storage device

27

SSI Flexibility

©2011 SGI

• Mixed or hybrid programming styles may be

employed without penalty to assist with

incremental parallelism

• Load balancing is simplified as all cores can

access all data; simply direct more resources as

appropriate

• The entire dataset can be immediately accessed

for visualization, computational steering, etc.

28

SSI Flexibility

©2011 SGI

• Scalable, SSI based system

• Up to 2,560 cores of processing power

• Supports up to 16TB of memory

• I/O scales to 100s of channels

29

SGI UV: Designed for Data Intensive Computing

CPU

I/O

Memory

©2011 SGI 30

SGI LiveArc (MediaFlux)Collaborative Data Management

Digital Asset Management

• Ties together metadata & content as an ‘asset’

– Users think about metadata, LiveArc manages the content (files)

– Metadata and content can be independently versioned

• For ingesting, storing & managing digital data

• Supports cataloguing data, searching for and finding data, and routing data to wherever it's required

• Provides data replication and federation

• Based on binary XML-encoded object database

• Manages Research data in any formateg. MRI, Oceanographic and Climatic data, Genomics, Video, Satellite imagery, etc…

©2011 SGI 31

• General application platform for collaborative applications

– Extensible service-oriented architecture (SOA)

– General purpose LiveArc Desktop client is

discipline neutral

– Components can be used to create a diverse

range of applications

• Integrated with and complementary to SGI’s DMF hierarchical storage

management platform

• Open data, open format, open platform

• Sold world-wide, designed, developed and maintained locally in Melbourne

more… SGI LiveArc (MediaFlux)Collaborative Data Management

Digital Asset Management

Introducing ArcFiniti - Active archiving made

easy

• File-based Archiving Solution

– Virtualized storage tiers for scalability, lower cost.

– TCO advantages with industry-leading power efficiency and density featuring MAID technology

– High-performance file-based access for easy integration into existing infrastructures

– Data protection software to ensure very long term data integrity

--An easy-to-deploy, integrated archive solution--

Bringing together the best of SGI Storage Tech

32

Objective:Providing a true archive for all persistent file data

• Single platform supporting multiple applications simultaneously

• Standard NFS File System Access• CIFS SAMBA (V2)

• Hugely scalable, reliable and secure storage

Document

management Media & Entertainment

Archives

Bio Sciences and Pharma

E-mail Archive

Software

File System

CCTV & Video

Surveillance

33

ArcFiniti™ -- Virtualized Tiers

• Virtualized tiers give high-performance access and archive security in one integrated solution.

– High-performance file access to Primary Cache

– Automated policy-based migration to the Archive Tier

– Flexible archive policy engine

– Benefits are significant:

Archive

Policy

Primary

CacheA

rchiv

e T

ier

34

©2011 SGI

• Big Data is pervasive and will continue to grow

• Structured vs. unstructured data can suggest the

best analysis method

• SGI has solutions that address many aspects of

big data, from creation, through analysis to

storage

35

Summary

Bill Mannel

[email protected]

Thank You

Documents

Data Intensive ComputingHadoop Value Propositions • Price/performance advantage for unstructured data • Complex, analytical processing (using MapReduce and analytics applications)