Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Bill Mannel
VP, Product Marketing, SGI
Data Intensive Computing
©2011 SGI
Agenda
• Source Drivers: Massive Data Generators
• What is Data Intensive Computing?
• Big Data Architectures, Hadoop and In-Memory
• Looking at Large Data from Different Sources--LiveArc
• Summary
2
©2011 SGI
The Data Explosion
• We are assimilating and generating data at ever-increasing rates. This is due to two main factors:
• Higher resolution instruments/sensors
• Scanners, Gene sequencers, microscopes, etc.
• Fast, affordable processing power
• Higher resolution models, ensemble forecasts, stochastic modeling, multi-disciplinary optimization
3
©2011 SGI
Higher Fidelity Models
Doubling the resolution of a 3D model meansan 8- fold increase in data!
4
4 x 4 x 464 cells
2 x 2 x 28 cells
1 x 1 x 11 cells
©2011 SGI 5
What is Data Intensive Computing?
Computing strategies and
implementations to help deal with the data tsunami
Data intensive computing is collecting, managing, analyzing, and understanding data at volumes and rates
that push the frontier of current technologies.
©2011 SGI 6
Example: Weather Forecasting
18Km Grid � 6Km Grid (27x data volume)
©2011 SGI
Big Data � Big Problems
• The massive data generation capacity of modern systems provides a number of problems:
• Acquiring the data
• Storing the data
• Processing the data
• Managing the data
• Viewing the data
• Data Intensive Computing is about managing the entire workflow of big data
7
©2011 SGI
Compute IntensiveSGI
Technical Computing
SGI® Technical Computing
Memory- and I/O-Intensive
Signal Analysis,Real-Time, Bio
HPC
CFD, CEM, FEACCM, Energy
Structured Data
RDBMS
Unstructured Data
Hadoop
Data Intensive
SGI UVSMP
RackableTM
ClusterSGI ICE
Bladed ClusterSGI UV
SMP
8
©2011 SGI
Big Data Dream TeamRackable UV UV Packages ArcFiniti
UnstructuredData
HPCData
Structured DataDeep Query
Persistent Data
End-to-End Big Data PlatformsDeep Query. Big Data. Fast Analytics.
9
©2011 SGI
Big Data ArchitecturesBig Data Architectures
©2011 SGI
• Advanced approach to storing and analyzing very large data-sets from organic unstructured data sources (e-mail, log files, usage patterns, transactional data and more)
• Build data relationships as you go - mostly read, seldom write
• “Share nothing” physical architecture (let nodes fail; add nodes as you scale)
• Open: x86, Linux, Hadoop distributions, tools (hive, hbase, pigs)
• BIG DATA. BIG SCALE. Single clusters as large as 20,000 cores and 20 pb of local storage.
11
Hadoop
Yahoo SGI Rackable Cluster
Facebook SGI Rackable Cluster
Large GVNT SGI Rackable Cluster
©2011 SGI
Hadoop Value Propositions
• Price/performance advantagefor unstructured data
• Complex, analytical processing (using MapReduceand analytics applications) on very large datasets
• Batch to near real-time analytics
• Streaming data from near real-time sources (like web clicks, transactions and POS)
• Post-processing with analytics applications and specialized DB solutions
12
©2011 SGI
RDBMS
Data size Gigabytes+ Petabytes+
Access Interactive and batch Batch
Latency Low Low to High
Updates Read and write many times
Write once, read many times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Non-linear Linear
How Hadoop Differs from a RDBMS
13
©2011 SGI
SGI Hadoop Background
� SGI has been one of the leading commercial
suppliers of Hadoop servers since the technology
was introduced
� Leading technology
users deploy on
SGI Hadoop Clusters
� SGI supplies customer optimized
Hadoop Clusters to key US
government agencies
� SGI has sourced Hadoop installations as large as
40,000 nodes and individual clusters as large as
4,000 nodes
©2011 SGI
• SGI is one of the leading vendors in Hadoop implementations
• We were there at the beginning:
• We’ve deployed systems in the government and commercial/cloud spaces
• We’ve shipped 10,000-node Hadoop clusters
• One of our Hadoop clusters is on the list of the Top 500 most powerful supercomputers in the world
SGI and Hadoop
Two large SGI Hadoop clusters in integration and test in
Chippewa Falls, Wisconsin
15
©2011 SGI
• Customer Optimized Hadoop
• Flexible and proven Hadoop implementations
optimized to the customer’s requirements:
• Performance
• Power
• Price
• Density
• Management at Scale
• SGI has the experience and tools to manage
high performance systems at scale. And data
growth means Hadoop clusters scale…
• Start Up and GO!
• Factory integrated Hadoop solutions including
Cloudera to minimize time from dock to
production
16
SGI Hadoop Clusters
17©2011 SGI
Terasort Scaling: SGI Hadoop Cluster
Terasort @ 100GB scales super
linearly on a 20-node SGI
Rackable C2005-TY6 cluster
running Cloudera distribution of
Apache Hadoop (CDH3u0)
Terasort Scaling: SGI Rackable C2005-TY6 Hadoop
Cluster - 100 GB job size
0
5
10
15
20
25
30
35
40
45
50
1 5 10 15 20
Number of Nodes
Sc
alin
g
Terasort Scaling
Linear Scaling
Terasort Scaling on SGI vs. Sun Hadoop Cluster
100 GB input data size
-10
0
10
20
30
40
50
0 5 10 15 20 25
Number of Nodes
Sc
alin
g
Terasort Scaling SGI Rackable C2005-TY6clusterTerasort Scaling Sun X2270 M2 cluster
Linear Scaling
Sources: http://sun.systemnews.com/articles/152/1/server/23549 and SGI internal measurements
SGI Rackable C2005-
TY6 cluster running
Cloudera Hadoop is 81%
faster than Sun X2270
cluster of a similar size.
©2011 SGI
• Highest density
• Power optimized
• Highest data capacity
• Factory integrated
• Cloudera certified
Introducing SGI Hadoop Cluster Reference Implementations
19©2011 SGI
Hadoop Analytical Players
Big Data Content Search and Interactive Analytics
Big Data Import, Export, Rich Analytics
Big Data ETL, Business Intelligence on Hadoop with Hive QL
Data Modeling, and Visualization for Interactive Business Intelligence on
Hadoop
SGI Analytics EcosystemExpanding partnerships with industry leading analytics,
visualization and tool vendors
©2011 SGI
• Being broadly adopted for a wide variety of big data workloads (high-speed ingest, in-
memory databases, real-time decision
making, large data sets)
• Real-time flow: Ingest > Store > Process >
Decision > Visualize
• High-speed ingest: Rates of TB/sec
• Large-single systems: 4,096 cores, 64TB of
memory
• Open: x86, Linux, Windows, standard out-of-the-box software tools and databases;
ease of programming
• Examples: Imagery, signals, LIDAR, cyber data, deep query, fast analytics, mass-
spectrometry, bio data, graph processing, genome alignment
20
Shared Memory
GVNT SGI UV Cluster
Tohoku SGI UV Cluster
The Bin Laden Network: SGI UV Cluster
©2011 SGI
Typical latencies for a modern CPU
21
Latency: The Performance Killer
Location Latency (nS) Latency (Cycles)
CPU Register 0 0
L1 Cache 1.3 4
L2 Cache 2.9 9
L3 Cache 21 63
Memory 120 360
Disk 4,500,000 13,500,000
(Typical values for a 3GHz CPU core)
©2011 SGI 22
Shared vs. Clustered Memory Architecture
• Each system has own memory and OS
• Nodes communicate over commodity interconnect
• Inefficient cross-node communication creates bottlenecks
• Coding required for parallel code execution
...
InfiniBand or Gigabit Ethernet
Mem~64GB
system+
OS
Commodity Clusters
mem
system+
OS
mem
system+
OS
mem
system+
OS
mem
system+
OS
• All nodes operate on one large shared memory space
• Eliminates data passing between nodes
• Big data sets fit entirely in memory
• Less memory per node required
• Simpler to program
• High performance, low cost, easy to deploy
Global shared memory to 16TB
SGI NUMAlink 5™ Interconnect
SGI UV Platform
System+
OS
©2011 SGI
• UV provides direct “load/store” semantics
• NUMAlink interconnect provides access times
in nanoseconds even to the most remote
memory locations
• Clustered memory requires complex message passing semantics
• Access times in microseconds
23
Shared vs. Clustered Memory
©2011 SGI
9. Customer ExamplesCustomer Examples
©2011 SGI
The Institute of Cancer Research (UK)• Institute of Cancer Research (ICR)
• “The SGI UV supercomputer will allowextremely large, diverse data sets to beprocessed quickly, enabling our researchers to correlate medical and biological data on an unprecedented scale," said Dr. Rune Linding, cellular and molecular logic team leader at the ICR. “Eventually, this will lead to network-based cancer models that will be used to streamline the process of drug development.”
©2011 SGI
• An SSI system with a large globally addressable memory space has a number of key advantages over a distributed cluster for Data Intensive Computing
• Certain problems are only tractable if resident
in-core
• There are no restrictions on the type or layout
of the data, supporting all methods of
computation
26
SSI Flexibility
©2011 SGI
• The relationship between memory and cores (the
GB/core ratio) is not fixed or limited in any way
• Problems do not need to be decomposed to fit
individual nodes
• All programming paradigms are supported,
including large memory serial, shared memory
parallel and distributed memory parallel
• I/O is inherently global; no need to distribute data
from a source feed or a network storage device
27
SSI Flexibility
©2011 SGI
• Mixed or hybrid programming styles may be
employed without penalty to assist with
incremental parallelism
• Load balancing is simplified as all cores can
access all data; simply direct more resources as
appropriate
• The entire dataset can be immediately accessed
for visualization, computational steering, etc.
28
SSI Flexibility
©2011 SGI
• Scalable, SSI based system
• Up to 2,560 cores of processing power
• Supports up to 16TB of memory
• I/O scales to 100s of channels
29
SGI UV: Designed for Data Intensive Computing
CPU
I/O
Memory
©2011 SGI 30
SGI LiveArc (MediaFlux)Collaborative Data Management
Digital Asset Management
• Ties together metadata & content as an ‘asset’
– Users think about metadata, LiveArc manages the content (files)
– Metadata and content can be independently versioned
• For ingesting, storing & managing digital data
• Supports cataloguing data, searching for and finding data, and routing data to wherever it's required
• Provides data replication and federation
• Based on binary XML-encoded object database
• Manages Research data in any formateg. MRI, Oceanographic and Climatic data, Genomics, Video, Satellite imagery, etc…
©2011 SGI 31
• General application platform for collaborative applications
– Extensible service-oriented architecture (SOA)
– General purpose LiveArc Desktop client is
discipline neutral
– Components can be used to create a diverse
range of applications
• Integrated with and complementary to SGI’s DMF hierarchical storage
management platform
• Open data, open format, open platform
• Sold world-wide, designed, developed and maintained locally in Melbourne
more… SGI LiveArc (MediaFlux)Collaborative Data Management
Digital Asset Management
Introducing ArcFiniti - Active archiving made
easy
• File-based Archiving Solution
– Virtualized storage tiers for scalability, lower cost.
– TCO advantages with industry-leading power efficiency and density featuring MAID technology
– High-performance file-based access for easy integration into existing infrastructures
– Data protection software to ensure very long term data integrity
--An easy-to-deploy, integrated archive solution--
Bringing together the best of SGI Storage Tech
32
Objective:Providing a true archive for all persistent file data
• Single platform supporting multiple applications simultaneously
• Standard NFS File System Access• CIFS SAMBA (V2)
• Hugely scalable, reliable and secure storage
Document
management Media & Entertainment
Archives
Bio Sciences and Pharma
E-mail Archive
Software
File System
CCTV & Video
Surveillance
33
ArcFiniti™ -- Virtualized Tiers
• Virtualized tiers give high-performance access and archive security in one integrated solution.
– High-performance file access to Primary Cache
– Automated policy-based migration to the Archive Tier
– Flexible archive policy engine
– Benefits are significant:
Archive
Policy
Primary
CacheA
rchiv
e T
ier
34
©2011 SGI
• Big Data is pervasive and will continue to grow
• Structured vs. unstructured data can suggest the
best analysis method
• SGI has solutions that address many aspects of
big data, from creation, through analysis to
storage
35
Summary