61
Copyright © 2011 by ScaleOut Software, Inc. Webinar December, 2011 Bill Bain ([email protected]) The Top Five Six Reasons to Use a Distributed Data Grid X

Top 6 Reasons to Use a Distributed Data Grid

Embed Size (px)

DESCRIPTION

Covers the problems of achieving scalability in server farm environments and how distributed data grids provide in-memory storage and boost performance. Includes summary of ScaleOut Software product offerings including ScaleOut State Server and Grid Computing Edition.

Citation preview

Page 1: Top 6 Reasons to Use a Distributed Data Grid

Copyright © 2011 by ScaleOut Software, Inc.

Webinar December, 2011

Bill Bain ([email protected])

The Top Five Six Reasons to Use a

Distributed Data Grid

X

Page 2: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

2

Agenda

• About ScaleOut Software • Overview of Products • What is a Distributed Data Grid (DDG)? • The Top Six Reasons • What to Look for in a DDG Product

Page 3: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

3

Company • Founded in September 2003, privately funded • Offices in Bellevue, WA and Beaverton, OR • Team:

– Dr. William Bain, Founder & CEO • Career focused on parallel computing – Bell Labs, Intel, Microsoft • 3 prior start-ups, last acquired by Microsoft and product now ships

as Network Load Balancing in Windows Server

– David Brinker, COO • 20 years software business and executive management

experience • Mentor Graphics, Cadence, Webridge

• Develops and markets Linux & Windows DDG products. • Seven years market experience.

Page 4: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

4

It’s All About Scaling Performance

• Scaling performance:

Memory

CPU

Storage

Scale Out

Storage

CPU

Storage

CPU

Storage

CPU

Storage

CPU

Memory Memory Memory MemoryScaling out: • Has excellent scalability. • But is challenging to implement.

SCALE OUT

Page 5: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

5

What is a Distributed Data Grid?

• A new “vertical” storage tier: – Adds missing layer to boost

performance. – Uses in-memory, out-of-process

storage. – Avoids repeated trips to backing

storage.

Processor Cache

Application Memory

“In-Process”

L2 Cache

Processor Cache

Application Memory

“In-Process”

L2 Cache

Backing Storage

• A new “horizontal” storage tier: – Allows data sharing among servers. – Scales performance & capacity. – Adds high availability. – Can be used independently of

backing storage.

Distributed Data Grid“Out-of-Process”

Distributed Data Grid“Out-of-Process”

(Aka “distributed cache”, “in-memory data grid”)

Page 6: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

6

Distributed Data Grids: A Closer Look

• Incorporates a client-side, in-process cache (“near cache”): – Transparent to the application – Holds recently accessed data.

• Boosts performance: – Eliminates repeated network data

transfers & deserialization. – Reduces access times to near “in-

process” latency. – Is automatically updated if the

distributed grid changes. – Supports various coherency models

(coherent, polled, event-driven)

Application Memory

“In-Process”Client-side

Cache“In-Process”Distributed

Data Grid“Out-of-Process”

Page 7: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

7

The Need for Memory-Based Storage

W eb Server W eb Server W eb Server W eb Server W eb Server W eb Server

Ethernet

Internet

DatabaseServer

Raid D iskArray

DatabaseServer

Ethernet

App. Server App. Server App. Server App. Server

Ethernet

POW ER FAU LT DATA ALARM Load-balancer

Example: Web server farm:

• Load-balancer directs incoming client requests to Web servers.

• Web and app. server farms build Web pages and run business logic.

• Database server holds all mission-critical, LOB data.

• Server farms share fast-changing data using a DDG to avoid bottlenecks and maximize scalability.

Bottleneck

Distributed, In-Memory Data Grid

Distributed, In-Memory Data Grid

Page 8: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

8

The Need for Memory-Based Storage

App VS

Cloud Application

App VS App VS

App VSApp VS

Cloud-Based Storage

Grid VSGrid VS

Grid VS

Distributed Data Grid

Example: Cloud Application:

• Application runs as multiple, virtual servers (VS).

• Application instances store and retrieve LOB data from cloud-based file system or database.

• Applications need fast, scalable storage for fast-changing data.

• Distributed data grid runs as multiple, virtual servers to provide “elastic,” in-memory storage.

Page 9: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

9

• “Scaled out” server applications repeatedly access two types of data: – Repeatedly referenced database-data (e.g., stock prices) and

– Fast changing, business-logic data (e.g., session-state, workflow state)

• Database servers are not designed to meet this need:

• Scaled-out applications create additional challenges:

– How to make shared application data quickly accessible by any server – How to maintain fast access and avoid bottlenecks as the server farm grows – How to keep application data highly available when a server fails

Scalability Challenges for Applications

Characteristics: Typical DBMS data Application data Volume High Low

Lifetime/turnover Long/slow Short/fast

Access patterns Complex Simple

Data preservation Critical Less critical

Fast access/update Less important More important

Page 10: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

10

Wide Range of Applications for DDGs Financial Services • Portfolio risk analysis • VaR calculations • Monte Carlo simulations • Algorithmic trading • Market message caching • Derivatives trading • Pricing calculations

Other Applications • Edge servers: chat, email • Online gaming servers • Scientific computations • Command and control

E-commerce • Session-state storage • Application state storage • Online banking • Loan applications • Wealth management • Online learning • Hotel reservations • News story caching • Shopping carts • Social networking • Service call tracking • Online surveys

Page 11: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

11

Product: ScaleOut StateServer®

Fully distributed data grid designed for storing application data on server farms, compute grids, and the cloud:

• Runs in-memory directly on a farm or grid as a distributed service. • Automatically:

– Distributes and shares data across the farm.

– Reduces access time. – Scales when

the farm grows. – Survives when

a server fails. • Cost-effective • Complements & offloads DBMS. • Portable across Windows and Linux.

Web Server

Ethe

rnet

DBMSServer

Internet Web Server

Web Server

Web Server

Eth

erne

t

SOSS Service

SOSS Service

SOSS Service

SOSS Service

DBMS Bottleneck

Page 12: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

12

Product: ScaleOut Remote Client Option

• Allows hosting ScaleOut StateServer on a separate server farm.

• Ensures highly available connectivity to SOSS store.

• Automatically load-balances access requests to minimize response times.

• Uses multiple connections to maximize throughput.

ClientApplication

ClientApplication

ClientApplication

ClientApplication

WindowsRemote Client

WindowsRemote Client

LinuxRemote Client

LinuxRemote Client

ScaleOut StateServer Farm

Web or Application Server Farm

WindowsSOSS

LinuxSOSS

WindowsSOSS

ClientApplication

WindowsRemote Client

Load-balanced Connections

Page 13: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

13

Products: Grid Computing Edition

Compute Servers

Master

Data Bottleneck

..

Database Servers

• Extends ScaleOut StateServer for use in high performance computing (HPC) applications.

• Provides advanced capabilities for parallel data analysis.

• Includes optional management tools.

• Complements SSI’s extended support plans.

SOSSService

Page 14: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

14

Products: ScaleOut GeoServer Option

Global, Multi-Site Data Grids • Extends SOSS across multiple sites. • Ensures against site-wide failures. • Replicates data between

data SOSS farms. • Employs scalable,

hi-av connections. • Automatically handles

membership changes at remote sites.

• Can support both “push” and “pull” access models.

Page 15: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

15

Reason #1: Faster Access Time

• Eliminates repeated network data transfers. • Eliminates repeated object deserialization.

0

500

1000

1500

2000

2500

3000

3500

DDG DBMS

Mic

rose

cond

s

Average Response Time10KB Objects

20:1 Read/Update

Page 16: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

16

Example of Faster API Read Access

• Example for direct API access: – 10 KB objects, 20:1 read/update ratio – 3-host ScaleOut StateServer store with 3 clients

• Results: – Distributed cache provided >6X faster read time than database server.

Page 17: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

17

Reason #2: Linearly Scalable Throughput

Tests performed in Microsoft Enterprise Engineering Center

Read/Write Throughput10KB Objects

0

20,000

40,000

60,000

80,000

4 16 28 40 52 64

Acc

esse

s / S

econ

d

Nodes16,000 ------------------------------------------- 256,000 #Objects

ScaleOut StateServer automatically scales its performance to match the size and workload of a server farm or HPC compute grid.

Page 18: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

18

What is Scalable Throughput?

• What it is (a perfect fit for server farms): – Workload W takes time T on 1 server ( 1 W/T). – Workload 2W takes time T on 2 servers (2 W/T). – Workload nW takes time T on n servers (n W/T). – Total completion time (i.e., response time) stays fixed.

• What it is not (common misperception): – Workload W takes time T/2 on 2 servers (2 W/T). – Workload W takes time T/n on n servers (n W/T).

• Why increase the workload with more servers? – Adding servers adds overhead (e.g., networking). – Increasing workload hides overheads for linear scaling. – DDG must keep overheads low for linear scaling. – Must not let network saturate! (Its throughput is fixed.)

Page 19: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

19

How SOSS Achieves Scalable Throughput

• Fully peer-to-peer architecture to eliminate bottlenecks.

• Automatically partitioned data storage with dynamic load-balancing.

• Fixed number of replicas per stored object (1 or 2) to avoid order-n overhead (storage and latency)

• Patented technique for scaling quorum updates to stored objects

• Patented, scalable heart-beating algorithm

Ethernet

Web orApplication

Server

CacheService

CacheService

CacheService

CacheService

Web orApplication

Server

Web orApplication

Server

Web orApplication

Server

Object ReplicaCopy

ScaleOut StateServer Distributed Cache

Heartbeats Heartbeats Heartbeats

Page 20: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

20

Integrated, Powerful Platform for Scaling

• All product features benefit from the scalable, hi-av architecture: – Ex. Parallel object

eventing: • All hosts handle events. • Event delivery is hi-av.

– Ex. Global replication: • All hosts replicate objects. • Caches automatically handle

membership changes.

CacheService

CacheService

CacheService

CacheService

ScaleOut StateServer Distributed Cache

ClientApplication

ClientLibrary

ClientApplication

ClientLibrary

ClientApplication

ClientLibrary

ClientApplication

ClientLibrary

LocalFarm

RemoteFarm

Page 21: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

21

Impact of Scalable TP on Access Latency

• Scalable, distributed data grid scales throughput and thereby maintains low latency: – DDG scales throughput by

adding servers. – Avoids throughput barrier

of a DBMS or file system. – Maintains low latency as

throughput increases. – Network bandwidth is

only throughput limit. – Also has inherently lower

latency due to: • Memory-based storage • Client-side caching

Acce

ss L

aten

cy (m

sec)

Throughput (accesses / sec)

SOSS DBMS

Access Latency vs. Throughput

Page 22: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

22

Putting it Together: How SOSS Works

• Creating or updating an object: – Client connects to a SOSS service instance and makes request. – Local SOSS service load-balances request to a selected host. – Selected host creates object and one or two remote replicas.

Server Server Server Server

SOSS SOSS SOSS SOSS

Client

Page 23: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

23

How SOSS Works • Reading an object:

– Client connects to SOSS service and makes request. – Local SOSS service forwards to selected host. – Selected host returns object’s data. – Requesting host caches object for future reads.

Server Server Server Server

SOSS SOSS SOSS SOSS

Client

Page 24: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

24

How SOSS Works

• Adding a new host: – Neighboring hosts detect SOSS on new host. – Hosts automatically establish new membership. – Neighbor hosts migrate objects to new host to rebalance load.

Server Server Server Server

SOSS SOSS SOSS SOSS

Server

SOSS

Page 25: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

25

Reason #3: High Availability

• Recovering from a host failure: – Host or NIC fails. – Neighboring hosts detect heartbeat failure. – Hosts establish new membership. – Neighbor host creates new object replica to “self-heal”.

Server Server Server Server

SOSS SOSS SOSS SOSS STOP

Page 26: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

26

SOSS: Integrated High Availability

• Peer-to-peer architecture for maximum redundancy & scalability • Fully integrated data replication for data redundancy, scalability, and

ease of use: – Partial replicas ensure scalable storage and throughput. – Per-server and per-client caches ensure fast access.

• Self-discovery and self-healing for hi-av and ease of use • Patented quorum algorithm for reliable updating with scalability

CacheService

CacheService

CacheService

CacheService

Object

ClientApplication

ClientLibrary

ScaleOut StateServer Distributed Cache

Retrieve

CachedCopy

ReplicaCopy

Page 27: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

27

Reason #4: Sharing Data Across the Farm

The first step for server farms (1998): load-balanced, stateless, Web applications:

• Without the ability to share data, we need “sticky” sessions (no hi av!):

• Or we can overload the database server:

• Or we can share data across the farm in a distributed data grid for both scalability & high av.

Web Server

Eth

erne

t

DBMSServer

Internet Web Server

Web Server

Web Server

Eth

erne

t

SOSS Service

SOSS Service

SOSS Service

SOSS Service

Page 28: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

28

The Evolution in DDGs and Data Sharing

2005 2006 2007 2008 2010 2011

Mar

ket P

enet

ratio

n

Session-state Storage

Application Caching

Platform-wide Usage

Grid Computing

Drivers: • Scaling data access & analysis are critical to

competitiveness. • Server farms & the cloud are now mainstream

computing platforms. • Data access is a key bottleneck. • Short dev. cycles are mandatory. • Standard APIs are emerging.

Early adoption on Web and app. server farms

for speed and hi-av

Expansion to new verticals (e.g., financial services)

for data & compute grids

Cloud Computing using industry-standard APIs

Data Analysis

Page 29: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

29

Data Sharing: a Closer Look

• Advantages of sharing data in a distributed data grid: – Boosts application performance and offloads the DBMS. – Advances & simplifies the programming model:

• Allows “stateful” business objects • Keeps object/relational mapping at the data access layer

• Examples: session & profile data, business objects, workflow state

• Requirements of a distributed data grid: – Coherent storage so all clients see a consistent view – Easy-to-use APIs – Integrated object locking to enable coordinated updating – High availability to avoid data loss if a server fails – Advanced features to enable effective use of the grid (e.g.,

parallel query, map/reduce analysis)

Page 30: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

30

Basic APIs for Data Access .

• Are easy to use in C#, Java, or C/C++. • Store objects in the grid as serialized blobs. • Primarily use string or numeric keys to identify objects. • Group objects into name spaces (“named caches”).

Object

key

// Read and update object:

MyClass retrievedObj;

retrievedObj = cache["myObj"] as MyClass;

retrievedObj.var1 = "Hello, again!";

cache["myObj"] = retrievedObj;

Page 31: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

31

Example: Named Cache Access (Java) static void Main(string argv[])

{

// Initialize string object to be stored:

String s = “Test string”;

// Create a cache collection:

SossCache cache = SossCacheFactory.getCache(“MyCache”);

// Store object in ScaleOut StateServer (SOSS):

CachedObjectId id = new CachedObjectId(UUID.randomUUID());

cache.put(id, s);

// Read object stored in SOSS:

String answerJNC = (String)cache.get(id);

// Remove object from SOSS:

cache.remove(id);

}

Page 32: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

32

Example: Named Cache Access (C#) static void Main(string[] args)

{

// Initialize object to be stored:

SampleClass sampleObj = new SampleClass();

sampleObj.var1 = "Hello, SOSS!";

// Create a cache:

SossCache cache = CacheFactory.GetCache("myCache");

// Store object in the distributed cache:

cache["myObj"] = sampleObj;

// Read and update object stored in cache:

SampleClass retrievedObj = null;

retrievedObj = cache["myObj"] as SampleClass;

retrievedObj.var1 = "Hello, again!";

cache["myObj"] = retrievedObj;

// Remove object from the cache:

cache.["myObj“] = null;

}

Page 33: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

33

Fully Distributed Locking

• Goal: synchronize access to a stored object by multiple client threads.

• Two mechanisms: pessimistic and optimistic locking • Pessimistic uses read-modify-write semantics:

– Can be set as default for all objects within a named cache. – Reads to locked objects are automatically retried. – Locks have timeouts to handle client failures. – Simple reads and updates can bypass locks.

• Optimistic uses object’s version number to allow or inhibit an update: – User supplies version number from read to a locking update. – Benefit: one trip to the server if high probability of success.

string myObj = cache.Retrieve("key", true); // read and lock

...

cache.Update("key", “new value", true); // update and unlock

Page 34: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

34

Advanced API Features

• Object timeouts • Distributed locking for coordinating access • Object dependency relationships • Asynchronous events on object changes • Automatic access to a backing store • Object eviction on high memory usage • Object metadata • Bulk insertion • Authentication • Custom serialization for compression & encryption • Parallel query based on metadata or properties

Page 35: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

35

Parallel Data Analysis • The goal:

– Quickly analyze a large set of data for patterns and trends. – Take advantage of scalable computing to shorten “time to insight.”

• Applications: – Search – Financial services – Business intelligence – Risk analysis – Weather simulation – Structural modeling – Fluid-flow analysis – Climate modeling NCAR Community Climate Model

http://www.vets.ucar.edu/vg/IPCC_CCSM3/index.shtml

Page 36: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

36

Reason #5: Parallel Data Analysis

• Rapid analysis of large data sets has become a top priority.

• Distributed data grids enable fast parallel analysis: – Automatically harness the power of many servers and cores. – Offer a simple, easy-to-use development model. – Deliver top performance for memory-based datasets.

• Key attributes of DDG-based data analysis: – Data is memory-based and

data motion is minimized. – Programming model is object-

oriented; parallelism is automatic. 0

100

200

300

400

500

600

4512

81024

121536

162048

202560

243072

283584

324096

Ob

ject

s p

er

Se

con

d

Number of Nodes

Number of Objects

PMI vs. Random Access Throughput Comparison2mb time series objects

SOSS PMI

Random Access

Page 37: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

37

Parallel Query • Goal: identify a set of objects with selected properties. • Uses all grid servers to scale query performance. • Uses fast, optimized lookup on each grid server.

Query the DDG in parallel.

Merge the keys into a list.

Sequentially analyze all

queried objects.

Page 38: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

38

Parallel Query Example (Java)

• Mark class properties as indexes for SOSS query:

• Define a query using these properties:

public class Stock implements Serializable {

private String ticker;

private int totalShares;

private double price;

@SossIndexAttribute

public String getTicker() {

return ticker;} … }

NamedCache cache = CacheFactory.getCache("Stocks",

false);

Set keys = cache.queryKeys(Stock.class,

or(equal("ticker", "GOOG"),

equal("ticker", "ORCL")));

Page 39: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

39

Parallel Query Example (C#)

• Mark class properties as indexes for SOSS query:

• Define a query using these properties. Objects are automatically read into memory:

class Stock {

[SossIndex]

public string Ticker { get; set; }

public decimal TotalShares { get; set; }

public decimal Price { get; set; }}

NamedCache cache = CacheFactory.GetCache("Stocks");

var q = from s in cache.QueryObjects<Stock>()

where s.Ticker == "GOOG" || s.Ticker == "ORCL"

select s;

Console.WriteLine("{0} Stocks found", q.Count());

Page 40: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

40

Parallel Method Invocation (“Map/Reduce”) • Goal: analyze a set of objects with selected properties. • Executes user’s code in parallel across the grid. • Uses a parallel query to select objects for analysis.

Analyze Data (Map)

Combine Results (Reduce)

In-Memory Distributed Data Grid Runs Map/Reduce Analysis.

Page 41: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

41

Example in Financial Services

Analyze trading strategies across stock histories: Why?

• Back-testing systems help guard against risks in deploying new trading strategies.

• Performance is critical for “first to market” advantage. • Uses significant amount of market data and computation time. How?

• Write method E to analyze trading strategies across a single stock history.

• Write method M to merge two sets of results. • Populate the data store with a set of stock histories. • Run method E in parallel on all stock histories. • Merge the results with method M to produce a report. • Refine and repeat…

Page 42: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

42

Stage the Data for Analysis

• Step 1: Populate the distributed data grid with objects each of which represents a price history for a ticker symbol:

Page 43: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

43

Code the Eval and Merge Methods • Step 2: Write a method to evaluate a stock history based on parameters:

• Step 3: Write a method to merge the results of two evaluations:

• Notes:

– This code can be run a sequential calculation on in-memory data. – No explicit accesses to the distributed data grid are used.

Results EvalStockHistory(StockHistory history, Parameters params)

{

<analyze trading strategy for this stock history>

return results;

}

Results MergeResuts(Results results1, Results results2)

{

<merge both results>

return results;

}

Page 44: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

44

Run the Analysis

• Step 4: Invoke parallel evaluation and merging of results: Results Invoke(EvalStockHistory, MergeResults, querySpec,

params);

EvalStockHistory()

MergeResults()

Page 45: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

45

stock history

stock history

stock history

stock history

stock history

stock history

.eval()

results results results results results results

.merge() .merge() .merge()

results results results

.merge()

results results returned

to client

Start parallel analysis

Page 46: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

46

Advantages of Using PMI • Fast

– Automatically scales application performance across grid servers.

– Automatically uses all server cores. – Minimizes data motion between

servers. – API-based invocation delivers very

low latency. • Easy to Use:

– User writes simple, “in memory” code; all grid accesses are implicit.

– Matches Java/C# model of object-oriented collections.

– Requires no tuning.

Core

Core

Core

Core

Grid Service

PMI Engine

Page 47: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

47

Comparison of DDGs and File-Based M/R DDG File-Based M/R

Data set size Gigabytes->terabytes Terabytes->petabytes Data repository In-memory File / database Data view Queried object collection File-based key/value

pairs Development time Low High Automatic scalability

Yes Application dependent

Best use Quick-turn analysis of memory-based data

Complex analysis of large datasets

I/O overhead Low High Cluster mgt. Simple Complex High availability Memory-based File-based

Page 48: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

48

DDG Minimizes Data Motion • File-based map/reduce must move data to memory for analysis:

• Memory-based DDG analyzes data in place:

D D D D D D D D D

D D D D D D D D D

Grid Server Grid Server Grid Server Grid Server

E Grid Server

E Grid Server

E

M/R Server M/R Server

E M/R Server M/R Server

E M/R Server M/R Server

E

File System / Database

Server Memory

Distributed Data Grid

Page 49: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

49

stock history

stock history

stock history

stock history

stock history

stock history

.eval()

results results results results results results

.merge() .merge() .merge()

results results results

.merge()

results results returned

to client

Start parallel analysis

File I/O

File I/O

File I/O

Page 50: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

50

Performance Impact of Data Motion Measured random access to DDG data to simulate file I/O:

Page 51: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

51

PMI Delivers 16X Speedup Over Hadoop

0

100

200

300

400

500

600

700

800

4 6 8

Thro

ugh

pu

t (O

bj/

Sec)

Number of Servers

Throughput Comparison

SOSS PMI

Hadoop/SOSS

Hadoop

Page 52: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

52

Reason # 6: Simplify Data Migration • DDGs enable seamless data migration across on-

premise sites and the cloud: – Automatically access

remote data as needed. – Efficiently manage

WAN bandwidth. – Enable full data

synchronization across sites.

In-Memory Distributed Data Grid

Page 53: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

53

Example: Web Farm Cloud-Bursting • DDGs bridge on-premise and cloud-based in-memory storage of

Web session state. • DDG automatically migrates session-state objects into the cloud

on demand. • This enables seamless access to data across multiple sites.

Automatically Migrate Data

Cloud of Virtual Servers User’s On-Premise Application

SOSS VSSOSS VS

SOSS VS

Cloud-Based Distributed Cache

App VS

Cloud Application

App VS App VS

App VSApp VS

SOSS HostSOSS Host

On-Premise Cache

Server App

On-Premise Application 2

Cloud of Virtual Servers

User’s On-Premise Application

Server App

AutomaticallyMigrate Data

BackingStore

Cloud hosted Distributed Data Grid

On-Premise Distributed Data Grid

Cloud Application

On-Premise Application 2

App VS

App VS App VS

App VS

App VS

Server App Server App

SOSS Host SOSS Host SOSS VS

SOSS VS

SOSS VS

Web Load Balancer

Virtual Distributed Data Grid

Page 54: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

54

Example: Global Access to Shared Data

Distributed Data Grid

SOSS SVRSOSS SVR

SOSS SVR

Distributed Data Grid

SOSS SVRSOSS SVR

SOSS SVR

Global Distributed Data Grid

Distributed Data Grid

SOSS SVRSOSS SVR

SOSS SVR

Distributed Data Grid

SOSS SVRSOSS SVR

SOSS SVR

Mirrored Data Centers Satellite Data Centers

Page 55: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

55

What to Look for in a DDG Product

• In direct comparison tests, SSI demonstrates faster access performance and scalability in key benchmarks. Performance

• SSI’s architecture integrates both scalability and high availability and uniformly applies key architectural principles,

such as peer-to-peer design. Architecture

• SSI's products have an unusually high level of integration and focus on automatic operation. This dramatically simplifies deployment and management of a distributed data grid.

Ease of Use

• Seamless interoperability across Windows and Unix (Linux, Solaris, etc.) operating systems was designed into SSI’s

architecture from the outset. Portability

• Advanced capabilities for "map/reduce"-style parallel data analysis open up important new applications for distributed data

grids. Data Analysis

• SSI’s comprehensive tools for managing distributed data grids, such as its object browser and parallel backup and restore utility,

are unique in the industry. Manageability

Page 56: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

56

SOSS Maximizes Ease of Use

Tree list shows: • Store status

• Host list • Host status

• Remote stores • Remote client configuration

Host configuration

pane: Just need to

select subnet shared by all

hosts.

Grid servers self-aggregate, self-heal, and automatically load-balance.

Page 57: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

57

Real-time Performance Charting

Page 58: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

58

SOSS Object Browser • Simplifies development. • Provides extremely useful visibility into grid usage. • Allows grid objects to be analyzed and managed.

Page 59: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

59

SOSS Parallel Backup and Restore

• Enables grid contents (or portions) to be backed up or restored in parallel either to:

– Separate file systems on all caching servers or – A single network file share

• Creates backups or snapshots for later analysis. • Makes full use of SOSS’s parallel implementation to

deliver highly scalable performance and high availability.

Ethernet

Server

Ethernet

SOSS

Server Server Server

SOSS SOSS SOSS

Ethernet

Server

Ethernet

SOSS

Server Server Server

SOSS SOSS SOSS

Page 60: Top 6 Reasons to Use a Distributed Data Grid

ScaleOut Software, Inc.

60

Recap: Top 6 Reasons to Use a DDG 1. Faster access time for business logic state or database data

2. Scalable throughput to match a growing workload and keep response times low

3. High availability to prevent data loss if a grid server (or network link) fails

4. Shared access to data across the server farm

5. Advanced capabilities for quickly and easily mining data using scalable, “map/reduce,” analysis

6. Transparent data migration across multiple sites and the cloud.

Acce

ss L

aten

cy (m

sec)

Throughput (accesses / sec)

Grid DBMS

Access Latency vs. Throughput

Page 61: Top 6 Reasons to Use a Distributed Data Grid

Distributed Data Grids for

Server Farms & High Performance Computing

www.scaleoutsoftware.com

Thank you for joining us today!