files.meetup.com/87095/C.Purdy_20080207.pdf · 2/7/2008 · • Cameron is the JCache (JSR107) specification lead and a member of the Work Manager (JSR237)

<Insert Picture Here>

<Insert Picture Here>

Creating Grid-Based Data Infrastructures for the EnterpriseCameron PurdyVice President of Development

Speaker

• Cameron Purdy, formerly CEO of Tangosol, is VP of Development at Oracle, and a contributor to Java and XML specifications

• Cameron is the JCache (JSR107) specification lead and a member of the Work Manager (JSR237) expert group.

• Oracle Coherence is the leading Data Grid software for Java and J2EE environments. Coherence enables highly scalable in-memory data management and caching for clustered Java applications.

AgendaGrid-Based Data Infrastructures

• Concepts• Capabilities• Use Cases• Futures

Data Grid: Introduction

• Provides a reliable data tier with a single, consistent view of data• Enables dynamic data capacity including fault tolerance and load

balancing• Ensures that data capacity scales with processing capacity

Data Grid: Organic Resiliency

• Cluster of nodes holding % of primary data locally• Back-up of primary data is distributed across all other nodes• Logical view of all data from any node

• All nodes verify health of each other• In the event a node is unhealthy, other nodes

diagnose state

• Unhealthy node isolated from cluster• Remaining nodes redistribute primary and

back-up responsibilities to healthy nodes

Data Grid: Uses

CachingApplications request data from the Data Grid rather than backend data sources

AnalyticsApplications ask the Data Grid questions from simple queries to advanced scenario modeling

TransactionsData Grid acts as a transactional System of Record, hosting data and business logic

EventsAutomated processing based on event

• Data: Not Just Algorithms• Applications tend to always assume data is always available,

immediately• Real-Time: Not Yesterday’s Data

• Applications make decisions on live data• Lots of examples in financial services, telcos, e-commerce

• Transactional: Micro/Nano/Pico “Jobs”• Application decisions reflect a one-time reality• Durability required

• Virtualization: Share your toys• Grid infrastructure viewed as a shared IS resource• Information viewed as a shared grid resource

Data Grid: Indications

• Built for continuous operation

• Data Fault Tolerance• Self-Diagnosis and

Healing• “Once and Only

Once” Processing

• Dynamically Expandable

• No data loss at any volume

• No interruption of service

• Leverage Commodity Hardware

• Cost Effective

Scalable Universal Data

Data Grid: Requirements

• Single view of data• Single

management view• Simple

programming model

• Any Application• Any Data Source

• Data Caching• Analytics• Transaction

Processing• Event

Processing

Reliable

Data Grid: Market Drivers

• Data growth and use has dramatically outpaced existing methods for feeding and managing data between applications and data sources.

• Ensuring availability, reliability, scalability of mission critical applications has evolved from difficult to near impossible as applications have become public-facing, concurrent user load has increased dramatically, and failures are often public.

• Organizations require predictable cost of scale, so that business success (e.g. profit) corresponds to technical success (e.g. handling additional load).

Data Grid: IT Initiatives driving adoption

• Virtualization• Increased demand on Data Sources • Application re-provisioning must occur

transparently without interruption of data access

• Must handle multiple load increases at the same time

Demand

Supply

Data

Time

Data Grid: IT Initiatives driving adoption

• SOA • Implicit trade-off of efficiency for modularization & re-use• Service composition statistically decreases reliability• Increases common access to shared resources• Obvious requirements for scalability & continuous

availability

• EDA• Data-centric Event Driven Architectures are becoming

common• Workflow, BPEL, algorithmic trading, fraud detection

• Events drive transactions generate events drive transactions …

• Transaction and event rates exceeding 1 million/second!

DemandData

Time

Supply

Data Grid: The Solution

• Data as a Service• Data in its application form• Java, C#, C++ objects

• Horizontal Scale• Common Off The Shelf (COTS)

& Commodity Hardware• Switched network infrastructure

Data Grid: The Solution

• Data Integration occurs in the Data Service• Integration uses the domain

model• The data is both live and shared

• Events provide bi-directional flow

• Applications can respond to events

Overview Summary

• Extreme increase in Access, Volume and Complexity of Data Use

• Meeting User Demands & Expectations• Difficulty meeting Service Level Agreements• Managing infrastructure growth• Cost Containment

• Provide Reliable, Scalable, Universal Data Access & Management

Problem

Challenge

Solution

Data Grid: Defined

• A Data Grid combines data management with data processing in a scale-out environment• Some or all of the servers in the grid are responsible for

reliably managing live information• Any or all of the servers in the grid are able to simultaneously

access and manipulate a shared, consistent view of that information

• Processing power far exceeds aggregate network bandwidth, so data access and manipulation must be parallelized and localized within the grid

• Data locality is both dynamic and transparent

Data Grid: Concepts

• There are only two things you can move in a distributed environment: State and Behavior• State distribution: replication, distributed caching• Behavior distribution: RPC, RMI• A Data Grid combines these two, moving the data to where the

processing is, or the processing to where the data is (locality)

• State and Behavior distribution are both required• Data Access: Distribution of state allows all servers to use

common, shared data for their processing• Data Processing: Distribution of behavior allows servers to

process in parallel, and without moving data

Data Grid: Concepts

• Locality of data• Most applications spend most of their time waiting for data

• This holds true even in a grid that has all of the data in-memory• With distribution of behavior, processing can occur on the server

within a grid that has the best locality of data• With data partitioning, the “owner” of each piece of data is known

• Benefits of Locality• Moving the parameters of execution is often much more efficient

than moving the data to process• Locality simplifies concurrent access management

• With simple units of work, it can eliminate two-phase commit (XA)• Enables transparent parallelization of work in a grid



Universal Access & Management

• All data in Data Grid accessible from any single node

• Optimizes data locality in Grid based on usage or access

• Parallelizes data loading, data queries, processing of data managed in grid

• Can queue transactions and persist later to database

• Goal: Extreme Performance.• Solution: Data is Replicated to all

members of the Data Grid.• Zero Latency Access: Since the data

is replicated to each grid node, it is available for use without any waiting. This provides the highest possible speed for data access. Each grid node accesses the data from its own memory.

• Limitations:• Cost Per Update : Replication uses a

large amount of network bandwidth, and some CPU from all grid nodes.

• Cost Per Entry : The same data set is located on each grid node, meaning that the data capacity does not scale.

Replicated Topology

• Goal : Extreme Scalability.• Solution : Transparently partition the

data to distribute the load across all grid nodes.

• Linear Scalability : By partitioning the data evenly, the per-port throughput (the amount of work being performed by each server) remains constant.

• Benefits• Partitioned : The size of the data set

and the processing power available grow linearly with the size of the data grid.

• Load-Balanced : The responsibility for managing the data is automatically load-balanced across the data grid.

• Ownership : Exactly one node in the data grid is responsible for each piece of data.

• Point-To-Point : The communication is all point-to-point, enabling linear scalability on a switched infrastructure.

Partitioned Topology

• Goal: Extreme Performance. Extreme Scalability.

• Solution: Local In-Memory cache in front of the entire data set provided by the Data Grid.

• Result: Zero Latency Access to recently- used and frequently-used data. Scalable data capacity and data throughput, with a fixed cost for worst-case.

Near Topology

• Universal: All data sets provide events, regardless of the topology.

• Distributed: The events are always delivered efficiently to the interested listeners.

• Regardless of originating node

• Flexible:• Listen to entire data sets, specific

identities, and even to queries!

• Provides “before” and “after” state

• Both sync and async event models

Events

• Parallel Query: A query is performed in parallel across the Data Grid, using indexing and a iterative Cost Based Optimizer.

• Customizable predicates• Custom indexes• Custom aggregators

• Continuous Query: Combines a query with events to provide a local materialized view.

• Result is up-to-date in real-time

• Like the Near Topology, but it always contains the desired data

Query

• Implicit: Queueing of operations• Virtual queue & thread per entry

• Explicit: Pessimistic locking• Grid-Wide Mutex

• Transactions: Unit of work management

• Both optimistic and pessimistic transactions

• Isolation levels from read-committed through serializable

• Integrated with JTA

Concurrency

• Access to the data sources go through the Data Grid.

• Read and write operations are always managed by the node that owns the data within the Data Grid.

• Concurrent accesses are combined, greatly reducing database load.

• Write-Through keeps the in-memory data and the database in sync.

Read-Through & Write-Through

• Write-Behind accepts data modifications directly into the Data Grid

• The modifications are then asynchronously written back to the data source, optionally after a specified delay

• All write-behind data is synchronously and redundantly managed, making it resilient to server failure

Write-Behind

• Node-Based: Directed execution to specific node(s)• Use case: Clustered management, JMX

Map map = isvc.query(new Agent(), setMembers);

• Task-Based: CommonJ WorkManager API (BEA/IBM)• The grid becomes one giant thread pool for parallel execution

• Moving to final JSR 236/237 API (BEA, IBM, Oracle, Doug Lea)• Data-Centric: Localizes processing of data

• Equivalent of a Stored Procedure for a Data Grid• Achieves Once-And-Only-Once Processing• Supports Entry Processors and Parallel Aggregators

for (int i = 0, c = work.size(); i < c; ++i) {

mgr.schedule((Work) work.get(i));

}

mgr.waitForAll(work, timeout);

Moving Behavior

• Compare the cost of moving state versus behavior in a large-scale Data Grid• State: lock(id), v=get(id), process, put(id,v), unlock(id)

map.lock(id, -1);

try {

Integer I = (Integer) map.get(id);

int c = (I == null ? 0 : I.intValue());

map.put(id, new Integer(++c));

return c;

} finally {

map.unlock(oKey);

}

Data Grid Processing

• Compare the cost of moving state versus behavior in a large-scale Data Grid• Behavior: execute(id, process)

• 72% fewer network hops per process• 100% less network critical-section time (implicit concurrency)

return map.invoke(id, new NumberIncrementor(..));


results = map.invoke(query, new NumberIncrementor(..));

results = map.invoke(setIds, new NumberIncrementor(..));

• Parallel Processing and Aggregation of Data using Partitioning• Analogue to the Parallel Query capability• Simple: Just specify a collection of identities or a query• Linear scale for processing and aggregation throughput• Operations can be read-only, write-only or read/write



Set setDistinct = (Set) orders.aggregate((Filter) null,new DistinctValues("getSymbol"));

for (Iterator iter = setDistinct.iterator(); iter.hasNext(); ){String sSymbol = iter.next();// ...}

• Parallel Aggregation• Used to extract small sets of information from huge sets of data

• Also min(), max(), count(), avg(), sum()• Support for group-by• Support for conditional and composite aggregations

• Data Access• Typical Use Case: Secure, Limited,

Read-Only Access• Typical Clients: Java, C, C++, C#,

VB.NET, Excel• Client Topologies: Near Caching,

Continuous Query Caching• Client issues request, receives results• All relevant events are streamed in real

time to the client

Data Grid Clients: Data Access

Access Query (Caching & Analytics)

Data Events

• Data Processing• For security purposes:

• Data Grid clients are untrusted, limited in their view of data, and unable to inject code or new agents

• The Data Grid is secure, audited, and typically contains trade secrets and confidential information

• Typical Use Case: All operations that could modify data occur within the data grid itself via Secure Agent Invocation• Enforces an SOA approach• Resulting modifications are streamed back

via the event stream

Data Grid Clients: Data Processing

Transaction Requests

Data Events

• WAN Failover• Extends the Finite State Cluster (FSC) across multiple Data Centers• Applicable for DR, Metropolitan Area Networking (MAN) clustering

• Requires primary/DR data center designation• Requires manual failover for the DR data center

Global Data Grids

• Asynchronous Fault-Tolerant Replication• An architecture that assumes interruptions will occur

• Utilizes the Write-Behind capability to asynchronously update other data centers, re-queuing if necessary when the link is down

• Each Data Center is designed to operate independently• Some “regional” information is owned by each Data Center• All information is available at each Data Center, but network

interruption may cause the information to be out-of-date

Global Data Grids

Agenda

Grid-Based Data Infrastructures• Concepts• Capabilities• Use Cases• Futures

Data Grid: Uses

CachingApplications request data from the Data Grid rather than backend data sources

AnalyticsApplications ask the Data Grid questions from simple queries to advanced scenario modeling

TransactionsData Grid acts as a transactional System of Record, hosting data and business logic

EventsAutomated processing based on event

Applications request data from the Data Grid rather than backend data sources

• Enable faster access to frequently accessed data• Reduce load on shared data sources• Obviates the impact of data source outages

Use Cases• Reference data• User preferences• Market data

Coherence• Manageable and scalable host for the cache • Guarantees consistent data and data integrity• Broad industry support as a plug-in cache

Caching

Applications ask the Data Grid questions from simple queries to advanced scenario modeling:

• Enables query rates beyond what a database can handle• Larger data sets available in memory• Complex analytics through massive parallel processing across grid• Increased availability

Use Cases• Risk management• Portfolio management• Derivatives pricing

Coherence• Built-in query support • User-defined parallel calculations• Stable results even with server failure

Analytics

The Data Grid acts as a transactional System of Record, hosting data and business logic:

• Higher transaction volumes• More predictable scalability• More predictable cost of scale• Flexible capacity

Use Cases• Order book, execution, fill and reconciliation• E-Commerce, real-time inventory• Large-scale data collection: Buffered transactions

Coherence• Multiple isolation levels, both optimistic and pessimistic modes• Reliability is key to transactional integrity

Transactions

In addition to transactions, the Data Grid reliably manages eachresulting event and delivers it to the related business logic:

• Real-time management for both data-driven and event-driven architectures

• Flexible data, event and processing capacityUse Cases

• Market Data-driven processes such as Algorithmic Trading• Fraud Detection• Large-scale work-flow: SEDA / Staged Business Event Transitions

Coherence• Co-located processing of data and events for low latency and high

throughput • Reliable “Once-And-Only-Once” processing for events• Scalable to millions of events per second without sacrificing

reliability

Events

NodeNode

Data Grid

Cache

Primary Data

Logical Data

Backup Data

NamedCache cache = CacheFactory.getCache(“trades”);

...

Filter filter = new EqualsFilter("getSymbol", sAggSymbol);

DoubleSum aggSum = new DoubleSum("getPrice");

Double DResult = (Double) cache.aggregate(filter, aggSum);

Client

InstrumentationJMX

Courtesy of:Tradeid : intsymbol : Stringprice : doublelot : int

Tradeid : intsymbol : Stringprice : doublelot : int



Trading Application

Data Grid

Cache

Cache Cache

Primary Data

Logical Data

Backup Data

Courtesy of:

Distributed Cache

Data Grid

Cache

Cache CacheOfflineCache

Primary Data

Logical Data

Backup Data

Courtesy of:

Failover

• Customer Background• Large regional bank• Experience with compute grid infrastructure (e.g. risk)• Low average server utilization made virtualization into a priority

• Goals• Shared, virtualized infrastructure to increase server utilization

and reduce long-term infrastructure costs• Eliminate data bottlenecks & cost of data infrastructure• Make big compute tasks into near “real time” operations• Package the result as an easy to use/deploy service

Data Grid: Case Study

• General Benefits• Elimination of all but incremental ETL• Information always available, real-time• Data capacity is scalable (e.g. in-memory capacity)• Automated “Locality of Reference” guaranteed processing

scalability (goal: eliminate “IO wait”)

• Business Benefits• “Data as a Service” – the information is available to additional

tasks, jobs and applications, such as providing internal market data feeds

• Previously “impossible” processes become realistic, such as a 50-day risk calculation … in under one hour!

Data Grid: Case Study



• Metering, Charge Back• Shared infrastructure is by its nature … shared• Lots of parallel, thread-level concurrent execution• Based on Clock cycles? RAM? Network bandwidth?

• Holistic Function Mindset• Dominance of imperative, iteration- and recursion-based

programming models• Lack of declarative models

• Granularity, Rate and Reliability• Application scenarios require different QoS• Qualities can “cost” an order of magnitude each

Data Grid: Challenges

• Platforms & Languages• Java, C, C++, C# / .NET, VB.NET, Excel

• Clustering Infrastructure• Data Center Failover, Infiniband

• Partitioning• Support for much larger data sets, deterministic configuration

• Usability• Security, Provisioning, Configuration, Deployment, Management, Monitoring

• Technology Integration• Composite, DataSynapse, Hibernate, Mule, Platform, Spring

• Projects• Ripple: Event Driven Architecture (EDA)

Coherence Futures

Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S

For More Information

http://search.oracle.com

orwww.oracle.com/technology/products/coherence/

Coherence

Documents

files.meetup.com/87095/C.Purdy_20080207.pdf · 2/7/2008 · • Cameron is the JCache (JSR107) specification lead and a member of the Work Manager (JSR237)