November 2013 HUG: Real-time analytics with in-memory grid

Enabling Real-Time Analytics Using Hadoop Map/Reduce

Copyright © 2013 by ScaleOut Software, Inc.

Hadoop Users GroupNovember 20, 2013

Bill Bain, CEO ([email protected])

2 ScaleOut Software, Inc.

• Quick Review of In-Memory Data Grids• The Need for Real-Time Analytics: Two Use Cases• Data-Parallel Computation on an IMDG Using Parallel

Method Invocation (PMI)• Implementing MapReduce Using PMI: ScaleOut

hServer™• Sample Use Cases• Video Demo• Comparison to Spark

Agenda


• Develops and markets In-Memory Data Grids:software middleware for:• Scaling application performance and • Performing real-time analytics using• In-memory data storage and computing

• Dr. William Bain, Founder & CEO• Career focused on parallel computing – Bell Labs, Intel, Microsoft• 3 prior start-ups, last acquired by Microsoft and product now ships

as Network Load Balancing in Windows Server• Eight years in the market; 400 customers, 9,000 servers• Sample customers:

About ScaleOut Software

http://about-monster.com/

http://en.wikipedia.org/wiki/Image:HSN.png


In-memory storage for fast updates and retrieval of live data• Fits in the business logic layer:• Follows object-oriented view of data

(vs. relational view).• Stores collections of Java/.NET

objects shared by multiple clients.• Uses create/read/update/delete

and query APIs to access data.

• Implemented across a cluster of servers or VMs:• Scales storage and throughput

by adding servers.• Provides high availability

in case a server fails.

What is an In-Memory Data Grid?


Big Data Analytics

Our Focus: Real-Time Analytics

Static data setsPetabytesDisk storageHours to minutesBest uses:• Analyzing

warehoused data

• Mining for long-term trends

Live data setsGigabytes to terabytesIn-memory storageMinutes to secondsBest uses:• Tracking live data• Immediately

identifying trends and capturing opportunities

AnalyticsServer

hServer

HadoopIBM

TeradataSASSAP

Real-Time Batch

Real-time“Operational Intelligence”

Batch“Business Intelligence”


A few examples:• Equity trading: to minimize risk during a trading day• Ecommerce: to optimize real-time shopping activity• Reservations systems: to identify issues, reroute,

etc.• Credit cards: to detect fraud in real time• Smart grids: to optimize power distribution & detect

issues

Online Systems Need Real-Time Analysis


Benefits:• Enables use of widely used Hadoop MapReduce APIs:• Accelerates data access by staging data in memory.• Eliminates batch scheduling

and data shuffling overheads of standard Hadoop distributions.

• Analyzes and updates live data.• Enables Hadoop

deployment in live systems.

• Hadoop MapReduce programs run without change.

• ScaleOut’s implementation is calledScaleOut hServer™.

Integrate MapReduceinto IMDG for Real-Time Analytics


Data-Parallel Analysis Is Not New• 1980’s: Special Purpose Hardware: “SIMD”

Thinking Machines Connection Machine 5

• 1990’s: General Purpose Parallel Supercomputers:“Domain Decomposition”, “SPMD”

IntelIPSC-2

IBMSP1


Data-Parallel Analysis Is Not New• 1990’s – early 2000’s: HPC on Clusters: “MPI”

• Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce”

HPBladeServers

Amazon EC2, Windows Azure


Parallel Method Invocation• Basic, well understood model of data-parallel

computation• Implemented for use on objects hosted in IMDGs:• Executes user’s code in parallel across the grid.• Uses parallel query to select objects for analysis.

Analyze Data (Eval)

Combine Results (Merge)

In-Memory Data Grid Runs Data-Parallel Analysis.


The parallel analysis executes in three steps:• Step 1: The application first selects all relevant objects in

the collection with a parallel query run on all grid servers.• Note: Query spec matches data’s object-oriented properties.

Running the Analysis


• Step 2: The IMDG automatically schedules analysis operations across all grid servers and cores.• The analysis runs on all objects selected

by the parallel query.• Each grid server analyzes its locally stored

objects to minimize data motion.• Parallel execution ensures fast

completion time:• IMDG automatically distributes

workload across servers/cores.• Scaling the IMDG automatically

handles larger data sets.

Running the Analysis: Step 2


• Step 3: The IMDG automatically merges all analysis results.• The IMDG first merges all results within each grid server in

parallel.• It then merges results across all grid servers to create one

combined result.• Efficient parallel merge

minimizes the delay incombining all results.

• The IMDG delivers thecombined result to thetrader’s display as oneobject.

Running the Analysis: Step 3


Optimizing a stock trading platform with real-time analysis:• IMDG hosted in Amazon

cloud using 75 servers.• IMDG holds 1 TB of stock

history data in memory.• IMDG handles continuous

stream of updates (1.1 GB/s).• IMDG performs real-time

analysis on live data.• Entire data set analyzed in

4.1 seconds (250 GB/s).• IMDG scales linearly as

workload grows.

Sample Performance Results for PMI


• Goal: Run MapReduce applications from a remote workstation.• The IMDG automatically builds an “invocation grid” of JVMs on

the grid’s servers for PMI and ships the application’s jars.• The invocation grid can be reused to shorten startup time.

• Use PMI to implement MapReduce.

Implementing Real-Time MapReduce


PMI is the foundation of fast execution time:• Data can be input from either the

IMDG or an external data source.• Works with any input/output

format compatible with the Apache distribution.

• ScaleOut IMDG uses its data-parallel execution engine (PMI) to invoke the mappers and the reducers.• Eliminates batch scheduling

overhead.• Intermediate results are stored

within the IMDG.• Minimizes data motion between

the mappers and reducers.• Allows optional sorting.

Accelerating MapReduce Execution


// This job will run using the Hadoop // job tracker:public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class);job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0]));FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true);}

// This job will run using ScaleOut hServer:

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new HServerJob(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class);job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0]));FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true);}

Only One-Line Code ChangeScaleOut hServer subclasses the Hadoop Job class:


• IMDG adds grid input format for accessing key/value pairs held in the IMDG.

• MapReduce programs optionally can output results to IMDG with grid output format.

• Grid Record Reader optimizes access to key/value pairs to eliminate network overhead.

• Applications can access and update key/value pairs as operational data during analysis.

Accessing IMDG Data for M/R


Multiple in-memory storage models:• Named cache, optimized

for rich semantics:• Property-based query• Distributed locking• Access from remote grids

• Named map, optimized for efficient storage and bulk analysis:• Highly efficient object

storage• Pipelined, bulk-access

mechanisms

Optimized In-Memory Storage


Fast map/reduce reconciles inventory and order systems for an online retailer:• Challenge: Inventory and online

order management are handledby different applications.• Reconciled once per day.• Inaccurate orders reduces margins.

• Solution:• Host SKUs in IMDG updated in real

time by order & inventory systems.• Use PMI to reconcile in two minutes.

• Results: Real-time reconciliation ensures accurate orders.

Example: Ecommerce: Inventory Management


Integrate analysis into a stock trading platform:• The IMDG holds market data and hedging strategies.• Updates to market data

continuously flow through the IMDG.

• The IMDG performsrepeated map/reduce analysis on hedging strategies and alerts traders in real time.

• IMDG automatically and dynamicallyscales its throughput to handle newhedging strategies by adding servers.

Example in Financial Services


• Video Link

Demo


Spark ScaleOut IMDGNew MapReduce engine

Yes Yes

In-memory data storage

Resilient Distr. Datasets

Distributed Objects

Load/store from HDFS Yes YesAvoid disk access Yes YesCRUD on live data No YesQuery on properties No YesHigh availability Rebuild on failure Replication and

failoverExtensibility Additional operators PMI methodsOpen source Yes Hybrid

Comparison to Spark• Spark is intended to accelerate data analysis using in-

memory computing.• ScaleOut’s IMDG provides standard MapReduce for “live”

systems.


• Online systems need to analyze “live” data in real-time.

• MapReduce has traditionally focused on analyzinglarge, static (offline) datasets held in file systems.

• An in-memory data grid (IMDG) can accelerate MapReduce applications, enabling real-time analytics:• Enables the application to analyze and update live data.• Leverages the IMDG’s load-balanced placement of data.• Avoids batch-scheduled startup delays.• Avoids data motion from secondary storage.

• MapReduce can be implemented using standard data-parallel computing techniques (“parallel method invocation”):• Tightly integrates Map/Reduce engine with the IMDG.• Accelerates Map/Reduce execution by >20X in benchmark

tests.

Summary


• The invocation grid can be re-used across MapReduce jobs:

Accelerating Start-Up Times

public static void main(String argv[]) throws Exception { //Configure and load the invocation grid InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid"). // Add JAR files as IG dependencies addJar("main-job.jar"). addJar("first-library.jar").

// Add classes as IG dependencies addClass(MyMapper.class). addClass(MyReducer.class). // Define custom JVM parameters setJVMParameters("-Xms512M -Xmx1024M"). load(); //Run 10 jobs on the same invocation grid for(int i=0; i<10; i++) { Configuration conf = new Configuration(); //The preloaded invocation grid is passed as the parameter to the job Job job = new HServerJob(conf, "Job number "+i, false, grid); //......Configure the job here......... //Run the job job.waitForCompletion(true); } //Unload the invocation grid when we are done grid.unload();}


Run continuous Hadoop on live data, while it’s being updated.Accelerate Hadoop on static data with a one line code change.

Quickly prototype Hadoop code.

Targeted Use Cases“Capture perishable business

opportunities and identify issues.”Real-time risk

analysisCredit card fraud

detection

“Speed-up Hadoop execution by >10X for faster business insights.”

Processsimulations

Financialmodeling

“Validate your Hadoop code before it goes into batch processing.”

Fast-turn debugand tuning

No need to install Hadoop stack

...

...

...


Many Use Cases: • Authorizations / Payment

Processing / Mobile Payments • Service Activation• Inventory Management• Sensor Data / SCADA • Real Time Tracking • Fraud Detection • Situational Awareness• Churn Management • Market Feed / Event Handlers• Execution Rules• Financial: Risk, P&L, Pricing• Operational Risk Compliance

The Need for Real-Time AnalyticsAcross Key Industries: • CPG• Financial• Telco• Retail • Utilities• Manufacturing • Logistics • IC / DoD • Life Sciences• Government • Health Care • Law enforcement


• Typically used for very large, static, offline datasets• Data must be copied from disk-based storage (e.g., HDFS)

into memory for analysis.• Hadoop Map/Reduce adds lengthy batch scheduling and

data shuffling overhead.

Problem: Hadoop Cannot Efficiently Perform Real-Time Analytics


• ScaleOut Software conducted informal survey at Strata 2013 Conference (Santa Clara).

• Based on 150 responses:• 78% of organizations generate fast-changing data.• 60% use Hadoop and 78% plan to expand usage of

Hadoop within 12 months.• Only 42% consider Hadoop to be an effective platform for

real-time analysis, but…• 93% would benefit from real-time data analytics.• 71% consider a 10X improvement in performance

meaningful.• Take-away: Hadoop users need real-time analytics.

Hadoop Users Need Real-Time Analytics


• ScaleOut hServer adds Dataset Record Reader (wrapper) to cache HDFS data during program execution.

• Hadoop automatically retrieves data from ScaleOut IMDG on subsequent runs.

• Dataset Record Reader stores and retrieves data with minimum network and memory overheads.

• Tests with Terasort benchmark have demonstrated 11X faster access latency over HDFS without IMDG.

Optional Caching of HDFS Data


• Create method to analyze each queried stock object and another method to pair-wise merge the results:

Java Example: Parallel Method Invocation

public class StockAnalysis implements Invokable<Stock, StockCalcParams, Double>

{ public Double eval(Stock stock, StockCalcParams param)

throws InvokeException { return stock.getPrice() * stock.getTotalShares(); }

public Double merge(Double first, Double second) throws InvokeException {

return first + second; }}


• Run a parallel method invocation on the query results:

Java Example: Parallel Method Invocation

NamedCache cache = CacheFactory.getCache("Stocks");

InvokeResult valueOfSelectedStocks = cache.invoke( StockAnalysis.class, Stock.class, or(equal("ticker", "GOOG"), equal("ticker", "ORCL")), new StockCalcParams());

System.out.println("The value of selected stocks is" + valueOfSelectedStocks.getResult());

Technology

November 2013 HUG: Real-time analytics with in-memory grid