Click here to load reader
Upload
yahoo-developer-network
View
1.135
Download
0
Embed Size (px)
Citation preview
Enabling Real-Time Analytics Using Hadoop Map/Reduce
Copyright © 2013 by ScaleOut Software, Inc.
Hadoop Users GroupNovember 20, 2013
Bill Bain, CEO ([email protected])
2 ScaleOut Software, Inc.
• Quick Review of In-Memory Data Grids• The Need for Real-Time Analytics: Two Use Cases• Data-Parallel Computation on an IMDG Using Parallel
Method Invocation (PMI)• Implementing MapReduce Using PMI: ScaleOut
hServer™• Sample Use Cases• Video Demo• Comparison to Spark
Agenda
3 ScaleOut Software, Inc.
• Develops and markets In-Memory Data Grids:software middleware for:• Scaling application performance and • Performing real-time analytics using• In-memory data storage and computing
• Dr. William Bain, Founder & CEO• Career focused on parallel computing – Bell Labs, Intel, Microsoft• 3 prior start-ups, last acquired by Microsoft and product now ships
as Network Load Balancing in Windows Server• Eight years in the market; 400 customers, 9,000 servers• Sample customers:
About ScaleOut Software
4 ScaleOut Software, Inc.
In-memory storage for fast updates and retrieval of live data• Fits in the business logic layer:• Follows object-oriented view of data
(vs. relational view).• Stores collections of Java/.NET
objects shared by multiple clients.• Uses create/read/update/delete
and query APIs to access data.
• Implemented across a cluster of servers or VMs:• Scales storage and throughput
by adding servers.• Provides high availability
in case a server fails.
What is an In-Memory Data Grid?
5 ScaleOut Software, Inc.
Big Data Analytics
Our Focus: Real-Time Analytics
Static data setsPetabytesDisk storageHours to minutesBest uses:• Analyzing
warehoused data
• Mining for long-term trends
Live data setsGigabytes to terabytesIn-memory storageMinutes to secondsBest uses:• Tracking live data• Immediately
identifying trends and capturing opportunities
AnalyticsServer
hServer
HadoopIBM
TeradataSASSAP
Real-Time Batch
Real-time“Operational Intelligence”
Batch“Business Intelligence”
6 ScaleOut Software, Inc.
A few examples:• Equity trading: to minimize risk during a trading day• Ecommerce: to optimize real-time shopping activity• Reservations systems: to identify issues, reroute,
etc.• Credit cards: to detect fraud in real time• Smart grids: to optimize power distribution & detect
issues
Online Systems Need Real-Time Analysis
7 ScaleOut Software, Inc.
Benefits:• Enables use of widely used Hadoop MapReduce APIs:• Accelerates data access by staging data in memory.• Eliminates batch scheduling
and data shuffling overheads of standard Hadoop distributions.
• Analyzes and updates live data.• Enables Hadoop
deployment in live systems.
• Hadoop MapReduce programs run without change.
• ScaleOut’s implementation is calledScaleOut hServer™.
Integrate MapReduceinto IMDG for Real-Time Analytics
8 ScaleOut Software, Inc.
Data-Parallel Analysis Is Not New• 1980’s: Special Purpose Hardware: “SIMD”
Thinking Machines Connection Machine 5
• 1990’s: General Purpose Parallel Supercomputers:“Domain Decomposition”, “SPMD”
IntelIPSC-2
IBMSP1
9 ScaleOut Software, Inc.
Data-Parallel Analysis Is Not New• 1990’s – early 2000’s: HPC on Clusters: “MPI”
• Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce”
HPBladeServers
Amazon EC2, Windows Azure
10 ScaleOut Software, Inc.
Parallel Method Invocation• Basic, well understood model of data-parallel
computation• Implemented for use on objects hosted in IMDGs:• Executes user’s code in parallel across the grid.• Uses parallel query to select objects for analysis.
Analyze Data (Eval)
Combine Results (Merge)
In-Memory Data Grid Runs Data-Parallel Analysis.
11 ScaleOut Software, Inc.
The parallel analysis executes in three steps:• Step 1: The application first selects all relevant objects in
the collection with a parallel query run on all grid servers.• Note: Query spec matches data’s object-oriented properties.
Running the Analysis
12 ScaleOut Software, Inc.
• Step 2: The IMDG automatically schedules analysis operations across all grid servers and cores.• The analysis runs on all objects selected
by the parallel query.• Each grid server analyzes its locally stored
objects to minimize data motion.• Parallel execution ensures fast
completion time:• IMDG automatically distributes
workload across servers/cores.• Scaling the IMDG automatically
handles larger data sets.
Running the Analysis: Step 2
13 ScaleOut Software, Inc.
• Step 3: The IMDG automatically merges all analysis results.• The IMDG first merges all results within each grid server in
parallel.• It then merges results across all grid servers to create one
combined result.• Efficient parallel merge
minimizes the delay incombining all results.
• The IMDG delivers thecombined result to thetrader’s display as oneobject.
Running the Analysis: Step 3
14 ScaleOut Software, Inc.
Optimizing a stock trading platform with real-time analysis:• IMDG hosted in Amazon
cloud using 75 servers.• IMDG holds 1 TB of stock
history data in memory.• IMDG handles continuous
stream of updates (1.1 GB/s).• IMDG performs real-time
analysis on live data.• Entire data set analyzed in
4.1 seconds (250 GB/s).• IMDG scales linearly as
workload grows.
Sample Performance Results for PMI
15 ScaleOut Software, Inc.
• Goal: Run MapReduce applications from a remote workstation.• The IMDG automatically builds an “invocation grid” of JVMs on
the grid’s servers for PMI and ships the application’s jars.• The invocation grid can be reused to shorten startup time.
• Use PMI to implement MapReduce.
Implementing Real-Time MapReduce
16 ScaleOut Software, Inc.
PMI is the foundation of fast execution time:• Data can be input from either the
IMDG or an external data source.• Works with any input/output
format compatible with the Apache distribution.
• ScaleOut IMDG uses its data-parallel execution engine (PMI) to invoke the mappers and the reducers.• Eliminates batch scheduling
overhead.• Intermediate results are stored
within the IMDG.• Minimizes data motion between
the mappers and reducers.• Allows optional sorting.
Accelerating MapReduce Execution
17 ScaleOut Software, Inc.
// This job will run using the Hadoop // job tracker:public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class);job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0]));FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true);}
// This job will run using ScaleOut hServer:
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new HServerJob(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class);job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0]));FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true);}
Only One-Line Code ChangeScaleOut hServer subclasses the Hadoop Job class:
18 ScaleOut Software, Inc.
• IMDG adds grid input format for accessing key/value pairs held in the IMDG.
• MapReduce programs optionally can output results to IMDG with grid output format.
• Grid Record Reader optimizes access to key/value pairs to eliminate network overhead.
• Applications can access and update key/value pairs as operational data during analysis.
Accessing IMDG Data for M/R
19 ScaleOut Software, Inc.
Multiple in-memory storage models:• Named cache, optimized
for rich semantics:• Property-based query• Distributed locking• Access from remote grids
• Named map, optimized for efficient storage and bulk analysis:• Highly efficient object
storage• Pipelined, bulk-access
mechanisms
Optimized In-Memory Storage
20 ScaleOut Software, Inc.
Fast map/reduce reconciles inventory and order systems for an online retailer:• Challenge: Inventory and online
order management are handledby different applications.• Reconciled once per day.• Inaccurate orders reduces margins.
• Solution:• Host SKUs in IMDG updated in real
time by order & inventory systems.• Use PMI to reconcile in two minutes.
• Results: Real-time reconciliation ensures accurate orders.
Example: Ecommerce: Inventory Management
21 ScaleOut Software, Inc.
Integrate analysis into a stock trading platform:• The IMDG holds market data and hedging strategies.• Updates to market data
continuously flow through the IMDG.
• The IMDG performsrepeated map/reduce analysis on hedging strategies and alerts traders in real time.
• IMDG automatically and dynamicallyscales its throughput to handle newhedging strategies by adding servers.
Example in Financial Services
22 ScaleOut Software, Inc.
• Video Link
Demo
23 ScaleOut Software, Inc.
Spark ScaleOut IMDGNew MapReduce engine
Yes Yes
In-memory data storage
Resilient Distr. Datasets
Distributed Objects
Load/store from HDFS Yes YesAvoid disk access Yes YesCRUD on live data No YesQuery on properties No YesHigh availability Rebuild on failure Replication and
failoverExtensibility Additional operators PMI methodsOpen source Yes Hybrid
Comparison to Spark• Spark is intended to accelerate data analysis using in-
memory computing.• ScaleOut’s IMDG provides standard MapReduce for “live”
systems.
24 ScaleOut Software, Inc.
• Online systems need to analyze “live” data in real-time.
• MapReduce has traditionally focused on analyzinglarge, static (offline) datasets held in file systems.
• An in-memory data grid (IMDG) can accelerate MapReduce applications, enabling real-time analytics:• Enables the application to analyze and update live data.• Leverages the IMDG’s load-balanced placement of data.• Avoids batch-scheduled startup delays.• Avoids data motion from secondary storage.
• MapReduce can be implemented using standard data-parallel computing techniques (“parallel method invocation”):• Tightly integrates Map/Reduce engine with the IMDG.• Accelerates Map/Reduce execution by >20X in benchmark
tests.
Summary
25 ScaleOut Software, Inc.
• The invocation grid can be re-used across MapReduce jobs:
Accelerating Start-Up Times
public static void main(String argv[]) throws Exception { //Configure and load the invocation grid InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid"). // Add JAR files as IG dependencies addJar("main-job.jar"). addJar("first-library.jar").
// Add classes as IG dependencies addClass(MyMapper.class). addClass(MyReducer.class). // Define custom JVM parameters setJVMParameters("-Xms512M -Xmx1024M"). load(); //Run 10 jobs on the same invocation grid for(int i=0; i<10; i++) { Configuration conf = new Configuration(); //The preloaded invocation grid is passed as the parameter to the job Job job = new HServerJob(conf, "Job number "+i, false, grid); //......Configure the job here......... //Run the job job.waitForCompletion(true); } //Unload the invocation grid when we are done grid.unload();}
26 ScaleOut Software, Inc.
Run continuous Hadoop on live data, while it’s being updated.Accelerate Hadoop on static data with a one line code change.
Quickly prototype Hadoop code.
Targeted Use Cases“Capture perishable business
opportunities and identify issues.”Real-time risk
analysisCredit card fraud
detection
“Speed-up Hadoop execution by >10X for faster business insights.”
Processsimulations
Financialmodeling
“Validate your Hadoop code before it goes into batch processing.”
Fast-turn debugand tuning
No need to install Hadoop stack
...
...
...
27 ScaleOut Software, Inc.
Many Use Cases: • Authorizations / Payment
Processing / Mobile Payments • Service Activation• Inventory Management• Sensor Data / SCADA • Real Time Tracking • Fraud Detection • Situational Awareness• Churn Management • Market Feed / Event Handlers• Execution Rules• Financial: Risk, P&L, Pricing• Operational Risk Compliance
The Need for Real-Time AnalyticsAcross Key Industries: • CPG• Financial• Telco• Retail • Utilities• Manufacturing • Logistics • IC / DoD • Life Sciences• Government • Health Care • Law enforcement
28 ScaleOut Software, Inc.
• Typically used for very large, static, offline datasets• Data must be copied from disk-based storage (e.g., HDFS)
into memory for analysis.• Hadoop Map/Reduce adds lengthy batch scheduling and
data shuffling overhead.
Problem: Hadoop Cannot Efficiently Perform Real-Time Analytics
29 ScaleOut Software, Inc.
• ScaleOut Software conducted informal survey at Strata 2013 Conference (Santa Clara).
• Based on 150 responses:• 78% of organizations generate fast-changing data.• 60% use Hadoop and 78% plan to expand usage of
Hadoop within 12 months.• Only 42% consider Hadoop to be an effective platform for
real-time analysis, but…• 93% would benefit from real-time data analytics.• 71% consider a 10X improvement in performance
meaningful.• Take-away: Hadoop users need real-time analytics.
Hadoop Users Need Real-Time Analytics
30 ScaleOut Software, Inc.
• ScaleOut hServer adds Dataset Record Reader (wrapper) to cache HDFS data during program execution.
• Hadoop automatically retrieves data from ScaleOut IMDG on subsequent runs.
• Dataset Record Reader stores and retrieves data with minimum network and memory overheads.
• Tests with Terasort benchmark have demonstrated 11X faster access latency over HDFS without IMDG.
Optional Caching of HDFS Data
31 ScaleOut Software, Inc.
• Create method to analyze each queried stock object and another method to pair-wise merge the results:
Java Example: Parallel Method Invocation
public class StockAnalysis implements Invokable<Stock, StockCalcParams, Double>
{ public Double eval(Stock stock, StockCalcParams param)
throws InvokeException { return stock.getPrice() * stock.getTotalShares(); }
public Double merge(Double first, Double second) throws InvokeException {
return first + second; }}
32 ScaleOut Software, Inc.
• Run a parallel method invocation on the query results:
Java Example: Parallel Method Invocation
NamedCache cache = CacheFactory.getCache("Stocks");
InvokeResult valueOfSelectedStocks = cache.invoke( StockAnalysis.class, Stock.class, or(equal("ticker", "GOOG"), equal("ticker", "ORCL")), new StockCalcParams());
System.out.println("The value of selected stocks is" + valueOfSelectedStocks.getResult());