54
Welcome to Today’s DBTA Roundtable Discussion

Realtime analytics with_hadoop

Embed Size (px)

Citation preview

Page 1: Realtime analytics with_hadoop

Welcome to Today’s

DBTA Roundtable Discussion

Page 2: Realtime analytics with_hadoop

Moderator

Stephen Faig

Manager

Unisphere Research and DBTA

Page 3: Realtime analytics with_hadoop

Real-Time Analytics with Hadoop

Page 4: Realtime analytics with_hadoop

Speakers

Dale Kim

Director of Industry Solutions

MapR

Paige Roberts

Hadoop & Analytics Evangelist

Actian

Page 5: Realtime analytics with_hadoop

© 2015 MapR Technologies 5© 2015 MapR Technologies

Page 6: Realtime analytics with_hadoop

© 2015 MapR Technologies 6

Examples of Real-Time

Images licensed under https://creativecommons.org/licenses/by/2.0/

Time image courtesy of Daniel Oldfield: https://www.flickr.com/photos/democlez/4424898002/

Air bag image courtesy of Mike Babcock: https://www.flickr.com/photos/mikebabcock/3098836311/

Tied to clock time Guaranteed response time

For real-time analytics, let’s use: “no built-in delays”

So what is real-time analytics with Hadoop?

Page 7: Realtime analytics with_hadoop

© 2015 MapR Technologies 7

Requirements for Real-Time Analytics with Hadoop

REAL-TIME DATA

REAL-TIME APPLICATIONS

REAL-TIME QUERIES

Page 8: Realtime analytics with_hadoop

© 2015 MapR Technologies 8

Real-Time Data

Definition: Provide immediate access to live Hadoop data

for analysis

Requirements:

• Analysis uses live real-time data, not batch-copied data

• Business can identify insights immediately (often through

an automated process)• Critical for use cases such as ad targeting, personalization, network

security analysis.

• System avoids complexity of separate stream processing

or messaging system for recent data

Page 9: Realtime analytics with_hadoop

© 2015 MapR Technologies 9

Real-Time Data in Hadoop

For real-time:

• Log files should be written directly

into the cluster or synced across

remote data centers

• Operational applications should

run in the same cluster, or in a

separate cluster with real-time

table replication

• Immediate action should be taken

• E.g., difference between fraud

detection and fraud prevention

• Difference between on-demand ad

bid versus missing opportunity

Existing challenges:

• Log files must be batch uploaded

periodically (e.g., every 30

minutes)

• Due to HDFS limitations (not R/W,

file-close semantics, no direct NFS)

• Operational applications run on a

separate cluster/stack

• Data must be batch uploaded

• With batch uploads, the window to

respond is missed

• Fraud, cyber attacks, matches,

anomalies, etc.

Page 10: Realtime analytics with_hadoop

© 2015 MapR Technologies 10

Real-Time Applications

Definition: Run operational applications in the cluster

Requirements:

• Address use cases beyond batch and interactive

analysis• E.g., end-to-end real-time marketing and security applications directly

on Hadoop

• Eliminate separate Hadoop and NoSQL

clusters/technology stacks for apps

Page 11: Realtime analytics with_hadoop

© 2015 MapR Technologies 11

Real-Time Applications in Hadoop

For real-time:

• Minimize impact of disrupting

“housekeeping tasks to enable

consistent, real-time operations

• E.g., Compactions, Java garbage

collection, “region splits”

• Process live, operational data in

Hadoop to avoid delays in batch

copies

Existing challenges:

• Other in-Hadoop databases suffer

disruptions, inhibiting real-time

• E.g., Compactions can significantly

slow down the system

• Garbage collection leads to

unpredictable system delays

• Region splits are required to spread

load, but impacts responsiveness

and performance

• Other in-Hadoop databases require

separate clusters

Page 12: Realtime analytics with_hadoop

© 2015 MapR Technologies 12

Real-Time Querying

Definition: Query any data as soon as it lands in the

cluster (self-service)

Requirements:

• Analysts can explore data immediately, no waiting

days/weeks for data prep by IT

• IT is not burdened with repeated schema management

and ETL requests

Page 13: Realtime analytics with_hadoop

© 2015 MapR Technologies 13

Real-Time Querying in Hadoop

For real-time:

• Minimize time to get started on

data exploration

• Leverage query engines that can

query data in place

– Eliminate IT dependencies for

schema preparation

Existing challenges:

• New data that lands in the cluster

necessarily requires IT-built

schemas

• Data exploration and analysis is

contingent on IT backlog

Page 14: Realtime analytics with_hadoop

© 2015 MapR Technologies 14© 2015 MapR Technologies

So How Are These Implemented?

Page 15: Realtime analytics with_hadoop

© 2014 MapR Technologies 15

Fraud modelRecommendations

table

MapR Distribution for Hadoop

Fraud

investigator

Interactive

marketer

Online

transactions

Fraud

detection

Personalized

offers

Clickstream

analysis

Fraud

investigation tool

Real-time Operational Applications

Analytics

Case Study: Global Financial Services FirmAnalytics + Operational Applications on one platform

Page 16: Realtime analytics with_hadoop

© 2015 MapR Technologies 16

REAL-TIME DATA

REAL-TIME APPLICATIONS

REAL-TIME QUERIES

Page 17: Realtime analytics with_hadoop

© 2015 MapR Technologies 17

Faster/Secure NFS Access

Redundant gateways

for high availability

CLIENT NODE(S)

NFS

Gateway

NFS

Gateway

MapR data access options:

1. HDFS API – apps written for Hadoop

2. Standard read/write NFS (POSIX) – existing

file system-based apps, no code changes

3. MapR POSIX Client – advanced read/write

NFS requirements, includes:

1. Compression

2. Parallelism

3. Authentication

4. Encryption

NFS client(included in OS)

Native applications

HDFS API(hadoop-core-*.jar)

MapR POSIX

Client

MapR cluster

Hadoop

applications(e.g. “hadoop fs –put”)

File-based apps/utils(e.g. cp, emacs)

NFS client(included in OS)

NFS

Gateway

2

3

1

Page 18: Realtime analytics with_hadoop

© 2015 MapR Technologies 18

YCSB

BenchmarkMapR-DB 4.X Other NoSQL

MapR-DB

Increase

Load

(10, 100)*27,097 14,753 1.8x

Read

(75, 150)4,402 1,902 2.3x

50% read /

50% update

(75, 100)

8,684 2,012 4.3x

95% read /

5% update

(75, 100)

3,776 1,127 3.4x

Scan

(32, 32)478 Client hangs N/A

MapR-DB and “Other NoSQL” Throughput on YCSB

Throughput performance in operations/second/node (higher is better)

*Numbers in parentheses represent threads per client used in test runs for MapR-DB, other NoSQL, respectively

Page 19: Realtime analytics with_hadoop

© 2015 MapR Technologies 19

REAL-TIME DATA

REAL-TIME APPLICATIONS

REAL-TIME QUERIES

Page 20: Realtime analytics with_hadoop

© 2015 MapR Technologies 20

YCSB Mixed (50% Read / 50% Put) - Compare Read Latency

MapR-DB

HBase on other

Hadoop distribution

Lower is better

Page 21: Realtime analytics with_hadoop

© 2015 MapR Technologies 21

MapR-DB Table Replication

Multi-master (aka, active/active)

replication

Active Read/Write

End Users

• Faster data access – minimize network

latency on global data with local clusters

• Reduced risk of data loss – real-time,

bi-directional replication for synchronized

data across active clusters

• Application failover – upon any cluster

failure, applications continue via

redirection to another cluster

Page 22: Realtime analytics with_hadoop

© 2015 MapR Technologies 22

MapR-DB Real-Time Analytics

Active clusters close to the end users,

with real-time analytics at central cluster

Active Read/Write

MapR-DB cluster

(London)

MapR-DB cluster

(New York)

MapR-DB cluster

(Singapore)

MapR-DB/Hadoop

cluster

Hadoop analytics

Operational and analytical workloads

combined in a single deployment

Operationally efficient,

consolidated MapR cluster

Database

operations

Hadoop

analytics

End Users

Page 23: Realtime analytics with_hadoop

© 2015 MapR Technologies 23

REAL-TIME DATA

REAL-TIME APPLICATIONS

REAL-TIME QUERIES

Page 24: Realtime analytics with_hadoop

© 2015 MapR Technologies 24

One SQL Interface for All Data Formats

Unstructured data will

account for more than 80%

of the data collected by

organizations

ANSI SQL queries on rapidly evolving schemas

UNSTRUCTURED DATA

STRUCTURED DATA

2000 20101990 2020

To

tal D

ata

Sto

red Existing

SQL

Engines

Apache

Drill

Self-Service

Data

Exploration

IT-Driven BI

Self-Service BI

SQL Options for

Analytics

Page 25: Realtime analytics with_hadoop

© 2015 MapR Technologies 25

Traditional

Approach

Agility by Reducing Distance to DataShort analytic life cycles with no upfront schema creation and management

Hadoop DataSchema Design

Transformation

Data Movement

Users

Hadoop Data Users

New Business Questions

Total Time to Value: Weeks to Months

Total Time to Value: Minutes

New

Approach

Data Preparation

New Business Questions

Drill enables the

“As It Happens” business

with instant SQL analytics

on complex data

FROM:

TO:

Page 26: Realtime analytics with_hadoop

© 2015 MapR Technologies 26© 2015 MapR Technologies

Summary

Page 27: Realtime analytics with_hadoop

© 2015 MapR Technologies 27

Batch Bottlenecks

1. Data streaming – real-time,

but…

2. Further analysis is limited by

batch loads into HDFS

3. Most databases must run in

separate cluster, leading to

batch copies

4. Append-only HDFS leads to

heavy I/O for database

defragmentation

(“compactions”)

5. Data exploration requires IT

intervention

1

2

3

4

5

Page 28: Realtime analytics with_hadoop

© 2015 MapR Technologies 28

Removing the Batch Limitations

1. Data streaming – real-time

as before, and now….

2. Further analysis is allowed

with real-time loading

3. MapR-DB runs in Hadoop

4. With full read/write file

system, defragmentation

delays are eliminated

5. Data exploration performed

in a self-service manner

Real-Time

Data

Real-Time

Applications

Real-Time

Querying

1

2

345

Page 29: Realtime analytics with_hadoop

© 2015 MapR Technologies 29

And Don’t Forget…

• Real-time analytics doesn’t help you if the other key pieces

aren’t in place

• Include security

– Interoperability with any authentication mechanism

– Fine-grained access controls

– Auditing capabilities beyond simple log files

• Also include enterprise-grade reliability

– An automated high availability configuration

– Incremental mirroring/replication for disaster recovery

– Consistent snapshots

• Talk to us about what else you should consider

Page 30: Realtime analytics with_hadoop

© 2015 MapR Technologies 30

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

Page 31: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation31

Real-Time Analytics

Paige Roberts

April, 2015

Hadoop & Analytics Evangelist Actian Hadoop & Analytics Center of Excellence

Page 32: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation32

Agenda

About Actian

Advantages of Data-Driven Business

What Do I Mean By Real-Time?

Real-Time Challenge: ATM Fraud

How Actian Does It

Page 33: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation33

$140M Revenues + Profitable

10,000+ Customers

Global Presence: 8 world-wide offices, 7x 24 multinational support model 33

“Fast becoming a big data

powerhouse to challenge

the market.” Forrester

“Actian is now very powerfully

positioned in the big data and analytics

markets.” Bloor

A Few Words About Actian

Page 34: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation34

Note: Percentage, 10 year CAGR McKinsey Report on Big Data.

8

9

5

5

-1

6

9

14

11

9

24

12

Revenue

Big Data Other Companies

Grocers

Online Retailers

Big Box Retailers

Casinos

Credit Cards

Insurance

EBITDA

• Predictive

• Real-time

• All Data

• New Insights

• Accuracy

5

-1

1

2

-15

3

14

9

12

10

22

11

…. At the Expense of Those That Don’t

Companies Using Big Data Strategically Outperform

Page 35: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation35

What Does Real-Time Mean to Us?

Human comfortable interactivity

Streaming data processing

Sub-second response

Page 36: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation36

Real-Time Analytics – Many Applications

Solar Power Company New customer targetingSmart meter data

Sportswear CompanyBrand loyaltyWearable data

BankATM FraudRouter data

Page 37: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation37

Large US Bank Needs Help

• Multi-billion dollar American

bank / financial holding

company

• Provides deposit, credit,

trust, and investment

services to a broad range of

clients

• Operates nearly 1,500 retail

branches and more than

2,000 ATMs

Page 38: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation38

Nu

mb

er

of tim

es fa

ste

r th

an

Imp

ala

Fraud Kept This Bank’s Execs Up at Night

Page 39: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation39

What is the Worst Gotcha About ATM Fraud?

In spite of that, 67% of U.S. adults would

switch to another institution after

experiencing ATM fraud or a data breach.

http://www.harrisinteractive.com/NewsRoom/HarrisPolls/tabid/447/ctl/ReadCustom%20Default/mid/1508/ArticleId/1515/Default.aspx

In the majority of cases, banks are required

to reimburse customer losses.

https://www.tycois.com/insights-and-opinions/articles/atm-skimming-costs-banks

Page 40: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation40

This is What You Call a Delayed Reaction

Page 41: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation41

Time to Call in the Elephant

Page 42: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation42

Actian Management Console

DA

TA

P

LA

TF

OR

M

Actian Vortex

Elastic Data Preparation

DataFlow

SQL Analytics

Vector in Hadoop

Library of Analytic Blueprints

Graph Analytics

SPARQLverse

Machine Learning & Predictive Analytics

DataFlow

ANALYTIC

APPS

Financial

Services

Health Care

Other

Verticals

S Q

LJava, C

/++

,

Pyth

on

SOURCE

DATA

Databases / Marts

Warehouses

Cloud / SaaS

Applications

Structured &

Unstructured

Data

Enterprise

Applications

AP

PLI

CA

TIO

N D

EV

Application Development and Tools

INFR

AST

RU

CTU

RE

Deployment Options

@Customer

Actian Vortex: The Elephant’s Best Friend

powered by KNIME

Page 43: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation43

Actian Vortex: High Performance Analytics at Scale in Hadoop

Powered by KNIME

Page 44: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation44

Stopping Fraud in Real Time

https://www.youtube.com/watch?v=u1QoHCpOUOU

Page 45: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation45

Actian Vector in Hadoop: Built for Speed

Vector Processing

Single

Instruction

Multiple

Data

2nd Gen Column Store

Limit I/O

Efficient real time updates

Smarter Compression

Maximize throughput

Vectorized decompression

Exploiting Chip Cache

Process data on chip – not in RAM

1

2

3

4

Multi-core ParallelismMaximize system resource

utilization…

Storage Indexes

Quickly identify candidate data

blocks

Minimize IO

5

6

Tim

e / C

yc

les

to

Pro

ce

ss

Data Processed

DISK

RAM

CHIP

10GB2-3GB40-400MB

2-2

0150-2

50

Mill

ions

Page 46: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation46

How Fast?

Page 47: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation47

How Fast?

Page 48: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation48

What to Look For in SQL in Hadoop

• Collaborative architecture• Open access to Actian data

storage formats • Support for other formats• Hadoop distribution and

ecosystem application support

No vendor lock-in

• Fastest data prep and ingestion

• Fastest analytic engines• Unbridled processing

power on data nodes in a Hadoop cluster

• Full SQL support• Extreme scalability• Full security• High Availability &

Disaster Recovery

Results you need when you need them

Proven technology advantages

Open Fast Enterprise-Grade

Page 49: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation49

Free Actian Vortex Express Edition

Page 50: Realtime analytics with_hadoop

Confidential © 2014 Actian Corporation50

www.actian.com

facebook.com/actiancorp

@actiancorp

Thank You

Download Actian Vortex ExpressFree Forever

http://bigdata.actian.com/sql-in-hadoop

Page 51: Realtime analytics with_hadoop

Question and Answer Session

(please submit questions)

Page 52: Realtime analytics with_hadoop

Q & A

Dale Kim

Director of Industry Solutions

MapR

Paige Roberts

Hadoop & Analytics Evangelist

Actian

Page 53: Realtime analytics with_hadoop

Please use the same URL you used to view today’s live event

for the archive event, plus we will be sending you a follow-up

email with that URL once the archive is posted!

Page 54: Realtime analytics with_hadoop

Thank you for participating in

today’s roundtable web event

Just by attending this event the winner of the

$100 AmEx Gift Card is…….