47
Delivering Operational Analytics Using Spark and NoSQL Data Stores Mike Ferguson Managing Director Intelligent Business Strategies Basho Webinar January, 2016

Operational Analytics Using Spark and NoSQL Data Stores

Embed Size (px)

Citation preview

Delivering Operational Analytics Using Spark and NoSQL Data Stores

Mike FergusonManaging DirectorIntelligent Business StrategiesBasho WebinarJanuary, 2016

2

About Mike Ferguson

Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specializes in business intelligence, data management and enterprise business integration. With over 34 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates.

[email protected]

Twitter: @mikeferguson1

Tel/Fax (+44)1625 520700

3

Topics

The changing landscape of operational and analytical systems Scalable operational applications and NoSQL data stores Big data analytics – The era of Hadoop and Spark The value of operational analytics  Operational analytics using The Basho Data Platform and

Apache Spark Conclusions

4

Topics

The changing landscape of operational and analytical systems Scalable operational applications and NoSQL data stores Big data analytics – The era of Hadoop and Spark The value of operational analytics  Operational analytics using The Basho Data Platform and

Apache Spark Conclusions

5

The Application Processing Spectrum

Source: BI-Research Copyright © BI-Research, 2013-Present

6

Big Data Processing – There Is A Growing Number of Data Stores Optimized for Operational or Analytical Workloads

OLTP RDBMS

NoSQL DBMS NoSQL

• ACID support missing in many NoSQL DBMSs• Can you live with losing a transaction?

• OK for sensor data for example

Analytical RDBMS

7

AnalyticalSystems

A Closed Loop Is Still Needed – It Just Now Also Includes NoSQL Technologies

Operational applications

Scalable AnalyticalSystems

data data

new data

new insights

Scalable Operational applications

Relational & NoSQL systems

Relational & NoSQL systems

8

Topics - – Where Are We?

The changing landscape of operational and analytical systemsScalable operational applications and NoSQL data stores Big data analytics – The era of Hadoop and Spark The value of operational analytics  Operational analytics using The Basho Data Platform and

Apache Spark Conclusions

9

AnalyticalSystems

Demand For Scalable Operational Systems With High Write Processing Is Driving Demand for NoSQL DBMS

Operational applications

ScalableAnalyticalSystems

data data

new data

new insights

Scalable operational applications

10

Success of Big Data Analytics Depends On Being Able To Scale To Capture High Velocity, High Volume Data

Successful big data analytics requires1. Ability to scale operational systems to capture, stream and store

the required transactional and non-transactional data – Support peak transaction rates– Support peak capture of non-transactional data e.g. shopping

cart data– Support peak data arrival rates e.g. sensor data– Support peak ingestion rates

2. Scalable Big Data analytics

3. Closed loop integration of analytical systems back into core operational transaction processing systems

– Make prescriptive insights available to all that need them to continuously optimise operations and maximise effectiveness

11

E-Business And Mobile Means Operational Systems Are Having To Scale To Support Masses Of Concurrent Users

Many more users

Operational applicationsTransactional applications

dataWeb logs

ClusterMobile devices

WWW

data data datapartitioned data

12

Example Operational Applications Requiring Scalability That Are Fuelling Demand For NoSQL DBMSs

Web and mobile commerce• Shopping cart data, session storage

Internet of Things (IoT) and other time series applications • Need to scale as the number of devices / things increase

Mobile gaming• Player profile data, session storage, game performance stats

Healthcare • Store unstructured healthcare digital imaging and video data

Social network applications

13

Types Of NoSQL Database And Product Examples

NoSQL Database Type NoSQL Product Examples

Key Value store Aerospike, Amazon DynamoDB, Basho Riak KV, Redis, MemcacheDB, Voldemort

Document database CouchDB, IBM DB2 (XML & JSON), MongoDB, IBM Cloudant, Marklogic, Terrastore, JackRabbit, RaptorDB

Column Family database

Casandra, DataStax, Google BigTable, Hadoop HBase, Hypertable, HPCC, Amazon SimpleDB

Graph database AllegroGraph, GraphBase, Horton, InfiniteGraph, IBM DB2, Neo4j, Oracle Spatial and Graph, Titan, Cray Research, Teradata Aster

Multi-modal database ArangoDB, CortexDB, MarkLogic , MongoDB FoundationDB,

Some NoSQL databases are aimed at write processing (data collection) Others are aimed at specific big data analytical workloads Issues include lack of standard APIs, weak or no optimizer and non-

immediate consistency

14

Global NoSQL Market Size And Forecast 2013 - 2020

Source: https://www.alliedmarketresearch.com/NoSQL-market

15

Key Value Stores Can Store Any Data - Examples

Key Value

10034 John Smith

82771

93441

{ "firstName": ”Wayne", "lastName": ”Rooney", "age": 25, "address": { "streetAddress": "21 Sir Matt Busby Way", "city": ”Manchester”, “country”: “England”, "postalCode": “M1 6DY” }, "phoneNumbers": [ { "type": "home”, "number": ”0161-123-1234” }, { "type": ”mobile", "number": ”07779-123234” } ]}

Key value store features:• Very simple to understand• Very scalable - hash partitioning• Data access is via the key • The application controls what’s stored in

the value • Very fast performance• Acceleration via in-memory processing• Eventual consistency• Often no support for data types

• No built-in referential integrity• No understanding of data relationships• The application must understand any

relationships in data• Programmer is in complete control• Application must navigate complex data

Use for specific operational applications

16

Key Value Stores– The Key Is Hashed To Partition The Data

Source: Microsoft

The value can be anything • A single data field • A JSON document• An XML document • Text• Image……

Key Value

Easy to partition (hash the key)Very fast to retrieve and store dataThe application needs to know • What is stored in the VALUE• How the value is structured • How to process the value

Key needs to be unique

Can use HTTP to read and write data e.g. CURL –XPUT, CURL -XGET

17

Key Value Stores – A Basho Riak KV Cluster Has Virtual Nodes Running on Physical Nodes

Source: Basho

SHA1 is a hashing function that hashes a key to determine the nodeRiak hash partitions and replicates data (3 copies of the data is the default)

e.g. PUT, POST, GET….

the valuethe key

hash the key

Nodes can be added and removed to a Riak cluster while it is running

18

Key Value Stores - A Basho Riak KV Ring

Riak uses partitions (64 partitions are the default) and also replicates the partitions for high availability

Source: Basho

Writing replicas

19

Topics – Where Are We?

The changing landscape of operational and analytical systems Scalable operational applications and NoSQL data storesBig data analytics – The era of Hadoop and Spark The value of operational analytics  Operational analytics using The Basho Data Platform and

Apache Spark Conclusions

20

AnalyticalSystems

Demand For Scalable Analytical Systems Is Also Exploding

Operational applications

ScalableAnalyticalSystems

data data

new data

new insights

Scalable operational applications

21

A Hadoop System

Java, Python, Scala

file file file file file

file file file file file

file file

file file

webHDFS(An HTTP interface to HDFS has

REST APIs)

HDFSfile

file

file

file

PIG latin scripts

3rd Party SQL on Hadoop

Analytic Application

indexindexIndex

partition

SQL

BI Tools

Storm

YARN

MapReduce Tez Spark

SQL

HBasewebHDFS

APIs to HBase, APIs to HDFS

executes on MR, Tez & Spark

22

Faster Execution Engines For Analytic Applications – Apache Spark

Java, Python, Scala

file file file file file

file file file file file

file file

file file

webHDFS(An HTTP interface to HDFS has

REST APIs)

HDFSfile

file

file

file

PIG latin scripts

3rd Party SQL on Hadoop

Analytic Application

indexindexIndex

partition

SQL

BI Tools

Storm

YARN

MapReduce Tez Spark

SQL

HBasewebHDFS

APIs to HBase, APIs to HDFS

23

Spark Is A General Purpose In-Memory Execution Framework That Can Run With Or Without Hadoop

file file file file file

file file file file file

file file

file file

HDFSfile

file

file

file

Storm

YARN

MapReduce Tez Spark

HBasewebHDFS

HDFS, S3…..Tachyon

Spark also includes an

HDFS compatible in-memory file

system

You can use Spark with or without Tachyon

The Spark stack is integrated – E.g. You can use Spark Streaming, SparkSQL and MLBase together in the same application

Applications / BI Tools

Spark Core

Spark Streaming

R

Spark SQL +

DataFrames

GraphX(Graph

Computation)

MLlib(Machine Learning)

SQL Python Scala Java

24

Applications / BI Tools

Spark Core

Spark Streaming

R

Spark SQL +DataFrames

GraphX(Graph Computation)

MLlib(Machine Learning)

SQL Python Scala Java

Apache Spark

Provides distributed task dispatching, scheduling,

and basic I/O.

For analysis of real-time streaming data

A library of pre-built analytic algorithms that can run in

parallel across a Spark cluster

A graph analysis engine running on Spark

Query structured data in Spark apps using SQL or a DataFrames API

25

Spark In-Memory Analytic Applications Can Do A Lot More Than Map Reduce Processing

Keep only one copy in memory in a JVM

Track lineage of job operators used to derive the data

Use the lineage to re-compute the data if there is a failure

No MapReduce execution needed• Just Spark APIs

map

map

join

filter

reduce

Source: Amplab

Spark application

HDFSfile file file file file file

Spark Applications / BI Tools

Spark Core

Spark Streaming

R

Spark SQL +

DataFrames

GraphX(Graph

Computation)

MLlib(Machine Learning)

SQL Python Scala Java

26

Spark Applications Operate On RDDs (Data) – You Can Do A Lot More Than Map and Reduce

RDD = Resilient Distributed Datasets An RDD is a read-only, partitioned collection of records RDDs can be only created through operators on either

1. A dataset in stable storage or 2. Other existing RDDs.

Map Reduce Sample

Filter Count Take

Groupby Fold First

Sort Reducebykey Partitionby

Union groupByKey Mapwith

Join Cogroup Mapwith

Leftouterjoin Cross Pipe

Rightouterjoin Zip Save

Spark Operators

Spark Applications

27

Simplifying Access To Data Using Via SparkSQL and Spark DataFrames

A DataFrame is a distributed collection of data organized into named columns

Conceptually equivalent to a relational DBMS table or a data frame in R/Python

DataFrames can be constructed from a wide array of sources: • Structured data files• Hive tables• External databases• Existing RDDs

Uses schema on read Image source: Databricks.com

Note: that Spark data sources can be

relational & NoSQL DBMSs

28

Spark Is Going Over The Top of Multiple Data Stores For Scalable In-Memory Analytics Across The Entire Ecosystem

Streamingdata

Hadoopdata store

Data WarehouseRDBMS

NoSQL DBMS

EDW

DW & martsAdvanced Analytic (multi-structured

data)

mart

Operational NoSQLData Stores

Streaminganalytics

e.g. Casandra, Basho Riak

Applications / BI Tools

Spark Core

Spark Streaming

R

Spark SQL +DataFrames

GraphX(Graph

Computation)

MLlib(Machine Learning)

SQL Python Scala Java

29

Topics – Where Are We?

The changing landscape of operational and analytical systems Scalable operational applications and NoSQL data stores Big data analytics – The era of Hadoop and SparkThe value of operational analytics  Operational analytics using The Basho Data Platform and

Apache Spark Conclusions

30

Key Business Drivers And Objectives For Operational Analytics

Combine operational and analytical processing at scale to:• Improve customer engagement• Reduce risk• Avoid unplanned operational cost • Optimise operational effectiveness

Use BI/Analytics to drive and guide business operations to help achieve specific target business goals and KPI targetsAutomated analysis of operational events as they happenAutomated alerts On-demand recommendations

Integrate BI/Analytics into every business process to:• Create a ‘insight driven’ employee base • Enable mass execution of business strategy via facilitating

mass contribution towards achieve specific business goals

31

Five Types Of Operational BI/Analytics

1. Simple operational reporting of current position/state e.g. session state

2. Situational awareness via visualisation of live operational data typically on dashboards

3. On-demand analytics of live operational and/or historical data to improve operational decisions and effectiveness

4. On-demand recommendations for guidance

5. Event stream processing to monitor, automatically analyse and act on events in real-time to prevent problems arising and to optimise business operations

32

BI/ Analytics Apps / Services

Operational Analytics – What’s The Difference Between On-Demand Vs Event-Driven Analysis?

BI/ Analytics Services

Application

On-Demand

Analytical service(query, report, model,

recommendation)

Message, file arrival, pattern, trigger

Event-Driven

Analytical service(query, report, model,

recommendation)

streaming

data

33

Analytics Need To Be Integrated Into Business Processes To Optimize Business Operations

Customers Partners &suppliers

Customerrelationshipmanagement

Operationsmanagement

Supplychain

management

Mar

ketin

g

Sal

es

Ser

vice

/sup

port

Ope

ratio

ns

Fina

nce/

acco

untin

g

Pro

cure

men

t

Inve

ntor

y co

ntro

l

Shi

ppin

g/di

strib

utio

n

Hum

an re

sour

ces

Employees

Integrated Intelligent Business Operations

Integrated On-Demand Business Intelligence

34

High Value Application Use Cases for Streaming Analytics

Streaming Analytics

Source: Adapted from a slide by IBM

35

Responding To Events And Event Patterns Means Reducing Action Time

The time between an event occurring and action being taken being as close to zero as possible

Action distance or action time

Event-driven data integration

Automated analysis

Automated decision and action taking

Source: Dr Richard Hackathorne

36

With Event Stream Processing The Architecture Has To Change

Data cleansing & integration

Store dataQuery/Analyze

(human)

Store data

Query/Analyze(automated)

Classic Use of Analytics

Event / Stream processing

Act(automatedor human)

Data cleansing & integration

37

Time Series Analysis – Query Processing Uses a Time Window to Look at Continuously Streaming Data Time Window

T1 T2

E.g. 5 seconds or 30 seconds or 5 minutes

Pattern/correlation

Continuous time series queries (CQs) operate on

the data as it flows by

Stream processing server

CQs

A set of queries (continuous queries) reside in the data stream server to process incoming dataData is pushed into the queries

High frequency data

38

Key Requirements For Operational Analytics

On-demand, event-driven and scheduled invocation of analytics Monitor streaming events as they happen via automatic analysis Automatic analysis via predictive and statistical models Automatic interpretation of predictive/statistical model outcomes Rule-driven automatic actions to automate decision making

• E.g. Alerts, recommendations, transaction and process invocation Integrate operational analytics into operational applications Operational reporting Scale to support large numbers of events and concurrent users Store relevant data together to speed up analytics execution Run predictive and statistical models close to the data Run analytics on a 24x365 basis

39

The Importance of In Memory Processing

Massively parallel in-memory processing is mission critical for scalable operational systems and operational analytics

Why? • Performance is a critical • Large number of concurrent user requests

for on-demand analytics• Large number of concurrent application

requests for on-demand analytics• Event driven operational analytics on very

high velocity data needs memory

40

Topics – Where Are We?

The changing landscape of operational and analytical systems Scalable operational applications and NoSQL data stores Big data analytics – The era of Hadoop and Spark The value of operational analytics Operational analytics using The Basho Data Platform and

Apache Spark Conclusions

41

The Basho Data Platform

hash partitioning,cluster scalability, triple replication,multi-datacentre

replication

co-locates time-series data, high availability, scalability

replicates and synchronises data within and across

Riak KV, Redis and Spark Clusters Automated cluster

management simplifies administration

Integrated in-memory caching for faster

application performance

Search based query processing on Riak data

using Solr indexes

Integrated in-memory analytics for Riak KV

and Riak TS data

42

Riak TS Is A New Basho Storage Instance Optimised for Time Series Data And Analytics

A distributed NoSQL database optimised for time series sequenced, unstructured data capture, aggregation and analysis from the Internet of Things (IoT)

Highly availability Scalability - add nodes to a cluster without sharding Automated and uniform data distributed across the cluster

• Time of geohash based data co-location to ensure time series data is located on the same node

Data validation on input APIs and client libraries for Java, Ruby, Python, Go, Erlang,

Node.js or .NET. Spark integration for operational analysis of time series data.

43

Operational Analytics Using The Basho Data Platform And Apache Spark

recent data

44

Operational Analytics Using The Basho Data Platform And Apache Spark - 2

• Can develop Spark operational analytic applications on low latency data stored in Basho Riak KV

• Spark-based analytical web services can be invoked on-demand to analyse data in Riak KV • Use on-demand Spark jobs for historical analysis and predictions

• Insights produced from analysing Riak KV data in can be written back to Riak KV for use by other applications• A form of closed-loop processing

• Spark Streaming can be used to calculate rollups and detect abnormalities on streaming sensor data

• Recent data can be kept in Redis for dashboard visualization

46

Topics – Where Are We?

The changing landscape of operational and analytical systems Scalable operational applications and NoSQL data stores Big data analytics – The era of Hadoop and Spark The value of operational analytics  Operational analytics using The Basho Data Platform and

Apache SparkConclusions

47

Conclusions

As operational application processing scales, so too does the need to scale operational analytics

Basho is using in-memory processing to accelerate operational applications (via Redis) and to introduce scalable operational analytics (via Spark) into these applications

New scalable ‘smart’ operational applications are therefore becoming possible with careful design in a NoSQL environment

48

[email protected]

Twitter: @mikeferguson1

Tel/Fax (+44)1625 520700

Thank You!