34
Scalable Realtime Analytics with declarative, SQL like, Complex Event Processing Scripts Srinath Perera Director, Research WSO2 Apache Member (@srinath_perera) [email protected]

Scalable Realtime Analytics with declarative SQL like Complex Event Processing Scripts

Embed Size (px)

Citation preview

Scalable Realtime Analytics with

declarative, SQL like, Complex Event

Processing ScriptsSrinath Perera

Director, Research WSO2Apache Member

(@srinath_perera) [email protected]

(Batch) AnalyticsScientists are doing this for 25 year with

MPI (1991) on special HardwareTook off with Google’s MapReduce

paper (2004), Apache Hadoop, Hive and whole eco system created.

It was successful, So we are here!!But, processing takes time.

Value of Some Insights degrade Fast!For some usecases ( e.g. stock markets, traffic, surveillance, patient

monitoring) the value of insights degrade very quickly with time. - E.g. stock markets and speed of light

We need technology that can produce outputs fast - Static Queries, but need very fast output

(Alerts, Realtime control) - Dynamic and Interactive Queries ( Data

exploration)

History Realtime Analytics are not new either!!- Active Databases (2000+)- Stream processing (Aurora, Borealis (2005+)

and later Storm) - Distributed Streaming Operators (e.g.

Database research topic around 2005)- CEP vendor roadmap ( from

http://www.complexevents.com/2014/12/03/cep-

tooling-market-survey-2014/)

Realtime Analytics Tools

I. Stream Processing

Program a set of processors and wire them up, data flows though the graph.

A middleware framework handles data flow, distribution, and fault tolerance (e.g. Apache Storm, Samza)

Processors may be in the same machine or multiple machines

II. Complex Event Processing

III. Micro BatchProcess data in small batches, and

then combine results for final results (e.g. Spark)

Works for simple aggregates, but tricky to do this for complex operations (e.g. Event Sequences)

Can do it with MapReduce as well if the deadlines are not too tight.

IV. OLAP Style In Memory Computing Usually done to support interactive

queries Index data to make them them

readily accessible so you can respond to queries fast. (e.g. Apache Drill)

Tools like Druid, VoltDB and SAP Hana can do this with all data in memory to make things really fast.

Realtime Analytics PatternsSimple counting (e.g. failure count) Counting with Windows ( e.g. failure count every hour) Preprocessing: filtering, transformations (e.g. data cleanup)Alerts , thresholds (e.g. Alarm on high temperature) Data Correlation, Detect missing events, detecting erroneous data

(e.g. detecting failed sensors) Joining event streams (e.g. detect a hit on soccer ball) Merge with data in a database, collect, update data conditionally

Realtime Analytics Patterns (contd.)Detecting Event Sequence Patterns (e.g. small transaction followed

by large transaction) Tracking - follow some related entity’s state in space, time etc. (e.g.

location of airline baggage, vehicle, tracking wild life) Detect trends – Rise, turn, fall, Outliers, Complex trends like triple

bottom etc., (e.g. algorithmic trading, SLA, load balancing)Learning a Model (e.g. Predictive maintenance) Predicting next value and corrective actions (e.g. automated car)

Apache HiveA SQL like data processing languageSince many understand SQL, Hive

made large scale data processing Big Data accessible to many

Expressive, short, and sweet. Define core operations that covers 90%

of problems Lets experts dig in when they like!

(Batch Processing, Hive) (Realtime Analytics, X)

What is X?

CEP = SQL for Realtime Analytics

Easy to follow from SQLExpressive, short, and sweet. Define core operations that covers 90% of

problems Lets experts dig in when they like!

Lets look at the core operations.

Operators: Filters

Assume a temperature stream Here weather:convertFtoC() is a

user defined function. They are used to extend the language.

define stream TempStream (ts long, temp double);from TempratureStream [weather:convertFtoC(temp) > 30.0)

and roomNo != 2043] select roomNo, tempinsert into HotRoomsStream ;

Usecases: - Alerts , thresholds (e.g. Alarm on

high temperature) - Preprocessing: filtering,

transformations (e.g. data cleanup)

Operators: Windows and Aggregation

Support many window types- Batch Windows, Sliding windows, Custom windows

Usecases- Simple counting (e.g. failure count) - Counting with Windows ( e.g. failure count every hour)

from TempratureStream#window.time(1 min) select roomNo, avg(temp) as avgTemp insert into HotRoomsStream ;

Operators: Patterns

Models a followed by relation: e.g. event A followed by event B

Very powerful tool for tracking and detecting patterns

from every (a1 = TempratureStream) -> a2 = TempratureStream [temp > a1.temp + 5 ]within 1 day

select a2.ts as ts, a2.temp – a1.temp as diffinsert into HotDayAlertStream;

Usecases- Detecting Event Sequence Patterns- Tracking- Detect trends

Operators: Joins

Join two data streams based on a condition and windowsUsecases- Data Correlation, Detect missing events, detecting erroneous data- Joining event streams

from TempStream[temp > 30.0]#window.time(1 min) as Tjoin RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNoselect T.roomNo, R.deviceID, ‘start’ as action insert into RegulatorActionStream

Operators: Access Data from the Disk

Event tables allow users to map a database to a window and join a data stream with the window

Usecases- Merge with data in a database, collect, update data conditionally

define stream TempStream (ts long, temp double);define table HistTempTable(day long, avgT double);

from TempStream #window.length(1) join OldTempTable on getDayOfYear(ts) == HistTempTable.day && ts > avgT

select ts, temp insert into PurchaseUserStream ;

Revisit Patterns

Predictive Analytics Build models and use them with

WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2)

Build model using R, export them as PMML, and use within WSO2 CEP

Call R Scripts from CEP queries Regression and Anomaly Detection

Operators in CEP

Case Study: Realtime Soccer Analysis

Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM

TFL Traffic AnalysisBuilt using TFL ( Transport for London) open data feeds.

http://goo.gl/04tX6khttp://goo.gl/9xNiCm

Great, Does it Scale?

Idea 1: Network of CEP NodesFor scaling, we arrange CEP

processing nodes in a graph like with stream processing.

The Graph can be implemented using an stream processing engine like Apache Storm

Idea II: Compile SQL like Queries to a Network of CEP Nodes

from TempStream[temp > 33]insert into HighTempStream;

from HighTempStream#window(1h)select max(temp)as max insert into HourlyMaxTempStream;

How do We partition the Data to scale up the Analysis?

Lets follow MapReduceMap Reduce does not scale itself, it asks users to break

the problem to many small independent problems.

Idea III: Let the Users specify Parallelism

Language include parallel constructs: partitions, pipelines, distributed operators

Assign each partition to a different node, and partition the data accordingly

define partition on TempStream.region {from TempStream[temp > 33]

insert into HighTempStream; }from HighTempStream#window(1h)

select max(temp)as max insert into HourlyMaxTempStream;

Handling OrderingWhen the data processed in

parallel, output might be generated out of order.

Due to lack of a global time, we cannot trigger windows and other time sensitive constructs

Solution: the current time needs to be propagated though the graph

Putting Everything Together

WSO2 CEP & Big Data Platform

CEP = SQL for Realtime AnalyticsEasy to follow from SQLExpressive, short, sweet and fast!!Define core operations that covers 90% of

problems Lets experts dig in when they like!

And it Scales!!

Questions?

Visit us at Booth 1025http://wso2.com/landing/strata-hadoop-world-ca-2015/