44
Introduction to Large Scale Data Analysis and WSO2 Analytics Platform Srinath Perera Director Research WSO2, Apache Member (@srinath_perera) [email protected] At Indiana University Bloomington

Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Embed Size (px)

Citation preview

Page 1: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Introduction to Large Scale Data

Analysis and WSO2 Analytics

PlatformSrinath Perera

Director Research WSO2, Apache Member(@srinath_perera) [email protected]

At Indiana University Bloomington

Page 2: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Who We are?We are an opensource Middleware

company - We build systems upon which others

build their systems Venture funded – Intel Capital, Cisco,

Toba Capital 400+ people & Offices at Silicon valley, Sri Lanka, London and Bloomington Customers including Banks, Aircraft Manufacturers, Governments (State and Federal), Media Companies, Telco, Retail, Healthcare ..

Page 3: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Outline

Introduction to Big DataThe Problem we are trying to solveWSO2 Big Data PlatformNext steps

Page 4: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

A Day in Your LifeThink about a day in your life?- What is the best road to take?- Would there be any bad weather?- How to invest my money?- How is my health?

There are many decisions that you can do better if only you can access the data and process them.

http://www.flickr.com/photos/kcolwell/5512461652/ CC licence

Page 5: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
Page 6: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Internet of ThingsCurrently th physical world and

software worlds are detached Internet of things promises to bridge

this- It is about sensors and actuators

everywhere - In your fridge, in your blanket, in your

chair, in your carpet.. Yes even in your socks

- Umbrella that light up when there is rain and medicine cups

Page 7: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

What can We do with Big Data?Optimize (World is inefficient)- 30% food wasted farm to plate

- GE Save 1% initiative (http://goo.gl/eYC0QE )- Trains => 2B/ year

- US healthcare => 20B/ year

Save lives - Weather, Disease identification, Personalized treatment

Technology advancement- Most high tech research are done via simulations

Page 8: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Big Data Architecture

Page 9: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Big data Processing Technologies Landscape

Page 10: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

(Batch) AnalyticsScientists are doing this for 25 year with

MPI (1991) on special Hardware- OpenMPI is being done at IU!

Took off with Google’s MapReduce paper (2004), Apache Hadoop, Hive and whole eco system created. It was successful, So we are here!!

But, processing takes time.

Page 11: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Usecase: Targeted Advertising

Analytics Implemented with MapReduce or Queries - Min, Max, average, correlation, histograms, might join or group data in

many ways - Heatmaps, temporal trends

Key Performance indicators (KPIs)- E.g. Profit per square feet for retail

Page 12: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Usecase: Big Data for developmentDone using CDR dataPeople density noon vs. midnight

(red => increased, blue => decreased)

Urban Planning - People distribution - Mobility - Waste Management- E.g. see http://goo.gl/jPujmM

From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/

Page 13: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Value of some Insights degrade Fast!For some usecases ( e.g. stock markets, traffic, surveillance, patient

monitoring) the value of insights degrades very quickly with time. - E.g. stock markets and speed of light

We need technology that can produce outputs fast - Static Queries, but need very fast output

(Alerts, Realtime control) - Dynamic and Interactive Queries ( Data

exploration)

Page 14: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
Page 15: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Predictive Analytics If we know how to solve a problem, that is if we know

a finite set of rules, then we can programs it. For some problems (e.g. Drive a car, character

recognition), we do not know a finite fix rule set. Instead of programming, we give lot of examples and

ask the computer to learn (often called Machine Learning)

Lot of tools - R ( Statistical language)- Sci-kit learn (Phython)- Apache Spark’s MLBase and Apache Mahout (Java)

Page 16: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Usecase: Predictive MaintenanceIdea is to fix the problem before it

happens, avoiding expensive downtimes- Airplanes, turbines, windmills

- Construction Equipment

- Car, Golf carts

How- Build a model for normal operation

and compare deviation

- Match against known error patterns

Page 17: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Problem we are trying to Solve!

Build a platform using which others can build their analytics systems - Collect, Analyze, Communicate - End to end, starts from humans and ends

with humans Different Audiences- Technical (Developers)- Non-technical (CXOs, sales, analysts)

There are two things you need to know about business,: make something users love and make more than you spend.

--Paul Graham

( Lisp, Y-combinator)

Page 18: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
Page 19: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Running Example

Monitor Temperature and hot airflow across multiple buildings (e.g. central AC) - More people => hot

Analytics - Historical behavior of temperature by the hour- Alerts if temperature falls too much or too high- Modeling and predicating temperature to adjust proactively

define TemperatureStream(ts long, buildingNo long, t double);define AirflowStream(ts long, buildingNo long,

aflow double, aT);

Page 20: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Collect DataOne Sensor API to publish events - REST, Thrift, Java, JMS, Kafka- Java clients, java script clients*

First you define streams (think it as a infinite table in SQL DB)

Then send events via API* Challenges ( performance,

guaranteed delivery, scale)

Can send to batch pipeline, Realtime pipeline or both via configuration!

Page 21: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Collecting Data: Example

Java example: create and send events Events send asynchronously See client given in http://goo.gl/vIJzqc for more info

Agent agent = new Agent(agentConfiguration);publisher = new AsyncDataPublisher("tcp://hostname:7612", .. );

StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION);definition.addPayloadData("sid", STRING);... publisher.addStreamDefinition(definition);... Event event = new Event();event.setPayloadData(eventData);publisher.publish(STREAM_NAME, VERSION, event); Send events

Define Stream

Initialize Stream

Page 22: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Batch Analytics: Spark

Two frameworks: Hadoop (http://hadoop.apache.org ) and Spark (https://spark.apache.org )- Hadoop is a MapReduce implementation

Spark is faster (30X and ) and much more flexible. They set a record at Gray Sort (100TB) 3X faster with 10X less

machines, http://goo.gl/r5LGvD For Hadoop and MapReduce resources, Google it.

file = spark.textFile("hdfs://...”)file.flatMap(tsToHourFunction)

.reduceByKey(lambda a, b: a+b)

Page 23: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

SQL like Queries: HiveApache Hive provides a SQL like data

processing languageSince many understands SQL, Hive

made large scale data processing Big Data accessible to many

Expressive, short, and sweet. Define core operations that covers 90%

of problems Lets experts dig in when they like! (via

User Defined functions)

Page 24: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Hourly Temperature Average

Hive compile the SQL like query to set of MapReduce jobs running in Hadoop or Spark (in WSO2 BAM from 15, Q2 release)

insert overwrite table TemperatureHistory select hour, average(t) as avgT, buildingId from TemperatureStream group by buildingId, getHour(ts);

Page 25: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Complex Event Processing

Page 26: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Operators: Filters

Assume a temperature stream Here weather:convertFtoC() is a

user defined function. They are used to extend the language.

define stream TemperatureStream(ts long, temp double);from TemperatureStream[weather:convertFtoC(temp) > 30.0)

and roomNo != 2043] select roomNo, tempinsert into HotRoomsStream ;

Usecases: - Alerts , thresholds (e.g. Alarm on

high temperature) - Preprocessing: filtering,

transformations (e.g. data cleanup)

Page 27: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Operators: Windows and Aggregation

Support many window types- Batch Windows, Sliding windows, Custom windows

Usecases- Simple counting (e.g. failure count) - Counting with Windows ( e.g. failure count every hour)

from TemperatureStream#window.time(1 min) select roomNo, avg(temp) as avgTemp insert into HotRoomsStream ;

Page 28: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Operators: Patterns

Models a followed by relation: e.g. event A followed by event B

Very powerful tool for tracking and detecting patterns

from every (a1 = TemperatureStream) -> a2 = TemperatureStream [temp > a1.temp + 5 ]within 1 day

select a2.ts as ts, a2.temp – a1.temp as diffinsert into HotDayAlertStream;

Usecases- Detecting Event Sequence Patterns- Tracking- Detect trends

Page 29: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Operators: Joins

Join two data streams based on a condition and windowsUsecases- Data Correlation, Detect missing events, detecting erroneous data- Joining event streams

from TemperatureStream [temp > 30.0]#window.time(1 min) as Tjoin RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNoselect T.roomNo, R.deviceID, ‘start’ as action insert into RegulatorActionStream

Page 30: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Operators: Access Data from the Disk

Event tables allow users to map a database to a window and join a data stream with the window

Usecases- Merge with data in a database, collect, update data conditionally

define table HistTempTable(day long, avgT double);

from TemperatureStream#window.length(1) join OldTempTable on getDayOfYear(ts) == HistTempTable.day && ts > avgT

select ts, temp insert into PurchaseUserStream ;

Page 31: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Realtime Analytics PatternsSimple counting (e.g. failure count) Counting with Windows ( e.g. failure count every hour) Preprocessing: filtering, transformations (e.g. data cleanup)Alerts , thresholds (e.g. Alarm on high temperature) Data Correlation, Detect missing events, detecting erroneous data

(e.g. detecting failed sensors) Joining event streams (e.g. detect a hit on soccer ball) Merge with data in a database, collect, update data conditionally

Page 32: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Realtime Analytics Patterns (contd.)Detecting Event Sequence Patterns (e.g. small transaction followed

by large transaction) Tracking - follow some related entity’s state in space, time etc. (e.g.

location of airline baggage, vehicle, tracking wild life) Detect trends – Rise, turn, fall, Outliers, Complex trends like triple

bottom etc., (e.g. algorithmic trading, SLA, load balancing)Learning a Model (e.g. Predictive maintenance) Predicting next value and corrective actions (e.g. automated car)

Page 33: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Predictive Analytics Build models and use them with

WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2)

Build model using R, export them as PMML, and use within WSO2 CEP

Call R Scripts from CEP queries Regression and Anomaly Detection

Operators in CEP

Page 34: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Predictive Analytics WSO2 Machine Learner provide

an wizard to explore and build model

E.g. Build a model to predict next 15 minutes temperature - Trivial Option : (historical mean

+last 15m mean)/2- Better model via ARIMA from time

series analysis To know more, take a ML class

Page 35: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Communicate: Dashboards

Idea is to given the “Overall idea” in a glance (e.g. car dashboard)

Support for personalization, you can build your own dashboard.

Also the entry point for Drill down How to build?- Dashboard via Google Gadget and content

via HTML5 + java scripts

- Use WSO2 User Engagement Server to build a dashboard. (or a JSP or PHP)

- Use charting libraries like Vega or D3

Page 36: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Communicate: Dashboards

Idea is to given the “Overall idea” in a glance (e.g. car dashboard)

Support for personalization, you can build your own dashboard.

Also the entry point for Drill down How to build?- Dashboard via Google Gadget and content

via HTML5 + java scripts

- Use WSO2 User Engagement Server to build a dashboard. (or a JSP or PHP)

- Use charting libraries like Vega or D3

Page 37: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Communicate: Alerts Detecting conditions can be done via

CEP Queries Key is the “Last Mile”- Email- SMS- Push notifications to a UI- Pager - Trigger physical Alarm

How?- Select Email sender “Output Adaptor” from CEP, or send from CEP to ESB, and ESB has lot of

connectors

Page 38: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Communicate: APIs With mobile Apps, most data are

exposed and shared as APIs (REST/Json ) to end users.

Following are some challenges - Security and Permissions- API Discovery - Billing, throttling, quote - SLA enforcement

How?- Write data to a database from CEP event tables- Build Services via WSO2 Data Service - Expose them as APIs via API Manager

Page 39: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Smart Home2015 yearly DEBS (Distributed Event Based Systems)

DEBS Grand Challenge (http://goo.gl/0htxlj) Smart Home electricity data: 2000 sensors, 40 houses,

4 Billion eventsWe posted (400K events/sec) and close to one million

distributed throughput with 4 nodes. WSO2 CEP based solution is one of the four finalists

(with Dresden University of Technology, Fraunhofer Institute, and Imperial College London)

Only generic solution to become a finalist

Page 40: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Case Study: Realtime Soccer Analysis

Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM

Page 41: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Case Study: TFL Traffic AnalysisBuilt using TFL ( Transport for London) open data feeds.

http://goo.gl/04tX6khttp://goo.gl/9xNiCm

Page 42: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

WSO2 Big Data Analytics Platform

Page 43: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

ConclusionGoal: Build a platform using

which others can build their analytics systems - End to end, starts from humans

and ends with humans Whole platform is opensource

under Apache License

What can you do with the platform?- Solve hard problems, build Great

Apps with the platform- Add and contribute extensions to

the platform (e.g. GSoc http://goo.gl/QNFP6Y )

- Fix problems ( Patches)

Find us at [email protected] list or Stackoverflow (tag wso2)

Page 44: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Questions?