ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions

e.g. Targeted Marketing

• Assume mass emails to – 1M people, reaction rate of

1%, 2$ cost per email =>Cost 2M$ and reach of 10k people.

• Lets say that looking at demographics (e.g. where they live and using decision tables), you can find – 250K people with reaction

rate of 6%, => cost 500K$ and reach of 15k people.

A day in your Life

Think about a day in your life?

– What is the best road to take?

– Would there be any bad weather?

– How to invest my money?

– How is my health?

There are many decisions that you can do better if only you can access the data and process them.

http://www.flickr.com/photos/kcolwell/5

512461652/ CC licence

http://www.flickr.com/photos/kcolwell/5512461652/

Internet of Things• Currently physical world and

software worlds are detached

• Internet of things promises to bridge this– It is about sensors and

actuators everywhere – In your fridge, in your

blanket, in your chair, in your carpet.. Yes even in your socks

– Umbrella that light up when there is rain and medicine cups

What can We do with Big Data?

• Optimize (World is inefficient)

– 30% food wasted farm to plate

– GE Save 1% initiative (http://goo.gl/eYC0QE )

• in trains => 2B/ year

• US healthcare => 20B/ year

• In contrast, Sri Lanka total exports 9B/ year.

• Save lives

– Weather, Disease identification, Personalized

treatment

• Technology advancement

– Most high tech research are done via simulations

http://goo.gl/eYC0QE

Big Data Architecture

Big data Processing Technologies Landscape

Hindsight: Batch Processing

• Programming model is MapReduce– Apache Hadoop

– Spark

• Lot of tools built on top – Hive Shark for (SQL style queries), Mahout (ML), Giraph

(Graph Processing)

• Store and process

• Slow (> 5 minutes for results for a reasonable usecase)

Usecase: Targeted Advertising

• Analytics Implemented with MapReduce or Queries – Min, Max, average, correlation, histograms

– Might join or group data in many ways

– Heatmaps, temporal trends

• Key Performance indicators (KPIs)– Average time for a ticket in customer service interactions

– Profit per square feet for retail

Real-time Analytics• Idea is to process data as they are

received in streaming fashion (without storing)

• Used when we need

– Very fast output (milliseconds)

– Lots of events (few 100k to millions)

• Two main technologies

– Stream Processing (e.g. Apache Strom, http://storm-project.net/ )

– Complex Event Processing (CEP) e.g. WSO2 CEP

define partition “playerPartition” as PlayerDataStream.pid;

from PlayerDataStream#win.time(1m)

select pid, avg(speed) as avgSpeed

insert into AvgSpeedStream

using partition playerPartition;

http://storm-project.net/

Usecase: DEBS 2013, Football Game

Sketch Algorithms• Data Structures that can count millions

of entries with few KBs

– Provide approximate answers

– E.g. Count-Min Sketch, Bloom Filters

• Use Cases

– Counting items

– Point estimates, rangesum, heavy hitters, quantiles, number of distinct elements

– Graph Summaries

– Linear algebraic problems such as approximating matrix products, least squares approximation and SVD

See https://sites.google.com/site/algoresearch/datastreamalgorithms

https://sites.google.com/site/algoresearch/datastreamalgorithms

Curious Case of Missing Data

http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/

• WW II, Returned Aircrafts and data on where they were hit?

• How would you add Armour?

http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today

Challenges: Causality• Correlation does not imply Causality!! ( send a

book home example [1])

• Causality

– do repeat experiment with identical test

– If CAN’T do a randomized test (A/B test)

– With Big data we cannot do either

• Option 1: We can act on correlation if we can verify the guess or if correctness is not critical (Start Investigation, Check for a disease, Marketing )

• Option 2: We verify correlations using A/B testing or propensity analysis

[1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/[2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/

http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/

https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/

Insight (Understanding Why ?)

• Pattern Mining – find frequent associations (e.g. Market Basket), frequent sequences

• Clustering

• Graph Analysis

• Knowledge Discovery

• Correlations between features and Finding principal components

• Simulations, Complex System modeling, matching a statistical distribution

Usecase: Big Data for development in SL?

• Done using CDR data• People density 1pm vs

midnight (red => increased, blue => decreased)

• Urban Planning – People distribution – Mobility – Waste Management– E.g. see

http://goo.gl/jPujmM

From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/

http://goo.gl/jPujmM

http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/

Foresight (Predict)

• Build a Model – Weather, Economic models

• Predict the future values– Electricity load, traffic, demand,

sales

• Classification– Spam detection, Group users,

Sentiment analysis

• Find anomalies– Fraud, Predictive maintenance

• Recommendations – Targeted advertising, product

recommendations

Usecase: Predictive Maintenance• Idea is to fix the problem

before it broke, avoiding expensive downtimes

– Airplanes, turbines, windmills

– Construction Equipment

– Car, Golf carts

• How

– Build a model for normal operation and compare deviation

– Match against known error patterns

Challenges: Selecting the best Algorithm for a Problem

• Types of data: categorical (C), numerical (N)

N-> N = Regression

C-> C = Decision trees

N->C= SVM

• Amount of data

• Required accuracy

• Required interpretability

• Kind of underlying function

See Skytree: Choosing The Right Machine Learning Methods,

https://www.youtube.com/watch?v=qMUpc10VsmA

Challenges: Feature Engineering

• In ML feature engineering is the key [1]. • You need features to form a kernel. Then you can

solve with less data.• Deep learning can learn best feature (combination)

via semi or unsupervised learning [2]

1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/

https://www.youtube.com/watch?v=wjTJVhmu1JM

Challenges: Taking Decisions (Context)

Challenges: Updating Models

● Incorporate more data o We get more data over time o We get feed back about

effectiveness of decisions (e.g. Accuracy of Fraud)

o Trends change

● Track and update modelo Generate models in batch

mode and update o Streaming (Online) ML,

which is an active research topic

Challenges: Scaling ML Algorithms

• With more data we can– Build more accurate and

detailed models [1]

• Scale => Distributed Systems • Need to build new or adopt

algorithms or use other methods – Sampling – Scaleable version of algorithms

(e.g. Decision Trees, NN )

[1] P Domingos, A Few Useful Things to Know about Machine Learning

Challenges: Lack of Labeled Data

• Most data is not labeled • Idea of Semi Supervised

learning• Provide Data + Examples +

Ontology, and algorithm find new patterns – Lot of Data – Few example sentences

• Often uses Expectations Maximization (EM) Algorithm

Watch Tom Mitchell’s Lecture https://www.youtube.com/watch?v=psFnHkIjHA0Maximization algorithm

Ontology: People, CitiesRelationships: like,

dislike, live in

Examples: Bob (People) lives in Colombo (City)

Outline

Data & Analytics

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions