Upload
srinath-perera
View
1.725
Download
0
Tags:
Embed Size (px)
Citation preview
e.g. Targeted Marketing
• Assume mass emails to – 1M people, reaction rate of
1%, 2$ cost per email =>Cost 2M$ and reach of 10k people.
• Lets say that looking at demographics (e.g. where they live and using decision tables), you can find – 250K people with reaction
rate of 6%, => cost 500K$ and reach of 15k people.
A day in your Life
Think about a day in your life?
– What is the best road to take?
– Would there be any bad weather?
– How to invest my money?
– How is my health?
There are many decisions that you can do better if only you can access the data and process them.
http://www.flickr.com/photos/kcolwell/5
512461652/ CC licence
Internet of Things• Currently physical world and
software worlds are detached
• Internet of things promises to bridge this– It is about sensors and
actuators everywhere – In your fridge, in your
blanket, in your chair, in your carpet.. Yes even in your socks
– Umbrella that light up when there is rain and medicine cups
What can We do with Big Data?
• Optimize (World is inefficient)
– 30% food wasted farm to plate
– GE Save 1% initiative (http://goo.gl/eYC0QE )
• in trains => 2B/ year
• US healthcare => 20B/ year
• In contrast, Sri Lanka total exports 9B/ year.
• Save lives
– Weather, Disease identification, Personalized
treatment
• Technology advancement
– Most high tech research are done via simulations
Big Data Architecture
Big data Processing Technologies Landscape
Hindsight: Batch Processing
• Programming model is MapReduce– Apache Hadoop
– Spark
• Lot of tools built on top – Hive Shark for (SQL style queries), Mahout (ML), Giraph
(Graph Processing)
• Store and process
• Slow (> 5 minutes for results for a reasonable usecase)
Usecase: Targeted Advertising
• Analytics Implemented with MapReduce or Queries – Min, Max, average, correlation, histograms
– Might join or group data in many ways
– Heatmaps, temporal trends
• Key Performance indicators (KPIs)– Average time for a ticket in customer service interactions
– Profit per square feet for retail
Real-time Analytics• Idea is to process data as they are
received in streaming fashion (without storing)
• Used when we need
– Very fast output (milliseconds)
– Lots of events (few 100k to millions)
• Two main technologies
– Stream Processing (e.g. Apache Strom, http://storm-project.net/ )
– Complex Event Processing (CEP) e.g. WSO2 CEP
define partition “playerPartition” as PlayerDataStream.pid;
from PlayerDataStream#win.time(1m)
select pid, avg(speed) as avgSpeed
insert into AvgSpeedStream
using partition playerPartition;
Usecase: DEBS 2013, Football Game
Sketch Algorithms• Data Structures that can count millions
of entries with few KBs
– Provide approximate answers
– E.g. Count-Min Sketch, Bloom Filters
• Use Cases
– Counting items
– Point estimates, rangesum, heavy hitters, quantiles, number of distinct elements
– Graph Summaries
– Linear algebraic problems such as approximating matrix products, least squares approximation and SVD
See https://sites.google.com/site/algoresearch/datastreamalgorithms
Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
• WW II, Returned Aircrafts and data on where they were hit?
• How would you add Armour?
Challenges: Causality• Correlation does not imply Causality!! ( send a
book home example [1])
• Causality
– do repeat experiment with identical test
– If CAN’T do a randomized test (A/B test)
– With Big data we cannot do either
• Option 1: We can act on correlation if we can verify the guess or if correctness is not critical (Start Investigation, Check for a disease, Marketing )
• Option 2: We verify correlations using A/B testing or propensity analysis
[1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/[2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/
Insight (Understanding Why ?)
• Pattern Mining – find frequent associations (e.g. Market Basket), frequent sequences
• Clustering
• Graph Analysis
• Knowledge Discovery
• Correlations between features and Finding principal components
• Simulations, Complex System modeling, matching a statistical distribution
Usecase: Big Data for development in SL?
• Done using CDR data• People density 1pm vs
midnight (red => increased, blue => decreased)
• Urban Planning – People distribution – Mobility – Waste Management– E.g. see
http://goo.gl/jPujmM
From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
Foresight (Predict)
• Build a Model – Weather, Economic models
• Predict the future values– Electricity load, traffic, demand,
sales
• Classification– Spam detection, Group users,
Sentiment analysis
• Find anomalies– Fraud, Predictive maintenance
• Recommendations – Targeted advertising, product
recommendations
Usecase: Predictive Maintenance• Idea is to fix the problem
before it broke, avoiding expensive downtimes
– Airplanes, turbines, windmills
– Construction Equipment
– Car, Golf carts
• How
– Build a model for normal operation and compare deviation
– Match against known error patterns
Challenges: Selecting the best Algorithm for a Problem
• Types of data: categorical (C), numerical (N)
N-> N = Regression
C-> C = Decision trees
N->C= SVM
• Amount of data
• Required accuracy
• Required interpretability
• Kind of underlying function
See Skytree: Choosing The Right Machine Learning Methods,
https://www.youtube.com/watch?v=qMUpc10VsmA
Challenges: Feature Engineering
• In ML feature engineering is the key [1]. • You need features to form a kernel. Then you can
solve with less data.• Deep learning can learn best feature (combination)
via semi or unsupervised learning [2]
1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/
Challenges: Taking Decisions (Context)
Challenges: Updating Models
● Incorporate more data o We get more data over time o We get feed back about
effectiveness of decisions (e.g. Accuracy of Fraud)
o Trends change
● Track and update modelo Generate models in batch
mode and update o Streaming (Online) ML,
which is an active research topic
Challenges: Scaling ML Algorithms
• With more data we can– Build more accurate and
detailed models [1]
• Scale => Distributed Systems • Need to build new or adopt
algorithms or use other methods – Sampling – Scaleable version of algorithms
(e.g. Decision Trees, NN )
[1] P Domingos, A Few Useful Things to Know about Machine Learning
Challenges: Lack of Labeled Data
• Most data is not labeled • Idea of Semi Supervised
learning• Provide Data + Examples +
Ontology, and algorithm find new patterns – Lot of Data – Few example sentences
• Often uses Expectations Maximization (EM) Algorithm
Watch Tom Mitchell’s Lecture https://www.youtube.com/watch?v=psFnHkIjHA0Maximization algorithm
Ontology: People, CitiesRelationships: like,
dislike, live in
Examples: Bob (People) lives in Colombo (City)
Outline