37
Real-Time Machine Learning at Industrial scale 9th October 2012 Michael Cutler @cotdp tumra.com @tumra TUMRA LTD, Building 3, Chiswick Park, 566 Chiswick High Road, W4 5YA ... the battle of accuracy vs latency

Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Embed Size (px)

DESCRIPTION

Right now in institutions around the world, some of the greatest minds in computer science and statistics are coming up with amazing new algorithms and mathematically beautiful solutions. However it's entirely possible that the solutions they conceive will be impracticable in industry. The reason is simple; "the best answer is useless if it arrives too late to do anything with it". The key principle here is the compromise between 'accuracy' and 'latency'. In this talk I will describe examples where this holds true, and how I am using real-time machine learning models to solve challenges in eCommerce, Financial Services and Media companies. http://tumra.com/blog/real-time-machine-learning-at-industrial-scale

Citation preview

Page 1: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Real-Time Machine Learning at Industrial scale

9th October 2012Michael Cutler @cotdp

tumra.com@tumra

TUMRA LTD, Building 3, Chiswick Park,566 Chiswick High Road, W4 5YA

... the battle of accuracy vs latency

Page 2: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

$ whoami

Michael Cutler (@cotdp)

● Previously at British Sky Broadcasting○ Last 7 years in R&D○ Created several patented systems & algorithms○ Kicked off ‘Big Data’ initiative at Sky in 2008

● Co-founder CTO @ TUMRA in March '12○ Real-time big data science platform○ Alpha-testing with selected clients

Page 3: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Agenda

● Background● Real-Time vs Batch processing● Accuracy vs Latency● Use Cases

○ eCommerce○ Financial Services○ Media

● Questions

Page 4: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Background

Big Data is "in vogue", but what does it mean:

● Distributed processing● Massively scalable● Commodity

Apache Hadoop is "Kernel" of Big Data OS:

● Distributed Filesystem (HDFS)● Parallel Processing (Map/Reduce, YARN)

Page 5: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Background (cont'd)

Solving problems with Big Data is hard:

● Tools are all low-level (Pig, Hive etc.)● Skills are hard to find

What is "Data Science":

● Understanding data & solving problems● Applies the following skills:

○ Statistical Analysis○ Machine Learning○ Communicating Results

Page 6: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Real-Time vsBatch processing

Page 7: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Credit: http://bit.ly/Q71u4W

Batch - Hoppers, Bins, Buckets

Page 8: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Real-Time - Flows & Streams

Credit: http://bit.ly/NOslqf

Page 9: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Real-Time vs Batch processing

Similarities to the Industrial Revolution:

● From handicraft to Batch & Real-Time● Complexity increases

Need for "Real-Time":

● Wherever the variation can change faster than you can retrain models

● When you can't pre-compute everything ahead of time

Page 10: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Accuracy vs Latency

Page 11: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Accuracy vs Latency

Netflix Prize winning entry :-

● Ensemble of 100's of models● Massively compute intensive solution● Marginally better than much simpler models

IBM won the KDD Cup 2009 (Orange) :-

● IBM Watson team won by sheer brute force● Used a "one of everything" approach

generating hundreds of models

Page 12: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Accuracy vs Latency (cont'd)

Mathematical navel-gazing:

● Often the factor we're optimising for, isn't the thing we measure improvement in:○ User ratings vs. customer longevity/value○ Overfitting outliers vs. missing clear Fraud

Given the choice between a "best guess" now, and a "marginally better" answer later, I'd take the "best guess" every time.

Page 13: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

However, that doesn't mean...

Page 14: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Accuracy vs Latency (cont'd)

It's a trade-off:

● Sometimes "best guess" is good enough,● Other times we can wait for the accuracy,● And of course, occasionally we want both!

Key objective:

● Most appropriate solution for the use-case● Hybrid solutions part batch, part real-time

Page 15: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use CaseeCommerce

Page 16: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use Case - eCommerce

Objective - Increase profits

How:● Match potential customers to the right products● Personalise user experience on web & email● Customer lifecycle management

Method:● Ensemble of real-time models● Collect lots of implicit feedback data

Page 17: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use Case - eCommerce (cont'd)

Detail:● Clustering - behavior, demogs● Simple predictors - keywords to products● Bayesian Bandit - blend the output

Requirements:● Predictions in < 50 ms● Online learning models● Occasional batch updates are OK

Page 18: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
Page 19: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

When eCommerce #FAILs

Page 20: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

I've only ever bought Cat food...

Page 21: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

... wait there's more, no Cat food

Page 22: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Even Amazon can #FAIL

Page 23: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use CaseFinancial Services

Page 24: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use Case - Financial Services

Objective - Reduce Fraud

How:● Compute patterns/predictors for individuals● Cluster individuals and recompute for clusters● Compute baselines across all data

Method:● Hybrid and Hierarchical Clustering models● Simple predictors for individuals, clusters & baseline

Page 25: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use Case - Financial Services

Detail:● CHEAT!!! ... Cluster to nearest centroid

○ will degrade over time (Hunchback Clusters)● Use simple metrics to alert (stddev)

Requirements:● Ability to alert/intervene near real-time < 1 second● Adapt to rapid changes (within baseline & clusters)● Periodic batch processing to recompute clusters

Page 26: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use Case - Financial Services

Page 27: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use CaseMedia

Page 28: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use Case - Media

Objective - Generating Metadata

Why:● Drive second screen applications● Create new streams of information for resale

How:● Video / Audio analysis● Closed Caption or, Subtitle text processing● Knowledgebase :- People, Places, Products & Things

Page 29: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Use Case - Media (cont'd)

Method:● Natural Language Processing

○ Named Entity Recognition○ Topic Extraction & Disambiguation

● Graph databases & algorithms

Requirements:● Responses in < 1 second● Ability to learn new 'Things'

Example of 12,000 entities from our Knowledgebase...

Page 30: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
Page 31: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
Page 32: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
Page 33: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Summary

Page 34: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Summary

Key points:● Clear move towards distributed algorithms● Latency is often more favorable than accuracy● Trade-offs are dependant on the use-cases

Further reading:● Apache Mahout - http://mahout.apache.org/● Storm Project - http://storm-project.net/● Data Science London - http://datasciencelondon.org/● Machine Learning Meetup - http://bit.ly/w8V8f6

Page 35: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Almost finished!

Page 36: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Introducing TUMRA Labs

API access to some of our real-time models:

● Probabilistic Demographics

Coming Soon:● Language detection● Sentiment analysis● Metadata Generation

Free to signup and easy to get started!

http://labs.tumra.com/

Page 37: Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Questions?

Worktumra.com

@tumra

Personalcotdp.com

@cotdp