Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Preview:

DESCRIPTION

Right now in institutions around the world, some of the greatest minds in computer science and statistics are coming up with amazing new algorithms and mathematically beautiful solutions. However it's entirely possible that the solutions they conceive will be impracticable in industry. The reason is simple; "the best answer is useless if it arrives too late to do anything with it". The key principle here is the compromise between 'accuracy' and 'latency'. In this talk I will describe examples where this holds true, and how I am using real-time machine learning models to solve challenges in eCommerce, Financial Services and Media companies. http://tumra.com/blog/real-time-machine-learning-at-industrial-scale

Citation preview

Real-Time Machine Learning at Industrial scale

9th October 2012Michael Cutler @cotdp

tumra.com@tumra

TUMRA LTD, Building 3, Chiswick Park,566 Chiswick High Road, W4 5YA

... the battle of accuracy vs latency

$ whoami

Michael Cutler (@cotdp)

● Previously at British Sky Broadcasting○ Last 7 years in R&D○ Created several patented systems & algorithms○ Kicked off ‘Big Data’ initiative at Sky in 2008

● Co-founder CTO @ TUMRA in March '12○ Real-time big data science platform○ Alpha-testing with selected clients

Agenda

● Background● Real-Time vs Batch processing● Accuracy vs Latency● Use Cases

○ eCommerce○ Financial Services○ Media

● Questions

Background

Big Data is "in vogue", but what does it mean:

● Distributed processing● Massively scalable● Commodity

Apache Hadoop is "Kernel" of Big Data OS:

● Distributed Filesystem (HDFS)● Parallel Processing (Map/Reduce, YARN)

Background (cont'd)

Solving problems with Big Data is hard:

● Tools are all low-level (Pig, Hive etc.)● Skills are hard to find

What is "Data Science":

● Understanding data & solving problems● Applies the following skills:

○ Statistical Analysis○ Machine Learning○ Communicating Results

Real-Time vsBatch processing

Credit: http://bit.ly/Q71u4W

Batch - Hoppers, Bins, Buckets

Real-Time - Flows & Streams

Credit: http://bit.ly/NOslqf

Real-Time vs Batch processing

Similarities to the Industrial Revolution:

● From handicraft to Batch & Real-Time● Complexity increases

Need for "Real-Time":

● Wherever the variation can change faster than you can retrain models

● When you can't pre-compute everything ahead of time

Accuracy vs Latency

Accuracy vs Latency

Netflix Prize winning entry :-

● Ensemble of 100's of models● Massively compute intensive solution● Marginally better than much simpler models

IBM won the KDD Cup 2009 (Orange) :-

● IBM Watson team won by sheer brute force● Used a "one of everything" approach

generating hundreds of models

Accuracy vs Latency (cont'd)

Mathematical navel-gazing:

● Often the factor we're optimising for, isn't the thing we measure improvement in:○ User ratings vs. customer longevity/value○ Overfitting outliers vs. missing clear Fraud

Given the choice between a "best guess" now, and a "marginally better" answer later, I'd take the "best guess" every time.

However, that doesn't mean...

Accuracy vs Latency (cont'd)

It's a trade-off:

● Sometimes "best guess" is good enough,● Other times we can wait for the accuracy,● And of course, occasionally we want both!

Key objective:

● Most appropriate solution for the use-case● Hybrid solutions part batch, part real-time

Use CaseeCommerce

Use Case - eCommerce

Objective - Increase profits

How:● Match potential customers to the right products● Personalise user experience on web & email● Customer lifecycle management

Method:● Ensemble of real-time models● Collect lots of implicit feedback data

Use Case - eCommerce (cont'd)

Detail:● Clustering - behavior, demogs● Simple predictors - keywords to products● Bayesian Bandit - blend the output

Requirements:● Predictions in < 50 ms● Online learning models● Occasional batch updates are OK

When eCommerce #FAILs

I've only ever bought Cat food...

... wait there's more, no Cat food

Even Amazon can #FAIL

Use CaseFinancial Services

Use Case - Financial Services

Objective - Reduce Fraud

How:● Compute patterns/predictors for individuals● Cluster individuals and recompute for clusters● Compute baselines across all data

Method:● Hybrid and Hierarchical Clustering models● Simple predictors for individuals, clusters & baseline

Use Case - Financial Services

Detail:● CHEAT!!! ... Cluster to nearest centroid

○ will degrade over time (Hunchback Clusters)● Use simple metrics to alert (stddev)

Requirements:● Ability to alert/intervene near real-time < 1 second● Adapt to rapid changes (within baseline & clusters)● Periodic batch processing to recompute clusters

Use Case - Financial Services

Use CaseMedia

Use Case - Media

Objective - Generating Metadata

Why:● Drive second screen applications● Create new streams of information for resale

How:● Video / Audio analysis● Closed Caption or, Subtitle text processing● Knowledgebase :- People, Places, Products & Things

Use Case - Media (cont'd)

Method:● Natural Language Processing

○ Named Entity Recognition○ Topic Extraction & Disambiguation

● Graph databases & algorithms

Requirements:● Responses in < 1 second● Ability to learn new 'Things'

Example of 12,000 entities from our Knowledgebase...

Summary

Summary

Key points:● Clear move towards distributed algorithms● Latency is often more favorable than accuracy● Trade-offs are dependant on the use-cases

Further reading:● Apache Mahout - http://mahout.apache.org/● Storm Project - http://storm-project.net/● Data Science London - http://datasciencelondon.org/● Machine Learning Meetup - http://bit.ly/w8V8f6

Almost finished!

Introducing TUMRA Labs

API access to some of our real-time models:

● Probabilistic Demographics

Coming Soon:● Language detection● Sentiment analysis● Metadata Generation

Free to signup and easy to get started!

http://labs.tumra.com/

Questions?

Worktumra.com

@tumra

Personalcotdp.com

@cotdp

Recommended