Upload
tumra-big-data-science-gain-a-competitive-advantage-through-big-data-data-science
View
10.710
Download
0
Embed Size (px)
DESCRIPTION
Right now in institutions around the world, some of the greatest minds in computer science and statistics are coming up with amazing new algorithms and mathematically beautiful solutions. However it's entirely possible that the solutions they conceive will be impracticable in industry. The reason is simple; "the best answer is useless if it arrives too late to do anything with it". The key principle here is the compromise between 'accuracy' and 'latency'. In this talk I will describe examples where this holds true, and how I am using real-time machine learning models to solve challenges in eCommerce, Financial Services and Media companies. http://tumra.com/blog/real-time-machine-learning-at-industrial-scale
Citation preview
Real-Time Machine Learning at Industrial scale
9th October 2012Michael Cutler @cotdp
tumra.com@tumra
TUMRA LTD, Building 3, Chiswick Park,566 Chiswick High Road, W4 5YA
... the battle of accuracy vs latency
$ whoami
Michael Cutler (@cotdp)
● Previously at British Sky Broadcasting○ Last 7 years in R&D○ Created several patented systems & algorithms○ Kicked off ‘Big Data’ initiative at Sky in 2008
● Co-founder CTO @ TUMRA in March '12○ Real-time big data science platform○ Alpha-testing with selected clients
Agenda
● Background● Real-Time vs Batch processing● Accuracy vs Latency● Use Cases
○ eCommerce○ Financial Services○ Media
● Questions
Background
Big Data is "in vogue", but what does it mean:
● Distributed processing● Massively scalable● Commodity
Apache Hadoop is "Kernel" of Big Data OS:
● Distributed Filesystem (HDFS)● Parallel Processing (Map/Reduce, YARN)
Background (cont'd)
Solving problems with Big Data is hard:
● Tools are all low-level (Pig, Hive etc.)● Skills are hard to find
What is "Data Science":
● Understanding data & solving problems● Applies the following skills:
○ Statistical Analysis○ Machine Learning○ Communicating Results
Real-Time vsBatch processing
Credit: http://bit.ly/Q71u4W
Batch - Hoppers, Bins, Buckets
Real-Time - Flows & Streams
Credit: http://bit.ly/NOslqf
Real-Time vs Batch processing
Similarities to the Industrial Revolution:
● From handicraft to Batch & Real-Time● Complexity increases
Need for "Real-Time":
● Wherever the variation can change faster than you can retrain models
● When you can't pre-compute everything ahead of time
Accuracy vs Latency
Accuracy vs Latency
Netflix Prize winning entry :-
● Ensemble of 100's of models● Massively compute intensive solution● Marginally better than much simpler models
IBM won the KDD Cup 2009 (Orange) :-
● IBM Watson team won by sheer brute force● Used a "one of everything" approach
generating hundreds of models
Accuracy vs Latency (cont'd)
Mathematical navel-gazing:
● Often the factor we're optimising for, isn't the thing we measure improvement in:○ User ratings vs. customer longevity/value○ Overfitting outliers vs. missing clear Fraud
Given the choice between a "best guess" now, and a "marginally better" answer later, I'd take the "best guess" every time.
However, that doesn't mean...
Accuracy vs Latency (cont'd)
It's a trade-off:
● Sometimes "best guess" is good enough,● Other times we can wait for the accuracy,● And of course, occasionally we want both!
Key objective:
● Most appropriate solution for the use-case● Hybrid solutions part batch, part real-time
Use CaseeCommerce
Use Case - eCommerce
Objective - Increase profits
How:● Match potential customers to the right products● Personalise user experience on web & email● Customer lifecycle management
Method:● Ensemble of real-time models● Collect lots of implicit feedback data
Use Case - eCommerce (cont'd)
Detail:● Clustering - behavior, demogs● Simple predictors - keywords to products● Bayesian Bandit - blend the output
Requirements:● Predictions in < 50 ms● Online learning models● Occasional batch updates are OK
When eCommerce #FAILs
I've only ever bought Cat food...
... wait there's more, no Cat food
Even Amazon can #FAIL
Use CaseFinancial Services
Use Case - Financial Services
Objective - Reduce Fraud
How:● Compute patterns/predictors for individuals● Cluster individuals and recompute for clusters● Compute baselines across all data
Method:● Hybrid and Hierarchical Clustering models● Simple predictors for individuals, clusters & baseline
Use Case - Financial Services
Detail:● CHEAT!!! ... Cluster to nearest centroid
○ will degrade over time (Hunchback Clusters)● Use simple metrics to alert (stddev)
Requirements:● Ability to alert/intervene near real-time < 1 second● Adapt to rapid changes (within baseline & clusters)● Periodic batch processing to recompute clusters
Use Case - Financial Services
Use CaseMedia
Use Case - Media
Objective - Generating Metadata
Why:● Drive second screen applications● Create new streams of information for resale
How:● Video / Audio analysis● Closed Caption or, Subtitle text processing● Knowledgebase :- People, Places, Products & Things
Use Case - Media (cont'd)
Method:● Natural Language Processing
○ Named Entity Recognition○ Topic Extraction & Disambiguation
● Graph databases & algorithms
Requirements:● Responses in < 1 second● Ability to learn new 'Things'
Example of 12,000 entities from our Knowledgebase...
Summary
Summary
Key points:● Clear move towards distributed algorithms● Latency is often more favorable than accuracy● Trade-offs are dependant on the use-cases
Further reading:● Apache Mahout - http://mahout.apache.org/● Storm Project - http://storm-project.net/● Data Science London - http://datasciencelondon.org/● Machine Learning Meetup - http://bit.ly/w8V8f6
Almost finished!
Introducing TUMRA Labs
API access to some of our real-time models:
● Probabilistic Demographics
Coming Soon:● Language detection● Sentiment analysis● Metadata Generation
Free to signup and easy to get started!
http://labs.tumra.com/
Questions?
Worktumra.com
@tumra
Personalcotdp.com
@cotdp