Boston hug

Machine Learning with Hadoop

Agenda

• Why Big Data? Why now?

• What can you do with big data?

• How does it work?

Slow Motion Explosion

Why Now?

• But Moore’s law has applied for a long time

• Why is Hadoop/Big Data exploding now?

• Why not 10 years ago?

• Why not 20?

2/15/12

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

They did

Everywhere at Once?

• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Everywhere at Once?

• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Analytics Scaling Laws

• Analytics scaling is all about the 80-20 rule – Big gains for little initial effort– Rapidly diminishing returns

• The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant

• Cost/performance has changed radically– IF you can use many commodity boxes

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

Net value optimum has a sharp peak well before maximum effort

But scaling laws are changing both slope and shape

More than just a little

They are changing a LOT!

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

Pre-requisites for Tipping

• To reach the tipping point, • Algorithms must scale out horizontally– On commodity hardware– That can and will fail

• Data practice must change– Denormalized is the new black– Flexible data dictionaries are the rule– Structured data becomes rare

So that is why and why now

So that is why, and why now

What can you do with it?And how?

Agenda

• Mahout outline– Recommendations– Clustering– Classification

• Hybrid Parallel/Sequential Systems• Real-time learning

Agenda

• Mahout outline– Recommendations– Clustering– Classification• Supervised on-line learning• Feature hashing

• Hybrid Parallel/Sequential Systems• Real-time learning

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training

• Logistic Regression (aka SGD)– fast on-line (sequential) training– Now with MORE topping!

How it Works

• We are given “features”– Often binary values in a vector

• Algorithm learns weights– Weighted sum of feature * weight is the key

• Each weight is a single real value

An Example

Features

From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <george@fumble-tech.com>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

But …

• Text and words aren’t suitable features• We need a numerical vector• So we use binary vectors with lots of slots

Feature Encoding

Hashed Encoding

Feature Collisions

Training Data

Full Scale Training

Featureextraction

anddown

sampling

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS

Hybrid Model Development

Logs User sessions

Training dataGroup by user

Count transaction

patterns

Account info

Training data

Big-data cluster Legacy modeling

Shared filesystem

Merge PROC LOGISTIC

Enter the Pig Vector

• Pig UDF’s for– Vector encoding

– Model training

define EncodeVector org.apache.mahout.pig.encoders.EncodeVector( '10','x+y+1', 'x:numeric, y:numeric, z:numeric');

vectors = foreach docs generate newsgroup, encodeVector(*) as v;grouped = group vectors all;model = foreach grouped generate 1 as key, train(vectors) as model;

Real-time Developments

• Storm + Hadoop + Mapr– Real-time with Storm– Long-term with Hadoop– State checkpoints with MapR

• Add the Bayesian Bandit for on-line learning

Aggregate Splicing

tHadoop handles the

Storm handles the present

Mobile Network MonitorTransaction

Batch aggregation

Real-time dashboard and alerts

Geo-dispersed ingest servers

Retro-analysisinterface

A Quick Diversion

• You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

• I flip the coin and while it is in the air ask again• I catch the coin and ask again• I look at the coin (and you don’t) and ask again• Why does the answer change?– And did it ever have a single value?

A First Conclusion

• Probability as expressed by humans is subjective and depends on information and experience

A Second Conclusion

• A single number is a bad way to express uncertain knowledge

• A distribution of values might be better

I Dunno

5 and 5

2 and 10

Bayesian Bandit

• Compute distributions based on data• Sample p1 and p2 from these distributions

• Put a coin in bandit 1 if p1 > p2

• Else, put the coin in bandit 2

The Basic Idea

• We can encode a distribution by sampling• Sampling allows unification of exploration and

exploitation

• Can be extended to more general response models

Deployment with Storm/MapR

All state managed transactionally in MapR file system

Service Architecture

MapR Lockless Storage Services

MapR Pluggable Service Management

Hadoop

Find Out More

• Me: tdunning@mapr.com ted.dunning@gmail.com tdunning@apache.com

• MapR: http://www.mapr.com • Mahout: http://mahout.apache.org

• Code: https://github.com/tdunning

Boston hug

Technology

Hug a startup

Kui Hug Life

Harding University in Greece (HUG): Spring 2014 Report #4dalewmanor.net/Manor Reports/HUG 2014/HUG 2014 4.pdf · 2018-12-18 · Harding University in Greece (HUG): Spring 2014 Report

AP HuG- Gender

A modern hug

Hug the Pug

The Departing Hug

Machine Learning with Hadoop Boston hug 2012

Boston HUG - Cloudera presentation

HUG Bigtop

Jan 2013 HUG: Dist cpv2 for hug 20130116

Hug Ill 2003

HUG Event Deck

London hug-samza

ARTISTIC RESEARCH JOA HUG JOA HUG No solutions: The

Aug 2012 HUG: Hug BigTop

London hug

October 2014 HUG

King hug uk

Director Contact Information Parent Expectations · 2019-08-06 · 1—Christmas Break 8—Bear Hug 12 15—Bear Hug 13 22—Bear Hug 14 (Green Apple) 29—Bear Hug 15 February—Missionary