79
© 2016 MapR Technologies 1 ® © 2014 MapR Technologies Ted Dunning

Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 1

®

© 2014 MapR Technologies

Ted Dunning

Page 2: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 2

®

© 2014 MapR Technologies

Ted Dunning

Page 3: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 3

Me, Us •  Ted Dunning, MapR Chief Application Architect, Apache Member

–  Committer PMC member Zookeeper, Drill, others –  Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin –  VP Incubator –  Bought the beer at the first HUG

•  MapR –  Produces first converged platform for big and fast data –  Includes data platform (files, streams, tables) + open source –  Adds major technology for performance, HA, industry standard API’s

•  Contact

@ted_dunning, [email protected], [email protected]

Page 4: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 4

Agenda •  Rationale •  Why cheap isn't the same as simple-minded •  Some techniques •  Examples

Page 5: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 5

Outline •  We have a revolution on our hands •  This leads to a green-field situation •  That implies that many important problems are easy to solve •  The limiting factor is fielding good enough solutions

–  Quickly –  With available workforce

•  Examples

Page 6: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 6

Isthisreallyarevolutionarymoment?

Page 7: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 7

Big is the next big thing •  Data scale is exploding

•  Companies are being funded

•  Books are being written

•  Applications sprouting up everywhere

Page 8: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 8

Why Now? •  But Moore’s law has applied for a long time •  Why is data exploding now?

•  Why not 10 years ago?

•  Why not 20?

Page 9: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 9

Size Matters, but … •  If it were just availability of data then existing big companies

would adopt big data technology first

Page 10: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 10

Size Matters, but … •  If it were just availability of data then existing big companies

would adopt big data technology first

They didn’t

Page 11: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 11

Or Maybe Cost •  If it were just a net positive value then finance companies should

adopt first because they have higher opportunity value / byte

Page 12: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 12

Or Maybe Cost •  If it were just a net positive value then finance companies should

adopt first because they have higher opportunity value / byte

They didn’t

Page 13: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 13

Backwards adoption •  Under almost any threshold argument startups would not adopt

big data technology first

Page 14: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 14

Backwards adoption •  Under almost any threshold argument startups would not adopt

big data technology first

They did

Page 15: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 15

Everywhere at Once? •  Something very strange is happening

–  Big data is being applied at many different scales –  At many value scales –  By large companies and small

Page 16: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 16

Everywhere at Once? •  Something very strange is happening

–  Big data is being applied at many different scales –  At many value scales –  By large companies and small

Why?

Page 17: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 17

Analytics Scaling Laws •  Analytics scaling is all about the 80-20 rule

–  Big gains for little initial effort –  Rapidly diminishing returns

•  The key to net value is how costs scale –  Old school – exponential scaling –  Big data – linear scaling, low constant

•  Cost/performance has changed radically –  IF you can use many commodity boxes

Page 18: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 18

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Most data isn’t worth much in isolation

First data is valuable

Later data is dregs

Page 19: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 19

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Suddenly worth processing

First data is valuable

Later data is dregs

But has high aggregate value

Page 20: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 20

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

If we can handle the scale

It’s really big

Page 21: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 21

Sowhatmakesthatpossible?

Page 22: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 22

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 23: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 23

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value Net value optimum has

a sharp peak well before maximum effort

Page 24: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 24

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

But scaling laws are changing both slope and shape

Page 25: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 25

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

More than just a little

Page 26: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 26

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

They are changing a LOT!

Page 27: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 27

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 28: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 28

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 29: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 29

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 30: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 30

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 31: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 31

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Initially, linear cost scaling actually makes things worse

Then a tipping point is reached and things change radically …

Page 32: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 32

Pre-requisites for Tipping •  To reach the tipping point, •  Algorithms must scale out horizontally

–  On commodity hardware –  That can and will fail

•  Data practice must change –  Denormalized is the new black –  Flexible data dictionaries are the rule –  Structured data becomes rare

Page 33: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 33

With great scale comes great opportunity •  Increasing scale by 1000x changes the game

•  We essentially have green fields opening up all around

•  Most of the opportunities don’t require advanced learning

Page 34: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 34

OK.

Wehaveabonafiderevolution

Page 35: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 35

Greenfield Problem Landscape

EasyHard

Impossible

Page 36: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 36

Hard

Impossible

Easy

Mature Problem Landscape

Page 37: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 37

Why is cheap better than deep (sometimes)? When we have a greenfield, problems can be

–  Easy (large number of these) –  Impossible (large number of these) –  Hard but possible (right on the boundary)

In a mature field, problems can be –  Easy (these are already done) –  Impossible (still a large number of these) –  Hard but possible (now the majority of the effort)

Page 38: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 38

Someexamples

Page 39: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 39

A simple example - security monitoring •  “Small” data

–  Capture IDS logs –  Detect what you already know

•  “Big” data –  Capture switch, server, firewall logs as well –  New patterns emerge immediately

Page 40: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 40

Another example – fraud detection

•  “Small” data –  Maintain card profiles –  Segment models –  Evaluate all transactions

•  “Big” Data

–  Maintain card profiles, full 90 day transaction history –  Evaluate all transactions

Page 41: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 41

Another example – indicator-based recommendation

•  “Advanced” approach –  Use matrix completion techniques (LDA, NNM, ALS) –  Tune meta-parameters –  Ensembles galore

•  “Simple” approach

–  Count cooccurrences and cross-occurrences –  Finding “interesting” pairs –  Use standard search engine to recommend

Page 42: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 42

Easy != Stupid •  You still have to do things reasonably well

–  Techniques that are not well founded are still problems

•  Heuristic frequency ratios still fail –  Coincidences still dominate the data –  Accidental 100% correlations abound

•  Related techniques still broken for coincidence –  Pearson’s χ2

–  Simple correlations

Page 43: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 43

ScaledoesnotcurewrongItjustmakeseasymorecommon

Page 44: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 44

A core technique •  Many of these easy problems reduce to finding interesting

coincidences

•  This can be summarized as a 2 x 2 table

•  Actually, many of these tables

A Other B k11 k12

Other k21 k22

Page 45: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 45

How do you do that? •  This is well handled using G2-test

–  See wikipedia –  See http://bit.ly/surprise-and-coincidence

•  Original application in linguistics now cited > 2000 times

•  Available in ElasticSearch, in Solr, in Mahout •  Available in R, C, Java, Python

Page 46: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 46

Which one is the anomalous co-occurrence?

A not A B 13 1000

not B 1000 100,000

A not A B 1 0

not B 0 10,000

A not A B 10 0

not B 0 100,000

A not A B 1 0

not B 0 2

Page 47: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 47

Which one is the anomalous co-occurrence?

A not A B 13 1000

not B 1000 100,000

A not A B 1 0

not B 0 10,000

A not A B 10 0

not B 0 100,000

A not A B 1 0

not B 0 2 0.90 1.95

4.52 14.3

Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)

Page 48: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 48

Sowecanfindinterestingcoincidences.

Thatgetsusexactlywhat?

Page 49: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 49

Operation Ababil – Brobots on Parade •  Dork attack to find unpatched default Joomla sites

–  Especially web servers with high bandwidth connections –  Basically just Google searches for default strings –  Joomla compromised into attack Brobot

•  C&C network checks in occasionally –  Note C&C is incoming request and looks like normal web requests

•  Later, on command, multiple Brobots direct 50-75 Gb/s of attack –  Attacks come from white-listed sites

Page 50: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 50

Attack Sequence

Source

First level C&C

Second level C&C

Page 51: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 51

Google

Attack Sequence

Source

First level C&C

Second level C&C

Page 52: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 52

Brobot

Brobot

Brobot

Attack Sequence

Source

First level C&C

Second level C&C

Page 53: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 53

Target

Brobot

Brobot

Brobot

Attack Sequence

Source

First level C&C

Second level C&C

Page 54: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 54

Outline of an Advanced Persistent Threat •  Advanced

–  Common use of zero-day for preliminary attacks –  Often attributed to state-level actors –  Modern privateers blur the line

•  Persistent –  Result of first attack is heavily muffled, no immediate exploit –  Remote access toolset installed (RAT)

•  Threat –  On command, data is exfiltrated covertly or en masse –  Or the compromised host is used for other nefarious purpose

Page 55: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 55

APT in Summary •  Attack, penetrate, pivot, exfiltrate or exploit •  If you are a high-value target, attack is likely and stealthy

–  High-value = telecom, banks, utilities, retail targets, web100 –  … and all their vendors –  Conventional multi-factor auth is easily breached

•  Penetration and pivot are critical counter-measure opportunities –  In 2010, RAT would contact command and control (C&C) –  In 2016, C&C looks like normal traffic

•  Once exfiltration or exploit starts, you may no longer have a business

Page 56: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 56

Target

Brobot

Brobot

Brobot

Example 1 - Ababil

Source

First level C&C

Second level C&C

Defense has to happen here

Page 57: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 57

Spot the Important Difference?

GET/personal/comparison-table.jsp?iODg2OQ=51a90HTTP/1.1Host:www.sometarget.comUser-Agent:Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1;)Accept-Encoding:deflateAccept-Charset:UTF-8Accept-Language:frCache-Control:no-cachePragma:no-cacheConnection:Keep-Alive

GET/photo.jpgHTTP/1.1Host:lh4.googleusercontent.comUser-Agent:Mozilla/5.0(Macintosh;IntelMacOSX10.11;rv:44.0)Gecko/20100101Firefox/44.0Accept:image/png,image/*;q=0.8,*/*;q=0.5Accept-Language:en-US,en;q=0.5Accept-Encoding:gzip,deflate,brReferer:https://www.google.comConnection:keep-aliveIf-None-Match:"v9”Cache-Control:max-age=0

Attacker request Real request

Page 58: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 58

Spot the Important Difference?

GET/personal/comparison-table.jsp?iODg2OQ=51a90HTTP/1.1Host:www.sometarget.comUser-Agent:Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1;)Accept-Encoding:deflateAccept-Charset:UTF-8Accept-Language:frCache-Control:no-cachePragma:no-cacheConnection:Keep-Alive

GET/photo.jpgHTTP/1.1Host:lh4.googleusercontent.comUser-Agent:Mozilla/5.0(Macintosh;IntelMacOSX10.11;rv:44.0)Gecko/20100101Firefox/44.0Accept:image/png,image/*;q=0.8,*/*;q=0.5Accept-Language:en-US,en;q=0.5Accept-Encoding:gzip,deflate,brReferer:https://www.google.comConnection:keep-aliveIf-None-Match:"v9”Cache-Control:max-age=0

Attacker request Real request

Page 59: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 59

This could only be found at scale

Page 60: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 60

This could only be found at scale

But at scale, it is stupidly simple to find

Page 61: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 61

Target

Brobot

Brobot

Brobot

Overall Outline Again

Source

First level C&C

Second level C&C

Tradecraft error!

Page 62: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 62

Large corpus analysis of source IP’s wins big

Page 63: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 63

Page 64: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 64

Example 2 - Common Point of Compromise •  Scenario:

–  Merchant 0 is compromised, leaks account data during compromise –  Fraud committed elsewhere during exploit –  High background level of fraud –  Limited detection rate for exploits

•  Goal: –  Find merchant 0

•  Meta-goal: –  Screen algorithms for this task without leaking sensitive data

Page 65: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 65

Example 2 - Common Point of Compromise

skim exploit

Merchant 0

Skimmed data

Merchant n

Card data is stolen from Merchant 0

That data is used in frauds at other merchants

Page 66: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 66

Simulation Setup

0 20 40 60 80 100

010

030

050

0

day

coun

tCompromise period

Exploit period

compromisesfrauds

Page 67: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 67

Detection Strategy •  Select histories that precede non-fraud •  And histories that precede fraud detection

•  Analyze 2x2 cooccurrence of merchant n versus fraud detection

Page 68: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 68

● ●●●●●● ●●●●●●● ●●●●● ●●●●● ●●● ●●●● ●●●● ●● ●● ●●●●●● ●● ●●●● ●●●

●●● ● ● ●●●●●

LLR score for simulated merchants

Number of Merchants

Brea

ch S

core

(LLR

)

100 101 102 103 104 105 106

020

4060

80Compromised Merchant

Page 69: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 69

What about the real world?

Page 70: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 70

●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●●● ●●● ●●● ●●● ●●

●●

●●

020

4060

80

LLR score for real data

Number of Merchants

Brea

ch S

core

(LLR

)

Real truly bad guys

100 101 102 103 104 105 106

Really truly bad guys

Page 71: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 71

Historical cooccurrence gives high S/N

Page 72: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 72

Historical cooccurrence gives high S/N

(we win)

Page 73: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 73

Cooccurrence Analysis Cooccurrence Analysis

Page 74: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 74

Real-life example •  Query: “Paco de Lucia” •  Conventional meta-data search results:

–  “hombres de paco” times 400 –  not much else

•  Recommendation based search: –  Flamenco guitar and dancers –  Spanish and classical guitar –  Van Halen doing a classical/flamenco riff

Page 75: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 75

Real-life example

Page 76: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 76

So … •  There are suddenly lots of these problems •  Simple techniques have surprising power at scale

–  Cooccurrence via G2 / LLR –  Distributional anomaly detection via t-digest

•  These simple techniques are largely unsuitable for academic research

•  But they are highly applicable in resource constrained industrial settings

Page 77: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 77

●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●●● ●●● ●●● ●●● ●●

●●

●●

020

4060

80

LLR score for real data

Number of Merchants

Brea

ch S

core

(LLR

)

Real truly bad guys

100 101 102 103 104 105 106

Cooccurrence AnalysisSummary •  We live in a golden age of newly achieved scale

•  That scale has lowered the tree –  Hard problems are much easier –  Lots of low-hanging fruit all around us

•  Cheap learning has huge value

•  Code available at http://github.com/tdunning

0 20 40 60 80 100

010

030

050

0

day

coun

t

Compromise period

Exploit period

compromisesfrauds

Page 78: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 78

Me, Us •  Ted Dunning, MapR Chief Application Architect, Apache Member

–  Committer PMC member Zookeeper, Drill, others –  Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin –  VP Incubator –  Bought the beer at the first HUG

•  MapR –  Produces a converged platform for big and fast data –  Includes data platform (files, streams, tables) + open source –  Adds major technology for performance, HA, industry standard API’s

•  Contact

@ted_dunning, [email protected], [email protected]

Page 79: Ted Dunning - Matroid · • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology

®© 2016 MapR Technologies 79

Q & A