61
DECEMBER 8-9, 2016

BSSML16 L3. Clusters and Anomaly Detection

Embed Size (px)

Citation preview

Page 1: BSSML16 L3. Clusters and Anomaly Detection

D E C E M B E R 8 - 9 , 2 0 1 6

Page 2: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 2

Poul Petersen CIO, BigML, Inc.

Cluster AnalysisFinding Similarities

Page 3: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 3Clusters

Trees vs Clusters

Trees/LR (Supervised Learning)

Provide: labeled data Learning Task: be able to predict label

Clusters (Unsupervised Learning)

Provide: unlabeled data Learning Task: group data by similarity

Page 4: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 4Clusters

Trees vs Clusters

sepallength

sepalwidth

petallength

petalwidth species

5,1 3,5 1,4 0,2 setosa5,7 2,6 3,5 1,0 versicolor6,7 2,5 5,8 1,8 virginica… … … … …

sepallength

sepalwidth

petallength

petalwidth

5,1 3,5 1,4 0,25,7 2,6 3,5 1,06,7 2,5 5,8 1,8… … … …

Inputs “X” Label “Y”

Learning Task: Find function “f” such that: f(X)≈Y

Learning Task: Find “k” clusters such that the data in each cluster is self similar

Page 5: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 5Clusters

Use Cases

• Customer segmentation • Item discovery • Similarity • Recommender • Active learning

Page 6: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 6Clusters

Customer Segmentation

GOAL: Cluster the users by usage statistics. Identify clusters with a

higher percentage of high LTV users. Since they have similar usage

patterns, the remaining users in these clusters may be good

candidates for up-sell.

• Dataset of mobile game users.

• Data for each user consists of usage statistics and a LTV based on in-game purchases

• Assumption: Usage correlates to LTV

0%

3% 1%

Page 7: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 7Clusters

Item Discovery

GOAL: Cluster the whiskies by flavor profile to discover whiskies that have

similar taste.

• Dataset of 86 whiskies

• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.

Smoky

Fruity

Page 8: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 8

Clustering Demo #1

Page 9: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 9Clusters

Similarity

GOAL: Cluster the loans by application profile to rank loan

quality by percentage of trouble loans in population

• Dataset of Lending Club Loans

• Mark any loan that is currently or has even been late as “trouble”

0%

3%

7%

1%

Page 10: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 10Clusters

Active Learning

GOAL:

Rather than sample randomly, use clustering to group

patients by similarity and then test a sample from each cluster to label the data.

• Dataset of diagnostic measurements of 768 patients.

• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*.

Page 11: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 11Clusters

Active Learning

*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.

2323

Page 12: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 12Clusters

Human ExpertCluster into 3 groups…

Page 13: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 13Clusters

Human Expert

Page 14: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 14Clusters

Human Expert• Jesa used prior knowledge to select possible

features that separated the objects.

• “round”, “skinny”, “edges”, “hard”, etc

• Items were then clustered based on the chosen features

• Separation quality was then tested to ensure:

• met criteria of K=3

• groups were sufficiently “distant”

• no crossover

Page 15: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 15Clusters

Human Expert

• Length/Width

• greater than 1 => “skinny”

• equal to 1 => “round”

• less than 1 => invert

• Number of Surfaces

• distinct surfaces require “edges” which have corners

• easier to count

Create features that capture these object differences

Page 16: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 16Clusters

Clustering FeaturesObject Length / Width Num Surfaces

penny 1 3

dime 1 3

knob 1 4

eraser 2,75 6

box 1 6

block 1,6 6

screw 8 3

battery 5 3

key 4,25 3

bead 1 2

Page 17: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 17Clusters

Plot by Features

NumSurfaces

Length / Width

box block eraser

knob

pennydime

bead

key battery screw

K-Means Key Insight:We can find clusters using distances

in n-dimensional feature space

K=3

Page 18: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 18Clusters

Plot by Features

NumSurfaces

Length / Width

box block eraser

knob

pennydime

bead

key battery screw

K-MeansFind “best” (minimum distance)circles that include all points

Page 19: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 19Clusters

K-Means Algorithm

K=3

Page 20: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 20Clusters

K-Means Algorithm

K=3

Page 21: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 21Clusters

Features Matter

Metal Other

Wood

Page 22: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 22Clusters

Convergence

Convergence guaranteedbut not necessarily unique

Starting points important (K++)

Page 23: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 23Clusters

Starting Points• Random points or instances in n-dimensional space

• Chose points “farthest” away from each other

• but this is sensitive to outliers

• k++

• the first center is chosen randomly from instances

• each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center

Page 24: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 24Clusters

Scaling Matters

price

number of bedrooms

d = 160,000

d = 1

Page 25: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 25Clusters

Other Tricks

• What is the distance to a “missing value”?

• What is the distance between categorical values?

• What is the distance between text features?

• Does it have to be Euclidean distance?

• Unknown “K”?

Page 26: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 26Clusters

Distance to Missing?• Nonsense! Try replacing missing values with:

• Maximum

• Mean

• Median

• Minimum

• Zero

• Ignore instances with missing values

Page 27: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 27Clusters

Distance to Categorical?

• Special distance function

• if xA == xB then x distance = 0 (or scaling value) else x distance = 1

• Assign centroid the most common category of the member instances

Approach: similar to “k-prototypes”

Page 28: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 28Clusters

Distance to Categorical?

feature_1 feature_2 feature_3

instance_1 red cat ball

instance_2 red cat ball

instance_3 red cat box

instance_4 blue dog fridge

D = 0

D = 1

D = sqrt(3)

Compute Euclidean distance between discrete vectors

Page 29: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 29Clusters

Text Vectors

1Cosine Similarity

0

-1

"hippo" "safari" "zebra" ….

3 0 1 …

2 4 0 …

0 5 7 …

Text Field #1

Text Field #2

Cosine Distance = 1 - Cosine SimilarityCD(TF1, TF2) = 0.575736

Features(thousands)

Page 30: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 30Clusters

Finding K: G-Means

Page 31: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 31Clusters

Finding K: G-Means

Page 32: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 32Clusters

Finding K: G-MeansLet K=2Keep 1, Split 1 New K=3

Page 33: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 33Clusters

Finding K: G-MeansLet K=3Keep 1, Split 2New K=5

Page 34: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 34Clusters

Finding K: G-MeansLet K=5K=5

Page 35: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 35

Clustering Demo #2

Page 36: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 36

Poul Petersen CIO, BigML, Inc.

Anomaly DetectionFinding the Unusual

Page 37: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 37Anomaly Detection

Clusters vs Anomalies

Clusters (Unsupervised Learning)

Provide: unlabeled dataLearning Task: group data by similarity

Anomalies (Unsupervised Learning)

Provide: unlabeled data Learning Task: Rank data by dissimilarity

Page 38: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 38Anomaly Detection

Clusters vs Anomalies

sepal length

sepal width

petal length

petal width

5,1 3,5 1,4 0,25,7 2,6 3,5 1,06,7 2,5 5,8 1,8… … … …

Learning Task: Find “k” clusters such that the data in each cluster is self similar

sepal length

sepal width

petal length

petal width

5,1 3,5 1,4 0,25,7 2,6 3,5 1,06,7 2,5 5,8 1,8… … … …

Learning Task: Assign value from 0 (similar) to 1 (dissimilar) to each instance.

Page 39: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 39Anomaly Detection

Use Cases

• Unusual instance discovery

• Intrusion Detection

• Fraud

• Identify Incorrect Data

• Remove Outliers

• Model Competence / Input Data Drift

Page 40: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 40Anomaly Detection

Removing Outliers

• Models need to generalize • Outliers negatively impact generalization

GOAL: Use anomaly detector to identify most anomalous points and then remove them before modeling.

DATASET FILTERED DATASET

ANOMALY DETECTOR

CLEAN MODEL

Page 41: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 41Anomaly Detection

Diabetes Anomalies

DIABETES SOURCE

DIABETES DATASET

TRAIN SET

TEST SET

ALL MODEL

CLEAN DATASET

FILTER

ALL MODEL

ALL EVALUATION

CLEAN EVALUATION

COMPARE EVALUATIONS

ANAOMALY DETECTOR

Page 42: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 42

Anomaly Demo #1

Page 43: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 43Anomaly Detection

Intrusion Detection

GOAL: Identify unusual command line behavior per user and across all users that might indicate an intrusion.

• Dataset of command line history for users • Data for each user consists of commands,

flags, working directories, etc. • Assumption: Users typically issue the same

flag patterns and work in certain directories

Per User Per Dir All User All Dir

Page 44: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 44Anomaly Detection

Fraud

• Dataset of credit card transactions • Additional user profile information

GOAL: Cluster users by profile and use multiple anomaly scores to detect transactions that are anomalous on multiple levels.

Card Level User Level Similar User Level

Page 45: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 45Anomaly Detection

Model Competence• After putting a model it into production, data that is being

predicted can become statistically different than the training data.

• Train an anomaly detector at the same time as the model.

GOAL: For every prediction, compute an anomaly score. If the anomaly score is high, then the model may not be competent and should not be trusted.

Prediction T TConfidence 86 % 84 %

Anomaly Score 0,5367 0,7124Competent? Y N

At Prediction TimeAt Training Time

DATASET

MODEL

ANOMALY DETECTOR

Page 46: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 46Anomaly Detection

Univariate Approach• Single variable: heights, test scores, etc

• Assume the value is distributed “normally”

• Compute standard deviation

• a measure of how “spread out” the numbers are

• the square root of the variance (The average of the squared differences from the Mean.)

• Depending on the number of instances, choose a “multiple” of standard deviations to indicate an anomaly. A multiple of 3 for 1000 instances removes ~ 3 outliers.

Page 47: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 47Anomaly Detection

Univariate Approach

measurement

frequency

outliersoutliers

• Available in BigML API

Page 48: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 48Anomaly Detection

Benford’s Law• In real-life numeric sets the small digits occur disproportionately often as

leading significant digits.

• Applications include:

• accounting records

• electricity bills

• street addresses

• stock prices

• population numbers

• death rates

• lengths of rivers

• Available in BigML API

Page 49: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 49Anomaly Detection

Multivariate Matters

Page 50: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 50Anomaly Detection

Multivariate Matters

Page 51: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 51Anomaly Detection

Human ExpertMost Unusual?

Page 52: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 52Anomaly Detection

Human Expert“Round”“Skinny” “Corners”

“Skinny” but not “smooth”

No “Corners”

Not “Round”

Key InsightThe “most unusual” object

is different in some way fromevery partition of the features.Most unusual

Page 53: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 53Anomaly Detection

Human Expert• Human used prior knowledge to select possible

features that separated the objects.

• “round”, “skinny”, “smooth”, “corners”

• Items were then separated based on the chosen features

• Each cluster was then examined to see which object fit the least well in its cluster and did not fit any other cluster

Page 54: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 54Anomaly Detection

Human Expert

• Length/Width

• greater than 1 => “skinny”

• equal to 1 => “round”

• less than 1 => invert

• Number of Surfaces

• distinct surfaces require “edges” which have corners

• easier to count

• Smooth - true or false

Create features that capture these object differences

Page 55: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 55Anomaly Detection

Anomaly FeaturesObject Length / Width Num Surfaces Smooth

penny 1 3 TRUE

dime 1 3 TRUE

knob 1 4 TRUE

eraser 2,75 6 TRUE

box 1 6 TRUE

block 1,6 6 TRUE

screw 8 3 FALSE

battery 5 3 TRUE

key 4,25 3 FALSE

bead 1 2 TRUE

Page 56: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 56Anomaly Detection

Random Splits

smooth = True

length/width > 5

box

blockeraser

knob

pennydime

bead

key

battery

screw

num surfaces = 6

length/width =1

length/width < 2

Know that “splits” matter - don’t know the order

Page 57: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 57Anomaly Detection

Isolation ForestGrow a random decision tree until each instance is in its own leaf

“easy” to isolate

“hard” to isolate

Depth

Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)

Page 58: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 58Anomaly Detection

Isolation Forest Scoring

f_1 f_2 f_3i_1 red cat balli_2 red cat balli_3 red cat boxi_4 blue dog pen

D = 3

D = 6

D = 2

Score

Page 59: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 59Anomaly Detection

Model Competence

• A low anomaly score means the loan is similar to the modeled loans.

• A high anomaly score means you can not trust the model.

Prediction T T

Confidence 86 % 84 %

AnomalyScore 0,5367 0,7124

Competent? Y N

OPEN LOANS

PREDICTION

ANOMALY SCORE

CLOSED LOAN MODEL

CLOSED LOAN ANOMALY DETECTOR

Page 60: BSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 60

Anomaly Demo #2

Page 61: BSSML16 L3. Clusters and Anomaly Detection