Upload
bigml-inc
View
196
Download
0
Embed Size (px)
Citation preview
D E C E M B E R 8 - 9 , 2 0 1 6
BigML, Inc 2
Poul Petersen CIO, BigML, Inc.
Cluster AnalysisFinding Similarities
BigML, Inc 3Clusters
Trees vs Clusters
Trees/LR (Supervised Learning)
Provide: labeled data Learning Task: be able to predict label
Clusters (Unsupervised Learning)
Provide: unlabeled data Learning Task: group data by similarity
BigML, Inc 4Clusters
Trees vs Clusters
sepallength
sepalwidth
petallength
petalwidth species
5,1 3,5 1,4 0,2 setosa5,7 2,6 3,5 1,0 versicolor6,7 2,5 5,8 1,8 virginica… … … … …
sepallength
sepalwidth
petallength
petalwidth
5,1 3,5 1,4 0,25,7 2,6 3,5 1,06,7 2,5 5,8 1,8… … … …
Inputs “X” Label “Y”
Learning Task: Find function “f” such that: f(X)≈Y
Learning Task: Find “k” clusters such that the data in each cluster is self similar
BigML, Inc 5Clusters
Use Cases
• Customer segmentation • Item discovery • Similarity • Recommender • Active learning
BigML, Inc 6Clusters
Customer Segmentation
GOAL: Cluster the users by usage statistics. Identify clusters with a
higher percentage of high LTV users. Since they have similar usage
patterns, the remaining users in these clusters may be good
candidates for up-sell.
• Dataset of mobile game users.
• Data for each user consists of usage statistics and a LTV based on in-game purchases
• Assumption: Usage correlates to LTV
0%
3% 1%
BigML, Inc 7Clusters
Item Discovery
GOAL: Cluster the whiskies by flavor profile to discover whiskies that have
similar taste.
• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.
Smoky
Fruity
BigML, Inc 8
Clustering Demo #1
BigML, Inc 9Clusters
Similarity
GOAL: Cluster the loans by application profile to rank loan
quality by percentage of trouble loans in population
• Dataset of Lending Club Loans
• Mark any loan that is currently or has even been late as “trouble”
0%
3%
7%
1%
BigML, Inc 10Clusters
Active Learning
GOAL:
Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each cluster to label the data.
• Dataset of diagnostic measurements of 768 patients.
• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*.
BigML, Inc 11Clusters
Active Learning
*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.
2323
BigML, Inc 12Clusters
Human ExpertCluster into 3 groups…
BigML, Inc 13Clusters
Human Expert
BigML, Inc 14Clusters
Human Expert• Jesa used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then clustered based on the chosen features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently “distant”
• no crossover
BigML, Inc 15Clusters
Human Expert
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences
BigML, Inc 16Clusters
Clustering FeaturesObject Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2,75 6
box 1 6
block 1,6 6
screw 8 3
battery 5 3
key 4,25 3
bead 1 2
BigML, Inc 17Clusters
Plot by Features
NumSurfaces
Length / Width
box block eraser
knob
pennydime
bead
key battery screw
K-Means Key Insight:We can find clusters using distances
in n-dimensional feature space
K=3
BigML, Inc 18Clusters
Plot by Features
NumSurfaces
Length / Width
box block eraser
knob
pennydime
bead
key battery screw
K-MeansFind “best” (minimum distance)circles that include all points
BigML, Inc 19Clusters
K-Means Algorithm
K=3
BigML, Inc 20Clusters
K-Means Algorithm
K=3
BigML, Inc 21Clusters
Features Matter
Metal Other
Wood
BigML, Inc 22Clusters
Convergence
Convergence guaranteedbut not necessarily unique
Starting points important (K++)
BigML, Inc 23Clusters
Starting Points• Random points or instances in n-dimensional space
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the first center is chosen randomly from instances
• each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center
BigML, Inc 24Clusters
Scaling Matters
price
number of bedrooms
d = 160,000
d = 1
BigML, Inc 25Clusters
Other Tricks
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown “K”?
BigML, Inc 26Clusters
Distance to Missing?• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values
BigML, Inc 27Clusters
Distance to Categorical?
• Special distance function
• if xA == xB then x distance = 0 (or scaling value) else x distance = 1
• Assign centroid the most common category of the member instances
Approach: similar to “k-prototypes”
BigML, Inc 28Clusters
Distance to Categorical?
feature_1 feature_2 feature_3
instance_1 red cat ball
instance_2 red cat ball
instance_3 red cat box
instance_4 blue dog fridge
D = 0
D = 1
D = sqrt(3)
Compute Euclidean distance between discrete vectors
BigML, Inc 29Clusters
Text Vectors
1Cosine Similarity
0
-1
"hippo" "safari" "zebra" ….
3 0 1 …
2 4 0 …
0 5 7 …
Text Field #1
Text Field #2
Cosine Distance = 1 - Cosine SimilarityCD(TF1, TF2) = 0.575736
Features(thousands)
BigML, Inc 30Clusters
Finding K: G-Means
BigML, Inc 31Clusters
Finding K: G-Means
BigML, Inc 32Clusters
Finding K: G-MeansLet K=2Keep 1, Split 1 New K=3
BigML, Inc 33Clusters
Finding K: G-MeansLet K=3Keep 1, Split 2New K=5
BigML, Inc 34Clusters
Finding K: G-MeansLet K=5K=5
BigML, Inc 35
Clustering Demo #2
BigML, Inc 36
Poul Petersen CIO, BigML, Inc.
Anomaly DetectionFinding the Unusual
BigML, Inc 37Anomaly Detection
Clusters vs Anomalies
Clusters (Unsupervised Learning)
Provide: unlabeled dataLearning Task: group data by similarity
Anomalies (Unsupervised Learning)
Provide: unlabeled data Learning Task: Rank data by dissimilarity
BigML, Inc 38Anomaly Detection
Clusters vs Anomalies
sepal length
sepal width
petal length
petal width
5,1 3,5 1,4 0,25,7 2,6 3,5 1,06,7 2,5 5,8 1,8… … … …
Learning Task: Find “k” clusters such that the data in each cluster is self similar
sepal length
sepal width
petal length
petal width
5,1 3,5 1,4 0,25,7 2,6 3,5 1,06,7 2,5 5,8 1,8… … … …
Learning Task: Assign value from 0 (similar) to 1 (dissimilar) to each instance.
BigML, Inc 39Anomaly Detection
Use Cases
• Unusual instance discovery
• Intrusion Detection
• Fraud
• Identify Incorrect Data
• Remove Outliers
• Model Competence / Input Data Drift
BigML, Inc 40Anomaly Detection
Removing Outliers
• Models need to generalize • Outliers negatively impact generalization
GOAL: Use anomaly detector to identify most anomalous points and then remove them before modeling.
DATASET FILTERED DATASET
ANOMALY DETECTOR
CLEAN MODEL
BigML, Inc 41Anomaly Detection
Diabetes Anomalies
DIABETES SOURCE
DIABETES DATASET
TRAIN SET
TEST SET
ALL MODEL
CLEAN DATASET
FILTER
ALL MODEL
ALL EVALUATION
CLEAN EVALUATION
COMPARE EVALUATIONS
ANAOMALY DETECTOR
BigML, Inc 42
Anomaly Demo #1
BigML, Inc 43Anomaly Detection
Intrusion Detection
GOAL: Identify unusual command line behavior per user and across all users that might indicate an intrusion.
• Dataset of command line history for users • Data for each user consists of commands,
flags, working directories, etc. • Assumption: Users typically issue the same
flag patterns and work in certain directories
Per User Per Dir All User All Dir
BigML, Inc 44Anomaly Detection
Fraud
• Dataset of credit card transactions • Additional user profile information
GOAL: Cluster users by profile and use multiple anomaly scores to detect transactions that are anomalous on multiple levels.
Card Level User Level Similar User Level
BigML, Inc 45Anomaly Detection
Model Competence• After putting a model it into production, data that is being
predicted can become statistically different than the training data.
• Train an anomaly detector at the same time as the model.
GOAL: For every prediction, compute an anomaly score. If the anomaly score is high, then the model may not be competent and should not be trusted.
Prediction T TConfidence 86 % 84 %
Anomaly Score 0,5367 0,7124Competent? Y N
At Prediction TimeAt Training Time
DATASET
MODEL
ANOMALY DETECTOR
BigML, Inc 46Anomaly Detection
Univariate Approach• Single variable: heights, test scores, etc
• Assume the value is distributed “normally”
• Compute standard deviation
• a measure of how “spread out” the numbers are
• the square root of the variance (The average of the squared differences from the Mean.)
• Depending on the number of instances, choose a “multiple” of standard deviations to indicate an anomaly. A multiple of 3 for 1000 instances removes ~ 3 outliers.
BigML, Inc 47Anomaly Detection
Univariate Approach
measurement
frequency
outliersoutliers
• Available in BigML API
BigML, Inc 48Anomaly Detection
Benford’s Law• In real-life numeric sets the small digits occur disproportionately often as
leading significant digits.
• Applications include:
• accounting records
• electricity bills
• street addresses
• stock prices
• population numbers
• death rates
• lengths of rivers
• Available in BigML API
BigML, Inc 49Anomaly Detection
Multivariate Matters
BigML, Inc 50Anomaly Detection
Multivariate Matters
BigML, Inc 51Anomaly Detection
Human ExpertMost Unusual?
BigML, Inc 52Anomaly Detection
Human Expert“Round”“Skinny” “Corners”
“Skinny” but not “smooth”
No “Corners”
Not “Round”
Key InsightThe “most unusual” object
is different in some way fromevery partition of the features.Most unusual
BigML, Inc 53Anomaly Detection
Human Expert• Human used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “smooth”, “corners”
• Items were then separated based on the chosen features
• Each cluster was then examined to see which object fit the least well in its cluster and did not fit any other cluster
BigML, Inc 54Anomaly Detection
Human Expert
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
• Smooth - true or false
Create features that capture these object differences
BigML, Inc 55Anomaly Detection
Anomaly FeaturesObject Length / Width Num Surfaces Smooth
penny 1 3 TRUE
dime 1 3 TRUE
knob 1 4 TRUE
eraser 2,75 6 TRUE
box 1 6 TRUE
block 1,6 6 TRUE
screw 8 3 FALSE
battery 5 3 TRUE
key 4,25 3 FALSE
bead 1 2 TRUE
BigML, Inc 56Anomaly Detection
Random Splits
smooth = True
length/width > 5
box
blockeraser
knob
pennydime
bead
key
battery
screw
num surfaces = 6
length/width =1
length/width < 2
Know that “splits” matter - don’t know the order
BigML, Inc 57Anomaly Detection
Isolation ForestGrow a random decision tree until each instance is in its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
BigML, Inc 58Anomaly Detection
Isolation Forest Scoring
f_1 f_2 f_3i_1 red cat balli_2 red cat balli_3 red cat boxi_4 blue dog pen
D = 3
D = 6
D = 2
Score
BigML, Inc 59Anomaly Detection
Model Competence
• A low anomaly score means the loan is similar to the modeled loans.
• A high anomaly score means you can not trust the model.
Prediction T T
Confidence 86 % 84 %
AnomalyScore 0,5367 0,7124
Competent? Y N
OPEN LOANS
PREDICTION
ANOMALY SCORE
CLOSED LOAN MODEL
CLOSED LOAN ANOMALY DETECTOR
BigML, Inc 60
Anomaly Demo #2