Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
xAI WORKBENCH TRAINING
Agenda | Part 1
• Introduction to xAI Workbench
• simClassify+
• Data Upload
• Data Type Specifications
• Model Tuning - Hyperparameter
Selection
• Auto Tune
• Exhaustive Grid Search
• Thresholding
• Domain Property
• Weighted Recall
• Classification Analysis Reports
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Agenda | Part 2
• Clustering
• simCluster+
• Cluster Visualizations
• Cluster/Segment Statistical
Analysis
• Classification Operational Issues
• Dataset Update & Merge
• Update Instance
• Copy Instance
• Monitoring
• Sample High-Capacity
Production Setup
• Applications with the API
• simClassify
• simCluster
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
PART 1
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Unlike static, traditional ML segmentation, our dynamic solution
can change as often as your customers change their behavior.
Our predictive technology clusters and segments based on
an action, outcome or other meaningful business objective.
These outcomes drive a segmentation schema comprised
of intent and action – not just descriptive statistics.
Our Expertise:Dynamic Predictive
Segmentation (DPS)
Traditional machine learning is a black box.
TARGET
Model
PREDICTED CLASS:
97% CONFIDENCEWILL BUY
Confidential ∙ Copyright ©2021 ∙ InRule Technology, Inc. ∙ All rights reserved.
We Open the Box – to Deliver Smarter Anything
Our predictions with the WHY® make it easy for analytics and business teams to apply machine learning quickly and
effectively with an easy-to-use workbench, explainable outputs, automation & RESTful APIs.
With user-controlled granularity and feature selection, our single-pass prediction and clustering deliver high precision
models with dynamically weighted attributes by segment for ultimate transparency and explainability.
WILL BUYPREDICTED CLASS:
97% CONFIDENCE
TARGET
PREDICTIVE SEGMENTSDIFFERENTIATING ATTRIBUTES
OUTDOOR ENTHUSIAST
MARRIED
HOUSEHOLD CHILDREN
PET OWNER
SENTIMENT INDEX
Nearest neighbors in the database inform each prediction with rich insights
contextual intelligence enables a more relevant decision on the offer or recommendation to present
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
HIGH LEVEL OVERVIEW
FileManagement
Indexed sM database
Fast sM access
Similarity Search(simSearch)
Classification(simClassify & simClassify+)
Clustering(simCluster & simCluster+)
CollaborativeRecommendation
(simRecommend) Results
Fold & Grid Cross Validation
Data UI Engine Specification/Validation Forms Results Visualization
Very fast random data retrieval.
Automated cross validation and hyperparameter optimization.
Specification Analyzer
Optimized engine with incredibly fast speeds.
Easy to use, simplified data identification. Classification and clustering exploration.
Host Architecture
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
SIMCLASSIFY+
simClassify+
• simClassify+ is a learned relevancy function based on proprietary ML
techniques.
• Can get improved classification accuracy (over simClassify) in some
circumstances.
• By using this learning approach, we are able to match or outperform
other ML techniques, while still providing transparency at the local,
prediction level.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Indexed sM database
Fast sM access
Similarity Search(simSearch)
Classification(simClassify)
Fold & Grid Validation
Indexed sM database
Fast sM access
Similarity Search(simSearch)
Fold & Grid Validation
Classification(simClassify)
Fold & Grid Validation
Classification(simClassify)
Architecture - simClassify+
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Indexed sM database
Fast sM access
Similarity Search(simSearch)
Fold & Grid Validation
Metric
K Nearest Neighbor
Classification(simClassify)
K
Threshold
Class Weighting
Parameters - simClassify+
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Indexed sM database
Fast sM access
Similarity Search(simSearch)
Fold & Grid Validation
Metric
K Nearest Neighbor
Domain Optimization
Classification(simClassify)
K
Threshold
Class WeightingDomain (Feature)
Domain Importance
Weighted Recall (Feature)
Parameters - simClassify+
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Indexed sM database
Fast sM access
Similarity Search(simSearch)
Fold & Grid Validation
Metric Learner
Metric
K Nearest Neighbor
Domain Optimization
Classification(simClassify)
KIterations
Threshold
Learning Rate
Feature Subsampling
Feature Focus
Class WeightingDomain (Feature)
Domain Importance
Weighted Recall (Feature)
Parameters - simClassify+
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Parameter Description
Iterations Number of iterations of the learning algorithm.
Learning RateStep size of the learning algorithm. Small values can lead to longer runtime. Large values can lead to overfitting.
Feature Subsampling
Ratio of randomly subsampled features at each iteration of the metric learning
algorithm. Randomizations provides diversity in the preparation of the similarity criteria.
Feature FocusMaximum number of dynamically selected features at any given time. This works like a
localized feature selection process.
Class WeightingUNIFORM or NORMALIZED. Uniform gives the same weight to all classes. Normalization
takes into account class imbalance.
simClassify+ Parameters
• In addition to the typical simClassify settings, simClassify+ has parameters for metric learning. We highly
recommend tuning parameters through auto tune or exhaustive grid search.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
At iteration 0, dX = 0
Iteration 1 x1, x2, x6, x8, x9
x1:x2
x8:x9
d’1
Feature Subsampling
Feature Focus
Learning Rate
Iteration 2 x3, x4, x7, x9, x10
x3:x4
x9:x10
d’2
Iteration 3 x1, x2, x3, x4, x8
x1:x2
x4:x8
d’3
dX = dX + (r * d’1)
dX = dX + (r * d’2)
dX = dX + (r * d’3)
dX
Feature Subsampling and Focus Drill Down
For this example, let’s assume the
following:
• Feature set of training data: {x1,
x2,..., x10} = X
• Feature Subsampling = .5
• Feature Focus = 2
• Iterations = 3
• Learning Rate = r
The output is dx a measure of the
weighted relevancy of features for
predicting the result.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Using simClassify+
• Choose training data folder and spec file
• Next, assign hyper-parameters
• We strongly suggest using the Auto-Tune feature
to find optimal hyper-parameters
• If using an exhaustive grid to tune hyper
parameters, identify the best performing model
in the grid based on selected metrics (or tune
further if necessary) and create the model.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
simClassify+ Results
• simClassify+ returns a prediction with its probability score and the weighted factors behind the prediction.
Special Values
• +/- represents whether that feature/value pair is in the query object (+) or not (-).
• <bias> represents how the class distribution in the underlying data set affects the model’s classification. This should be more prevalent in imbalanced data sets.
• SNLL represents a null value
• SRAR represents a rare value in nominal columns
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
DATA UPLOAD
Data Upload - GUI
The user can upload data either directly from the GUI, or from the command line using a curl command
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Data Upload - Command Line
• The curl command to load data from the command line has the following structure:
curl -u <user>:<password> -v -F folderName=<GUI_Folder_Name> -F fileSize=<File_Size> -F
fileName=<File_Name> -F fileData=@<File_Path>
<Cloud_API_Protocol>://<Cloud_API_IP>:<Cloud_API_Port>/cloud/uploadFile
• In order to get fileSize value, run this command:
stat --printf="%s" <file-name>
• Example of data upload using model input data:
sudo curl -u user_1:password -v -F folderName=DataFolderName -F fileSize=123456 -F
fileName=data.tsv -F fileData=@/path/to/data.tsv http://127.0.0.1:9090/cloud/uploadFile
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
DATA TYPE SPECIFICATIONS
Spec Types
• ID - Required field; unique identifier of each row
• Class - Specifies the target column in supervised learning.
• Real - Numerical values
• Nominal - Values that do not bear a quantitative relationship with each other (i.e. strings and numbers which represent non-numerical information).
• Ignore - This column will be ignored during model development
• Item_Set - A series of values with weights. Formatted as item1:weight1;item2:weight2;...;itemN:weightN.
• Multi_English - Freeform English text. Numbers and symbols can be included as well.
• Multi_Spanish - Freeform Spanish text. Numbers and symbols can be included as well.
• Multi_Japanese - Freeform Japanese text. Numbers and symbols can be included as well.
• Multi_Plain - Freeform non-language specific text. Numbers and symbols can be included as well.
• Null_Indicator - This spec type transforms the column into a binary indicator of whether each row in the selected column has data in it or not.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Spec Generation
• Once the data is uploaded, you will have to “spec the data”, or choose the data types for each column in the dataset.
• There are multiple ways to do with within xAI Workbench
• Manual Selection
• Automated Selection using the Spec Analyzer
• Import previously created spec files
• Navigate to the Specs page by selecting “Edit Specs for Folder” in your desired data folder
• Data file must be uploaded to folder for user to have the ability to go to Specs page.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Name spec file
Choose model type
Analyze columns
Spec Analyzer
• The Spec Analyzer will generate recommended data types for each column, along with providing some high
level analysis for each column.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Spec File Upload
• You can upload a previously created spec file from the “Upload Spec File” tab.
• The format of the spec file can be CSV, TSV, or JSON
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
MODEL TUNING - HYPERPARAMETER SELECTIONAUTO TUNE
Auto Tune: Setup
• Once your data is uploaded and specs are created,
you are ready to begin model development.
• In Auto-Tune mode, the engine intelligently searches
a large grid of experiments and only creates model
experiments when the probability of successfully
increasing the metric of interest is high.
• Current metrics available for Auto-Tune
optimization:
• AUC - Area Under the ROC Curve (Binomial)
• Log-Loss (Binomial)
• MCC - Matthews Correlation Coefficient (Binomial)
• Accuracy (Multinomial)
• Recall (Multinomial)
• Precision (Multinomial)
• F1 Score (Multinomial)
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Auto Tune: Experiment Types
• xAI Workbench offers two types of experiments when evaluating a model’s performance during grid
experiments: N-Fold Experiments & Date Split Experiments
• N-Fold Experiments
• N-Fold cross validation is used to measure model performance
• Date Split Experiments
• A date column is used to split the training data into tuning, testing, and validation splits. This is useful for time sensitive data
sets.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Auto Tune: Experiment Types - NFold
• Fold experiments are one way to measure results during grid execution.
• There are three steps.
• Step 1: You specify the number of folds and the fold seed.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
3-Fold Experiment
simClassify+
Test
Results
simClassify+
Train
Fold Experiment Parameters● # of Folds● Fold Seed
Fold 1
simClassify+
Fold 2
Fold 3
Test
Results
Train
Test
Results
Auto Tune: Experiment Types - NFold
• Step 2: For each grid configuration, a fold experiment runs. The training data is split into the number of
folds specified. In each iteration, one fold is left out of training for testing. The number of iterations equal to
the number of folds.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Tuning - Train Tuning - Test
Validation - Train Validation - Test
Auto Tune: Experiment Types - Date Split
• Select Column - date column in dataset to use for splitting
• Date Format - format of date field in selected column
• First Split Date - initial split date for testing
• Second Split Date - final split date for testing
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Auto Tune: Experiment Types - Date Split
• The training data is split into three groups, in time order.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Auto Tune: Experiment Types - NFold
• Step 3: Interpret Results - model metrics represent aggregate measures of each record when included in
the “testing” fold during N-Fold experiment
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Auto Tune: Execution
• Once an auto-tune experiment has been executed, you can navigate to the Grid Results page to see current
best model.
• To create this model directly from the results page, select the blue “Create Model” button.
• To view all of the results from the underlying grid experiment, select the “All Results” radio button.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Auto Tune: Validation
• For date split experiments, the user can validate multiple model configurations.
• This allows the user to test for overfitting and to see how well the model generalizes on data it hasn’t seen
before.
• In the grid results table, send configurations to validation by selecting the respective “Validate” check box in
the table. Once all desired configurations are chosen, hit the “Execute Validation” button on the bottom of the
page.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
MODEL TUNING - HYPERPARAMETER SELECTIONEXHAUSTIVE GRID SEARCH
Exhaustive Grid Search: Grid Creation
• If “Auto Tune” is turned off (as illustrated in the
screenshot), an exhaustive grid search will be
performed to test hyper parameter
configurations.
• You can edit the parameters to test by selecting
the “Edit Initial Parameters” slider.
• This particular grid in the screenshot will test
eight distinct hyper parameter configurations
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Exhaustive Grid Search: Grid Results
• Each combination of parameters used in grid experiments can be created into a model from the Grid Results
table
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Exhaustive Grid Search: Validation
• Just like in “Auto-Tune” mode, the user can send model configurations to validation for date split
experiments.
• This allows the user to test for overfitting, and to see how well the model generalizes on data it hasn’t seen
before
• In the grid results table, send configurations to validation by selecting the respective “Validate” check box in
the table. Once all desired configurations are chosen, hit the “Execute Validation” button on the bottom of the
page.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
THRESHOLDING
0 1Probability of Class 1
Class 1Class 0
Threshold = 0.5
10
0 1Probability of Class 1
Class 0
Class 1
Threshold = 0.2
10
Balanced Unbalanced
Thresholding
• Thresholding is a feature used in simClassify and simClassify+
• It acts as a limit to split resulting predictions into a true or false category
• Thresholding only applies to:
• Binary classification
• The positive class of a prediction
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
N
N
N
N
NN
N
N
N
N
NNN
N
N
NN
F F
FF
N
F
NN
Threshold
Thresholding
• Perfection
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
N
N
N
N
NN
N
N
N
N
NNN
N
N
NN
F F
FF
N
TrueNegatives
FalseNegatives
FalsePositives
TruePositives
F
N
N
Threshold
Thresholding
• Reality
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
N
N
N
N
NN
N
N
N
N
NNN
N
N
NN
F F
FF
N
TrueNegatives
FalseNegatives
FalsePositives
TruePositives
F
N
N
Threshold
Precision
• Quality
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
N
N
N
N
NN
N
N
N
N
NNN
N
N
NN
F F
FF
N
TrueNegatives
FalseNegatives
FalsePositives
TruePositives
F
N
N
Threshold
Recall
• Quantity
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
N
N
N
N
NN
N
N
N
N
NNN
N
N
NN
F F
FF
N
TrueNegatives
FalseNegatives
FalsePositives
TruePositives
F
N
N
Threshold
Precision / Recall Tradeoff
• Quantity
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
F1 Score
• Precision
• Quality of results: How exact were they?
• Recall
• Quantity of results: How complete were they?
• Precision increases at the expense of recall and vice versa.
• F1 Score
• Balance (harmonic mean) of precision and recall
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
False Positive RateR
ec
all
ROC Curve
Random Model
Good Model
Better Model
AUC
• Area Under Receiver Operator Characteristic
(ROC) Curve
• Walks through a range of thresholds and
plots the Recall and FPR for each.
• Line of equality is a random model.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
DOMAIN PROPERTY
Domain Property
• Configuration set in model training
• Only available for binomial classifiers
• User selects a “real” column from training data. Model will place more importance on this column during training. E.g. transaction dollars, person hours, etc.
Domain Column
• “Real” column to place importance on during model training.
Domain Importance Function
• Global - emphasis on both classes
• Conservative - emphasis on positive class, conservative approach
• Aggressive - emphasis on positive class, aggressive approach
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
WEIGHTED RECALL
Weighted Recall Formula (Positive Class)Rw = ∑ Selected Metric True Positive / ∑ Selected Metric Condition
Positive
Weighted Recall
• Metric used in grid results analysis
• User selects a “real” column from training data to evaluate recall from. The underlying logic is the same as standard recall, but you are evaluating the percentage of the chosen metric caught for each class (e.g. % of fraud dollars caught).
• Weighted recall metric will appear in the grid results table, just like other metrics
• Metric name in grid results table
• weighted_recall_*POS_CLASS*
• weighted_recall_*NEG_CLASS*
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
CLASSIFICATION ANALYSIS REPORTS
Global Feature Analysis
• This report can be accessed through the “Model Analysis” link on the “Model Actions” page
• This report will show you the feature importance at a global level for classification models that you have
created.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Model Menu
Batch Query
Make Batch Query
1.
2.
Model Analysis
Batch Query Analytics
+ Report
3.
Batch Query Analysis
• This report can be accessed through the “Model Analysis” link on the “Model Actions” page
• This report will show you the cumulative results, by classification rate, for the results of a batch query file. Typically, this might be done with a separate hold out file, or periodically with a sample from production.
• In this report, the confidence level is multiplied by 100 to get a score between 0 and 100.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Batch Query Analytics
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
PART 2
CLUSTERING
Clustered all feature similarity
Unsupervised Clustering
• Unsupervised Clustering identifies groups of similar data objects based on the frequency of shared variables
between objects.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Class 1
Clustered by differentiating features
Class 2
Cluster 1 Cluster 2
Supervised Clustering
• Supervised Clustering uses xAI Workbench’s simClassify+ to identify groupings of the differences in features
between classes.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Class 1
Class 2
Cluster 2Cluster 1
Supervised Clustering (cont.)
• If the classes had been different, the clustering would have been different.
Clustered by differentiating features
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Class Variable
Why Factors - most predictive feature-value pairs driving the clusters
Cluster Label
Cluster Level Details
Download Cluster
View single cluster, or compare two clusters
Cluster Size
Cluster Visualization
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
SIMCLUSTER +
simCluster+: Overview
• simCluster+ is our K-Means clustering algorithm for supervised clustering. It can also be used in unsupervised clustering, where Euclidean, Manhattan, and our One Class distance functions are available.
• simCluster+ uses a technique that can create K-Means clustering or K-Spilling clusters, unlike simCluster which uses agglomerative clusters.
• In K-Spilling clusters, K clusters that are more tightly formed around their “mean” centroid will be returned, but if a datapoint is not close enough to the centroids, it will “spill” into secondary clusters.
• The Range Percentile parameter determines if K-Means or K-Spilling will be used.
• If Range Percentile is 1.0, then all data points will be clustered into K clusters. If the Range Percentile is less than 1.0, then dense clusters will be produced and data points that are not within the specified density will spill into clusters beyond K.
• A visual walkthrough of the K-Spilling methodology is included in your user manual.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
simCluster+: Parameters
Both Supervised and Unsupervised
• K - Minimum number of clusters. If Distance Percentile is 1.0, K clusters will be generated. Anything less than 1.0, at least K w ill be generated.
• Feature Focus - The maximum number of dynamically selected features used in any iteration of the clustering algorithm. This works like a localized feature selection.
• Distance Percentile - The maximum distance between the cluster center and a given element of the cluster. 1.0 will produce a K -Means clustering. Values less than 1.0 will create ‘tighter clusters,’ but elements not in the range will “spill” into new clusters. (Range gre ater than 0 and less than or equal to 1.0.)
• Iterations - Number of iterations of the learning algorithm.
Unsupervised Only
• Max Samples - The maximum number of data points used in each iteration.
Supervised Only
• Learning Rate - Step size of metric learning algorithm. Small values can lead to longer run times and large values can lead to overfitting.
• Feature Subsampling - Ratio of randomly subsampled features in each iteration of the metric learning algorithm. Randomization provides diversity inthe resulting similarity metric.
• Class Weighting - UNIFORM gives the same weight to all classes. NORMALIZED takes into account class imbalance.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
K=2Distance Percentile = 1.0
K=2Distance Percentile < 1.0
simCluster+: Distance Percentile
• Distance Percentile sets the maximum
distance between an object and the center of
the cluster to which it belongs.
• Distance Percentile is a float value between 0
(exclusive) and 1 (inclusive).
• If some elements are not assigned to any
cluster due to exceeding the Distance
Percentile value, they will be spilled over to the
next iteration of cluster assignment.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
K=2Distance Percentile = 1.0
K=2Distance Percentile < 1.0
simCluster+: K
• K is an integer value that represents the minimum
number of clusters to be created.
• If Distance Percentile is 1.0, exactly K clusters will
be generated. Anything less than 1.0, at least K
will be generated.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
simCluster+: Unsupervised
• Three distance functions available for
unsupervised clustering
• Manhattan
• Euclidean
• One Class
• One Class proprietary distance
function is useful when you know the
underlying data set has a majority
“common” class (i.e. fraud, customer
behavior, anomaly detection, etc.)
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
simCluster+: Supervised
• When creating a supervised clustering experiment, you should find the optimal hyper parameter configuration tuning the simClassify+ classifier (Auto Tune or Exhaustive Grid Search).
• Once you find your optimal hyper parameter configuration, use them in this page to create your clustering engine.
• If you are using both simClassify+ and simCluster+ for a particular data set/use case, choosing the same hyper parameters will ensure that the learned distance function is consistent between the classification and clustering.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Create Visualization
• Top N: Number of layers (ie feature-value pairs) in visualization
• Max Number of Clusters: Maximum number of slices (ie clusters)
• Limit: Minimum number of elements for cluster to be visualized
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
CLUSTER VISUALIZATIONS
Cluster Label
Save cluster label
See as main cluster (Compare mode)
Deselect cluster
Download cluster elements and features
Most predictive (or frequent if unsupervised) factors in the selected cluster
View Results
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
View Details
• View the most predictive or frequent factors of each cluster, on either a local or global level.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Cluster Comparison
• Select two cluster from the visualization page, and compare the most predictive or frequent factors between
the two
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
CLUSTER/SEGMENT STATISTICAL ANALYSIS
Cluster Statistics
• If the cluster was created with statistics turned on, the Statistics tab will take you to the cluster statistics
view.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Cluster Statistics Choice
• In the cluster statistics view you can select a dataset attribute and see the statistical properties of the
attribute values in that cluster.
• You can also compare numerical value distribution to the overall dataset distribution. If there are significant
differences, this may lead to useful insights.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Cluster Statistics Choice
• Selecting the blue + button will
show a list of attributes to choose
from.
• Once an attribute is chosen, the left
side button turns to a red - button
and can be used to remove the
display of that attribute’s details.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Cluster Statistics
• Number of Laboratory
Procedures was one of the
most heavily weighted
attributes for predictions in
this cluster.
• Looking at the distribution of
values for this cluster
compared to the distribution
in the whole dataset, we see
that many of the frequently
occurring values are not even
in the top ten values in the
overall dataset.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Cluster Statistics
• As a comparison, gender is distributed very similarly in both this
cluster and the overall dataset.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
DATASET UPDATE & MERGE
Dataset Update & Merge
• The update and merge features give users
the ability to refresh their data for model
re-training
• Files that will be used to update the data
must have the exact same header as the
original dataset
• The user has two choices when updating a
dataset:
• Appending rows to current dataset
• Refreshing values in current dataset (rows in
new dataset must have rows with a unique
ID that match rows in the original dataset)
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
UPDATE INSTANCE
Update Instance
• To re-train a given model on a new data set, select “Update Instance” from the Model Actions page.
• This will create another version of the existing model, in which the user can select to receive queries or not.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
COPY INSTANCE
Copy Instance
Step 1
• From “Your Models” page, choose “Model Actions” for the new model you would like to copy. The new model will have to be first
created as an entirely separate model.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Copy Instance
Step 2
• Next, choose “Copy Instance” from the “Models Details” tab.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Copy Instance
Step 3
• Next, choose the “Target Instance” you wish to update. This will be the current instance you are using in production. Then, select
“Copy Instance”.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Copy Instance
Step 4
• After choosing “Copy Instance” in the dialog box, you will be directed to the target instance “View Versions” page. The copied
version of the instance will default to “Sleeping” mode.
• To set the copied version to “Current”, which will enable querying from the API, select the “Set as Current” option in it’s row.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Copy Instance
Step 5
• The updated instance is now set to current, and is ready for querying from the API.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
MONITORING
Monitoring
• Model monitoring can be setup to notify users if a certain prediction class is experiencing outlier behavior
• Parameters to be set:• Class Name - Name of class to monitor
• Packet Size - Calculations are done over packets. A packet is a set of queries.
• Sample Size - Number of predictions for specified class that must occur before monitor activation
• Percentage - Specifies minimum % of packets that have to be marked as outlier to trigger a warning
• Revision Frequency - How often monitoring report will run (in milliseconds)
• zScore - Number of standard deviations the accepted range is from the mean
• Email Address - email address to be notified when a monitoring warning is triggered
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Monitoring
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
SAMPLE HIGH CAPACITY PRODUCTION SET -UP
Sample High Capacity Production Set-up
• One server is used for ‘Training’, in other words running grids.
• The other four servers are divided into a Master and three query Slaves
• The three slaves are connected to a load balancer, but the master could be added to the load balancer as needed.
• Updated data would be merged into the Training server and grid are run to determine if there is significant difference to update the query models.
• If there is sufficient change, the hyper parameters and files (updated data) for the model are copied over to the master and a new version of the model is created.
• Either a new model is created in the Master and then Copy Model is used to create a new version,
• Or, Update Model is used to create a new version of the model.
• This new model version will be replicated to the slaves.
• The switchover to the new model can then happen by changing the ‘Current Version’ for the model in the Master, which will replicate to the Slaves.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Updates
TrainingDataset
Production Fraud Detect
Production Fraud Detect
Lo
ad
Ba
lan
ce
Client Operations
Master ServerQuery Server
Query Server
xAI Workbench Software
Queries
Training
Merge
rsync latest version of
models
Training Server
Update Models
Hyper-parameters and updated files copied here from Training Server to
update models
Production Fraud Detect
Query Server
Monitor
Alert Email
Sample High Capacity Production Set-up
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Sample High-Capacity Production Throughput
• Server configuration
• 16 cores
• 128 Gb RAM
• ~2 milliseconds per query
• ~20 milliseconds per query with the Whys (dynamic weighted factors)
• Throughput per query server: ~30,000 queries per minute
• Sample system throughput (three servers): ~90,000 queries per minute
• Sample system high-load throughput (three servers plus master): ~120,000 queries per
minute
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
APPLICATIONS WITH THE API
Select all API text and Copy.
API Usage
• Command Line using curl as displayed in form:
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Remove:[PASSWORD]
Enter password when prompted.
API Usage
• Command Line using curl as displayed in form:
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
API Usage
• Command Line using curl as displayed in form:
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
APPENDIX
Avoid Overfitting
• Overfitting - a model that makes very, very accurate predictions, but only for a specific dataset. An overfit
model does not generalize.
• Three part approach to avoid overfitting:
• Training Dataset - a set of examples used for learning.
• Validation Dataset - a set of examples used to tune the parameters of a model. Usually these examples are a separate
subset of the training dataset. Choose the best model based on the validation dataset metrics.
• Holdout Dataset - a set of examples used only to assess the performance of a fully-trained model. Never seen before in
training and validation datasets. Used to test best model from above to see if model performance held.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
SIMCLASSIFY
simClassify
• simClassify is one of xAI Workbench’s classifiers.
• It accepts queries in the form of a data object with an unknown Class column.
• simClassify uses our similarity engine to identify the nearest neighbors to a queried object and uses the
Class field from those objects to predict the Class field for the queried object.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
simClassify Settings
• Bins
• Determines how many “buckets” fields with numbers as values will be split into.
• e.g. if you have values from 1-100, 5 bins would give you splits of 20. Higher values can increase accuracy at the risk of overfitting.
• Top Columns
• The number of columns to consider when making the prediction.
• Fields with strings may be broken into multiple columns.
• Higher values can increase accuracy at the cost of speed.
• Classification K
• The number of nearest neighbors to use when making the classification. We recommend the default, CK, which auto-detects the proper value.
• Energy Weight
• Used if one class is expected to be significantly more frequent than others.
• Dense Mode
• The distance function being used by the engine.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
simClassify Distance Functions
• simClassify can accept any distance function.
• We typically recommend using the SMART distance function.
• SMART learns the relationships between objects based on their class. This means that the dataset is clustered based on
outcome, resulting in very clean clusters for predictions to be made from.
• Along with all other settings, the accuracy of various distance functions can be tested in Fold Experiments.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Queried Objects
Very close neighbors,high confidence
More distant neighbors, lower confidence
Class 1 Region
Class 2 Region
Interpreting simClassify Results
• simClassify returns results as a
confidence value based on the
distance between the queried object
and its neighbors.
• The closer an object is to its
neighbors, the more confident the
algorithm is that it has the correct
classification.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
queried object
Class 1 Region
Class 2 Region
Class 1.95 confidence
Factor Weight
Circle 1.5
Medium Size 0.7
Yellow 0.2
Interpreting simClassify Results (cont.)
• Along with the prediction, simClassify
provides the weighted factors which
support that prediction.
• In this example, the result and factors
would be something like:
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
SIMCLUSTER
simCluster
• simCluster is the xAI Workbench clustering engine.
• Clusters will be different depending on parameters and the distance function used.
• It can produce either supervised or unsupervised clusters.
• Unsupervised clustering can be used for data analysis and exploration. It can reveal complex patterns and
relationships in data.
• Supervised clustering is clustering based on classifications. It will identify the features that differentiate
classes from each other. This is a very powerful way to visualize what a classification engine is doing and
can be used to identify groups and subgroups in data.
• Application examples: anomaly detection, customer segmentation
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
simCluster Parameters
• Processing Recipe - The distance function to be used for clustering.
• Sim Cluster Range - The maximum distance between the center of a cluster and an object on its edge.
• Sim Cluster Iterations - The number of passes made by the algorithm to identify cluster centers.
• Sim Cluster Percentage - The percentage of data to use for identifying new cluster centers during each
iteration.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Processing Recipe
• The Processing Recipe is the distance function that is used to determine the relationship between objects.
• simCluster has access to two distance functions by default (on the platform):
• Universal is the unsupervised function. It clusters based on the frequency of shared variables between objects.
• Dense is the supervised function. It detects the variables that are most critical for differentiating classes and clusters based
on those.
• Additional distance functions can be used through the API or can be added by request.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
Centroid of Cluster
Range = 0.4
Range = 0.5
● At range 0.4 only the green objects will be in the cluster.
● At 0.5 the blue objects will be, too.
● Neither setting will add the red object to the cluster.
simCluster Range
• simCluster works by identifying data
objects near the center of clusters and then
measuring the distance from other data
objects to those centers.
• simCluster Range sets the maximum
distance between an object and the center
of the cluster to which it belongs.
• simCluster Range is a float value between
0 (exclusive) and 1 (inclusive).
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.
simCluster Iterations and Percentage
• To boost performance, simCluster creates and populates clusters in multiple iterations.
• In the first iteration, simCluster will take an amount of data equal to the simCluster
percentage and identify the center of any clusters in that subset. simCluster will then attempt
to populate those clusters with all of the data.
• simCluster Percentage is a float value between 0 (exclusive) and 1 (inclusive).
• It will then take any data that cannot be placed in those clusters (distance exceeds
simCluster Range) and attempt to identify new clusters.
• The number of times this process is repeated is the simCluster Iterations parameter.
• simCluster Iterations is an integer value equal or greater than 1.
Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.