xAI WORKBENCH TRAINING

xAI WORKBENCH TRAINING

Agenda | Part 1

• Introduction to xAI Workbench

• simClassify+

• Data Upload

• Data Type Specifications

• Model Tuning - Hyperparameter

Selection

• Auto Tune

• Exhaustive Grid Search

• Thresholding

• Domain Property

• Weighted Recall

• Classification Analysis Reports

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Agenda | Part 2

• Clustering

• simCluster+

• Cluster Visualizations

• Cluster/Segment Statistical

Analysis

• Classification Operational Issues

• Dataset Update & Merge

• Update Instance

• Copy Instance

• Monitoring

• Sample High-Capacity

Production Setup

• Applications with the API

• simClassify

• simCluster


PART 1


Unlike static, traditional ML segmentation, our dynamic solution

can change as often as your customers change their behavior.

Our predictive technology clusters and segments based on

an action, outcome or other meaningful business objective.

These outcomes drive a segmentation schema comprised

of intent and action – not just descriptive statistics.

Our Expertise:Dynamic Predictive

Segmentation (DPS)

Traditional machine learning is a black box.

TARGET

Model

PREDICTED CLASS:

97% CONFIDENCEWILL BUY

Confidential ∙ Copyright ©2021 ∙ InRule Technology, Inc. ∙ All rights reserved.

We Open the Box – to Deliver Smarter Anything

Our predictions with the WHY® make it easy for analytics and business teams to apply machine learning quickly and

effectively with an easy-to-use workbench, explainable outputs, automation & RESTful APIs.

With user-controlled granularity and feature selection, our single-pass prediction and clustering deliver high precision

models with dynamically weighted attributes by segment for ultimate transparency and explainability.

WILL BUYPREDICTED CLASS:

97% CONFIDENCE

TARGET

PREDICTIVE SEGMENTSDIFFERENTIATING ATTRIBUTES

OUTDOOR ENTHUSIAST

MARRIED

HOUSEHOLD CHILDREN

PET OWNER

SENTIMENT INDEX

Nearest neighbors in the database inform each prediction with rich insights

contextual intelligence enables a more relevant decision on the offer or recommendation to present


HIGH LEVEL OVERVIEW

FileManagement

Indexed sM database

Fast sM access

Similarity Search(simSearch)

Classification(simClassify & simClassify+)

Clustering(simCluster & simCluster+)

CollaborativeRecommendation

(simRecommend) Results

Fold & Grid Cross Validation

Data UI Engine Specification/Validation Forms Results Visualization

Very fast random data retrieval.

Automated cross validation and hyperparameter optimization.

Specification Analyzer

Optimized engine with incredibly fast speeds.

Easy to use, simplified data identification. Classification and clustering exploration.

Host Architecture


SIMCLASSIFY+

simClassify+

• simClassify+ is a learned relevancy function based on proprietary ML

techniques.

• Can get improved classification accuracy (over simClassify) in some

circumstances.

• By using this learning approach, we are able to match or outperform

other ML techniques, while still providing transparency at the local,

prediction level.


Indexed sM database

Fast sM access


Classification(simClassify)

Fold & Grid Validation

Indexed sM database

Fast sM access






Architecture - simClassify+


Indexed sM database

Fast sM access



Metric

K Nearest Neighbor


K

Threshold

Class Weighting

Parameters - simClassify+


Indexed sM database

Fast sM access



Metric

K Nearest Neighbor

Domain Optimization


K

Threshold

Class WeightingDomain (Feature)

Domain Importance

Weighted Recall (Feature)



Indexed sM database

Fast sM access



Metric Learner

Metric

K Nearest Neighbor

Domain Optimization


KIterations

Threshold

Learning Rate

Feature Subsampling

Feature Focus

Class WeightingDomain (Feature)

Domain Importance

Weighted Recall (Feature)



Parameter Description

Iterations Number of iterations of the learning algorithm.

Learning RateStep size of the learning algorithm. Small values can lead to longer runtime. Large values can lead to overfitting.

Feature Subsampling

Ratio of randomly subsampled features at each iteration of the metric learning

algorithm. Randomizations provides diversity in the preparation of the similarity criteria.

Feature FocusMaximum number of dynamically selected features at any given time. This works like a

localized feature selection process.

Class WeightingUNIFORM or NORMALIZED. Uniform gives the same weight to all classes. Normalization

takes into account class imbalance.

simClassify+ Parameters

• In addition to the typical simClassify settings, simClassify+ has parameters for metric learning. We highly

recommend tuning parameters through auto tune or exhaustive grid search.


At iteration 0, dX = 0

Iteration 1 x1, x2, x6, x8, x9

x1:x2

x8:x9

d’1

Feature Subsampling

Feature Focus

Learning Rate


x3:x4

x9:x10

d’2


x1:x2

x4:x8

d’3

dX = dX + (r * d’1)

dX = dX + (r * d’2)

dX = dX + (r * d’3)

dX

Feature Subsampling and Focus Drill Down

For this example, let’s assume the

following:

• Feature set of training data: {x1,

x2,..., x10} = X

• Feature Subsampling = .5

• Feature Focus = 2

• Iterations = 3

• Learning Rate = r

The output is dx a measure of the

weighted relevancy of features for

predicting the result.


Using simClassify+

• Choose training data folder and spec file

• Next, assign hyper-parameters

• We strongly suggest using the Auto-Tune feature

to find optimal hyper-parameters

• If using an exhaustive grid to tune hyper

parameters, identify the best performing model

in the grid based on selected metrics (or tune

further if necessary) and create the model.


simClassify+ Results

• simClassify+ returns a prediction with its probability score and the weighted factors behind the prediction.

Special Values

• +/- represents whether that feature/value pair is in the query object (+) or not (-).

• <bias> represents how the class distribution in the underlying data set affects the model’s classification. This should be more prevalent in imbalanced data sets.

• SNLL represents a null value

• SRAR represents a rare value in nominal columns


DATA UPLOAD

Data Upload - GUI

The user can upload data either directly from the GUI, or from the command line using a curl command


Data Upload - Command Line

• The curl command to load data from the command line has the following structure:

curl -u <user>:<password> -v -F folderName=<GUI_Folder_Name> -F fileSize=<File_Size> -F

fileName=<File_Name> -F fileData=@<File_Path>

<Cloud_API_Protocol>://<Cloud_API_IP>:<Cloud_API_Port>/cloud/uploadFile

• In order to get fileSize value, run this command:

stat --printf="%s" <file-name>

• Example of data upload using model input data:

sudo curl -u user_1:password -v -F folderName=DataFolderName -F fileSize=123456 -F

fileName=data.tsv -F fileData=@/path/to/data.tsv http://127.0.0.1:9090/cloud/uploadFile


DATA TYPE SPECIFICATIONS

Spec Types

• ID - Required field; unique identifier of each row

• Class - Specifies the target column in supervised learning.

• Real - Numerical values

• Nominal - Values that do not bear a quantitative relationship with each other (i.e. strings and numbers which represent non-numerical information).

• Ignore - This column will be ignored during model development

• Item_Set - A series of values with weights. Formatted as item1:weight1;item2:weight2;...;itemN:weightN.

• Multi_English - Freeform English text. Numbers and symbols can be included as well.

• Multi_Spanish - Freeform Spanish text. Numbers and symbols can be included as well.

• Multi_Japanese - Freeform Japanese text. Numbers and symbols can be included as well.

• Multi_Plain - Freeform non-language specific text. Numbers and symbols can be included as well.

• Null_Indicator - This spec type transforms the column into a binary indicator of whether each row in the selected column has data in it or not.


Spec Generation

• Once the data is uploaded, you will have to “spec the data”, or choose the data types for each column in the dataset.

• There are multiple ways to do with within xAI Workbench

• Manual Selection

• Automated Selection using the Spec Analyzer

• Import previously created spec files

• Navigate to the Specs page by selecting “Edit Specs for Folder” in your desired data folder

• Data file must be uploaded to folder for user to have the ability to go to Specs page.


Name spec file

Choose model type

Analyze columns

Spec Analyzer

• The Spec Analyzer will generate recommended data types for each column, along with providing some high

level analysis for each column.


Spec File Upload

• You can upload a previously created spec file from the “Upload Spec File” tab.

• The format of the spec file can be CSV, TSV, or JSON


MODEL TUNING - HYPERPARAMETER SELECTIONAUTO TUNE

Auto Tune: Setup

• Once your data is uploaded and specs are created,

you are ready to begin model development.

• In Auto-Tune mode, the engine intelligently searches

a large grid of experiments and only creates model

experiments when the probability of successfully

increasing the metric of interest is high.

• Current metrics available for Auto-Tune

optimization:

• AUC - Area Under the ROC Curve (Binomial)

• Log-Loss (Binomial)

• MCC - Matthews Correlation Coefficient (Binomial)

• Accuracy (Multinomial)

• Recall (Multinomial)

• Precision (Multinomial)

• F1 Score (Multinomial)


Auto Tune: Experiment Types

• xAI Workbench offers two types of experiments when evaluating a model’s performance during grid

experiments: N-Fold Experiments & Date Split Experiments

• N-Fold Experiments

• N-Fold cross validation is used to measure model performance

• Date Split Experiments

• A date column is used to split the training data into tuning, testing, and validation splits. This is useful for time sensitive data

sets.


Auto Tune: Experiment Types - NFold

• Fold experiments are one way to measure results during grid execution.

• There are three steps.

• Step 1: You specify the number of folds and the fold seed.


3-Fold Experiment

simClassify+

Test

Results

simClassify+

Train

Fold Experiment Parameters● # of Folds● Fold Seed

Fold 1

simClassify+

Fold 2

Fold 3

Test

Results

Train

Test

Results


• Step 2: For each grid configuration, a fold experiment runs. The training data is split into the number of

folds specified. In each iteration, one fold is left out of training for testing. The number of iterations equal to

the number of folds.


Tuning - Train Tuning - Test

Validation - Train Validation - Test

Auto Tune: Experiment Types - Date Split

• Select Column - date column in dataset to use for splitting

• Date Format - format of date field in selected column

• First Split Date - initial split date for testing

• Second Split Date - final split date for testing


Auto Tune: Experiment Types - Date Split

• The training data is split into three groups, in time order.



• Step 3: Interpret Results - model metrics represent aggregate measures of each record when included in

the “testing” fold during N-Fold experiment


Auto Tune: Execution

• Once an auto-tune experiment has been executed, you can navigate to the Grid Results page to see current

best model.

• To create this model directly from the results page, select the blue “Create Model” button.

• To view all of the results from the underlying grid experiment, select the “All Results” radio button.


Auto Tune: Validation

• For date split experiments, the user can validate multiple model configurations.

• This allows the user to test for overfitting and to see how well the model generalizes on data it hasn’t seen

before.

• In the grid results table, send configurations to validation by selecting the respective “Validate” check box in

the table. Once all desired configurations are chosen, hit the “Execute Validation” button on the bottom of the

page.


MODEL TUNING - HYPERPARAMETER SELECTIONEXHAUSTIVE GRID SEARCH

Exhaustive Grid Search: Grid Creation

• If “Auto Tune” is turned off (as illustrated in the

screenshot), an exhaustive grid search will be

performed to test hyper parameter

configurations.

• You can edit the parameters to test by selecting

the “Edit Initial Parameters” slider.

• This particular grid in the screenshot will test

eight distinct hyper parameter configurations


Exhaustive Grid Search: Grid Results

• Each combination of parameters used in grid experiments can be created into a model from the Grid Results

table


Exhaustive Grid Search: Validation

• Just like in “Auto-Tune” mode, the user can send model configurations to validation for date split

experiments.

• This allows the user to test for overfitting, and to see how well the model generalizes on data it hasn’t seen

before

• In the grid results table, send configurations to validation by selecting the respective “Validate” check box in

the table. Once all desired configurations are chosen, hit the “Execute Validation” button on the bottom of the

page.


THRESHOLDING

0 1Probability of Class 1

Class 1Class 0

Threshold = 0.5

10

0 1Probability of Class 1

Class 0

Class 1

Threshold = 0.2

10

Balanced Unbalanced

Thresholding

• Thresholding is a feature used in simClassify and simClassify+

• It acts as a limit to split resulting predictions into a true or false category

• Thresholding only applies to:

• Binary classification

• The positive class of a prediction


N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

F

NN

Threshold

Thresholding

• Perfection


N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

TrueNegatives

FalseNegatives

FalsePositives

TruePositives

F

N

N

Threshold

Thresholding

• Reality


N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

TrueNegatives

FalseNegatives

FalsePositives

TruePositives

F

N

N

Threshold

Precision

• Quality


N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

TrueNegatives

FalseNegatives

FalsePositives

TruePositives

F

N

N

Threshold

Recall

• Quantity


N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

TrueNegatives

FalseNegatives

FalsePositives

TruePositives

F

N

N

Threshold

Precision / Recall Tradeoff

• Quantity


F1 Score

• Precision

• Quality of results: How exact were they?

• Recall

• Quantity of results: How complete were they?

• Precision increases at the expense of recall and vice versa.

• F1 Score

• Balance (harmonic mean) of precision and recall


False Positive RateR

ec

all

ROC Curve

Random Model

Good Model

Better Model

AUC

• Area Under Receiver Operator Characteristic

(ROC) Curve

• Walks through a range of thresholds and

plots the Recall and FPR for each.

• Line of equality is a random model.


DOMAIN PROPERTY

Domain Property

• Configuration set in model training

• Only available for binomial classifiers

• User selects a “real” column from training data. Model will place more importance on this column during training. E.g. transaction dollars, person hours, etc.

Domain Column

• “Real” column to place importance on during model training.

Domain Importance Function

• Global - emphasis on both classes

• Conservative - emphasis on positive class, conservative approach

• Aggressive - emphasis on positive class, aggressive approach


WEIGHTED RECALL

Weighted Recall Formula (Positive Class)Rw = ∑ Selected Metric True Positive / ∑ Selected Metric Condition

Positive

Weighted Recall

• Metric used in grid results analysis

• User selects a “real” column from training data to evaluate recall from. The underlying logic is the same as standard recall, but you are evaluating the percentage of the chosen metric caught for each class (e.g. % of fraud dollars caught).

• Weighted recall metric will appear in the grid results table, just like other metrics

• Metric name in grid results table

• weighted_recall_*POS_CLASS*

• weighted_recall_*NEG_CLASS*


CLASSIFICATION ANALYSIS REPORTS

Global Feature Analysis

• This report can be accessed through the “Model Analysis” link on the “Model Actions” page

• This report will show you the feature importance at a global level for classification models that you have

created.


Model Menu

Batch Query

Make Batch Query

1.

2.

Model Analysis

Batch Query Analytics

+ Report

3.

Batch Query Analysis

• This report can be accessed through the “Model Analysis” link on the “Model Actions” page

• This report will show you the cumulative results, by classification rate, for the results of a batch query file. Typically, this might be done with a separate hold out file, or periodically with a sample from production.

• In this report, the confidence level is multiplied by 100 to get a score between 0 and 100.


Batch Query Analytics


PART 2

CLUSTERING

Clustered all feature similarity

Unsupervised Clustering

• Unsupervised Clustering identifies groups of similar data objects based on the frequency of shared variables

between objects.


Class 1

Clustered by differentiating features

Class 2

Cluster 1 Cluster 2

Supervised Clustering

• Supervised Clustering uses xAI Workbench’s simClassify+ to identify groupings of the differences in features

between classes.


Class 1

Class 2

Cluster 2Cluster 1

Supervised Clustering (cont.)

• If the classes had been different, the clustering would have been different.

Clustered by differentiating features


Class Variable

Why Factors - most predictive feature-value pairs driving the clusters

Cluster Label

Cluster Level Details

Download Cluster

View single cluster, or compare two clusters

Cluster Size

Cluster Visualization


SIMCLUSTER +

simCluster+: Overview

• simCluster+ is our K-Means clustering algorithm for supervised clustering. It can also be used in unsupervised clustering, where Euclidean, Manhattan, and our One Class distance functions are available.

• simCluster+ uses a technique that can create K-Means clustering or K-Spilling clusters, unlike simCluster which uses agglomerative clusters.

• In K-Spilling clusters, K clusters that are more tightly formed around their “mean” centroid will be returned, but if a datapoint is not close enough to the centroids, it will “spill” into secondary clusters.

• The Range Percentile parameter determines if K-Means or K-Spilling will be used.

• If Range Percentile is 1.0, then all data points will be clustered into K clusters. If the Range Percentile is less than 1.0, then dense clusters will be produced and data points that are not within the specified density will spill into clusters beyond K.

• A visual walkthrough of the K-Spilling methodology is included in your user manual.


simCluster+: Parameters

Both Supervised and Unsupervised

• K - Minimum number of clusters. If Distance Percentile is 1.0, K clusters will be generated. Anything less than 1.0, at least K w ill be generated.

• Feature Focus - The maximum number of dynamically selected features used in any iteration of the clustering algorithm. This works like a localized feature selection.

• Distance Percentile - The maximum distance between the cluster center and a given element of the cluster. 1.0 will produce a K -Means clustering. Values less than 1.0 will create ‘tighter clusters,’ but elements not in the range will “spill” into new clusters. (Range gre ater than 0 and less than or equal to 1.0.)

• Iterations - Number of iterations of the learning algorithm.

Unsupervised Only

• Max Samples - The maximum number of data points used in each iteration.

Supervised Only

• Learning Rate - Step size of metric learning algorithm. Small values can lead to longer run times and large values can lead to overfitting.

• Feature Subsampling - Ratio of randomly subsampled features in each iteration of the metric learning algorithm. Randomization provides diversity inthe resulting similarity metric.

• Class Weighting - UNIFORM gives the same weight to all classes. NORMALIZED takes into account class imbalance.


K=2Distance Percentile = 1.0

K=2Distance Percentile < 1.0

simCluster+: Distance Percentile

• Distance Percentile sets the maximum

distance between an object and the center of

the cluster to which it belongs.

• Distance Percentile is a float value between 0

(exclusive) and 1 (inclusive).

• If some elements are not assigned to any

cluster due to exceeding the Distance

Percentile value, they will be spilled over to the

next iteration of cluster assignment.


K=2Distance Percentile = 1.0

K=2Distance Percentile < 1.0

simCluster+: K

• K is an integer value that represents the minimum

number of clusters to be created.

• If Distance Percentile is 1.0, exactly K clusters will

be generated. Anything less than 1.0, at least K

will be generated.


simCluster+: Unsupervised

• Three distance functions available for

unsupervised clustering

• Manhattan

• Euclidean

• One Class

• One Class proprietary distance

function is useful when you know the

underlying data set has a majority

“common” class (i.e. fraud, customer

behavior, anomaly detection, etc.)


simCluster+: Supervised

• When creating a supervised clustering experiment, you should find the optimal hyper parameter configuration tuning the simClassify+ classifier (Auto Tune or Exhaustive Grid Search).

• Once you find your optimal hyper parameter configuration, use them in this page to create your clustering engine.

• If you are using both simClassify+ and simCluster+ for a particular data set/use case, choosing the same hyper parameters will ensure that the learned distance function is consistent between the classification and clustering.


Create Visualization

• Top N: Number of layers (ie feature-value pairs) in visualization

• Max Number of Clusters: Maximum number of slices (ie clusters)

• Limit: Minimum number of elements for cluster to be visualized


CLUSTER VISUALIZATIONS

Cluster Label

Save cluster label

See as main cluster (Compare mode)

Deselect cluster

Download cluster elements and features

Most predictive (or frequent if unsupervised) factors in the selected cluster

View Results


View Details

• View the most predictive or frequent factors of each cluster, on either a local or global level.


Cluster Comparison

• Select two cluster from the visualization page, and compare the most predictive or frequent factors between

the two


CLUSTER/SEGMENT STATISTICAL ANALYSIS

Cluster Statistics

• If the cluster was created with statistics turned on, the Statistics tab will take you to the cluster statistics

view.


Cluster Statistics Choice

• In the cluster statistics view you can select a dataset attribute and see the statistical properties of the

attribute values in that cluster.

• You can also compare numerical value distribution to the overall dataset distribution. If there are significant

differences, this may lead to useful insights.


Cluster Statistics Choice

• Selecting the blue + button will

show a list of attributes to choose

from.

• Once an attribute is chosen, the left

side button turns to a red - button

and can be used to remove the

display of that attribute’s details.


Cluster Statistics

• Number of Laboratory

Procedures was one of the

most heavily weighted

attributes for predictions in

this cluster.

• Looking at the distribution of

values for this cluster

compared to the distribution

in the whole dataset, we see

that many of the frequently

occurring values are not even

in the top ten values in the

overall dataset.


Cluster Statistics

• As a comparison, gender is distributed very similarly in both this

cluster and the overall dataset.


DATASET UPDATE & MERGE

Dataset Update & Merge

• The update and merge features give users

the ability to refresh their data for model

re-training

• Files that will be used to update the data

must have the exact same header as the

original dataset

• The user has two choices when updating a

dataset:

• Appending rows to current dataset

• Refreshing values in current dataset (rows in

new dataset must have rows with a unique

ID that match rows in the original dataset)


UPDATE INSTANCE

Update Instance

• To re-train a given model on a new data set, select “Update Instance” from the Model Actions page.

• This will create another version of the existing model, in which the user can select to receive queries or not.


COPY INSTANCE

Copy Instance

Step 1

• From “Your Models” page, choose “Model Actions” for the new model you would like to copy. The new model will have to be first

created as an entirely separate model.


Copy Instance

Step 2

• Next, choose “Copy Instance” from the “Models Details” tab.


Copy Instance

Step 3

• Next, choose the “Target Instance” you wish to update. This will be the current instance you are using in production. Then, select

“Copy Instance”.


Copy Instance

Step 4

• After choosing “Copy Instance” in the dialog box, you will be directed to the target instance “View Versions” page. The copied

version of the instance will default to “Sleeping” mode.

• To set the copied version to “Current”, which will enable querying from the API, select the “Set as Current” option in it’s row.


Copy Instance

Step 5

• The updated instance is now set to current, and is ready for querying from the API.


MONITORING

Monitoring

• Model monitoring can be setup to notify users if a certain prediction class is experiencing outlier behavior

• Parameters to be set:• Class Name - Name of class to monitor

• Packet Size - Calculations are done over packets. A packet is a set of queries.

• Sample Size - Number of predictions for specified class that must occur before monitor activation

• Percentage - Specifies minimum % of packets that have to be marked as outlier to trigger a warning

• Revision Frequency - How often monitoring report will run (in milliseconds)

• zScore - Number of standard deviations the accepted range is from the mean

• Email Address - email address to be notified when a monitoring warning is triggered


Monitoring


SAMPLE HIGH CAPACITY PRODUCTION SET -UP

Sample High Capacity Production Set-up

• One server is used for ‘Training’, in other words running grids.

• The other four servers are divided into a Master and three query Slaves

• The three slaves are connected to a load balancer, but the master could be added to the load balancer as needed.

• Updated data would be merged into the Training server and grid are run to determine if there is significant difference to update the query models.

• If there is sufficient change, the hyper parameters and files (updated data) for the model are copied over to the master and a new version of the model is created.

• Either a new model is created in the Master and then Copy Model is used to create a new version,

• Or, Update Model is used to create a new version of the model.

• This new model version will be replicated to the slaves.

• The switchover to the new model can then happen by changing the ‘Current Version’ for the model in the Master, which will replicate to the Slaves.


Updates

TrainingDataset

Production Fraud Detect


Lo

ad

Ba

lan

ce

Client Operations

Master ServerQuery Server

Query Server

xAI Workbench Software

Queries

Training

Merge

rsync latest version of

models

Training Server

Update Models

Hyper-parameters and updated files copied here from Training Server to

update models


Query Server

Monitor

Alert Email

Sample High Capacity Production Set-up


Sample High-Capacity Production Throughput

• Server configuration

• 16 cores

• 128 Gb RAM

• ~2 milliseconds per query

• ~20 milliseconds per query with the Whys (dynamic weighted factors)

• Throughput per query server: ~30,000 queries per minute

• Sample system throughput (three servers): ~90,000 queries per minute

• Sample system high-load throughput (three servers plus master): ~120,000 queries per

minute


APPLICATIONS WITH THE API

Select all API text and Copy.

API Usage

• Command Line using curl as displayed in form:


Remove:[PASSWORD]

Enter password when prompted.

API Usage



API Usage



APPENDIX

Avoid Overfitting

• Overfitting - a model that makes very, very accurate predictions, but only for a specific dataset. An overfit

model does not generalize.

• Three part approach to avoid overfitting:

• Training Dataset - a set of examples used for learning.

• Validation Dataset - a set of examples used to tune the parameters of a model. Usually these examples are a separate

subset of the training dataset. Choose the best model based on the validation dataset metrics.

• Holdout Dataset - a set of examples used only to assess the performance of a fully-trained model. Never seen before in

training and validation datasets. Used to test best model from above to see if model performance held.


SIMCLASSIFY

simClassify

• simClassify is one of xAI Workbench’s classifiers.

• It accepts queries in the form of a data object with an unknown Class column.

• simClassify uses our similarity engine to identify the nearest neighbors to a queried object and uses the

Class field from those objects to predict the Class field for the queried object.


simClassify Settings

• Bins

• Determines how many “buckets” fields with numbers as values will be split into.

• e.g. if you have values from 1-100, 5 bins would give you splits of 20. Higher values can increase accuracy at the risk of overfitting.

• Top Columns

• The number of columns to consider when making the prediction.

• Fields with strings may be broken into multiple columns.

• Higher values can increase accuracy at the cost of speed.

• Classification K

• The number of nearest neighbors to use when making the classification. We recommend the default, CK, which auto-detects the proper value.

• Energy Weight

• Used if one class is expected to be significantly more frequent than others.

• Dense Mode

• The distance function being used by the engine.


simClassify Distance Functions

• simClassify can accept any distance function.

• We typically recommend using the SMART distance function.

• SMART learns the relationships between objects based on their class. This means that the dataset is clustered based on

outcome, resulting in very clean clusters for predictions to be made from.

• Along with all other settings, the accuracy of various distance functions can be tested in Fold Experiments.


Queried Objects

Very close neighbors,high confidence

More distant neighbors, lower confidence

Class 1 Region

Class 2 Region

Interpreting simClassify Results

• simClassify returns results as a

confidence value based on the

distance between the queried object

and its neighbors.

• The closer an object is to its

neighbors, the more confident the

algorithm is that it has the correct

classification.


queried object

Class 1 Region

Class 2 Region

Class 1.95 confidence

Factor Weight

Circle 1.5

Medium Size 0.7

Yellow 0.2

Interpreting simClassify Results (cont.)

• Along with the prediction, simClassify

provides the weighted factors which

support that prediction.

• In this example, the result and factors

would be something like:


SIMCLUSTER

simCluster

• simCluster is the xAI Workbench clustering engine.

• Clusters will be different depending on parameters and the distance function used.

• It can produce either supervised or unsupervised clusters.

• Unsupervised clustering can be used for data analysis and exploration. It can reveal complex patterns and

relationships in data.

• Supervised clustering is clustering based on classifications. It will identify the features that differentiate

classes from each other. This is a very powerful way to visualize what a classification engine is doing and

can be used to identify groups and subgroups in data.

• Application examples: anomaly detection, customer segmentation


simCluster Parameters

• Processing Recipe - The distance function to be used for clustering.

• Sim Cluster Range - The maximum distance between the center of a cluster and an object on its edge.

• Sim Cluster Iterations - The number of passes made by the algorithm to identify cluster centers.

• Sim Cluster Percentage - The percentage of data to use for identifying new cluster centers during each

iteration.


Processing Recipe

• The Processing Recipe is the distance function that is used to determine the relationship between objects.

• simCluster has access to two distance functions by default (on the platform):

• Universal is the unsupervised function. It clusters based on the frequency of shared variables between objects.

• Dense is the supervised function. It detects the variables that are most critical for differentiating classes and clusters based

on those.

• Additional distance functions can be used through the API or can be added by request.


Centroid of Cluster

Range = 0.4

Range = 0.5

● At range 0.4 only the green objects will be in the cluster.

● At 0.5 the blue objects will be, too.

● Neither setting will add the red object to the cluster.

simCluster Range

• simCluster works by identifying data

objects near the center of clusters and then

measuring the distance from other data

objects to those centers.

• simCluster Range sets the maximum

distance between an object and the center

of the cluster to which it belongs.

• simCluster Range is a float value between

0 (exclusive) and 1 (inclusive).


simCluster Iterations and Percentage

• To boost performance, simCluster creates and populates clusters in multiple iterations.

• In the first iteration, simCluster will take an amount of data equal to the simCluster

percentage and identify the center of any clusters in that subset. simCluster will then attempt

to populate those clusters with all of the data.

• simCluster Percentage is a float value between 0 (exclusive) and 1 (inclusive).

• It will then take any data that cannot be placed in those clusters (distance exceeds

simCluster Range) and attempt to identify new clusters.

• The number of times this process is repeated is the simCluster Iterations parameter.

• simCluster Iterations is an integer value equal or greater than 1.


Documents

xAI WORKBENCH TRAINING