Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees,...

Preview:

Citation preview

Distributed Logistic Model Trees, Stratio Intelligence

Mateo Álvarez and Antonio Soriano

@StratioBD

Aerospace Engineer, MSc in

Propulsion Systems (UPM), Master

in Data Science (URJC).

Working as data scientist and Big

Data developer at Stratio Big Data in

the data science department

mateo-alvarez

Ph.D. in Telecommunications, MSc in

Electronic Systems Engineering and

Telecommunication Technologies,

Systems and Networks (UPV), and MSc

“Big Data Expert” (UTAD).

Working as data scientist and Big Data

developer at at Stratio Big Data in the

data science department

@Phd_A_Soriano

Why using interpretable algorithms instead of “black boxes”

Logistic Regression

Decision Trees

Variance-Bias tradeoff

Metrics

Demo

Logistic Model Trees

Distributed implementation

Cost function & configuration params

Demo

• Why use interpretable algorithms instead of “black boxes”

• Logistic Regression

• Decision Trees

• Variance-Bias tradeoff

@StratioBD

Accuracy ExplainabilityVS

Medical Studies Power management Financial environment Criminal activity

Threshold

Pro

bab

ility

Feature

Threshold

Pro

bab

ility

Local bad adjust

Feature

Threshold

Pro

bab

ility

Feature

Feature 1

Root node

Leaf node Leaf node

Feature 1

Root node

Feature 2 Feature 2

Leaf node Leaf node Leaf node Leaf node

Feature 1

Root node

Feature 2 Feature 2

Leaf node Leaf node Leaf node Leaf node

Local bad adjust

Mea

n E

rro

r

Model complexity

Test error

Training error

Model complexity

Variance

Total error

Bias2

Erro

r OverfittingUnderfitting

Variance

Bias

Missing important variables for the problem to make the predictions

Variance

Bias

Overfitting to the sample/training data

Variance

Bias

Irreducible error on prediction

Variance

Bias

• Logistic Model Trees

• Distributed implementation

• Cost function & configuration parametres

• Demo

@StratioBD

Root node

Performance metrics

Performance metrics

Feature 1

Root node

Performance metrics

Performance metrics

Feature 1

Root node

Performance metrics

Performance metrics

Performance metrics

Performance metrics

Performance metrics

Performance metrics

Feature 1

Root node

Feature 2 Feature 2

Feature 1

Root node

Feature 2

Feature 1

Root node

Feature 2

DISTRIBUTED IMPLEMENTATION

Spark’s Decision Tree(distributed implementation of random forests)

Spark’s Logistic Regression / weka’s Logistic Regression on the nodes

LMT Cost function to fix the logistic regression threshold

• AccuracyCostFunction• ConfusionMatrix• PrecisionCostFunction• PrecisionRecallCostFunction• RocCostFunction

The same cost function for pruning criteria

Performance metrics

Performance metrics

Performance metrics

Big datasetsPower of spark to distribute building the

tree and logistic regressions

ADVANTAGES OF THIS IMPLEMENTATION

Medium datasetsDistributed tree growth and weka’s

logistic regression

Small datasetsAlthough it can be slow to

distribute the data for the decision tree, cost functions can be still

used and specific optimization for particular cases

Example of DLMT algorithm in a synthetic dataset

• Metrics

• Demo

@StratioBD

PREDICTION

Positive Negative

TRUE CONDITION

Positive True Positives False Negatives

Negative False Positives True Negatives

Precission

True Positive Rate (Recall)

False Positive Rate

TPR = TP/(TP+FN) Insensitive to unbalance

FPR = FP/(FP+TN) Insensitive to unbalance

Precision = TP/(TP+FP) Sensitive to unbalance

Accuracy = (TP+TN)/(TP+TN+FP+FN) Sensitive to unbalance

PREDICTION

Positive Negative

TRUE CONDITION

Positive True Positives False Negatives

Negative False Positives True Negatives

True Positive Rate (Recall)

False Positive Rate

AUROC (AUC): TPR/FPR -> Insensitive to unbalance! TPR

FPR

Best performance

PREDICTION

Positive Negative

TRUE CONDITION

Positive True Positives False Negatives

Negative False Positives True Negatives

True Positive Rate (Recall)

AUPRC: Precision/TPR -> Sensitive to unbalance!

Precission

Precision

Recall

Best performance

f

f

1

n

BenchmarkABF

Data

Algorithms

@StratioBD

Accuracy ExplainabilityVS

Performance Metrics:AUROC, AUPRC, ACCURACY

Automatic Benchmarking Framework

f

f

1

n

BenchmarkABF

1

2

3 4

THANK YOU

UNITED STATES

Tel: (+1) 408 5998830

EUROPE

Tel: (+34) 91 828 64 73

contact@stratio.com

www.stratio.com

@StratioBD

people@stratio.comWE ARE HIRING

@StratioBD