37
Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD

Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Distributed Logistic Model Trees, Stratio Intelligence

Mateo Álvarez and Antonio Soriano

@StratioBD

Page 2: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Aerospace Engineer, MSc in

Propulsion Systems (UPM), Master

in Data Science (URJC).

Working as data scientist and Big

Data developer at Stratio Big Data in

the data science department

mateo-alvarez

Page 3: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Ph.D. in Telecommunications, MSc in

Electronic Systems Engineering and

Telecommunication Technologies,

Systems and Networks (UPV), and MSc

“Big Data Expert” (UTAD).

Working as data scientist and Big Data

developer at at Stratio Big Data in the

data science department

@Phd_A_Soriano

Page 4: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Why using interpretable algorithms instead of “black boxes”

Logistic Regression

Decision Trees

Variance-Bias tradeoff

Metrics

Demo

Logistic Model Trees

Distributed implementation

Cost function & configuration params

Demo

Page 5: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

• Why use interpretable algorithms instead of “black boxes”

• Logistic Regression

• Decision Trees

• Variance-Bias tradeoff

@StratioBD

Page 6: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Accuracy ExplainabilityVS

Medical Studies Power management Financial environment Criminal activity

Page 7: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Threshold

Pro

bab

ility

Feature

Page 8: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Threshold

Pro

bab

ility

Local bad adjust

Feature

Page 9: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Threshold

Pro

bab

ility

Feature

Page 10: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Feature 1

Root node

Leaf node Leaf node

Page 11: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Feature 1

Root node

Feature 2 Feature 2

Leaf node Leaf node Leaf node Leaf node

Page 12: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Feature 1

Root node

Feature 2 Feature 2

Leaf node Leaf node Leaf node Leaf node

Local bad adjust

Page 13: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Mea

n E

rro

r

Model complexity

Test error

Training error

Model complexity

Variance

Total error

Bias2

Erro

r OverfittingUnderfitting

Page 14: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Variance

Bias

Page 15: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Missing important variables for the problem to make the predictions

Variance

Bias

Page 16: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Overfitting to the sample/training data

Variance

Bias

Page 17: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Irreducible error on prediction

Variance

Bias

Page 18: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

• Logistic Model Trees

• Distributed implementation

• Cost function & configuration parametres

• Demo

@StratioBD

Page 19: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Root node

Performance metrics

Page 20: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Performance metrics

Feature 1

Root node

Performance metrics

Performance metrics

Page 21: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Feature 1

Root node

Performance metrics

Performance metrics

Performance metrics

Performance metrics

Performance metrics

Performance metrics

Page 22: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Feature 1

Root node

Feature 2 Feature 2

Page 23: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Feature 1

Root node

Feature 2

Page 24: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Feature 1

Root node

Feature 2

Page 25: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

DISTRIBUTED IMPLEMENTATION

Spark’s Decision Tree(distributed implementation of random forests)

Spark’s Logistic Regression / weka’s Logistic Regression on the nodes

Page 26: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

LMT Cost function to fix the logistic regression threshold

• AccuracyCostFunction• ConfusionMatrix• PrecisionCostFunction• PrecisionRecallCostFunction• RocCostFunction

The same cost function for pruning criteria

Performance metrics

Performance metrics

Performance metrics

Page 27: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Big datasetsPower of spark to distribute building the

tree and logistic regressions

ADVANTAGES OF THIS IMPLEMENTATION

Medium datasetsDistributed tree growth and weka’s

logistic regression

Small datasetsAlthough it can be slow to

distribute the data for the decision tree, cost functions can be still

used and specific optimization for particular cases

Page 28: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Example of DLMT algorithm in a synthetic dataset

Page 29: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

• Metrics

• Demo

@StratioBD

Page 30: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

PREDICTION

Positive Negative

TRUE CONDITION

Positive True Positives False Negatives

Negative False Positives True Negatives

Precission

True Positive Rate (Recall)

False Positive Rate

TPR = TP/(TP+FN) Insensitive to unbalance

FPR = FP/(FP+TN) Insensitive to unbalance

Precision = TP/(TP+FP) Sensitive to unbalance

Accuracy = (TP+TN)/(TP+TN+FP+FN) Sensitive to unbalance

Page 31: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

PREDICTION

Positive Negative

TRUE CONDITION

Positive True Positives False Negatives

Negative False Positives True Negatives

True Positive Rate (Recall)

False Positive Rate

AUROC (AUC): TPR/FPR -> Insensitive to unbalance! TPR

FPR

Best performance

Page 32: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

PREDICTION

Positive Negative

TRUE CONDITION

Positive True Positives False Negatives

Negative False Positives True Negatives

True Positive Rate (Recall)

AUPRC: Precision/TPR -> Sensitive to unbalance!

Precission

Precision

Recall

Best performance

Page 33: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

f

f

1

n

BenchmarkABF

Data

Algorithms

Page 34: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

@StratioBD

Page 35: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

Accuracy ExplainabilityVS

Performance Metrics:AUROC, AUPRC, ACCURACY

Automatic Benchmarking Framework

f

f

1

n

BenchmarkABF

1

2

3 4

Page 36: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

THANK YOU

UNITED STATES

Tel: (+1) 408 5998830

EUROPE

Tel: (+34) 91 828 64 73

[email protected]

www.stratio.com

@StratioBD

Page 37: Mateo Álvarez Antonio Soriano - events.static.linuxfound.org · Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD. Aerospace Engineer,

[email protected] ARE HIRING

@StratioBD