23
Cross Validation Improvements in TMVA Mohammad Uzair Supervisors: Dr. Lorenzo MONETA Mr. Hans Kim ALBERTSSON BRANN CERN SUMMER STUDENT FINAL PRESENTATION

in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Cross Validation Improvements in TMVAMohammad Uzair

Supervisors:Dr. Lorenzo MONETAMr. Hans Kim ALBERTSSON BRANN

CERN SUMMER STUDENTFINAL PRESENTATION

Page 2: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

About Me:● Name : Mohammad Uzair

● Country : Pakistan

● Studies : Final Year of bachelor’s degree

in Computer Science

● University : National University of

Sciences and Technology (NUST)

● Interested in : Artificial Intelligence and

Machine Learning in Computer Vision

2

Page 3: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Tasks Assigned: Improve Cross Validation in TMVA

● New tutorial introducing Cross Validation in TMVA and make it

available through SWAN.

● Improve presentation of tutorial by extending the plotting

functionality (Introducing the feature to draw an average ROC

Curve).

● Improve CV fold generation targeting unbalanced datasets

(Introducing the feature of Stratified Splitting)3

Page 4: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Validation in Machine Learning

● Divide the dataset into Train and Test sets.

● Tries to estimate the expected error of the model.

● Allows us to get an honest assessment of the

trained model.

4

Page 5: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Validation in Machine Learning

02

03

Dataset (N=20)

Training Set (70 %)

Test Set (30 %)

5

Page 6: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Drawbacks of Validation

● Test errors can be highly variable depending on

how much data we use for each set.

● Model Developed on only a subset of data. Ideally

we want 100% data for training and 100% for

validation.

6

Page 7: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

K-Fold Cross Validation

Dataset (N=20)

Training Set

Validation Fold 1

7

Page 8: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

K-Fold Cross Validation

02

03

Dataset (N=20)

Training Set

Validation Fold 2

8

Page 9: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

K-Fold Cross Validation

Dataset (N=20)

Training Set

Validation Fold 3

9

And so on….

Page 10: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Tutorial Introducing TMVA Cross Validation

10

Page 11: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Tutorial introducing TMVA Cross Validation

11

● The tutorial was outdated.

● Was not easy to understand for new users.

● A new and updated tutorial in python jupyter notebook

with proper explanation so that it is easy to use and

understand and make it available through SWAN.

Page 12: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Jupyter Notebook for basic Tutorial onTMVA Cross Validation

02

03

12

Page 13: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Jupyter Notebook for basic Tutorial onTMVA Cross Validation

02

03

13

Page 14: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Extending the plotting functionality

14

Page 15: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Improve the presentation of tutorial

15

● Currently, we can only show ROC’s of individual folds.

● Often, we can be interested in the average behaviour.

● An added feature of visualising the average ROC Curve.

● Addition of this feature in the tutorial to improve the

presentation.

Page 16: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Average ROC Curve

02

03

16

With drawFolds = False With drawFolds = True

Page 17: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Improve CV fold generation

17

Page 18: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Random Splitting VS Stratified Splitting

18

● Determines the distribution of input data.

● Random Splitting just randomly splits the data equally.

● Data of a fold can be distributed differently than the whole.

● Problems, in particular, arise with unbalanced classes

(can more easily occur in multi-class classification).

● Stratified Splitting ensures that each fold follows the same

distribution as the whole.

Page 19: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Random Splitting VS Stratified Splitting

19

● Random Splitting just randomly splits the data equally30 samples03 Classes 10 samples 15 samples 05 samples

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

6 random samples in each fold

Page 20: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Random Splitting VS Stratified Splitting

20

● Stratified Splitting randomly splits data ensuring that each fold is good representative of the whole.

30 samples03 Classes 10 samples 15 samples 05 samples

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

6 random samples in each fold

Page 21: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Stratified Splitting in TMVA

21

Page 22: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Future Improvements:

22

● Weighted kFolds Splitting

● Investigate feasibility of integrating TMVA GUI with

tutorials

Page 23: in TMVA Cross Validation Improvements · Name : Mohammad Uzair Country : Pakistan Studies : Final Year of bachelor’s degree in Computer Science University : National University

Special Thanks to my Supervisors:Dr. Lorenzo MONETA

Mr. Hans Kim ALBERTSSON BRANN23