MammoScreen Deep Learning in practicegcharpia/deeppractice/2020/tpx_for_mva_dl_course_2020.pdfconv11, 512 conv12, 512 pool 2x2 pool_final 5x5 16 8 2 [0,1] shortcut_final metadata 14

Special for MVA course

Deep Learning in practice:MammoScreen

by Yaroslav Nikulin,Senior Research Scientist

February, 24 2020

Part I: Introduction

1. Therapixel

2. DL -> radiology

3. Breast cancer

4. DM DREAM Challenge

2013

5th place Kaggle Data Bowl

Founded Visualization SW AI research

Therapixel: Medical Image Understanding

Olivier Clatz, PhD

Pierre Fillard, PhD

Clinical Study Therapixel Cloud

1st place DREAM DM

2015 2016 2017 2018 2019 2020

MammoScreen deployed

1 out of 8

Woman affected during her lifetime

5 cancers for

1000 screening

10 recall for

100 screened

4

Breast Cancer Screening: some key stats● 33M exams/year = 132M images in US alone● $7.8 billion - cost of mammography screening in US (2010)● 120 sec: average interpretation time.

“If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.”

Andrew Ng, 2016

Challenge setting:

● Completely in the cloud

● 22 CPU cores + 2 GPUs

● 14 days / per team / round

● Performance measure: AUROC and partial AUROC

The Digital Mammography DREAM Challenge

0

0

000

0

0

00

1

1 1

0

Now look for a needle in them…

● 300k images● Only 1114 (0.35%!) positive examples● High resolution: from 3328x2560 to 5928x4728.

● One single label per image: 0 or 1

Why it is difficult - challenges of the Challenge

Why it is difficult - challenges of the Challenge

● Different kinds of anomalies: ○ calcifications○ masses○ distorsions○ asymmetries

● Different scales of anomalies: from micro-calcifications to big cancerous masses.

Can be malignant OR benign! 7

Part II: Winning solution: dream_net

1. Data specificity

2. Hack: dense annotations

3. Patch model

4. Image model

Zone of Interest

● Resolution: 1200x800 vs 224x224

● Zone of Interest : < 1% vs > 50%

● Number of classes : 2 vs 1000

● Highly imbalanced vs roughly balanced

Why is it very different from ImageNet?

In our approach, limited by several factors. Actually 3-5 times higher

9

Input size: 224x224

Output : 1 out of 1000 ~ 10 bits of information

Input size: ~3500x2500

Output : 1 out of 2 = 1 bit of information0 or 1

Why don’t DL results generalize always well to a new domain?

DDSM – bridge towards solution

DDSM DREAM

Total im 10k 640k

Positives 1807 1548

Info mask&type 0 or 1

It would be great to:

● Make use of local info● Make use of lesion type● Still be able to train on DREAM

11

segmentation mask patch

Patch model: Fully Convolutional Network

massmalignant

224

224

832

1152

12


massmalignant

0

1

3

2

4

224

224

832

1152

healthy

calc ben

mass ben

calc mal

mass mal

13


massmalignant

112x112x32 56x56x64 28x28x128 14x14x256 7x7x256 3x3x512 0

1

3

2

4Who said VGG..?

224

224

conv

1, 3

2

conv

2, 3

2

1024

512 5

po

ol 2

x2

conv

3, 6

4

conv

4, 6

4

po

ol 2

x2

conv

5, 1

28

conv

6, 1

28

po

ol 2

x2

conv

7, 2

56

conv

8, 2

56

po

ol 2

x2

conv

9, 2

56

conv

10,

256

po

ol 2

x2

conv

11,

512

conv

12,

512

po

ol 2

x2

FC as FCN

832

1152

healthy

calc ben

mass ben

calc mal

mass malTotal: 11M of parameters

14


massmalignant

112x112x32 56x56x64 28x28x128 14x14x256 7x7x256 3x3x512 0

1

3

2

4Who said VGG..?

224

224

Long, Shelhamer, Darrell, Fully Convolutional Networks for Semantic Segmentation, 2014

conv

1, 3

2

conv

2, 3

2

1024

512 5

po

ol 2

x2

conv

3, 6

4

conv

4, 6

4

po

ol 2

x2

conv

5, 1

28

conv

6, 1

28

po

ol 2

x2

conv

7, 2

56

conv

8, 2

56

po

ol 2

x2

conv

9, 2

56

conv

10,

256

po

ol 2

x2

conv

11,

512

conv

12,

512

po

ol 2

x2

FC as FCN

832

1152

11

16

healthy

calc ben

mass ben

calc mal

mass mal

5

Information Bottleneck..?

Total: 11M of parameters

15

https://arxiv.org/abs/1411.4038


[0,1

,2,3

,4]

conv

1, 3

2

conv

2, 3

2

1024

512 5

po

ol 2

x2

conv

3, 6

4

conv

4, 6

4

po

ol 2

x2

conv

5, 1

28

conv

6, 1

28

po

ol 2

x2

conv

7, 2

56

conv

8, 2

56

po

ol 2

x2

conv

9, 2

56

conv

10,

256

po

ol 2

x2

conv

11,

512

conv

12,

512

po

ol 2

x2

po

ol_

fin

al 5

x5

16 8 2

[0,1

]

shortcut_final

met

adat

a 14

1152

832

224

224

• Detector Net: pretraining by patches, ~2 hours on 4 Titan X

• End-to-end finetuning by images, ~20 hours on 4 Titan X

576x416x32 288x208x64 144x104x128 72x52x256 36x26x256 18x13x512

11

16

Intermediate labels:0 – healthy1 – calc benign2 – mass benign3 – calc malignant4 – mass malignant

FC as FCN FC

From patch to image model: final pooling and some more layers

Final labels:0 – healthy1 – cancer

16x11x5

All the convolutional kernels have spatial size 3x3

All the pooling layers are max pool

Inspired by ResNet…

16

Important to train on images:● Final pool 5x5● Adjust learning rate● Linear shortcut


Some technical details: training procedure and EMA

● DetectorNet on patches from scratch: Adam, lr 0.001

● Restore DetectorNet weights and Adam variables

● On images (partially restored): Adam, lr 0.0001

● Send it to the cloud and use as a starting point

● Finetuning on DREAM data: Adam, lr 0.0001 and Exponential Moving Averages (0.9)

● Restore EMA (0.9), finetune with SGD, lr 0.0001

AUC per breast (DDSM)

Loss

default

17

Some technical details: data

DataGenerator:Flips: hor & verRotation: ±20°Zoom: ±20%Shear: ±20%

Channel Shift: ±10%

18

From DreamNet to MammoScreenModels overview

1. 50 model instances of 5 model families (all for perf)1.1. Adjusted architectures1.2. Custom training procedures

2. Lesion detection: 2.1. RetinaNet

3. Patch malignancy score: 3.1. CaracNet3.2. CaracNetSymmetry

4. Image malignancy score:4.1. DreamNet4.2. DreamNetSymmetry

5. Ensembled, calibrated

6. Lesions paired

0.00 0.25 0.50 0.75 1.00

0.00

1.00

2.00

3.00

Cancer

Non-cancer

Status of the exam year N

90%

10%

70% 30%

20

Model’s output distribution on exams year N-1

Clinical study summary

21

Clinical study summary

22

MammoScreen in Production

Health Institution

On site server

MammoScreen-Client

Support

Mammobox

Therapixel CloudHospital IT

PACS /Acquisition

device

Therapixel [email protected]

MammoScreen-Cloud

Therapixel Offices

DICOM-mammograms

Doctors’ office

Computer

reports

DICOM Node

MammoScreen-UI

MammoScreen-Server

MammoScreen-AI

DICOM-mammograms

Storage reports

23

Part III: Some specific advices

1. DL is a paradigm, not a method

2. Hyperparameters’ space is too vast

3. Develop your intuition

4. Understand the internal dynamics of NN

● Adapt model to your problem

● good data and gradient flow: “well-wired net”

● Adjust architecture !

● Deep = complex, but cheap

Slide credit: 1)G. Montúfar et al, On the Number of Linear Regions of Deep Neural Networks 2) Marc'Aurelio Ranzato slides 3) Introduction to Deep Learning by Iasonas Kokkinos

Some specific advices and practical moments



https://www.cs.toronto.edu/~ranzato/

http://cvn.ecp.fr/personnel/iasonas/course/DL7.pdf

Observation:

If the networks find a latent variable in your data that correlates well with their optimization problem, they will use it. It might not be what you want!

Message:

Non-obvious biases in training data may be exploited by the network. You need to realize and control that effect. In particular, be suspicious about skyrocketing performances when injecting new source of data.

Initial train set(too few malignant images)

Enriching this train setwith malignant images

⇒ what do you think the network did?

Neural nets are good at NOT learning the right problem

Problem ?- Inhomogeneous set of

validation data

Solution ?- Split the validation set in

homogenous subsets

Look at the right metrics… on the right data

● Prefer smooth, stable learning inside a certain L2 sphere

● Good condition number● Long plateau of low valid metric● (Very) noisy, spiky, non-stable = bad● How to improve:

○ more examples○ data augmentation○ adapt model, regularization, loss○ validate more often○ debug model, data

● And only then “early stopping” - checkpoint with best valid metric

Fight overfitting

Mind the generalization gap

29

A note on overfitting and “advertising” stats:

● Overfitting happens on several levels:○ training data○ validation data○ test data = overfit dataset○ overfit a particular problem○ overfit a particular domain (?)○ overfit human style of thinking (??)

● In particular, performance of DL model on mammographies depends on:

○ Device used for mammography○ Skills of technician○ Screening period (1-1.5-2 years)○ Positive/negative ratio, closely linked to○ Fraction of truly difficult cases○ Population (country)○ …

• All breast lesions in the world form a (very complex) distribution

• Eliminate everything not linked to natural variability

• Example: device1 -> 3600x2400, device2 -> 3600x3600

• When both image sets rescaled to 1200x800 -> 2 modes

f

Standardize the input distribution

Data Science 2020 = Software Engineering 2000

● Visual Studio 1st release: 1997

● Development process and paradigm evolving

● Data becomes 2nd part of your code

● Software 2.0 stack (©Andrej Karpathy)

● IDEs for ML models are yet to come?TensorFlow Extended (TFX)?

Thank you for your attention!

32

Q&A session

Documents

MammoScreen Deep Learning in practicegcharpia/deeppractice/2020/tpx_for_mva_dl_course_2020.pdfconv11, 512 conv12, 512 pool 2x2 pool_final 5x5 16 8 2 [0,1] shortcut_final metadata 14