Multi-centric learning from medical data

Multi-centric learning from medical data

Nov 2012Georgi Nalbantov

Multiple sets of medical data exist in different hospitals

Currently: models are built from data in 1 center only

Currently: external model validation requires standardization

Data privacy is an issue: data cannot leave hospitals (easily)

Multi-centric learning from medical data: why?

General Hypothesis:

Current way of learning from medical data is suboptimal, as modeling techniques do not have access to all available data

Specific Hypothesis:

A distributed learning environment (euroCAT), by giving local access to all data, can be used to produce optimal models, the same as if data was centralized*

Multi-centric learning from medical data: why?

* For some modeling techniques

Learning from medical data: the current state

Center 1 Center 2 Center 3 Center 4 Center 5

C2 C3 C4 C5 C1 C3 C4 C5 C1 C2 C4 C5 C1 C2 C3 C5 C1 C2 C3 C4

We learn a model from one center only and validate at other centers (if possible)

We also check the predictions of the doctors. (The golden standard?)

Learning from medical data: the current state

Center 1 Center 2 Center 3 Center 4 Center 5

C2 C3 C4 C5 C1 C3 C4 C5 C1 C2 C4 C5 C1 C2 C3 C5 C1 C2 C3 C4

We learn a model from one center only and validate at other centers (if possible)

We also check the predictions of the doctors. (The golden standard?)

Problem: suboptimal

Optimal solution: learning from centralized data

NOT FEASIBLE

Learning from medical data: the challenge

C2 C3 C4 C5

Center 2Center 3Center 4Center 5

Center 1Data to predict

Data for learning

Centralize data

Decentralized data How to achieve that?

Option 1: Centralized learningCombine data from sites 1,2,3 and 4 at one central database. That is: bring the data to the model: NOT FEASIBLE

Option 2: Distributed learningApply “distributed learning” on the data from sites 1,2,3 and 4. That is, bring the model to the data: FEASIBLE

Centralized Learning

Data centralization

Learning

Nalbantov and Wiessler, Oct 2012

Distributed Learning

Nalbantov and Wiessler, Oct 2012

Distributed Learning: doing it

- For distributed learning we need a statistical modeling technique that can learn in distributed mode, that is without being able to “see” all the data at once.

We choose one of them for this study: SVM’s

- Have shown excellent results across a wide range of data-analysis problems- Robust to the inclusion of many features (bye-bye to the “15-1” rule of thumb)- Can be constructed in distributed-learning mode

- There exist learning models that are able to find an optimal solution no matter whether data is scattered across different centers or not.

Learning from medical data: distributed learning

SVMs

Model evaluation:ROC curve

The “Trial”: prediction of 2-year survival of lung cancer patients

Patients: 322 (Maastro) lung cancer patients, distributed across 5 sites:

Maastro 186Liege 52Hasselt/Genk 45Aachen 7Eindhoven 32

Endpoint: 2-year survival

Method: distributed learning SVMs (ref: Boyd, ADMM), euroCAT

Predictive features: gender, WHO performance status, FEV1, number of PLNSs, GTV (volume) and “EQD2,T”

The “Trial”:prediction of 2-year survival of lung cancer patients

2-year survival

gender WHO FEV1 number of PLNSs

GTV (volume)

EQD2,T

… … … … … … …

… … … … … … …

… … … … … … …

… … … … … … …

… … … … … … …

site patients

Maastro 186

Liege 52

Hasselt/G 45

Aachen 7

Eindhoven 32

The data for the trial

The “golden standard”: doctors’ predictionsprediction of 2-year survival of lung cancer patients

“Traditional” solutionprediction of 2-year survival of lung cancer patients

Center 1

C2 C3 C4 C5

site patients

Maastro 186

Liege 52

Hasselt/G 45

Aachen 7

Eindhoven 32

Build model at center 1

Validate the model at centers2,3,4 and 5

Step 1. Build an SVM model from data in center 1.- There is no “one-button” to press… It turns out SVM is a “family” of models and the trained statistician has to

choose one family member in much the same way as the surgeon has to choose from a variety of “knives”.

Step 2. Model evaluation: how will our model perform outside my hospital?- Perform cross-validation to find optimal SVM from the “family”

Step 3. Build the final model using the “best-performing” SVM from the SVM family.

“Traditional” solutionprediction of 2-year survival of lung cancer patients

Center 1

SVM with Lambda=1

SVM with Lambda=5

cross-validation ROC

Build the final SVM model on all data from center 1, that is “SVM with lambda = 2”

External validation

SVM family with different lambda’s AUC

0.723

0.757

0.671

0.6

Center 2

Center 3

Center 4

Center 5

Learning from centralized data:optimal solution

Distributed learning:optimal solution*

NOT FEASIBLE FEASIBLE (ex: EuroCAT)

euroCAT solution:prediction of 2-year survival of lung cancer patients

C2 C3 C4 C5



Data for learning

C2 C3 C4 C5


Decentralized data Decentralized data

Data BEHAVES like centralized*

*Using ADMM to solve SVM


Bring the data to the model Bring model to the data

euroCAT: a breakthrough

Training site(s)

Predicted site

AUC euroCAT learning on PREDICTED site

(same result as centralized)

color

2 1 0.754 red

3 1 0.678 green

4 1 0.610 cyan

5 1 0.723 pink

2,3,4,5 1 0.766 blue

1,2,3,4,5 world ? ?

What is the potential benefit from multi-centric batch learning on predicting survival?

The future

How can patients/clinics profit from distributed learning medical environments?

- Use real multi-centric data for modeling

- Use multiple endpoints: survival, dyspnea, dysphagia, fibrosis, etc.

- Include more variables: imaging, DNA, etc.

- Use standardized data (and more data)

- Etc…

Bedankt voor uw aandacht

Heeft u vragen of [email protected]

Documents

Multi-centric learning from medical data