Learning under Budget Generative Adversarial Netsnsokolovska/GAN_UnderBudget.pdf · Roux-en-Y gastric bypass surgery: a retrospective cohort study, 2013 4/41. DiaRem accuracy on NutriOmics

Learning under BudgetGenerative Adversarial Nets

Nataliya Sokolovska

Sorbonne UniversityParis, France

Master 2 in StatisticsMarch, 29, 2019

1 / 41

Outline

Learning under BudgetMotivation for Learning under BudgetCascade ClassifiersA deep reinforcement learning modelMDP and POMDPSome Applications in Medicine

Generative Adversarial LearningGenerative Adversarial NetworksExample: a Real Materials Science ApplicationReal Data and Discussion

2 / 41

Outline



3 / 41

Example: the DiaRem (Diabetes Prediction) Score

Variable Thresholds ScoreAge <40 0

40–49 150 – 59 2>60 3

Glycated hemoglobin <6.5 06.5 – 6.9 27 – 8.9 4> 9 6

Insuline No 0Yes 10

Other drugs No 0Yes 3

Classify as Remission if sum of scores < 7Classify as Non-remission if sum of scores ≥ 7

C. D. Still et al., Preoperative prediction of type 2 diabetes remission afterRoux-en-Y gastric bypass surgery: a retrospective cohort study, 2013

4 / 41

DiaRem accuracy on NutriOmics data

0 2 4 6 8 11 15 18 21

remissionnot remission

Score

05

1015

2025

30

5 / 41

Classification under budget constraints

Problem: predict y from x when this is reliable; abstain otherwise.The cost:

δ =

1 , if y = y,δ , if rejection,0 , if y 6= y.

δ < 1/2, y is a classifier.The cost is used to avoid misclassification.I Boosting with Abstention, C. Cortes, G. DeSalvo and M. Mohri,

Advances in Neural Information Processing Systems, 2016:

Many thanks to Matthieu Clertant for sharinghis slides

6 / 41

Homogeneous cascade with abstention

Label

Test A

Label

Test A

f(A)

Label

Test A Test B Test C

f(A,B

)

f(A,B

,C)

Label Label

. . .

Tradeoff between accuracy and cost: The system incurs a penaltyof δk+1 at k-th stage if it rejects to seek more measurements. Order oftests and classifier with abstention are learned purely from the data.

7 / 41

The Costs

In real life application, budget is always limited.

Possible costs:I time, money,I side effects of medication,I complexity (interpretability, signature disease) ...

The costs δk depends of the tests or the nature of acquired data.

Sum of costs:

∆ =

K∑k=1

δk.

∆ is a tradeoff parameter between accuracy and cost.

8 / 41

Heterogeneous Cascade with abstention

Test A

Test B

Test C

Test C

f(A

)Test D

Test E

f(A,B,C,D)

f(A,B,C,E)

Test D

Test E

f(A,C,D)

f(A,C,E)

f(A

,B)

Label

LabelLabel

Label

Label

Label

Tradeoff between accuracy and cost: Dynamic diagnostic protocolsare individually tailored acquisitions of patients data with the aim toprovide the most accurate diagnostics for the lowest cost.

9 / 41

Goals

The machine learning sytem has to provide:I individualized treatment of patients,I exploration of K! homogeneous cascades,I a tree structure or heterogeneous cascade,I 2K classifiers and 2K rejecters,I classifiers and rejecters jointly,I interpretable classifiers,

The algorithm has to learn purely from data.

10 / 41

Strategy Games: Human vs Computer

Deep Blue (1997):I it won against Gary KasparovI almost exhaustive explorationI human knowledge (evaluation function,

memory of real games)

Alpha Go (2015):I it won against Lee SedolI no exhaustive explorationI human knowledge (memory of real game)

Alpha Zero (2017): no human knowledge.

These board games are Markov Decision Processes.

11 / 41

Markov Decision Process

A Markov Decision Process is a 5-tuple (S,A, P,C,X × Y).

I S : set of states,I A : set of actions,I P : set of conditional transition probabilities,I C : cost function,I X × Y : set of observations.

12 / 41

From POMDP to MDP

Partially Observable Markov Decision Process deals with:I uncertainty related to the effects of actions (cost).I uncertainty about the current state.

Purely epsistemic MDP deals with the second uncertainty.

Historical state st:I a[t], set of tests done at time t,I x[t], all the features available at time t.

Purely epistemic MDP+ historical data (state)= simple MDP.

13 / 41

Deep Q Learning

Goal: Determinating a policy π of action.

Global cost C and cost from step t, noted Ct,:

C(sK) =∑

k∈[K+1]

c(sk) and Ct(sK) =∑

k∈[t:K+1]

c(sk) .

The function Q (also called cost-to-go to at) is the expectedcumulative cost from time t to the end if we undertake action at:

Qπ(st−1, at) = Eπ [Ct(sK) | st−1, at].

Optimal solution: π? = arg minπ

Eπ[C(sK)].

The function Qπ?

is approximatedwith deep neural network.

14 / 41

Optimal policy

TheoremLet y and r be two natural candidates in case of classification andrejection when the distribution D is known:

yt = arg maxat∈[L]

PD(at = Y | st−1) and rt = − arg minat∈[−K:−1]

Qπ(st−1, at) .

For all t ∈ [K] (or to the end of the cascade), the optimal solution ofthe MDP system satisfies:

π?(st−1) = a?t =

yt , if max

at>0PD(at = Y | st−1)

> 1−minat<0

Qπ(st−1, at)

rt , otherwise.

15 / 41

Deep Neural Networks

Environment: dataset of patients features and final diagnostic.

Agent (Part 1):Rejecter/selector

At time t in the cascade:I inputs are summarized by st = (a[t];x[t]) (historical state: tests+

features),I outputs are the approximated cost-to-go to each available tests

(function Q).

16 / 41



Agent (Part 2):Classifier


features),I outputs are the probability of each class (function Q).

16 / 41



Agent (Part 2):Classifier


features),I final outputs are the probability of each class (function Q).

Familly of logistic regressions: fw(st, a) ∝ exp 〈βa(M[t]),M[t] � x〉.

16 / 41

Experimentation setup

Three models:I ICCA stands for Interpretable Cascade Classifier with

Abstention; it uses a classifier network generating a family ofsoftmax regression,

I p-ICCA, same model after a run of pretraining using only themask for the rejecter network.

I Random forest with feature selection (Importance measure bymean decrease impurity).

Data: Breast Cancer Wisconsin Diagnostic (UCI). We dispose of 30parameters describing characteristics of the cell nuclei and medicalimages for 569 patients. All parameters are continuous. Costs offeature acquisition are the same for all the features.

17 / 41

Data of breast cancer diagnostic

2 3 4 5 6 7 8number of features

0.93

0.94

0.95

0.96

0.97

0.98

Accu

racy

Accuracy and standard deviation for ICCA (in blue), pretrainedICCA(in yellow), and the random forest classifier (dotted line). Thefeatures share the same cost: δ ∈ [10−4, 1/3.10−2].

18 / 41

Data of breast cancer diagnostic

0 5 10 15 20 25 30

Number of features0.00

0.05

0.10

0.15

0.20

0.25

Pro

babi

lity

Histogram of the number of features explored for ICCA. In blue: allobservations (size: 569, mean: 4.8); in orange: the misclassified ob-servations (size: 21, mean: 12.3). Overall accuracy: 96,31%.

18 / 41

Published in proceedings of AISTATS, 2019.

19 / 41

The State-of-the-art and Open Challenges

More or less achieved:I methods for learning an (individualized) diagnostic protocol,I based on theoretical support (MDP, POMDP),I deep network able to learn a family of parametric models for

missing data.

Still a Challenge:I Missing data alternatives as generative adversarial network and

variational autoencoder,I numerical experiments for special cases (comparison between

cascading system, multi-stage system and classifier withembedded feature selection),

I adaptation to genomic, transcriptomic and metabolomic data,I features variation in time (POMDP)

20 / 41

Outline



21 / 41

Generative Learning

I Discriminative learning tries to classify data (maps features toclasses)I Find a boundary between classes

I Generative learning tries to provide features given a labelI Model data distribution

I To be precise, generative learning: given a class, how likely arethese or those features?

22 / 41

Generative Adversarial Networks

Two components:

1. Generator – generates new data

2. Discriminator – estimate the probability to belong to a class

Steps of the GANs:

1. The generator generates data

2. The generated and observed data are fed into the discriminator

3. The discriminator returns probabilities to belong to the classes(fake or true)

23 / 41

GANs

from https://skymind.ai/wiki/generative-adversarial-network-gan

I Feedback loop: discriminator and the true classesI Feedback loop: generator – discriminator

24 / 41

GANs

I pz – Data distribution over noise (uniform)I pg – Generator’s distribution over (observed) dataI pr – True distribution

Play a minimax game:

minG

maxD

L(D,G) = Ex∼pr{logD(x)}+ Ez∼pz{log(1−D(G(z)))}

= Ex∼pr{logD(x)}+ Ex∼pg{1− logD(x)}

I D is expected to maximize Ez∼pz{log(1−D(G(z)))}I G is trained to fool D, and to minimizeEz∼pz{log(1−D(G(z)))}

25 / 41

GANs

from http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html

26 / 41

GANs: the Learning Algorithm

from Ian J. Goodfellow et al., 2014l

27 / 41

Autoencoders

from https://www.jeremyjordan.me/variational-autoencoders/

28 / 41

Variational Autoencoders

from https://www.jeremyjordan.me/variational-autoencoders/

29 / 41


30 / 41

Some Materials Science Examples

I Synthesis of new organic and inorganic compounds

I Combinatorial problem for human experts

I “Clean” energy: hydrogen storage

31 / 41

Problem Formulation

Generate: ternary hydride compounds of the form ”A (a metal) - H(hydrogen) - B (a metal)”.

I The training algorithm observes stable binary compoundscontaining chemical elements A+H which is a composition ofsome metal A and the hydrogen H, and B+H which is a mixtureof another metal B with the hydrogen.

I A machine learning algorithm has access to observations{(xAHi)}

NAHi=1 and {(yBHi)}

NBHi=1 . Our goal is to generate novel

ternary, i.e. more complex, stable data xAHB (or yBHA) basedon the properties learned from the observed binary structures.

32 / 41

GANs and Cross Domain GANs

I DiscoGAN and CycleGAN: cross-domain learningI Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. Unpaired

Image-to-Image Translation using Cycle-Consistent AdversarialNetworks. ICCV, 2017

I Kim, T.; Cha, M.; Kim, H.; Lee, J. K.; and Kim, J. Learning toDiscover Cross-Domain Relations with Generative AdversarialNetworks, ICML, 2017.

I Discover relations between two different domains from unpairedsamples, and to find a mapping from one domain to another

I How to generate data with augmented complexity?

33 / 41

GANs and Cross Domain GANsWe consider a function GABZ that maps elements from domains Aand B to domain Z which includes the co-domains A and B. In anunsupervised learning scenario, GABZ can be arbitrarily defined,however, to apply it to real-world applications, some conditions on therelation of interest have to be well-defined.In an idealistic setting, the equality

GABZ ◦GZAB(xA, xB) = (xA, xB) (1)

is satisfied. However, this constraint is a hard constraint, it is notstraightforward to optimize it, and a relaxed soft constraint ispreferred. As a soft constraint, we can consider the distance

d (GABZ ◦GZAB(xA, xB), (xA, xB)) , (2)

and minimize it using a metric function such as L1 or L2.

−ExA,xB∼PA,B[logDZ(GABZ)(xA, xB)] . (3)

34 / 41

CrystalGAN

Our goal: Generate xAHB (or yBHA) from {(xAHi)}NAHi=1 and

{(yBHi)}NBHi=1

Three steps of CrystalGAN:

1. First step GAN which is closely related to the cross-domainGANs, and that generates pseudo-binary samples where thedomains are mixed.

2. Feature transfer procedure constructs higher order complexitydata from the samples generated at the previous step, and wherecomponents from all domains are well-separated.

3. Second step GAN synthesizes, under geometric constraints,novel ternary stable chemical structures.

35 / 41

CrystalGAN: First Step

36 / 41

CrystalGAN: Feature Transfer and Second Step

37 / 41

Outline


38 / 41

Numerical Experiments: POSCAR files

A Material Science application:generate stable data for hydrogen storage

Dataset: POSCAR files: 3 matrices:

1. abc matrix, corresponding to the three lattice vectors defining theunit cell of the system

2. atomic positions of H atom, and coordinates of metallic atomA (or B).

39 / 41

Results on Crystals Dataset

Number of ternary compositions of good quality (stable) generated bythe tested methods

Composition GAN Disco Crystal CrystalGAN GAN GAN

(standard) without with geometricconstraints constraints

Pd - Ni - H1 0 0 4 9Mg - Ti - H2 0 0 2 8

(learning from about 35 observations from each class)

1Palladium – Hydrogen – Nickel2Magnesium - Hydrogen - Titanium

40 / 41

Published in proceedings of AAAI-MAKE, 2019.

41 / 41

Many thanks

to Matthieu Clertant (learning under budget)and Asma Nouira (GANs for materials science)

for sharing their slides and figures

42 / 41

Documents

Learning under Budget Generative Adversarial Netsnsokolovska/GAN_UnderBudget.pdf · Roux-en-Y gastric bypass surgery: a retrospective cohort study, 2013 4/41. DiaRem accuracy on NutriOmics