53
Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002 http://www.nada.kth.se/~stefan Data Mining in Schizophrenia Researc -preliminary

Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

  • Upload
    elvina

  • View
    33

  • Download
    1

Embed Size (px)

DESCRIPTION

Data Mining in Schizophrenia Research -preliminary. Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002. - PowerPoint PPT Presentation

Citation preview

Page 1: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Stefan Arnborg, KTH, SICS

Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet

’Principles of Data Mining and Knowledge Discovery’,Helsinki, Aug 2002

http://www.nada.kth.se/~stefan

Data Mining in Schizophrenia Research -preliminary

Page 2: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,
Page 3: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

HUBIN - a project to accelerate research in human brain diseases

• Carefully selected patients, relatives and controls• Each participant characterized over many domains• DNA stored in bio-bank• Each research team collects high-quality information,

analyzes it, and stores in archive for inter-domain analyses.

Page 4: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Hubin organizationHubin organization

Ethical groupEthical groupGöran Sedvall, ChairmanGöran Sedvall, Chairman

Ethical groupEthical groupGöran Sedvall, ChairmanGöran Sedvall, Chairman

Hubin ABHubin ABStig Larsson, ChairmanStig Larsson, ChairmanHåkan Hall, CEOHåkan Hall, CEO

Hubin ABHubin ABStig Larsson, ChairmanStig Larsson, ChairmanHåkan Hall, CEOHåkan Hall, CEO

Project staff Data domain responsibles

Management groupManagement groupHåkan Hall, Assoc. Prof. Håkan Hall, Assoc. Prof.

(project manager)(project manager)Stig Larsson, T.D. hcStig Larsson, T.D. hcGöran Sedvall, Prof.Göran Sedvall, Prof.Stefan Arnborg, Prof.Stefan Arnborg, Prof.Tom McNeil, Prof.Tom McNeil, Prof.Lars Therenius Prof.Lars Therenius Prof.

Management groupManagement groupHåkan Hall, Assoc. Prof. Håkan Hall, Assoc. Prof.

(project manager)(project manager)Stig Larsson, T.D. hcStig Larsson, T.D. hcGöran Sedvall, Prof.Göran Sedvall, Prof.Stefan Arnborg, Prof.Stefan Arnborg, Prof.Tom McNeil, Prof.Tom McNeil, Prof.Lars Therenius Prof.Lars Therenius Prof.

Scientific advisory boardScientific advisory boardGöran Sedvall, ChairmanGöran Sedvall, ChairmanNancy Andreasen, Univ of IowaNancy Andreasen, Univ of IowaPaul Greengard, Rockefeller UnivPaul Greengard, Rockefeller UnivTomas Hökfelt, Karolinska Inst.Tomas Hökfelt, Karolinska Inst.

Scientific advisory boardScientific advisory boardGöran Sedvall, ChairmanGöran Sedvall, ChairmanNancy Andreasen, Univ of IowaNancy Andreasen, Univ of IowaPaul Greengard, Rockefeller UnivPaul Greengard, Rockefeller UnivTomas Hökfelt, Karolinska Inst.Tomas Hökfelt, Karolinska Inst.

Page 5: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Leading causes of disability in the world, WHO (1990)

Cause of disability Total % of millions world total

1. Unipolar major depression 50.8 10.7

2. Iron deficiency anemia 22.0 4.7

3. Falls 22.0 4.6

4. Alcohol use 15.8 3.3

5. Chronic obstructive pulmonary disease 14.7 3.1

6. Bipolar disorder 14.1 3.0

7. Congenital anomalies 13.5 2.9

8. Osteoarthritis 13.3 2.8

9. Schizophrenia 12.1 2.6

10. Obsessive compulsive disorder 10.2 2.2

Page 6: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Schizophrenia -Questions and Clues

• Cause(s) of schizophrenia not known.• Medication effective against some symptoms - discovered by

chance 100-2000 years ago.• Does not appear in animals-no experimental clues.• Explanation models vary over time.• Disturbed neuronal circuitry in schizophrenia?

(currently hottest hypothesis)• Influenced by genotype or/and environment?

(clustering in families)

Page 7: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Schizophrenia -Questions and Clues

• Which processes result in disease?• Traces of disturbed development visible in MRI

(anatomy) and blood tests?• Genetic risk factors?• Causal pathways?• MAIN PROBLEM:

Connect psychiatry to physiology

Page 8: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Preliminary analysis

Test case:144 subjects: 61 affected, 83 controlsVariables:•Diagnosis (DSM-IV)•Demography (age, gender, ..)•Blood tests (liver, heart,…)•Genetics (20 SNP:s, receptor, growth factors, …)•Anatomy (MRI)•Neuropsychology(working memory, reactions)•Clinicaltest batteries (type of delusions, history, medication)

Page 9: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

In vivo imaging

Magnetic resonance images (MRI)

Functional magnetic resonance images (fMRI)

Positron emission tomography (PET)

Single photon emission tomography (SPECT)

MRI

PET

In vitro imaging (whole hemispheres)

Autoradiography

In situ hybridization

ISHH

LAR

Types of images used in HUBIN

Page 10: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Brain boxes

Picture fromBRAINS II manual,Magnotta et al,University of Iowa

Page 11: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Manually drawn vermis regions

ROIs drawn by GakuOkugawa

Page 12: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Single Nucleotide Polymorphism

A U G U U C C A U U A U U G U

A U G U U U C A U U A U U A U

RNA:

Protein A Phe

Phe

His

His

Tyr

Tyr

Cys

Phe

non-coding SNP

coding SNP

TyrProtein A’

Protein A can be slightly different from A´

Page 13: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Genes studied

• DBH dopamine beta-hydroxylase• DRD2 dopamine receptor D2 +• DRD3 dopamine receptor D3• HTR5A serotonin receptor 5A• NPY neuropeptide Y• SLC6A4 serotonin transporter• BDNF brain derived neurotrophic factor• NRG1 neuregulin +

Page 14: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Intracranial volume (ml)

Cumulative distribution

+ = schizo = controls

Elementary Visualizations MRI Intracranial volume

Page 15: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Elementary VisualizationsMRI data

Total CSF volumes (ml)

Cumulative distribution

+ = schizo = controls

p < 0.0002

Page 16: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gamma GTGamma GT

Cumulative distribution

+ = schizo = controls

p < 0.01

Blood dataGamma GT- alcohol marker

Page 17: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Men

Women25 30 35 40 45 50 55

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sub White-women

30 35 40 45 50 55 60 650

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sub White-men

Subcortical white

+ = schizo = controls

Subcortical white

+ = schizo = controls

Gender differences

MRI

Page 18: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Which methods to use?

• Visualizations, cdf and scatter plots, give intuitive grasp of variables - problems with many interrelated variables

• Statistical modelling required to decide significance of visible trend, and to rank effects

Page 19: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Statistical methods

• Bayesian methods intuitive and rational - but conventional testing required for publications

• Linear models - need to account for mixing and over-dispersion(Glenn Lawyer thesis project).

• Discretization and Bayesian analysis of discrete distributions - intuitive, but information lost

• Non-parametric randomization tests - most sensitive and accommodate modern multiple testing paradigms

Page 20: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Bayes’ factor

• Choice between two hypotheses, H1 and H2,given experimental/observational data D

P(H1|D) P(D|H1) P(H1)P(H2|D) P(D|H2) P(H2)

Posterior odds Bayes factor prior odds

Page 21: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Hypotheses in test matrix

• H1: (no effect) a data column is generatedindependently of diagnosis (composite model)

• H2: the data for controls are generated by one composite model, for affected by another one.

Page 22: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Hierarkiska modeller• Modell kontinuerlig:

H:

• Modell parametriserad:

• Modell hierarkisk: priorfördelning f() för

• Inferens för parameter

P(x∈X)= f (x)dxX∫

Hλ :P(x∈X) = f(x|λ)dxX∫

H1 :P(D |H1) = f(D|λ) f(λ)dλΛ∫

f (λ |D)∝ f(D |λ) f (λ)

Page 23: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Model adequate?

• Best tested with classical p-values.• Determine posterior for parameter:

• Design test function • Compute p-value:

• Reject model if p small, e.g., <1%, <5%

f (λ |D)∝ f(D |λ) f (λ)

t :D→ Rp=P(t(Dr )<t(D))

Dr ~f (⋅|λ) f (λ |D)

Page 24: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Bayes’ example

• Result D from test: s heads, f tails, n=s+f

• H0: Coin is balanced, P(D|H0)=2

• H : Coin has head probability p P(D|H ) = p (1-p)

• H1: H with uniform prior for p , hierarchical

P(D|H1) = ∫ P(D|H ) dp = (s! f!)/((n+1)!)

-n

p

p

ps f

p

Page 25: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Bayes factor - ratio between areas

Page 26: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Graphical models

Y

Z

X

Y

Z

X

Y

Z

Xf(x,y,z)=f(x)f(y)f(z)

f(x,y,z)=f(x,z)f(x,y)/f(x)

f(x,y,z)

Page 27: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

MRI volumes, blood, demography

Dia

BrsCSF TemCSF

SubCSF TotCSF

Multivariate characterization by graphical models

Page 28: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Adding Vermis variables

Dia

BrsCSF TemCSF

PSV

Page 29: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

V-structures,causality

X

Y

A

B

C

A

B

C

X

YA CA C | B

A CA C | B

V-structures detectablefrom observational data

Indistinguishable

A

B

C

f(x,y)=f(y|x)f(x) =f(x|y)f(y)

Page 30: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Pairs associated to Diagnosis

Y

Z

D

Y

Z

D

Y

Z

D

Y

Z

D

Y and Z co-vary differentlyfor Affected and Controls

Page 31: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Age-dependency of Posterior Superior Vermis

Age at MRI

Post sup vermis

+ = schizo = controls

Page 32: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

70 80 90 100 110 120 1300.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

ParWhite

0

No co-variation between Posterior inferior vermis and parietal white for affected

Parietal white

Post inf vermis

+ = schizo = controls

Page 33: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

PSV has best explanatory power

affected - healthy

0.05 0.1 0.15 0.2 0.25 0.30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

PS VermisPosterior superior vermis

+ = schizo = controls

Page 34: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Decision tree for DiagnosisMRI Data

A = schizC = controls() = misscls

Page 35: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Classification explains data!(Can Mert Thesis project)

XY

Z

XY

Z

H

W W

Page 36: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Autoclass1

Total gray

A= schizC= controls

Page 37: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Weak signals in genetics data

• Numerous investigations have indicated ‘almost significant’ signals of SNP:s to diagnosis

• Typically, these findings cannot be confirmed in other studies - populations genetically heterogeneous and measurements nonstandardized.

• We try to connect SNP:s both to diagnosis and to other phenotypical variables

• Multiple testing and weak signal problems.

Page 38: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Genetics data - weak statistics

Gene SNP type genotypes

DRD3 SerGly A/C 49 59 14DRD2 Ser311Cys C/G 118 4 0NPY Ley7Pro A/C 1 7 144DBH Ala55Ser G/T 98 24 0BDNF Val66Met A/G 5 37 80HTR5A Pro15Ser C/T 109 11 2PNOC Gln172Arg A/G 11 37 28SLC6A4 (del(44bp)in pr) S/L 20 60 42

Page 39: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Empirical distribution by genotypeGene BDNF (schiz + controls)

Frontal CSF

A/A A/G G/G

Cumulativedistribution

25 30 35 40 45 50 55 60 65 70 750

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Frontal-CSF-right-

G/G G/A A/A

Page 40: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

0 20 40 60 80 100 120 1400

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

Benjamini & Yekutieli, Annals of Math Stat, (ta)

‘no effect’Observedp-values

FDRi 71

FDRd 62

Bonferroni-Hochberg-Benjamini methodsMRI and lab data

Number of p-values

p-values

Page 41: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

multiple comparisons:

what is the significanceof min p-values 1,1,2,3%in 20 tests?What is the probability of obtainingmore extreme result?

Page 42: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Compensating multiple comparisons

• Bonferroni 1937: For level and n tests, use level /n

• Hochberg 1988: step-up procedure• Benjamini,Hochberg 1996: False Discovery

Rate• J. Storey, 2002: pFDRi, pFDRd• Bayesian interpretations being developed

(Wasserman & Genovese, 2002)

Page 43: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Diagnosis-genotype

bdnf drd2 nrg1

0.1136 0.0735 0.8709

0.0801 0.2213 0.7666

0.0316 0.0823 0.6426

0.5499 1.0000 0.0244

0.7314 0.7312 0.0103

bdnf drd2 nrg1

0.1137 0.0749 0.8744

0.7293 0.7276 0.0096

21 tests on three genes

Page 44: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

| | | |

| | | |

| | | |

| | | |

| | | |

| | | | . . .

7% - not quite significant!but better than Bonferroni: 20%

p-values 3%, 2%, 1%, 1% in 20 tests

Page 45: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1TPH:SNP000002367

q-value - FDR rate in prefix

Page 46: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1BDNF:SNP000006430

Page 47: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1sex

Page 48: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1diagnos

Page 49: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

0 1 20

0.5

1VermisMiddle-grey-right-

0.0850 1 2

0

0.5

1men

0.0650 1 2

0

0.5

1women

0.81

0 1 20

0.5

1

0.0050 1 2

0

0.5

1

0.010.5 1 1.5 20

0.5

1

0.275

0 1 20

0.5

1

0.810 0.5 1 1.5

0

0.5

1

0.7250 1 2

0

0.5

1

0.955

Page 50: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

2 4 60

0.5

1VermisUpper-grey-

0.4652 4 6

0

0.5

1men

0.112 3 4 5

0

0.5

1women

0.55

2 4 60

0.5

1

0.033 4 5 6

0

0.5

1

0.0052 3 4 5

0

0.5

1

0.4925

2 3 4 50

0.5

1

0.9552 3 4 5

0

0.5

1

0.9252 3 4 5

0

0.5

1

0.6325

Page 51: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

10 20 30 400

0.5

1Subcortical-grey-left-

0.99520 25 30 35

0

0.5

1men

0.99515 20 25 30

0

0.5

1women

0.315

10 20 30 400

0.5

1

0.8920 25 30 35

0

0.5

1

0.815 20 25 30

0

0.5

1

0.33

20 25 30 350

0.5

1

120 25 30 35

0

0.5

1

0.99520 25 30

0

0.5

1

0.28

Page 52: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

10 20 30 40 500

0.2

0.4

0.6

0.8

1Parietal-CSF-left-

0.9923610 20 30 40 50

0

0.2

0.4

0.6

0.8

1men

0.69636

10 20 30 40 500

0.2

0.4

0.6

0.8

1control

0.9124510 15 20 25 30

0

0.2

0.4

0.6

0.8

1women

0.99903

Page 53: Stefan Arnborg,  KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

That’s all, folks!

• High-quality databases for medical research of the HUBIN type open up for intelligent data analysis methods used in engineering and business

• Already with the limited data presently available, interesting clues emerge

• Multiple testing considerations are important• Long term effort - stable economy and

engagement is vital.