25
Pathways as robust biomarkers for cancer classification: the power of big expression data 1 David Amar , Tom Hait, and Ron Shamir Blavatnik School of Computer Science Tel Aviv University

NetBioSIG2014-Talk by David Amar

Embed Size (px)

DESCRIPTION

NetBioSIG2014 at ISMB in Boston, MA, USA on July 11, 2014

Citation preview

Page 1: NetBioSIG2014-Talk by David Amar

1

Pathways as robust biomarkers for cancer

classification: the power of big expression data

David Amar, Tom Hait, and Ron Shamir

Blavatnik School of Computer Science

Tel Aviv University

Page 2: NetBioSIG2014-Talk by David Amar

2

Motivation and introduction

Page 3: NetBioSIG2014-Talk by David Amar

3

Comparative genomicsStandard expression experiments: cases vs.

controls -> differential genes -> interpretationProblems

Small number of samples Non-specific signal Interpretation of a gene set/ gene ranking

Goal: find specific changes for a tested diseaseE.g., an up-regulated pathwayCrucial for clinical studies

Page 4: NetBioSIG2014-Talk by David Amar

Previous integrative classification studies Huang et al. 2010 PNAS (9,160 samples); Schmid

et al. PNAS 2012 (3,030); Lee et al. Bioinformatics 2013 (~14,000) Multilabel classification Global expression patternsOnly 1-3 platformsMany datasets were removed from GEONo “healthy” class (Huang);No diseases (Lee)

Pathprint (Altschuler et al. 2013)Use pathwaysTissue classification (as in Lee et al.)

4

Page 5: NetBioSIG2014-Talk by David Amar

Integrating pathways and molecular profilesEnrichment tests

Improves interpretability GSEA\GSA

Ranked based Higher statistical power

ClassificationExtract pathway features

Example: given a pathway remove non-differential genes

Not clear if prediction performance improves compared to using genes (Staiger et al. 2013)

5

Page 6: NetBioSIG2014-Talk by David Amar

6

Pathway-based gene expression database

Page 7: NetBioSIG2014-Talk by David Amar

PathwaysKEGG Reactome

Biocarta NCI

Expression profiles

GSE

GDS

TCGA

Sample labelsDiseas

e

Dataset\sample

descriptionSingle sample - single pathway

analysis

For each

pathway

• Mean• SD

Y

Sam

ple

s

XP

Pathway features

Platform data

Single sample analysis

g1 , g2 ,g3 , … , gk

Ranked genes\

transcripts

Sample j

g1 , g2 ,g3 , … , gk

Weighted ranks

/i kiW ie

w 1 , w 2 ,w 3 , … , w k

Standardized profile

low expressio

n

highexpressio

n

7

Page 8: NetBioSIG2014-Talk by David Amar

Single sample analysisInput: an expression profile of a sample

A vector of real values for each patientStep 1: rank the genesStep 2: calculate a score for each gene

Rank of gene g in sample s

Total number of ranked genes (Yang et al. 2012,2013)

8

Page 9: NetBioSIG2014-Talk by David Amar

Pathway features1723 pathways in total

Covering 7842 genesMean size: 36.35 (median 15)

Score all genes that are in the pathway databases

Pathway statistics:Mean scoreStandard deviationSkewnessKS test

Pathway DBsKEGG

Reactome

Biocarta NCI

9

Page 10: NetBioSIG2014-Talk by David Amar

Patient labelsUnite ~180 datasets, >14,000

samplesPublic databases contain ‘free

text’Problem: automatic mapping

fails, example:GDS4358:” lymph-node

biopsies from classic Hodgkins lymphoma HIV- patients before ABVD chemotherapy”

MetaMap top score: “HIV infections”

Solution: manual analysisRead descriptions and papers

10

Page 11: NetBioSIG2014-Talk by David Amar

Current microarray dataData from GEO

13,314 samples17 platforms

Sample annotationIgnore terms with less than

100 samples 5 datasets 48 disease terms

Disease terms

XP

Sam

ple

s

Pathway features

Y

Disease terms {0,1}

Sam

ple

s11

Page 12: NetBioSIG2014-Talk by David Amar

12

Analysis and results

Page 13: NetBioSIG2014-Talk by David Amar

13

Multi-label classification algorithmsLearn a single classifier for each

disease Ignore class dependencies

Adaptation: Bayesian CorrectionLearn single classifiersCorrect errors using the DO DAG

Transformation: use the label power sets and learn a multiclass modelUsing RF: multi-label trees

Was better than most approaches in an experimental study (Madjarov et al. 2012)

Page 14: NetBioSIG2014-Talk by David Amar

14

How to validate an classifier?Use leave-dataset out cross-validation

Global AUC scores: each prediction Pij vs the correct label Yij Disease based AUC scores: consider each column separately

Y

Disease terms {0,1}

Sam

ple

s

P

Probabilities [0,1]

Sam

ple

s

The output of a multi-label learner

Test set

Page 15: NetBioSIG2014-Talk by David Amar

15

A problem (!)What is in the background?For a disease D define:

Positives: disease samplesNegatives: direct controlsBackground controls

Example: 500 positives

500 negatives

10000 BGCs

Y

P

Page 16: NetBioSIG2014-Talk by David Amar

16

Multistep validationIt is recommended to use several scores (Lee et al.

2013)Measure global AUPRFor each disease we calculate three scores

Measure Used (additional) information

AUPR: check separation between positives and all others

Sick vs. not sick

ROC: test for separation between positives and negatives

Direct use of negatives

Meta analysis p-value: calculate the overall separation significance within the original datasets (a p-value)

Mapping of samples to datasets

Page 17: NetBioSIG2014-Talk by David Amar

17

Performance results

Meta analysis q-value < 0.001 (filled boxes)

Positives vs. negatives ROC

AUPR

Page 18: NetBioSIG2014-Talk by David Amar

18

Performance results

8.5% improvement in recall, 12% in precision, compared to Huang et al.

Page 19: NetBioSIG2014-Talk by David Amar

Validation on RNA-SeqData from TCGA: 1,699 samples

19

Page 20: NetBioSIG2014-Talk by David Amar

Pathway-Disease networkSteps (for each of the selected diseases):

1. Disease-pathway edges1. RF importance: Select the top features2. Test for disease relevance

2. Add edges between diseases1. Use the DO structure

3. Add edges between pathways1. Based on significant overlap in genes

20

Page 21: NetBioSIG2014-Talk by David Amar

Cancer network

DownUp

Page 22: NetBioSIG2014-Talk by David Amar

23

Cardiovascular disease

DownUp

Page 23: NetBioSIG2014-Talk by David Amar

Gastric cancers

Page 24: NetBioSIG2014-Talk by David Amar

25

SummaryLarge scale integrationMulti-label learningCareful validationPathway based features as biomarkersSummary of the results in a networkCurrently

Add genes: overcome missing valuesShows improvement in validation

Page 25: NetBioSIG2014-Talk by David Amar

AcknowledgementsRon ShamirTom Hait