NetBioSIG2014-Talk by David Amar

1

Pathways as robust biomarkers for cancer

classification: the power of big expression data

David Amar, Tom Hait, and Ron Shamir

Blavatnik School of Computer Science

Tel Aviv University

2

Motivation and introduction

3

Comparative genomicsStandard expression experiments: cases vs.

controls -> differential genes -> interpretationProblems

Small number of samples Non-specific signal Interpretation of a gene set/ gene ranking

Goal: find specific changes for a tested diseaseE.g., an up-regulated pathwayCrucial for clinical studies

Previous integrative classification studies Huang et al. 2010 PNAS (9,160 samples); Schmid

et al. PNAS 2012 (3,030); Lee et al. Bioinformatics 2013 (~14,000) Multilabel classification Global expression patternsOnly 1-3 platformsMany datasets were removed from GEONo “healthy” class (Huang);No diseases (Lee)

Pathprint (Altschuler et al. 2013)Use pathwaysTissue classification (as in Lee et al.)

4

Integrating pathways and molecular profilesEnrichment tests

Improves interpretability GSEA\GSA

Ranked based Higher statistical power

ClassificationExtract pathway features

Example: given a pathway remove non-differential genes

Not clear if prediction performance improves compared to using genes (Staiger et al. 2013)

5

6

Pathway-based gene expression database

PathwaysKEGG Reactome

Biocarta NCI

Expression profiles

GSE

GDS

TCGA

Sample labelsDiseas

e

Dataset\sample

descriptionSingle sample - single pathway

analysis

For each

pathway

• Mean• SD

Y

Sam

ple

s

XP

Pathway features

Platform data

Single sample analysis

g1 , g2 ,g3 , … , gk

Ranked genes\

transcripts

Sample j

g1 , g2 ,g3 , … , gk

Weighted ranks

/i kiW ie

w 1 , w 2 ,w 3 , … , w k

Standardized profile

low expressio

n

highexpressio

n

7

Single sample analysisInput: an expression profile of a sample

A vector of real values for each patientStep 1: rank the genesStep 2: calculate a score for each gene

Rank of gene g in sample s

Total number of ranked genes (Yang et al. 2012,2013)

8

Pathway features1723 pathways in total

Covering 7842 genesMean size: 36.35 (median 15)

Score all genes that are in the pathway databases

Pathway statistics:Mean scoreStandard deviationSkewnessKS test

Pathway DBsKEGG

Reactome

Biocarta NCI

9

Patient labelsUnite ~180 datasets, >14,000

samplesPublic databases contain ‘free

text’Problem: automatic mapping

fails, example:GDS4358:” lymph-node

biopsies from classic Hodgkins lymphoma HIV- patients before ABVD chemotherapy”

MetaMap top score: “HIV infections”

Solution: manual analysisRead descriptions and papers

10

Current microarray dataData from GEO

13,314 samples17 platforms

Sample annotationIgnore terms with less than

100 samples 5 datasets 48 disease terms

Disease terms

XP

Sam

ple

s

Pathway features

Y

Disease terms {0,1}

Sam

ple

s11

12

Analysis and results

13

Multi-label classification algorithmsLearn a single classifier for each

disease Ignore class dependencies

Adaptation: Bayesian CorrectionLearn single classifiersCorrect errors using the DO DAG

Transformation: use the label power sets and learn a multiclass modelUsing RF: multi-label trees

Was better than most approaches in an experimental study (Madjarov et al. 2012)

14

How to validate an classifier?Use leave-dataset out cross-validation

Global AUC scores: each prediction Pij vs the correct label Yij Disease based AUC scores: consider each column separately

Y

Disease terms {0,1}

Sam

ple

s

P

Probabilities [0,1]

Sam

ple

s

The output of a multi-label learner

Test set

15

A problem (!)What is in the background?For a disease D define:

Positives: disease samplesNegatives: direct controlsBackground controls

Example: 500 positives

500 negatives

10000 BGCs

Y

P

16

Multistep validationIt is recommended to use several scores (Lee et al.

2013)Measure global AUPRFor each disease we calculate three scores

Measure Used (additional) information

AUPR: check separation between positives and all others

Sick vs. not sick

ROC: test for separation between positives and negatives

Direct use of negatives

Meta analysis p-value: calculate the overall separation significance within the original datasets (a p-value)

Mapping of samples to datasets

17

Performance results

Meta analysis q-value < 0.001 (filled boxes)

Positives vs. negatives ROC

AUPR

18

Performance results

8.5% improvement in recall, 12% in precision, compared to Huang et al.

Validation on RNA-SeqData from TCGA: 1,699 samples

19

Pathway-Disease networkSteps (for each of the selected diseases):

1. Disease-pathway edges1. RF importance: Select the top features2. Test for disease relevance

2. Add edges between diseases1. Use the DO structure

3. Add edges between pathways1. Based on significant overlap in genes

20

Cancer network

DownUp

23

Cardiovascular disease

DownUp

Gastric cancers

25

SummaryLarge scale integrationMulti-label learningCareful validationPathway based features as biomarkersSummary of the results in a networkCurrently

Add genes: overcome missing valuesShows improvement in validation

AcknowledgementsRon ShamirTom Hait