Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of...

Preview:

Citation preview

Gene Set Enrichment and Splicing

Detection using Spectral Counting

Gene Set Enrichment and Splicing

Detection using Spectral Counting

Nathan EdwardsDepartment of Biochemistry and Mol. & Cell. BiologyGeorgetown University Medical Center

Outline

• Systems Biology• Gene Sets & Functional Enrichment• Balls in Urns

• Proteomics• MS/MS and Peptide ID• Quantitation and Spectrum Counting

• Differential Protein Abundance• Detecting Splicing and Isoforms

2

Systems Biology

3

MathematicalModels

KnowledgeDatabases

High-ThroughputExperiments

Systems Biology

4

MathematicalModels

KnowledgeDatabases

High-ThroughputExperiments

• Sequencing• Microarrays• Proteomics• Metabolomics

molecular biology ↕

phenotype

Systems Biology

5

MathematicalModels

KnowledgeDatabases

High-ThroughputExperiments

• UniProt• OMIM• Kegg

molecular biology↕

biology

Systems Biology

6

MathematicalModels

KnowledgeDatabases

High-ThroughputExperiments

• Software • Statistics• Algorithms

phenotype↕

biology

Systems Biology

7

MathematicalModels

KnowledgeDatabases

High-ThroughputExperiments

• Software • Statistics• Algorithms

phenotype↕

biology

• UniProt• OMIM• Kegg

molecular biology↕

biology

• Sequencing• Microarrays• Proteomics• Metabolomics

molecular biology ↕

phenotype

Gene Expression Analysis

• Differential expression via:• Structured experiments• Transcript measurements• Statistics

• But now what?

8

Gene Expression Analysis

Hengel et al. J Immunol. 2003.•Structured experiment:

• CD4+/L-selectin- T-cells, vs• CD4+/L-selectin+ T-cells

•Affymetrix Human Genome U95A Array•Processing & Statistics

• MAS 4.0, t-Tests, FDR filtering, …•164 probe identifiers for upregulated genes.

9

Gene Expression Analysis

10

34529_AT 38816_AT 679_AT 37105_AT 34623_AT 36378_AT 35648_AT 33979_AT 34529_AT 1372_AT 38646_S_AT 35896_AT 34249_AT 40317_AT 32413_AT 33530_AT 32469_AT 34720_AT 36317_AT 31987_AT 33027_AT 35439_AT 36421_AT 966_AT 967_G_AT 31525_S_AT 38236_AT 34618_AT 34546_AT 31512_AT 40959_AT 38604_AT 33922_AT 40790_AT 35595_AT 33963_AT 33685_AT 35566_F_AT 33684_AT 36436_AT 37166_AT 34453_AT 1645_AT 39469_S_AT 38229_AT 38945_AT 37711_AT 39908_AT 1355_G_AT 38948_AT 1786_AT 39198_S_AT 606_AT 35091_AT 35090_G_AT 37954_AT 822_S_AT 36766_AT 37953_S_AT 38128_AT 40350_AT 37097_AT 33516_AT 38691_S_AT 34702_F_AT 31715_AT 1331_S_AT 34577_AT 33027_AT 38508_S_AT 32680_AT 39187_AT 31506_S_AT 31793_AT 40294_AT 40553_AT 1983_AT 32250_AT 37968_AT 33293_AT 40271_AT 32418_AT 33077_AT 38201_AT 2090_I_AT 34012_AT 34703_F_AT 38482_AT 40058_S_AT 34902_AT 34636_AT 41113_AT 35996_AT 40735_AT 34539_AT 41280_R_AT 37061_AT 34233_I_AT 41703_R_AT 37898_R_AT 35373_AT 37408_AT 35213_AT 31576_AT 39094_AT 32010_AT 919_AT 1855_AT 1391_S_AT 34436_AT 33371_S

Gene Expression Analysis

11

1112_g_at neural cell adhesion molecule 1

1331_s_at tumor necrosis factor receptor superfamily, member 25

1355_g_at neurotrophic tyrosine kinase, receptor, type 2

1372_at tumor necrosis factor, alpha-induced protein 6

1391_s_at cytochrome P450, family 4, subfamily A, polypeptide 11

1403_s_at chemokine (C-C motif) ligand 5

1419_g_at nitric oxide synthase 2, inducible

1575_at ATP-binding cassette, sub-family B (MDR/TAP), member 1

1645_at KiSS-1 metastasis-suppressor

1786_at c-mer proto-oncogene tyrosine kinase

1855_at fibroblast growth factor 3 (murine mammary tumor virus integration site (v-int-2) oncogene homolog)

1890_at growth differentiation factor 15

… …

Gene Set Enrichment

• Candidate genes are “special” with respect to the experiment structure (phenotype)

• Are they special with respect to general biological knowledge?• Are the candidate genes related?• Can we filter out the noise?• Can we expose associated genes?• What genes' changes are linked to the

experimental structure / phenotype?12

Gene Sets

• Genes may be related in many ways:• Same pathway, similar function, cellular location• Cytoband, identified in previous study, etc.

• Define gene sets for relatedness• GO Biological Process• GO Molecular Function• GO Cellular Component• KEGG Pathway, Biocarta Pathway• Biological knowledge databases

13

Gene Set Enrichment

14

Gene Set Enrichment

15

Gene Set Enrichment

16

Drawing Balls from Urns

17

1000 Balls, 900 Red, 100 Blue.

Drawing Balls from Urns

18

100 Balls Drawn at Random? # Red? # Blue?

Drawing Balls from Urns

19

How surprising is 5, 10, 15, 20, … blue?

Drawing Balls from Urns

20

How surprising is 30, 50, 70, … blue?

Drawing Balls from Urns

21

6 of 155 upregulated genes have"oxygen binding" GO annotation!

All human genes ( = 25), blue is oxygen binding.

How surprised should we be?

• Classic problem in probability theory• How well do the observed counts match the

expected counts?• Various mostly equivalent statistical tests

are applied:• Fisher exact test• Hypergeometric• Chi-Squared (χ2)

• p-value measures "surprise".

22

23

Proteomics

• Proteins are the machines that drive much of biology• Genes are merely the recipe

• The direct characterization of proteins en masse. • What proteins are present?• How much of each protein is present?• Which proteins change in abundance?

24

Sample Preparation for Tandem Mass Spectrometry

Enzymatic Digestand

Fractionation

25

Single Stage MS

MS

26

Tandem Mass Spectrometry(MS/MS)

MS/MS

27

Peptide Fragmentation

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

100

0250 500 750 1000

m/z

% I

nte

nsit

y

147260389504633762875102210801166 y ions

y6

y7

y2 y3 y4

y5

y8 y9

b3

b5 b6 b7b8 b9

b4

LC-MS/MS

• Powerful combination of liquid chromatography (LC), and

• Tandem mass-spectrometry (MS/MS)

• Automatically collect 100k MS/MS spectra in an afternoon• Tens of thousands of peptide/spectra

assignments, • Thousands of proteins identified

28

Spectral Counting

• Abundant proteins are more likely to be identified:• Selection (by the instrument) for

fragmentation is based on intensity• More abundant ions are more likely to

fragment in an informative manner• A proteins' peptide identification count

(spectra) can be used as a crude abundance measurement. • Easy, cheap, (relative) protein quantitation

29

Differential Spectral Counts

• Spectral counts are too crude for classical (microarray) statistics.• Fold change, t-tests, …

• However, we expect "similar" spectral counts when the protein abundance is unchanged.• Recast as drawing balls from urns.

30

HER2/Neu Mouse Model of Breast Cancer

• Paulovich, et al. JPR, 2007• Study of normal and tumor mammary

tissue by LC-MS/MS• 1.4 million MS/MS spectra

• Peptide-spectrum assignments• Normal samples (Nn): 161,286 (49.7%)• Tumor samples (Nt): 163,068 (50.3%)

• 4270 proteins identified in total31

Drawing Balls from Urns

32All Normal SpectraAll Tumor Spectra

Plastin-2 (Lcp1) 827 102 2.437E-123

Osteopontin (Spp1) 334 19 2.444E-62

Hypoxia up-regulated protein 1 (Hyou1) 200 7 1.437E-40

Functional Enrichment

• 374 proteins with "significantly" increased abundance in tumor tissue• Use 4270 proteins as background!

• DAVID gene set enrichment:• Protein translation• RNA binding, splicing

33

Differential Spectral Counting

• Assumptions of the formal tests (Fisher exact, χ2) are violated, so• p-values can be misleading (too small)• Use label permutation tests to compute

empirical p-values. SLOW!• Collapse spectral counts to protein sets

(GO terms) directly:• Potential to observe more subtle spectral

count differences

34

35

Unannotated Splice Isoform

36

Unannotated Splice Isoform

37

Halobacterium sp. NRC-1ORF: GdhA1

• K-score E-value vs PepArML @ 10% FDR• Many peptides inconsistent with annotated

translation start site of NP_279651

0 40 80 120 160 200 240 280 320 360 400 440

What if there is no "smoking gun" peptide…

38

What if there is no "smoking gun" peptide…

39

What if there is no "smoking gun" peptide…

40

PKM2 in Peptide Atlas

41

expe

rimen

ts

peptides

What if there is no "smoking gun" peptide…

42

?

Nascent polypeptide-associated complex subunit alpha

• Long form is "muscle-specific"• Exon 3 is missing from short form

• Peptide identifications provide evidence for long form only• 9 peptides are specific to long form• 6 peptides are found in both isoforms

• Urn with balls of 15 different colors• p-value of observed spectral counts: 7.3E-8

43

Nascent polypeptide-associated complex subunit alpha

44

Pyruvate kinase isozymes M1/M2

• Exon "substitution" changes sequence in the middle of the protein

• Peptide identifications provide evidence for both isoforms• 3 peptides are specific to isoform 1• 5 peptides are specific to isoform 2

• Urn with balls of 63 colors for isoform 1• p-value of observed spec. counts: 2.46E-05

45

46

Pyruvate kinase isozymes M1/M2

Summary

• Systems biology requires:• Experiments, Databases, Models• Informaticians and Disease Experts

• Functional Enrichment:• Quickly navigate knowledge databases using

experiment derived genes• Classical probability experiment: Balls & Urns• How surprised should you be?• Still require domain expert to pick out gems

47

Summary

• Proteomics:• High-throughput protein comparison• Proteome "sample" is identified• Crude spectral count quantitation

• Differential protein abundance:• Use Balls & Urns to find significant changes• Apply functional enrichment tools

• Splicing detection:• Perturbed peptide spectral counts provide

evidence for splicing.• Evaluate using Balls & Urns48

Recommended