63
Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 09-16-10 rvard School of Public Health partment of Biostatistics

Scalable data mining for functional genomics and metagenomics

Embed Size (px)

DESCRIPTION

Scalable data mining for functional genomics and metagenomics. Curtis Huttenhower 09-16-10. Harvard School of Public Health Department of Biostatistics. Greatest discoveries in biology?. Our job is to create computational microscopes: - PowerPoint PPT Presentation

Citation preview

Page 1: Scalable data mining for functional genomics and metagenomics

Scalable data mining for functional genomics and metagenomics

Curtis Huttenhower

09-16-10Harvard School of Public HealthDepartment of Biostatistics

Page 2: Scalable data mining for functional genomics and metagenomics

2

Greatest discoveries in biology?

Our job is to create computational microscopes:

To ask and answer specific biological questions using

millions of experimental results

Page 3: Scalable data mining for functional genomics and metagenomics

3

Outline

1. Data mining:Integrating very large

genomic data compendia

2. Metagenomics:Network models of

microbial communities

Page 4: Scalable data mining for functional genomics and metagenomics

4

A computational definition offunctional genomics

Genomic data Prior knowledge

Data↓

Function

Function↓

Function

Gene↓

Gene

Gene↓

Function

Page 5: Scalable data mining for functional genomics and metagenomics

5

A framework for functional genomics

HighSimilarity

LowSimilarity

HighCorrelation

LowCorrelation

G1G2

+

G4G9

+

G3G6

-

G7G8

-

G2G5

?

0.9 0.7 … 0.1 0.2 … 0.8

+ - … - - … +

0.8 0.5 … 0.05 0.1 … 0.6

HighCorrelation

LowCorrelation

Fre

quen

cy

Let.Not let.

Fre

quen

cy

SimilarDissim.

Fre

quen

cy

P(G2-G5|Data) = 0.85

100Ms gene pairs →

← 1

Ks

data

sets

+ =

Page 6: Scalable data mining for functional genomics and metagenomics

6

Functional networkprediction and analysis

Global interaction network

Carbon metabolism network Extracellular signaling network Gut community network

Currently includes data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

HEFalMp

Page 7: Scalable data mining for functional genomics and metagenomics

7

Functional network prediction from diverse microbial data

486 bacterial expression

experiments

876 raw datasets

310 postprocessed

datasets

304 normalized coexpression networks

in 27 species

Integrated functional interaction networks

in 15 species

307 bacterial interaction

experiments

154796 raw interactions

114786 postprocessed

interactions

E. Coli Integration

← Precision ↑, Recall ↓

Page 8: Scalable data mining for functional genomics and metagenomics

8

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

eiey ,

ieeeiey ,,

i

ieiee yw ,*,̂

22,

*, ˆ

1

eie

ies

w

Simple regression:All datasets are equally accurate

Random effects:Variation within and

among datasets and interactions

Page 9: Scalable data mining for functional genomics and metagenomics

9

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

+ =

Page 10: Scalable data mining for functional genomics and metagenomics

10

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

Page 11: Scalable data mining for functional genomics and metagenomics

11

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

X?

Page 12: Scalable data mining for functional genomics and metagenomics

12

Predicting gene function

Cell cycle genes

Predicted relationships between genes

HighConfidence

LowConfidence

Page 13: Scalable data mining for functional genomics and metagenomics

13

Predicting gene functionPredicted relationships

between genes

HighConfidence

LowConfidence

Cell cycle genes

Page 14: Scalable data mining for functional genomics and metagenomics

14

Cell cycle genes

Predicting gene functionPredicted relationships

between genes

HighConfidence

LowConfidence

These edges provide a measure of how likely a gene is to

specifically participate in the process of

interest.

Page 15: Scalable data mining for functional genomics and metagenomics

15

Comprehensive validation of computational predictions

Genomic data

Computational Predictions of Gene Function

MEFITSPELLHibbs et al 2007

bioPIXIEMyers et al 2005

Genes predicted to function in mitochondrion organization

and biogenesis

Laboratory ExperimentsPetite

frequencyGrowthcurves

Confocal microscopy

New known functions for correctly predicted genes

Retraining

With David Hess, Amy Caudy

Prior knowledge

Page 16: Scalable data mining for functional genomics and metagenomics

16

Evaluating the performance of computational predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

135Under-annotations

82Novel Confirmations,

First Iteration

17Novel Confirmations,

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Page 17: Scalable data mining for functional genomics and metagenomics

17

Evaluating the performance of computational predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

95Under-annotations

40Confirmed

Under-annotations

80Novel Confirmations

First Iteration

17Novel Confirmations

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Computational predictions from large collections of genomic data can be

accurate despite incomplete or misleading gold standards, and they

continue to improve as additional data are incorporated.

Page 18: Scalable data mining for functional genomics and metagenomics

18

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

The strength of these relationships indicates how

cohesive a process is.

Chemotaxis

Page 19: Scalable data mining for functional genomics and metagenomics

19

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

Page 20: Scalable data mining for functional genomics and metagenomics

20

Functional mapping: mining integrated networks

Flagellar assembly

The strength of these relationships indicates how

associated two processes are.

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

Page 21: Scalable data mining for functional genomics and metagenomics

21

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 22: Scalable data mining for functional genomics and metagenomics

22

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 23: Scalable data mining for functional genomics and metagenomics

23

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 24: Scalable data mining for functional genomics and metagenomics

24

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Page 25: Scalable data mining for functional genomics and metagenomics

25

Cross-species knowledge transferusing functional data

PinakiSarder

)P()|P()|P( sssss FRFRDDFR ),P( ts FRFR

)|P( DFRs

)},{|P( ssts DFRFR

)P()|},P({ sssst FRFRDFR

st

stD

sss FRFRFRDFRs

)|P()|P()P(

TaFTan

Page 26: Scalable data mining for functional genomics and metagenomics

26

TaFTan: Cross-species knowledge transfer using functional data

E. coli

B. subtilis

P. aeruginosa

M. tuberculosis

Species-specific data

Species’ data excluded

All species’ data

log(

prec

isio

n/ra

ndom

)

log(recall)

• Important to take advantage of all

available data for any one organism

• Important to take advantage of all

available data for every organism

• Scalable to dozens of organisms with

hundreds of functional datasets

• Currently working on making this

more context-specific

Page 27: Scalable data mining for functional genomics and metagenomics

27

Outline

1. Data mining:Integrating very large

genomic data compendia

2. Metagenomics:Network models of

microbial communities

Page 28: Scalable data mining for functional genomics and metagenomics

28

~2000

AML/ALLSurvival

Mutation

Geneexpression

Batcheffects

Functionalmodules

So what does all of this have to do with

microbial communities ?

Page 29: Scalable data mining for functional genomics and metagenomics

29

~2005

Healthy/DiabetesBMI

M/F

SNPgenotypes

Populationstructure

LD

Page 30: Scalable data mining for functional genomics and metagenomics

30

2010

Healthy/IBDTemperature

Location

Taxa &Orthologs

???

Niches &Phylogeny Test for

correlatesMultiple

hypothesiscorrection

Featureselection

p >> n

Confounds/stratification/environment

Cross-validate

Biological story?

Independent sample

Intervention/perturbation

Page 31: Scalable data mining for functional genomics and metagenomics

31

What’s metagenomics?Total collection of microorganisms

within a community

Also microbial community or microbiota

Total genomic potential of a microbial community

Total biomolecular repertoire of a microbial community

Study of uncultured microorganisms from the environment, which can include

humans or other living hosts

Page 32: Scalable data mining for functional genomics and metagenomics

32

The Human Microbiome Project

2006 - ongoing

• 300 “normal” adults, 18-40

• 16S rDNA + WGS• 5 sites/18 samples +

blood• Oral cavity: saliva, tongue,

palate, buccal mucosa, gingiva,

tonsils, throat, teeth• Skin: ears, inner elbows• Nasal cavity• Gut: stool• Vagina: introitus, mid,

fornix• Reference genomes

(~200-800)

All healthy subjects; followup projects in psoriasis, Crohn’s,

colitis, obesity, acne, cancer, resistant

infection…

Hamady, 2009

Page 33: Scalable data mining for functional genomics and metagenomics

33

What features to test?

16S reads

WGS reads

Taxa

Orthologous clusters

Pathways/modules

Functional roles

Pathway activity

Genomic data(Reference genomes)

Functional data(Experimental models)

Binning

Clustering

Microbiome data

Page 34: Scalable data mining for functional genomics and metagenomics

34

HMP: Data features

16S reads

Orthologous clusters

Pathways/modules

Taxa

Genes(KOs)

Pathways(KEGGs)

Page 35: Scalable data mining for functional genomics and metagenomics

35

HMP: Body sites

Taxa

KOs

KEGGs

Vanilla linear SVM

Page 36: Scalable data mining for functional genomics and metagenomics

36

HMP: Subjects

Taxa

KEGGs

We can tell who you are by the bugs in

your mouth!

Page 37: Scalable data mining for functional genomics and metagenomics

37

HMP: Metabolic reconstruction

WGS reads

Pathways/modules

Genes(KOs)

Pathways(KEGGs)

Functional seq.KEGG + MetaCYC

CAZy, TCDB,VFDB, MEROPS…

BLAST → Genes

rra

r

raa

p

gap

gc

)(

)(

1

)()1(

)(

Genes → PathwaysMinPath (Ye 2009)

SmoothingWitten-Bell

otherwiseTNNgc

gcTNTVTNgc

)/()(

0)()/()/()(

Gap filling

300 subjects1-3 visits/subject

15-18 body sites/visit10-20M reads/sample

100bp reads

BLAST

?

Page 38: Scalable data mining for functional genomics and metagenomics

38

HMP: Metabolic reconstruction

Pathway coverage Pathway abundance

Page 39: Scalable data mining for functional genomics and metagenomics

39

HMP: Metabolic reconstruction

Pathway coverage

Pathway abundance← Samples →

← P

ath

wa

ys

Aerobic body sites

Gastrointestinal body sites

All

bo

dy

sit

es

(“c

ore

”)

Page 40: Scalable data mining for functional genomics and metagenomics

40

MetaHIT: Data features

WGS reads

Pathways/modules

85 healthy, 15 IBD +

12 healthy, 12 IBD

ReBLASTed against KEGG since published data obfuscates read

counts

10x bootstrap within training cohort, test on

12+12 as validation

Taxa

PhymmBrady 2009

Genes(KOs)

Pathways(KEGGs)

Page 41: Scalable data mining for functional genomics and metagenomics

41

MetaHIT: Taxonomic CD biomarkersBacteroidetes

Firmicutes

Methanomicrobia

Enterobacteriaceae

Chromatiales

Desulfobacterales

OxalobacteraceaeRhodobacteraceae

Bradyrhizobiaceae

iTOLLetunic 2007

Page 42: Scalable data mining for functional genomics and metagenomics

42

MetaHIT: Taxonomic CD biomarkers

Down in CD

Up in CD

Page 43: Scalable data mining for functional genomics and metagenomics

43

MetaHIT: Functional CD biomarkers

Growth/replication Motility Transporters Sugar metabolism

Down in CD

Up in CD

Page 44: Scalable data mining for functional genomics and metagenomics

44

MetaHIT: KO IBD biomarkers

Transporters

Growth/replication

Motility

Sugarmetabolism

Down in IBD

Up in IBD

LEfSe

NicolaSegata

Page 45: Scalable data mining for functional genomics and metagenomics

t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis…

Metagenomic differential analysis: LEfSe

1. Is there a statistically significant difference?

2. Is the difference biologically significant?

3. How large is the difference? PCA, LDA, mean difference, class or cluster distance…

expert supervision, specific post-hoc tests…

p(ANOVA) < 0.05

pairwise post-hoc Wilcoxon OK

Log(Score(LDA)) = 3.68

LEfSe:

45

Page 46: Scalable data mining for functional genomics and metagenomics

46

LEfSe: A non-human exampleViromes vs. bacterial metagenomes

Metastats (White 2009): p < 0.001ANOVA: p < 0.05

LEfSE: DIFF!

Hi-level functional category: CarbohydratesHi-level functional category: TransportersHi-level functional category: Nucleosides and Nucleotides

LEfSE: NO DIFF!

Microbial Viral

Dinsdale 2008

Page 47: Scalable data mining for functional genomics and metagenomics

47

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!

Sleipnir: Software forscalable functional genomics

Massive datasets require efficientalgorithms and implementations.

It’s also speedy: microbial data integration

computationtakes <3hrs.

Page 48: Scalable data mining for functional genomics and metagenomics

48

Outline

1. Data mining:Integrating very large

genomic data compendia

2. Metagenomics:Network models of

microbial communities

• Network framework for

scalable data integration

• HEFalMp: human data

integration

• TaFTan: cross-species

knowledge transfer from

functional data

• 16S and WGS community

metabolic reconstruction

• LEfSe: biologically relevant

community differences

• Sleipnir: software forscalable genomic

datamining

Page 49: Scalable data mining for functional genomics and metagenomics

49

Thanks!

http://huttenhower.sph.harvard.edu/sleipnir

Jacques Izard

Wendy Garrett

Sarah Fortune

Pinaki Sarder Nicola Segata

Levi Waldron LarisaMiropolsky

WillythssaPierre-Louis

Interested? We’re lookingfor postdocs!

http://huttenhower.sph.harvard.edu

OlgaTroyanskayaChris ParkDavid HessMatt HibbsChad MyersAna PopAaron Wong

Hilary CollerErin Haley

Page 50: Scalable data mining for functional genomics and metagenomics
Page 51: Scalable data mining for functional genomics and metagenomics

51

HEFalMp: Predicting human gene function

HEFalMp

Page 52: Scalable data mining for functional genomics and metagenomics

52

HEFalMp: Predicting humangenetic interactions

HEFalMp

Page 53: Scalable data mining for functional genomics and metagenomics

53

HEFalMp: Analyzing human genomic data

HEFalMp

Page 54: Scalable data mining for functional genomics and metagenomics

54

HEFalMp: Understanding human disease

HEFalMp

Page 55: Scalable data mining for functional genomics and metagenomics

55

Validating Human Predictions

Autophagy

Luciferase(Negative control)

ATG5(Positive control) LAMP2 RAB11A

NotStarved

Starved(Autophagic)

Predicted novel autophagy proteins

5½ of 7 predictions currently confirmed

With Erin Haley, Hilary Coller

Page 56: Scalable data mining for functional genomics and metagenomics

56

Functional Mapping:Scoring Functional Associations

How can we formalizethese relationships?

Any sets of genes G1 and G2 in a network can be compared

using four measures:

• Edges between their genes

• Edges within each set• The background edges

incident to each set• The baseline of all edges

in the network

),(),(

),(

2121

21, 21 GGwithin

baseline

GGbackground

GGbetweenFA GG

Stronger connections between the sets increase association.

Stronger within self-connections or nonspecific background connections decrease association.

Page 57: Scalable data mining for functional genomics and metagenomics

57

Functional Mapping:Bootstrap p-values

• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?

Empirically!# Genes 1 5 10 50

1

5

10

50

Histograms of FAs for random sets

For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is

approximately normal with mean 1.

Standard deviation is asymptotic in the sizes

of both gene sets.

Maps FA scores to p-values for any gene sets and

underlying graph.

100

102

104

100

101

102

103

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

|G1|

|G2|

Null distribution σs for one graph

|)(|||

|||)(|),(ˆ

1),(ˆ

ji

jijiFA

jiFA

GCG

BGGAGG

GG

)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG

Page 58: Scalable data mining for functional genomics and metagenomics

58

Functional maps for cross-speciesknowledge transfer

G17

G16G15

G10

G6

G9

G8

G5

G11

G7

G12

G13

G14

G2

G1

G4

G3

O8

O4O5

O7

O9

O6

O2

O3

O1

O1: G1, G2, G3O2: G4O3: G6…

ECG1, ECG2BSG1ECG3, BSG2…

Page 59: Scalable data mining for functional genomics and metagenomics

59

Functional maps for functional metagenomics

GOS 4441599.3Hypersaline Lagoon, Ecuador

KEGG Pathways

Org

anis

ms

Pathog ens

Env.

Mapping genes into pathways

Mapping pathways into

organisms

+ Integrated functional interaction networks

in 27 species

Mapping organisms into phyla

=

Page 60: Scalable data mining for functional genomics and metagenomics

60

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

Data integration summarizes an impossibly huge amount of experimental data into an

impossibly huge number of predictions; what next?

Page 61: Scalable data mining for functional genomics and metagenomics

61

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

How can a biologist take advantage of all this data to study

his/her favorite gene/pathway/disease without

losing information?

Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease

associations• Underlying experimental results and

functional activities in data

Page 62: Scalable data mining for functional genomics and metagenomics

62

Functional maps for cross-speciesknowledge transfer

← Precision ↑, Recall ↓

Following up with unsupervised and partially anchored network alignment

Page 63: Scalable data mining for functional genomics and metagenomics

63

LEfSe: A non-human exampleViromes vs. bacterial metagenomes

Metastats (White 2009): p < 0.001ANOVA: p < 0.05

LEfSE: DIFF!

Hi-level functional category: CarbohydratesHi-level functional category: Membrane TransportHi-level functional category: Nitrogen MetabolismHi-level functional category: Nucleosides and Nucleotides

LEfSE: NO DIFF!

Microbial Viral