36
Analysis of Membrane Proteins in Metagenomics: Networks of correlated environmental features and protein families Prianka Patel, Thesis Defense Yale University Molecular Biophysics and Biochemistry 2.17.10

Projects

  • Upload
    sailor

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

Analysis of Membrane Proteins in Metagenomics: Networks of correlated environmental features and protein families Prianka Patel, Thesis Defense Yale University Molecular Biophysics and Biochemistry 2.17.10. # Coevolving pairs. Bowie, James Nature , 2005. Sequence Separation. Photosynthesis. - PowerPoint PPT Presentation

Citation preview

Page 1: Projects

Analysis of Membrane Proteins in Metagenomics: Networks of correlated

environmental features and protein families

Prianka Patel, Thesis DefenseYale University

Molecular Biophysics and Biochemistry2.17.10

Page 2: Projects

2

Projects

Bowie, James Nature, 2005

Analysis of Membrane Protein Structures

Metagenomics of Ocean Microbes: Co-variation with Environment

Sequence Separation

# C

oevo

lvin

g pa

irsPhotosynthesis

Page 3: Projects

3

Traditional Genomics Metagenomics

Assemble and annotate

Extract DNA and sequence

Select organism and culture

Estimated that less than 1% of microbes can be cultured

Contig 1

Contig 2

. . .

. . .

atgctcgatctcg

atcgatctcgctg

atgccgatctaa

Lose information about which gene belongs to which microbe

Collect sample from environment

Assemble and annotate

Extract DNA and sequence

atgctcgatctcg

atcgatctcgctg

atgccgatctaa

What is Metagenomics?

Page 4: Projects

4

Comparative Metagenomics

Foerstner et al., EMBO Rep, 2005

An amino acid change in Proteorhodopsin proteins is linked to abundant wavelengths in the sample of origin

GC content is shaped by environmentVery different environments: whale bone associated, ocean, acid mine, soil

Sargasso Sea 2

Sargasso Sea 4

Sargasso Sea 3

Whale 1 (bone

Whale 2 (bone)

Whale 1 (microbial mat)

Acid mine Drainage

Minnesota farm soil

= Average

Page 5: Projects

5

Comparative Metagenomics

Dinsdale et. al., Nature 2008

There are microbial pathways that discriminate between categorically different environments

variantinvariant

Gianoulis et al., PNAS 2009

There are microbial pathways that discriminate between similar environments

Photosynthesis

Page 6: Projects

6

Motivation

Variation in membrane proteins across different environments may give insight into microbial adaptations that allow them to survive in a specific habitats.

Membrane proteins interact with the environment, transporting available nutrients, sensing environmental signals, and responding to changes

Engelman et al., Nature, 2005

Page 7: Projects

7

Sorcerer II Global Ocean Survey

Sorcerer II journey August 2003- January 2006

Sample approximately every 200 miles

Rusch, et al., PLOS Biology 2007

Page 8: Projects

8

Sorcerer II Global Ocean Survey

Metagenomic Sequence 0.1–0.8 μm size fraction (bacteria)

6.3 billion base pairs (7.7 million reads)

Reads were assembled and genes annotated

MetadataGPS coordinates, Sample Depth, Water Depth, Salinity, Temperature, Chlorophyll Content

The majority of samples are from open ocean, with a few estuaries and lakes

Each site has its own metadata

Assembly was done over all locations, but can be mapped back to a particular site

Rusch, et al., PLOS Biology 2007

Page 9: Projects

9

Extracting environmental data using GPS Coordinates

Sample Depth: 1 meter

Water Depth: 32 meters

Chlorophyll: 4.0 ug/kg  

Salinity: 31 psu

Temperature: 11 C

Location: 41°5'28"N, 71°36'8"W

* World Ocean Atlas

* National Center for Ecological Analysis and Synthesis

GPS coordinates allow us to extract information from other sources:

GOS

Sample Depth: 1 meter

Water Depth: 32 meters

Chlorophyll: 4.0 ug/kg  

Salinity: 31 psu

Temperature: 11 C

Location: 41°5'28"N, 71°36'8"W

Page 10: Projects

10

World Ocean Atlas 2005NOAA (National Oceanic and Atmospheric Administration) and

NODC (National Oceanographic Data Center)

Nutrient Features Extracted:

Phosphate

Silicate

Nitrate

Apparent Oxygen Utilization

Dissolved Oxygen

* Cumulative annual data at the ocean surface* Resolution is 1 degree latitude/longitude

. . . no simple geometric shape matches the Earth

Annual Phosphate [umol/l] at the surface

Page 11: Projects

11

National Center for Ecological Analysis and Synthesis (NCEAS)

Anthropogenic Features Extracted:

Ultraviolet radiation

Shipping

Pollution

Climate Change

Ocean Acidification

Halperin et. al.(2008), Science

* Resolution is 1 km square

* Value of a activity at a particular location is determined by the type of ecosystem present:

Impact = ∑ Features * Ecosystem * impact weight

Shipping

Climate Change

Page 12: Projects

12

Predicting membrane proteins in GOS data

- TMHMM (Transmembrane Hidden Markov Model): finds hydrophobic stretches of amino acids

- COG (Clusters of Orthologous Groups): orthologous groups of protein families

Metagenomic Reads Protein Clusters

GOS Mapping TMHMM

Filtering

Membrane Protein Clusters

COG

Family 1

Family 2

* 151 Families

Page 13: Projects

13

Site Name # proteins # proteins with predicted

membrane spanning region

% proteins with predicted membrane

spanning regions

# proteins in membrane protein

clusters

% proteins in membrane protein

clusters

# proteins in membrane proteins clusters mapping to

COG

Sargasso Sea, Hydrostation S 138,843 29,759 21.43% 20,609 14.84% 6,123 Gulf of Maine 189,035 35,971 19.03% 23,104 12.22% 6,927 Browns Bank, Gulf of Maine 85,295 17,089 20.04% 11,161 13.09% 2,961 Outside Halifax, Nova Scotia 79,377 15,520 19.55% 10,254 12.92% 3,347 Northern Gulf of Maine 77,342 15,219 19.68% 9,728 12.58% 3,008 Block Island, NY 113,436 23,415 20.64% 15,883 14.00% 4,519 Cape May, NJ 110,513 23,302 21.09% 15,975 14.46% 4,842 Off Nags Head, NC 162,087 31,154 19.22% 20,662 12.75% 6,416 South of Charleston, SC 196,814 41,993 21.34% 28,353 14.41% 7,692 Off Key West, FL 197,474 43,081 21.82% 29,545 14.96% 8,596 Gulf of Mexico 192,335 42,027 21.85% 28,161 14.64% 7,804 Yucatan Channel 370,261 77,900 21.04% 53,396 14.42% 13,638 Rosario Bank 211,933 44,876 21.17% 30,712 14.49% 8,003 Northeast of Colon 213,938 45,447 21.24% 31,007 14.49% 8,959 Gulf of Panama 193,265 41,558 21.50% 27,985 14.48% 7,923 250 miles from Panama City 192,610 41,764 21.68% 27,996 14.54% 7,849 30 miles from Cocos Island 191,568 39,302 20.52% 26,850 14.02% 6,337 134 miles NE of Galapagos 158,620 34,221 21.57% 23,367 14.73% 6,280 Devil's Crown, Floreana Island 320,755 68,068 21.22% 46,680 14.55% 11,988 Coastal Floreana 283,545 60,268 21.26% 40,864 14.41% 10,389 North James Bay, Santigo Island 208,852 45,408 21.74% 31,078 14.88% 8,083 Warm seep, Roca Redonda 570,496 124,657 21.85% 85,303 14.95% 23,061 Upwelling, Fernandina Island 659,780 140,276 21.26% 97,806 14.82% 26,658 North Seamore Island 208,632 45,314 21.72% 30,534 14.64% 9,021 Wolf Island 221,536 48,391 21.84% 32,359 14.61% 9,172 Cabo Marshall, Isabella Island 114,654 24,891 21.71% 17,077 14.89% 4,137 Equatorial Pacific TAO Buoy 107,066 22,882 21.37% 15,599 14.57% 4,016 201 miles from F. Polynesia 98,014 20,789 21.21% 14,385 14.68% 3,540 Rangirora Atoll 188,985 40,136 21.24% 27,285 14.44% 6,581

Sum 6,057,061 1,284,678 873,718 237,870

Predicting membrane proteins in GOS data

22% of unique proteins in membrane protein clusters map to COG

Page 14: Projects

14

What is the Relationship?

• Correlation of Sites based on environmental features or protein families

• Discriminative Partition Matching

• Canonical Correlation Analysis/Protein Features and Environmental Features Network

Environmental Features Membrane Protein Families

?

Page 15: Projects

15

How Similar are the Sites to each other?

1

0

-1

Page 16: Projects

16

Species Distribution

• The 16S rRNA gene is a component of the small prokaryotic ribosomal subunit• Bacteria with 16S rRNA gene sequences more similar than 97% are considered the

same ‘species’• 10,025 16S genes found and classified

20% level, “phylum”

Biers et al. App. Env. Microbiology , 2009

Page 17: Projects

17

Method: For each site, we correlated the EF profile distances and its MPF frequency profile distances and 16S profile distances

This suggests that the observed membrane protein variation is more a function of the measured environmental features, than phylogenetic diversity.

Page 18: Projects

18

Discriminative Partition Matching

Which membrane protein families are discriminating between these clusters?

We can partition the membrane protein family matrix by these site groupings, and then look for significantly different distributions of proteins families between the clusters.

Sites cluster into three distinct groups:

Groups are geographically separated:

Page 19: Projects

19

First, we performed PCA on the membrane protein families matrix, and grouped the first component scores by the environmental clustering

This revealed that the Mid-Atlantic and Pacific were more similar to each other in terms of membrane protein content, and these sites were grouped

Discriminate Partition Matching

Which families are discriminating between these two site-sets? (T-test)

Page 20: Projects

20

DPM results

• 30 families showed significant differences (p-value<0.01) between the site sets• Most were enriched in the North Atlantic (28/30)• Higher pollution, chlorophyll, and possibly higher nutrients and cell abundance in the North Atlantic

microbes’ need to expel antimicrobials, by-products of metabolism, or environmental toxins

Buffer against shifts in ocean solute concentrations again alluding to the increased pollutants, and possibly nutrient fluxes from land and rivers

Chlorophyll content Stabilization of DNA and RNA

Exchanges ATP for ADP in mitochondria and obligate intracellular parasites, may be nucleotide/H+ transporters

Page 21: Projects

21

Simultaneous Correlations of Environmental Features and Membrane Proteins Canonical Correlation Analysis

We have addressed this questions by:

1. Comparing site similarity based on these two sets of features

2. Finding particular discriminating families between environmental groupings

But we don’t know what particular features are associated with each other, and we know that they are all likely interdependent: Canonical Correlation Analysis

Environmental Features Membrane Protein Families

?

Salinity

Pollution

Temp

Family 1

Family 2

Family 5

Page 22: Projects

22

Canonical Correlation Analysis

- CCA allows us to take advantage of the continuity of the features and observe which features are invariant or variant, and the type (positive, negative) of relationship between them.

-We correlate all the variables, protein families and environmental features simultaneously.

- We have two sets of variables, X1. . . X15 (environmental features) and Y1. . . Y151 (membrane protein families)

Environmental Features Membrane Protein Families

We are looking for two vectors, a and b (a set of weights for X and Y), such that the correlation between X, Y is maximized:

Page 23: Projects

23

CCA results

We are defining a change of basis of the cross co-variance matrixWe want the correlations between the projections of the variables, X and Y, onto the basis vectors

to be mutually maximized.Eigenvalues squared canonical correlationsEigenvectors normalized canonical correlation basis vectors

Environment Family

Correlation = .3

Correlation= 1

This plot shows the correlations in the first and second dimensions

Correlation Circle: The closer the point is to the outer circle, the higher the correlation

Variables projected in the same direction are correlated

Page 24: Projects

24

CCA results

Water depth

Acidity

App. O2 util.

Salinity

Pollution

Climate change Shipping

Phospahte

NitrateSilicate

Temperature

Chlorophyll

Dissolved O2

UV

Sample Depthvariant

invariant

Dimension 1

Dim

en

sio

n

2

107 variant membrane protein families

44 invariant membrane protein families

Difficult to see the strength and directionality of a relationship

Weights of the features are difficult to visualize and compare

There is no means of quantifying the variation between sets of features

Page 25: Projects

25

Protein Families and Environmental Features Network (PEN)

cosbaba

Distance: Dot product between 1st and 2nd Dimension of CCA

Page 26: Projects

26

Protein Families and Environmental Features Network (PEN)

“Bi-modules”: groups of environmental features and membrane proteins families that are associated

UV, dissolved oxygen, apparent oxygen utilization, sample depth, and water depth are not in the network

COG0598, Magnesium Transporter

COG1176, Polyamine Transporter

Page 27: Projects

27

Bi-module 1: Phosphate/Phosphate Transporters

Low Phosphate, high affinity phosphate transporters which are induced during phosphate limitation

High Phosphate, low affinity inorganic phosphate ion transporter which are constitutively expressed

Page 28: Projects

28

Microbes modulate content in response to phosphate

Martiny et al. Env Microbiology, 2009 Van Mooy et al. Nature, 2009

Phosphate Concentration related to phosphate acquisition genes in Prochlorococcus

Microbes modulate phospholipid content in response to phosphate concentrations

Page 29: Projects

29

Bi-module 2: Iron Transporters/Pollution/Shipping

Negative relationship between areas of high ocean-based pollution and shipping and transporters involved in the uptake of iron

Pollution and Shipping may be a proxy for iron concentrations

Page 30: Projects

30

Bi-module 2: Iron Transporters/Pollution/Shipping

Rigwell A. J. (2002) Phil. Trans. R. Soc. Lond.

Iron is usually limiting in oceans: High Nitrate-Nutrient/Low Chlorophyll regionsDelivery of iron to is usually by:

- terrestrial input - fluvial (rivers) input- upwelling from the ocean floor- aeolian dust from land

Page 31: Projects

31

Bi-module 2: Iron Transporters/Pollution/Shipping

Pollution and Dust N/C and Iron Transporters

-Negative correlation between COG4558 and COG0609 and dust/pollution values (p-value <0.01)

- Searching the BRENDA database for enzymes using iron as a cofactor reveal that an increase in these two COGs negatively correlated to the amount of enzymes present that required iron.

Page 32: Projects

32

Conclusions

New method (PEN) to visualize complex relationships in metagenomic data using explicit environmental variables

We show both known and intuitive relationships between features and genomic content

CCA also reveals the invariant fraction of environmental features and protein families (highlights important cellular processes):

Chloride Channel, Type II secretion Proteins (virulence)

Many variant ABC-type transporters(34/41): suggests streamlining for optimization and energy conservation

Page 33: Projects

33

Much of Membrane Protein Space Remains Uncharacterized

• 15% of predicted membrane proteins had NO homology to Genbank (e-value<1e-10)

• We used short motifs (PROSITE) to characterize a small fraction of these including ABC Transporters, GPCRs, Lipocalins, beta-lactamases

16% (29,384) were annotated

Page 34: Projects

34

Intraribotype diversity and the definition of a ‘species’

Eugene V Koonin Nat Biotechnology, 2007

16S analysis of GOS data reveals that most sequences fall into 5 ribotypes

However, there were very few identical sequences, suggesting that no two cells have identical genome sequences

This suggests that ocean microbes are rather adaptive to their environments

We observe diversity in membrane protein content and abundance, and show that it is a reflection of different environmental conditions more than phylogenetic diversity (16S)

These are mostly oligotrophic (nutrient poor) waters and environmental conditions have likely been fairly constant over many years , genomes are “streamlining”

Page 35: Projects

35

Conclusions

Integration of Environmental Features using GPS coordinates

Environmental clusters show differences in membrane protein content which reflect environmental conditions (pollution/efflux proteins)

Microbes from ocean surface samples show diversity in membrane protein contentDiversity in membrane proteins was shown to be a reflection of different environmental conditions more than phylogenetic diversity

Developed (PEN) and adapted techniques to connect features of environment to specific protein families

Genotypic variation within similar natural populations occurs in response to environmental conditions

Integration of geospatial data can highlight unexpected trends as anthropogenic factors seem to be reflected in microbial function

Page 36: Projects

36

Advisors: Donald Engelman and Mark Gerstein

Acknowledgements

Committee Members:Jim Bowie (UCLA)Annette MolinaroLynne ReganMike Snyder

Administrative Staff:Mary BackerAnn NicotraNessie Stewart

Collaborators

Gerstein Lab:

Tara Gianoulis

Kevin Yip

Rob Bjornson

Nicolas Carriero

Philip Kim

Jan Korbel

Sam Flores

Engelman Lab:

Damien Thevenin

Julia Rogers

Past and Present members of Engelman and Gerstein Labs

Yale University Biomedical High Performance Computing FacilityNIH grant RR19895 which funded the instrumentation

Yale Map Collection:

Stacey Maples