Upload
sailor
View
19
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Analysis of Membrane Proteins in Metagenomics: Networks of correlated environmental features and protein families Prianka Patel, Thesis Defense Yale University Molecular Biophysics and Biochemistry 2.17.10. # Coevolving pairs. Bowie, James Nature , 2005. Sequence Separation. Photosynthesis. - PowerPoint PPT Presentation
Citation preview
Analysis of Membrane Proteins in Metagenomics: Networks of correlated
environmental features and protein families
Prianka Patel, Thesis DefenseYale University
Molecular Biophysics and Biochemistry2.17.10
2
Projects
Bowie, James Nature, 2005
Analysis of Membrane Protein Structures
Metagenomics of Ocean Microbes: Co-variation with Environment
Sequence Separation
# C
oevo
lvin
g pa
irsPhotosynthesis
3
Traditional Genomics Metagenomics
Assemble and annotate
Extract DNA and sequence
Select organism and culture
Estimated that less than 1% of microbes can be cultured
Contig 1
Contig 2
. . .
. . .
atgctcgatctcg
atcgatctcgctg
atgccgatctaa
Lose information about which gene belongs to which microbe
Collect sample from environment
Assemble and annotate
Extract DNA and sequence
atgctcgatctcg
atcgatctcgctg
atgccgatctaa
What is Metagenomics?
4
Comparative Metagenomics
Foerstner et al., EMBO Rep, 2005
An amino acid change in Proteorhodopsin proteins is linked to abundant wavelengths in the sample of origin
GC content is shaped by environmentVery different environments: whale bone associated, ocean, acid mine, soil
Sargasso Sea 2
Sargasso Sea 4
Sargasso Sea 3
Whale 1 (bone
Whale 2 (bone)
Whale 1 (microbial mat)
Acid mine Drainage
Minnesota farm soil
= Average
5
Comparative Metagenomics
Dinsdale et. al., Nature 2008
There are microbial pathways that discriminate between categorically different environments
variantinvariant
Gianoulis et al., PNAS 2009
There are microbial pathways that discriminate between similar environments
Photosynthesis
6
Motivation
Variation in membrane proteins across different environments may give insight into microbial adaptations that allow them to survive in a specific habitats.
Membrane proteins interact with the environment, transporting available nutrients, sensing environmental signals, and responding to changes
Engelman et al., Nature, 2005
7
Sorcerer II Global Ocean Survey
Sorcerer II journey August 2003- January 2006
Sample approximately every 200 miles
Rusch, et al., PLOS Biology 2007
8
Sorcerer II Global Ocean Survey
Metagenomic Sequence 0.1–0.8 μm size fraction (bacteria)
6.3 billion base pairs (7.7 million reads)
Reads were assembled and genes annotated
MetadataGPS coordinates, Sample Depth, Water Depth, Salinity, Temperature, Chlorophyll Content
The majority of samples are from open ocean, with a few estuaries and lakes
Each site has its own metadata
Assembly was done over all locations, but can be mapped back to a particular site
Rusch, et al., PLOS Biology 2007
9
Extracting environmental data using GPS Coordinates
Sample Depth: 1 meter
Water Depth: 32 meters
Chlorophyll: 4.0 ug/kg
Salinity: 31 psu
Temperature: 11 C
Location: 41°5'28"N, 71°36'8"W
* World Ocean Atlas
* National Center for Ecological Analysis and Synthesis
GPS coordinates allow us to extract information from other sources:
GOS
Sample Depth: 1 meter
Water Depth: 32 meters
Chlorophyll: 4.0 ug/kg
Salinity: 31 psu
Temperature: 11 C
Location: 41°5'28"N, 71°36'8"W
10
World Ocean Atlas 2005NOAA (National Oceanic and Atmospheric Administration) and
NODC (National Oceanographic Data Center)
Nutrient Features Extracted:
Phosphate
Silicate
Nitrate
Apparent Oxygen Utilization
Dissolved Oxygen
* Cumulative annual data at the ocean surface* Resolution is 1 degree latitude/longitude
. . . no simple geometric shape matches the Earth
Annual Phosphate [umol/l] at the surface
11
National Center for Ecological Analysis and Synthesis (NCEAS)
Anthropogenic Features Extracted:
Ultraviolet radiation
Shipping
Pollution
Climate Change
Ocean Acidification
Halperin et. al.(2008), Science
* Resolution is 1 km square
* Value of a activity at a particular location is determined by the type of ecosystem present:
Impact = ∑ Features * Ecosystem * impact weight
Shipping
Climate Change
12
Predicting membrane proteins in GOS data
- TMHMM (Transmembrane Hidden Markov Model): finds hydrophobic stretches of amino acids
- COG (Clusters of Orthologous Groups): orthologous groups of protein families
Metagenomic Reads Protein Clusters
GOS Mapping TMHMM
Filtering
Membrane Protein Clusters
COG
Family 1
Family 2
* 151 Families
13
Site Name # proteins # proteins with predicted
membrane spanning region
% proteins with predicted membrane
spanning regions
# proteins in membrane protein
clusters
% proteins in membrane protein
clusters
# proteins in membrane proteins clusters mapping to
COG
Sargasso Sea, Hydrostation S 138,843 29,759 21.43% 20,609 14.84% 6,123 Gulf of Maine 189,035 35,971 19.03% 23,104 12.22% 6,927 Browns Bank, Gulf of Maine 85,295 17,089 20.04% 11,161 13.09% 2,961 Outside Halifax, Nova Scotia 79,377 15,520 19.55% 10,254 12.92% 3,347 Northern Gulf of Maine 77,342 15,219 19.68% 9,728 12.58% 3,008 Block Island, NY 113,436 23,415 20.64% 15,883 14.00% 4,519 Cape May, NJ 110,513 23,302 21.09% 15,975 14.46% 4,842 Off Nags Head, NC 162,087 31,154 19.22% 20,662 12.75% 6,416 South of Charleston, SC 196,814 41,993 21.34% 28,353 14.41% 7,692 Off Key West, FL 197,474 43,081 21.82% 29,545 14.96% 8,596 Gulf of Mexico 192,335 42,027 21.85% 28,161 14.64% 7,804 Yucatan Channel 370,261 77,900 21.04% 53,396 14.42% 13,638 Rosario Bank 211,933 44,876 21.17% 30,712 14.49% 8,003 Northeast of Colon 213,938 45,447 21.24% 31,007 14.49% 8,959 Gulf of Panama 193,265 41,558 21.50% 27,985 14.48% 7,923 250 miles from Panama City 192,610 41,764 21.68% 27,996 14.54% 7,849 30 miles from Cocos Island 191,568 39,302 20.52% 26,850 14.02% 6,337 134 miles NE of Galapagos 158,620 34,221 21.57% 23,367 14.73% 6,280 Devil's Crown, Floreana Island 320,755 68,068 21.22% 46,680 14.55% 11,988 Coastal Floreana 283,545 60,268 21.26% 40,864 14.41% 10,389 North James Bay, Santigo Island 208,852 45,408 21.74% 31,078 14.88% 8,083 Warm seep, Roca Redonda 570,496 124,657 21.85% 85,303 14.95% 23,061 Upwelling, Fernandina Island 659,780 140,276 21.26% 97,806 14.82% 26,658 North Seamore Island 208,632 45,314 21.72% 30,534 14.64% 9,021 Wolf Island 221,536 48,391 21.84% 32,359 14.61% 9,172 Cabo Marshall, Isabella Island 114,654 24,891 21.71% 17,077 14.89% 4,137 Equatorial Pacific TAO Buoy 107,066 22,882 21.37% 15,599 14.57% 4,016 201 miles from F. Polynesia 98,014 20,789 21.21% 14,385 14.68% 3,540 Rangirora Atoll 188,985 40,136 21.24% 27,285 14.44% 6,581
Sum 6,057,061 1,284,678 873,718 237,870
Predicting membrane proteins in GOS data
22% of unique proteins in membrane protein clusters map to COG
14
What is the Relationship?
• Correlation of Sites based on environmental features or protein families
• Discriminative Partition Matching
• Canonical Correlation Analysis/Protein Features and Environmental Features Network
Environmental Features Membrane Protein Families
?
15
How Similar are the Sites to each other?
1
0
-1
16
Species Distribution
• The 16S rRNA gene is a component of the small prokaryotic ribosomal subunit• Bacteria with 16S rRNA gene sequences more similar than 97% are considered the
same ‘species’• 10,025 16S genes found and classified
20% level, “phylum”
Biers et al. App. Env. Microbiology , 2009
17
Method: For each site, we correlated the EF profile distances and its MPF frequency profile distances and 16S profile distances
This suggests that the observed membrane protein variation is more a function of the measured environmental features, than phylogenetic diversity.
18
Discriminative Partition Matching
Which membrane protein families are discriminating between these clusters?
We can partition the membrane protein family matrix by these site groupings, and then look for significantly different distributions of proteins families between the clusters.
Sites cluster into three distinct groups:
Groups are geographically separated:
19
First, we performed PCA on the membrane protein families matrix, and grouped the first component scores by the environmental clustering
This revealed that the Mid-Atlantic and Pacific were more similar to each other in terms of membrane protein content, and these sites were grouped
Discriminate Partition Matching
Which families are discriminating between these two site-sets? (T-test)
20
DPM results
• 30 families showed significant differences (p-value<0.01) between the site sets• Most were enriched in the North Atlantic (28/30)• Higher pollution, chlorophyll, and possibly higher nutrients and cell abundance in the North Atlantic
microbes’ need to expel antimicrobials, by-products of metabolism, or environmental toxins
Buffer against shifts in ocean solute concentrations again alluding to the increased pollutants, and possibly nutrient fluxes from land and rivers
Chlorophyll content Stabilization of DNA and RNA
Exchanges ATP for ADP in mitochondria and obligate intracellular parasites, may be nucleotide/H+ transporters
21
Simultaneous Correlations of Environmental Features and Membrane Proteins Canonical Correlation Analysis
We have addressed this questions by:
1. Comparing site similarity based on these two sets of features
2. Finding particular discriminating families between environmental groupings
But we don’t know what particular features are associated with each other, and we know that they are all likely interdependent: Canonical Correlation Analysis
Environmental Features Membrane Protein Families
?
Salinity
Pollution
Temp
Family 1
Family 2
Family 5
22
Canonical Correlation Analysis
- CCA allows us to take advantage of the continuity of the features and observe which features are invariant or variant, and the type (positive, negative) of relationship between them.
-We correlate all the variables, protein families and environmental features simultaneously.
- We have two sets of variables, X1. . . X15 (environmental features) and Y1. . . Y151 (membrane protein families)
Environmental Features Membrane Protein Families
We are looking for two vectors, a and b (a set of weights for X and Y), such that the correlation between X, Y is maximized:
23
CCA results
We are defining a change of basis of the cross co-variance matrixWe want the correlations between the projections of the variables, X and Y, onto the basis vectors
to be mutually maximized.Eigenvalues squared canonical correlationsEigenvectors normalized canonical correlation basis vectors
Environment Family
Correlation = .3
Correlation= 1
This plot shows the correlations in the first and second dimensions
Correlation Circle: The closer the point is to the outer circle, the higher the correlation
Variables projected in the same direction are correlated
24
CCA results
Water depth
Acidity
App. O2 util.
Salinity
Pollution
Climate change Shipping
Phospahte
NitrateSilicate
Temperature
Chlorophyll
Dissolved O2
UV
Sample Depthvariant
invariant
Dimension 1
Dim
en
sio
n
2
107 variant membrane protein families
44 invariant membrane protein families
Difficult to see the strength and directionality of a relationship
Weights of the features are difficult to visualize and compare
There is no means of quantifying the variation between sets of features
25
Protein Families and Environmental Features Network (PEN)
cosbaba
Distance: Dot product between 1st and 2nd Dimension of CCA
26
Protein Families and Environmental Features Network (PEN)
“Bi-modules”: groups of environmental features and membrane proteins families that are associated
UV, dissolved oxygen, apparent oxygen utilization, sample depth, and water depth are not in the network
COG0598, Magnesium Transporter
COG1176, Polyamine Transporter
27
Bi-module 1: Phosphate/Phosphate Transporters
Low Phosphate, high affinity phosphate transporters which are induced during phosphate limitation
High Phosphate, low affinity inorganic phosphate ion transporter which are constitutively expressed
28
Microbes modulate content in response to phosphate
Martiny et al. Env Microbiology, 2009 Van Mooy et al. Nature, 2009
Phosphate Concentration related to phosphate acquisition genes in Prochlorococcus
Microbes modulate phospholipid content in response to phosphate concentrations
29
Bi-module 2: Iron Transporters/Pollution/Shipping
Negative relationship between areas of high ocean-based pollution and shipping and transporters involved in the uptake of iron
Pollution and Shipping may be a proxy for iron concentrations
30
Bi-module 2: Iron Transporters/Pollution/Shipping
Rigwell A. J. (2002) Phil. Trans. R. Soc. Lond.
Iron is usually limiting in oceans: High Nitrate-Nutrient/Low Chlorophyll regionsDelivery of iron to is usually by:
- terrestrial input - fluvial (rivers) input- upwelling from the ocean floor- aeolian dust from land
31
Bi-module 2: Iron Transporters/Pollution/Shipping
Pollution and Dust N/C and Iron Transporters
-Negative correlation between COG4558 and COG0609 and dust/pollution values (p-value <0.01)
- Searching the BRENDA database for enzymes using iron as a cofactor reveal that an increase in these two COGs negatively correlated to the amount of enzymes present that required iron.
32
Conclusions
New method (PEN) to visualize complex relationships in metagenomic data using explicit environmental variables
We show both known and intuitive relationships between features and genomic content
CCA also reveals the invariant fraction of environmental features and protein families (highlights important cellular processes):
Chloride Channel, Type II secretion Proteins (virulence)
Many variant ABC-type transporters(34/41): suggests streamlining for optimization and energy conservation
33
Much of Membrane Protein Space Remains Uncharacterized
• 15% of predicted membrane proteins had NO homology to Genbank (e-value<1e-10)
• We used short motifs (PROSITE) to characterize a small fraction of these including ABC Transporters, GPCRs, Lipocalins, beta-lactamases
16% (29,384) were annotated
34
Intraribotype diversity and the definition of a ‘species’
Eugene V Koonin Nat Biotechnology, 2007
16S analysis of GOS data reveals that most sequences fall into 5 ribotypes
However, there were very few identical sequences, suggesting that no two cells have identical genome sequences
This suggests that ocean microbes are rather adaptive to their environments
We observe diversity in membrane protein content and abundance, and show that it is a reflection of different environmental conditions more than phylogenetic diversity (16S)
These are mostly oligotrophic (nutrient poor) waters and environmental conditions have likely been fairly constant over many years , genomes are “streamlining”
35
Conclusions
Integration of Environmental Features using GPS coordinates
Environmental clusters show differences in membrane protein content which reflect environmental conditions (pollution/efflux proteins)
Microbes from ocean surface samples show diversity in membrane protein contentDiversity in membrane proteins was shown to be a reflection of different environmental conditions more than phylogenetic diversity
Developed (PEN) and adapted techniques to connect features of environment to specific protein families
Genotypic variation within similar natural populations occurs in response to environmental conditions
Integration of geospatial data can highlight unexpected trends as anthropogenic factors seem to be reflected in microbial function
36
Advisors: Donald Engelman and Mark Gerstein
Acknowledgements
Committee Members:Jim Bowie (UCLA)Annette MolinaroLynne ReganMike Snyder
Administrative Staff:Mary BackerAnn NicotraNessie Stewart
Collaborators
Gerstein Lab:
Tara Gianoulis
Kevin Yip
Rob Bjornson
Nicolas Carriero
Philip Kim
Jan Korbel
Sam Flores
Engelman Lab:
Damien Thevenin
Julia Rogers
Past and Present members of Engelman and Gerstein Labs
Yale University Biomedical High Performance Computing FacilityNIH grant RR19895 which funded the instrumentation
Yale Map Collection:
Stacey Maples