17
1/17 Identification of thermophilic speci es by the amino acid compositions de duced from their genomes Reporter: Yu Lun Kuo E-mail: [email protected] Date: October 26, 2006 David P. Kreil and Christos A. Ouzounis University of Cambridge and European Bioinformatics Insti tute, Computational Genomics Group, Research Programme, T he European Bioinformatics Institute, EMBL Outstation, We llcome Trust Gnome Campus, Cambridge CB10 1SD, UK

1/17 Identification of thermophilic species by the amino acid compositions deduced from their genomes Reporter: Yu Lun Kuo E-mail: [email protected]@gmail.com

Embed Size (px)

Citation preview

1/17

Identification of thermophilic species by the amino acid compositions deduced from their genomes

Reporter: Yu Lun KuoE-mail: [email protected]: October 26, 2006

David P. Kreil and Christos A. Ouzounis

University of Cambridge and European Bioinformatics Institute, Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Gnome Campus, Cambridge CB10 1SD, U

K

2/17

Outline

• Introduction

• Materials and Methods

• Results

• Discussion and Conclusion

3/17

Introduction

• The properties of thermophilic protein have been examined in the past two decades.

• Thermophilic protein for particular amino acids, but general rules have not yet emerged.

• Experiment is not only homologous proteins, but also protein unique to particular species.

4/17

Introduction

• The results for the genomes of six archaea, 19 bacteria, and the eukaryotic organisms.

• Using two different approaches, several factors– Determine amino acid composition can be deduced

• GC content of the coding sequences is the dominant influence on amino acid composition– Possible to identify thermophilic species

5/17

Materials and Methods

• Data sources and tools

• Exploratory data analysis

• Sensitivity analysis, sampling adequacy and significance

6/17

Data Sources and Tools

• Obtained from public databases– EBI (European Bioinformatics Institute)

– NCBI (National Center for Biotechnology Information)

– SRS – Access to multiple molecular biology databases

– EPCLUST (Expression Profile data CLUSTering and analysis)

– Hierarchical clustering

– PCA (Principal Components Analysis)

7/17

Exploratory Data Analysis

• For all organisms, determined global amino acid compositions– Matrix where the rows represent the data

sources list

– The columns correspond to the respective percentage amino acid content

8/17

Exploratory Data Analysis

• Principal factors was supported two variables– GC ratio (GC counts vs. AT counts)

– A binary variable (therm)

• The binary variable, therm– 0 (zero) - mesophilic

– 1 (one) - thermophilic

9/17

Sensitivity Analysis, Sampling Adequacy and Significance

• Miscellaneous clustering methods were tried– Average linkage (UPGMA)

– Complete linkage (Maximum distance method)

– Single linkage (Minimum distance method)

– Weighted pair group method (WPGMA)

• PCA was repeated to verify that this weighting did not affect any conclusions– 20 amino acids with equal weight

10/17

Results Red – More than averageGreen – Less than average

Thermophilic

Unusually high GC ratio57-67%

ThermophilicHigh GC ratio

0.2

0.6

1.5

11/17

Results (PCA of Amino Acid)

• A clear separation of thermophiles and mesophiles along the second principal axis

0-mesophile1-thermophile

Thermophilic

Archea – RedBacteria – GreenEukaryote – PurpleOutgroup - Blue

12/17

Component Loadings

• High Loading– Absolute component loadings > 0.6

• Component loading can be interpreted as correlation coefficients

• Component 1– Correlate with GC ratio

• Component 2– Correlate with Therm

13/17

Statistical Evidence and Specific Feature of Thermophilic Species

• PCA– Starting from the distinct groups of thermophiles

and mesophiles as obtained

• Gln (Q) & Glu (E)– Have very high component loadings

• Table 2 summarizes the results and most of the statistical evidence

Very high factor loadings

Raw correlations with the binary v

ariable therm

Strong

PCA factor loading for

component 2

Average difference between thermo & mesoThermo & meso

more or less

Low factor loadings

Less – in Thermophiles < in mesophilesMore - in Thermophiles > in mesophiles

14/17

Discussion and Conclusion

• The results discern several underlying factors that influence amino acid composition– Completely sequenced genomes of 27 species

– Employing different methods of data analysis

• The two most prominent observations– Dominant effect of GC pressure

– Clear identification of thermophilic species

15/17

Discussion and Conclusion

• PCA found GC ratio to be the most important factor

• Environmental adaptations would also be expected to play a role– A pernix is found at a little distance from the other

thermophiles

16/17

Discussion and Conclusion

• Not only true for individual proteins or groups of proteins but also for entire genomes– GC contents with a stronger influence on amino a

cid composition than adaptation to extreme environments (e.g., thermophily)

– Interesting to extend analysis from different phyla

17/17

Thanks