1
Characterizing Protein Families of Unknown Function Morgan G.I. Langille and Jonathan A. Eisen Genome Center, University of California, Davis, CA Abstract Perhaps one of the most frustrating aspects of genomic and metagenomic analysis is that functional predictions for many genes cannot be identified. These "hypothetical" or "unknown" genes represent a significant fraction of the genes in most genomes or metagenomes, and this fraction will likely increase as sequencing technology continues to outpace functionally informative lab experiments. To start tackling this situation we characterized and ranked protein families with unknown function from completed genome sequences. Ranking of these families were done using several metrics such as quantity of members, presence across tree of life, presence in mostly pathogens or other habitats, etc. This ranking allows particular families of unknown function to be targeted for more in-depth analysis due to their ubiquitous nature or their role in a particular niche. In addition to ranking these families, we analyze their abundance profiles across several metagenomic studies and cluster them with families of known function in the hopes of making novel functional predictions. 1. Identify families of genes with unknown function 1.2 Identify proteins with unknown function 1.1 Data Source All genomes from IMG were downloaded Dec. 2009. 1895 genomes (Eukaryota, Bacteria, and Archaea) 7.3 M Proteins 1.3 Protein families with unknown function Protein families were obtained from IMG (“IMG Ortholog Clusters”) and those families with >70% of their members having unknown function (see 1.2) were labelled as “families of unknown function”. 144,492 familes with unknown function(size >1) *Pfam hits determined using HMMER 3 and Pfam-A version 24. Non-informative PFams such as DUFs (Domains of Unknown Function) and UPFs (Unidentified Protein Families) were not counted. **No Informative Product Annotation identified by searching protein product description (e.g. "hypothetical protein", "predicted protein", etc. ) By taking the intersection of these two methods for identifying genes with no known function we are left with 1.8 million proteins. Thus, 25% of all proteins from completed genomes are of unknown function. No PFam Hits* (2.2 M) No Informative Product Annotation** (2.7 M) “Unknown” Genes (1.8 M) All Proteins (7.3 M, 1900 genomes) 2. Characterize and rank families of unknown function Rank by size (# of proteins in the family) 10 unknown families > 1000 sequences 40 unknown families > 500 sequences Rank by universality (% presence across all species) 26 unknown families > 10% in Bacateria and Eukaryotes 6 unknown families > 10% in all 3 domains Rank by family presence in pathogenic species 75 unknown families (size > 100) with > 80% proteins existing only in species listed as pathogens Rank by family presence in particular habitats (e.g. Aquatic) 12 unknown families (size > 50) with > 80% proteins existing only in species with aquatic habitat 3. Produce novel predictions using metagenomics data 3.1 Hypothesis for “Community Profiling” By looking at the abundance profiles for all protein families across many diverse metagenomic samples we hypothesize that protein families with similar profiles will have similar function. If so, novel predictions can be made for families with unknown function that have similar profiles to known functional protein families. 3.2 Current Approach for “Community Profiling” All PFAMs (11,000) CAMERA’s “All metagenomic proteins” Sanger Seq. Mostly GOS. 43 M proteins, 102 Samples HMMER 3 Count PFAM hits per sample Identify similar protein family profiles (using R & Cytoscape) Normalize Samples* Calculate distances between protein family profiles ** *Samples are normalized by dividing by the sum of each column. Subtraction of taxonomic signal is planned in the near future. ** Distance between protein families have been calculated using Pearson’s Correlation and Sørensen similarity index Additional metagenomic samples are being added for improved resolution between protein family profiles. 3.4 Future Work Various normalization methods, distance calculations, and clustering methods are being investigated in order to maximize the clustering of protein families with similar function. Additional metagenomic samples are being screened against PFams to provide better resolution for those with similar profiles. Extension to protein families other than PFams is planned in the near future. 3.3 Preliminary Results Metagenomic Samples PFams with metagenomic profiles (see 3.1) having correlations > 0.9 were visualized using Cytoscape Conclusions Objectives 1. Identify families of genes with unknown function 2. Characterize and rank families with unknown function 3. Produce novel predictions using metagenomics data In this study we created a list of protein families that have no known function. These protein families were then ranked on various criteria that would allow researchers to identify particular genes with unknown function that are a high level of interest due to their presence across the tree of life, their possible role in pathogenisis, or their contribution to species in particular environments. Secondly, we have started development on a completely novel method that would predict gene function that does not use sequence similarity and would in theory improve as the number of metagenomic datasets become available over time. This method could help gain some insight into the function of the vast number of proteins that we currently can not annotate. PFam hits to proteins from 102 metagenomic samples (see 3.1) shown as a heat map with hierarchal clustering

Characterizing Protein Families of Unknown Function

Embed Size (px)

Citation preview

Characterizing Protein Families of Unknown FunctionMorgan G.I. Langille and Jonathan A. Eisen

Genome Center, University of California, Davis, CA

AbstractPerhaps one of the most frustrating aspects of genomic and metagenomic analysis is that functional predictions for many genes cannot be identified. These "hypothetical" or "unknown" genes represent a significant fraction of the genes in most genomes or metagenomes, and this fraction will likely increase as sequencing technology continues to outpace functionally informative lab experiments. To start tackling this situation we characterized and ranked protein families with unknown function from completed genome sequences. Ranking of these families were done using several metrics such as quantity of members, presence across tree of life, presence in mostly pathogens or other habitats, etc. This ranking allows particular families of unknown function to be targeted for more in-depth analysis due to their ubiquitous nature or their role in a particular niche. In addition to ranking these families, we analyze their abundance profiles across several metagenomic studies and cluster them with families of known function in the hopes of making novel functional predictions.

1. Identify families of genes with unknown function

1.2 Identify proteins with unknown function

1.1 Data Source

• All genomes from IMG were downloaded Dec. 2009.

• 1895 genomes (Eukaryota, Bacteria, and Archaea)

• 7.3 M Proteins

1.3 Protein families with unknown function

• Protein families were obtained from IMG (“IMG Ortholog Clusters”) and those families with >70% of their members having unknown function (see 1.2) were labelled as “families of unknown function”.

• 144,492 familes with unknown function(size >1)

• *Pfam hits determined using HMMER 3 and Pfam-A version 24. Non-informative PFams such as DUFs (Domains of Unknown Function) and UPFs (Unidentified Protein Families) were not counted.

• **No Informative Product Annotation identified by searching protein product description (e.g. "hypothetical protein", "predicted protein", etc. )

• By taking the intersection of these two methods for identifying genes with no known function we are left with 1.8 million proteins. Thus, 25% of all proteins from completed genomes are of unknown function.

No PFamHits*

(2.2 M)

No Informative Product Annotation**

(2.7 M)

“Unknown” Genes (1.8 M)

All Proteins (7.3 M, 1900 genomes)

2. Characterize and rank families of unknown function

Rank by size (# of proteins in the family)

• 10 unknown families > 1000 sequences

• 40 unknown families > 500 sequences

Rank by universality (% presence across all species)

• 26 unknown families > 10% in Bacateria and Eukaryotes

• 6 unknown families > 10% in all 3 domains

Rank by family presence in pathogenic species

• 75 unknown families (size > 100) with > 80% proteins existing only in species listed as pathogens

Rank by family presence in particular habitats (e.g. Aquatic)

• 12 unknown families (size > 50) with > 80% proteins existing only in species with aquatic habitat

3. Produce novel predictions using metagenomics data

3.1 Hypothesis for “Community Profiling”

• By looking at the abundance profiles for all protein families across many diverse metagenomic samples we hypothesize that protein families with similar profiles will have similar function. If so, novel predictions can be made for families with unknown function that have similar profiles to known functional protein families.

3.2 Current Approach for “Community Profiling”

All PFAMs(11,000)

CAMERA’s “All metagenomic proteins”

Sanger Seq. Mostly GOS.43 M proteins, 102 Samples

HMMER 3

Count PFAM hits per sample

Identify similar protein family profiles (using R & Cytoscape)

Normalize Samples*Calculate distances between

protein family profiles **

• *Samples are normalized by dividing by the sum of each column. Subtraction of taxonomic signal is planned in the near future.

• ** Distance between protein families have been calculated using Pearson’s Correlation and Sørensen similarity index

• Additional metagenomic samples are being added for improved resolution between protein family profiles.

3.4 Future Work

• Various normalization methods, distance calculations, and clustering methods are being investigated in order to maximize the clustering of protein families with similar function.

• Additional metagenomic samples are being screened against PFams to provide better resolution for those with similar profiles.

• Extension to protein families other than PFams is planned in the near future.

3.3 Preliminary Results

Metagenomic Samples

• PFams with metagenomic profiles (see 3.1) having correlations > 0.9 were visualized using Cytoscape

Conclusions

Objectives

1. Identify families of genes with unknown function

2. Characterize and rank families with unknown function

3. Produce novel predictions using metagenomics data

• In this study we created a list of protein families that have no known function. These protein families were then ranked on various criteria that would allow researchers to identify particular genes with unknown function that are a high level of interest due to their presence across the tree of life, their possible role in pathogenisis, or their contribution to species in particular environments.

• Secondly, we have started development on a completely novel method that would predict gene function that does not use sequence similarity and would in theory improve as the number of metagenomic datasets become available over time. This method could help gain some insight into the function of the vast number of proteins that we currently can not annotate.

• PFam hits to proteins from 102 metagenomic samples (see 3.1) shown as a heat map with hierarchal clustering