14
A New Interface to GeneKeyDB Methods for analyzing relationships among proteins based on shared motifs Chris Symons & Xinxia Peng

A New Interface to GeneKeyDB

Embed Size (px)

DESCRIPTION

A New Interface to GeneKeyDB. Methods for analyzing relationships among proteins based on shared motifs Chris Symons & Xinxia Peng. Protein domains are distinct units of protein three-dimensional structure, which also carry function. - PowerPoint PPT Presentation

Citation preview

Page 1: A New Interface to GeneKeyDB

A New Interface to GeneKeyDB

Methods for analyzing relationships among proteins based on shared motifs

Chris Symons & Xinxia Peng

Page 2: A New Interface to GeneKeyDB

Protein domains

• are distinct units of protein three-dimensional structure, which also carry function.

• Proteins can be composed of single or multiple domains.

• A few thousand conserved domain models are sufficient to cover more than two thirds of known protein

sequences.

Marchler-Bauer A, et al. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Research 31:383-387 (2003) .

Page 3: A New Interface to GeneKeyDB

The growth of the number of proteins known vs. the growth in the number of unique domains

Geer,L.Y., Domrachev,M., Lipman,D.J. and Bryant,S.H. (2002) CDART: Protein Homology by Domain Architecture. Genome Res., 12, 1619–1623.

Page 4: A New Interface to GeneKeyDB

Conserved Domain Database (CDD):

http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

• a curated Entrez database of conserved domain alignments at NCBI

• currently contains domains derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI, such as COG.

Page 5: A New Interface to GeneKeyDB

Data generation using GeneKeyDB

-- create a master table of associatioship between-- locuslink id and cdd_key

CREATE TABLE peng_cddlist as(SELECT a.ll_id, b.ll_refseq_nm_id, c.cdd_key, c.cdd_evalue, a.organismFROM ll_locus a, ll_refseq_nm b, ll_np_cdd cWHERE a.ll_id = b.ll_id and b.ll_refseq_nm_id = c.ll_refseq_nm_id);commit;

Page 6: A New Interface to GeneKeyDB

Summary of Data

Locus CD

Mouse 6102 1999

Human 8732 2786

Page 7: A New Interface to GeneKeyDB

Looking at groups of domains

We look at a list of cdd domains and return the proteins that are found exclusively in the intersection of those domains.If a second (third, etc.) list of domains is added, we look at the proteins found exclusively in the intersection of this list, and we combine this with previous lists and do the same.

Page 8: A New Interface to GeneKeyDB

A BA + B

Looking at groups of domains

Page 9: A New Interface to GeneKeyDB

Options

This can be done using either human or mouse data.We can turn the exclusivity off, so that we return all proteins in the intersection of the list of cdd keys.

Page 10: A New Interface to GeneKeyDB

Sample Input and OutputInput the first list of domains.The domains should be separated by spaces and should all be on one line.1 438(1 438): Input another list of domains separated by spaces (or hit q to quit):1825(1825): (1 438 1825): 28992 83666 Input another list of domains separated by spaces (or hit q to quit):

Page 11: A New Interface to GeneKeyDB

Why useful? A thought

2003

Page 12: A New Interface to GeneKeyDB

?: log[P(k)] ~ - k

k: the number of CDs per protein

Page 13: A New Interface to GeneKeyDB

Redundancy in CDD?

Page 14: A New Interface to GeneKeyDB

Following works:1. Remove CDD redundancy

2. Distribution of the minimal set of proteins across different biological processes/subcellular location (GO terms)

3. Application in other types of graph with same or different dataset, such genes + TBS