Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

Preview:

Citation preview

prorepeat.bioinformatics.nl

ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

9 diseases causes by polyQ repeats- HD- DRPLA- SCA 1,2,3,6,7,17- Kennedy’s disease (SBMA)

PolyQ and neurodegenerative diseases

www.bioinformatics.nl

Transcription Factor

-COOHNH3-

TRANSCRIPTIONAL REGULATIONDNA BINDING

HORMONE BINDING

T1 T2 T3

Region 1 Region 2 Region 3

Androgen receptor (AR)

polyQ tract length has important consequences■ shorter tracts : prostate cancer susceptibility■ longer tracts : feminization syndromes■ over 40 residues : SBMA (spinal and bulbar muscular atrophy) or Kennedy’s disease

polyQ tract length has important consequences■ shorter tracts : prostate cancer susceptibility■ longer tracts : feminization syndromes■ over 40 residues : SBMA (spinal and bulbar muscular atrophy) or Kennedy’s disease

9-35 residues, average of 20-25 depending on ethnic origin

9-35 residues, average of 20-25 depending on ethnic origin

www.bioinformatics.nl

PolyQ in AR

Collection of polyQ repeats 792 human individuals

available from earlier study (Edwards, 1992)

26 armadillo individuals sequenced by CP

77 mammals and marsupials from protein database

Céline Poux, RUCéline Poux, RU

www.bioinformatics.nl

What about repeats in other proteins?

ProRepeat database Data sources: UniProt and RefSeq Limited to exact tandem repeats

Standard, linear-time suffix tree algorithm Stored in Oracle 10g Interface in PHP5

unit length repetitions

1 ≥ 5

2 ≥ 4

3 ≥ 3

4 .. N ≥ 2Maarten van den Bosch, WURMaarten van den Bosch, WUR

www.bioinformatics.nl

Simple query syntax:

e.g. “Q” or “DE”

Simple query syntax:

e.g. “Q” or “DE”

DE is equivalent to ED; DEF is equivalent to EFD and FDE

DE is equivalent to ED; DEF is equivalent to EFD and FDE

www.bioinformatics.nl

Or use ProSite syntax:

e.g. “[DE]-{P}-X(0,1).”

Or use ProSite syntax:

e.g. “[DE]-{P}-X(0,1).”

www.bioinformatics.nl

Taxonomic distributions of hits

www.bioinformatics.nl

www.bioinformatics.nl

Sorting/grouping options

Identifier Repeat unit Repetitions Unit length Length Start location End location Protein Taxonomy Ontology

www.bioinformatics.nl

Link to DNA data

DNA coding sequences of available repeats also stored in the database Extracted from EMBL

and/or RefSeq

Hong Luo, WURHong Luo, WUR

www.bioinformatics.nl

Link to DNA data / errors

Approximately 3% of corresponding nucleotide sequences cannot be retrieved

Errors caused by No links to nucleotide database (35%)

• NO_ANNOTATED_CDS• No EMBL links

Annotation errors in the nucleotide database (65%)

www.bioinformatics.nl

Number of different units per unit size per proteome

0

100

200

300

400

500

600

700

800

900

Unit length

Nu

mb

er o

f d

iffe

ren

t u

nit

s

Hsapiens

Athaliana

Celegans

Cserevesiae

Ptroglodytes

Ggallus

Rnorvegicus

Mmusculus

Ecoli

Guido Kappé, RUGuido Kappé, RU

www.bioinformatics.nl

Single amino acid (SAA) repeat length distribution in Homo sapiens

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 >20

Total SAA repeat length (aa)

Per

cen

tag

e (%

)

A B C D E F G H I K L M N P Q R S T U V W X Y Z

SS

QQ

PPGG

EEAA

TT

www.bioinformatics.nl

Amino acid distribution Homo sapiens

0

5

10

15

20

25

30

A B C D E F G H I K L M N P Q R S T U V W X Y Z

Amino acid

Per

cen

tag

e (%

)

All prot. - Rep. Rep. - SAA SAA

www.bioinformatics.nl

Amino acid distribution Arabidopsis thaliana

0

5

10

15

20

25

30

A B C D E F G H I K L M N P Q R S T U V W X Y Z

Amino acid

Per

cen

tag

e (%

)

All prot. - Rep. Rep. - SAA SAA

www.bioinformatics.nl

Current work

Annotation of repeats versus function Adding imperfect tandem repeats - a.k.a.

approximate tandem repeats (ATR) – to the database

Offering remote access via web services (WSDL and BioMoby)

Expansion of the analysis capabilities of the interface

www.bioinformatics.nl

PolyQ in AR (reprise)

Impure tracts longer and more variable than pure CAG tracts (mainly CAA, CCG, and CGG)

Presence of other codons better explained by codon duplication than multiple point mutations interrupting codons are part of elongation process,

rather than hampering their dynamics as proposed previously

Negative correlation between lengths of the different CAG tracts maximal expansion length that protein can handle

without being deleteriousCéline Poux, RUCéline Poux, RU

www.bioinformatics.nl

Acknowledgements

Wageningen University and Research Centre Maarten van den Bosch Hong Luo Mark Kramer Harm Nijveen

Radboud University, Nijmegen Guido Kappé Céline Poux Wilfried W. de Jong

This work was supported in part by project grants from NWO/BMI (GK, CP) and the NBIC/BioAssist program (HN)

prorepeat.bioinformatics.nl

Thank you for your attention!See also our posters on phylogenetic domain visualisation (TreeDomViewer) and microarray (re)annotation at the ISMB

Post-doc positions available: contact Jack.Leunissen@wur.nl or jack@bioinformatics.nl

Recommended