View
217
Download
0
Category
Preview:
Citation preview
prorepeat.bioinformatics.nl
ProRepeat a comprehensive directory of exact tandem repeats in proteins
www.bioinformatics.nl
9 diseases causes by polyQ repeats- HD- DRPLA- SCA 1,2,3,6,7,17- Kennedy’s disease (SBMA)
PolyQ and neurodegenerative diseases
www.bioinformatics.nl
Transcription Factor
-COOHNH3-
TRANSCRIPTIONAL REGULATIONDNA BINDING
HORMONE BINDING
T1 T2 T3
Region 1 Region 2 Region 3
Androgen receptor (AR)
polyQ tract length has important consequences■ shorter tracts : prostate cancer susceptibility■ longer tracts : feminization syndromes■ over 40 residues : SBMA (spinal and bulbar muscular atrophy) or Kennedy’s disease
polyQ tract length has important consequences■ shorter tracts : prostate cancer susceptibility■ longer tracts : feminization syndromes■ over 40 residues : SBMA (spinal and bulbar muscular atrophy) or Kennedy’s disease
9-35 residues, average of 20-25 depending on ethnic origin
9-35 residues, average of 20-25 depending on ethnic origin
www.bioinformatics.nl
PolyQ in AR
Collection of polyQ repeats 792 human individuals
available from earlier study (Edwards, 1992)
26 armadillo individuals sequenced by CP
77 mammals and marsupials from protein database
Céline Poux, RUCéline Poux, RU
www.bioinformatics.nl
What about repeats in other proteins?
ProRepeat database Data sources: UniProt and RefSeq Limited to exact tandem repeats
Standard, linear-time suffix tree algorithm Stored in Oracle 10g Interface in PHP5
unit length repetitions
1 ≥ 5
2 ≥ 4
3 ≥ 3
4 .. N ≥ 2Maarten van den Bosch, WURMaarten van den Bosch, WUR
www.bioinformatics.nl
Simple query syntax:
e.g. “Q” or “DE”
Simple query syntax:
e.g. “Q” or “DE”
DE is equivalent to ED; DEF is equivalent to EFD and FDE
DE is equivalent to ED; DEF is equivalent to EFD and FDE
www.bioinformatics.nl
Or use ProSite syntax:
e.g. “[DE]-{P}-X(0,1).”
Or use ProSite syntax:
e.g. “[DE]-{P}-X(0,1).”
www.bioinformatics.nl
Taxonomic distributions of hits
www.bioinformatics.nl
www.bioinformatics.nl
Sorting/grouping options
Identifier Repeat unit Repetitions Unit length Length Start location End location Protein Taxonomy Ontology
www.bioinformatics.nl
Link to DNA data
DNA coding sequences of available repeats also stored in the database Extracted from EMBL
and/or RefSeq
Hong Luo, WURHong Luo, WUR
www.bioinformatics.nl
Link to DNA data / errors
Approximately 3% of corresponding nucleotide sequences cannot be retrieved
Errors caused by No links to nucleotide database (35%)
• NO_ANNOTATED_CDS• No EMBL links
Annotation errors in the nucleotide database (65%)
www.bioinformatics.nl
Number of different units per unit size per proteome
0
100
200
300
400
500
600
700
800
900
Unit length
Nu
mb
er o
f d
iffe
ren
t u
nit
s
Hsapiens
Athaliana
Celegans
Cserevesiae
Ptroglodytes
Ggallus
Rnorvegicus
Mmusculus
Ecoli
Guido Kappé, RUGuido Kappé, RU
www.bioinformatics.nl
Single amino acid (SAA) repeat length distribution in Homo sapiens
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 >20
Total SAA repeat length (aa)
Per
cen
tag
e (%
)
A B C D E F G H I K L M N P Q R S T U V W X Y Z
SS
PPGG
EEAA
TT
www.bioinformatics.nl
Amino acid distribution Homo sapiens
0
5
10
15
20
25
30
A B C D E F G H I K L M N P Q R S T U V W X Y Z
Amino acid
Per
cen
tag
e (%
)
All prot. - Rep. Rep. - SAA SAA
www.bioinformatics.nl
Amino acid distribution Arabidopsis thaliana
0
5
10
15
20
25
30
A B C D E F G H I K L M N P Q R S T U V W X Y Z
Amino acid
Per
cen
tag
e (%
)
All prot. - Rep. Rep. - SAA SAA
www.bioinformatics.nl
Current work
Annotation of repeats versus function Adding imperfect tandem repeats - a.k.a.
approximate tandem repeats (ATR) – to the database
Offering remote access via web services (WSDL and BioMoby)
Expansion of the analysis capabilities of the interface
www.bioinformatics.nl
PolyQ in AR (reprise)
Impure tracts longer and more variable than pure CAG tracts (mainly CAA, CCG, and CGG)
Presence of other codons better explained by codon duplication than multiple point mutations interrupting codons are part of elongation process,
rather than hampering their dynamics as proposed previously
Negative correlation between lengths of the different CAG tracts maximal expansion length that protein can handle
without being deleteriousCéline Poux, RUCéline Poux, RU
www.bioinformatics.nl
Acknowledgements
Wageningen University and Research Centre Maarten van den Bosch Hong Luo Mark Kramer Harm Nijveen
Radboud University, Nijmegen Guido Kappé Céline Poux Wilfried W. de Jong
This work was supported in part by project grants from NWO/BMI (GK, CP) and the NBIC/BioAssist program (HN)
prorepeat.bioinformatics.nl
Thank you for your attention!See also our posters on phylogenetic domain visualisation (TreeDomViewer) and microarray (re)annotation at the ISMB
Post-doc positions available: contact Jack.Leunissen@wur.nl or jack@bioinformatics.nl
Recommended