Upload
ngohuong
View
235
Download
3
Embed Size (px)
Citation preview
1
Sequence analysis
• Analysis of primary, secondary, not tertiary ... structures
• Biological sequences. Central dogma.• Similarities (orthologs, paralogs)• Methods, algorithms (alignments, models)• Databases (primary, secondary)
Sequences: DNA, RNA , protein ...
Genome: DNAtranscription?
Primary transcript: pre-mRNA, pre-ncRNAprocessing (splicing*, cleavage) ?
Processed transcript: mRNA, ncRNA (tRNA, rRNA ...)translation, modification?
[a] Translated sequence: protein (amino acids). [b] Mature ncRNAprotein cleavage ... ?Mature protein.
[ ESTs are nucleotide sequences, might be unspliced, spliced ...]
* Splicing only occurs in Eukaryotes.
SEQUENCE ANALYSIS
Where and why ?
Sequencing projects, assembly of sequence dataIdentification of functional elements in sequences Sequence comparisonClassification of proteins Comparative genomicsRNA structure prediction Protein structure prediction Evolutionary history
Alignments and database searches (Summary)
Common biological problem: We have a novel protein sequence. What can we inferfrom this sequence about the biological function of theprotein?
* Sequence homology - BLAST, FASTA, SSEARCHSimple example: unknown human protein is highly similar to a protein with known function from another organism=> The human protein has the same function
(it’s a homolog: ortholog or paralog) * Pattern/profile search – PROSITE, Profile search - Pfam** Secondary structure precition** Prediction of transmembrane domains
( ~ 25 % of all proteins are membrane bound!)
Comparing non-identical sequencesProtein sequence comparison - basic concepts
When two protein sequences are being compared and the similarity isconsidered statistically significant, it is highly likely that the two proteins are evolutionary related. There are two kinds of biological relationships:
Orthologs Proteins that carry out the same function in different species
Paralogs Proteins that perform different but related functions within one organism
Proteins are homologous if they are related by divergence from a common ancestor.
Homology: orthologs & paralogs
Orthology describes genes in different species that derive from a common ancestor. (=MouseA, ChickA, FrogA that come from Alfa-chain gene in common ancestor)
Paralogy describes homologous genes within a single species that diverged by gene duplication (= MouseA and MouseB).
2
Methods in sequence analysis• Simple transformation/extraction
a) Translation: RNA > proteinb) Reverse translation protein>RNAc) Splicing (removing introns in pre-mRNA, pre-rRNA ...)
• Comparison of primary sequencesa) Identity: finding sites, pattern matchesb) Alignments: non-identical seqs (pair/multiple/phylogeny)
• Analyzing for other propertiesa) statistical compositionb) profile analysis (PSI-Blast)c) HMMs (probabilities of aa in position, Pfam) d) higher order stucture (secondary structure in RNA/prot)
Translation of sequences
• Different nucleotide sequences may translate into identical amino acid sequences.
• Nucleotide sequence may yield different amino acid seqs. (6 reading frames)
• Reverse translation does not give unique nucleotide sequence.
• Different splicing of pre-mRNA1 gene – several proteins!
The (degenerate) Genetic code
UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop* UUG Leu L UCG Ser S UAG Stop* UGG Trp W
CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R
AUU Ile I ACU Thr T AAU Asn N AGU Ser S AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R AUG Met M ACG Thr T AAG Lys K AGG Arg R
GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G
Translation:
AUGUUGGGUUGA=MLG*||| | || | | AUGCUAGGAUAA=MLG*
Reverse translation:
MLG* =AUG UUA GGU UAA 1AUG UUA GGU UAG 2AUG UUA GGU UGA 3... .AUG CUG GGG UGA 72(1x6x4x3 possible seqs)
3rd position is not so important!
UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop*UUG Leu L UCG Ser S UAG Stop* UGG Trp W
CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R
AUU Ile I ACU Thr T AAU Asn N AGU Ser S AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R AUG Met M ACG Thr T AAG Lys K AGG Arg R
GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G
Translation:
AUGUUGGGUUGA=MLG*||| | || | | AUGCUAGGAUAA=MLG*
AUGUUGGGUUGA=MLG*AUGUUAGGUUGA=MLG*AUGUUCGGUUGA=MFG*AUGUGAGGUUGA=M*G*(=M*!)
AUG-UGGGUUGA=MTV(+GA.)Frameshift=> new AA seqLast example: no Stop!
Changes that affect translation
Open Reading Frame (ORF)Forward reading frames:
Frames 1-3AUGUUGGGUUGA=MLG*.UGUUGGGUUGA=CTV..GUUGGGUUGA=VGL...UUGGGUUGA= LG*
Backward reading frames:
Frames 4-6 on reverse (minus) strand:AUGUUGGGUUGA originalAGUUGGGUUGUA revUCAACCCAACAU +complement= STQH, QPN, ...
1 AUGUUCCGUCUCACGCUCACCAAACGGCUAGCCCGCGCUUCUGCACACGUCACUCCGUCG 60------------------------------------------------------------UACAAGGCAGAGUGCGAGUGGUUUGCCGAUCGGGCGCGAAGACGUGUGCAGUGAGGCAGC
M F R L T L T K R L A R A S A H V T P S C S V S R S P N G * P A L L H T S L R R V P S H A H Q T A S P R F C T R H S V A
------------------------------------------------------------H E T E R E G F P * G A S R C V D S R R T G D * A * W V A L G R K Q V R * E T A N R R V S V L R S A R A E A C T V G D G Frame 4-6
Example unknown RNA:
Translation tables
• The coding for amino acids depends on species and/or nuclear/mitochondrial DNA.
• At least 17 translation tables exist:* The Standard Code* The Vertebrate Mitochondrial Code* The Yeast Mitochondrial Code* The Mold, Protozoan, and Coelenterate Mitochondrial Code and ...* The Invertebrate Mitochondrial Code* The Ciliate, Dasycladacean and Hexamita Nuclear Code* The Echinoderm and Flatworm Mitochondrial Code* The Euplotid Nuclear Code...* ...
Tables with comments may be found at NCBI: http://www.ncbi.nlm.nih.gov/Taxonomy/
3
Translation tables (cont), examples
Example:
The Vertebrate Mitochondrial Code (transl_table=2)
Differences from the Standard Code:
Code 2 Standard AGA Ter * Arg R AGG Ter * Arg RAUA Met M Ile IUGA Trp W Ter *
Example:
The Yeast Mitochondrial Code (transl_table=3)
Differences from the Standard Code:
Code 3 Standard AUA Met M Ile I CUU Thr T Leu L CUC Thr T Leu LCUA Thr T Leu LCUG Thr T Leu LUGA Trp W Ter *CGA absent Arg RCGC absent Arg R
Alternative Initiation Codon:
Bos: AUA Homo: AUA, AUUMus: AUA, AUU, AUCCoturnix, Gallus: also GUG.
Big differences if start (initiation) and stop (termination) codes differ!
Ambiguous sequence notation
Nucleotide examples:A or C, [AC]: symbol MA or G, [AG]: symbol RA or T, [AT]: symbol WA or C or G, [ACG]: V
... etc.
G A A A A CG A G A T CG C A A C CG C G A G C-----------------G[AC][AG]A[ATCG]C
The 4 sequence example may be written as a sequence : GMRANC , or as a pattern : G-[AC]-[AG]-A-x(1)-C
Wildcard: x(N) represents N arbitrary symbols.
Identity (pattern matching)• Finding short exact matches
GAATTC – recognition site for enzyme EcoRIGDSGGP – typical of serine proteases (e.g. G-[DE]-S-G-[GS] -[SAPHV] )
• Patterns for multiple matchesGA-[AG]-L-[ST] : GA + A or G + L + S or T
GAALS, GAGLS, GAALT, GAGLT matchesGA-x-G-[STLAG] : GA + any 1 aa + G + S or T or L or A or G
100 different sequences matchC-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
pattern for zinc finger proteins (millions of possible sequences)
Programs that use these kinds of patterns:”Findpatterns” searches a sequence (or set of sequences) for a pattern.”Motifs” searches a sequence for motifs present in the PROSITE database.PROSITE have patterns for >1000 protein families.Important: Match or no match – just true or false, no score!(”Profiles” have probabilities for different aminoacids in certain positions.)
Pairwise alignments:
Global alignmentConsiders similarity across the full extent of the sequences xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
| | ||||||| | |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Local alignment (most common)Considers regions of similarity in parts of the sequences only.
xxxxxxx|||||||xxxxxxx
region of similarity
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
2 mismatches
M A K L Q L G K R Y
M *A *K * *L * *Q *G *A *L * *G *K * *R *Y *
M A K L Q L G K R Y
* * * * * * * * * *M A K L Q G A L G K R Y
Gap
Sequence alignment
Comparing 2 sequences - Gaps
4
Comparing 2 sequences: What are gaps?
Gaps are results of mutations (changes in DNA) that occur during evolution
For instance consider this deletion mutation:
AACTTGACGTTGAACTGC
GACTGGGCGTATCTGACCCGCATA
CGGGCACCGGCCCGTGGC
N L T D W A Y R A P
N L T R A P
AACTTGACGTTGAACTGC
CGGGCACCGGCCCGTGGC
DNAprotein
Alignment report example
Red lines = matches full sequence (high identity) Purple lines = matches contain gap (good identity)
Gap
Best alignment = highest score!
Give scores for match, mismatch and gap (and gap extension).
What is better: mismatch or gap?
Calculate best score for each position, “trace back” to find best alignment.
“Dynamic programming” algorithms.
Very slow algorithm, cannot be used in database searches!
BLAST lists all matching “words”*
Query
Subject
For each short match, the program tries to extend in both directions.
* A word is 7-11 nucleotides or 3-.. aa
Improvement of speed as compared to local alignment algorithm:
BLAST and FastA
Searching databases with BLAST
Initial search is for short words.Word hits are then extended in either direction.? we only extend words that are in both sequences? fast, but gap can’t be long between two close words
Searching databases with FastA
Initial search for short words.Words are extended, but also linked if they are close!? slower, but longer alignments
An alignment that BLAST can’t find!
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG|| | || || || | || || || || | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT| || || || ||| || | |||||| || | |||||| ||||| | |
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC|||| || ||||| || || | | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
5
Aligning two sequences - Gap extension penalty. Alignment of genomic sequence with mRNA (Global alignment!)
Alignment of the following two sequences: V00594 (Human mRNA for metallothionein) and J00271 (corresponding genomic sequence).
Default setting
Extend gap= 3
In a global alignment all residues are matched.
?
!
New settings
Extend gap= 0Exon 1
Exon 2
Exon 3
Output from Blast
BLASTP 2.0.11 [Jan-20-2000]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq(75 letters)
Database: nr457,798 sequences; 140,871,481 total letters
Searching..................................................done
Score ESequences producing significant alignments: (bits) Value
gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6
E-value: probability of finding hit in a database of this size.
E-value, as important as score!
Score
Alig
nmen
ts
Expect ValueE = number of database hits you expect to find by chance
size of database
your score
expected number of random hits
Small database = few random hits. Big database = many random hits!In small databases you get higher E-values.
High score
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;cDNA EST EMBL:D71338 comes from this gene; cDNA ESTEMBL:D74010 comes from this gene; cDNA EST EMBL:D74852comes from this gene; cDNA EST EMBL:C07354 comes fromthis gene; cDNA EST EMBL:C0...Length = 65
Score = 74.1 bits (179), Expect = 1e-13Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)
Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++M
Sbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63
Query: 74 G 74G
Sbjct: 64 G 64
In protein alignments some mismatches are marked “similar” (+).
Substitution matrices are used to score matches/mismatches!
Are there better/worse substitutions?
• From comparisons of known proteins, it is known that some changes/mutations are more frequent than others.
• Also, not all amino acids* are common ...If a rare amino acid is matched, it is more significant than if a common amino acid match
• How can we give a score to a mismatch/match that is biologically significant?? substitution matrices
* There are 20 amino acids, but only 4 nucleotides!
6
BLOSUM 62 scoresA 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S T W Y V X
Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
Substitution matrices
Unitary matrices (nucleotide, protein)All matches get ’10’, all mismatches ’0’.Used for nucleotide seqs. Bad protein hits due to identities by chance.
Point Accepted Mutation, PAM (proteins)PAM30, PAM70 ... matrices. Based on evolutionary distance: 1 PAM = 1 point mutation / 100 residues. Can’t handle distant relationships well.
Blocks Substitution Matrix, BLOSUM (prots)BLOSUM50, BLOSUM62 ... matrices. Based on alignments in the BLOCKS db. Sequence segments of a certain identity are clustered: The most used matrices. BLOSUM62 default in BLAST (>62% identity).
Remember: Any substitution matrix is making a statement about the probability of observing a pair of aligned residues in real alignments!
ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGTM A K L E K L N Q A G L M V A G
60% nucleotide identityATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGAM V R L E K I N Q A G L L V A G69% amino acid identity
M V R I Q K I N E K G A L L A G38%
Q V R I Q K I Y E K G A L L A A19% (‘twilight zone’)
Q V R I Q K I Y E K T A L L F A6% (‘midnight zone’)
Evolution of protein genes: secondary and tertiary structure conservedBlast report
Sequences producing significant alignments: (bits) Value
pir||F69494 (R)-hydroxyglutaryl-CoA dehydratase activator (hgdC)... 462 e-129gb|AAD31675.1| (AF123384) (R)-2-hydroxyglutaryl-CoA dehydratase ... 233 1e-060sp|P39383|YJIL_ECOLI HYPOTHETICAL 27.4 KD PROTEIN IN IADA-MCRD I... 184 9e-046emb|CAA67409.1| (X98916) orf6 [Methanopyrus kandleri] 170 1e-041gb|AAF13150.1|AF156260_1 (AF156260) unknown [Methanosarcina bark... 143 2e-033pir||A69117 activator of (R)-2-hydroxyglutaryl-CoA - Methanobact... 132 4e-030pir||A72369 (R)-2-hydroxyglutaryl-CoA dehydratase activator-rela... 129 4e-029gb|AAC23928.1| (U75363) benzoyl-CoA reductase subunit [Rhodopseu... 117 1e-025pir||S04476 hypothetical protein (hdgA 5' region) - Acidaminococ... 104 1e-021sp|P27542|DNAK_CHLPN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 42 0.005gb|AAC15473.1| (AF016711) heat shock protein 70 [Burkholderia ps... 39 0.036pir||F75029 o-sialoglycoprotein endopeptidase (gcp) PAB1159 - Py... 38 0.082pir||F72514 probable glucokinase APE2091 - Aeropyrum pernix (str... 37 0.18sp|P42373|DNAK_BURCE DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 37 0.18emb|CAA10035.1| (AJ012470) mitochondrial-type hsp70 [Encephalito... 36 0.31sp|P56836|DNAK_CHLMU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.41gb|AAF39496.1| (AE002336) dnaK protein [Chlamydia muridarum] 36 0.41pir||B70189 rod shape-determining protein (mreB-1) homolog - Lym... 36 0.41sp|O57716|GCP_PYRHO PUTATIVE O-SIALOGLYCOPROTEIN ENDOPEPTIDASE (... 36 0.54sp|O33522|DNAK_ALCEU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.54ref|NP_012874.1| Ykl050cp >gi|549677|sp|P35736|YKF0_YEAST HYPOTH... 36 0.54emb|CAA53420.1| (X75781) D513 [Saccharomyces cerevisiae] >gi|158... 36 0.54sp|P30722|DNAK_PAVLU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) >gi|99... 36 0.54pir||A40158 dnaK-type molecular chaperone - Chlamydia trachomati... 34 1.2gb|AAF07742.1|AE001584_39 (AE001584) hypothetical protein [Borre... 34 1.6gb|AAF07521.1|AE001577_35 (AE001577) hypothetical protein [Borre... 34 1.6gb|AAF38963.1| (AE002276) cell shape-determining protein MreB [C... 34 2.1gb|AAG08147.1|AE004889_10 (AE004889) DnaK protein [Pseudomonas a... 33 2.7dbj|BAB03215.1| (AB017035) dnaK [Bacillus thermoglucosidasius] 33 2.7sp|P43736|DNAK_HAEIN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|P45554|DNAK_STAAU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|Q58303|FLA3_METJA FLAGELLIN B3 PRECURSOR 32 4.7gb|AAG08239.1|AE004898_10 (AE004898) phosphoribosylaminoimidazol... 32 6.1
Bad scores/E-
values might
sometimes not matter.
1 MSAAPVQDKDTLSNAERAKNVNGLLQVLMDINTLNGGSSDTADKIRIHAKNFEAALFAKS 60
61 SSKKEYMDSMNEKVAVMRNTYNTRKNAVTAAAANNNIKPVEQHHINNLKNSGNSANNMNV 120
121 NMNLNPQMFLNQQAQARQQVAQQLRNQQQQQQQQQQQQRRQLTPQQQQLVNQMKVAPIPK 180
181 QLLQRIPNIPPNINTWQQVTALAQQKLLTPQDMEAAKEVYKIHQQLLFKARLQQQQAQAQ 240
241 AQANNNNNGLPQNGNINNNINIPQQQQMQPPNSSANNNPLQQQSSQNTVPNVLNQINQIF 300
301 SPEEQRSLLQEAIETCKNFEKTQLGSTMTEPVKQSFIRKYINQKALRKIQALRDVKNNNN 360
361 ANNNGSNLQRAQNVPMNIIQQQQQQNTNNNDTIATSATPNAAAFSQQQNASSKLYQ
Low complexity sequence tends to(1) increase the number of non-specific hits to database sequences(2) correspond to regions in proteins not associated with a knownbiological function (typically unstructured parts of the protein)
Therefore, low complexity parts are filtered out by default in BLAST searches. (Don’t use filtering if you want exact matches.)
Blast variants:
Query Database
blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA
Example: Searching a new genome assembly for a protein homolog.
Input: protein.Database: DNA (genome sequences)
? tblastn
7
Rules of database searches (like BLAST)
? Database sequence searches involving proteins should be carried out at the protein level and not at the DNA level *? Use of smallest possible database (not too small though) ? Sequence statistics should be used rather than percent identity/similarity as criterion for homology? Consider different scoring matrices and gap penalties
* 1) DNA sequences encoding the same protein sequence can be very different, due to the degeneracy of the genetic code.
TTTCGATTCTCAACAAGAAGC** * ** ** * *TTCAGGTTTAGCACGCGGTCCF R F S T R S
2) For nucleotide—nucleotide searches, it is often good to set the word size low (-W 7)
BLAST at NCBI
tblastn
BLAST output at NCBI
1 perfect hit, some hits with parts of sequence matched
Alignments below
“HSP” high
scoring pair
– there may be several!
Best hit
Next best hit
BLAST output, with many HSPs
gb|CM000011.1| Canis familiaris chromosome 11, whole genome shot... 86 9e-15
>gb|CM000011.1| Canis familiaris chromosome 11, whole genome shotgun sequenceLength = 75769841
Score = 85.7 bits (43), Expect = 9e-15Identities = 89/102 (87%), Gaps = 3/102 (2%)Strand = Plus / Minus
Query: 4 cgtgctgaaggcctgtatcctaggctacacactgaggactctgttcctcccctttccgcc 63|||||||||||||||| |||||||||||| || ||||||| ||||||| ||| ||||
Sbjct: 53542401 cgtgctgaaggcctgtttcctaggctacagacggaggact-tgttcctta--tttgcgcc 53542345
Query: 64 taggggaaagtccccggacctcgggcagagagtgccacgtgc 105|||||||||||||||||||| ||||||||||||| |||||
Sbjct: 53542344 taggggaaagtccccggacccttggcagagagtgccgcgtgc 53542303
Score = 75.8 bits (38), Expect = 9e-12Identities = 75/86 (87%), Gaps = 1/86 (1%)Strand = Plus / Minus
Query: 181 ggggcgtcatccgtcagctccctctagttacgcaggcagtgcgtgtcc-gcgcaccaacc 239|||||||| ||||||| ||| ||||||||||||||||| ||| | |||| ||||||
Sbjct: 53542216 ggggcgtcgtccgtcaactctatctagttacgcaggcagcgcgcctggtgcgcgccaacc 53542157
Query: 240 acacggggctcattctcagcgcggct 265||||||||||||||||||||||||||
Sbjct: 53542156 acacggggctcattctcagcgcggct 53542131
Score = 36.2 bits (18), Expect = 7.7Identities = 18/18 (100%)Strand = Plus / Minus
Query: 25 aggctacacactgaggac 42||||||||||||||||||
Sbjct: 42727936 aggctacacactgaggac 42727919
Note: Only the best HSP is shown in the list before the alignments. Check the positions to understand in which order the HSPsmatch. The strand must be the same!
?
Databases at NCBI available for BLAST searches
Protein sequence databases
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
swissprot the last major release of SWISS-PROT
DNA sequence Databases
nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences)
dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
You may also blast against single genomes ...
8
Multiple alignments - applications
Identify conserved motifs - patterns (PROSITE)Profiles (Pfam)Phylogenetic studiesPrediction of protein secondary structure Experimental : design of probes
Multiple sequence alignment programs (CLUSTALW, PileUp, T-coffee ...)
PILEUP
PileUp does a series of progressive, pairwise alignments between sequences and clusters of sequences to generate the final multiple alignment. A cluster consists of two or more already-aligned sequences.
PileUp begins by doing pairwise alignments that score the similarity between every possible pair of sequences. These similarity scores are used to create a clustering order that can be represented as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA).
The dendrogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment. For example:
Trees from MSA
SRP54
SRPRFtsY
3 large groups
Multiple alignment software
Pileup (GCG)
Clustalw / Clustalx
MSA (program that in principle finds the true optimal multiple alignment by the dynamic programming method)
T-coffee
Multiple alignment editors/viewers
SeqLab (GCG)MACAW (search for motifs, blocks)JalviewCINEMAGenedocBioeditBoxshade
Clustalx
njplot
Colours of amino acids according to type: charged, hydrophobic ...
Makes it easier to see matches.
9
How to find homologs with low sequence identity
• Sequence identity high if evolutionary distance is small, but low if the distance is big.
• Many amino acid positions change.• An amino acid may be substituted differently in
different species.• If we have many known homologs, we can search
with “all of them” as queries, but the unknown sequence may have yet another set of substitutions compared to the known homologs.? align known sequences and make a “profile”
Position Specific Substitution Rates
Active site serineTypical serine
Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M F P S T W Y V206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Serine scored differentlyin these two positions
Active site nucleophile
Example sequence. How does Serine score in positions 211 and 216?
Amino acids
PSIBLAST – a more sensitive BLAST!
PSI-BLAST is an important tool to identify remote protein similarity. It proceeds by way of the following steps:
(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program .
(2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence servesas a template for the multiple alignment and profile, whose lengths are identical to that of the query.
(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly.
(4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale , and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.
(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.
Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of BLAST; the results produced in iterative search steps are comparable to those produced from the first pass.
Advantage : Unlike most profile-based search methods, PSI-BLAST runs as one program, starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user.
1st BLAST round 2nd BLAST round
threshold
profile profile
3rd BLAST round
PSI-BLAST creates profiles automatically
When no more new sequences are found, search terminates.
Problem: If bad sequences enters the profile, it finds only trash!
Example of homology: SRP9/14/21
• SRP9 & SRP14 are related (common ancestor)• SRP9 is not found in Fungi (but SRP21 is)• But weak SRP9 hit in the fungi S.pombe (YE07)• Weak similarity SRP9 S.pombe & SRP21• Make a profile of known SRP21 sequences and
search a database of all known proteins!Can we detect any similarity SRP9/21?
10
Profilesearch - based on Saccharomyces SRP21 sequences Sequence ZScore Orig Length Comment
1. S_BAY 84.40 349.21 146 SRP21 S. bayanus2. S_PAR 83.75 351.99 169 Paradoxus3. S_KUD 83.71 346.53 146 Kudria4. SR21_YEAST 82.41 346.02 166 P32342 saccharomyces cerevisiae5. S_MIK 82.06 339.92 145 Mikatae6. S_KLU 75.51 314.59 145 Kluyveri7. S_CAS 74.91 308.02 125 Castellii8. C_ALB 21.67 107.92 168 Candida9. N_CRA 12.74 74.20 197 Neurospora10 YE07_SCHPO 9.61 58.63 120 O13804 schizosaccharomyces pombe11 CD3D_RAT 8.74 57.34 173 P19377 rattus norvegicus (rat). 12 ARP2_PLAFA 8.52 65.04 451 P13824 plasmodium falciparum. 13 Q23147 8.50 60.12 284 Q23147 caenorhabditis elegans.14 SR09_ARATH 8.45 53.56 103 Q9smu7 arabidopsis thaliana (mouse-ear 15 Q8K2G5 8.45 60.59 306 Q8k2g5 mus musculus (mouse). riken cdna16 Q8BFQ4 8.40 60.59 313 Q8bfq4 mus musculus (mouse).17 Q8I562 8.39 64.65 459 Q8i562 plasmodium falciparum (isolate 3d7).18 AAH44174 8.28 60.11 313 Aah44174 brachydanio rerio (zebrafish) 19 SR09_MAIZE 8.17 52.51 103 O04438 zea mays (maize). signal recognition20 CD3D_MOUSE 8.11 54.85 173 P04235 mus musculus (mouse). t-cell surface 21 SR09_CAEEL 7.90 50.47 76 P34642 caenorhabditis elegans. signal
Green box = sequences in profile (should be first!)Yellow box = unknown SRP21 (incl YE07 from S.pombeRed box = SRP9 sequences (Best hits in db of >1 million proteins!)
SRP21 aligned to SRP9 &14
Unaligned box21
9
14
Secondary structure prediction by PSI-Pred also showed the conserved ? ? ? ? ?structure.
SRP9/14 ????? secondary structure (Birse et al.) shown as cylinders (alfahelices) and arrows (beta strands).
The most conserved residues are in secondary structure elements.SRP9, SRP21 more similar.
Residues marked according to similarity in sequence and chemical properties.
21
9
14
Proteins share domains
• In primary sequence searches the found proteins are aligned because they share domains
• If the sequences are very different outside the shared domain, they may be paralogs.
• The next example shows a MSA in which the middle part is a GTPase domain. The first or last part is missing ...
N-terminal
C-terminal
Two different proteins (4+4 sequences ) are aligned. They share a domain.
Pfam – protein domains DB
• From multiple alignments of many related proteins, profiles (HMMs) are made
• Input a sequence, match to all families/HMMs.
• Known sequences are in Pfam database.
Pfam DB: Karolinska Inst., Sanger (UK), S:t Louis (USA), Pasteur (F)
Structure logo for Pfammotif trypsin (only part of the model shown).
Positions in the model
The size of the letters = probability of finding that amino acid in the position In these
positions, some amino acids are much more common than others.
Pfam model amino acid probability plot in the “structure logo” style
11
Search a sequence for matches to Pfam modelsHMM file: /dbs/pfam/Pfam_lsSequence file: pop3_spombe- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Query sequence: gi|3560259|emb|CAA20744.1|Accession: [none]Description: SPCC16C4.05 [Schizosaccharomyces pombe]
Scores for sequence family classification (score includes all domains):Model Description Score E-value N -------- ----------- ----- ------- ---RNase_P_pop3 RNase P subunit Pop3 332.1 8.6e-97 1
Parsed for domains:Model Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------RNase_P_pop3 1/1 7 165 .. 1 175 [] 332.1 8.6e-97
Alignments of top-scoring domains:RNase_P_pop3: domain 1 of 1, from 7 to 165: score 332.1, E = 8.6e-97
*->KrkQvyKPVLeNPytNEAkLWPhVtdqklvlELLqekvlkklvhalkK+kQ++K+VL+NP++++ WP+V+++ +qek++++lv++l+
gi|3560259 7 KVKQTVKLVLRNPLSIS---WPIVDAN------TQEKLAQTLVQWLP 44
ashKgneesevtvGfNeivelLsraCsesddvTQPAVvlfvcnkDgtPsvashK++++s++tvG+N+++elL+r+C++++dvTQPAVv++++++D s+
gi|3560259 45 ASHKDILDSKLTVGLNSVNELLERCCQNAKDVTQPAVVFILHDQD---SM 91
LlsQlPLLvavanltGSSKVkLVqLpksaqakfdehlGlskavHDGmlLvL++++P+Lva+an++GSSK++LV+L++saqa+++++lGls+a G+++v
gi|3560259 92 LVTHMPQLVANANFYGSSKCRLVPLGFSAQALIAKKLGLSRA---GAIAV 138
rkdasldksfadlvdskvEepqiPWLep<-*++d++l+k+++dlv++ +Eepq++WL++
gi|3560259 139 QDDSPLWKYLKDLVMN-IEEPQARWLSE 165
The HMMER package is used for searching sequences against one or all Pfam models (or a model that you have made yourself).
As in BLAST searches, you get a score, e-value and an alignment.
Searches may be done at Pfam WWW.
Search at Pfam (Sanger)For a known protein one may use the UNIPROT accession to get a precomputedalignment.
If the protein is not in the database, just input the sequence ...
WWW results
Good match(RNaseP_pop3)
Matches below threshold
Pfam notes ...• Even though the Pfam alignments are curated, they
may contain sequences that are very different from your sequence ? bad score and eval!
• If your sequence gets a very high score and good evalue, it probably is in the alignment that was used to create the model.
• Pfam B models are made “automatically” and not curated (use with caution)
• Some Pfam models are domains, others are almost complete proteins ...
SRS: InterPro – search all domain DB
InterPro had a “bad” reputation some years ago, but it is good idea!
PROSITEPfamPRODOMPRINTSSMART......
Seq input
Transmembrane prediction
• 25% of all proteins are membrane bound• By comparing known transmembrane proteins,
programs like TMHMM make predictions. Some use neural networks that are trained on known TM proteins.
• Other methods can be combined to get a higher specificity of TMHMM predictions or other programs (all methods have a flaw somewhere)
12
TMHMM output
RF47_[Guillardia len=68 ExpAA=37.41 First60=32.65 PredHel=2 Topology=i2-19o47-64iORF74_[Odontella len=74 ExpAA=39.05 First60=32.92 PredHel=2 Topology=i2-24o48-65iORF71_[Porphyra len=71 ExpAA=36.0 First60=26.14 PredHel=2 Topology=i7-24o53-70iORF70_[Chlorella len=70 ExpAA=38.67 First60=32.40 PredHel=2 Topology=i2-21o45-67i
-------------------------------------------------------------------------
PredHel=2 (= 2 TM dom) Topology=i2-21o45-67iinside-TRANSMEMBRANE-outside-TRANSMEMBRANE-inside
Example in which scores for first TM domain are too low .
PSI-pred: secondary structure
Confidence in prediction of this residue
Output sent by mail:
EEE = beta strandHHH= alfa helixCCC= coil (“normal”)
Link to image files.
http://www.psipred.net
Looking for short sequences
• Sometimes you want to find out if there are short sequences (often called words) that are in a set of sequences. They may, for instance, be transcription factor binding sites ...
• Alignment programs wont find these ...• MEME is a program that finds “words” of a
specified length in a set of sequences.• MAST may be used to search for known words
But what about RNA genes?
• RNA genes are genes that do not code for protein (they are not translated)They are usually called “noncoding RNAs”
• There are structural, catalytic and regulatory ncRNA, few are conserved in all organisms
• Many ncRNAs are part of ribonucleoproteincomplexes (RNPs)
• Some commonly known ncRNAs are:ribosomal RNAs (rRNA), transfer RNAs (tRNAs),signal recognition particle RNA (SRP RNA),ribonuclease P RNA (RNaseP RNA)
ncRNAs are often not annotated
NC_006270.1 -TTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACTCGCAATCCGCTCGAGCGAGGCX06802|BAC.SUB. NTTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACCTGCAATCCGCTCTAGCAGGGC
************************************* *********** *** ***
NC_006270.1 CGAATCCCTTTCTCGAGGTTCGTTTACTTTAAGGTCTGCCTTAAGCAAGTGGTGTTGACGX06802|BAC.SUB. CGAATCCCTT-CTCGAGGTTCGTTTACTTTAAGGCCTGCCTTAAGTAAGTGGTGTTGACG
********** *********************** ********** **************
NC_006270.1 CTTGGGTCCTGCGCAATGGGAATCCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTAX06802|BAC.SUB. TTTGGGTCCTGCGCAATGGGAATTCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTA
********************** ************************************
NC_006270.1 AGTGGAACCTTCCATGTGCCGCAGGGTTGCCTGGGCTGAGCTAACTGCTTAAGTAACGCTX06802|BAC.SUB. AGTGAAACCTCTCATGTGCCGCAGGGTTGCCTGGGCCGAGCTAACTGCTTAAGTAACGCT
**** ***** ************************ ***********************
NC_006270.1 TAGGGTAGCGAATCGACAGAAGGTGCACGGTAX06802|BAC.SUB. TAGGGTAGCGAATCGACAGAAGGTGCACGGTA
********************************
Sequence alignment of annotated SRP RNA from Bacillus subtilis and identified SRP RNA from the newly sequenced and “fully” annotated Bacillus licheniformis.Sequence identity = 94%! Still no SRP RNA is annotated. SRPDB is needed.
ncRNAs in the 3 Kingdoms of Life
Rfam: annotating non-coding RNAs in complete genomes.Sam Griffiths-Jones, Simon Moxon, Mhairi Marshall, Ajay Khanna, Sean R. Eddy and Alex Bateman.Nucleic Acids Res. 2005 33:D121-D124.
13
RNA structure
5’-GGGGAUGUAGCUUAGUGGUAGAGCAUUGGAGUUAUAAUCCGGAGGCGCGGGUUCGAAUCCCGUUAUCCCC -3’
primary secondary tertiary
ncRNAs basepair (G-C, A-U, G-U) creating secondary structure
Mutations may maintain secondary structure (G-C? G-U? A-U)
ncRNAs first fold into a secondary structure before adding tertiary interactions ? The secondary structure must not change!
RNA: Conserved secondary structureAU, GC base pairing create ”hairpins”
CAGGAAACUG seq1...|.||...GCUGCAAAGC seq2|||||||...GCUGCAACUG seq3
A A C A C AG A G A G AG-C U-A U CA-U C-G C UC-G G-C G G
seq1 seq2 seq3
Seq1 and seq2 are not similar, but they both have a hairpin structure, which is not shared by seq3!
The alignment of the primary sequences (structure) doesn’t give us any information.
Secondary structure pattern
Compensatory base changes maintain secondary structure.We need a way to specify the base pairing!
Secondary structure pattern
Pattern:h1 s1 h1’h1 NNN:NNNs1 GMAA
Note:M = [GA]N = [AUGC]
”h” stands for helix”s” -”- strand
A A C AG A G AG-C U-A A-U C-G C-G G-C
seq1 seq2
Programs that search for secondary structure: Patscan, RNAbob.
Creating probabalistic covariance models from alignments (Rfam)
tetraloop with stem
Both primary sequence and secondary structure conservation captured in probabalistic model.
COVE (Eddy 1994) used for creating models, searching.
Covariance models are equivalent to stochastic context-free grammars.
Patterns: Hard to make large patterns and patterns that find new structures. Yes/No match, no scoring. Fast!
Models: covariance model of which bases appear together created automatically from alignment.Time-complexity: O(n3). Slow!
Idea: Use smaller pattern to filter, use covariance model on filtered sequences ? Fast and sensitive!
Multiple Sequence Alignments for ncRNAs must specify basepairings
The Rfam database is the “Pfam for ncRNAs”
14
ncRNA structure evolution
1. Mutations in ncRNAs maintain the secondary structure? primary sequence is poorly conserved? hard to detect similarities by primary sequence searches
2. Structure evolves by loosing / adding helices? big gaps in alignments even when primary sequence is conserved
An example ...
SRP RNA variants
Helices H3 and H4 missing in yeast!
Bacteria
Archaea
Eukarya
Comment: t1 and t2 depict tertiary interactions
Helix 8 is the only part found in all SRP RNA!
Fungi SRPRNA lack helices 3,4
o
A
GCUGUAA U G G C
AU U U
UG U C G G A
G U GG U A A A U
CG C C U U C U
UGUU
GUGCGU
UC G
AGUUCUG
GACUC
UGCACUGG
G C U A C U U UG U U G U C C UUU
C C GA A U
U CUG
C G G UUGAUGGGCGUCUCGG
UCUGA
GU A
AUCGGC
UUUGAGAUUUCCGUUCU
AAGA
UUAACUGGGAUACUU C
AGU
GGAG
CAAUCCAG
CA G
AGAUCCAGUU
GCCGUG
GGU
AUGGCGGUGGG
AUAGCAACAAAGUGGU
AU
AUGU
UAU
GGAAGGUAUUUGCAA
UCA
CGACUC
UCo12
3
4
5
6
Yarrowia lipolytica
oC
GACTGTAA T
G G T CA
A G G T G G GT
T T GAAG G C A C T T G A
T T TT C T C A A T G
TC T C T A T T CC A
TG
TCCA
AA T
CTGGA
AGC C C A G C G G C G C C C A G C A C G A A CC T T G C G G T G
GTC
A C CCACTCGCACGGGT
AGCC TG
CG
ACTTGCTGCGCGTGG C CC
TAAG
CAATGA
AGATG A
CAC
TT G
AGA
GAGGTTCC
ACTCTG
CA G
AG
ACATCTT
CACCGTCAGGTGG
CGCGCTGGA
TTACG
ATCGCTGGG
GGGTTGGGATAGAGCGTTGAGATGGAG
ATGTC
GACTCCTATTT
To
1
2
3
4
5
6
Neurospora crassa
o
GGCTGTGATG G C T
TT T A G
CG G A
AG C
GT G C T G C
T C G TG T A C C T G C T G T T T G TT GA
AAAT TT
AA
G A G C A A A G T G T C CG G C T C G A T CC CT GC G AAT
TGAATTCTGA
ACGCTAGAG
T AATCAGTGT C
TTT
CAAGTTCTG
GTAAT
GTTTAGCAT A
AC
CACTG G
AG
GGAAG
CAATTCA
GC
A CAGT
AATGCTAA
TCGTG
GT
GGAGG
CGAAT
CCGGATG
GCACCTTGTTTGTTG
ATAAATAGTGC
GGTATC
TAGT
GTTGCAAC
TCTATo
1
2
3
4
Candida albicans
oC
GCUGU
AA U
G G CU UGGU
CGAA
G U G U U U AGU A CU C C C A
AU A
GU G C A UG U U C G G U GG
UC U
CG GG U
U CG A G U C U CG C U U U C G
A UC C C
UCG A
UCUGCCACGUCUGUUCGAAGA
GUA
GUCUUCGUGGCAACUGGCAGU
UAA
ACCGUGUAGU A
CCG
AUG G
AGG
UUGG
AAACAAUG
CA C
AUC
ACUACCGGG
UCUU
GGGC
AGUGCGAUAGCGA
UGGGAUUCACCUUCGCAGGAUGUGCAUGGAAGUAUAAACAC
AACG
GUC
GU
U o
1
2
3
4
S. pombe
These RNAs are < 300 nts.
BUT ...
S.cerevisiae (length 519 nts) SRP RNA was not possible to fit to this type of SRP RNAs.
How do we decide on the structure of this gene?
Note that Yarrowia (bottom) has an extra helix.
Comparative analysis of SRP RNA Saccharomyces species
Using the known SRP sequences from Saccharomyces cerevisiaeas queries, regions of the genomes of S.paradoxus, S.mikatae, S.kudriavzevii, S.castelli and S.kluyveri were retreived from Washington Univ., St. Louis.
By comparative analysis, SRPRNA sequences (453-547 nts) and structures were identified*.
The results showed that all species had large inserts in the helix 5 region, especially close to the small Alu domain, and that helix 7 also was variable.
* The secondary structures were predicted with MFOLD.
o
A
GCUGUAA U G G C
AU U U
UG U C G G A
G U GG U A A A U
CG C C U U C U
UGUUG
UGCGU
UC G
AGUUCUG
GACUC
UGCACUGG
G C U A C U U U G U U G U C C UUU
C C GA A U
U CUG
C G G UUGAUGGGCGUCUCGG
UCUGA
GU A
AUCGGC
UUUGAGAUUUCCGUUCU
AAGA
UUAACUGGGAUACUUG
AGAUCCAGUU
GCCGUG
GGU
AUGGCGGUGGG
AUAGCAACAAAGUGGU
AUA
UGUUAU
GGAAGGUAUUUGCAA
UCA
CGACUC
UCo12
3
4
5
6
Yarrowia lipolytica
AGGCUGUAAUG G C U U
UC U
GG U G G
G AU GG G A U A C
GUUG
GGA
AUU
UU
GGC
CG
AGG
AACA
AAU C
CU
UCCU
CG
CGG
CC
AGA
CACGGA
C UGC
ACG
CC
CUUUG
GG
CAAGGGAUGGUUCU
CCAUCUC
GCA
CCGUG
CC C U G
U UG U G G C A
AC C G U CU UUU
CUCCGUCGCUAA
UU
U G UCCUGGGCAGA
AA U
GUCUGCUCGGA
GGCGGGGGAG
U C C G GUC U G A A G U G U C C C G G C U
AU
A A U AAAU C G A U C
U U UG C G G G
CAGCCCGU
UGGCAGGAGGCGCGA
GG A
AUCCGUCUCUCUGUCU
GGU
GCGGCAA
G GUA G U C C
UGG G
UUUG
GGGCUCCAC
CUU
CACC
GCUGUU A
GGG
GAGU
UUUAUCCA
GC G
GCAGCA
AA G
GUGA
CCCGUGAUGGAGGC
GGCCGGGAU
AGCACAUAUCAGUCGGAU
AA
UCGUG
CAAGUUGAUCGUU
UCGGCGGUCU AAUUU
GGCGGUGCCAUCAGGAU
UUACUCG
CACA
UUGUGU
UCGUUCCC
UCGGGGACGAG
UGU
GUAUCCUGAACCACA
UU
UUUo
1
2
3
4
5
6 7
8
9
10
11
12
13
14
15
S. bayanusSaccharomyceshave a unique inserted part in helix 5 close to the Alu domain.
This was found in all Saccharomycesspecies.
S.bayanus
Saccharomyces helix insertions
Helix 7
This structure is not in C.albicans, S.pombe.
MicroRNAs – regulatory ncRNAs
Red part is the mature miRNA, the sequence is complementary to mRNA!
15
RNAi pathway
Cell. 2004 Apr 2;117(1):1-3. miRNA and siRNA work in a similar fashion
Cross-species genomic sequence conservation can be used1. for discovery of new regions with regulatory functions2. to enhance gene predictions, and3. alternative splicing predictions (1 gene ? >1 mRNA ? >1 protein)4. reveal transcription factor binding sites
Cross-species gene location conservation can be used for1. identification of unknown ORFs (predicted proteins)2. adding evidence for discovered new genes
Cross-species gene prevalence can be used for prediction of1. the probability for the existance of a gene in a species (Keep looking!)2. the function of a certain gene/protein/RNA (Is the product essential?)
Post-genomic Bioinformatics/Genomics
Cross-species genome comparisons
And much more ... (We will show some examples later ...)
SRP component searches
This is part of the secretory pathway.
The SRPpathway is conserved is all domains of life: Eukarya, Bacteria, Archaea.
All organisms have an SRPparticle, but it looks different.
Mitochondria and
Chloroplasts are
endosymbionts
Origin of photosynthetic organisms (have chloroplasts with own genome!)
Primary endosymbiosis
:Cyanobacteria+ Eukaryote ?
algae
Secondary endosymbiosis
:algae +
Eukaryote
Genome map of
P.purpureachloroplast
at NCBI
We downloaded 26 chloroplast genomes
and searched with pattern and model for bacterial SRP RNA.
16
Red algal group
Odontella and Guillardia have chloroplasts of secondary endosymbiosis origin
Green plant group
Found SRP RNA candidates (low scores) in 8 chloroplast genomes
Genome position for SRP RNA gene candidates in “green plant” group
Conserved clusters The candidates in phylogeneticallylinked organisms are all found in this position.
No overlap with known genes!
(Conserved gene clusters are marked with ‘3’, ‘4’ ...)
rpoC-rpoB-trnC-RNA
Red algae (incl. secondary endosymbionts)Porphyra purpurea psaJ-apcD-RNA-fabH-(tRnaLeu)-psbX-accD-psbVCyanidioschyzon mer. psaJ-apcD-RNA------(tRnaLeu)-psbX-accD-psbVCyanidium caldarium psaJ-apcD-RNA------(tRnaLeu)------accD-psbVGuillardia theta psaJ------RNA----------------psbX------psbVOdontella sinensis (tRnaPhe)-RNA----------------psbX-p120-psbV
Green algae + ancestral streptophytaMesostigma viride (ycf6)-RNA------(trnC)-rpoB-rpoC1-rpoC2Nephroselmis olivacea ndbH---RNA------(trnC)-rpoB-rpoC1-rpoC2Chorella vulgaris p133---RNA-p134-(trnC)-rpoB-rpoC1-rpoC2
Some of these also contain rnpB (gene for RNase P RNA)
•2 clear groups: Red algae and Green algae
Genome locations of SRP RNA candidatesin chloroplasts
The predicted SRP RNAs have conserved promoters (as in cyanobacteria)
Cyano-bacteria
Distances between –10 TATA box and sequence (5-8 nts), and promoter sequences are consistent with experimentally verified promoters in Prochlorococcus (Vogel et al. 2003)