Upload
dooley
View
22
Download
0
Embed Size (px)
DESCRIPTION
Sequence Comparison – Identification of remote homologues. Amir Harel Moran Yassour. Overview. Homologues proteins Protein Sequence comparison BLAST and its improvements PSI-BLAST. Homologous Proteins. Proteins that share a common ancestor are called homologous. - PowerPoint PPT Presentation
Citation preview
Sequence Comparison – Identification of remote homologues
Amir HarelMoran Yassour
Overview
Homologues proteins
Protein Sequence comparison
BLAST and its improvements
PSI-BLAST
Homologous Proteins
Proteins that share a common ancestor are called homologous.
Common three dimensional folding structure
Homologous Proteins
Homology refers to a similarity that spans an entire folding domain.
The difficulty in defining homology
Why is homology important? Prediction of protein’s properties
Classification of proteins to families
Evolution tree
How to identify homology?
Using sequence similarities Aligning two proteins Giving a score to the alignment
Global & Local Alignments
Global alignment –alignment of the entire
sequence
Local alignment –alignment of a segment of the sequence
How to score an alignment Substitution Matrix – Sij = a value
proportional to the probability that amino acid i mutated into amino acid j
Types of Substitution Matrices
PAM – comparison of closely related sequences
BLOSUM – multiple alignments of distantly related sequences
Substitution Matrices
Different matrices reflect different evolutionary distances: 1 PAM represents the evolutionary
distance of 1 amino acid substitution per 100 amino acids.
BLOSUM X: all sequences with a similarity higher than X were summarized into one
Gap costs
The most widely used Gap score is-(a+bk) for a gap of length k.
Long gaps do not cost much more than short ones since a single mutation may cause a large gap.
Basic Sequence Comparison Smith & Waterman (1981) –
dynamic programming of sequence comparison
O(mn)
m
n
Complexity issue
When DBs become larger, m grows Time complexity Space complexity
Intuition to Solution
Go over less than the whole matrix Put the spotlight on segments that
can be a part of the best path and extend them.
The best path is close to a diagonal
Less than O(mn) m
n
Heuristic procedures
Heuristic: An algorithm that usually, but not always works, or that gives nearly the right answer.
There is no guarantee to find the best match.
BLAST – Basic Local Alignment Search Tool
BLAST first scans the DB for words that score at least T when aligned with some word within the query sequence, these are called hits. O(n)
Each hit is extended in both directions as long as the score hasn’t dropped too much.
- - - - - - - x - - - - - - x - - - x- x - - - - - - - - - x - - - - x - -- - - x - - - - - - - - x - - - - - -- - - - - - - - - - - - - - - x - - -- x - - - x - - - - x - - - - - x - -- - - - - - x - - - - - - - x - - x x- - - - - - - x x - - - - - - - - x xx - - - x - - - - - - - - x - - - - -- - - - x - - - - x - - - - - - - - -- - - - - x - - - - - - - - - - - x -- x - - - - x - - - - - - - - - - - -- - - - - - - - - - - - x - - x - - -- - - - - - - - - - - - - - - - - - x- - - - - - x - - - - - - - x - - x -- - - x - - - - - - - - x - - x - - -x - - - x - - - - x - - - x - - x - -- - - - - x - - - - - - - - - - x - -x - - - x - - - - x - - - x - - - - -
BLAST
A word about the parameter T
Small T:greater sensitivity, more hits to expand
large T: lower sensitivity, fewer hits to expand
Gapped BLAST
The original BALST was un-gapped
Soon after came gapped BLAST
BLAST - Results
P value – The probability of an alignment occurring with score S or better.
E value – Expectation value. The number of different alignments with scores S or better that are expected to occur in this DB search by chance.
Lower E value –> more significant score.
E-value and Homology Non significant score does not
necessarily imply non-homology:
E-value and Homology
Use it wisely
Choose your Substitution Matrix
Choose your DB
Example 1 – remote homology Frequently, identification of a remote
homology will require several database searches.
The glutathione transferase family
Remote homology
Remote homology
Testing the possibility that elongation factors share homology with glutathione S-transferases :
There is a clear relationship between this elongation factor and the class-theta glutathione transferases.
Example 2 - mapping
Three different families of G-protein coupled receptors: the R family (the largest) the C/S family the G receptor family
Finding links between families
E-valueScoreName02347OPSD_HUMAN RHODOPSIN.01791OPSG_CHICK GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO 01002OPSG_HUMAN GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO
3.10E-30527OPS1_DROME OPSIN RH1 (OUTER R1-R6 PHOTORECEPTOR CE 1.10E-23435NK2R_MOUSE SUBSTANCE-K RECEPTOR (SKR) (NEUROKININ 1.50E-23431SSR5_HUMAN SOMATOSTATIN RECEPTOR TYPE 5. 3.50E-22419TXKR_HUMAN PUTATIVE TACHYKININ RECEPTOR. 6.40E-142835H7_HUMAN 5-HYDROXYTRYPTAMINE 7 RECEPTOR (5-HT-7) 8.50E-14280CKR1_HUMAN C-C CHEMOKINE RECEPTOR TYPE 1 (C-C CKR- 1.50E-13278ETBR_RAT ENDOTHELIN B RECEPTOR PRECURSOR (ET-B) (E1.60E-13276AA2B_RAT ADENOSINE A2B RECEPTOR.
0.007133MAS_MOUSE MAS PROTO-ONCOGENE. 0.007130PAFR_MACMU PLATELET ACTIVATING FACTOR RECEPTOR (PA 0.009135OLF2_RAT OLFACTORY RECEPTOR-LIKE PROTEIN F12. 0.01131MAS_RAT MAS PROTO-ONCOGENE. 0.01130CAR1_DICDI CYCLIC AMP RECEPTOR 10.02129OLF2_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR2. 0.05124CAR3_DICDI CYCLIC AMP RECEPTOR 3. 0.06120MAS_HUMAN MAS PROTO-ONCOGENE. 0.17117OLF1_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR1. 0.23121PER2_MOUSE PROSTAGLANDIN E RECEPTOR, EP2 SUBTYPE.
Finding links between families
E-valueScoreFamilyName02678CAR1_DICDI CYCLIC AMP RECEPTOR 1. 01524CAR3_DICDI CYCLIC AMP RECEPTOR 3. 01497CAR2_DICDI CYCLIC AMP RECEPTOR 2.
0.00042167C/SCALR_HUMAN CALCITONIN RECEPTOR PRECURSOR (CT-R). 0.00073161RIL8B_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 0.00087162C/SCLRA_RAT CALCITONIN RECEPTOR A PRECURSOR (CT-R-A) 0.00095162C/SCLRB_RAT CALCITONIN RECEPTOR B PRECURSOR (CT-R-B)
0.0045150C/SDIHR_MANSE DIURETIC HORMONE RECEPTOR (DH-R). 0.012145C/SCALR_PIG CALCITONIN RECEPTOR PRECURSOR (CT-R). 0.012145C/SGLR_RAT GLUCAGON RECEPTOR PRECURSOR (GL-R). 0.016141RIL8B_RABIT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 0.022139RRDC1_HUMAN G PROTEIN-COUPLED RECEPTOR RDC1 HOMOLOG 0.061133RG10D_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR G10D 0.085130ROPSD_HUMAN RHODOPSIN. 0.098131C/SVIPR_HUMAN VASOACTIVE INTESTINAL POLYPEPTIDE RECEP
0.11129ROPSD_SPHSP OPSIN. 0.13129C/SSCRC_RAT SECRETIN RECEPTOR PRECURSOR. 0.14127RIL8A_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR A 0.16143.1C/SGLPR_RAT GLUCAGON-LIKE PEPTIDE 1 RECEPTOR PRECURSO 0.16126RAG2S_XENLA TYPE-1-LIKE ANGIOTENSIN II RECEPTOR 2
Building Proteins tree
Conclusions
Searches with high-scoring, related or unrelated sequences, is a very important tool.
Homology is a transitive relation…
BLAST – Pros & Cons
Pros: It works
Cons: Statistical evaluations rather than
biological one. Converged Evolution Weak but biologically relevant
similarities may be overlooked (PSI will improve this issue)
BLAST improvements
Running time improvements : Two-hit method Seed extension
PSI-BLAST
The two-hit method
The extension step accounts for more than 90% of BLAST’s execution time
Invoke an extension only when two non-overlapping hits are found within a certain distance of one another
- - - - - - x x x - - - - - x - - x x- x - - - x - - - - x x - - - - x - -- - - x - - - - - - - - x - - x - - -- - x - - - - - - - - - x - - x - - -- x - - - x - - - - x x - - - - x - -- - - - - - x - - - - - - - x - - x x- - - - - - - x x - - - - - x - - x xx - - - x - - - - - - - - x - - - - -x - - - x - - - - x - - - x - - - - -- x - - - x - - - - x - - - - - x - -- x - - - x - - - - x x - - - - x - -- - - - - - - - - - - - x - - x - - -- - - - - - - - - - - - - - - - - x x- - - - - - x x x - - - - - x - - x x- - x x - - - - - - - - x - - x - - -x - - - x - - - - x - - - x - - - - -- x - - - x - - - - x x - - - - x - -x - - - x - - - - x - - - x - - - - -
first hit
second hit
two-hit extension
The two-hit method
Seed Extension
PSI-BLAST
Evolution pressure
Needle in a hey stack
PSI-BLAST comes to solve this problem
Evolution reveals itself
Giving more significance to the conserved areas and to ignoring the background noises
PSI-BLAST = Position Specific Iterated BLAST, shifts our view to these areas using the Position-Specific Score Matrix - PSSM
Position-Specific Matrix - PSSM Pij = proportional to the probability of
finding the ith amino acid in the jth position in these sequences
PSSM
Represents the distribution of the amino acids in each position in a collection of sequences
Steps in the PSI-BLAST Initiation:
Running gapped BLAST on the query, outputting a collection of matching sequences
Iteration: Constructing the PSSM based on the best
sequences in this collection
The PSSM is compared to the protein DB, again, seeking alignments
PSI-BLAST Example
We start with an uncharacterized protein – MJ0414
When submitting the query we set the E-value threshold to 0.01 (higher than usual)
Result of initial gapped BLAST
First iteration –
Iterating the search using the derived profile uncovers DNA ligase II with E-value of 0.005
Second iteration –
Interpretation of the results Considering a strong unrelated protein
will shift the PSSM to its direction
E-values retrieved in later iterations should not be taken as automatic proof of homology
Was the ligase a right choice?
PSI-BLAST Conclusions Uncovers protein relationships missed
by single-pass database-search methods
Errors are easily amplified by iterations.
PSI-BLAST increases rather than removes the need for expertise, because there is more to interpret
Running time evaluation
Running time can be highly influenced by modifying parameters
Smith Waterman
Original BLAST
Gapped BLAST
PSI BLAST
Normalized Running time
36 1.0 0.34 0.87
Future Improvements
Accepting PSSM as input from other programs
Realignment – improve the alignment before going over the DB
Automatic domain recognition
Summary
In BLAST use multiple searches for maximum knowledge
BLAST improvements are considerably faster, and enhance significantly the abilities of DB search
For many queries the PSI BLAST can greatly increase sensitivity to weak, but biologically relevant sequence relationships
Questions time
Thank You
References Pearson WR. (1997) Identifying distantly related protein
sequences. Comput Appl Biosci., 13, 325-332
Altschul SF, Massen TL, Shaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402
Altschul SF, Koonin EV. (1998) Iterated profile searches with PSI-BLAST – a tool for discovery in protein databases. Trends Biochem Sci., 23, 444-447
Sites
http://www.ncbi.nlm.nih.gov/BLAST http://www.cs.huji.ac.il/~cbio http://www.people.virginia.edu/~w
rp/ http://www-lmmb.ncifcrf.gov/
Appendix - Statistics
2ln
ln'
kSS
E
NS 2log'
nmN
'2 SN
E