Upload
jonah-reynolds
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Fast Sequence SearchMultiple Sequence Alignment
Xiaole Shirley Liu
STAT115/STAT215, 2010
STAT115
Outline
• Fast sequence search– BLAST, statistical significance– BLAST programs– BLAT
• Global MSA– ClustalW– ClustalW features– ClustalW example
STAT115
Fast Sequence Similarity Search
• Uses:– Map a sequence to sequenced genome– Infer unknown sequence function– Find family of proteins in an organism– Find homolog/ortholog in other organisms– Find sequence mutations or variations (SNP)
Sequence DBQuery
STAT115
Fast Similar Sequence Search
• Can we run Smith-Waterman between query and every DB sequence?
• Yes, but too slow!• General approach
– Break query and DB sequence to match subsequences
– Extend the matched subsequences, filter hopeless sequences
– Use dynamic programming to get optimal alignment
STAT115
BLAST
• Basic Local Alignment Search Tool• Altschul et al. J Mol Biol. 1990• One of the most widely used bioinformatics
applications– Alignment quality not as good as Smith-Waterman
– But much faster, supported at NCBI with big computer cluster
• For tutorials or information:http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
STAT115
BLAST Algorithm Steps
• Query and DB sequences are optionally filtered to remove low-complexity regions– E.g. ACACACACA, TTTTTTTTT
STAT115
BLAST Algorithm Steps
• Query and DB sequences are optionally filtered to remove low-complexity regions
• Break DB sequences into k-mer words and hash their locations to speed later searches– k is usually 11 for DNA/RNA and 3 for protein
LPPQGLL
LPP
PPQ
PQG
QGL
GLL
STAT115
BLAST Algorithm Steps
• Query and DB sequences are optionally filtered to remove low-complexity regions
• Break DB sequences into k-mer words and hash their locations to speed later searches
• Each k-mer in query find possible k-mers that matches well with it– “well” is evaluated by substitution matrices
STAT115
Scoring K-mer Matches
P E G
P Q G
7 + 2 + 6 = 15
BLOSUM62
STAT115
BLAST Algorithm Steps
• Only words with T cutoff score is kept– T is usually 11-13, ~ 50 words make T cutoff
– Note: this is 50 words at every query position
• For each DB sequence with a high scoring word, try to extend it in both ends
Query: LP PQG LL
DB seq: MP PEG LL
HSP score 9 + 15 + 8 = 32
– Form HSP (High-scoring Segment Pairs)
– Use BLOSUM to score the extended alignment
– No gaps allowed
STAT115
BLAST Algorithm Steps
• Keep only statistically significant HSPs– Based on the scores of aligning 2 random seqs
• Use Smith-Waterman algorithm to join the HSPs and get optimal
alignment– Gaps are allowed
default (-11, -1)
STAT115
Statistical Significance
• Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994)
]exp[1)( xexsp
Probability that a random alignment gets score like this or better pvalue
Digression: hypothesis testing
• Null hypothesis H0 (“nothing special”)
– Like a “defendant” presumed innocent
• Alternative hypothesis HA
– Proven guilty if overwhelming evidences are present
STAT115
Tongji 200914
Two Sample t-test• Statistical significance in the two sample problem
Group 1: X1, X2, … Xn1
Group 2: Y1, Y2, … Yn2
• If Xi ~ Normal (μ1, σ12),
Yi ~ Normal (μ2, σ22)
• Null hypothesis of μ1= μ2
• Use Welch-t statistic• Check T table for p-val• A gene with small p-val
(very big or small t) – Reject null– Significant difference between normal and MM
2221
21 //
)(
nsns
YXt
Tongji 200915
Permutation Test
• Non-parametric method for p-val calculation– Do not assume normal expression distribution
– Do not assume the two groups have equal variance
• Randomly permute sample label, calculate t to form the empirical null t distribution– For MM-study, (14 choose 5) = 2002 different t values
from permutation
• If the observed t extremely high/low differential expression with statistical significance
Tongji 200916
Permutation Technique
Condition 0 Condition 1
Patient 4 Patient 2 Patient 3 Patient 1 Patient 5 Patient 6
Condition 0 Condition 1
Patient 1 Patient 2 Patient 5 Patient 4 Patient 3 Patient 6
Condition 0 Condition 1
Patient 1 Patient 6 Patient 3 Patient 4 Patient 5 Patient 2
Condition 0 Condition 1
Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6Compute T0
Compute T1
Compute T2
Compute T3
Compare T0 to T* set
STAT115
Statistical Significance
• Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994)
]exp[1)( xexsp
Probability that a random alignment gets score like this or better pvalue
STAT115
Statistical Significance
• Actual alignment score S can be normalized
– m, n are query and DB length– K, are constants
• Depends on substitution matrix and sequence composition
• For typical amino acid and PAM250 matrix
K = 0.09, = 0.229
KmnSs ln
STAT115
Statistical Significance
• Normalized score s can be used to get p-value
– When x > 2, probability can be approximated
• Another quick check, raw score S/3
xexsp )(
Ssignificant
/3 log2(nm)
]exp[1)( xexsp
STAT115
BLAST Reporting
• Report DB sequences above a threshold– E value: Number (instead of probability
pvalue) of matches expected merely by chance
• Usually [0.05, 10] threshold• Smaller E, more stringent
– User selected (just for display): e.g. top 10, 50, 100
SeKmnE
STAT115
Different BLAST Programs
• If query is DNA, but known to be coding (e.g. cDNA)– Translate
cDNA into protein
– Zero gap-extension penalty
STAT115
Psi-BLAST
• Position Specific Iterative BLAST– Align high scoring hits in initial BLAST to
construct a profile for the hits– Use profile for next iteration BLAST
• Find remote homologs or protein families
• FP sequences can degrade search quickly
QuerySeq1
Seq2
Seq4
Seq3
STAT115
Reciprocal Blast
• Search for orthologous sequences between two species– Orthologs: genes related by vertical descent
from a common ancestor and encode proteins with the same function in different species
– Paralogs: homologous genes evolved by duplication and code for protein with similar but not identical functions
– Finding the correct orthologous sequence is very important in comparative genomics
STAT115
Reciprocal Blast
• Search for orthologous sequences between two species– Orthologs: genes related by vertical descent
from a common ancestor and encode proteins with the same function in different species
– GeneA in Species1 BLAST Species2 GeneB– GeneB in Species2 BLAST Species1 GeneA
– GeneA GeneB
• Also called bi-directional best hit
orthologous
STAT115
BLAT• BLAST-Like Alignment Tool
– Compare to BLAST, BLAT can align much longer regions (MB) really fast with little resources
– E.g. can map a sequence to the genome in seconds on one Linux computer
– Allow big gaps (mRNA to genome)– Need higher similarity (> 95% for DNA and 80% for
proteins) for aligned sequences
• Basic approach– Break long sequence into blocks– Index k-mers, typically 8-13– Stitch blocks together for final alignment
STAT115
BLAT: Indexing
Genome: cacaattatcacgaccgc3-mers: cac aat tat cac gac cgcIndex: aat 3 gac 12 cac 0,9 tat 6 cgc 15cDNA (mRNA -> DNA): aattctcac3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6hits: aat 0,3 -3 cac 6,0 6 cac 6,9 -3clump: cacAATtatCACgaccgc
STAT115
BLAT Example
• Get result instantly!!
STAT115
Multiple Sequence Alignment
• MSA Uses:– Establish evolutionary relationships (global)
– Find conserved nucleotides and amino acids (global)
– Characterize signature protein patterns or motifs (local)
– Find acceptable substitutions (local)
• Protein MSA gold standard: structural alignment
STAT115
Progressive MSA Method• Progressive:
– Heuristic algorithm: approximation strategy, do not aim at perfect
– Build alignment with most related sequences, progressively add less-related to the alignment
– Often manual examination can improve alignments
• ClustalW, NAR 1994– W stands for weighting: more distant seqs weigh more– Reflect evolutionary distance
STAT115
ClustalW Steps
• Global pairwise alignment for all pairs
• Calculate pairwise sequence distances– Approximate evolutionary distance
STAT115
ClustalW Steps
• Construct a tree based on sequence distances– E.g. solve the following matrix
A B C D
A 4 6 2
B 4 4
C 6
D
A
C
B
1 32
1 1D
STAT115
ClustalW Steps
• Progressively add sequences/alignments by the tree order– Starting from the smallest distance– Add seq to seq, seq to align, align to align
• AD form new node E, calc AE, DE distance
• Calc E consensus, weighted
by AE DE distance
• Calc B, C, E pairwise
distance
• BE form new node F…
A
C
B
1 32
1 1DE F
STAT115
ClustalW Features: Consensus
• Consensus is used to represent the aligned sequences; if different, find AA to maximize score
• Final score weighted based on branch length
– Weight for A = a = 0.2 + 0.3 / 2 + 0.3/3 = 0.45
– Weight for B = b = 0.1 + 0.3 / 2 + 0.3/3 = 0.35
– Weight for C = c = 0.5 + 0.3/3 = 0.6
A
B
C
0.2
0.10.3
0.50.3
STAT115
Scoring an Alignment
Sequence A (weight a) …K…
Sequence B (weight b) …I…
Merge and align toSequence C (weight c) …L…
Sequence D (weight d) …V…
Score for aligning the column[a c Score(K,L) + a d Score(K,V) + b c Score(I,L) + b d Score(I,V)] / 4
STAT115
ClustalW Features: Gaps
• Sequence specific gap penalties – Penalize gaps more in segments that are less
likely to have gaps
STAT115
Progressive Alignment Limitations
• Gaps can proliferate, if not carefulAlign1: ABCD-E ABC-D-E
Align2: ABC-DE ABC-D-E
• Need many heuristic parameters
• Does not guarantee global optimum
• Errors in initial alignments are propagated
• Manual improvements:– Shift residues from one side of gap to the other– Reduce gaps
STAT115
ClustalW Alignment
* - identical
: - conserved
. - semi-conserved
STAT115
ClustalW Tree
Branch length
~ distance
0.023180.01824
0.415960.12694
0.02011
0.10523
0.01147
STAT115
Summary• Fast sequence similarity search
– Break seq, hash DB sub-seq, match sub-seq and extend, use DP for optimal alignment
– *BLAST, most widely used, many applications with sound statistical foundations
– *BLAT, align sequence to genome, fast yet need higher similarity
• Protein global MSA– Progressive heuristic alignment
– ClustalW: pairwise, tree, merge alignments
– Merge with minimum edit, sequence weighting, sequence/position specific gaps
STAT115
Acknowledgment
• David Mount
• Aoife McLysaght
• Ir. Brecht Claerhout