Fast Sequence Search Multiple Sequence Alignment Xiaole Shirley Liu STAT115/STAT215, 2010

Fast Sequence SearchMultiple Sequence Alignment

Xiaole Shirley Liu

STAT115/STAT215, 2010

STAT115

Outline

• Fast sequence search– BLAST, statistical significance– BLAST programs– BLAT

• Global MSA– ClustalW– ClustalW features– ClustalW example

STAT115

Fast Sequence Similarity Search

• Uses:– Map a sequence to sequenced genome– Infer unknown sequence function– Find family of proteins in an organism– Find homolog/ortholog in other organisms– Find sequence mutations or variations (SNP)

Sequence DBQuery

STAT115

Fast Similar Sequence Search

• Can we run Smith-Waterman between query and every DB sequence?

• Yes, but too slow!• General approach

– Break query and DB sequence to match subsequences

– Extend the matched subsequences, filter hopeless sequences

– Use dynamic programming to get optimal alignment

STAT115

BLAST

• Basic Local Alignment Search Tool• Altschul et al. J Mol Biol. 1990• One of the most widely used bioinformatics

applications– Alignment quality not as good as Smith-Waterman

– But much faster, supported at NCBI with big computer cluster

• For tutorials or information:http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

STAT115

BLAST Algorithm Steps

• Query and DB sequences are optionally filtered to remove low-complexity regions– E.g. ACACACACA, TTTTTTTTT

STAT115


• Query and DB sequences are optionally filtered to remove low-complexity regions

• Break DB sequences into k-mer words and hash their locations to speed later searches– k is usually 11 for DNA/RNA and 3 for protein

LPPQGLL

LPP

PPQ

PQG

QGL

GLL

STAT115


• Query and DB sequences are optionally filtered to remove low-complexity regions

• Break DB sequences into k-mer words and hash their locations to speed later searches

• Each k-mer in query find possible k-mers that matches well with it– “well” is evaluated by substitution matrices

STAT115

Scoring K-mer Matches

P E G

P Q G

7 + 2 + 6 = 15

BLOSUM62

STAT115


• Only words with T cutoff score is kept– T is usually 11-13, ~ 50 words make T cutoff

– Note: this is 50 words at every query position

• For each DB sequence with a high scoring word, try to extend it in both ends

Query: LP PQG LL

DB seq: MP PEG LL

HSP score 9 + 15 + 8 = 32

– Form HSP (High-scoring Segment Pairs)

– Use BLOSUM to score the extended alignment

– No gaps allowed

STAT115


• Keep only statistically significant HSPs– Based on the scores of aligning 2 random seqs

• Use Smith-Waterman algorithm to join the HSPs and get optimal

alignment– Gaps are allowed

default (-11, -1)

STAT115

Statistical Significance

• Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994)

]exp[1)( xexsp

Probability that a random alignment gets score like this or better pvalue

Digression: hypothesis testing

• Null hypothesis H0 (“nothing special”)

– Like a “defendant” presumed innocent

• Alternative hypothesis HA

– Proven guilty if overwhelming evidences are present

STAT115

Tongji 200914

Two Sample t-test• Statistical significance in the two sample problem

Group 1: X1, X2, … Xn1

Group 2: Y1, Y2, … Yn2

• If Xi ~ Normal (μ1, σ12),

Yi ~ Normal (μ2, σ22)

• Null hypothesis of μ1= μ2

• Use Welch-t statistic• Check T table for p-val• A gene with small p-val

(very big or small t) – Reject null– Significant difference between normal and MM

2221

21 //

)(

nsns

YXt

Tongji 200915

Permutation Test

• Non-parametric method for p-val calculation– Do not assume normal expression distribution

– Do not assume the two groups have equal variance

• Randomly permute sample label, calculate t to form the empirical null t distribution– For MM-study, (14 choose 5) = 2002 different t values

from permutation

• If the observed t extremely high/low differential expression with statistical significance

Tongji 200916

Permutation Technique

Condition 0 Condition 1

Patient 4 Patient 2 Patient 3 Patient 1 Patient 5 Patient 6






Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6Compute T0

Compute T1

Compute T2

Compute T3

Compare T0 to T* set

STAT115


• Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994)

]exp[1)( xexsp

Probability that a random alignment gets score like this or better pvalue

STAT115


• Actual alignment score S can be normalized

– m, n are query and DB length– K, are constants

• Depends on substitution matrix and sequence composition

• For typical amino acid and PAM250 matrix

K = 0.09, = 0.229

KmnSs ln

STAT115


• Normalized score s can be used to get p-value

– When x > 2, probability can be approximated

• Another quick check, raw score S/3

xexsp )(

Ssignificant

/3 log2(nm)

]exp[1)( xexsp

STAT115

BLAST Reporting

• Report DB sequences above a threshold– E value: Number (instead of probability

pvalue) of matches expected merely by chance

• Usually [0.05, 10] threshold• Smaller E, more stringent

– User selected (just for display): e.g. top 10, 50, 100

SeKmnE

STAT115

Different BLAST Programs

• If query is DNA, but known to be coding (e.g. cDNA)– Translate

cDNA into protein

– Zero gap-extension penalty

STAT115

Psi-BLAST

• Position Specific Iterative BLAST– Align high scoring hits in initial BLAST to

construct a profile for the hits– Use profile for next iteration BLAST

• Find remote homologs or protein families

• FP sequences can degrade search quickly

QuerySeq1

Seq2

Seq4

Seq3

STAT115

Reciprocal Blast

• Search for orthologous sequences between two species– Orthologs: genes related by vertical descent

from a common ancestor and encode proteins with the same function in different species

– Paralogs: homologous genes evolved by duplication and code for protein with similar but not identical functions

– Finding the correct orthologous sequence is very important in comparative genomics

STAT115

Reciprocal Blast

• Search for orthologous sequences between two species– Orthologs: genes related by vertical descent

from a common ancestor and encode proteins with the same function in different species

– GeneA in Species1 BLAST Species2 GeneB– GeneB in Species2 BLAST Species1 GeneA

– GeneA GeneB

• Also called bi-directional best hit

orthologous

STAT115

BLAT• BLAST-Like Alignment Tool

– Compare to BLAST, BLAT can align much longer regions (MB) really fast with little resources

– E.g. can map a sequence to the genome in seconds on one Linux computer

– Allow big gaps (mRNA to genome)– Need higher similarity (> 95% for DNA and 80% for

proteins) for aligned sequences

• Basic approach– Break long sequence into blocks– Index k-mers, typically 8-13– Stitch blocks together for final alignment

STAT115

BLAT: Indexing

Genome: cacaattatcacgaccgc3-mers: cac aat tat cac gac cgcIndex: aat 3 gac 12 cac 0,9 tat 6 cgc 15cDNA (mRNA -> DNA): aattctcac3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6hits: aat 0,3 -3 cac 6,0 6 cac 6,9 -3clump: cacAATtatCACgaccgc

STAT115

BLAT Example

• Get result instantly!!

STAT115

Multiple Sequence Alignment

• MSA Uses:– Establish evolutionary relationships (global)

– Find conserved nucleotides and amino acids (global)

– Characterize signature protein patterns or motifs (local)

– Find acceptable substitutions (local)

• Protein MSA gold standard: structural alignment

STAT115

Progressive MSA Method• Progressive:

– Heuristic algorithm: approximation strategy, do not aim at perfect

– Build alignment with most related sequences, progressively add less-related to the alignment

– Often manual examination can improve alignments

• ClustalW, NAR 1994– W stands for weighting: more distant seqs weigh more– Reflect evolutionary distance

STAT115

ClustalW Steps

• Global pairwise alignment for all pairs

• Calculate pairwise sequence distances– Approximate evolutionary distance

STAT115

ClustalW Steps

• Construct a tree based on sequence distances– E.g. solve the following matrix

A B C D

A 4 6 2

B 4 4

C 6

D

A

C

B

1 32

1 1D

STAT115

ClustalW Steps

• Progressively add sequences/alignments by the tree order– Starting from the smallest distance– Add seq to seq, seq to align, align to align

• AD form new node E, calc AE, DE distance

• Calc E consensus, weighted

by AE DE distance

• Calc B, C, E pairwise

distance

• BE form new node F…

A

C

B

1 32

1 1DE F

STAT115

ClustalW Features: Consensus

• Consensus is used to represent the aligned sequences; if different, find AA to maximize score

• Final score weighted based on branch length

– Weight for A = a = 0.2 + 0.3 / 2 + 0.3/3 = 0.45

– Weight for B = b = 0.1 + 0.3 / 2 + 0.3/3 = 0.35

– Weight for C = c = 0.5 + 0.3/3 = 0.6

A

B

C

0.2

0.10.3

0.50.3

STAT115

Scoring an Alignment

Sequence A (weight a) …K…

Sequence B (weight b) …I…

Merge and align toSequence C (weight c) …L…

Sequence D (weight d) …V…

Score for aligning the column[a c Score(K,L) + a d Score(K,V) + b c Score(I,L) + b d Score(I,V)] / 4

STAT115

ClustalW Features: Gaps

• Sequence specific gap penalties – Penalize gaps more in segments that are less

likely to have gaps

STAT115

Progressive Alignment Limitations

• Gaps can proliferate, if not carefulAlign1: ABCD-E ABC-D-E

Align2: ABC-DE ABC-D-E

• Need many heuristic parameters

• Does not guarantee global optimum

• Errors in initial alignments are propagated

• Manual improvements:– Shift residues from one side of gap to the other– Reduce gaps

STAT115

ClustalW Alignment

* - identical

: - conserved

. - semi-conserved

STAT115

ClustalW Tree

Branch length

~ distance

0.023180.01824

0.415960.12694

0.02011

0.10523

0.01147

STAT115

Summary• Fast sequence similarity search

– Break seq, hash DB sub-seq, match sub-seq and extend, use DP for optimal alignment

– *BLAST, most widely used, many applications with sound statistical foundations

– *BLAT, align sequence to genome, fast yet need higher similarity

• Protein global MSA– Progressive heuristic alignment

– ClustalW: pairwise, tree, merge alignments

– Merge with minimum edit, sequence weighting, sequence/position specific gaps

STAT115

Acknowledgment

• David Mount

• Aoife McLysaght

• Ir. Brecht Claerhout

Documents

Fast Sequence Search Multiple Sequence Alignment Xiaole Shirley Liu STAT115/STAT215, 2010