Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW

Sequence AlignmentSequence AlignmentLakshmanan Iyer, Ph. D.

The Building Blocks…The Building Blocks…

ATGC

VLMFNQEDHKRCSTPYW

Why Align Sequences?Why Align Sequences?

Discover functional, structural, and evolutionary information

Similar Sequences may have similar function– Gene Regulation

– Biochemical Function

– Similar Structure Homology

– Similar sequences may have a common ancestor

What is Sequence Alignment?What is Sequence Alignment?

Local Alignment

Global Algnment

LGPSSKQTGKGS-SRIWDN| | ||| | |LN-ITKSAGKGAIMRLGDA

-------TGKGS------- ||| -------AGKGA-------

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html

Example Sequence Alignment?Example Sequence Alignment?

Evolutionary Tree

Example AlignmentConserved

Similar

Methods of Sequence Methods of Sequence AlignmentAlignmentPair-wise Sequence Alignment

Multiple Sequence Alignment

Dot Matrix Analysis Dynamic Programming Algorithm Word or k-tuple methods (FASTA,BLAST,

BLAT)

Dot Matrix AlignmentDot Matrix Alignment

Place Sequences on X and Y axis and put a dot where there is a match

Especially useful to detect repetitive structure

Dynamics ProgrammingDynamics Programming

The problem at hand is diving into a series of sub-problems

The sub-problems are solved in steps The results are compiled to find the final

solution.

Scoring SystemsScoring Systems

•Position Independent MatricesPosition Independent Matrices•Nucleic Acids – identity matrix•Proteins

•PAM Matrices (Percent Accepted Mutation)•Implicit model of evolution•Higher PAM number all calculated from PAM1•PAM250 widely used

•BLOSUM Matrices (BLOck SUbstitution Matrices)•Empirically determined from alignment of conserved blocks•Each includes information up to a certain level of identity•BLOSUM62 widely used

•Position Specific Score Matrices (PSSMs)Position Specific Score Matrices (PSSMs)•PSI and RPS BLAST

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62BLOSUM62

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

Gapped AlignmentsGapped Alignments

•Gapping provides more

biologically realistic alignments•Gapped BLAST parameters

must be simulated

•Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)

LGPSSKQTGKGS-SRIWDN| | ||| | |LN-ITKSAGKGAIMRLGDA

-------TGKGS------- ||| -------AGKGA-------

ScoresScores

V D S – C Y

V E T L C F

BLOSUM62 +4 +2 +1 -12 +9 +3 7

PAM30 +7 +2 0 -10 +10 +2 11

H E AH E AP -2 -1 -1P -2 -1 -1A -2 -1 4A -2 -1 4W -2 -3 -3W -2 -3 -3

H E AH E AP -2 -1 -1P -2 -1 -1A -2 -1 4A -2 -1 4W -2 -3 -3W -2 -3 -3

00 -8-8 -16-16

-8-8

-16-16

-24-24

-24-24

-2-2 -9-9

-3-3 -5-5

-6-6

-17-17

-11-11-18-18

-10-10

WW

AA

PP

HH EE AA

Calculate scores for site pairsCalculate scores for site pairsBLOSUM62BLOSUM62

Calculate scores for site pairsCalculate scores for site pairsBLOSUM62BLOSUM62

D DYNAMIC PROGRAMMING D DYNAMIC PROGRAMMING Global Alignment: Needleman-Global Alignment: Needleman-WunschWunsch

H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73A -16 -10 -3 -5 -13 -21 -29 -37 -45 -53 -61W -24 -18 -11 -6 -7 -15 -10 -18 -26 -34 -41H -32 -16 -18 -13 -8 -9 -17 -12 -10 -18 -26E -40 -24 -11 -19 -15 -9 -12 -19 -12 -5 -13A -48 -32 -19 -7 -15 -11 -12 -12 -20 -13 -6E -56 -40 -27 -15 -9 -16 -14 -14 -12 -15 -8

H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73A -16 -10 -3 -5 -13 -21 -29 -37 -45 -53 -61W -24 -18 -11 -6 -7 -15 -10 -18 -26 -34 -41H -32 -16 -18 -13 -8 -9 -17 -12 -10 -18 -26E -40 -24 -11 -19 -15 -9 -12 -19 -12 -5 -13A -48 -32 -19 -7 -15 -11 -12 -12 -20 -13 -6E -56 -40 -27 -15 -9 -16 -14 -14 -12 -15 -8-8-8-8-8

-13-13-12-12

-12-12-10-10

-21-21-25-25-17-17

-16-16-8-8

H E A G A W G H E E- - P - A W H E A EH E A G A W G H E E- - P - A W H E A E

Trace BackTrace Back

BLAST…BLAST…

NCBI Presentation …

NCBI Molecular Biology NCBI Molecular Biology ResourcesResources

January 2006 Peter Cooper

Using NCBI BLAST

Sequence Similarity Sequence Similarity SearchingSearching

Basic Local Alignment Search ToolBasic Local Alignment Search Tool

What BLAST tells youWhat BLAST tells you BLAST reports surprising alignments

– Different than chance Assumptions

– Random sequences

– Constant composition Conclusions

– Surprising similarities imply evolutionary homology

Evolutionary Homology: descent from a common ancestorDoes not always imply similar function

BBasic asic LLocal ocal AAlignment lignment SSearch earch TToolool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database.

– DNA vs DNA

– DNA translation vs Protein

– Protein vs Protein

– Protein vs DNA translation

– DNA translation vs DNA translation

www, standalone, and network clients

BLAST and BLAST-like BLAST and BLAST-like programsprograms

Traditional BLAST (blastall) nucleotide, protein, translations

– blastn nucleotide query vs. nucleotide database

– blastp protein query vs. protein database

– blastx nucleotide query vs. protein database

– tblastn protein query vs. translated nucleotide database

– tblastx translated query vs. translated database Megablast nucleotide only

– Contiguous megablast Nearly identical sequences

– Discontiguous megablast Cross-species comparison

Position Specific BLAST Programs protein only

– Position Specific Iterative BLAST (PSI-BLAST) Automatically generates a position specific score matrix (PSSM)

– Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs

GTACTGGACATGGACCCTACAGGAACGT

TGGACATGGACCCTACAGGAACGTATAC

CATGGACCCTACAGGAACGTATACGTAA . . .

Nucleotide WordsNucleotide Words

GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT . . .

Make a lookuptable of words

GTACTGGACATGGACCCTACAGGAACGTATACGTAAG Query

11-mer

1228megablast

711blastn

Min.Def.WORD SIZE

Protein WordsProtein WordsGTQITVEDLFYNIATRRKALKNQuery:

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookuptable of words

Word size = 3 (default) Word size can only be 2 or 3

Minimum Requirements for a Minimum Requirements for a HitHit

•Nucleotide BLAST requires one exact match•Protein BLAST requires two neighboring matches within 40 aa

GTQITVEDLFYNI

SEI YYN

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT

neighborhood words

exact word match

one match

two matches

An alignment that BLAST can’t An alignment that BLAST can’t findfind

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |

1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT

| || || || ||| || | |||||| || | |||||| ||||| | |

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC

|||| || ||||| || || | | |||| || |||

121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Megablast: NCBI’s Genome Megablast: NCBI’s Genome AnnotatorAnnotator

Long alignments for similar DNA sequences Concatenation of query sequences Faster than blastn Contiguous Megablast

– exact word match– Word size 28

Discontiguous Megablast– initial word hit with mismatches– cross-species comparison

Templates for Discontiguous Templates for Discontiguous WordsWords

W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111

Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

W = word size; # matches in template

t = template length (window size within which the word match is evaluated)

Local Alignment StatisticsLocal Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution

Score

Alig

nm

en

ts

(applies to ungapped alignments)

E = Kmne-S or E = mn2-S’

K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2

Expect Value

E = number of database hits you expect to find by chance

size of database

your score

expected number of random hits

Scoring SystemsScoring Systems

•Position Independent MatricesPosition Independent Matrices•Nucleic Acids – identity matrix•Proteins

•PAM Matrices (Percent Accepted Mutation)•Implicit model of evolution•Higher PAM number all calculated from PAM1•PAM250 widely used

•BLOSUM Matrices (BLOck SUbstitution Matrices)•Empirically determined from alignment of conserved blocks•Each includes information up to a certain level of identity•BLOSUM62 widely used

•Position Specific Score Matrices (PSSMs)Position Specific Score Matrices (PSSMs)•PSI and RPS BLAST

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62BLOSUM62Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

Position Specific Substitution Position Specific Substitution Rates Rates

Active site serineActive site serineTypical serineTypical serine

Position Specific Score Matrix Position Specific Score Matrix (PSSM)(PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differently in these two positions

Active site nucleophile

Gapped AlignmentsGapped Alignments

•Gapping provides more biologically realistic alignments•Gapped BLAST parameters must be simulated

•Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)

ScoresScores

V D S – C Y

V E T L C F

BLOSUM62 +4 +2 +1 -12 +9 +3 7

PAM30 +7 +2 0 -10 +10 +2 11

WWW WWW BLASTBLAST

The BLAST The BLAST homepagehomepage

Specialized Databases

Standard databases

BLAST Databases: Non-redundant BLAST Databases: Non-redundant proteinprotein

nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein

PIR, Swiss-Prot, PRFPDB (sequences from structures)

pat protein patents

env_nr environmental samples

nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein

PIR, Swiss-Prot, PRFPDB (sequences from structures)

pat protein patents

env_nr environmental samples

Nucleotide Databases: GenomicNucleotide Databases: Genomic

Human and mouse genomes and reference transcripts now available

Human and mouse genomes and reference transcripts now available

Nucleotide Databases: Nucleotide Databases: StandardStandard

Nucleotide Databases: Nucleotide Databases: TraditionalTraditional

nr (nt)– Traditional GenBank– NM_ and XM_

RefSeqs refseq_rna

refseq_genomic– NC_ RefSeqs

dbest – EST Division

est_human, mouse, others

htgs – HTG division

gss – GSS division

wgs– whole genome shotgun

env_nt– environmental samples

3000 Myr3000 Myr

1000 Myr1000 Myr

540 Myr540 Myr

Alzheimer’sDisease

Ataxiatelangiectasia

Colon cancer

Pancreaticcarcinoma

Yeast BacteriaWormFlyHuman

BLAST and Molecular BLAST and Molecular EvolutionEvolution

MLH1 MutL

Protein BLAST PageProtein BLAST Page

>Mutated in Colon CancerIETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEVQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSS

Protein database

Advanced Options: Entrez limitAdvanced Options: Entrez limit

all[Filter] NOT mammals[Organism]

gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]

Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]

all[Filter] NOT mammals[Organism]

gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]

Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]

Advanced Options: FiltersAdvanced Options: Filters

Hides low complexity for initial word hits only

Hides low complexity for initial word hits only

Masks regions of query in lower case (pre-masked)Masks regions of query in lower case (pre-masked)

Masks Human or Mouse Interspersed repeats.Default for genome searches.Masks Human or Mouse Interspersed repeats.Default for genome searches.

ProteinProtein

NucleotideNucleotide

Masks Low Complexity Sequencewith X or n

Masks Low Complexity Sequencewith X or n

Advanced Options: Advanced Options: Composition based Composition based statsstats

Amino acid composition:Ala (A) 42 19.6%Arg (R) 4 1.9%Asn (N) 4 1.9%Asp (D) 1 0.5%Cys (C) 0 0.0%Gln (Q) 2 0.9%Glu (E) 6 2.8%Gly (G) 13 6.1%His (H) 0 0.0%Ile (I) 3 1.4%Leu (L) 10 4.7%Lys (K) 57 26.6%Met (M) 0 0.0%Phe (F) 1 0.5%Pro (P) 19 8.9%Ser (S) 23 10.7%Thr (T) 14 6.5%Trp (W) 0 0.0%Tyr (Y) 1 0.5%Val (V) 14 6.5%

Negatively charged residues (Asp + Glu): 7Positively charged residues (Arg + Lys): 61

Amino acid composition:Ala (A) 42 19.6%Arg (R) 4 1.9%Asn (N) 4 1.9%Asp (D) 1 0.5%Cys (C) 0 0.0%Gln (Q) 2 0.9%Glu (E) 6 2.8%Gly (G) 13 6.1%His (H) 0 0.0%Ile (I) 3 1.4%Leu (L) 10 4.7%Lys (K) 57 26.6%Met (M) 0 0.0%Phe (F) 1 0.5%Pro (P) 19 8.9%Ser (S) 23 10.7%Thr (T) 14 6.5%Trp (W) 0 0.0%Tyr (Y) 1 0.5%Val (V) 14 6.5%

Negatively charged residues (Asp + Glu): 7Positively charged residues (Arg + Lys): 61

Histone H1

BLAST Formatting Page BLAST Formatting Page

Conserved DomainConserved Domain

BLAST Output: Graphical BLAST Output: Graphical OverviewOverview

mouse overmouse over

Sort by taxonomySort by taxonomy

BLAST Output: DescriptionsBLAST Output: Descriptions

Link to entrezLink to entrez

Sorted by e valuesSorted by e values

3 X 10-123 X 10-12

Default e value cutoff 10Default e value cutoff 10

Gene LinkoutGene Linkout

TaxBLAST: Taxonomy ReportsTaxBLAST: Taxonomy Reports

>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615

Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)

Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338

>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615

Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)

Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338

BLAST Output: AlignmentsBLAST Output: Alignments

Identical matchIdentical match

positive score(conservative)positive score(conservative)

negative substitution

negative substitution gapgap

Low Complexity FilterLow Complexity Filter

>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 Length=756

Score = 231 bits (589), Expect = 1e-62 Identities = 131/131 (100%), Positives = 131/131 (100%), Gaps = 0/131 (0%)

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335

Query 61 GSNSSRMYFTQTLLPGLAGPSGEMVKsttsltssstsgssDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDASbjct 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395

Query 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct 396 FLQPLSKPLSS 406

>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 Length=756

Score = 231 bits (589), Expect = 1e-62 Identities = 131/131 (100%), Positives = 131/131 (100%), Gaps = 0/131 (0%)

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335

Query 61 GSNSSRMYFTQTLLPGLAGPSGEMVKsttsltssstsgssDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDASbjct 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395

Query 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct 396 FLQPLSKPLSS 406

low complexity sequence filtered

Nucleotide: Human RepeatsNucleotide: Human Repeats

Human Albumin Genomic RegionHuman Albumin Genomic Region

Nucleotide: Human Repeat Nucleotide: Human Repeat FilterFilter

Alb mRNAsAlb mRNAs

Nucleotide BLAST: New OutputNucleotide BLAST: New Output

Crab-eating macaque CDC20 mRNA

Crab-eating macaque CDC20 mRNA

Default human databaseDefault human database

New output displayNew output display

Sortable ResultsSortable Results

Pseudogene on Chromosome 9Pseudogene on Chromosome 9

Functional Gene on Chromosome 1Functional Gene on Chromosome 1

Separate Sections for

Transcript and Genome

Separate Sections for

Transcript and Genome

Total Score: All SegmentsTotal Score: All Segments

Functional Gene Now FirstFunctional Gene Now First

Sorting in Exon OrderSorting in Exon Order

Default Sorting Order: ScoreLongest exon usually firstDefault Sorting Order: ScoreLongest exon usually first

Query start positionExon orderQuery start positionExon order

Links to Map ViewerLinks to Map Viewer

Chromosome 1 Chromosome 9

Service AddressesService Addresses

•General Help [email protected]•BLAST [email protected]

Telephone support: 301- 496- 2475

Back to Multiple Sequence Back to Multiple Sequence AlignmentAlignment

Multiple Sequence AlignmentMultiple Sequence Alignment

An extension of the pair-wise alignment…– We will learn by example– We will use Jalview to learn it

JalviewJalview

Viewing– Reads and writes

alignments– save alignments and

associated trees

Editing– Inserted/delete Gaps– Insert/delete gaps in

groups of sequences.– Remove of gapped

columns

Analysis– Align sequences using Web

Services – Amino acid conservation analysis – Alignment sorting options (by

name, tree order, percent identity, group)

– UPGMA and NJ trees calculated and drawn

– Sequence clustering using principal component analysis.

– Removal of redundant sequences.– Smith Waterman pairwise

alignment of selected sequences.

AcknowledgementAcknowledgement

Dr. Peter Cooper at NCBI for permission to use the BLAST Powerpoint presentation

Dr. Kurt Wollenberg for slides on Dynamic Programming

Documents

Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW