Upload
irma-palmer
View
220
Download
0
Embed Size (px)
Citation preview
Sequence AlignmentSequence AlignmentLakshmanan Iyer, Ph. D.
The Building Blocks…The Building Blocks…
ATGC
VLMFNQEDHKRCSTPYW
Why Align Sequences?Why Align Sequences?
Discover functional, structural, and evolutionary information
Similar Sequences may have similar function– Gene Regulation
– Biochemical Function
– Similar Structure Homology
– Similar sequences may have a common ancestor
What is Sequence Alignment?What is Sequence Alignment?
Local Alignment
Global Algnment
LGPSSKQTGKGS-SRIWDN| | ||| | |LN-ITKSAGKGAIMRLGDA
-------TGKGS------- ||| -------AGKGA-------
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html
Example Sequence Alignment?Example Sequence Alignment?
Evolutionary Tree
Example AlignmentConserved
Similar
Methods of Sequence Methods of Sequence AlignmentAlignmentPair-wise Sequence Alignment
Multiple Sequence Alignment
Dot Matrix Analysis Dynamic Programming Algorithm Word or k-tuple methods (FASTA,BLAST,
BLAT)
Dot Matrix AlignmentDot Matrix Alignment
Place Sequences on X and Y axis and put a dot where there is a match
Especially useful to detect repetitive structure
Dynamics ProgrammingDynamics Programming
The problem at hand is diving into a series of sub-problems
The sub-problems are solved in steps The results are compiled to find the final
solution.
Scoring SystemsScoring Systems
•Position Independent MatricesPosition Independent Matrices•Nucleic Acids – identity matrix•Proteins
•PAM Matrices (Percent Accepted Mutation)•Implicit model of evolution•Higher PAM number all calculated from PAM1•PAM250 widely used
•BLOSUM Matrices (BLOck SUbstitution Matrices)•Empirically determined from alignment of conserved blocks•Each includes information up to a certain level of identity•BLOSUM62 widely used
•Position Specific Score Matrices (PSSMs)Position Specific Score Matrices (PSSMs)•PSI and RPS BLAST
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62BLOSUM62
Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
Gapped AlignmentsGapped Alignments
•Gapping provides more
biologically realistic alignments•Gapped BLAST parameters
must be simulated
•Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)
LGPSSKQTGKGS-SRIWDN| | ||| | |LN-ITKSAGKGAIMRLGDA
-------TGKGS------- ||| -------AGKGA-------
ScoresScores
V D S – C Y
V E T L C F
BLOSUM62 +4 +2 +1 -12 +9 +3 7
PAM30 +7 +2 0 -10 +10 +2 11
H E AH E AP -2 -1 -1P -2 -1 -1A -2 -1 4A -2 -1 4W -2 -3 -3W -2 -3 -3
H E AH E AP -2 -1 -1P -2 -1 -1A -2 -1 4A -2 -1 4W -2 -3 -3W -2 -3 -3
00 -8-8 -16-16
-8-8
-16-16
-24-24
-24-24
-2-2 -9-9
-3-3 -5-5
-6-6
-17-17
-11-11-18-18
-10-10
WW
AA
PP
HH EE AA
Calculate scores for site pairsCalculate scores for site pairsBLOSUM62BLOSUM62
Calculate scores for site pairsCalculate scores for site pairsBLOSUM62BLOSUM62
D DYNAMIC PROGRAMMING D DYNAMIC PROGRAMMING Global Alignment: Needleman-Global Alignment: Needleman-WunschWunsch
H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73A -16 -10 -3 -5 -13 -21 -29 -37 -45 -53 -61W -24 -18 -11 -6 -7 -15 -10 -18 -26 -34 -41H -32 -16 -18 -13 -8 -9 -17 -12 -10 -18 -26E -40 -24 -11 -19 -15 -9 -12 -19 -12 -5 -13A -48 -32 -19 -7 -15 -11 -12 -12 -20 -13 -6E -56 -40 -27 -15 -9 -16 -14 -14 -12 -15 -8
H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73A -16 -10 -3 -5 -13 -21 -29 -37 -45 -53 -61W -24 -18 -11 -6 -7 -15 -10 -18 -26 -34 -41H -32 -16 -18 -13 -8 -9 -17 -12 -10 -18 -26E -40 -24 -11 -19 -15 -9 -12 -19 -12 -5 -13A -48 -32 -19 -7 -15 -11 -12 -12 -20 -13 -6E -56 -40 -27 -15 -9 -16 -14 -14 -12 -15 -8-8-8-8-8
-13-13-12-12
-12-12-10-10
-21-21-25-25-17-17
-16-16-8-8
H E A G A W G H E E- - P - A W H E A EH E A G A W G H E E- - P - A W H E A E
Trace BackTrace Back
BLAST…BLAST…
NCBI Presentation …
NCBI Molecular Biology NCBI Molecular Biology ResourcesResources
January 2006 Peter Cooper
Using NCBI BLAST
Sequence Similarity Sequence Similarity SearchingSearching
Basic Local Alignment Search ToolBasic Local Alignment Search Tool
What BLAST tells youWhat BLAST tells you BLAST reports surprising alignments
– Different than chance Assumptions
– Random sequences
– Constant composition Conclusions
– Surprising similarities imply evolutionary homology
Evolutionary Homology: descent from a common ancestorDoes not always imply similar function
BBasic asic LLocal ocal AAlignment lignment SSearch earch TToolool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database.
– DNA vs DNA
– DNA translation vs Protein
– Protein vs Protein
– Protein vs DNA translation
– DNA translation vs DNA translation
www, standalone, and network clients
BLAST and BLAST-like BLAST and BLAST-like programsprograms
Traditional BLAST (blastall) nucleotide, protein, translations
– blastn nucleotide query vs. nucleotide database
– blastp protein query vs. protein database
– blastx nucleotide query vs. protein database
– tblastn protein query vs. translated nucleotide database
– tblastx translated query vs. translated database Megablast nucleotide only
– Contiguous megablast Nearly identical sequences
– Discontiguous megablast Cross-species comparison
Position Specific BLAST Programs protein only
– Position Specific Iterative BLAST (PSI-BLAST) Automatically generates a position specific score matrix (PSSM)
– Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs
GTACTGGACATGGACCCTACAGGAACGT
TGGACATGGACCCTACAGGAACGTATAC
CATGGACCCTACAGGAACGTATACGTAA . . .
Nucleotide WordsNucleotide Words
GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT . . .
Make a lookuptable of words
GTACTGGACATGGACCCTACAGGAACGTATACGTAAG Query
11-mer
1228megablast
711blastn
Min.Def.WORD SIZE
Protein WordsProtein WordsGTQITVEDLFYNIATRRKALKNQuery:
Neighborhood Words
LTV, MTV, ISV, LSV, etc.
GTQ
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
Make a lookuptable of words
Word size = 3 (default) Word size can only be 2 or 3
Minimum Requirements for a Minimum Requirements for a HitHit
•Nucleotide BLAST requires one exact match•Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
SEI YYN
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT
neighborhood words
exact word match
one match
two matches
An alignment that BLAST can’t An alignment that BLAST can’t findfind
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || || || ||| || | |||||| || | |||||| ||||| | |
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || || | | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Megablast: NCBI’s Genome Megablast: NCBI’s Genome AnnotatorAnnotator
Long alignments for similar DNA sequences Concatenation of query sequences Faster than blastn Contiguous Megablast
– exact word match– Word size 28
Discontiguous Megablast– initial word hit with mismatches– cross-species comparison
Templates for Discontiguous Templates for Discontiguous WordsWords
W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111
Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5
W = word size; # matches in template
t = template length (window size within which the word match is evaluated)
Local Alignment StatisticsLocal Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score
Alig
nm
en
ts
(applies to ungapped alignments)
E = Kmne-S or E = mn2-S’
K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2
Expect Value
E = number of database hits you expect to find by chance
size of database
your score
expected number of random hits
Scoring SystemsScoring Systems
•Position Independent MatricesPosition Independent Matrices•Nucleic Acids – identity matrix•Proteins
•PAM Matrices (Percent Accepted Mutation)•Implicit model of evolution•Higher PAM number all calculated from PAM1•PAM250 widely used
•BLOSUM Matrices (BLOck SUbstitution Matrices)•Empirically determined from alignment of conserved blocks•Each includes information up to a certain level of identity•BLOSUM62 widely used
•Position Specific Score Matrices (PSSMs)Position Specific Score Matrices (PSSMs)•PSI and RPS BLAST
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62BLOSUM62Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
Position Specific Substitution Position Specific Substitution Rates Rates
Active site serineActive site serineTypical serineTypical serine
Position Specific Score Matrix Position Specific Score Matrix (PSSM)(PSSM)
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Serine scored differently in these two positions
Active site nucleophile
Gapped AlignmentsGapped Alignments
•Gapping provides more biologically realistic alignments•Gapped BLAST parameters must be simulated
•Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)
ScoresScores
V D S – C Y
V E T L C F
BLOSUM62 +4 +2 +1 -12 +9 +3 7
PAM30 +7 +2 0 -10 +10 +2 11
WWW WWW BLASTBLAST
The BLAST The BLAST homepagehomepage
Specialized Databases
Standard databases
BLAST Databases: Non-redundant BLAST Databases: Non-redundant proteinprotein
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein
PIR, Swiss-Prot, PRFPDB (sequences from structures)
pat protein patents
env_nr environmental samples
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein
PIR, Swiss-Prot, PRFPDB (sequences from structures)
pat protein patents
env_nr environmental samples
Nucleotide Databases: GenomicNucleotide Databases: Genomic
Human and mouse genomes and reference transcripts now available
Human and mouse genomes and reference transcripts now available
Nucleotide Databases: Nucleotide Databases: StandardStandard
Nucleotide Databases: Nucleotide Databases: TraditionalTraditional
nr (nt)– Traditional GenBank– NM_ and XM_
RefSeqs refseq_rna
refseq_genomic– NC_ RefSeqs
dbest – EST Division
est_human, mouse, others
htgs – HTG division
gss – GSS division
wgs– whole genome shotgun
env_nt– environmental samples
3000 Myr3000 Myr
1000 Myr1000 Myr
540 Myr540 Myr
Alzheimer’sDisease
Ataxiatelangiectasia
Colon cancer
Pancreaticcarcinoma
Yeast BacteriaWormFlyHuman
BLAST and Molecular BLAST and Molecular EvolutionEvolution
MLH1 MutL
Protein BLAST PageProtein BLAST Page
>Mutated in Colon CancerIETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEVQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSS
Protein database
Advanced Options: Entrez limitAdvanced Options: Entrez limit
all[Filter] NOT mammals[Organism]
gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]
Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]
all[Filter] NOT mammals[Organism]
gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]
Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]
Advanced Options: FiltersAdvanced Options: Filters
Hides low complexity for initial word hits only
Hides low complexity for initial word hits only
Masks regions of query in lower case (pre-masked)Masks regions of query in lower case (pre-masked)
Masks Human or Mouse Interspersed repeats.Default for genome searches.Masks Human or Mouse Interspersed repeats.Default for genome searches.
ProteinProtein
NucleotideNucleotide
Masks Low Complexity Sequencewith X or n
Masks Low Complexity Sequencewith X or n
Advanced Options: Advanced Options: Composition based Composition based statsstats
Amino acid composition:Ala (A) 42 19.6%Arg (R) 4 1.9%Asn (N) 4 1.9%Asp (D) 1 0.5%Cys (C) 0 0.0%Gln (Q) 2 0.9%Glu (E) 6 2.8%Gly (G) 13 6.1%His (H) 0 0.0%Ile (I) 3 1.4%Leu (L) 10 4.7%Lys (K) 57 26.6%Met (M) 0 0.0%Phe (F) 1 0.5%Pro (P) 19 8.9%Ser (S) 23 10.7%Thr (T) 14 6.5%Trp (W) 0 0.0%Tyr (Y) 1 0.5%Val (V) 14 6.5%
Negatively charged residues (Asp + Glu): 7Positively charged residues (Arg + Lys): 61
Amino acid composition:Ala (A) 42 19.6%Arg (R) 4 1.9%Asn (N) 4 1.9%Asp (D) 1 0.5%Cys (C) 0 0.0%Gln (Q) 2 0.9%Glu (E) 6 2.8%Gly (G) 13 6.1%His (H) 0 0.0%Ile (I) 3 1.4%Leu (L) 10 4.7%Lys (K) 57 26.6%Met (M) 0 0.0%Phe (F) 1 0.5%Pro (P) 19 8.9%Ser (S) 23 10.7%Thr (T) 14 6.5%Trp (W) 0 0.0%Tyr (Y) 1 0.5%Val (V) 14 6.5%
Negatively charged residues (Asp + Glu): 7Positively charged residues (Arg + Lys): 61
Histone H1
BLAST Formatting Page BLAST Formatting Page
Conserved DomainConserved Domain
BLAST Output: Graphical BLAST Output: Graphical OverviewOverview
mouse overmouse over
Sort by taxonomySort by taxonomy
BLAST Output: DescriptionsBLAST Output: Descriptions
Link to entrezLink to entrez
Sorted by e valuesSorted by e values
3 X 10-123 X 10-12
Default e value cutoff 10Default e value cutoff 10
Gene LinkoutGene Linkout
TaxBLAST: Taxonomy ReportsTaxBLAST: Taxonomy Reports
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)
Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)
Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
BLAST Output: AlignmentsBLAST Output: Alignments
Identical matchIdentical match
positive score(conservative)positive score(conservative)
negative substitution
negative substitution gapgap
Low Complexity FilterLow Complexity Filter
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 Length=756
Score = 231 bits (589), Expect = 1e-62 Identities = 131/131 (100%), Positives = 131/131 (100%), Gaps = 0/131 (0%)
Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query 61 GSNSSRMYFTQTLLPGLAGPSGEMVKsttsltssstsgssDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDASbjct 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct 396 FLQPLSKPLSS 406
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 Length=756
Score = 231 bits (589), Expect = 1e-62 Identities = 131/131 (100%), Positives = 131/131 (100%), Gaps = 0/131 (0%)
Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query 61 GSNSSRMYFTQTLLPGLAGPSGEMVKsttsltssstsgssDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDASbjct 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct 396 FLQPLSKPLSS 406
low complexity sequence filtered
Nucleotide: Human RepeatsNucleotide: Human Repeats
Human Albumin Genomic RegionHuman Albumin Genomic Region
Nucleotide: Human Repeat Nucleotide: Human Repeat FilterFilter
Alb mRNAsAlb mRNAs
Nucleotide BLAST: New OutputNucleotide BLAST: New Output
Crab-eating macaque CDC20 mRNA
Crab-eating macaque CDC20 mRNA
Default human databaseDefault human database
New output displayNew output display
Sortable ResultsSortable Results
Pseudogene on Chromosome 9Pseudogene on Chromosome 9
Functional Gene on Chromosome 1Functional Gene on Chromosome 1
Separate Sections for
Transcript and Genome
Separate Sections for
Transcript and Genome
Total Score: All SegmentsTotal Score: All Segments
Functional Gene Now FirstFunctional Gene Now First
Sorting in Exon OrderSorting in Exon Order
Default Sorting Order: ScoreLongest exon usually firstDefault Sorting Order: ScoreLongest exon usually first
Query start positionExon orderQuery start positionExon order
Links to Map ViewerLinks to Map Viewer
Chromosome 1 Chromosome 9
Service AddressesService Addresses
•General Help [email protected]•BLAST [email protected]
Telephone support: 301- 496- 2475
Back to Multiple Sequence Back to Multiple Sequence AlignmentAlignment
Multiple Sequence AlignmentMultiple Sequence Alignment
An extension of the pair-wise alignment…– We will learn by example– We will use Jalview to learn it
JalviewJalview
Viewing– Reads and writes
alignments– save alignments and
associated trees
Editing– Inserted/delete Gaps– Insert/delete gaps in
groups of sequences.– Remove of gapped
columns
Analysis– Align sequences using Web
Services – Amino acid conservation analysis – Alignment sorting options (by
name, tree order, percent identity, group)
– UPGMA and NJ trees calculated and drawn
– Sequence clustering using principal component analysis.
– Removal of redundant sequences.– Smith Waterman pairwise
alignment of selected sequences.
AcknowledgementAcknowledgement
Dr. Peter Cooper at NCBI for permission to use the BLAST Powerpoint presentation
Dr. Kurt Wollenberg for slides on Dynamic Programming