NCBI Fundamentals of Sequence Analysis Chuong Huynh NIH/NLM/NCBI Sept 30, 2004 [email protected]

NC

BI

Fundamentals of Sequence Analysis

Chuong HuynhNIH/NLM/NCBISept 30, 2004

[email protected]

NC

BI

3000 Myr3000 Myr

1000 Myr1000 Myr

540 Myr540 Myr

Common ancestry allows us to infer similar function

Alzheimer’sDisease

Ataxiatelangiectasia

Colon cancer

Pancreaticcarcinoma

Yeast BacteriaWormFlyHuman

Molecular Evolution

MLH1 MutL

NC

BI

Why do we need similarity searching?

Identification and annotation•Incomplete or no annotations (GenBank)•Incorrectly annotated sequences

Evolutionary relationshipshomologous molecules may

have similar functions

NC

BI

Why Search Databases?

• To find out if a new DNA sequence is already fully or partially present in the databanks.

• To find homologous proteins to a putative coding ORF that might share similar 3D structure.

• to identify homology (“relatedness”) between a query and entries in a database

NC

BI

Some Terminology

NC

BI

Searching Sequence Databases

• Two sequences are homologous when they share a common ancestry. This ancestry is reflected in strong sequence similarity.

• Computationally, threshold limits for sequence similarity can be defined by :– length of the stretch of similar sequence– percentage of identity between the

sequence– statistical measurements, like E-value, P-

value, Bit-score, etc.

NC

BI

Similarity and Homology

• Similarity can be expressed as a percentage. It does not imply any reasons for the observed sameness.

• Homology is an evolutionary term used to describe relationship via descent from a common ancestor.

• Homologous things are often similar, but not always (whale flipper <-> human arm)

• Homology is NEVER expressed as a percentage

NC

BI

Orthologs vs Paralogs

• Homologs can be separated into two classes: orthologs and paralogs.

• Orthologs are homologous genes that perform the same function in different species.

• Paralogs are homologous genes within a species that may perform different functions.

NC

BI

Similarity and Homology• Sequence homology can be reliably inferred

from statistically significant similarity over a majority of the sequence length.

• Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor.

• Homologous proteins share common structures, but not necessarily common sequence or function (e.g. FtsZ <-> tubulin)

• Remember: pair of sequences either is or isn't homologous. There is no such thing as “64% homologous"

NC

BI

Searching sequence databases

• When we search a sequence database, we are usually looking for related sequences.

• Unfortunately, the algorithms that we have for searching databases, do not search for homology, they search for similarity.

• When similarity is found, we must determine if this similarity is a result of homology or it if comes from another source.

NC

BI

Pairwise Sequence Alignments

• Purpose:• identification of sequences with significant similarity to

(a) sequence(s) in a sequence-repository• identification of all homologous sequences the repository• identification of domains with sequence similarity

• Terminology • Global alignment• Local alignment

NC

BI

Terminology: Global Alignment

• Finds the optimal alignment over the entire length of the two compared sequences

• Unlikely to detect genes that have evolved by recombination (e.g. domain shuffling) or insertion/deletion of DNA

• Suitable for sequences of homologous molecules

NC

BI

Terminology: Local Alignment

• short regions of similarity between a pair of sequences.

• compared sequences can receive high local similarity scores, without the need to have high levels of similarity over their entire length

• useful when looking for domains within proteins or looking for regions of genomic DNA that contain coding exons

NC

BI

Substitutions, Insertions, Deletions

• Mutation: one of– switch from one nucleotide to another– insertion– deletion

• Substitution: a switch in nucleotides which spreads throughout most of a species.

• Substitutions, insertions and deletions passed along two independent lines of descent cause a divergence of the two sequences from the original (and from each other):

ccctaggtccca

cgggtatccaacggtatgcca

NC

BI

Example

• For the previous example cggtatgcca cgggtatccaa , ccctaggtccca, the two

descendent sequences align as follows

c g g g t a - - t - c c a a c c c - t a g g t c c c - a

• “-” (indel) represents an insertion or deletion.

NC

BI

Algorithms: definition

Webster’s definition: “a procedure for solving a

mathematical problem in a finite number of steps that frequently involves a repetition of an operation; or broadly: a step-by-step procedure for solving a problem or accomplishing some end”

NC

BI

Algorithms

• Needleman-Wunsch– Exhaustive global alignment– most rigorous method when aligning conserved

sequences of similar length (no exon shuffling, insertion/deletion etc)

• Smith-Waterman– Exhaustive local alignment– alignment does not have to extend along the full

length of the sequences– In contrast to N-W alignments initiating at all

possible positions of the sequence-space will be considered

– Can be very slow

NC

BI

Basic Local Alignment Search Tool

http://www.ncbi.nlm.nih.gov/BLAST/

NC

BI

Basic Local Alignment Search Tool

• Widely used similarity search tool• Heuristic approach based on Smith Waterman

algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and

database.– DNA vs DNA– DNA translation vs Protein– Protein vs Protein– Protein vs DNA translation– DNA translation vs DNA translation

• www, standalone, and network clients

NC

BI

BLAST Selection Matrix

NC

BI

Choosing The Right BLAST Flavor for Proteins

What you Want to Do? The Right BLAST Flavor

Find out something about the function of the protein

Use blastp to compare your protein with other proteins contained in the databases.

Discover new genes encoding similar proteins

Use tblastn to compare your protein with DNA sequences translated into their 6 possible reading framesClaverie & Notredame 2003

NC

BI

Choosing the Right BLAST

Flavor for DNAQuestions Answer

Am I interested in non coding DNA?

Yes, Use blastn. Rem: blastn is only for closely related DNA sequences (more than 70% identical)

Do I want to discover new proteins?

Yes, Use tblastx

Do I want to discover proteins encoded in my query DNA sequences?

Yes, Use blastx

Am I unsure of the quality of my DNA?

Yes, Use blastx. Especially if you suspsect your DNA sequence codes for a protein, but may contain sequencing errors.

Claverie & Notredame 2003

NC

BI

Choosing The Right BLAST Flavor

for DNA SequencesUsage Query Database Progra

m

Find very similar DNA sequence

DNA DNA blastn

Protein discovery and ESTs

Translated DNA

Translated DNA

tblastx

Analysis of query DNA sequence

Translated DNA

Protein blastx


NC

BI

WWW BLAST

NC

BI

Web BLAST

NC

BI

BLAST Databases: Non-redundant protein

nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein

• PIR, Swiss-Prot, PRF

– PDB (sequences from structures)

NC

BI

BLAST Databases: Nucleic Acid

• nr (nt)– Traditional GenBank

Divisions– NM_ and XM_ RefSeqs

• dbest – EST Division

• htgs – HTG division

• gss – GSS division

• chromosome – NC_ RefSeqs

NC

BI

Protein BLAST Page

>Mutated in Colon CancerIETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSS

Protein database

NC

BI

BLAST Formatting Page

NC

BI

BLAST Output Overview

• Graphic Display: Shows you where your query is similar to other sequences.

• Hit List: Name of sequences similar to your query ranked by similarity

• Alignments: Every alignment between your query and the reported hits

• Parameters: List of the various parameters used for the search

NC

BI

BLAST Output: Graphic

mouse over,click for active links

Sort by taxonomy

Red bar = most similar sequencePink = almost as similarGreen – even less similarBlue/Black – worse scores

NC

BI

BLAST Output: Descriptions

Bacterial mismatch repair proteins

link to entrez

sorted by e values

4 X 10-56

Default e value cutoff 10

LocusLink

Bit scores < 50 unreliable

NC

BI

A Little on Interpretation

• How similar must sequences be in order to be considered homologous?

• More than 25% of the amino acids present are identical for proteins and more than 70% of the nucleotides present are identical for DNA. Above these limits, you can be sure that two proteins have same structure and same common ancestor.

• Rem: only > 100 aa or nt in length

NC

BI

A Little on Interpretation: E-value

• Determine how much you can trust your conclusion on homology.

• E-value = Expectation Values• Allow for comparing pairwise alignment with

different similarities and different length. Advantage over Percent Identity (not discussed).

• Definition: Number of times your database match may have occurred by chance. Match unlikely to occur by chance is a good match. The loest E-values (as close to 0 as possible) are the best. Thus, most significant, since we know we can trust them enough to infer homology

• If you want to be certain of homology your E-values must be below 10-4 or (0.0001).

NC

BI

TaxBLAST: Taxonomy Reports

NC

BI

BLAST Output: Pairwise Alignments

>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615

Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%)

Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ LSbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338

NC

BI

BLAST Output: Alignments

>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756

Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%)

Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335

Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDASbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395

Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct: 396 FLQPLSKPLSS 406

low complexity sequence filtered

NC

BI

Results from nr

Sequences producing significant alignments: (bits) Value

gi|604369|gb|AAA85687.1| (U17857) hMLH1 gene product [Homo ... 233 3e-61 gi|4557757|ref|NP_000240.1| (NM_000249) mutL homolog 1; mut... 233 4e-61 gi|466462|gb|AAA17374.1| (U07418) human homolog of E. coli ... 233 4e-61 gi|13878583|sp|Q9JK91|MLH1_MOUSE DNA mismatch repair protei... 214 2e-55 gi|19387852|ref|NP_081086.1| (NM_026810) mutL homolog 1; DN... 213 2e-55 gi|13591989|ref|NP_112315.1| (NM_031053) mismatch repair pr... 212 5e-55 gi|12835158|dbj|BAB23172.1| (AK004105) DNA MISMATCH REPAIR ... 205 6e-53 gi|3192877|gb|AAC19117.1| (AF068257) mutL homolog [Drosophi... 128 1e-29 gi|17136968|ref|NP_477022.1| (NM_057674) Mlh1-P1 [Drosophil... 127 1e-29 gi|17861656|gb|AAL39305.1| (AY069160) GH18717p [Drosophila ... 125 8e-29 gi|20146218|dbj|BAB89000.1| (AP003238) putative MLH1 [Oryza... 87 2e-17 gi|11357265|pir||T51620 DNA mismatch repair protein MLH1 [i... 83 5e-16 gi|18413196|ref|NP_567345.1| (NM_116983) MLH1 protein [Arab... 83 5e-16 gi|6323819|ref|NP_013890.1| (NC_001145) Required for mismat... 72 1e-12 gi|460627|gb|AAA16835.1| (U07187) Mlh1p [Saccharomyces cere... 71 2e-12gi|19112991|ref|NP_596199.1| (NC_003423) putative DNA misma... 70 5e-12 gi|13517948|gb|AAK29067.1|AF346620_1 (AF346620) MLH1 [Trypa... 57 3e-08 gi|16272041|ref|NP_438240.1| (NC_000907) DNA mismatch repai... 54 3e-07 gi|19173567|ref|NP_597370.1| (NC_003232) DNA MISMATCH REPAI... 52 9e-07 gi|13543339|gb|AAH05833.1|AAH05833 (BC005833) Similar to mu... 50 5e-06 gi|15602769|ref|NP_245841.1| (NC_002663) MutL [Pasteurella ... 50 6e-06 gi|15642797|ref|NP_227838.1| (NC_000853) DNA mismatch repai... 48 2e-05

>gi|4557757|ref|NP_000240.1| (NM_000249) mutL homolog 1; mutL (E. coli) homolog 1; coli) homolog 1 (colon cancer, nonpolyposis type 2) [Homo sapiens] gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 (MutL protein homolog 1) gi|631299|pir||S43085 DNA mismatch repair protein MLH1 - human gi|463989|gb|AAC50285.1|(U07343) hMLH1 [Homo sapiens] gi|1079787|gb|AAA82079.1|(U40978) DNA mismatch repair protein homolog [Homo sapiens] gi|13905126|gb|AAH06850.1|AAH06850 (BC006850) mutL (E. coli) homolog 1 type 2) [Homo sapiens] gi|741682|prf||2007430A DNA mismatch repair protein [Homo sapiens] Length = 756

Score = 233 bits (593), Expect = 4e-61 Identities = 117/131 (89%), Positives = 117/131 (89%)


NC

BI

tblastn Results Against ESTs

>gi|12794555|emb|AL531062.1|AL531062 AL531062 LTI_NFL001_NBC4 Homo sapiens cDNA clone CS0DM005YM23 5 prime. Length = 878

Score = 167 bits (422), Expect(3) = 1e-42 Identities = 81/82 (98%), Positives = 81/82 (98%) Frame = +2


Query: 61 GSNSSRMYFTQTLLPGLAGPSG 82 GSNSSRMYFTQTLLPGLAGP GSbjct: 692 GSNSSRMYFTQTLLPGLAGPLG 757

Score = 24.3 bits (51), Expect(3) = 1e-42 Identities = 11/26 (42%), Positives = 11/26 (42%) Frame = +1

Query: 80 PSGEMVKXXXXXXXXXXXXXXDKVYA 105 PSG MVK DKVYASbjct: 748 PSG*MVKSTTSLTSSSTSGSSDKVYA 825

combined expect forhits to multiple frames

NC

BI

BLAST Tips

• It is faster and more accurate to BLAST proteins (blastp) rather than nucleotides.

• If in doubt use blastp.• When possible restrict to the subset of

the database you are interested in.• Look around for the database you

need or create your own custom BLAST database. BUT HOW???

• When is the best time to use the BLAST server?

NC

BI

Asking Biological Problems with BLASTWhat You

Want to DOGeneral (but More Complicated) Computational Method

Using BLAST

Finding genes in a genome

Run gene prediction software or an ORF Finder (for bacteria)

Cut your genome sequence in little (2-5kb) overlapping sequences. Use blastx to BLAST each piece of genome against NR (nonredundant protein db). Works better for sequences with no introns (bacteria).

Predicting protein function

Domain analysis or wet-lab experimentation

Use blastp to BLAST your protein sequence against SWISS-Prot (future = UniProt). If you get a good hit (more than 25% identify) over the complete length of the protein, then your protein has the same function as the SWISS-PROT protein

Predicting protein 3-D structure

Homology modeling, X-ray, NMR analysis of protein of interest

Use blastp to BLAST your protein against PDB (Protein structure DB), if you get hit >25% identity, then your protein and the good hit(s) have a similar 3-D structure

Finding protein family members

Clone new family members using PCR techniques

Use blastp (or better use PSI-BLAST) and run against NR (nonredundant protein family). After you have all members of family, you can make multiple sequence alignment phylogenetic tree


NC

BI

BLAST and PSI-BLAST Servers on the Internet

Country

Program

URL

USA BLAST/ PSI-BLAST

http://www.ncbi.nlm.nih.gov/BLAST

USA BLAST http://genome.wustl.edu/gsc/BLAST

EUROPE BLAST http://www.ch.embnet.org/software/bBLAST.html

Europe BLAST http://www.ebi.ac.uk/blast2/

Japan BLAST/ PSI-BLAST

http://www.ddbj.nig.ac.jp/E-mail/ homology.html

NC

BI

Alternative Method for

Homology Searches• Smith-Waterman (ssearch): slower but

more accurate• FASTA: slower than BLAST, but more

accurate when making DNA comparison

• BLAT: for locating cDNA in a genome or finding close proteins in a genome

NC

BI

Common Mistake

• Seq1 has domain A & B; Seq2 has domain A and Seq3 has domain B

• Use Seq 1 as query sequence• What happens? E-value of both of these hits may

be very high if domain A and B are long and well conserved.

• Seq1 is homologous to Seq2&3, but remember Seq1 is not homlogous over the entire length to Seq2&3

• Just don’t depend on the E-value• “BLAST hits are not transitive, unless the

alignments are overlapping”• Most proteins have more than one domain, so

becareful when looking a BLAST results, not all reported hits belong to the same big family.

Sequence 1: AAAAAABBBBBBSequence 2: AAAAAASequence 3: BBBBBB

NC

BI

Common Questions

• When I do a blast job using WU-BLAST vs NCBI BLAST with the same query sequence, I get a different result? Both are based on the same algorithm, but a different implementation. So why the difference?

Usually this is due to the slight variation in the database version, but differences in BLAST program version also play a minor role in the difference. Usually the result, do not change in a dramatic manner, but they do change a bit.

NC

BI

Self Guided Exercises - BLAST

• If you need further help on Blast. • First READ then try the problem set.• Blast Course:http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-

1.htmlBlast

• Tutorial:http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/

information3.html • Blast Quick Start (click on P for the

problem set)http://www.ncbi.nlm.nih.gov/Class/minicourses/

blastex2.html

Documents

NCBI Fundamentals of Sequence Analysis Chuong Huynh NIH/NLM/NCBI Sept 30, 2004 [email protected]