241
Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team [email protected] [email protected]

Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team [email protected]

Embed Size (px)

Citation preview

Page 1: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Tools & techniques for finding and aligning homologous sequences

Sequence Searching and Alignments

Andrew CowleyWeb Production Team

[email protected]@ebi.ac.uk

Page 2: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Materials

http://www.ebi.ac.uk/~apc/Courses/Brazil

Page 3: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

About me

• MA Biochemistry, Cambridge University• MRes Bioinformatics, University of York• PhD CASE Studentship, Structural Bioinformatics, University of York

Page 4: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

About me

• Training bioinformatics since 2005• Joined EMBL-EBI in 2010

From Derbyshire, UK

Many hobbies!

Page 5: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Derbyshire

Beautiful countryside

1871: South Derbyshire Football Association

1884: Derby County F.C.

“Hold the record for the lowest ever pointsfinish in the Premier League”

Page 6: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Contents

• Sequence databases

• Text searching

• Sequence similarity searching

• Alignment basics

• Similarity searching tools

• Improving algorithms

• Guidelines

• Problem sequences

• Multiple sequence alignments

Page 7: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Sequence Databases

Page 8: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Primary vs Secondary

• Primary data comes from experiments/submitters

• Derived (or secondary) data is generated with additional work (by curators etc.) from the primary data

Page 9: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Nucleotide primary data

Individual scientistsLarge-scale

sequencing projects

Primary sequence data

Primarysequencedatabase

•Original sequence data• Experimental data

• Patent data

• Submitter-defined

Patent Offices

ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG

Page 10: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

GenBank DDBJ

ENA

(Japan)(U.S.A.)

(Europe)

INSDC: • International Nucleotide Sequence Database Collaboration

• Daily exchange of data

Submission can be

made to any

INSDC database

Nucleotide primary data

Page 11: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Assembled sequences

Raw data

Annotated sequence

Large-scale sequencing

projects

Individual scientists

Patent Offices

ENAEMBL-Bank

(ENA Annotation)

Sequence ReadArchive (SRA)

EMBL-Coding etc.

Nucleotide primary data

Ensembl/genomes

IMGT/HLA

Page 12: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Nucleotide sequence resources at EMBL-EBI• European Nucleotide Archive (ENA)

• ENA sequence – Annotated sequence entries

• Sequence Read Archive (SRA) – sequence read data

• Sequence Version Archive (SVA) – historical entry version

• ENA Coding/Non-coding

• Ensembl

• Assembled genomes and annotations for Vertebrates

• Ensembl Genomes

• Extending Ensembl to other species

Page 13: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

When is the data updated?

Data is updated every night, but main releases are quarterly

• Quarterly release of all EMBL-Bank eg. Rel

116

Normal Release

• All updates since last normal release

• Rolled into quarterly release

Updates

ENA updates

Page 14: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Protein sequence data

Swiss-Prot & TrEMBL PIR-PSD

Since 2002 a merger and collaboration of three databases:

Page 15: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

UniProtKB

Non-redundant, high-quality manual annotation

- reviewed

Redundant, automatically annotated - unreviewed

UniProtKB/TrEMBL1 entry per nucleotide

submission

UniProtKB/Swiss-Prot1 entry per protein

Page 16: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

UniProtKB/Swiss-ProtManuallyannotated

UniProtKB/TrEMBLComputationallyannotated

Page 17: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Data sources of UniProtKB

UniProt/TrEMBL

VEGA(Sanger)

WormBaseFlyBase

Sub/Peptide

DataPDB

Patent Data

EnsemblENA (EMBL) DNA database

mRNAData

Page 18: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

UniProtKB employs two prediction programs which are referred to as UniRule and SAAS.

UniRule maintains a set of manually established and maintained annotation rules.

SAAS, Statistical Automatic Annotation System, generates a new set of decision-trees with every UniProtKB release using data-mining.

InterProSwiss-Prot

Automatic annotation

Page 19: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Curation of a UniProtKB/Swiss-Prot entry

Sequence variants

Nomenclature

Sequence features

UniProtKB/TrEMBL

UniProtKB/SwissProt

Ontologies

Literature Annotations

References

Page 20: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

UniProtKB

Page 21: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

UniProt databases

• UniProtKB/Swiss-Prot

• Manually curated

• UniProtKB/TrEMBL

• Automatically curated

• UniRef

• Sequences clustered by %identity

• UniParc

• Sequence archive – keeps track of historical sequences & identifiers

• Proteomes

Page 22: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Data

• Simplistically, much the data held at EMBL-EBI can be thought of as like a container

• Part of it is the raw data itself (eg. Protein sequence)

• Another part being meta-information or annotation about this data

Page 23: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

ExampleID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, Brazil.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//

Page 24: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Formats

• Different databases store this data in different formats

ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919

LOCUS AJ131285 919 bp mRNA linear INV 20-JUL-2001DEFINITION Sabella spallanzanii mRNA for globin 3.ACCESSION AJ131285VERSION AJ131285.1 GI:13810248KEYWORDS globin; globin 3; globin gene.SOURCE Sabella spallanzanii ORGANISM Sabella spallanzanii Eukaryota; Metazoa; Lophotrochozoa; Annelida; Polychaeta; Palpata; Canalipalpata; Sabellida; Sabellidae; Sabella.REFERENCE 1

SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//

ORIGIN 1 caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 61 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 121 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 181 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 241 tctgtggaca agttcttcaa gcgtgtcaat ggcaaggaca tcagctcccc agccttccag 301 gctcacatcc agcgtgtgtt cggtggcttt gacatgtgca tctccatgct tgatgacagt 361 gatgtgctcg cctctcagct ggctcacctc cacgcccagc acgtcgagag aggaatctct//

EMBL GENBANK

Page 25: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Formats

>gi|13810248|emb|AJ131285.1| Sabella spallanzanii mRNA for globin 3CAAACAGTCARTTAATTCACAGAGCCCTGAGGTCTCTCGCTCCTTTCTGCGTCACTCTCTCTTACCGTCATCATGTACAAGTGGTTGCTTTGCCTGGCTCTGATTGGCTGCGTCAGCGGCTGCAACATCCTCCAGAGGCTGAAGGTCAAGAACCAGTGGCAGGAGGCTTTCGGCTATGCTGACGACAGGACATCCCYCGGTACCGCATTGTGGAGATCCATCATCATGCAGAAGCCCGAGTCTGTGGACAAGTTCTTCAAGCGTGTCAATGGCAAGGACATCAGCTCCCCAGCCTTCCAGGCTCACATCCAGCGTGTGTTCGGTGGCTTTGACATGTGCATCTCCATGCTTGATGACAGTGATGTGCTCGCCTCTCAGCTGGCTCACCTCCACGCCCAGCACGTCGAGAGAGGAATCTCT

FASTA format

Page 26: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Format conversion tools

http://www.ebi.ac.uk/Tools/sfc/

Page 27: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Meta-information

• Contains information important for:

• Identifying/referencing a piece of data or entry

• Classifying an entry

• Determining the source of the data

• And can also contain annotation that adds value:

• Identification of sequence features

• Keywords, GO terms etc.

• Cross-references to other entries that share some property

• Etc.

Page 28: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Searching using the meta-data

• When looking for a sequence we can perform text searches against the meta-data

• Accession look-up

• Keyword search eg: function, species

• Protein family classification

• Accession changes?

• Cross reference services

• PICR

• UniProt ID Mapping

Page 29: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Text search tools

• Each database has its own search engine

• Interface tailored to their specific data use

• There are also EBI-wide search tools

• EBI Search

Page 30: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

EBI Search

• First approach/entry point to data resources at EBI

Page 31: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

EBI Search

• Just type, with auto-complete

Page 32: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

EBI Search

• One stop search across many resources, grouped into categories

Page 33: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Categories

Page 34: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk
Page 35: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk
Page 36: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Domains

Multi-domain facet

Page 37: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Domain-specific facet

Page 38: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

EBI Search

• First approach/entry point to data resources at EBI

• One stop search across many resources

• Non-expert friendly summaries

Page 39: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

EBI Search

Page 40: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

EBI Search

• First approach/entry point to data resources at EBI

• One stop search across many resources

• Non-expert friendly summaries

• Advanced search available (via direct URL)

• http://www.ebi.ac.uk/ebisearch/advancedsearch.ebi

• Allows domain/field specification

• Boolean etc.

Page 41: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Sequence searching

Page 42: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Sequence searching tools

• Central to modern techniques

• Genome annotation

• Characterising protein families

• Exploring evolutionary relationships

Page 43: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

How?

• Search by comparing sequence data rather than meta-data

• Find sequences/entries when missing or inaccurate meta-data

• More than just an exact look-up

• Allow for sequence variability – look for ‘similar’ sequences

• Sequence variation is important information for bioinformaticians

• Infer homology (shared ancestry)

• IF homologous, then can transfer information

Page 44: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homology vs. Similarity

• Presence of similar features because of common decent

• Cannot be observed since the ancestors are not anymore

• Is inferred as a conclusion based on ‘similarity’

• Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999)

• Quantifies a ‘likeness’

• Uses statistics to determine ‘significance’ of a similarity

• Statistically significant similar sequences are considered ‘homologous’

Measurable

Inferred

Page 45: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Sequence alignment

ACATAGGT

TCATAGAT AAATTCTG

Query:

1 2

Page 46: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Sequence alignment

ACATAGGT

TCATAGAT AAATTCTG

Query:

1 2

ACATAGGTACATAGGT

Page 47: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Sequence alignment

ACATAGGT

TCATAGAT AAATTCTG

Query:

1 2

Score: 6/8 3/8

ACATAGGT ACATAGGT

Identity

Page 48: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Sequence alignment

atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatctcaagggcacctttgcccagcttgagt

atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggccatggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccaccaagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggcaagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgccctgtccactctgagcgacctgc

cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccataaagcacctggatgatctcaagggca

Query:

1

2

Page 49: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Dot plot

• Maybe a dot plot will help

Query

Sequence 1

A C A T A G

GATACT

Page 50: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Dot plot

Query vs Sequence 1 Query vs Sequence 2

Query Query

1 2

Page 51: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Algorithms

• To get a computer to solve a problem, the first step is to create a way for the computer to know what is relatively ‘good’ and what is relatively ‘bad’

• I.e. a score.

• Computer can then assess solutions and choose best.

Page 52: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• Simple algorithm – penalise movement away from diagonal – gap penalty

0

-10

-10

0

-10

-10

Page 53: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Why gap open and extension?

• Adjacent gap positions are likely to have been created by the same in/del event, rather than multiple independent events

• Use a smaller gap extension compared to opening penalty to account for this

G---ATTA G-A-T-TA

Page 54: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• To encourage this we apply a low penalty per each gap, and a high one just to open a gap.

-10.5

Gap extend

0

-10.5

-10.5

0

-10 -0.5

-10-0.5

-11

0

-10.5

-0.5

-11-0.5

-10.5-10.5

Gap open = 10Gap extend = 0.5

Page 55: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Match/mismatch

• Of course, we need to tell the algorithm that matching letters are better than mismatches too

• This is done via a scoring matrix

A C G T

ACGT

5 -4 -4 -4-4 5 -4 -4-4 -4 5 -4 -4 -4 -4 5

Page 56: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• Putting the two together gives us a scoring mechanism

-4

-18.5

-18.5

1

-14 -13.5

-23

-13.5

T

A

C

A

C A

6

Page 57: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• To pick the optimal alignment, start at the end and trace back the highest scoring route.

-4

-18.5

-18.5

1

-14 -13.5

-23

-13.5

T

A

C

A

C A

6

Page 58: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Needleman-Wunsch

• Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!

• An example of dynamic programming

• Comparing the full length of both sequences is called a global-global or just global alignment

Page 59: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Global vs Local

• But global-global might not be suitable for sequences that are very different lengths

• A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm.

• Sets negative scores in matrix to 0, and allows trace back to end and restart

Page 60: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Global vs Local

A T G T A T A C G C

A G T A T A - G C

A - T G T A T A C G C

A G T A T A - - - G C

Page 61: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Scoring

• Parameters so far:

• Match/mismatch

• Gap opening

• Gap extending

• Can we improve it?

Page 62: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Substitutions

• Some substitutions are more likely than others

Page 63: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Protein substitution matrices

• Can look at closely related proteins to determine substitution rates

• Two most commonly used models:

• PAM

• BLOSUM

Page 64: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PAM

• Point Accepted Mutation

• Observed mutations in a set of closely related proteins

• Markov chain model created to describe substitutions

• Normalised so that PAM1 = 1 mutation per 100 amino acids

• Extrapolate matrices from model

• Higher PAM number = less closely related

PAM 250

Page 65: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BLOSUM

• Blocks of Amino Acid Substitution Matrix

• Align conserved regions of evolutionary divergent sequences clustered at a given % identity

• Count relative frequencies of amino acids and substitution probability

• Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.

• Higher BLOSUM number = more closely related

Page 66: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BLOSUM 45PAM 250

BLOSUM 62PAM 160

BLOSUM 90PAM 100

More divergent Less divergent

Page 67: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Scoring

• Parameters:

• Match/mismatch

• Gap opening

• Gap extending

• Substitution matrix

Page 68: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Dynamic programming alignments at the EBI• EMBOSS Pairwise Alignment algorithms

• European Molecular Biology Open Software Suite

• Suite of useful tools for molecular biology

• Command line based

• Designed to be used as part of scripts/chained programs

• We implement selected tools to provide web and Web Services access

• Database alignments via FASTA suite of programs

Page 69: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Where to find at the EBI?

http://www.ebi.ac.uk/Tools/psa/

Page 70: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Where to find at the EBI?

http://www.ebi.ac.uk/Tools/psa/

Page 71: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Pairwise alignment tools

• Global alignment

• Local alignment

• Genomic DNA alignment

Needle

Water

Stretcher

Matcher

LALIGN

WISE tools

Big sequences

Big sequences

Page 72: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Change to nucleotide

Sequence input

Parameters

Submit!

Page 73: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk
Page 74: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Key

- Gap

: Positive match

. Negative match

| Identity

Page 75: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Example sequences

www.ebi.ac.uk/~apc/Courses/Brazil

Pairwise_align1.fsa

Pairwise_align2.fsa

Page 76: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Searching a database

• Multiple pairwise alignments between query sequence and database sequence

Page 77: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Dynamic programming sequence search methods at the EBI

• Global alignment

• Local alignment

• Global query vs local database

• Profile-iterative search

GGSEARCH

SSEARCH

GLSEARCH

PSI-SEARCH

Page 78: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Where to find at the EBI?

http://www.ebi.ac.uk/Tools/sss/

Page 79: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Where to find at the EBI?

http://www.ebi.ac.uk/Tools/sss/

Page 80: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Database selection

Sequence input

Parameters

Submit!

Page 81: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• Dynamic programming methods are rigorous and guarantee an optimal result

• But take up a lot of memory

• And evaluate each position of the matrix

• Predictably, this makes them slow and demanding when you are aligning large sequences

Page 82: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Heuristics

• Therefore we need methods of estimating alignments

• Estimation methods are called heuristics

• Try and take short cuts in an intelligent manner

• Speed up the search

• At the possible expense of accuracy

• Accuracy in sequence searches is important for:

• Aligning the right bits

• Scoring the alignment correctly

• Identifying similar sequences - sensitivity

Page 83: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• Going back to our dot plot

Page 84: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.

Page 85: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• Of course, we have to identify likely regions – not all alignments will be as nice as that one!

• This is the method used by FASTA

• W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Page 86: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA – step 1

• Identify runs of identical sequence and pick regions with highest density of runs

Ktup parameter:How small are ‘words’ considered before they are ignored

Increase Ktup = faster, but less sensitive

Page 87: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA – step 2

• Weight scoring of runs using matrix, trim back regions to those contributing to highest scores

Parameter:Substitution matrix

Page 88: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA – step 3

• Discard regions too far from the highest scoring region

Joining threshold:Internally determined

Page 89: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA – step 4

• Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions

Parameters:Gap openGap extendSubstitution matrix

Page 90: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA

• Repeat against all sequences in the database

Page 91: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA – programs available at EBI

• FASTA: ”a fast approximation to Smith & Waterman”

• FASTA – scan a protein or DNA sequence library for similar sequences.

• FASTX/Y – compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward or reverse translation frames.

• TFASTX/Y – compare a protein sequence to a translated DNA data bank.

• FASTF – compares ordered peptides (Edman degradation) to a protein databank.

• FASTS – compares unordered peptides (Mass Spec.) to a protein databank.

Page 92: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Where to find at the EBI?

http://www.ebi.ac.uk/Tools/sss/

Page 93: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk
Page 94: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Database selection

Sequence input

Parameters

Submit!

Page 95: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA - results

Page 96: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA - results

Page 97: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA - results

Page 98: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA - results

Key

- Gap

: Identity

. Similarity

X Filtered

Page 99: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Example sequence

www.ebi.ac.uk/~apc/Courses/Brazil

test_prot.fasta

Page 100: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BLAST – Basic Local Alignment Search Tool• Instead of narrowing the dynamic programming

search space, BLAST works a slightly different way

• Firstly, it creates a word list both of the exact sequence and high scoring substitutions

Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Page 101: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BLAST – step 1

• w=3

SEWRFKHIYRGQPRRHLLTTGWSTFVT

SEWEWR

WRF Parameter:Word length (w)

Increase = faster, but less sensitive

Page 102: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BLAST – step 1(cont.d)

• w=3

• T=13

SEWRFKHIYRGQPRRHLLTTGWSTFVT

GQP 18GEP 15GRP 14GKP 14GNP 13GDP 13

AQP 12NQP 12

Parameters:Neighbourhood threshold (T)Substitution matrix

Page 103: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BLAST – step 2

• Then it scans database sequences for exact matches with these words

Page 104: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount

• This results in a High-scoring Segment Pair (HSP)

BLAST – step 3

Parameters:Drop offSubstitution matrix

Page 105: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• If the total HSP score is above another threshold then a gapped extension is initiated

BLAST – step 4

Parameters:Extension threshold (Sg)Substitution matrix

Page 106: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BLAST

• The steps rule out many database sequences early on

• Large increase in speed

Page 107: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BLAST – programs available at the EBI

• Basic Local Alignment Search Tool

• NCBI BLAST programs:

• BLASTP – protein sequence vs. protein sequence library

• BLASTN – nucleotide query vs. nucleotide database

• BLASTX – translated DNA vs. protein sequence library

Page 108: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk
Page 109: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Key

- Gap

[residue] Identity

+ Similarity

X Filtered

Page 110: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Example sequence

www.ebi.ac.uk/~apc/Courses/Brazil

test_prot.fasta

Page 111: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

When to use what?

Database size

Query length

GGSEARCH

FASTA

BLAST

PSI-SEARCHSSEARCH

Page 112: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

When to use what?

PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

time to search

GGSEARCH

FASTA

BLAST

PSI-SEARCHSSEARCH

Page 113: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homology and Similarity

Page 114: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Similarity

Page 115: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homology

Page 116: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• So far, we’ve talked about scoring alignments

• Direct function of the algorithm

• But what we want is to assign some kind of quality to that score

Page 117: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Score vs significance

A A A

A A A

A C A T A A G G C T

A T A C A A G C C T

High score Higher significance?

Page 118: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

“Lies, damn lies, and statistics”

Page 119: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

“Lies, damn lies, and statistics”

• Not just interested in score...

• ...But how likely we are to get that alignment by chance alone

• It is this ‘non-random’ alignment that infers homology

• Statistics are used to estimate this chance

Page 120: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

E-value

• ‘Expect’ value (really ‘expectation’)

• Probability of obtaining this score by chance in the given database, or “how many times you might be wrong”

• Best measure of how biologically significant an alignment is

• Used for ranking results by default

• Most people use 10-3 “Happy to be wrong one time in a thousand”

Page 121: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• Calculated in slightly different ways for BLAST and FASTA

• Short alignments are more likely to be found by chance so have higher E-values

• Affected by database size

• BLAST and FASTA both optimised for distant relationships

Page 122: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA statistics

• Compares query sequence with every sequence in database

• As most of these sequences are unrelated it is possible to use the distribution of scores (sampled) to assign statistical significance

• As distribution is taken from a random sample, exact E-Value can vary slightly from search to search

Page 123: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

FASTA - histogram

Predicted distribution of scores

Observed distribution of scores

Key

*

=

High scoring region

Page 124: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BLAST statistics

• Main reason for speed is that it doesn’t compare query with lots of other sequences

• Therefore it pre-estimates statistical values using a random sequence model

“Appears to yield fairly accurate results”

Page 125: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Improving algorithms

Page 126: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Sensitivity, Selectivity & Speed

• Sensitivity is how distantly you can determine a homologous sequence (avoid false negative)

• Selectivity is how accurately you can determine whether a sequence is homologous or not (avoid false positive)

• Speed is obviously how long it takes!

Page 127: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

• In general, the more information we can add to an alignment, the better the result

Conserved regions Structural information Motifs

[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H

Page 128: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Conserved regions

• We can add a new ‘position’ parameter to the substitution matrix

We can even modify a normal search to generate a position specific scoring matrix, or PSSM

Page 129: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PSI-BLAST

Position Specific Iterative – BLAST:

1.Takes the result of a normal BLAST

2.Aligns them and generates profile of conserved positions

3.Uses this to weight scoring on next iteration

Page 130: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PSI-BLAST

Page 131: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PSI-BLAST

Page 132: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PSI-BLAST

Page 133: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Example sequence

www.ebi.ac.uk/~apc/Courses/Brazil

test_prot.fasta

Page 134: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PHI-BLAST

• Pattern Hit Initiated-BLAST

• User provides a pattern alongside a protein

• Database hits have to contain this pattern, and similarity to rest of sequence

• Results can initiate a PSI-BLAST search as well

Page 135: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PSI-BLAST

• By adding importance to conserved residues we might be able to find more distant sequences

• But iterate too far and we might be assigning importance where there is none

• Problem of Homologous Over-Extension (HOE)

More sensitive

Less selective

Page 136: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homologous Over-Extension (HOE)

Alignment region

Extends over subsequent iterations

2nd3rd

Page 137: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homologous Over-Extension (HOE)

Contaminated PSSM

Page 138: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homologous Over-Extension (HOE)

Which can cause (significant) alignment with unrelated protein

Page 139: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homologous Over-Extension (HOE)

Expect score: 9.0x10-5

PSI-BLAST initial search

Page 140: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homologous Over-Extension (HOE)

PSI-BLAST 2nd Iteration

Page 141: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homologous Over-Extension (HOE)

PSI-BLAST 3rd Iteration

Expect score: 1.0x10-4

Page 142: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Homologous Over-Extension (HOE)

PSI-BLAST 5th Iteration

Expect score: 7.0x10-4

Expect score: 1.0x10-19

Page 143: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Reducing HOE

• Look for domains in results and manually select sequences that form part of PSSM

• Mask boundaries according to initial alignment

• Results in improvement of false-positives (selectivity)

Page 144: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PSI-SEARCH

• Smith-Waterman implementation (SSEARCH)

• With iterative position specific scoring

• Optional boundary masking to reduce HOE

Page 145: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Reducing HOE errors

• Sequence boundary masking procedure

• First time a significant alignment occurs for a library sequence, store co-ordinates

Page 146: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Reducing HOE errors

• Mask regions outside so can’t contribute to PSSM

Page 147: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Reducing HOE errors

PSI-Search 2nd Iteration

PSI-Search 5th Iteration

Page 148: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PSI-Search

Page 149: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

So what does that do to sensitivity/selectivity?

Selectivity

Sen

sitiv

ity

PSI-Search

=

Very sensitive+Much more selective

Page 150: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Coming soon

• PSI-Search 2!

• Use domain annotations/predictions to inform alignment

Page 151: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Low complexity regions

• Biologically irrelevant, but likely to skew alignment scoring

• E.g. CA repeats, poly-A tails and Proline rich regions

Page 152: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Good Statistics:

The inset shows good correlationbetween the observed over expectednumbers of scores.

This is the region of the histogram to look out for first when evaluating results.

Page 153: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

The inset shows bad correlationbetween the observed and expectedscores in this search.

The spaces between the = and * symbolsindicate this poor correlation.

One reason for this can be low complexityregions.

Bad Statistics:

Page 154: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Low complexity regions

• Biologically irrelevant, but likely to skew alignment scoring

• E.g. CA repeats, poly-A tails and Proline rich regions

• Compensate by filtering/masking sequence so these regions don’t contribute to scoring

• Filters: seg, xnu, dust, CENSOR

• But check what you are filtering!

Page 155: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Inset showing the effect of using a low complexity filter (seg) and searchingthe database using the segment withhighest complexity.

Note that there is now good agreementbetween the observed and expectedhigh score in the search and that thedistance between = and * has beensignificantly reduced.

Filtered:

Page 156: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Example sequence

www.ebi.ac.uk/~apc/Courses/Brazil

Filtertest_seq.fsa

Page 157: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Database composition

• Statistics rely on database containing wide coverage

• Assumption query is not homologous to most of the data

• Specialist databases might cause problems

• Eg Innuno- databases, made up of relatively few genes

• A lot of the database IS homologous

• Skews statistics

Page 158: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Database composition

• Can’t make same assumptions about coverage

• So don’t use BLAST

• FASTA based tools sample the score so provide accurate statistics

• Use the histogram to check

• Use shuffled versions of database to create additional coverage

Page 159: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk
Page 160: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Search Guidelines

Page 161: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Search guidelines 1

• Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)

• Then with translated DNA query sequences (fastx, blastx)

• Search with DNA vs. DNA as the next resort

• And then against translated DNA database sequences (tfastx, tblastx) as the VERY LAST RESORT!

Page 162: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Search guidelines 2

• Search the smallest database that is likely to contain the sequence(s) of interest

• Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology

Page 163: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Search guidelines 3

• Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence

• Examine the histograms

• Use programs such as prss3 to confirm the expectation values.

• Searching with shuffled sequences (use MLE/Shuffle in fasta) which should have an E() ~1.0

• Perform reverse search

Page 164: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Search guidelines 4

• Default parameters are set up for most common queries

• Consider searches with different gap penalties and other scoring matrices, especially for short queries/domains

• Use shallower matrices and/or more stringent gaps in order to uncover or force out relationships in partial sequences

• Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of PAM250)

• Remember to change the gap penalty defaults (if the tool doesn’t change them for you)

MATRIX open ext.BLOSUM50 -10 -2BLOSUM62 -11 -1BLOSUM80 -16 -4PAM250 -10 -2PAM120 -16 -4

Page 165: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Search guidelines 5

• Homology can be reliably inferred from statistically significant similarity

• But remember:

• Orthologous sequences have similar functions

• Paralogous sequences can acquire very different functional roles

• So further work might be needed to tease out details

Page 166: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk
Page 167: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Search guidelines 6

• Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues

• However, motif identity in the absence of significant sequence similarity usually occurs by chance alone

• Try to produce multiple sequence alignments in order to examine the relatedness of your sequence data

• Clustal Omega

• MUSCLE

• T-Coffee

• Kalign

• MAFFT

Page 168: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Problem Sequences

Page 169: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Short sequences

• What about short sequences?

• Depends on their nature:

• Protein

• Use shallow matrices

• Reduce word length and/or increase the E() value cut off

• DNA

• Reduce the word length

• Ignore gap penalties (force local alignments only)

• Use rigorous methods

• But ask what you are trying to do!

Page 170: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Vector contamination

• You think you know what your sequence is..

• .. But the results are really confusing!

• Maybe you have vector contamination

• Search against known vectors to check

Page 171: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Vector contamination

Page 172: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Example sequences

www.ebi.ac.uk/~apc/Courses/Brazil

vectortest_seq1.fsa

vectortest_seq2.fsa

Page 173: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Multiple Sequence Alignments

Page 174: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Uses of Multiple Sequence Alignment (MSA)• Alignment of three or more sequences

• Functional prediction

• Structural prediction

• Conservation analysis

• Classification

• Phylogeny

• To help distinguish between orthology and parology

Page 175: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

We have a (computational) problem…

• Pairwise alignments are simple enough to find the optimal (highest scoring) solution in a reasonable timeframe

• Multiple sequence alignment is in a class of problems that is ‘NP-hard’

Page 176: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

NP-easy

• Problems that are solvable in polynomial time

• E.g. operations to solve = n2

• Problems that are hard to solve

• E.g. operations to solve = 2n

NP-hard

Page 177: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

n2 vs 2n

• Imagine a computer running 109 operations a second

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

1021

37 trillion years

Page 178: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

What to do about NP-hard problems?

• Give up (do you really need MSA?)

• Use approximations and heuristics

Page 179: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Weighted Sums of Pairs: WSP

N

i

i

jijijDW

2

1

1

Sequences Time2 1 second

3 150 seconds

4 6.25 hours

5 39 days

6 16 years

Time O(LN)

7 2404 years

Page 180: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Progressive Alignment:Barton and Sternberg, 1987Florence Corpet, 1988Feng and Doolittle, 1987Jotun Hein, 1989Higgins and Sharp, 1988Hogeweg and Hesper, 1984Willie Taylor, 1987, 1988

Page 181: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Horse beta

Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin

Page 182: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Horse beta

Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin

Page 183: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Horse beta

Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin

Guide Tree

Page 184: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Clustal• >85,000 citations

• Clustal1-Clustal4 • 1988, Paul Sharp, Dublin

• Clustal V 1992• EMBL Heidelberg,

• Rainer Fuchs

• Alan Bleasby

• Clustal W, Clustal X 1994-2007• Toby Gibson, EMBL, Heidelberg

• Julie Thompson, ICGEB, Strasbourg

• Clustal W and Clustal X 2.0 2007• University College Dublin

www.clustal.org

Page 185: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

ClustalW2 at the EBI

www.ebi.ac.uk/Tools/msa/clustalw2/

Page 186: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

www.ebi.ac.uk/Tools/msa/

Page 187: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

ClustalW2

Sequence input

Parameters

Submit!

Page 188: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

ClustalW2

Page 189: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

ClustalW2

Page 190: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Jalview

Page 191: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

ClustalW2

Advantages• Quite fast for low

numbers

• Not too demanding

• Widely used

Disadvantages• Fixing of early

alignments• Propagate errors

• Doesn’t search far• Local minima

• Compresses gaps

Page 192: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Example sequences

www.ebi.ac.uk/~apc/Courses/Brazil

Prot_MSA.fsa

Page 193: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Other progressive aligners

• MUSCLE

• Optimised progressive aligner

• Good alternative to ClustalW

BaliBase % correct time(s)

Clustal W 37.4 766Muscle 47.5 789

Page 194: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Other progressive aligners

• KAlign

• Local regions progressive aligner

• Extremely fast!

• Good for large alignments/input

BaliBase % correct time(s)

Clustal W 37.4 766Kalign 50.1 21

Page 195: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Consistency based alignment

• Maximise similarity to a library of residue pairs

Page 196: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

COFFEE

• Consistency based Objective Function For alignmEnt Evaluation

• Maximum Weight Trace (John Kececioglu)

• Maximise similarity to a LIBRARY of residue pairs• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE:

An objective function for multiple sequence alignments. Bioinformatics 14: 407-422.

Page 197: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

COFFEE

• Library of reference pairwise alignments

• For your given set of sequences

• Objective Function

• Evaluates consistency between multiple alignment and the library of pairwise alignments

• Use SAGA to optimise this function

• Weigh depending on quality of alignment

SAGA is another alignment method, using genetic algorithms

Page 198: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

COFFEE

• More accurate than ClustalW

• Much less prone to problems in early alignment stages

• VERY slow!

Page 199: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

T-Coffee

• Tree-based COFFEE

• Heuristic approach to COFFEE

• Gets rid of genetic algorithm portion

• Uses progressive alignments

• Changes algorithm based on number of sequences

Page 200: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

T-Coffee

• Much faster than COFFEE

• Avoids some of ClustalW’s pitfalls

• Can take information from several data sources

• Still not that fast

• Can be very demanding of memory etc.

Page 201: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Other Tools

• MAFFT

• Iterative based Fast Fourier Transform

• Different modes – can operate in both progressive and consistency type alignments

Page 202: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

NEW!: Clustal Omega

• Completely different way of doing things from ClustalW

• Two major areas of improvement:

• 1) Guide tree generation

• 2) Profile-profile alignments

Page 203: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Clustal Omega – Guide Tree improvements• Guide tree generation is one of the slowest steps

• Especially with large numbers of sequence

• Clustal Omega uses the embed method to sample range of sequences and represent all sequences as vectors to these samples

• Results in better scaling with more sequences

Page 204: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Clustal Omega – Profile-profile alignments• Like sequence searching, profiles can be used to

increase sensitivity

• HMMs are a form of profile

• Clustal Omega aligns HMMs to HMMs

Page 205: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Clustal Omega

• Better scaling for many sequences

• Speed

• Accuracy

• Better scaling for many computers

• More accurate alignments

• Nucleic Acid alignments still work in progress

Page 206: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Which tool should I use?

Input data

• 2-100 sequences of typical protein length

• 100-500 sequences

• >500 sequences

• Small number of unusually long sequences

Recommendation

• MUSCLE, T-Coffee, MAFFT, ClustalW2/Omega

• Clustal Omega, MUSCLE, MAFFT

• Clustal Omega, KALIGN

• ClustalW, KALIGN

Page 207: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

How to evaluate?

• Use a benchmark

• BaliBASE

Page 208: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics

•ICGEB Strasbourg

•141 manual alignments using structures• 5 sections• core alignment regions marked

1. Equidistant(82)

2. Orphan(23)

3. Two groups (12)

4. Long internal gaps(13)

5. Long terminal gaps(11)

Page 209: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

BaliBase % correct time(s)

Clustal Omega 55.4 539

Clustal W 37.4 766Mafft (default) 45.8 68Muscle 47.5 789Kalign 50.1 21T-Coffee 55.1 81041Probcons 55.8 13086Mafft (auto/consistency) 58.8 1475MsaProbs 60.7 12382

Page 210: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Benchmark pitfalls

• Benchmark dataset may not be representative

• Danger of over-training towards benchmark

• Goldman: Most MSAs have unrealistic gaps

• Tend towards multiple, independent deletions

• Insertions are rare

• Sequences shrink in length over evolution

• No supporting evidence that this is the case

Page 211: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Solutions

• Use phylogentic data to guide alignment

• Keep track of changes to ancestor sequences

• Don’t change them again so easily in decendents

Page 212: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Phylogeny

• Multiple Sequence Alignment tries to find best alignment of three or more sequences

• Used to identify groups of similar sequences

• Conserved regions etc.

• But if we want to examine evolutionary relationships we need more than just current sequence similarity

• Phylogeny is an estimate of evolutionary history between sequences

• Model substitutions from theoretical ancestor sequences

Page 213: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Neighbour Joining

• Simple phylogenetic tree method

• Bottom up (starts from alignment of current day sequences)

• Iterate to form a tree with nodes forming minimum distances between paired taxa

• Fast

• Dependant on accuracy of input

• Can sometimes get negative branch lengths

Page 214: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

ClustalW2 - Phylogeny

• Neighbour joining (and UPGMA) phylogenetic tree algorithm from the ClustalW2 package

http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny/

Page 215: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

ClustalW2 - Phylogeny

ALIGNED sequence input

Page 216: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

ClustalW2 - Phylogeny

Page 217: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

PRANK

• Probabilistic Alignment Kit

• webPRANK

• Better suited for closely related sequences

• Tied solutions are chosen from at random

• Avoids incorrect confidence in result

• Means alignments might not be reproducible

• Alignments look quite different

• Might look worse!

• But gap patterns make sense

• Gaps are good!

Page 218: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

http://www.ebi.ac.uk/goldman-srv/webprank/

Page 219: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk
Page 220: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Common problems with MSA

• Input format

• Try using FASTA format

• Unique sequence identifiers

• Include sequence!

• Usually limit of 500 sequences/1MB

• Job can’t be found/other error

• Results deleted after 7 days

• Some sequence/program combinations run out of memory

• Use a different program

Page 221: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Example sequences

www.ebi.ac.uk/~apc/Courses/Brazil

Problem_MSA1.fsa

Problem_MSA2.fsa

Problem_MSA3.fsa

Problem_MSA4.fsa

Page 222: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Common mis-uses of MSA

• Performing a sequence assembly

• Specialist type of MSA

• Use other tools (Staden etc.)

• Aligning ESTs to a reference genome

• Use EST2Genome

• Designing primers

• Use primer tools (primer3 etc.)

• Aligning two sequences

• Use a pairwise alignment tool!

Page 223: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Putting it all together

• EBI Search

• Sequence retrieval

• Sequence search

• Sequences retrieval

• Multiple sequence alignment

• Phylogeny

• Analysis

Page 224: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Final remarks

• Don’t assume a single tool will cater for all your needs

• Change the parameters of the tools

• Remember where the tool excels and what its limitations are

• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

• Crazy input will always give crazy results!

Page 225: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Getting Help

Page 226: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Getting Help

• Database documentation

• EBI Support• http://www.ebi.ac.uk/support/

• EBI training programme• http://www.ebi.ac.uk/training

• EBI online training• http://www.ebi.ac.uk/training/online

• IMGT/HLA

[email protected]

Page 227: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Thank you!

www.ebi.ac.uk

Twitter: @emblebi

Facebook: EMBLEBI

Page 228: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Final remarks

• Don’t assume a single tool will cater for all your needs

• Change the parameters of the tools

• Remember where the tool excels and what its limitations are

• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

• Crazy input will always give crazy results!

Page 229: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Getting Help

Page 230: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Getting Help

• Database documentation

• EBI Support• http://www.ebi.ac.uk/support/

• EBI training programme• http://www.ebi.ac.uk/training

• EBI online training• http://www.ebi.ac.uk/training/online

Page 231: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

With thanks to our funders

• EMBL-EBI is primarily funded by EMBL member states

• Other major funders:

• European Commission

• National Institutes of Health

• Research Councils UK

• Wellcome Trust

• Industry Programme

Page 232: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

Appendix

Page 233: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

How is the data organised?

Data in EMBL-Bank is divided in 2 ways:

• Type of data or

• Methodology used to obtain data

• Each entry belongs to one data class

1) Data classes

• Each entry belongs to one taxonomic division

2) Taxonomic Divisions

ENA database structure

Page 234: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

CON

EST

GSS

HTC

HTG

MGA

PAT

STS

TPA

TSA

WGS

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STD Standard (high quality annotated sequence)

1) Data Classes

SRA Sequence Read Archive (both databank and data class)

Page 235: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

CON

EST

GSS

HTC

HTG

MGA

PAT

STS

TPA

TSA

WGS

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STD Standard (high quality annotated sequence)

1) Data Classes

SRA Sequence Read Archive (both databank and data class)

• Single pass reads variable quality

Page 236: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

CON

EST

GSS

HTC

HTG

MGA

PAT

STS

TPA

TSA

WGS

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STD Standard (high quality annotated sequence)

1) Data Classes

SRA Sequence Read Archive (both databank and data class)

• SRA is a separate databank from EMBL-Bank

• SRA can also be searched as a data class within EMBL-Bank

Page 237: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

CON

EST

GSS

HTC

HTG

MGA

PAT

STS

TPA

TSA

WGS

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STD Standard (high quality annotated sequence)

1) Data Classes

SRA Sequence Read Archive (both databank and data class)

• Bulk of entries

• Highest level of tracked information

Page 238: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

HUM Human

MUS Mouse

MAM Mammal

VRT Vertebrate

ROD Rodent

FUN Fungi

INV Invertebrate

PLN Plant

PHG Phage

PRO Prokaryote

VIR Viral

ENV Environmental

SYN Synthetic

TGN Transgenic

UNC Unclassified

Other:

2) Taxonomy

All INSDC databases use NCBI Taxonomy

Which taxonomy database does ENA use?

Divisions:

Page 239: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

2) Taxonomy: exclusion

Some species EXCLUDED

from certain taxonomic ranges

VRT Vertebrate excludeshumanmouserodentmammal

MAM Mammal excludeshumanmouserodent

ROD Rodent excludes mouse

Applies to:

• ftp files and• Sequence search tools

But not:

• ENA Browser

Page 240: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

How does data organization differ from GenBank?

EMBL-Bank

Data classes

TaxonomicDivisions

• Data split into intersecting slices

• Reduces search set

• Ensures complete result set

con

est

gss

htc

htg

pat

sts

std ...

hummusrodmamvrtfun...

GenBank

Divisions

• Data split into parallel slices

• Large search sets

• Classes incomplete for taxonomy

• Taxonomy incomplete for classes

con

est

gss

htc

htg

pat

sts

std ...hum

mus rod

mam vrt

fun

inv

pln

...

Data classes Taxonomy

Database structure

Page 241: Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team andrew.cowley@ebi.ac.uk

How does data organization differ from GenBank?

EMBL-Bank

Data classes

TaxonomicDivisions

• Data split into intersecting slices

• Reduces search set

• Ensures complete result set

con

est

gss

htc

htg

pat

sts

std ...

hummusrodmamvrtfun...

GenBank

Divisions

• Data split into parallel slices

• Large search sets

• Classes incomplete for taxonomy

• Taxonomy incomplete for classes

con

est

gss

htc

htg

pat

sts

std ...hum

mus rod

mam vrt

fun

inv

pln

...

Data classes Taxonomy

‘Mouse’ + ‘EST’ intersection

• small data set

• ensured complete set of mouse ESTs

‘Mouse’ set

• large data set

• includes all mouse entries

‘EST’ set

• large data set

• includes all EST entries

Database structure