Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team [email protected]

Tools & techniques for finding and aligning homologous sequences

Sequence Searching and Alignments

Andrew CowleyWeb Production Team

[email protected]@ebi.ac.uk

Materials

http://www.ebi.ac.uk/~apc/Courses/Brazil

http://www.ebi.ac.uk/~apc/Courses/Cambridge

http://www.ebi.ac.uk/~apc/Courses/Cambridge

About me

• MA Biochemistry, Cambridge University• MRes Bioinformatics, University of York• PhD CASE Studentship, Structural Bioinformatics, University of York

About me

• Training bioinformatics since 2005• Joined EMBL-EBI in 2010

From Derbyshire, UK

Many hobbies!

Derbyshire

Beautiful countryside

1871: South Derbyshire Football Association

1884: Derby County F.C.

“Hold the record for the lowest ever pointsfinish in the Premier League”

Contents

• Sequence databases

• Text searching

• Sequence similarity searching

• Alignment basics

• Similarity searching tools

• Improving algorithms

• Guidelines

• Problem sequences

• Multiple sequence alignments

Sequence Databases

Primary vs Secondary

• Primary data comes from experiments/submitters

• Derived (or secondary) data is generated with additional work (by curators etc.) from the primary data

Nucleotide primary data

Individual scientistsLarge-scale

sequencing projects

Primary sequence data

Primarysequencedatabase

•Original sequence data• Experimental data

• Patent data

• Submitter-defined

Patent Offices

ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG

GenBank DDBJ

ENA

(Japan)(U.S.A.)

(Europe)

INSDC: • International Nucleotide Sequence Database Collaboration

• Daily exchange of data

Submission can be

made to any

INSDC database


Assembled sequences

Raw data

Annotated sequence

Large-scale sequencing

projects

Individual scientists

Patent Offices

ENAEMBL-Bank

(ENA Annotation)

Sequence ReadArchive (SRA)

EMBL-Coding etc.


Ensembl/genomes

IMGT/HLA

Nucleotide sequence resources at EMBL-EBI• European Nucleotide Archive (ENA)

• ENA sequence – Annotated sequence entries

• Sequence Read Archive (SRA) – sequence read data

• Sequence Version Archive (SVA) – historical entry version

• ENA Coding/Non-coding

• Ensembl

• Assembled genomes and annotations for Vertebrates

• Ensembl Genomes

• Extending Ensembl to other species

When is the data updated?

Data is updated every night, but main releases are quarterly

• Quarterly release of all EMBL-Bank eg. Rel

116

Normal Release

• All updates since last normal release

• Rolled into quarterly release

Updates

ENA updates

Protein sequence data

Swiss-Prot & TrEMBL PIR-PSD

Since 2002 a merger and collaboration of three databases:

UniProtKB

Non-redundant, high-quality manual annotation

- reviewed

Redundant, automatically annotated - unreviewed

UniProtKB/TrEMBL1 entry per nucleotide

submission

UniProtKB/Swiss-Prot1 entry per protein

UniProtKB/Swiss-ProtManuallyannotated

UniProtKB/TrEMBLComputationallyannotated

Data sources of UniProtKB

UniProt/TrEMBL

VEGA(Sanger)

WormBaseFlyBase

Sub/Peptide

DataPDB

Patent Data

EnsemblENA (EMBL) DNA database

mRNAData

UniProtKB employs two prediction programs which are referred to as UniRule and SAAS.

UniRule maintains a set of manually established and maintained annotation rules.

SAAS, Statistical Automatic Annotation System, generates a new set of decision-trees with every UniProtKB release using data-mining.

InterProSwiss-Prot

Automatic annotation

Curation of a UniProtKB/Swiss-Prot entry

Sequence variants

Nomenclature

Sequence features

UniProtKB/TrEMBL

UniProtKB/SwissProt

Ontologies

Literature Annotations

References

UniProtKB

UniProt databases

• UniProtKB/Swiss-Prot

• Manually curated

• UniProtKB/TrEMBL

• Automatically curated

• UniRef

• Sequences clustered by %identity

• UniParc

• Sequence archive – keeps track of historical sequences & identifiers

• Proteomes

Data

• Simplistically, much the data held at EMBL-EBI can be thought of as like a container

• Part of it is the raw data itself (eg. Protein sequence)

• Another part being meta-information or annotation about this data

ExampleID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, Brazil.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//

Formats

• Different databases store this data in different formats

ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919

LOCUS AJ131285 919 bp mRNA linear INV 20-JUL-2001DEFINITION Sabella spallanzanii mRNA for globin 3.ACCESSION AJ131285VERSION AJ131285.1 GI:13810248KEYWORDS globin; globin 3; globin gene.SOURCE Sabella spallanzanii ORGANISM Sabella spallanzanii Eukaryota; Metazoa; Lophotrochozoa; Annelida; Polychaeta; Palpata; Canalipalpata; Sabellida; Sabellidae; Sabella.REFERENCE 1

SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//

ORIGIN 1 caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 61 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 121 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 181 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 241 tctgtggaca agttcttcaa gcgtgtcaat ggcaaggaca tcagctcccc agccttccag 301 gctcacatcc agcgtgtgtt cggtggcttt gacatgtgca tctccatgct tgatgacagt 361 gatgtgctcg cctctcagct ggctcacctc cacgcccagc acgtcgagag aggaatctct//

EMBL GENBANK

Formats

>gi|13810248|emb|AJ131285.1| Sabella spallanzanii mRNA for globin 3CAAACAGTCARTTAATTCACAGAGCCCTGAGGTCTCTCGCTCCTTTCTGCGTCACTCTCTCTTACCGTCATCATGTACAAGTGGTTGCTTTGCCTGGCTCTGATTGGCTGCGTCAGCGGCTGCAACATCCTCCAGAGGCTGAAGGTCAAGAACCAGTGGCAGGAGGCTTTCGGCTATGCTGACGACAGGACATCCCYCGGTACCGCATTGTGGAGATCCATCATCATGCAGAAGCCCGAGTCTGTGGACAAGTTCTTCAAGCGTGTCAATGGCAAGGACATCAGCTCCCCAGCCTTCCAGGCTCACATCCAGCGTGTGTTCGGTGGCTTTGACATGTGCATCTCCATGCTTGATGACAGTGATGTGCTCGCCTCTCAGCTGGCTCACCTCCACGCCCAGCACGTCGAGAGAGGAATCTCT

FASTA format

Format conversion tools

http://www.ebi.ac.uk/Tools/sfc/

Meta-information

• Contains information important for:

• Identifying/referencing a piece of data or entry

• Classifying an entry

• Determining the source of the data

• And can also contain annotation that adds value:

• Identification of sequence features

• Keywords, GO terms etc.

• Cross-references to other entries that share some property

• Etc.

Searching using the meta-data

• When looking for a sequence we can perform text searches against the meta-data

• Accession look-up

• Keyword search eg: function, species

• Protein family classification

• Accession changes?

• Cross reference services

• PICR

• UniProt ID Mapping

Text search tools

• Each database has its own search engine

• Interface tailored to their specific data use

• There are also EBI-wide search tools

• EBI Search

EBI Search

• First approach/entry point to data resources at EBI

EBI Search

• Just type, with auto-complete

EBI Search

• One stop search across many resources, grouped into categories

Categories

Domains

Multi-domain facet

Domain-specific facet

EBI Search


• One stop search across many resources

• Non-expert friendly summaries

EBI Search

EBI Search


• One stop search across many resources

• Non-expert friendly summaries

• Advanced search available (via direct URL)

• http://www.ebi.ac.uk/ebisearch/advancedsearch.ebi

• Allows domain/field specification

• Boolean etc.

http://www.ebi.ac.uk/ebisearch/advancedsearch.ebi

Sequence searching

Sequence searching tools

• Central to modern techniques

• Genome annotation

• Characterising protein families

• Exploring evolutionary relationships

How?

• Search by comparing sequence data rather than meta-data

• Find sequences/entries when missing or inaccurate meta-data

• More than just an exact look-up

• Allow for sequence variability – look for ‘similar’ sequences

• Sequence variation is important information for bioinformaticians

• Infer homology (shared ancestry)

• IF homologous, then can transfer information

Homology vs. Similarity

• Presence of similar features because of common decent

• Cannot be observed since the ancestors are not anymore

• Is inferred as a conclusion based on ‘similarity’

• Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999)

• Quantifies a ‘likeness’

• Uses statistics to determine ‘significance’ of a similarity

• Statistically significant similar sequences are considered ‘homologous’

Measurable

Inferred

Sequence alignment

ACATAGGT

TCATAGAT AAATTCTG

Query:

1 2

Sequence alignment

ACATAGGT

TCATAGAT AAATTCTG

Query:

1 2

ACATAGGTACATAGGT

Sequence alignment

ACATAGGT

TCATAGAT AAATTCTG

Query:

1 2

Score: 6/8 3/8

ACATAGGT ACATAGGT

Identity

Sequence alignment

atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatctcaagggcacctttgcccagcttgagt

atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggccatggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccaccaagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggcaagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgccctgtccactctgagcgacctgc

cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccataaagcacctggatgatctcaagggca

Query:

1

2

Dot plot

• Maybe a dot plot will help

Query

Sequence 1

A C A T A G

GATACT

Dot plot

Query vs Sequence 1 Query vs Sequence 2

Query Query

1 2

Algorithms

• To get a computer to solve a problem, the first step is to create a way for the computer to know what is relatively ‘good’ and what is relatively ‘bad’

• I.e. a score.

• Computer can then assess solutions and choose best.

• Simple algorithm – penalise movement away from diagonal – gap penalty

0

-10

-10

0

-10

-10

Why gap open and extension?

• Adjacent gap positions are likely to have been created by the same in/del event, rather than multiple independent events

• Use a smaller gap extension compared to opening penalty to account for this

G---ATTA G-A-T-TA

• To encourage this we apply a low penalty per each gap, and a high one just to open a gap.

-10.5

Gap extend

0

-10.5

-10.5

0

-10 -0.5

-10-0.5

-11

0

-10.5

-0.5

-11-0.5

-10.5-10.5

Gap open = 10Gap extend = 0.5

Match/mismatch

• Of course, we need to tell the algorithm that matching letters are better than mismatches too

• This is done via a scoring matrix

A C G T

ACGT

5 -4 -4 -4-4 5 -4 -4-4 -4 5 -4 -4 -4 -4 5

• Putting the two together gives us a scoring mechanism

-4

-18.5

-18.5

1

-14 -13.5

-23

-13.5

T

A

C

A

C A

6

• To pick the optimal alignment, start at the end and trace back the highest scoring route.

-4

-18.5

-18.5

1

-14 -13.5

-23

-13.5

T

A

C

A

C A

6

Needleman-Wunsch

• Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!

• An example of dynamic programming

• Comparing the full length of both sequences is called a global-global or just global alignment

Global vs Local

• But global-global might not be suitable for sequences that are very different lengths

• A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm.

• Sets negative scores in matrix to 0, and allows trace back to end and restart

Global vs Local

A T G T A T A C G C

A G T A T A - G C

A - T G T A T A C G C

A G T A T A - - - G C

Scoring

• Parameters so far:

• Match/mismatch

• Gap opening

• Gap extending

• Can we improve it?

Substitutions

• Some substitutions are more likely than others

Protein substitution matrices

• Can look at closely related proteins to determine substitution rates

• Two most commonly used models:

• PAM

• BLOSUM

PAM

• Point Accepted Mutation

• Observed mutations in a set of closely related proteins

• Markov chain model created to describe substitutions

• Normalised so that PAM1 = 1 mutation per 100 amino acids

• Extrapolate matrices from model

• Higher PAM number = less closely related

PAM 250

BLOSUM

• Blocks of Amino Acid Substitution Matrix

• Align conserved regions of evolutionary divergent sequences clustered at a given % identity

• Count relative frequencies of amino acids and substitution probability

• Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.

• Higher BLOSUM number = more closely related

BLOSUM 45PAM 250

BLOSUM 62PAM 160

BLOSUM 90PAM 100

More divergent Less divergent

Scoring

• Parameters:

• Match/mismatch

• Gap opening

• Gap extending

• Substitution matrix

Dynamic programming alignments at the EBI• EMBOSS Pairwise Alignment algorithms

• European Molecular Biology Open Software Suite

• Suite of useful tools for molecular biology

• Command line based

• Designed to be used as part of scripts/chained programs

• We implement selected tools to provide web and Web Services access

• Database alignments via FASTA suite of programs

Where to find at the EBI?

http://www.ebi.ac.uk/Tools/psa/


http://www.ebi.ac.uk/Tools/psa/

Pairwise alignment tools

• Global alignment

• Local alignment

• Genomic DNA alignment

Needle

Water

Stretcher

Matcher

LALIGN

WISE tools

Big sequences

Big sequences

Change to nucleotide

Sequence input

Parameters

Submit!

Key

- Gap

: Positive match

. Negative match

| Identity

Example sequences

www.ebi.ac.uk/~apc/Courses/Brazil

Pairwise_align1.fsa

Pairwise_align2.fsa

http://www.ebi.ac.uk/~apc/Courses/Amsterdam

http://www.ebi.ac.uk/~apc/Courses/Amsterdam

Searching a database

• Multiple pairwise alignments between query sequence and database sequence

Dynamic programming sequence search methods at the EBI

• Global alignment

• Local alignment

• Global query vs local database

• Profile-iterative search

GGSEARCH

SSEARCH

GLSEARCH

PSI-SEARCH


http://www.ebi.ac.uk/Tools/sss/



Database selection

Sequence input

Parameters

Submit!

• Dynamic programming methods are rigorous and guarantee an optimal result

• But take up a lot of memory

• And evaluate each position of the matrix

• Predictably, this makes them slow and demanding when you are aligning large sequences

Heuristics

• Therefore we need methods of estimating alignments

• Estimation methods are called heuristics

• Try and take short cuts in an intelligent manner

• Speed up the search

• At the possible expense of accuracy

• Accuracy in sequence searches is important for:

• Aligning the right bits

• Scoring the alignment correctly

• Identifying similar sequences - sensitivity

• Going back to our dot plot

• Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.

• Of course, we have to identify likely regions – not all alignments will be as nice as that one!

• This is the method used by FASTA

• W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

FASTA – step 1

• Identify runs of identical sequence and pick regions with highest density of runs

Ktup parameter:How small are ‘words’ considered before they are ignored

Increase Ktup = faster, but less sensitive

FASTA – step 2

• Weight scoring of runs using matrix, trim back regions to those contributing to highest scores

Parameter:Substitution matrix

FASTA – step 3

• Discard regions too far from the highest scoring region

Joining threshold:Internally determined

FASTA – step 4

• Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions

Parameters:Gap openGap extendSubstitution matrix

FASTA

• Repeat against all sequences in the database

FASTA – programs available at EBI

• FASTA: ”a fast approximation to Smith & Waterman”

• FASTA – scan a protein or DNA sequence library for similar sequences.

• FASTX/Y – compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward or reverse translation frames.

• TFASTX/Y – compare a protein sequence to a translated DNA data bank.

• FASTF – compares ordered peptides (Edman degradation) to a protein databank.

• FASTS – compares unordered peptides (Mass Spec.) to a protein databank.



Database selection

Sequence input

Parameters

Submit!

FASTA - results

FASTA - results

FASTA - results

FASTA - results

Key

- Gap

: Identity

. Similarity

X Filtered

Example sequence


test_prot.fasta

http://www.ebi.ac.uk/~apc/Courses/Amsterdam/


BLAST – Basic Local Alignment Search Tool• Instead of narrowing the dynamic programming

search space, BLAST works a slightly different way

• Firstly, it creates a word list both of the exact sequence and high scoring substitutions

Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

BLAST – step 1

• w=3

SEWRFKHIYRGQPRRHLLTTGWSTFVT

SEWEWR

WRF Parameter:Word length (w)

Increase = faster, but less sensitive

BLAST – step 1(cont.d)

• w=3

• T=13

SEWRFKHIYRGQPRRHLLTTGWSTFVT

GQP 18GEP 15GRP 14GKP 14GNP 13GDP 13

AQP 12NQP 12

Parameters:Neighbourhood threshold (T)Substitution matrix

BLAST – step 2

• Then it scans database sequences for exact matches with these words

• If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount

• This results in a High-scoring Segment Pair (HSP)

BLAST – step 3

Parameters:Drop offSubstitution matrix

• If the total HSP score is above another threshold then a gapped extension is initiated

BLAST – step 4

Parameters:Extension threshold (Sg)Substitution matrix

BLAST

• The steps rule out many database sequences early on

• Large increase in speed

BLAST – programs available at the EBI

• Basic Local Alignment Search Tool

• NCBI BLAST programs:

• BLASTP – protein sequence vs. protein sequence library

• BLASTN – nucleotide query vs. nucleotide database

• BLASTX – translated DNA vs. protein sequence library

Key

- Gap

[residue] Identity

+ Similarity

X Filtered

Example sequence


test_prot.fasta



When to use what?

Database size

Query length

GGSEARCH

FASTA

BLAST

PSI-SEARCHSSEARCH

When to use what?

PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

time to search

GGSEARCH

FASTA

BLAST

PSI-SEARCHSSEARCH

Homology and Similarity

Similarity

Homology

• So far, we’ve talked about scoring alignments

• Direct function of the algorithm

• But what we want is to assign some kind of quality to that score

Score vs significance

A A A

A A A

A C A T A A G G C T

A T A C A A G C C T

High score Higher significance?

“Lies, damn lies, and statistics”

“Lies, damn lies, and statistics”

• Not just interested in score...

• ...But how likely we are to get that alignment by chance alone

• It is this ‘non-random’ alignment that infers homology

• Statistics are used to estimate this chance

E-value

• ‘Expect’ value (really ‘expectation’)

• Probability of obtaining this score by chance in the given database, or “how many times you might be wrong”

• Best measure of how biologically significant an alignment is

• Used for ranking results by default

• Most people use 10-3 “Happy to be wrong one time in a thousand”

• Calculated in slightly different ways for BLAST and FASTA

• Short alignments are more likely to be found by chance so have higher E-values

• Affected by database size

• BLAST and FASTA both optimised for distant relationships

FASTA statistics

• Compares query sequence with every sequence in database

• As most of these sequences are unrelated it is possible to use the distribution of scores (sampled) to assign statistical significance

• As distribution is taken from a random sample, exact E-Value can vary slightly from search to search

FASTA - histogram

Predicted distribution of scores

Observed distribution of scores

Key

*

=

High scoring region

BLAST statistics

• Main reason for speed is that it doesn’t compare query with lots of other sequences

• Therefore it pre-estimates statistical values using a random sequence model

“Appears to yield fairly accurate results”

Improving algorithms

Sensitivity, Selectivity & Speed

• Sensitivity is how distantly you can determine a homologous sequence (avoid false negative)

• Selectivity is how accurately you can determine whether a sequence is homologous or not (avoid false positive)

• Speed is obviously how long it takes!

• In general, the more information we can add to an alignment, the better the result

Conserved regions Structural information Motifs

[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H

Conserved regions

• We can add a new ‘position’ parameter to the substitution matrix

We can even modify a normal search to generate a position specific scoring matrix, or PSSM

PSI-BLAST

Position Specific Iterative – BLAST:

1.Takes the result of a normal BLAST

2.Aligns them and generates profile of conserved positions

3.Uses this to weight scoring on next iteration

PSI-BLAST

PSI-BLAST

PSI-BLAST

Example sequence


test_prot.fasta



PHI-BLAST

• Pattern Hit Initiated-BLAST

• User provides a pattern alongside a protein

• Database hits have to contain this pattern, and similarity to rest of sequence

• Results can initiate a PSI-BLAST search as well

PSI-BLAST

• By adding importance to conserved residues we might be able to find more distant sequences

• But iterate too far and we might be assigning importance where there is none

• Problem of Homologous Over-Extension (HOE)

More sensitive

Less selective

Homologous Over-Extension (HOE)

Alignment region

Extends over subsequent iterations

2nd3rd


Contaminated PSSM


Which can cause (significant) alignment with unrelated protein


Expect score: 9.0x10-5

PSI-BLAST initial search


PSI-BLAST 2nd Iteration


PSI-BLAST 3rd Iteration



PSI-BLAST 5th Iteration



Reducing HOE

• Look for domains in results and manually select sequences that form part of PSSM

• Mask boundaries according to initial alignment

• Results in improvement of false-positives (selectivity)

PSI-SEARCH

• Smith-Waterman implementation (SSEARCH)

• With iterative position specific scoring

• Optional boundary masking to reduce HOE

Reducing HOE errors

• Sequence boundary masking procedure

• First time a significant alignment occurs for a library sequence, store co-ordinates

Reducing HOE errors

• Mask regions outside so can’t contribute to PSSM

Reducing HOE errors

PSI-Search 2nd Iteration

PSI-Search 5th Iteration

PSI-Search

So what does that do to sensitivity/selectivity?

Selectivity

Sen

sitiv

ity

PSI-Search

=

Very sensitive+Much more selective

Coming soon

• PSI-Search 2!

• Use domain annotations/predictions to inform alignment

Low complexity regions

• Biologically irrelevant, but likely to skew alignment scoring

• E.g. CA repeats, poly-A tails and Proline rich regions

Good Statistics:

The inset shows good correlationbetween the observed over expectednumbers of scores.

This is the region of the histogram to look out for first when evaluating results.

The inset shows bad correlationbetween the observed and expectedscores in this search.

The spaces between the = and * symbolsindicate this poor correlation.

One reason for this can be low complexityregions.

Bad Statistics:

Low complexity regions

• Biologically irrelevant, but likely to skew alignment scoring

• E.g. CA repeats, poly-A tails and Proline rich regions

• Compensate by filtering/masking sequence so these regions don’t contribute to scoring

• Filters: seg, xnu, dust, CENSOR

• But check what you are filtering!

Inset showing the effect of using a low complexity filter (seg) and searchingthe database using the segment withhighest complexity.

Note that there is now good agreementbetween the observed and expectedhigh score in the search and that thedistance between = and * has beensignificantly reduced.

Filtered:

Example sequence


Filtertest_seq.fsa



Database composition

• Statistics rely on database containing wide coverage

• Assumption query is not homologous to most of the data

• Specialist databases might cause problems

• Eg Innuno- databases, made up of relatively few genes

• A lot of the database IS homologous

• Skews statistics

Database composition

• Can’t make same assumptions about coverage

• So don’t use BLAST

• FASTA based tools sample the score so provide accurate statistics

• Use the histogram to check

• Use shuffled versions of database to create additional coverage

Search Guidelines

Search guidelines 1

• Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)

• Then with translated DNA query sequences (fastx, blastx)

• Search with DNA vs. DNA as the next resort

• And then against translated DNA database sequences (tfastx, tblastx) as the VERY LAST RESORT!

Search guidelines 2

• Search the smallest database that is likely to contain the sequence(s) of interest

• Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology

Search guidelines 3

• Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence

• Examine the histograms

• Use programs such as prss3 to confirm the expectation values.

• Searching with shuffled sequences (use MLE/Shuffle in fasta) which should have an E() ~1.0

• Perform reverse search

Search guidelines 4

• Default parameters are set up for most common queries

• Consider searches with different gap penalties and other scoring matrices, especially for short queries/domains

• Use shallower matrices and/or more stringent gaps in order to uncover or force out relationships in partial sequences

• Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of PAM250)

• Remember to change the gap penalty defaults (if the tool doesn’t change them for you)

MATRIX open ext.BLOSUM50 -10 -2BLOSUM62 -11 -1BLOSUM80 -16 -4PAM250 -10 -2PAM120 -16 -4

Search guidelines 5

• Homology can be reliably inferred from statistically significant similarity

• But remember:

• Orthologous sequences have similar functions

• Paralogous sequences can acquire very different functional roles

• So further work might be needed to tease out details

Search guidelines 6

• Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues

• However, motif identity in the absence of significant sequence similarity usually occurs by chance alone

• Try to produce multiple sequence alignments in order to examine the relatedness of your sequence data

• Clustal Omega

• MUSCLE

• T-Coffee

• Kalign

• MAFFT

Problem Sequences

Short sequences

• What about short sequences?

• Depends on their nature:

• Protein

• Use shallow matrices

• Reduce word length and/or increase the E() value cut off

• DNA

• Reduce the word length

• Ignore gap penalties (force local alignments only)

• Use rigorous methods

• But ask what you are trying to do!

Vector contamination

• You think you know what your sequence is..

• .. But the results are really confusing!

• Maybe you have vector contamination

• Search against known vectors to check

Vector contamination

Example sequences


vectortest_seq1.fsa

vectortest_seq2.fsa



Multiple Sequence Alignments

Uses of Multiple Sequence Alignment (MSA)• Alignment of three or more sequences

• Functional prediction

• Structural prediction

• Conservation analysis

• Classification

• Phylogeny

• To help distinguish between orthology and parology

We have a (computational) problem…

• Pairwise alignments are simple enough to find the optimal (highest scoring) solution in a reasonable timeframe

• Multiple sequence alignment is in a class of problems that is ‘NP-hard’

NP-easy

• Problems that are solvable in polynomial time

• E.g. operations to solve = n2

• Problems that are hard to solve

• E.g. operations to solve = 2n

NP-hard

n2 vs 2n

• Imagine a computer running 109 operations a second

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

1021

37 trillion years

What to do about NP-hard problems?

• Give up (do you really need MSA?)

• Use approximations and heuristics

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Weighted Sums of Pairs: WSP

N

i

i

jijijDW

2

1

1

Sequences Time2 1 second

3 150 seconds

4 6.25 hours

5 39 days

6 16 years

Time O(LN)

7 2404 years


Progressive Alignment:Barton and Sternberg, 1987Florence Corpet, 1988Feng and Doolittle, 1987Jotun Hein, 1989Higgins and Sharp, 1988Hogeweg and Hesper, 1984Willie Taylor, 1987, 1988


Horse beta

Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin


Horse beta



Horse beta


Guide Tree

Clustal• >85,000 citations

• Clustal1-Clustal4 • 1988, Paul Sharp, Dublin

• Clustal V 1992• EMBL Heidelberg,

• Rainer Fuchs

• Alan Bleasby

• Clustal W, Clustal X 1994-2007• Toby Gibson, EMBL, Heidelberg

• Julie Thompson, ICGEB, Strasbourg

• Clustal W and Clustal X 2.0 2007• University College Dublin

www.clustal.org

ClustalW2 at the EBI

www.ebi.ac.uk/Tools/msa/clustalw2/

www.ebi.ac.uk/Tools/msa/

ClustalW2

Sequence input

Parameters

Submit!

ClustalW2

ClustalW2

Jalview

ClustalW2

Advantages• Quite fast for low

numbers

• Not too demanding

• Widely used

Disadvantages• Fixing of early

alignments• Propagate errors

• Doesn’t search far• Local minima

• Compresses gaps

Example sequences


Prot_MSA.fsa

http://www.ebi.ac.uk/~apc/Courses/Rotterdam


Other progressive aligners

• MUSCLE

• Optimised progressive aligner

• Good alternative to ClustalW

BaliBase % correct time(s)

Clustal W 37.4 766Muscle 47.5 789

Other progressive aligners

• KAlign

• Local regions progressive aligner

• Extremely fast!

• Good for large alignments/input


Clustal W 37.4 766Kalign 50.1 21

Consistency based alignment

• Maximise similarity to a library of residue pairs

COFFEE

• Consistency based Objective Function For alignmEnt Evaluation

• Maximum Weight Trace (John Kececioglu)

• Maximise similarity to a LIBRARY of residue pairs• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE:

An objective function for multiple sequence alignments. Bioinformatics 14: 407-422.

COFFEE

• Library of reference pairwise alignments

• For your given set of sequences

• Objective Function

• Evaluates consistency between multiple alignment and the library of pairwise alignments

• Use SAGA to optimise this function

• Weigh depending on quality of alignment

SAGA is another alignment method, using genetic algorithms

COFFEE

• More accurate than ClustalW

• Much less prone to problems in early alignment stages

• VERY slow!

T-Coffee

• Tree-based COFFEE

• Heuristic approach to COFFEE

• Gets rid of genetic algorithm portion

• Uses progressive alignments

• Changes algorithm based on number of sequences

T-Coffee

• Much faster than COFFEE

• Avoids some of ClustalW’s pitfalls

• Can take information from several data sources

• Still not that fast

• Can be very demanding of memory etc.

Other Tools

• MAFFT

• Iterative based Fast Fourier Transform

• Different modes – can operate in both progressive and consistency type alignments

NEW!: Clustal Omega

• Completely different way of doing things from ClustalW

• Two major areas of improvement:

• 1) Guide tree generation

• 2) Profile-profile alignments

Clustal Omega – Guide Tree improvements• Guide tree generation is one of the slowest steps

• Especially with large numbers of sequence

• Clustal Omega uses the embed method to sample range of sequences and represent all sequences as vectors to these samples

• Results in better scaling with more sequences

Clustal Omega – Profile-profile alignments• Like sequence searching, profiles can be used to

increase sensitivity

• HMMs are a form of profile

• Clustal Omega aligns HMMs to HMMs

Clustal Omega

• Better scaling for many sequences

• Speed

• Accuracy

• Better scaling for many computers

• More accurate alignments

• Nucleic Acid alignments still work in progress

Which tool should I use?

Input data

• 2-100 sequences of typical protein length

• 100-500 sequences

• >500 sequences

• Small number of unusually long sequences

Recommendation

• MUSCLE, T-Coffee, MAFFT, ClustalW2/Omega

• Clustal Omega, MUSCLE, MAFFT

• Clustal Omega, KALIGN

• ClustalW, KALIGN

How to evaluate?

• Use a benchmark

• BaliBASE

BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics

•ICGEB Strasbourg

•141 manual alignments using structures• 5 sections• core alignment regions marked

1. Equidistant(82)

2. Orphan(23)

3. Two groups (12)

4. Long internal gaps(13)

5. Long terminal gaps(11)


Clustal Omega 55.4 539

Clustal W 37.4 766Mafft (default) 45.8 68Muscle 47.5 789Kalign 50.1 21T-Coffee 55.1 81041Probcons 55.8 13086Mafft (auto/consistency) 58.8 1475MsaProbs 60.7 12382

Benchmark pitfalls

• Benchmark dataset may not be representative

• Danger of over-training towards benchmark

• Goldman: Most MSAs have unrealistic gaps

• Tend towards multiple, independent deletions

• Insertions are rare

• Sequences shrink in length over evolution

• No supporting evidence that this is the case

Solutions

• Use phylogentic data to guide alignment

• Keep track of changes to ancestor sequences

• Don’t change them again so easily in decendents

Phylogeny

• Multiple Sequence Alignment tries to find best alignment of three or more sequences

• Used to identify groups of similar sequences

• Conserved regions etc.

• But if we want to examine evolutionary relationships we need more than just current sequence similarity

• Phylogeny is an estimate of evolutionary history between sequences

• Model substitutions from theoretical ancestor sequences

Neighbour Joining

• Simple phylogenetic tree method

• Bottom up (starts from alignment of current day sequences)

• Iterate to form a tree with nodes forming minimum distances between paired taxa

• Fast

• Dependant on accuracy of input

• Can sometimes get negative branch lengths

ClustalW2 - Phylogeny

• Neighbour joining (and UPGMA) phylogenetic tree algorithm from the ClustalW2 package

http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny/


ALIGNED sequence input


PRANK

• Probabilistic Alignment Kit

• webPRANK

• Better suited for closely related sequences

• Tied solutions are chosen from at random

• Avoids incorrect confidence in result

• Means alignments might not be reproducible

• Alignments look quite different

• Might look worse!

• But gap patterns make sense

• Gaps are good!

http://www.ebi.ac.uk/goldman-srv/webprank/

Common problems with MSA

• Input format

• Try using FASTA format

• Unique sequence identifiers

• Include sequence!

• Usually limit of 500 sequences/1MB

• Job can’t be found/other error

• Results deleted after 7 days

• Some sequence/program combinations run out of memory

• Use a different program

Example sequences


Problem_MSA1.fsa

Problem_MSA2.fsa

Problem_MSA3.fsa

Problem_MSA4.fsa



Common mis-uses of MSA

• Performing a sequence assembly

• Specialist type of MSA

• Use other tools (Staden etc.)

• Aligning ESTs to a reference genome

• Use EST2Genome

• Designing primers

• Use primer tools (primer3 etc.)

• Aligning two sequences

• Use a pairwise alignment tool!

Putting it all together

• EBI Search

• Sequence retrieval

• Sequence search

• Sequences retrieval

• Multiple sequence alignment

• Phylogeny

• Analysis

Final remarks

• Don’t assume a single tool will cater for all your needs

• Change the parameters of the tools

• Remember where the tool excels and what its limitations are

• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

• Crazy input will always give crazy results!

Getting Help

Getting Help

• Database documentation

• EBI Support• http://www.ebi.ac.uk/support/

• EBI training programme• http://www.ebi.ac.uk/training

• EBI online training• http://www.ebi.ac.uk/training/online

• IMGT/HLA

• [email protected]

http://www.ebi.ac.uk/support/

http://www.ebi.ac.uk/training/handson/



Thank you!

www.ebi.ac.uk

Twitter: @emblebi

Facebook: EMBLEBI

Final remarks

• Don’t assume a single tool will cater for all your needs

• Change the parameters of the tools

• Remember where the tool excels and what its limitations are

• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

• Crazy input will always give crazy results!

Getting Help

Getting Help

• Database documentation

• EBI Support• http://www.ebi.ac.uk/support/

• EBI training programme• http://www.ebi.ac.uk/training

• EBI online training• http://www.ebi.ac.uk/training/online

http://www.ebi.ac.uk/support/




With thanks to our funders

• EMBL-EBI is primarily funded by EMBL member states

• Other major funders:

• European Commission

• National Institutes of Health

• Research Councils UK

• Wellcome Trust

• Industry Programme

Appendix

How is the data organised?

Data in EMBL-Bank is divided in 2 ways:

• Type of data or

• Methodology used to obtain data

• Each entry belongs to one data class

1) Data classes

• Each entry belongs to one taxonomic division

2) Taxonomic Divisions

ENA database structure

CON

EST

GSS

HTC

HTG

MGA

PAT

STS

TPA

TSA

WGS

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STD Standard (high quality annotated sequence)

1) Data Classes

SRA Sequence Read Archive (both databank and data class)

CON

EST

GSS

HTC

HTG

MGA

PAT

STS

TPA

TSA

WGS







Patent sequences






1) Data Classes


• Single pass reads variable quality

CON

EST

GSS

HTC

HTG

MGA

PAT

STS

TPA

TSA

WGS







Patent sequences






1) Data Classes


• SRA is a separate databank from EMBL-Bank

• SRA can also be searched as a data class within EMBL-Bank

CON

EST

GSS

HTC

HTG

MGA

PAT

STS

TPA

TSA

WGS







Patent sequences






1) Data Classes


• Bulk of entries

• Highest level of tracked information

HUM Human

MUS Mouse

MAM Mammal

VRT Vertebrate

ROD Rodent

FUN Fungi

INV Invertebrate

PLN Plant

PHG Phage

PRO Prokaryote

VIR Viral

ENV Environmental

SYN Synthetic

TGN Transgenic

UNC Unclassified

Other:

2) Taxonomy

All INSDC databases use NCBI Taxonomy

Which taxonomy database does ENA use?

Divisions:

2) Taxonomy: exclusion

Some species EXCLUDED

from certain taxonomic ranges

VRT Vertebrate excludeshumanmouserodentmammal

MAM Mammal excludeshumanmouserodent

ROD Rodent excludes mouse

Applies to:

• ftp files and• Sequence search tools

But not:

• ENA Browser

How does data organization differ from GenBank?

EMBL-Bank

Data classes

TaxonomicDivisions

• Data split into intersecting slices

• Reduces search set

• Ensures complete result set

con

est

gss

htc

htg

pat

sts

std ...

hummusrodmamvrtfun...

GenBank

Divisions

• Data split into parallel slices

• Large search sets

• Classes incomplete for taxonomy

• Taxonomy incomplete for classes

con

est

gss

htc

htg

pat

sts

std ...hum

mus rod

mam vrt

fun

inv

pln

...

Data classes Taxonomy

Database structure

How does data organization differ from GenBank?

EMBL-Bank

Data classes

TaxonomicDivisions

• Data split into intersecting slices

• Reduces search set

• Ensures complete result set

con

est

gss

htc

htg

pat

sts

std ...

hummusrodmamvrtfun...

GenBank

Divisions

• Data split into parallel slices

• Large search sets

• Classes incomplete for taxonomy

• Taxonomy incomplete for classes

con

est

gss

htc

htg

pat

sts

std ...hum

mus rod

mam vrt

fun

inv

pln

...

Data classes Taxonomy

‘Mouse’ + ‘EST’ intersection

• small data set

• ensured complete set of mouse ESTs

‘Mouse’ set

• large data set

• includes all mouse entries

‘EST’ set

• large data set

• includes all EST entries

Database structure

Documents

Tools & techniques for finding and aligning homologous sequences Sequence Searching and Alignments Andrew Cowley Web Production Team [email protected]