36
Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information, see http ://ich.vscht.cz /~ svozil/teaching.html

Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Embed Size (px)

Citation preview

Page 1: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Lecture invitation• 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20)• How Lab IT Accelerates Pharmaceutical Research• For more information, see http://ich.vscht.cz/~

svozil/teaching.html

Page 2: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Last lecture summary

Page 3: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

• Flavors of sequence alignment• Homology• Scoring DNA alignment, gaps• Substitution matrix• Scoring protein alignment• PAM matrices, PAM1, higher PAM

Page 4: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

New stuff

Page 5: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Protein substitution matrices – BLOSUM

Page 6: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

BLOSUM matrices I • BLOck SUbstitution Matrix by Henikoff and Henikoff,

1992.• They used the BLOCKS database containing multiple

alignments of ungapped segments (blocks).• These alignments correspond to the most highly

conserved regions of proteins.• Blocks are ungapped sequence motifs. Sequence motif is a

conserved stretch of amino acids confering a specific function to a protein.

• Any given protein can contain one or more blocks corresponding to its structural/functional motifs.

Page 7: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Blocks

...

Page 8: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

BLOSUM matrices II• Thus the Hanikoffs focused on substitution patterns only

in the most conserved regions of a protein. These regions are (presumably) least prone to change.

• The substitution patterns of 2000 blocks (block is the whole alignment, not individual sequences within it) representing more than 500 groups were examined, and BLOSUM matrices were generated.

• Sequences sharing no more than 62% identity were used to calculate BLOSUM62 matrix.

Short and clear explanation of BLOSUM62 derivation: Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol. 2004 22(8):1035-6. PMID: 15286655.

Page 9: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

BLOSUM matrices III• BLOSUM matrices are based on entirely different type of

sequence analysis (local ungapped alignment vs. global gapped alignment in PAM) and on a much larger data set than PAM.

• All BLOSUM matrices are based on observed alignments. They are not based on extrapolations like PAM.

• BLOSUM numbering system goes in reversing order as the PAM numbering system.• The lower the BLOSUM number, the more divergent sequence

they represent.

Page 10: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

PAM vs. BLOSUM I• However, you may ask a question which particular matrix

should be used?• Dayhoff et al. (1978) defined terms protein families and

superfamilies.• A protein family is formed by sequences 85% (or greater)

identical to each other.• A protein superfamily is defined as sequences related

from 30% or greater.• Superfamily may clearly contain many families.• These terms are widely used in contemporary literature,

however with different meanings (we’ll come to that later).

Guidance in the choice of scoting matrix: Wheeler D. Selecting the right protein-scoring matrix. Curr Protoc Bioinformatics. 2002;Chapter 3:Unit 3.5. www.nshtvn.org/ebook/molbio/Current%20Protocols/CPB/bi0305.pdf

Page 11: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

PAM vs. BLOSUM II – PAM• At the time of deriving PAM matrices, most known

proteins were small, globular and hydrophilic. If resercher believes his protein contain substantial hydrophobic regions, PAM matrices are not that useful.

• Most widely used is PAM250. • It is capable of detecting similarities in the 30% range (i.e.

superfamilies).• Another point of view – PAM250 provides the best look-

back in evolutionary time.• PAM250 is most effective if the goal is to know the widest

possible range of proteins similar to the given protein.

Page 12: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

PAM vs. BLOSUM III – PAM• Assume a protein is a known member of the serine

protease family. • Using the protein as a query against protein databases

with PAM 250 will detect virtually all serine proteases, but also considerable amount of irrelevant hits.

• In this case, the PAM160 matrix should be used. It detects similarities in the 50% to 60% range (Altschul, 1991).

• And to find only those proteins most similar (70% - 90%) to the query protein, use PAM40.

• Let’s summarize:• Locate all potential similarities – PAM250• Determine if the protein belongs to the protein family – PAM160• Determine the most similar proteins – PAM40

Page 13: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

PAM vs. BLOSUM IV – BLOSUM• Most widely used is BLOSUM62.• BLOSUM62 appears to be superior to PAM250 in

detecting distant relationships even if the PAM method is updated with current data sets.

• BLOSUM62 is capable of accurately detecting similarities down to the 30% range (superfamilies).

• Determine if the protein belongs to protein family – BLOSUM80 (detects identities at the 50% level)

• Determine the most similar proteins – BLOSUM90

Page 14: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Selecting an Appropriate MatrixMatrix Best use Similarity (%)

Pam40 Short highly similar alignments 70-90

PAM160 Detecting members of a protein family 50-60

PAM250 Longer alingments of more divergent sequences ~30

BLOSUM90 Short highly similar alignments 70-90

BLOSUM80 Detecting members of a protein family 50-60

BLOSUM62 Most effective in finding all potential similarities 30-40

BLOSUM30 Longer alingments of more divergent sequences <30

Similarity column gives range of similarities that the matrix is able to best detect.

Page 15: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

PAM vs. BLOSUM V – comparison• Careful information theory analysis showed that the

following matrices are equivalent:• PAM250 is equivalent to BLOSUM45• PAM160 is equivalent to BLOSUM62• PAM120 is equivalent to BLOSUM80

• Compared to the PAM160 matrix, BLOSUM62 is less tolerant to substitutions involving hydrophilic amino acids, and more tolerant to substitutions involving hydrophobic amino acids.

• Although both PAM250 and BLOSUM62 detect similarities at the 30% level, since BLOSUM uses much wider range of proteins, PAM250 is actually equivalent to BLOSUM45 when considering all proteins, not just those that are hydrophilic.

Page 16: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Sequence alignment algorithms

Page 17: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Pairwise alignment algorithms• Dot plot (dot matrix)

• Graphical way of comparing two sequences

• Dynamic programming• Slow, but formally optimizing

• Heuristic methods• Efficient, but not as thorough• Word (also k-tuples) methods• Used in database searches

Page 18: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Dot plot

Page 19: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Dot plot• Graphical method that allows the comparison of two

biological sequences and identify regions of close similarity between them.

• Also used for finding direct or inverted repeats in sequences.

• Or for prediction regions in RNA that are self-complementary and therefore have potential to form secondary structures.

Page 20: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,
Page 21: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Self-similarity dot plot I

The DNA sequence EU127468.1 compared against itself.

Introduction to dot-plots, Jan Schulzhttp://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

Page 22: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

runs of matched residues

gap

backgroundnoise

Page 23: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Self-similarity dot plot II

Introduction to dot-plots, Jan Schulzhttp://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

The DNA sequence EU127468.1 compared against itself.

Window size = 16.Linear color mapping

Page 24: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Improving dot plot• Sliding window – window size (lets say 11)• Stringency (lets say 7) – a dot is printed only if 7 out of the

next 11 positions in the sequence are identical• Color mapping

• Scoring matrices can be used to assign a score to each substitution. These numbers then can be converted to gray/color.

Page 25: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Interpretation of dot plot I1. Plot two homologous sequences of interest. If they are

similar – diagonal line will occur (matches).

2. frame shifts a) mutations

gaps in diagonal

b) insertions

shift of main diagonal

c) deletions

shift of main diagonal

http://ugene.unipro.ru/documentation/manual/plugins/dotplot/interpret_a_dotplot.html

Page 26: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Interpretation of dot plot II• Identify repeat regions (direct repeats, inverted repeats)

– lines parallel to the diagonal line in self-similarity plot

• Microsattelites and minisattelites (these are also called low-complexity regions) can be identified as “squares”.

• Palindromatic sequences are shown as lines perpendicular to the main diagonal.• Plaindromatic sequence: V ELIPSE SPI LEV

Bioinformatics explained: Dot plots, http://www.clcbio.com/index.php?id=1330&manual=BE_Dot_plots.html

Page 27: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Repeats in dot plot

from the book Bioinformatics, David. M. Mount,

direct repeats

minisattelites

inverted repeats

self-similarity dot plot of NA sequence ofhuman LDL receptor

window 23, stringency 7

Page 28: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Interpretation of dot plot – summary

http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

Page 29: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Dot plot of the human genome

A. M. Campbell, L. J. Heyer, Discovering genomics, proteomics and bioinformatics

Page 30: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Dot plot rules• Larger windows size is used for DNA sequences because

the number of random matches is much greater due to the presence of only four characters in the alphabet.

• A typical window size for DNA is 15, with stringency 10. For proteins the matrix has not to be filtered at all, or windows 2 or 3 with stringency 2 can be used.

• If two proteins are expected to be related but to have long regions of dissimilar sequence with only a small proportion of identities, such as similar active sites, a large window, e.g., 20, and small stringency, e.g., 5, should be useful for seeing any similarity.

Page 31: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Dot plot advantages/disadvantages• Advantages:

• All possible matches of residues between two sequences are found. It’s just up to you to choose the most significant ones.

• Readily reveals the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by the other, more automated methods.

• Disadvantages:Most dot matrix computer programs do not show an actual alignment. Does not return a score to indicate how ‘optimal’ a given alignment is (no statistical significance that could be tested).

Page 32: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Dynamic programming

Page 33: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Dynamic programming (DP)

• General class of algorithms typically applied to optimization problems.

• Recursive approach.• Original problem is broken into smaller subproblems and then solved.

• Pieces of larger problem have a sequential dependency.

• 4th piece can be solved using solution of the 3rd piece, the 3rd piece can be solved by using solution of the 2nd piece and so on…

Page 34: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

We want to align two following sequences:

ABCDEPQRST

If you already have the optimal solution to:

A…DP…R

then you know the next pair of characters will be either:

A…DE or A…D- or A…DEP…RS P…RS P…R-

You can extend the match by determining which of these has the highest score.

Page 35: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Sequence B

Sequence ABest previous alignment

New best alignment = previous best + local best

...

...

...

...

Page 36: Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

DP algorithms• Global alignment - Needlman-Wunsch• Local alignment - Smith-Waterman

• Guaranteed to provide the optimal alignment.

• Disadvantages:• Slow due to the very large number of computational steps: O(n

2).• Computer memory requirement also increases with the square of

the sequence lengths.• Therefore, it is difficult to use the method for very long

sequences.• Many alignments may give the same optimal score. And none of

these correspond to the biologically correct alignment.