13
Introduction to Bioinformatics Multiple Sequence Alignment

Bioinformatics lesson

Embed Size (px)

Citation preview

Page 1: Bioinformatics lesson

Introduction to Bioinformatics

Multiple Sequence Alignment

Page 2: Bioinformatics lesson

Why Multiple Sequence Alignment?

• Up until now we have onlytried to align two sequences.

• What about more than two? And what for?

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

V T I S C T G S S S N I G V T LT C T G S S S N I G V T LS C S S S G F I F S V T LT C T V S G T S F D V T I T C V V S D V S H E V T LV C L I S D F Y P G V T LV C L I S D F Y P G V T LV C L VS D Y F P E

Page 3: Bioinformatics lesson

Multiple Sequence Alignment (msa)VTISCTGSSSNIGAGNHVKWYQQLPG

VTISCTGTSSNIGSITVNWYQQLPG LRLSCSSSGFIFSSYAMYWVRQAPG LSLTCTVSGTSFDDYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGAVTVAWKADS ATLVCLISDFYPGAVTVAWKADS AALGCLVKDYFPEPVTVSWNSG- VSLTCLVKGFYPSDIAVEWESNG-

•Goal: Bring the greatest number of similar characters into the same column of the alignment

•Similar to alignment of two sequences.

Page 4: Bioinformatics lesson

Multiple Sequence Alignment: Motivation

• Correspondence. Find out which parts “do the same thing”– Similar genes are conserved across widely divergent species,

often performing similar functions• Structure prediction

– Use knowledge of structure of one or more members of a protein MSA to predict structure of other members

– Structure is more conserved than sequence• Create “profiles” for protein families

– Allow us to search for other members of the family• Genome assembly: Automated reconstruction of “contig”

maps of genomic fragments such as ESTs• msa is the starting point for phylogenetic analysis• msa often allows to detect weakly conserved regions which

pairwise alignment can’t

Page 5: Bioinformatics lesson

Multiple Sequence Alignment: Approaches• Optimal Global Alignments -

– Generalization of Dynamic programming– Find alignment that maximizes a score function– Computationally expensive: Time grows as

product of sequence lengths• Global Progressive Alignments - Match closely-

related sequences first using a guide tree• Global Iterative Alignments - Multiple re-building

attempts to find best alignment• Local alignments

– Profile analysis,– Block analysis– Patterns searching and/or Statistical methods

Page 6: Bioinformatics lesson

Global msa: Challenges• Computationally Expensive

– If msa includes matches, mismatches and gaps and also accounts the degree of variation then global msa can be applied to only a few sequences

• Difficult to score– Multiple comparison necessary in each column of the msa for

a cumulative score– Placement of gaps and scoring of substitution is more difficult

• Difficulty increases with diversity– Relatively easy for a set of closely related sequences– Identifying the correct ancestry relationships for a set of

distantly related sequences is more challenging– Even difficult if some members are more alike compared

to others

Page 7: Bioinformatics lesson

Global msa: Dynamic Programming

• The two-sequence alignment algorithm (Needleman- Wunsch) can be generalized to any number of sequences.

• E.g., for three sequences X, Y, W define C[i,j,k] = score of optimum alignment

among

X[1..i], Y[1..j], W[1..k]• As for two sequences, divide possible

alignments into different classes, depending on how they end.– Devise recurrence relations for C[i,j,k]– C[i,j,k] is the maximum out of all possibilities

Page 8: Bioinformatics lesson

Xi

Yj

Wk

msa for 3 sequences: alignment can end in 7 ways

Xi-1

Yj-1

Wk-1

Xi

Yj

Wk-Yj

Wk

Xi

-WkXi

Yj

---Wk

-Yj

-

Xi

--

X1 . . .

Y1 . . .W1 . . .

Page 9: Bioinformatics lesson

Aligning Three Sequences

• Same strategy as aligning two sequences

• Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align

V

W

2-D edit graph

3-D edit graph

V

W

X

Page 10: Bioinformatics lesson

Dynamic programming for 3 sequences

V S N — S— S N A

—— — — A

SV S N S

A N S

Each alignment is a path through the dynamic programming matrix

SA

Start

Page 11: Bioinformatics lesson

2-D cell versus 2-D Alignment Cell

In 3-D, 7 edges in each unit cube

In 2-D, 3 edges in each unit square

C(i-1,j-1,k-1) C(i-1,j,k-1)

C(i,j-1,k)

C(i-1,j-1,k)C (i-1,j,k)

C(i,j,k)

C(i,j,k-1)C(i,j-1,k-1)

Enumerate all possibilities and choose the best one

C (i-1,j-1) C (i-1,j)

C (i,j-1)

Page 12: Bioinformatics lesson

Multiple Alignment: Dynamic Programming

• si,j,k = max

• (x, y, z) is an entry in the 3-D scoring matrix

si-1,j-1,k-1 + (vi, wj, uk)si-1,j-1,k + (vi, wj, _ )si-1,j,k-1 + (v , _, u )i k

si,j-1,k-1

si-1,j,k

si,j-1,k

si,j,k-1

+ (_, wj, uk)+ (vi, _ , _)+ (_, wj, _)+ (_, _, uk)

cube diagonal: no in/dels

face diagonal: one in/del

edge diagonal: two in/dels

Page 13: Bioinformatics lesson

• Reading Materials– Chapter 5: Bioinformatics Sequence and Genome

analysis – David W. Mount• 2nd Edition: Page 170~194• 1st Edition: Page 140~165

– Cédric Notredame, Desmond G. Higgins and Jaap Heringa “T- coffee: a novel method for fast and accurate multiple sequence alignment”, Journal of Molecular Biology, Volume 302, Issue 1, 8 September 2000, Pages 205-217

– Christopher Lee, Catherine Grasso and Mark F. Sharlow, “Multiple sequence alignment using partial order graphs” Bioinformatics Vol. 18 no. 3 2002, Pages 452-464

– Cédric Notredame and Desmond G. Higgins “SAGA: sequence alignment by genetic algorithm”, Nucleic Acids Res. 1996 Apr 15;24(8):1515-24.