48
1 Multiple Sequence Alignment Lesson 5

1 Multiple Sequence Alignment Lesson 5. 2 Example VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

1

Multiple Sequence Alignment

Lesson 5

2

Example

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

3

Why multiple sequence alignmentbull Structure similarity ndash aa that play the same role in

each structure are in the same columnbull Evolutionary similarity ndash aa related to the same

ancestor are in the same columnbull Functional similarity - aa with the same function are in

the same columnbull Seq similarity ndash alignment with max similarity No

biological meaning

bull When seqs are closely related structure-evolution-functional similarity equivalent

4

bull Histones small abundant proteins Present in all eukaryotic chromosomes

bull Show a remarkable conserved multiple sequence alignment

bull Conservation of structure and function (aid in DNA package)

Why multiple alignment - examplebull Example

5

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)

6

multiple sequence alignment

Pairwise solution might be very different from multiple solution

7

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת (עי הוספת רווחים) כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

8

מאחר שהבעיה החישובית קשה אנו יודעים שכל שיטה שנציע לא תבטיח פתרון אופטימלי נחפש שיטה שנותנת תוצאות טובות

ברב המקרים יתכן שהשיטה שנבחר תהיה תלויה בסיבה שבגללה

אנו מבצעים את ההתאמה

1 Choosing the sequencess2 Scoring metrics3 Approximationheuristic algorithms4 MSA formats5 Interpreting the MSA

The problems we have to answer

9

Scoring metricsbull Distance from Consensus - The consensus of an alignment

is a string of the most common characters in each column of the alignment The total distance between the strings is defined as the number of characters that differ from the consensus character of their column let C be the consensus sequence then the total distance is sum_i D(Si C)

bull Evolutionary Tree Alignment - The weight of the lightest evolutionary tree that can be constructed from the sequences with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree summed over all such pairs

bull Sum of Pairs - The sum of pairwise distances between all pairs of sequences Sum_iltj D(Si Sj)

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

2

Example

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

3

Why multiple sequence alignmentbull Structure similarity ndash aa that play the same role in

each structure are in the same columnbull Evolutionary similarity ndash aa related to the same

ancestor are in the same columnbull Functional similarity - aa with the same function are in

the same columnbull Seq similarity ndash alignment with max similarity No

biological meaning

bull When seqs are closely related structure-evolution-functional similarity equivalent

4

bull Histones small abundant proteins Present in all eukaryotic chromosomes

bull Show a remarkable conserved multiple sequence alignment

bull Conservation of structure and function (aid in DNA package)

Why multiple alignment - examplebull Example

5

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)

6

multiple sequence alignment

Pairwise solution might be very different from multiple solution

7

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת (עי הוספת רווחים) כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

8

מאחר שהבעיה החישובית קשה אנו יודעים שכל שיטה שנציע לא תבטיח פתרון אופטימלי נחפש שיטה שנותנת תוצאות טובות

ברב המקרים יתכן שהשיטה שנבחר תהיה תלויה בסיבה שבגללה

אנו מבצעים את ההתאמה

1 Choosing the sequencess2 Scoring metrics3 Approximationheuristic algorithms4 MSA formats5 Interpreting the MSA

The problems we have to answer

9

Scoring metricsbull Distance from Consensus - The consensus of an alignment

is a string of the most common characters in each column of the alignment The total distance between the strings is defined as the number of characters that differ from the consensus character of their column let C be the consensus sequence then the total distance is sum_i D(Si C)

bull Evolutionary Tree Alignment - The weight of the lightest evolutionary tree that can be constructed from the sequences with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree summed over all such pairs

bull Sum of Pairs - The sum of pairwise distances between all pairs of sequences Sum_iltj D(Si Sj)

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

3

Why multiple sequence alignmentbull Structure similarity ndash aa that play the same role in

each structure are in the same columnbull Evolutionary similarity ndash aa related to the same

ancestor are in the same columnbull Functional similarity - aa with the same function are in

the same columnbull Seq similarity ndash alignment with max similarity No

biological meaning

bull When seqs are closely related structure-evolution-functional similarity equivalent

4

bull Histones small abundant proteins Present in all eukaryotic chromosomes

bull Show a remarkable conserved multiple sequence alignment

bull Conservation of structure and function (aid in DNA package)

Why multiple alignment - examplebull Example

5

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)

6

multiple sequence alignment

Pairwise solution might be very different from multiple solution

7

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת (עי הוספת רווחים) כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

8

מאחר שהבעיה החישובית קשה אנו יודעים שכל שיטה שנציע לא תבטיח פתרון אופטימלי נחפש שיטה שנותנת תוצאות טובות

ברב המקרים יתכן שהשיטה שנבחר תהיה תלויה בסיבה שבגללה

אנו מבצעים את ההתאמה

1 Choosing the sequencess2 Scoring metrics3 Approximationheuristic algorithms4 MSA formats5 Interpreting the MSA

The problems we have to answer

9

Scoring metricsbull Distance from Consensus - The consensus of an alignment

is a string of the most common characters in each column of the alignment The total distance between the strings is defined as the number of characters that differ from the consensus character of their column let C be the consensus sequence then the total distance is sum_i D(Si C)

bull Evolutionary Tree Alignment - The weight of the lightest evolutionary tree that can be constructed from the sequences with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree summed over all such pairs

bull Sum of Pairs - The sum of pairwise distances between all pairs of sequences Sum_iltj D(Si Sj)

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

4

bull Histones small abundant proteins Present in all eukaryotic chromosomes

bull Show a remarkable conserved multiple sequence alignment

bull Conservation of structure and function (aid in DNA package)

Why multiple alignment - examplebull Example

5

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)

6

multiple sequence alignment

Pairwise solution might be very different from multiple solution

7

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת (עי הוספת רווחים) כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

8

מאחר שהבעיה החישובית קשה אנו יודעים שכל שיטה שנציע לא תבטיח פתרון אופטימלי נחפש שיטה שנותנת תוצאות טובות

ברב המקרים יתכן שהשיטה שנבחר תהיה תלויה בסיבה שבגללה

אנו מבצעים את ההתאמה

1 Choosing the sequencess2 Scoring metrics3 Approximationheuristic algorithms4 MSA formats5 Interpreting the MSA

The problems we have to answer

9

Scoring metricsbull Distance from Consensus - The consensus of an alignment

is a string of the most common characters in each column of the alignment The total distance between the strings is defined as the number of characters that differ from the consensus character of their column let C be the consensus sequence then the total distance is sum_i D(Si C)

bull Evolutionary Tree Alignment - The weight of the lightest evolutionary tree that can be constructed from the sequences with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree summed over all such pairs

bull Sum of Pairs - The sum of pairwise distances between all pairs of sequences Sum_iltj D(Si Sj)

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

5

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)

6

multiple sequence alignment

Pairwise solution might be very different from multiple solution

7

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת (עי הוספת רווחים) כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

8

מאחר שהבעיה החישובית קשה אנו יודעים שכל שיטה שנציע לא תבטיח פתרון אופטימלי נחפש שיטה שנותנת תוצאות טובות

ברב המקרים יתכן שהשיטה שנבחר תהיה תלויה בסיבה שבגללה

אנו מבצעים את ההתאמה

1 Choosing the sequencess2 Scoring metrics3 Approximationheuristic algorithms4 MSA formats5 Interpreting the MSA

The problems we have to answer

9

Scoring metricsbull Distance from Consensus - The consensus of an alignment

is a string of the most common characters in each column of the alignment The total distance between the strings is defined as the number of characters that differ from the consensus character of their column let C be the consensus sequence then the total distance is sum_i D(Si C)

bull Evolutionary Tree Alignment - The weight of the lightest evolutionary tree that can be constructed from the sequences with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree summed over all such pairs

bull Sum of Pairs - The sum of pairwise distances between all pairs of sequences Sum_iltj D(Si Sj)

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

6

multiple sequence alignment

Pairwise solution might be very different from multiple solution

7

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת (עי הוספת רווחים) כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

8

מאחר שהבעיה החישובית קשה אנו יודעים שכל שיטה שנציע לא תבטיח פתרון אופטימלי נחפש שיטה שנותנת תוצאות טובות

ברב המקרים יתכן שהשיטה שנבחר תהיה תלויה בסיבה שבגללה

אנו מבצעים את ההתאמה

1 Choosing the sequencess2 Scoring metrics3 Approximationheuristic algorithms4 MSA formats5 Interpreting the MSA

The problems we have to answer

9

Scoring metricsbull Distance from Consensus - The consensus of an alignment

is a string of the most common characters in each column of the alignment The total distance between the strings is defined as the number of characters that differ from the consensus character of their column let C be the consensus sequence then the total distance is sum_i D(Si C)

bull Evolutionary Tree Alignment - The weight of the lightest evolutionary tree that can be constructed from the sequences with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree summed over all such pairs

bull Sum of Pairs - The sum of pairwise distances between all pairs of sequences Sum_iltj D(Si Sj)

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

7

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת (עי הוספת רווחים) כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

8

מאחר שהבעיה החישובית קשה אנו יודעים שכל שיטה שנציע לא תבטיח פתרון אופטימלי נחפש שיטה שנותנת תוצאות טובות

ברב המקרים יתכן שהשיטה שנבחר תהיה תלויה בסיבה שבגללה

אנו מבצעים את ההתאמה

1 Choosing the sequencess2 Scoring metrics3 Approximationheuristic algorithms4 MSA formats5 Interpreting the MSA

The problems we have to answer

9

Scoring metricsbull Distance from Consensus - The consensus of an alignment

is a string of the most common characters in each column of the alignment The total distance between the strings is defined as the number of characters that differ from the consensus character of their column let C be the consensus sequence then the total distance is sum_i D(Si C)

bull Evolutionary Tree Alignment - The weight of the lightest evolutionary tree that can be constructed from the sequences with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree summed over all such pairs

bull Sum of Pairs - The sum of pairwise distances between all pairs of sequences Sum_iltj D(Si Sj)

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

8

מאחר שהבעיה החישובית קשה אנו יודעים שכל שיטה שנציע לא תבטיח פתרון אופטימלי נחפש שיטה שנותנת תוצאות טובות

ברב המקרים יתכן שהשיטה שנבחר תהיה תלויה בסיבה שבגללה

אנו מבצעים את ההתאמה

1 Choosing the sequencess2 Scoring metrics3 Approximationheuristic algorithms4 MSA formats5 Interpreting the MSA

The problems we have to answer

9

Scoring metricsbull Distance from Consensus - The consensus of an alignment

is a string of the most common characters in each column of the alignment The total distance between the strings is defined as the number of characters that differ from the consensus character of their column let C be the consensus sequence then the total distance is sum_i D(Si C)

bull Evolutionary Tree Alignment - The weight of the lightest evolutionary tree that can be constructed from the sequences with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree summed over all such pairs

bull Sum of Pairs - The sum of pairwise distances between all pairs of sequences Sum_iltj D(Si Sj)

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

9

Scoring metricsbull Distance from Consensus - The consensus of an alignment

is a string of the most common characters in each column of the alignment The total distance between the strings is defined as the number of characters that differ from the consensus character of their column let C be the consensus sequence then the total distance is sum_i D(Si C)

bull Evolutionary Tree Alignment - The weight of the lightest evolutionary tree that can be constructed from the sequences with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree summed over all such pairs

bull Sum of Pairs - The sum of pairwise distances between all pairs of sequences Sum_iltj D(Si Sj)

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

10

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

11

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee)

bull Iterative methods (Dialign)

bull Direct optimization (monte carlo genetic algorithms)

bull Local methods eMotifs Blocks Psi-blast

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

12

CLUSTALW algorithm

bull Compare all sequence pairs (pairwise alignment)

bull Generate a hierarchy for alignment (guide tree)

bull Build the multiple alignment step by step according to the guide tree first aligning the most similar pair then add another sequence or another pairwise alignment etc

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

13

(1) Pairwise alignment (prepare a guide tree)

6 pairwise alignments

then cluster analysis

(2) Multiple alignment following the tree from (1)

successive alignments

CLUSTALW algorithm

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

14

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

15

ניתן להתאים בין התאמות (או בין רצף להתאמה) עי הרחבה די טבעית של שיטת התיכנות הדינמי הוספה

והכנסת אותיות באופן הרגיל המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות צריך לתת מחיר מיוחד

לרווח

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

16

נקודות עדינות

נותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bullהמשקל היחסי של כל אחד יקטן

שיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

17

bull We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one

Best Pairwise alignment (optimal)

Projected Pairwise alignment

CLUSTALW algorithm

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

18

ClustalW at EMBLhttpwwwebiacukclustalw Clustalw at the SRS site at EBI

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

19

ClustalW Output Aln format

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

20

MSA Editing Jalview

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

21

MSA formats - fasta

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

22

MSA formats - Aln

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

23

MSA formats - MSF

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

24

Example 1a a good MSA

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

25

Example 1b making MSA of distantly related proteins

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

26

Example 1c including more distant relatives in the MSA

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

27

bull Mononuclear iron proteins ndash electron carrier proteins Iron atoms are bound to amino acid side chains

bull In IPNS the metal ion is coordinated by three protein residues

bull IPNS is involved in biosynthesis of penicillin

Example 2 Isopenicillin N Synthase

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

28

Research IPNS

bull Goal Identify Fe+2 binding residues

bull Possible solutions1 In the lab

2 Bioinformatic approach (comparing different IPNS sequences)

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

29

Step 1

Multiple alignment of known IPNS

Implementation

1 Obtain sequence (eg for MCBI)

IPNS AND Bacteria[Organism]

2 MSA (clustalw) and search for conserved residues in the MSA

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

30

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

31

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

32

MSA ndash bacteria only

Not enough variation

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

33

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

34

Step 2

Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblastbullblast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire lengthbullExport in FASTA formatbullRun CLUSTALW

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

35

bull New multiple alignment narrowing down the possibilities

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

36

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite

similarbull Not enough variability to categorize the active

sitesbull We need to obtain even more distant sequences

(distant homologs)

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

37

Step 3

Using the results of the MSA for further searches

Implementation

1 Obtain an MSA (clustalw)

2 - Construct a consensus sequence and perform a

new search

OR

- Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

38

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

39

Profilebull We can deduce a statistical model describing

the multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

40

Profile vs Consensusbull The following multiple alignments will

have the same consensus

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

41

Profile vs Consensusbull But have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

42

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

43

Psi Blast

bull Position Specific Iterated - automatic profile-like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

44

bull Alignment with distantly related proteins

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

45

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKm

Activity (mM) (min-1) (mM-1min

-1)Wild type 100 04 388 969

His48Ala 16 056 75 134

His63Ala 31 10 142 142

His114Ala 28 085 125 147

His124Ala 48 084 321 381

His135Ala 22 059 117 198

His212Ala lt0007 nd nd

His268Ala lt0003 nd nd

Asp14Ala 5 086 056 07

Asp113Ala 63 045 238 528

Asp131Ala 68 048 363 755

Asp203Ala 32 091 123 135

Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

46

ndash IPNS

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

47

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48

48

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Why multiple alignment - example
  • MSA applications
  • Slide 6
  • Slide 7
  • Slide 8
  • Scoring metrics
  • Slide 10
  • MSA algorithms
  • CLUSTALW algorithm
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 25
  • Slide 26
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • Slide 30
  • Slide 31
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 35
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 41
  • Slide 42
  • Psi Blast
  • Slide 44
  • Isopenicillin N Synthase
  • Slide 46
  • Slide 47
  • Slide 48