25
Comparing Protein Sequences Tutorial 4

Comparing Protein Sequences Tutorial 4. Comparing Protein Sequences Substitution Matrices –PAM - Point Accepted Mutations –BLOSUM - Blocks Substitution

  • View
    234

  • Download
    8

Embed Size (px)

Citation preview

Comparing Protein Sequences

Tutorial 4

Comparing Protein Sequences

• Substitution Matrices– PAM - Point Accepted Mutations– BLOSUM - Blocks Substitution Matrix

• Advance comparison tools– Psi-BLAST– Phi-BLAST

Substitution Matrix

• Scoring matrix S– 20x20 for protein alignment (Amino-acid)

• Si,j represents the gain/penalty due to substituting AAj by AAi (i – line , j – colomn)

– Based on likelihood this substitution is found in nature– Computed differently in PAM and BLOSUM

Computing probability of Mutation (Mi,j)

• PAM - Point Accepted Mutations– Based on closely related proteins (X% divergence)

– Matrices for comparison of divergent proteins computed

• BLOSUM - Blocks Substitution Matrix– Based on conserved blocks bounded in similarity (at least X% identical)

– Matrices for divergent proteins are derived using appropriate X%

PAM-1

• Captures mutation rates between close proteins– 1% divergence– Mi,j = AB / #A

• Problematic when comparing far proteins– The 1% divergence does not capture more sporadic mutations

– PAM250 is theoretical (extrapolation based)

PAM-1

• Captures mutation rates between divergent proteins

• Why is BLOSUM62 called BLOSUM62? Basically, this is because all blocks whose members shared at least 62% identity with ANY other member of that block were averaged and represented as 1 sequence.

BLOSUM62

BLOSUM62

The idea of BLOSUM matrices is to get a better measure of differences between two proteins specifically for more distantly related proteins.

Similar AA have high score

PAM & BLOSUM

• PAM matrices are based on global alignments of closely related proteins.

• The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence.

• Other PAM matrices are extrapolated from PAM1.

• BLOSUM matrices are based on local alignments.

• BLOSUM 62 is a matrix calculated from comparisons of sequences with at least 62% identity

in the blocks.

• All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.

PAM100 ~ BLOSUM90 Closely RelatedPAM120 ~ BLOSUM80PAM160 ~ BLOSUM60 PAM200 ~ BLOSUM52PAM250 ~ BLOSUM45 Highly Divergent

Query length Matrix Gap costs

<35 PAM30 9,1

35-50 PAM70 10,1

50-85 BLOSUM80 10,1

>85 BLOSUM62 11,1

Use Recommendations

Example

• Query: >ADRM1_HUMAN

(A glycosylated plasma membrane protein which promotes cell adhesion

• Data Base: nr on Human genome.• Blast Program: BLASTP• Matrices: PAM30,BLOSUM45

PAM 30 BLOSUM45

•With BLOSUM45 we found related and divergent sequences.

•With PAM30 we found only related sequences.

What difference do we observe?

PAM 30

BLOSUM45

With BLOSUM45 we can discover interesting relations between proteins

...

Mucin-13:a glycosylated membrane protein that protects the cell by binding to pathogens

With PAM 30

With BLOSUM45

Using different scoring matrices can produce slightlyDifferent alignments:

A same alignment can be solved in many ways, specially when using a matrix for highly divergent sequences (BLOSUM45):

PSI-BLAST

Position Specific Iterative BLAST

We will analyze the following Archeal uncharacterized protein: >gi|2501594|sp|Q57997|Y577_METJA PROTEIN MJ0577MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVAGLNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS

Threshold for initial BLAST

Search (default:10)

Threshold for inclusion in PSI-BLAST iterations

(default:0.005)

The query itself

Orthologous sequences in two other archaeal species

Other homologo

us sequence

s

...

...

...

Is MJ0577 a filament protein?

Is MJ0577 a cationic amino

transporter?

Is MJ0577 a universal stress

protein?

Pattern Hit Initiated BLAST

PHI-BLAST

Pattern symbols

[]= For grouping up aminoacids that can happen at a given position

()= For numbers, when a residue (or group of residues) is repited

- = For separating between positions

Making a pattern

[LIVM](2)-D-E-A-D-[RKEN]-x-[LI]

…LIDEADKTT……IMDEADEFL……LLDEADKCL……ILDEADRIL……VVDEADNFI……LVDEADKGI……LMDEADEFL……MLDEADRSI……LIDEADKML……MLDEADNWI……LVDEADRFL…

Example>gi|71154193|sp|P0A9P6|DEAD_ECOLI Cold-shock DEAD box protein A (ATP-dependent RNA helicase deaD) MAEFETTFADLGLKAPILEALNDLGYEKPSPIQAECIPHLLNGRDVLGMAQTGSGKTAAFSLPLLQNLDP ELKAPQILVLAPTRELAVQVAEAMTDFSKHMRGVNVVALYGGQRYDVQLRALRQGPQIVVGTPGRLLDHL KRGTLDLSKLSGLVLDEADEMLRMGFIEDVETIMAQIPEGHQTALFSATMPEAIRRITRRFMKEPQEVRI QSSVTTRPDISQSYWTVWGMRKNEALVRFLEAEDFDAAIIFVRTKNATLEVAEALERNGYNSAALNGDMN QALREQTLERLKDGRLDILIATDVAARGLDVERISLVVNYDIPMDSESYVHRIGRTGRAGRAGRALLFVE NRERRLLRNIERTMKLTIPEVELPNAELLGKRRLEKFAAKVQQQLESSDLDQYRALLSKIQPTAEGEELD LETLAAALLKMAQGERTLIVPPDAPMRPKREFRDRDDRGPRDRNDRGPRGDREDRPRRERRDVGDMQLYR IEVGRDDGVEVRHIVGAIANEGDISSRYIGNIKLFASHSTIELPKGMPGEVLQHFTRTRILNKPMNMQLL GDAQPHTGGERRGGGRGFGGERREGGRNFSGERREGGRGDGRRFSGERREGRAPRRDDSTGRRRFGGDA

The DEAD box pattern: [LIVM](2)-D-E-A-D-[RKEN]-x-[LI]