69
Hidden Markov Models • Probabilistic model of a Multiple sequence alignment. • No indel penalties are needed • Experimentally derived information can be incorporated • Parameters are adjusted to represent observed variation. • Requires at least 20 sequences

Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Hidden Markov Models

• Probabilistic model of a Multiple sequence alignment.

• No indel penalties are needed• Experimentally derived information can

be incorporated• Parameters are adjusted to represent

observed variation.• Requires at least 20 sequences

Page 2: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

The Evolution of a Sequence

• Over long periods of time a sequence will acquire random mutations.– These mutations may result in a new amino acid

at a given position, the deletion of an amino acid, or the introduction of a new one.

– Over VERY long periods of time two sequences may diverge so much that their relationship can not see seen through the direct comparison of their sequences.

Page 3: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Hidden Markov Models

• Pair-wise methods rely on direct comparisons between two sequences.

• In order to over come the differences in the sequences, a third sequence is introduced, which serves as an intermediate.

• A high hit between the first and third sequences as well as a high hit between the second and third sequence, implies a relationship between the first and second sequences. Transitive relationship

Page 4: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Introducing the HMM

• The intermediate sequence is kind of like a missing link.

• The intermediate sequence does not have to be a real sequence.

• The intermediate sequence becomes the HMM.

Page 5: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Introducing the HMM

• The HMM is a mix of all the sequences that went into its making.

• The score of a sequence against the HMM shows how well the HMM serves as an intermediate of the sequence.– How likely it is to be related to all the other

sequences, which the HMM represents.

Page 6: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

B M1 M2 M3 M4 E

Match State with no Indels

MSGLMTNL

Arrow indicates transition probability.In this case 1 for each step

Page 7: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

B M1 M2 M3 M4 E

Match State with no Indels

MSGLMTNL

Also have probability of Residue at each positon

M=1S=0.5T=0.5

Page 8: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

B M1 M2 M3 M4 E

MSGLMTNL

M=1S=0.5T=0.5

Typically want to incorporate small probability for all other amino acids.

Page 9: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

B M1 M2 M3 M4 E

I1 I2 I3 I4

MS.GLMT.NLMSANI

Permit insertion states

Transition probabilities may not be 1

I0

Page 10: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

B M1 M2 M3 M4 E

I1 I2 I3 I4

MS..GLMT..NLMSA.NIMTARNL

Permit insertion states

I0

Page 11: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

MS..GL--MT..NLAGMSA.NIAGMTARNLAG

DELETE PERMITS INCORPORATION OFLAST TWO SITES OF SEQ1

D1 D2 D3 D4 D5 D6

I1 I2 I3 I4 I5 I6I0

B M1 M2 M3 M4 EM5 M6

M ST

AA

GN

IL

A

M7

I7

D7

G

Page 12: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

The bottom line of states are the main states (M)•These model the columns of the alignment

The second row of diamond shaped states are called the insert states (I)•These are used to model the highly variable regions in the alignment.

The top row or circles are delete states (D)•These are silent or null states because they do not match any residues, they simply allow the skipping over of main states.

B M1 M2 M3 M4 E

I1 I2 I3 I4

D1 D2 D3 D4

M5 M6

D5 D6

I5 I6I0

Page 13: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Dirichlet Mixtures• Additional information to expand

potential amino acids in individual sites.

• Observed frequency of amino acids seen in certain chemical environments– aromatic– acidic– basic– neutral– polar

Page 14: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

STRUCTURES helix sheet coils turnsStructures are used to build domains.-Legos of evolution

Page 15: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated
Page 16: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Rotation around the peptide bond

Page 17: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated
Page 18: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Ramachandran plot for Glycine

Areas not permittedfor other amino acids

Phi angles

Psi Angles

Page 19: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Introduction to Protein Structure, Branden and ToozeGarland Publishing Co.1991 p.13

Page 20: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

From: http://bioweb.ncsa.uiuc.edu/~bioph254/Class-slides/Lect12/figure13.html

Page 21: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated
Page 22: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

From: http://bioweb.ncsa.uiuc.edu/~bioph254/Class-slides/Lect12/figure14.html

Longitudinal and Transverse image of alpha helix

Page 23: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Turn connecting two helices

Introduction to Protein Structure, Branden and ToozeGarland Publishing Co.1991 p. 17

Page 24: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Hemoglobin - ribbon representation

Page 25: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Proline

• Because of its structure, proline is typically excluded from helices except in the first three positions at the amino end.

Page 26: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Structure

strand - single run of amino acids in conformation

sheet- multiple strands which are hydrogen bonded to yield a sheet like structure.

bulge - disruption of normal hydrogen bonding in a sheet by amino acid(s) that will not fit into the sheet -for example: proline

Page 27: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Introduction to Protein Structure, Branden and ToozeGarland Publishing Co.1991 p.17.

sheets- Parallel

Page 28: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

sheet - longitudinal and transverse view.Side chains stick “out”

http://bioweb.ncsa.uiuc.edu/~bioph254/Class-slides/Lect12/figure22.html

Page 29: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Superoxide dismutase - sheet

Page 30: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Superoxide dismutase - sheet

Page 31: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Six classes of structure• Class - bundled a helices connected by

loops.

• Class - sandwich or barrel comprised entirely of sheets typically anti-parallel.

• Class / mainly parallel sheets with intervening a helices.

• Class + - segregated a helices and anti-parallel sheets

• Multi-domain

• Membrane proteins

Page 32: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

CD8 -all

Page 33: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Thioredoxin /

Page 34: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Endonuclease Class +

Page 35: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Rhodopsin7TM proten

Page 36: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Common Hairpin Loop between two Strands

Introduction to Protein Structure, Branden and ToozeGarland Publishing Co.1991 p. 17

Page 37: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

• Turn - short, regular loop.– Difference in frequency of amino acids at

positions 1-4 of the turn.

• Coils (not coiled coil)– Random turns or irregular structure.

Page 38: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Disulfide bridges

• Crosslink of two cysteine residues.

• Distance between sulfur = 3 Angstroms.

Page 39: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Coiled coil -two a helices bundled side by side

From: http://catt.poly.edu/~jps/coilcoil.html

Page 40: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

a,d are internal, remaining amino acids are solvent exposed

From: http://catt.poly.edu/~jps/coilcoil.html

Page 41: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Coiled Coil

• Two or more adjacent helices

Page 42: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Prediction of potential Coiled coil domain in Groucho

Page 43: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

MMFPQSRHSGSSHLPQQLKFTTSDSCDRIKDEFQLLQAQYHSLKLECDKLASEKSEMQRHYVMYYEMSYGLNIEMHKQAEIVKRLNGICAQVLPYLSQEHQQQVLGAIERAKQVTAPELNSIIRQQLQAHQLSQLQALALPLTPLPVGLQPPSLPAVSAGTGLLSLSALGSQTHLSKEDKNGHDGDTHQEDDGEKSD

Potential Residues involved in Coiled Coil

Page 44: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Triple helix coiled coil - built from helices

Page 45: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Backbone of triple coiled coil

Page 46: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

E. coli Nucleotide exchange factor

Page 47: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Domains• Single domain proteins - • Epidermal growth factor• Serine Proteases - Trypsin• Multi domain proteins -Factor IX -one Ca2+

binding, two EGF/ one protease domain.• Permit building of novel functions by

swapping of domains

Page 48: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Ca EGF EGF CT

Factor IX Domain Structure

Ca - Calcium binding domainEGF - Epidermal growth factor domainCT - Chymotrypsin domain

Page 49: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Chou - Fasman Prediction of Secondary Structure

• Based upon analysis of known structures (1974).

• Frequency of occurrence of each amino acid in: helix strand– turn

Page 50: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Chou - Fasman Prediction• List is then analyzed for stretches of amino

acids that have a common tendency to form a given secondary structure.

• Extend until a region of high probability for either a turn or region with a low probability of both or is encountered.

• Window is typically <10

Page 51: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

GOR prediction

• Similar to Chou - Fassman– More recent (1988) tabulation of amino

acid preferences.– Uses a larger window -17

Page 52: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated
Page 53: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated
Page 54: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

More Recent Prediction Programs

• Make use of library of 3d structures to predict structure.

• Most use a Neural Net approach for prediction.

• Examples– Nnpredict– PredictProtein

Page 55: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Neural Net• Programs “trained” on structures.

• Window -within the window each position is predicted based upon knowledge.

• Rules also applied (alpha helix 4 AA long)

coil

Input Hidden Output

win

dow

Page 56: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

PredictProtein• Uses an alignment approach.• Submitted sequence is compared to

database and alignment is generated• Profile is generated for further database

searching.• Alignment is then used for prediction of

secondary structure.• Confidence predicted - based upon

number of residues of given type at a given position in the alignment

Page 57: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Kyte and Doolittle Hydropathy

• Average of hydropathy index for each residue.

• Examle of Hydropathy index:

• F +2.8

• R -4.5

Page 58: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Transmembrane Domain• Characteristics make them easier to predict:

helix structure– Hydrophobic amino acids– 19 or more amino acids long– charged residue will typically have an opposing

charge for neutralization.

• Difficulty in predicted ends of transmembrane domains.

Page 59: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Caveat

• Local secondary structure can be influenced by tertiary structure.

• Identical string of residues can be an helix in one protein but a strand in another protein.

Page 60: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

3D structural prediction

Page 61: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

>gi|14769656|ref|XP_010270.4| coagulation factor IX [Homo sapiens]MQRVNMIMAESPGLITICLLGYLLSAECTVFLDHENANKILNRPKRYNSGKLEEFVQGNLERECMEEKCSFEEAREVFENTERTTEFWKQYVDGDQCESNPCLNGGSCKDDINSYECWCPFGFEGKNCELDVTCNIKNGRCEQFCKNSADNKVVCSCTEGYRLAENQKSCEPAVPFPCGRVSVSQTSKLTRAETVFPDVDYVNSTEAETILDNITQSTQSFNDFTRVVGGEDAKPGQFPWQVVLNGKVDAFCGGSIVNEKWIVTAAHCVETGVKITVVAGEHNIEETEHTEQKRNVIRIIPHHNYNAAINKYNHDIALLELDEPLVLNSYVTPICIADKEYTNIFLKFGSGYVSGWGRVFHKGRSALVLQYLRVPLVDRATCLRSTKFTIYNNMFCAGFHEGGRDSCQGDSGGPHVTEVEGTSFLTGIISWGEECAMKGKYGIYTKVSRYVNWIKEKTKLT

Pfam

Protein Information Resource

KFHU

Page 62: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Tertiary Structure

• Still challenging

• Focus upon core structure for prediction

Hydrophobic interactionsthat stabilize structure.

Page 63: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Approach

• Determine “fit”of a query sequence to library of known structures.– Threading- examine compatibility of amino

acid side groups with known structures– Two approaches:

• Environmental template• Contact potential

Page 64: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Environmental Template

• Each amino acid in known core evaluated for:– secondary structure– area of side chain buried– types of nearby AA side chains

Page 65: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Arginine - basic Aa Isoleucine

Different propensity to be in a hydrophobic environment.Might accommodate charge by opposite charge

Page 66: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Environmental

• Query sequence is submitted to previously analyzed database of structures– How well does your sequence fit these

protein cores?

Page 67: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Contact Potential

• Number and closeness between each AA pair determined.

• Query sequence examined to determine if potential AA interactions match those of known cores.

Page 68: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Structural Profile

• Structural position specific scoring matrix• Identify which amino acid fit into a specific

position in the core of each known structure– each position is assigned to one of the 18 classes

of structural environment– scores reflect suitability of AA for that position– log odds matrix

• Use profile to examine query sequence

Page 69: Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated

Z score

• Many return an E value or a Z score

• Z score the number of standard deviations from the mean score for all sequences.

• The higher the Z score, the more significant the model -typical good score >5.