Hidden Markov Models BIOL 7711 Computational Bioscience

Biochemistry and Molecular GeneticsComputational Bioscience Program

Consortium for Comparative GenomicsUniversity of Colorado School of Medicine

David.Pollock@uchsc.eduwww.EvolutionaryGenomics.com

Hidden Markov ModelsBIOL 7711

Computational Bioscience

University of Colorado School of Medicine

Consortium for Comparative Genomics

Why a Hidden Markov Model?

Data elements are often linked by a string of connectivity, a linear sequence

Secondary structure prediction (Goldman, Thorne, Jones)CpG islands

Models of exons, introns, regulatory regions, genesMutation rates along genome

Occasionally Dishonest Casino

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

eFair eLoadedaFair=>Loaded

aLoaded=>Fair

Posterior Probability of Dice

Sequence Alignment ProfilesMouse TCR Va

Hidden Markov Models: Bugs and Features

MemorylessSum of states is conserved

(rowsums =1)Complications?

Insertion and deletion of states (indels)Long-distance interactions

BenefitsFlexible probabilistic framework

E.g., compared to regular expressions

Profiles: an Example

A .1C .05D .2E .08F .01

Gap A .04C .1D .01E .2F .02

Gap A .2C .01D .05E .1F .06

insert

delete

continue continue

insert

Profiles, an Example: States

A .1C .05D .2E .08F .01

Gap A .04C .1D .01E .2F .02

Gap A .2C .01D .05E .1F .06

insert

delete

continue continue

insert

State #1 State #2 State #3

Profiles, an Example: Emission

A .1C .05D .2E .08F .01

Gap A .04C .1D .01E .2F .02

Gap A .2C .01D .05E .1F .06

insert

delete

continue continue

insert

Sequence Elements

(possibly emitted by a state)

Profiles, an Example: Emission

A .1C .05D .2E .08F .01

Gap A .04C .1D .01E .2F .02

Gap A .2C .01D .05E .1F .06

insert

delete

continue continue

insert

Sequence Elements

(possibly emitted by a state)

Emission Probabilities

Profiles, an Example: Arcs

A .1C .05D .2E .08F .01

Gap A .04C .1D .01E .2F .02

Gap A .2C .01D .05E .1F .06

delete

continue continue

insert

transition

insert

Profiles, an Example: Special States

A .1C .05D .2E .08F .01

A .04C .1D .01E .2F .02

Gap A .2C .01D .05E .1F .06

delete

continue continue

insert

transition

insert

Self => SelfLoop

No Delete “State”

A Simpler not very Hidden MM

Nucleotides, no Indels, Unambiguous Path

G .1C .3A .2T .4

G .1C .1A .7T .1

G .3C .3A .1T .3

1.0 1.0 1.0

P(D | M) = 0.7∗1.0∗0.4∗1.0∗0.3∗1.0

A Simpler not very Hidden MM

Nucleotides, no Indels, Unambiguous Path

G .1C .3A .2T .4

G .1C .1A .7T .1

G .3C .3A .1T .3

1.0 1.0 1.0

lnP(D | M) = lnP(ED | state)states

∑ + lnP(x− > y)arcs

A Toy not-Hidden MMNucleotides, no Indels, Unambiguous

but Variable PathAll arcs out are equal

Example sequences: GATC ATC GC GAGAGC AGATTTC

BeginEmit G

Emit A

Emit C

Emit T

P(AGATTTC | M) = (0.5∗1.0)l= 7

Arc Emission

A Simple HMMCpG Islands; Methylation Suppressed in

Promoter Regions; States are Really Hidden Now

G .1C .1A .4T .4

G .3C .3A .2T .2 0.1

CpG Non-CpG

0.8 0.9

P(stateyi |D <= i) = P(statex

i−1)∗P(x− > y)x

∑ *P(ED | stateyi )

Fractional likelihood

The Forward AlgorithmProbability of a Sequence is the

Sum of All Paths that Can Produce It

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

.3*.8+

.1*.1)=.075

.3*.2+

.1*.9)=.015

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

.3*.8+

.1*.1)=.075

.3*.2+

.1*.9)=.015

.075*.8+

.015*.1)=.0185

.075*.2+

.015*.9)=.0029

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

.3*.8+

.1*.1)=.075

.3*.2+

.1*.9)=.015

.075*.8+

.015*.1)=.0185

.075*.2+

.015*.9)=.0029

.0185*.8+.0029*.1

)=.003.4*(.0185*.2+.0029*.9

)=.0025

.003*.8+

.0025*.1

)=.0005.4*(.003*.2+.0025*.9

)=.0011

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

.3*.8+

.1*.1)=.075

.3*.2+

.1*.9)=.015

.075*.8+

.015*.1)=.0185

.075*.2+

.015*.9)=.0029

.0185*.8+.0029*.1

)=.003.4*(.0185*.2+.0029*.9

)=.0025

.003*.8+

.0025*.1

)=.0005.4*(.003*.2+.0025*.9

)=.0011

The Viterbi AlgorithmMost Likely Path

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

.3*.8,

.1*.1)=.072

.3*.2,

.1*.9)=.009

.075*.8,

.015*.1)=.0173

.075*.2,

.015*.9)=.0014

.0185*.8,.0029*.1

)=.0028.4*m(.0185*.2,.0029*.9

)=.0014

.003*.8,

.0025*.1

)=.00044.4*m(.003*.2,.0025*.9

)=.00050

Forwards and BackwardsProbability of a State at a Position

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

.0185*.8+.0029*.1

)=.003.4*(.0185*.2+.0029*.9

)=.0025

.003*.8+

.0025*.1

)=.0005.4*(.003*.2+.0025*.9

)=.0011

.003*(

.2*.8+

.4*.2)=.0007.0025*(.2*.1+

.4*.9)=.0009

Lki = fk (i)bk (i)

Forwards and BackwardsProbability of a State at a Position

G C G A A

.003*(

.2*.8+

.4*.2)=.0007.0025*(.2*.1+

.4*.9)=.0009

P(CpG | i = 4,D)

=P(CpG)

P(CpG) + P(non −CpG)[ ]

=0.0007

0.0007 + 0.0009= 0.432

Homology HMMGene recognition, identify distant

homologs

Common Ancestral SequenceMatch, site-specific emission probabilitiesInsertion (relative to ancestor), global emission probsDelete, emit nothingGlobal transition probabilities

Homology HMM

insert insert

delete delete

match end

insert

Multiple Sequence Alignment HMM

Defines predicted homology of positions (sites)

Recognize region within longer sequenceModel domains or whole proteinsStructural alignmentCompare alternative models

Can modify model for sub-familiesIdeally, use phylogenetic tree

Often not much back and forthIndels a problem

Model Comparison

Based on For ML, take

Usually to avoid numeric error

For heuristics, “score” isFor Bayesian, calculate

P(D |θ,M)

Pmax (D |θ,M)

−lnPmax (D |θ,M)

−log2 P(D |θ fixed ,M)

Pmax (θ,M |D) =P(D |θ,M) *P θ( ) *P M( )

P(D |θ,M) *P θ( ) *P M( )∑

Parameters,

Types of parametersAmino acid distributions for positionsGlobal AA distributions for insert statesOrder of match statesTransition probabilitiesTree topology and branch lengthsHidden states (integrate or augment)

Wander parameter space (search)Maximize, or move according to

posterior probability (Bayes)

Expectation Maximization (EM)

Classic algorithm to fit probabilistic model parameters with unobservable states

Or missing data

Two Stages, iterateMaximize

If know hidden variables (states), maximize model parameters with respect to that knowledge

ExpectationIf know model parameters, find expected

values of the hidden variables (states)

Works well even with e.g., Bayesian to find near-equilibrium space

Homology HMM EMStart with heuristic (e.g., ClustalW)Maximize

Match states are residues aligned in most sequencesAmino acid frequencies observed in

columns

ExpectationRealign all the sequences given model

Repeat until convergenceProblems: Local, not global

optimizationUse procedures to check how it worked

Model ComparisonDetermining significance depends

on comparing two modelsUsually null model, H0, and test model,

Models are nested if H0 is a subset of H1

If not nestedAkaike Iinformation Criterion (AIC) [similar

to empirical Bayes] or Bayes Factor (BF) [but be careful]

Generating a null distribution of statistic

Z-factor, bootstrapping, , parametric bootstrapping, posterior predictive

Z Test MethodDatabase of known negative controls

E.g., non-homologous (NH) sequencesAssume NH scores

i.e., you are modeling known NH sequence scores as a normal distribution

Set appropriate significance level for multiple comparisons (more below)

ProblemsIs homology certain?Is it the appropriate null model?

Normal distribution often not a good approximation

Parameter control hard: e.g., length distribution

~ N(μ,σ )

Bootstrapping and Parametric Models

Random sequence sampled from the same set of emission probability distributions

Same length is easyBootstrapping is re-sampling columnsParametric models use estimated frequencies, may include variance, tree, etc.

More flexible, can have more complex nullAllows you to consider carefully what the null means,

and what null is appropriate to use! Pseudocounts of global frequencies if data limit

Insertions relatively hard to modelWhat frequencies for insert states? Global?

Homology HMM Resources

UCSC (Haussler)SAM: align, secondary structure

predictions, HMM parameters, etc.

WUSTL/Janelia (Eddy)Pfam: database of pre-computed HMM

alignments for various proteinsHMMer: program for building HMMs

Increasing Asymmetry with Increasing Single

Strandedness

e.g., P ( A=> G) = c + t

t = ( DssH * Slope ) + Intercept €

A C T G

− λ ACπ C λ ATπ T λ AGπG

λCAπ A − λCTπ T λCGπG

λTAπ A λTCπ C − λTGπG

λGAπ A λGCπ C λGTπ T −

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥

⎥⎥⎥

⎢⎢⎢⎢

−−

2x Redundant Sites

4x Redundant Sites

Beyond HMMs

Neural netsDynamic Bayesian netsFactorial HMMsBoltzmann TreesKalman filtersHidden Markov random fields

COI Functional Regions

Oxygen

Electron

H (alt)

O2 + protons+ electrons = H2O + secondary proton pumping (=ATP)

Hidden Markov Models BIOL 7711 Computational Bioscience

Documents

Georgia Bioscience

English Literature A 7711/1 - Revision World

Spring 2013 Course List · BIOL CAS BIOL 511 007 Biol Teaching Practicum 4 A, G BIOL CAS BIOL 511 008 Biol Teaching Practicum 4 A, G BIOL CAS BIOL 595 004 Thesis Supervision 4 B,

BioScience Trends. 2021; 15(2):107-117. 107 BioScience

Pragma Bioscience(Habibah)

Multiple Sequence Alignment (MSA); short version BIOL 7711 Computational Bioscience

Colorado PLTW Conference - UCCS Home · Colorado PLTW Conference ... Biol 1011 Biol 1012 Biol 1013 Biol 2040 ... BIOL 3140, BIOL 3150, BIOL 3300, BIOL 3620, BIOL 3910, BIOL 4200,

7711 tata-motors-thesis

Breakthroughs in Bioscience

Bioscience Careers

Should I become a Doctor?proteomics.ysu.edu/bioscience/MDprep.pdf · Core courses (2 required; General Biology 1 & 2 are prerequisites) BIOL 3730 – Human Physiology with lab BIOL

BioScience Soundscape

EW-7711 22In Manual

southcentral.edu · BIOL 105 General Biology I BIOL 106 General Biology Il BIOL 211 Genetics BIOL250 Biology Capstone BIOL 220 Human Anatomy BIOL 230 Human Physiology BIOL 270 Microbiology

Yield10 Bioscience, Inc

FIRST SEMESTER 2Y CNSM COURSE RET % Tim ......CSULB CHICO FRESNO FULLERTON LA NORTHRIDGE POMONA SAC SDSU SFSU SJSU SLO BIOL BIOL BIOL BIOL BIOL BIOL BIOL BIOL BIOL BIOL BIOL BIOL HDEV

BioScience Trends. 2021; 15(1):41-49. BioScience Trends

BioScience Trends. 2021; 15(1):24-32. 24 BioScience Trends

What Is Biotechnology? An Introduction BioScience Survey An Introduction BioScience Survey

Biology Programs - SIUE · At least 1 course from the following list: BIOL 319 BIOL 335 BIOL 337 BIOL 350 BIOL 415 BIOL 416 BIOL 421 BIOL 422 BIOL 451 BIOL 467 BIOL 472 Chemistry