1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008

1

DNA AnalysisDNA Analysis

Amir GolnabiENGS 112

Spring 2008

2

Outline:

1. Markov Chain

2. DNA and Modeling

3. Markovian Models for DNA Sequences

4. Hidden Markov Models (HMM)

5. HMM for DNA Sequences

6. Future Works

7. References

3

1.Markov Chain :• Alphabet:

• are called states, and S is the state

space

• Notation >

• Sequence of random variables:

• A sequence of random variables is

called a Markov Chain, (MC), if for all n>=1

and

• The conditional probability of a future

event depends only upon the immediate

past event

Js,...,s,sS 21

js

J,...,,S 21

,...X,...,X,X n10

0nnX

111100 nnnnnnnn jXjXPjX,...,jXjXP

nn jX

11 nn jX

4

1.Markov Chain (cont.)• Conditional Probability:

• Transition Matrix

• Property:

• Higher-Order Markov Chains:– Second order MC:

Sj,i,n,iXjXPP nnji 11

J,J

j,ijiPP

11

101

J

jjiji P,P

Sk,j,i,n,kX,iXjXPP nnnjk,i 121

5

2.DNA and Modeling:• Bases: {A,T,C,G}

• Complementary strands > sequence of bases in a single

strand

• Sequences are always read from 5’ to 3’ end.

• DNA mRNA proteins (transcription and translation)

• Codons: Triples of bases which code for amino acids

• 61 + 3 ‘stop’ codons

• Specific sequence of codons gene Chromosomes

genome

• exons: coding portion of genes

• introns: non-coding regions

• Goal: To determine the nucleotide sequence of entire

genomes

6

3.Markov Chains for DNA Sequences• Nucleotides are chained linearly one by one local

dependence between the bases and their neighbors

• Markov chains offer computationally effective ways of

expressing the various frequencies and local dependencies

• Alphabet of bases = {A,T,C,G} not uniformly

distributed in any sequence and the composition vary

within and between sequences

• The probability of finding a particular base at one

position can depend not only on the immediate adjacent

bases, but also on several more distant bases upstream or

downstream higher order Markov model, (heterogeneous)

• Gene finding: Markov models of coding and non-coding

regions to classify segments as either exons or introns.

• Segmentation for decomposing DNA sequences into

homogeneous regions Hidden Markov Models

7

4.Hidden Markov Models (HMM)

• Stochastic process generated by two interrelated

probabilistic mechanisms

• Underlying Markov chain with a finite number of states

and a set of random functions, each associated with its

respective state

• Changing the states: according to transition matrix

• Only the output of the random functions can be seen

• Advantage: HMM allow for local characteristics of

molecular sequences to be modeled and predicted within a

rigorous statistical framework, and also allow the

knowledge from prior investigations to be incorporated

into analysis.

8

5.HMM for DNA Sequences

• Every nucleotide in a DNA belongs to either a “Normal” region (N), or a GC-rich region (R).

• No random distribution: Larger regions of (N) sequence

• Example of such a sequence:

NNNNNNNNNRRRRRNNNNNNNNNNNNNNNNNRRRRRRRNNNN• States of HMM: {N,R}

• Possible DNA sequence with this underlying collection:

TTACTTGACGCCAGAAATCTATATTTGGTAACCCGACGCTAA• No typical random collection of nucleotides: GC in R

regions: 83% vs. 23% in N regions

• HMM: Identify these types of feature in sequences

• Ability to capture both the patchiness of N and R and

different compositional frequencies within the categories

9

6.Future work…• Better and deeper understanding of HMM• Different applications of HMM, such as, Segmentation of DNA Sequence and Gene Finding• Build an automata for a simple case

7.References• Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden: Kluwer Academic , 2001. • Birney, E.. "Hidden Markov models in biological sequence analysis". July 2001: • Haussler, David. David Kulp, Martin Reese Frank Eeckman "A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA". • Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV MODELS FOR DNA SEQUENCING".

Documents

1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008