Upload
arabella-young
View
219
Download
0
Embed Size (px)
Citation preview
1
DNA AnalysisDNA Analysis
Amir GolnabiENGS 112
Spring 2008
2
Outline:
1. Markov Chain
2. DNA and Modeling
3. Markovian Models for DNA Sequences
4. Hidden Markov Models (HMM)
5. HMM for DNA Sequences
6. Future Works
7. References
3
1.Markov Chain :• Alphabet:
• are called states, and S is the state
space
• Notation >
• Sequence of random variables:
• A sequence of random variables is
called a Markov Chain, (MC), if for all n>=1
and
• The conditional probability of a future
event depends only upon the immediate
past event
Js,...,s,sS 21
js
J,...,,S 21
,...X,...,X,X n10
0nnX
111100 nnnnnnnn jXjXPjX,...,jXjXP
nn jX
11 nn jX
4
1.Markov Chain (cont.)• Conditional Probability:
• Transition Matrix
• Property:
• Higher-Order Markov Chains:– Second order MC:
Sj,i,n,iXjXPP nnji 11
J,J
j,ijiPP
11
101
J
jjiji P,P
Sk,j,i,n,kX,iXjXPP nnnjk,i 121
5
2.DNA and Modeling:• Bases: {A,T,C,G}
• Complementary strands > sequence of bases in a single
strand
• Sequences are always read from 5’ to 3’ end.
• DNA mRNA proteins (transcription and translation)
• Codons: Triples of bases which code for amino acids
• 61 + 3 ‘stop’ codons
• Specific sequence of codons gene Chromosomes
genome
• exons: coding portion of genes
• introns: non-coding regions
• Goal: To determine the nucleotide sequence of entire
genomes
6
3.Markov Chains for DNA Sequences• Nucleotides are chained linearly one by one local
dependence between the bases and their neighbors
• Markov chains offer computationally effective ways of
expressing the various frequencies and local dependencies
• Alphabet of bases = {A,T,C,G} not uniformly
distributed in any sequence and the composition vary
within and between sequences
• The probability of finding a particular base at one
position can depend not only on the immediate adjacent
bases, but also on several more distant bases upstream or
downstream higher order Markov model, (heterogeneous)
• Gene finding: Markov models of coding and non-coding
regions to classify segments as either exons or introns.
• Segmentation for decomposing DNA sequences into
homogeneous regions Hidden Markov Models
7
4.Hidden Markov Models (HMM)
• Stochastic process generated by two interrelated
probabilistic mechanisms
• Underlying Markov chain with a finite number of states
and a set of random functions, each associated with its
respective state
• Changing the states: according to transition matrix
• Only the output of the random functions can be seen
• Advantage: HMM allow for local characteristics of
molecular sequences to be modeled and predicted within a
rigorous statistical framework, and also allow the
knowledge from prior investigations to be incorporated
into analysis.
8
5.HMM for DNA Sequences
• Every nucleotide in a DNA belongs to either a “Normal” region (N), or a GC-rich region (R).
• No random distribution: Larger regions of (N) sequence
• Example of such a sequence:
NNNNNNNNNRRRRRNNNNNNNNNNNNNNNNNRRRRRRRNNNN• States of HMM: {N,R}
• Possible DNA sequence with this underlying collection:
TTACTTGACGCCAGAAATCTATATTTGGTAACCCGACGCTAA• No typical random collection of nucleotides: GC in R
regions: 83% vs. 23% in N regions
• HMM: Identify these types of feature in sequences
• Ability to capture both the patchiness of N and R and
different compositional frequencies within the categories
9
6.Future work…• Better and deeper understanding of HMM• Different applications of HMM, such as, Segmentation of DNA Sequence and Gene Finding• Build an automata for a simple case
7.References• Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden: Kluwer Academic , 2001. • Birney, E.. "Hidden Markov models in biological sequence analysis". July 2001: • Haussler, David. David Kulp, Martin Reese Frank Eeckman "A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA". • Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV MODELS FOR DNA SEQUENCING".