View
217
Download
1
Category
Tags:
Preview:
Citation preview
Outline
• Gene finding using HMMs• Adding trees to HMMs
• phyloHMM• N-SCAN
• BLAST+ Gene Finding• SGP2
• Examples
3
Markov Sequence Models
• Key: distinguish coding/non-coding statistics• Popular models:
• 6-mers (5th order Markov Model)• Homogeneous/non-homogeneous (reading frame
specific)
• Not sensitive enough for eukaryote genes: exons too short, poor detection of splice junctions
• Simple HMMs can only encode genometric length distributions
• The length of each exon (intron) :
CG © Ron Shamir, 2008 4
Length Distribution
exon intronp q
1-p
1-q
(length ) (1 )kP k p p
CG © Ron Shamir, 2008 5
Exon Length Distribution
• The length distribution of introns is ≈ geometric
• For exons, it isn’t: also affected by splicing itself:• Too short (under 50bps): the spliceosomes have no room• Too long (over 300bps): ends have problems finding each
other.• But as usual there are exceptions.
• A different model for exons is needed• A different model is needed for exons.
CG © Ron Shamir, 2008 6
Generalized HMM(Burge & Karlin, J. Mol. Bio. 97 268 78-94)
• Instead of a single char, each state omits a sequence with some length distribution
CG © Ron Shamir, 2008 7
Generalized HMM(Burge & Karlin, J. Mol. Bio. 97 268 78-94)
•Overview:• Hidden Markov states q1,…qn
• State qi has output length distribution fi
• Output of each state can have a separate probabilistic model (weight matrix model, HMM…)
• Initial state probability distribution • State transition probabilities Tij
CG © Ron Shamir, 2008 9
GenScan model
•states = functional units on a gene•The allowed transitions ensure the
order is biologically consistent.•As an intron may cut a codon, one
must keep track of the reading frame, hence the three I phases:
• phase I0: between codons
• phase I1:: introns that start after 1st base
• phase I2 : introns that start after 2nd base
Phylogenetic HMMs
• Due to Siepel and Haussler• A simple gene-finding HMM looks at
a single Markov process:• Along the sequence: each position is
dependent on the previous position• If we incorporate sequences from
multiple organisms, we can look at another process:• Along the tree: each position is
dependent on its ancestor
Phylogenetic HMMs
• A simple HMM can be thought of as a machine that generates a sequence• Every state omits a single character• Multinomial distribution at every state
• A phyloHMM generates an MSA • Every state omits a single MSA column• Phylogenetic model at every state
Phylogenetic models in phyloHMM
• Defines a stochastic process of substitution• Every position is independent• The following process occurs:
• A character is assigned to the root• The character substitution occur based of
some substitution matrix and based on the branch lengths
• The characters at the leaves of the tree correspond to the MSA column
Phylogenetic models in phyloHMM
• Different models for different states:• Different substitution rates
• E.g., in exons, we’ll see less substitutions
• Different patterns of substitutions• E.g., third position bias in coding sequences
• Different tree topologies• E.g., following recombination
Formally
• S – set of states• Ψ – phylogenetic models (instead of
E in a standard HMM)• A – state transitions• b – initial probabilities
Formally
• Q – substitution rate matrix (e.g., derived from PAM)
• Π – background frequencies• τ – the phylogenetic tree• β – branch lengths
Formally
• - Probability of a column Xi
being omitted by the model ψi
• Can be computed efficiently by Felsenstein’s “pruning algorithm” (recitation 6)
• Joint probability of a path in the HMM and and alignment X
• Viterbi, forward-backward etc. – as usual
Simple phylo-gene-finder
Non-coding
3rd position
• If the parameters are known – Viterbi can be used to find the most probably path – segmentation into coding regions
Phylo-gene-finder is a good idea
• Use of phylogeny is important:• Imposes structure on the substitutions• Weights different pairs differently based
on the evolutionary distance
N-SCAN
• Another phylogeny-HMM-gene-finder• A GHHM that emits MSA columns• Annotates one sequence at a time: the
target sequence• Distinguishes between a target sequence
– T and other informative sequences (Is) that may contain gaps
• States correspond to sequence types in the target sequence
N-SCAN
• Bayesian network instead of a simple evolutionary model
• Accounts for:• 5’ UTRs• Conserved non-coding
• Highly conserved • No “coding” features
SGP-2
• Drawback of the described approaches: require meaningful alignment• Impossible if one of the genomes is not
yet finished• An alignment is not necessary “correct”
SGP-2
• A framework working on two genomes• Idea:
• Use BLAST to identify which positions are more/less conserved
• Feed the BLAST scores into the gene-finding HMM
• The BLAST results serve to modify the scores of the exons.
Summary
• Different approaches for gene finding• Adding phylogeny generally helps• But
• What about genes/exons which are specific to humans
• Ape genomes are not (almost) available and too similar
• Phylogenetic help almost essential in more difficult problems• Motif finding (promoter analysis)• Ultraconserved regions with no evident function
Recommended