Upload
william-scott
View
215
Download
1
Embed Size (px)
Citation preview
Lecture 10, CS567 1
Neural Network Applications
• Problems
• Input transformation
• Network Architectures
• Assessing Performance
Lecture 10, CS567 2
Problems
• Deducing the genetic code
• Predicting genes
• Predicting signal peptide cleavage sites
Lecture 10, CS567 3
Deducing the genetic code• Problem: Given a codon, predict corresponding amino acid • Of didactic value
– Trivial mapping table, after-the-fact• Perfect classification problem, rather than prediction
– With minimal network• Learning issues
– ‘Similar’ codons code for ‘similar’ amino acids– Abundance of amino acids proportional to code
redundancy (this and previous point undermine effect of mutations)
– Third base ‘wobble’– N:1 mapping between codon and amino acid
Lecture 10, CS567 4
The genetic code
http://molbio.info.nih.gov/molbio/gcode.html
T C A G
T
TTT Phe (F)TTC " TTA Leu (L)TTG "
TCT Ser (S)TCC " TCA " TCG "
TAT Tyr (Y)TAC TAA Ter TAG Ter
TGT Cys (C)TGC TGA Ter TGG Trp (W)
C
CTT Leu (L)CTC " CTA " CTG "
CCT Pro (P)CCC " CCA " CCG "
CAT His (H)CAC " CAA Gln (Q)CAG "
CGT Arg (R)CGC " CGA " CGG "
A
ATT Ile (I)ATC " ATA " ATG Met (M)
ACT Thr (T)ACC " ACA " ACG "
AAT Asn (N)AAC " AAA Lys (K)AAG "
AGT Ser (S)AGC " AGA Arg (R)AGG "
G
GTT Val (V)GTC " GTA " GTG "
GCT Ala (A)GCC " GCA " GCG "
GAT Asp (D)GAC " GAA Glu (E)GAG "
GGT Gly (G)GGC " GGA " GGG "
Lecture 10, CS567 5
Network Architecture
• Orthogonal coding (4X3) 2 hidden neurons (Is this a linear or non-linear
problem?)
• 20 output neurons – Winner takes all
• Total of 86 parameters (How?)
• FFBP
Lecture 10, CS567 6
Deducing the genetic code (Fig 6.7)
Lecture 10, CS567 7
Deducing the genetic code (Fig 6.8)
Lecture 10, CS567 8
Improving classification error
• Training rate high for misclassified codons, low otherwise (in addition to iteration dependence)
• Balanced cycles (Balanced in terms of amino acids, not codons)
• Adaptive training– Present mis-classified examples more often
Lecture 10, CS567 9
Is it a gene or not a gene?• Approaches depend on
– Bias at junctions of coding and non-coding regions • Donor (5’ end of intron) and acceptor sites (3’ end of intron) have
biases in composition (GT [junk]+ C/U+ AG)
– Bias in composition of coding regions (but not of non-coding regions, eg, introns)
• Exons are “regular guys”, introns are “freshman dorm rooms”• Seen as GC bias, codon usage frequency and codon bias
– Inverse relationship between the two (splice site strength and regularity within exons)
• “Food exit sign on highway doesn’t need prominent restaurant signs”
• “Stretch of prominent restaurant signs doesn’t need a sign indicating food”
Lecture 10, CS567 10
Regularity within coding regions (Fig 6.11)Bacteria Mammals
C. elegans A. thaliana
Lecture 10, CS567 11
Predicting Exons: The holy GRAIL • Neural networks for gene prediction
– Input representation/transformation key
– NN per se trivial: MLP with single hidden layer and single output neuron
– Input = Coding region candidate, transformed to• 6mer (di-codon) score of candidate region
• 6mer (di-codon) score of flanking regions
• GC composition of candidate region
• GC composition of flanking region
• Markov model score
• Length of candidate
• Splice site score
Lecture 10, CS567 12
Signal peptide (SignalP) prediction
• Signal peptides are N-terminal subsequences in proteins that are “export tags” including a “dotted line” (cleavage site) indicating point of detachment– Coding is species specific
• Problem analogous to exon/intron delineation– Distinguish between signalP and rest of protein– Find junction between signalP and rest of protein
Lecture 10, CS567 13
Signal peptide (SignalP) prediction • Two kinds of network that output, for each position,
– S-score: Probability of classification as signal peptide
– C-score: Probability of being the junction
• Key is post-processing – using S and C scores to come up with final prediction
• C-score prediction: Based on Asymmetric windows (why?)• S-score prediction: Based on Symmetric windows (why?)
• Y-score = (CidSi)1/2 where dSi = Average difference in Si in windows of size d flanking position i
Lecture 10, CS567 14
Signal peptide (SignalP) prediction (Fig 6.5)
S
S