Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance

Lecture 10, CS567 1

Neural Network Applications

• Problems

• Input transformation

• Network Architectures

• Assessing Performance

Lecture 10, CS567 2

Problems

• Deducing the genetic code

• Predicting genes

• Predicting signal peptide cleavage sites

Lecture 10, CS567 3

Deducing the genetic code• Problem: Given a codon, predict corresponding amino acid • Of didactic value

– Trivial mapping table, after-the-fact• Perfect classification problem, rather than prediction

– With minimal network• Learning issues

– ‘Similar’ codons code for ‘similar’ amino acids– Abundance of amino acids proportional to code

redundancy (this and previous point undermine effect of mutations)

– Third base ‘wobble’– N:1 mapping between codon and amino acid

Lecture 10, CS567 4

The genetic code

http://molbio.info.nih.gov/molbio/gcode.html

T C A G

T

TTT Phe (F)TTC " TTA Leu (L)TTG "

TCT Ser (S)TCC " TCA " TCG "

TAT Tyr (Y)TAC TAA Ter TAG Ter

TGT Cys (C)TGC TGA Ter TGG Trp (W)

C

CTT Leu (L)CTC " CTA " CTG "

CCT Pro (P)CCC " CCA " CCG "

CAT His (H)CAC " CAA Gln (Q)CAG "

CGT Arg (R)CGC " CGA " CGG "

A

ATT Ile (I)ATC " ATA " ATG Met (M)

ACT Thr (T)ACC " ACA " ACG "

AAT Asn (N)AAC " AAA Lys (K)AAG "

AGT Ser (S)AGC " AGA Arg (R)AGG "

G

GTT Val (V)GTC " GTA " GTG "

GCT Ala (A)GCC " GCA " GCG "

GAT Asp (D)GAC " GAA Glu (E)GAG "

GGT Gly (G)GGC " GGA " GGG "

Lecture 10, CS567 5

Network Architecture

• Orthogonal coding (4X3) 2 hidden neurons (Is this a linear or non-linear

problem?)

• 20 output neurons – Winner takes all

• Total of 86 parameters (How?)

• FFBP

Lecture 10, CS567 6

Deducing the genetic code (Fig 6.7)

Lecture 10, CS567 7

Deducing the genetic code (Fig 6.8)

Lecture 10, CS567 8

Improving classification error

• Training rate high for misclassified codons, low otherwise (in addition to iteration dependence)

• Balanced cycles (Balanced in terms of amino acids, not codons)

• Adaptive training– Present mis-classified examples more often

Lecture 10, CS567 9

Is it a gene or not a gene?• Approaches depend on

– Bias at junctions of coding and non-coding regions • Donor (5’ end of intron) and acceptor sites (3’ end of intron) have

biases in composition (GT [junk]+ C/U+ AG)

– Bias in composition of coding regions (but not of non-coding regions, eg, introns)

• Exons are “regular guys”, introns are “freshman dorm rooms”• Seen as GC bias, codon usage frequency and codon bias

– Inverse relationship between the two (splice site strength and regularity within exons)

• “Food exit sign on highway doesn’t need prominent restaurant signs”

• “Stretch of prominent restaurant signs doesn’t need a sign indicating food”

Lecture 10, CS567 10

Regularity within coding regions (Fig 6.11)Bacteria Mammals

C. elegans A. thaliana


Predicting Exons: The holy GRAIL • Neural networks for gene prediction

– Input representation/transformation key

– NN per se trivial: MLP with single hidden layer and single output neuron

– Input = Coding region candidate, transformed to• 6mer (di-codon) score of candidate region

• 6mer (di-codon) score of flanking regions

• GC composition of candidate region

• GC composition of flanking region

• Markov model score

• Length of candidate

• Splice site score


Signal peptide (SignalP) prediction

• Signal peptides are N-terminal subsequences in proteins that are “export tags” including a “dotted line” (cleavage site) indicating point of detachment– Coding is species specific

• Problem analogous to exon/intron delineation– Distinguish between signalP and rest of protein– Find junction between signalP and rest of protein


Signal peptide (SignalP) prediction • Two kinds of network that output, for each position,

– S-score: Probability of classification as signal peptide

– C-score: Probability of being the junction

• Key is post-processing – using S and C scores to come up with final prediction

• C-score prediction: Based on Asymmetric windows (why?)• S-score prediction: Based on Symmetric windows (why?)

• Y-score = (CidSi)1/2 where dSi = Average difference in Si in windows of size d flanking position i


Signal peptide (SignalP) prediction (Fig 6.5)

S

S

Documents

Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance