58
1 Gene Predictor Gene Predictor Date:20/11/2003 Date:20/11/2003 Implemented By: Zohar Idelson Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Supervisor: Dr. Yizhar Lavner Winter - Summer 2003 Winter - Summer 2003

1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

11

Gene PredictorGene Predictor

Date:20/11/2003Date:20/11/2003Implemented By: Zohar IdelsonImplemented By: Zohar IdelsonSupervisor: Dr. Yizhar LavnerSupervisor: Dr. Yizhar Lavner

Winter - Summer 2003Winter - Summer 2003

Page 2: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

22

Genomic Signal ProcessingGenomic Signal Processing• Genomic Signal Processing is a Genomic Signal Processing is a

relatively new field in Bioinformatics, in relatively new field in Bioinformatics, in which signal processing algorithms and which signal processing algorithms and methods are used to study functional methods are used to study functional structures in the DNA.structures in the DNA.

• An appropriate mapping of the DNA An appropriate mapping of the DNA sequence into one or more numerical sequence into one or more numerical sequences, enables the use of many sequences, enables the use of many digital signal processing tools. digital signal processing tools.

atgcggatttgccgtcgatgtc…Gene

PredictorGene Gene

DNA Segment DNA Segment

Page 3: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

33

• DNA in Eukaryotes is organized in chromosomes.DNA in Eukaryotes is organized in chromosomes.• The DNA in each chromosome can be read as a discrete The DNA in each chromosome can be read as a discrete

signal to {a,t,c,g}. (For example: atgatcccaaatggaca…).signal to {a,t,c,g}. (For example: atgatcccaaatggaca…).• In exons (protein-coding region), during the biological amino In exons (protein-coding region), during the biological amino

acids building, those letters are read as triplets (codons). acids building, those letters are read as triplets (codons). Every codon signals which amino acid to build (there 20 aa).Every codon signals which amino acid to build (there 20 aa).

• There are 6 ways of translating DNA signal to codons signal, There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions).called the reading frames (3 * 2 directions).

• Every gene start with a start-codon and ends with a stop-Every gene start with a start-codon and ends with a stop-codon. An exon cannot consists of more than one stop-codon.codon. An exon cannot consists of more than one stop-codon.

• Non coding areas (majority usually) has a lot more random Non coding areas (majority usually) has a lot more random behavior than genes. Most of the DNA is non coding.behavior than genes. Most of the DNA is non coding.

• Genes can be detected by some statistics regularities, like Genes can be detected by some statistics regularities, like codon usage, nucleotide usage, periodicity and data base codon usage, nucleotide usage, periodicity and data base comparison.comparison.

DNA BasicsDNA Basics

Page 4: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

44

OrganismsOrganisms

• Classified into two types:Classified into two types:

• EukaryotesEukaryotes: contain a membrane-bound nucleus : contain a membrane-bound nucleus and organelles (plants, animals, fungi,…)and organelles (plants, animals, fungi,…)

• ProkaryotesProkaryotes: lack a true membrane-bound nucleus : lack a true membrane-bound nucleus and organelles (single-celled, includes bacteria)and organelles (single-celled, includes bacteria)

• Not all single celled organisms are Not all single celled organisms are prokaryotes!prokaryotes!

Page 5: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

55

CellsCells

• Complex system enclosed Complex system enclosed in a membranein a membrane

• Organisms are unicellular Organisms are unicellular (bacteria, baker’s yeast) (bacteria, baker’s yeast) or multicellularor multicellular

• Humans:Humans:

– 60 trillion cells 60 trillion cells – 320 cell types320 cell types

Example Animal Cellwww.ebi.ac.uk/microarray/ biology_intro.htm

Page 6: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

66

DNA Basics – contDNA Basics – cont..

• DNA in Eukaryotes is organized in DNA in Eukaryotes is organized in chromosomes.chromosomes.

Page 7: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

77

ChromosomesChromosomes

• In eukaryotes, nucleus In eukaryotes, nucleus contains one or several contains one or several double stranded DNA double stranded DNA molecules orgainized as molecules orgainized as chromosomeschromosomes

• Humans: Humans: – 22 Pairs of autosomes22 Pairs of autosomes– 1 pair sex 1 pair sex

chromosomeschromosomes Human Karyotype http://avery.rutgers.edu/WSSP/StudentScholars/

Session8/Session8.html

Page 8: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

88www.biotec.or.th/Genome/whatGenome.html

Page 9: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

99

What is DNA?What is DNA?

• DNA: Deoxyribonucleic AcidDNA: Deoxyribonucleic Acid

• Single stranded molecule (oligomer, Single stranded molecule (oligomer, polynucleotide) chain of nucleotidespolynucleotide) chain of nucleotides

• 4 different nucleotides:4 different nucleotides:– Adenosine (A)Adenosine (A)– Cytosine (C)Cytosine (C)– Guanine (G)Guanine (G)– Thymine (T)Thymine (T)

Page 10: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1010

Nucleotide BasesNucleotide Bases

• Purines (A and G)Purines (A and G)

• Pyrimidines (C and T)Pyrimidines (C and T)

• Difference is in base structureDifference is in base structure

Image Source: www.ebi.ac.uk/microarray/ biology_intro.htm

Page 11: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1111

DNADNA

Page 12: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1212

Page 13: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1313

GenomeGenome

• chromosomal DNA of an organismchromosomal DNA of an organism

• number of chromosomes and genome size number of chromosomes and genome size varies quite significantly from one organism varies quite significantly from one organism to anotherto another

• Genome size and number of genes does not Genome size and number of genes does not necessarily determine organism complexitynecessarily determine organism complexity

Page 14: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1414

ORGANISM CHROMOSOMES GENOME SIZE GENES

Homo sapiens (Humans)

23 3,200,000,000 ~ 30,000

Mus musculus(Mouse)

20 , 2600,000,000 ~30,000

Drosophila melanogaster

(Fruit Fly)

4 180,000,000 ~18,000

Saccharomyces cerevisiae (Yeast)

16 14,000,000 ~6,000

Zea mays (Corn) 10 2,400,000,000 ???

Genome ComparisonGenome Comparison

Page 15: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1515

Page 16: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1616

• The DNA in each The DNA in each chromosome can chromosome can be read as a be read as a discrete signal to discrete signal to {a,t,c,g}. (For {a,t,c,g}. (For example: example: atgatcccaaatggacaatgatcccaaatggaca…)…)

DNA Basics – contDNA Basics – cont..

Page 17: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1717

• In genes (protein-coding region), In genes (protein-coding region), during the construction of proteins during the construction of proteins by amino acids, these nucleotides by amino acids, these nucleotides (letters) are read as triplets (codons). (letters) are read as triplets (codons). Every codon signals one amino acid Every codon signals one amino acid for the protein synthesis (there are for the protein synthesis (there are 20 aa).20 aa).

DNA Basics – contDNA Basics – cont..

Page 18: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1818

• There are 6 ways of translating DNA There are 6 ways of translating DNA signal to codons signal, called the signal to codons signal, called the reading frames (3 * 2 directions).reading frames (3 * 2 directions).

DNA Basics – contDNA Basics – cont..

…CATTGCCAGT…

Page 19: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

1919

DNA Basics – ContDNA Basics – Cont..…CATTGCCAGT…

Start: ATG

Stop: TAA, TGA, TAG

gene

Exon ExonExon IntronIntron Exon

Page 20: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2020

The ProblemThe Problem

• Given unannotated DNA, find the Given unannotated DNA, find the genes.genes.

• In practice, find the exons and their In practice, find the exons and their RF.RF.

• Smaller scale problem: given some Smaller scale problem: given some annotated DNA of a creature, find the annotated DNA of a creature, find the exons of unannotated DNA of the exons of unannotated DNA of the same creature.same creature.atgcggatttgccgtcgatgtc…

Gene Predictor

Exon Exon

Page 21: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2121

Solution SchemeSolution Scheme

• Solution scheme:Solution scheme:– Work in windows analysis.Work in windows analysis.– Find parameters that gives a good prediction in Find parameters that gives a good prediction in

annotated DNA (of the same organism). annotated DNA (of the same organism). LearnLearn how how to distinguish exons regions from non-exons to distinguish exons regions from non-exons regions.regions.

– Extract those parameters from the unannotated Extract those parameters from the unannotated DNA, and use the discrimination rule in order to DNA, and use the discrimination rule in order to predict.predict.

• Almost all methods shown here fit to this Almost all methods shown here fit to this scheme.scheme.

Page 22: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2222

Creatures in the ProjectCreatures in the Project

C. elegans S. cerevisiae (yeast)

Page 23: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2323

Existing MethodsExisting Methods

• Many methods relies on the pseudo Many methods relies on the pseudo periodicity of 3 in genes. For that we periodicity of 3 in genes. For that we define:define:– UUbb is the binary indicator series for base B. is the binary indicator series for base B.

– UUBB is the STFT of u is the STFT of ubb..

• N, the window size, is in the hundreds. Exons size is N, the window size, is in the hundreds. Exons size is in order of 10in order of 1011…10…103 3 (in S. Cerevisiae).).

• Overlapping windows.Overlapping windows.

– There exists a connection between the DFT in k There exists a connection between the DFT in k = N/3 frequency and nucleotides usage.= N/3 frequency and nucleotides usage.

Page 24: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2424

Calculating the Calculating the DFTDFT of a DNA of a DNA sequencesequence**

ATCGTACAGCTGCAAAGCATAGATTCGGTCACAGTTG…S(n)

1000010100000111001010100000001010000

01001000001…

0010001001…

000100001001…

uA(n)

uT(n)

uC(n)

uG(n)

1

3

NDFT

N

1

3A

NU

N 1

3T

NU

N

1

3C

NU

N 1

3G

NU

N

2110

0

( ) { ( )} ( ) 0 1N i nk

N Nn

n

X k DFT x n x n e k N

213

0

, , ,3

N i n

b bn

NU u n e b A T C G

*Silverman and Linsker 1986; Voss 1992

Page 25: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2525

SpectrogramSpectrogram

A way for showing the amplitude of UA, UC, UG and UT together.Linear Transform to RGB.Magnitude is represented by brightnessFinding exons visually: bright horizontal lines, usually in k = N/3

Position(nucleotides)

Fre

qu

enc y

N/3

Page 26: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2626

Spectrogram – contSpectrogram – cont..

DNA of C. Elegans chr. III versus totally random DNA

Page 27: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2727

Power SpectrumPower Spectrum

1 1A = Ua(k),C = Uc(k) ...

N Nk {0, 1, ... N/2}

2 2 2 2S = A + C + G + T

Difference between gene to non-gene areas is in 1 order of magnitude

Used for k = N/3

Page 28: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2828

IIR Anti Notch FilteringIIR Anti Notch Filtering

• IIR anti notch filter IIR anti notch filter aimed to find “peaks” aimed to find “peaks” of a chosen frequencyof a chosen frequency

2 1 2

1 2 2

2 cos( )

1 2 cos

R R z zA z

R z R z

all-pass

1 ( )( )

2

A zH z

Anti-notch

Page 29: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2929

Optimized Spectral Content Optimized Spectral Content Measure (OSCM)Measure (OSCM)

{ } { }

( ) ( )r r r

a,t,g r r r

E aA tT gG E aA tT gGargmax

std aA tT gG std aA tT gG

Find good coefficients (a,g,t) for high differentiation between exons and introns.Ignoring C since of the linear dependency in the rest.Ar, Tr, Gr are generated from random DNA sequence, or Introns.Performance:

2 2W = aA + cC + gG + tT

Page 30: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3030

OSCM ExampleOSCM Example

Direction mistake

Good forward detection

Good reverse detection

Page 31: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3131

OSCM JustificationOSCM Justification

• In genes, the 4 In genes, the 4 complex variables complex variables A,T,C,G are not all-A,T,C,G are not all-random and tend to random and tend to be near a specific be near a specific angle (phase).angle (phase).

• In introns, the values In introns, the values of phase seems to be of phase seems to be pure random.pure random.

• Those unique angles Those unique angles enable us to detect enable us to detect reading frame as well.reading frame as well.

Page 32: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3232

Distribution of the phase of the DFT at the Distribution of the phase of the DFT at the

freq of 1/3 in the freq of 1/3 in the genesgenes ofof S. CerevisiaeS. Cerevisiae::

Distribution of arg(A)

angular mean = 0.3556 angular deviation = 0.4016

Distribution of arg(T)

Distribution of arg(C) Distribution of arg(G)

Argument distributions for all experimental genes in all chromosomes in S. Cerevisiae

angular mean = -2.6862angular deviation = 0.8416

angular mean = -1.3734angular deviation = 0.7903

angular mean = 2.7962angular deviation = 0.5723

Page 33: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3333

Distribution of arg(A)

Distribution of arg(C) Distribution of arg(G)

Argument distribution for non-coding regions in all chromosomes in S. Cerevisiae

Distribution of arg(T)

Distribution of the phase of the DFT at the Distribution of the phase of the DFT at the

freq of 1/3 in the freq of 1/3 in the intronsintrons ofof S. CerevisiaeS. Cerevisiae::

Page 34: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3434

2 2

3 3,1 1 ,2 ,33

i i

b

NU f b f b e f b e

Fourier Spectra and Position Asymmetry

f(b,i) is the frequency of the base b in the codon position i, i=1,2,3.

Page 35: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3535

GenesGenes versusversus IntronsIntrons

Introns andIntrons and

intergenic spacersintergenic spacersCoding regionsCoding regions

genes and exonsgenes and exons

MagnitudeMagnitudesmallsmallLARGELARGE

PhasePhaseRandomlyRandomlydistributeddistributed

NarrowNarrowdistributiondistribution

0.05

0.1

30

210

60

240

90

270

120

300

150

330

180 0

Distribution of the DFT of T at 1/3 frequency

0.05

0.1

30

210

60

240

90

270

120

300

150

330

180 0

Distribution of the DFT of G at 1/3 frequencyDistribution of the DFT of T at 1/3 frequency Distribution of the DFT of G at 1/3 frequency

(Data taken from S.Cerevisiae, chr. IV)

Page 36: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3636

Finding Reading Frame Finding Reading Frame (OSCM Phase)(OSCM Phase)

Is concentrated around Is concentrated around 11,, 22 and and 3 3

corresponding to each corresponding to each reading frame.reading frame.

• Lowering the variance of Lowering the variance of with the optimization: with the optimization:

• Transforming Transforming to color.to color.• Deriving reading frame Deriving reading frame

by a simple look.by a simple look.

= arg(aA + cC + gG + tT)

{ }a,t,g

aA gG tTargmax E

aA gG tT

BlueBlue33

GreenGreen22

RedRed11

ColorColorReading Reading FrameFrame

Page 37: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3737

New Methods in This ProjectNew Methods in This Project

• Linear predictionLinear prediction

• Classification by clustering (CC)Classification by clustering (CC)

• Classification by compression ratiosClassification by compression ratios

Page 38: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3838

Linear PredictionLinear Prediction

• Create a walk from Create a walk from the indicator the indicator sequencessequences

• For each window, For each window, find LP coefficients. find LP coefficients. Look for differences Look for differences in correlation by:in correlation by:– Poles mapPoles map– Frequency responseFrequency response– Prediction errorPrediction error

• No new findings in No new findings in this method.this method.

1

[ ] [ ] [ ] [ ] [ ]n

A C G Tk

x n au k cu k gu k tu k

Page 39: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

3939

Classification by ClusteringClassification by Clustering

• Recall: DFT in k=N/3 frequency has a Recall: DFT in k=N/3 frequency has a strong correlation with genes locations strong correlation with genes locations and reading frames (as shown in and reading frames (as shown in part Apart A))

• Here we’ll attempt to use it in order to Here we’ll attempt to use it in order to discriminate exons from the rest, in a discriminate exons from the rest, in a 6D space6D space

• Learning phase: clusteringLearning phase: clustering• Classification phase: fuzzy KNNClassification phase: fuzzy KNN

Page 40: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4040

Classification by Clustering Classification by Clustering Clustering Stage: ExampleClustering Stage: Example

From left to right: C, G and T.

S. Cerevisiae 5th chromosome.

Page 41: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4141

Classification by ClusteringClassification by Clustering

RF = 1

+120°

-120°Max

סף

Exon?

Reading frame (if it’s an exon)

)T,C,G (new sample

RF = 1

RF = 1

RF =? 1

RF =? 3

RF =? 2

DNA = … atcgtgactagc…

DFT(k=N/3)

Indicator

DFT(k=N/3)

Indicator

DFT(k=N/3)

Indicator

T

C G

Start here

uT uC uG

Page 42: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4242

Classification RuleClassification Rule

• Fuzzy KNN: create a Fuzzy KNN: create a fuzzy membership fuzzy membership function and choose function and choose the one with the the one with the highest score. Add highest score. Add fuzzy clustering fuzzy clustering iteration to the LBG iteration to the LBG algorithm.algorithm.

• Two methods for Two methods for classifying gene/non-classifying gene/non-gene:gene:– Add genes and non-Add genes and non-

genes scores, and max genes scores, and max sum wins.sum wins.

– Max centroid score wins.Max centroid score wins.

• 22ndnd method used method used (better performance). (better performance). Scores sums are used Scores sums are used for reading frame: max for reading frame: max r.f. wins.r.f. wins.

Page 43: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4343

ResultsResults

• Creature: S. Creature: S. Cerevisiae.Cerevisiae.

• Learning was done on Learning was done on the 5the 5thth chromosome. chromosome.

• Parameters:Parameters:– K=7 and m=2 of K=7 and m=2 of

fuzzy KNN.fuzzy KNN.– True exon True exon 50% 50%

exon.exon.– Thresh = 1.Thresh = 1.

• Total: only 4.6% of Total: only 4.6% of true exons weren’t true exons weren’t detected at all.detected at all.

f_pf_nrf_truef_n_exons# exons# missed

10.10370.45240.95740.08821029

20.08210.47350.96850.047238118

30.09170.46180.95510.07115511

40.08210.46150.96540.02972521

60.11020.42470.97620.051206

70.08210.47490.96470.025850413

80.1030.47160.96710.045626312

90.10910.4520.94760.042008

100.10050.47190.97230.029334110

110.08220.48160.96410.070332723

120.09730.47590.97220.051448625

130.08850.46820.96070.036543816

140.10410.45970.96160.039737815

150.09040.46440.96650.031151416

160.08240.47440.96620.045244220

Total5376223

Page 44: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4444

CC - ExampleCC - Example

Page 45: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4545

CC - ImprovingCC - Improving

• Instead of deciding for each reading Instead of deciding for each reading frame separately and then decide which frame separately and then decide which r.F. “Won”, we can replicate the r.F. “Won”, we can replicate the centroids for the other reading frames centroids for the other reading frames and the classification rule will determine and the classification rule will determine [exon / non-exon] + [reading frame], at [exon / non-exon] + [reading frame], at the same time. This suppose to cause a the same time. This suppose to cause a more fair competition between the more fair competition between the reading frames.reading frames.

Page 46: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4646

Classification by Classification by Compression RatesCompression Rates

A T C G A T C G T A C G C A T G C A T G C A T G C A T G A A A A

60…11829 • In forward coding, creating 3 different codon sequences.

• In classification of reverse coding, first complementing all the DNA, then treating it like forward (and results will also be reversed)

• In the end of this stage, we have 6 codon seriates.

Nucleotides (‘A’,’C’,’T’,’G’)

Codons (0..63)

Page 47: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4747

The IdeaThe Idea

• If we have a dictionary with the popular If we have a dictionary with the popular words ( = codon sequences) in exons words ( = codon sequences) in exons which aren’t popular in non-exons then:which aren’t popular in non-exons then:– Good compression will be achieved in Good compression will be achieved in

exonsexons– Good compression will not be achieved in Good compression will not be achieved in

intronsintrons

• So we need a good dictionary and a So we need a good dictionary and a good compressing algorithmgood compressing algorithm

Page 48: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4848

Building the DictionaryBuilding the Dictionary

• Aim: the output Aim: the output dictionary is dictionary is expected to hold expected to hold short short popularpopular words words in exons.in exons.

• Using LZW algorithm.Using LZW algorithm.• Input: all exons of Input: all exons of

learnt chromosome.learnt chromosome.• Initial dictionary: all Initial dictionary: all

codons.codons.

• Add restriction on Add restriction on length of words to length of words to be entered to the be entered to the dictionary.dictionary.

• Output I: dictionary Output I: dictionary with words that with words that appearedappeared in exons. in exons.

• Output II: the code Output II: the code of the exons by the of the exons by the dictionary.dictionary.

Page 49: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

4949

LZW: EncodingLZW: Encoding

1)1) Accum Accum first input letter first input letter2)2) If dict.Find(accum) == falseIf dict.Find(accum) == false

1)1) Dict.Add(accum)Dict.Add(accum)2)2) Code.Add(index)Code.Add(index)3)3) Accum Accum accum(end) accum(end)4)4) Return to (2)Return to (2)

3)3) Else:Else:1)1) Index = dict.Findwhere(accum)Index = dict.Findwhere(accum)2)2) Accum.Add(next letter from input)Accum.Add(next letter from input)3)3) Return to (2)Return to (2)

Page 50: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

5050

Dictionary PruningDictionary Pruning

• Output LZW dictionary is a tree (TRIE).Output LZW dictionary is a tree (TRIE).

• Aim: keep the most popular words, but don’t Aim: keep the most popular words, but don’t allow undesired redundancy.allow undesired redundancy.

• Method:Method:– Go on every level of the tree (starting in max Go on every level of the tree (starting in max

length words) and take predefined number of length words) and take predefined number of popular words.popular words.

– Pass number of appearances (from output code) to Pass number of appearances (from output code) to parents: pass the sum of all, OR pass the sum of parents: pass the sum of all, OR pass the sum of untaken. More variations: multiply by the entropy.untaken. More variations: multiply by the entropy.

Page 51: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

5151

Using Entropy for Better Using Entropy for Better PruningPruning

[31 45 1 60][31 45 1 60] [31 45 1 30][31 45 1 30] [31 45 1 13][31 45 1 13][31 45 1 31][31 45 1 31]

[31 45 1][31 45 1]

66 66 66 66

24*log(4) = 48

24*log(4) = 48

[31 45 1 30][31 45 1 30]

[31 45 1][31 45 1]

4040

40*log(1) = 040*log(1) = 0

[31 45 1 60] [31 45 1 60] [31 45 1 30] [31 45 1 30] [31 45 1 13] [31 45 1 13][31 45 1 31] [31 45 1 31]

[31 45 1] [31 45 1]

11 2020 11 22

20*(-1)*[5/6*log(5/6) + 2*1/24*log(1/24) + 1/16*log(1/16)] = 20*0.8513 = 17.0255

20*(-1)*[5/6*log(5/6) + 2*1/24*log(1/24) + 1/16*log(1/16)] = 20*0.8513 = 17.0255

Page 52: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

5252

Compression Rates Compression Rates ClassificationClassification

1. Input:DNA of a chromosome and gene based dictionary

1. Input:DNA of a chromosome and gene based dictionary

2. 6 codons sequences for the 6 different reading frames

2. 6 codons sequences for the 6 different reading frames

4. 6 compress rates vectors

4. 6 compress rates vectors

6.6 binary vectors+ post processing data

6.6 binary vectors+ post processing data

8.6 binary vectors – the final classification

8.6 binary vectors – the final classification

5. Rf_wins = Argmax{compress_rate(rf),thresh)Lowerthresh = Argmax{compress_rate(rf),lower-thresh)Too_much_stops = 1 if window has more than 1 stop codon

3. Compressing with genes based dictionary

7. Post Processing

Page 53: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

5353

Post ProcessingPost Processing

• Lower threshold technique: tag as Lower threshold technique: tag as true every window that is between true every window that is between close already-tagged windows, if close already-tagged windows, if value larger than the lower threshold.value larger than the lower threshold.

• Stop codons quantity in the window: Stop codons quantity in the window: more than one => not an exon-more than one => not an exon-window (which is larger than analysis window (which is larger than analysis window size).window size).

Page 54: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

5454

Compression Rates: Compression Rates: ExampleExample

Page 55: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

5555

Stop Codons UsageStop Codons Usage

• 100,000b of 2100,000b of 2ndnd chromosomechromosome

• 1 where there is 1 where there is one stop codon in one stop codon in the window, at the window, at mostmost

Page 56: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

5656

Post Processing: Stop-codon Post Processing: Stop-codon UsageUsage

• Stop codon usage cleans up many potential false positives, without damaging any success measure

• Hence, a lower principal threshold can be determined and we’ll get better performance

Without stop codon usage

Page 57: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

5757

Compression Rates: ResultsCompression Rates: Results• Learnt chromosome = 1Learnt chromosome = 1stst , window size = 100c, dictionary size = , window size = 100c, dictionary size = 1381 (32 codons, branching = 3)1381 (32 codons, branching = 3)

• After choosing best configuration, going over all the chromosomes:After choosing best configuration, going over all the chromosomes:#f_pf_nrf_truef_n_exons# exons# missTHRESH

20.104420.138090.938660.046875384180.457

30.100150.160980.922340.0387115560.457

40.0842710.140140.938090.036986730270.457

50.0905560.137630.927230.03448326190.457

60.139090.142740.924950.04166712050.457

70.120530.147330.939270.027723505140.457

80.150570.145380.933620.059925267160.457

90.131610.138160.924580.04520090.457

100.122220.124110.934470.03207343110.457

110.078330.145750.937120.069069333230.457

120.141060.136540.94050.064777494320.457

130.110510.143380.928140.040816441180.457

140.150440.154750.934340.026525377100.457

150.0899950.145780.93570.044231520230.457

160.120390.137940.936570.033784444150.457

   total0.04233945574236 

Page 58: 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

5858

Compression Rates: Compression Rates: ImprovingImproving

• Use non-exon dictionary, or prune Use non-exon dictionary, or prune exon-dictionary considering non-exon exon-dictionary considering non-exon common words.common words.

• Adaptive dictionary: when detecting Adaptive dictionary: when detecting an exon, use its common words to an exon, use its common words to update the current dictionary.update the current dictionary.