25
Splice Site Prediction Using Artificial Neural Networks Journal Club Dec24, 2015 Hiroya MORIMOTO

Journal club dec24 2015 splice site prediction using artificial neural networks

Embed Size (px)

Citation preview

Page 1: Journal club dec24 2015   splice site prediction using artificial neural networks

Splice Site Prediction Using Artificial Neural Networks

Journal Club Dec24, 2015 Hiroya MORIMOTO

Page 2: Journal club dec24 2015   splice site prediction using artificial neural networks

Information

�  Title: �  Splice Site Prediction Using Artificial Neural Networks

�  Authors: �  Øystein Johansen, Tom Ryen, Trygve Eftesøl, Thomas Kjosmoen, and

Peter Ruoff �  Institutional affiliations:

�  University of Stavanger, Norway

�  Publishing year �  2009

Splice Site Prediction Using Artificial Neural Networks

Page 3: Journal club dec24 2015   splice site prediction using artificial neural networks

0. ABSTRACT

�  SQ �  NN を用いたSS 予測が用いられている.

�  Methods �  NN を持ちた後に,出力値をparabolic func. にfitting することで正確な

SS 予測を可能とした. �  Data

�  Arabidopsis genesから,16,965 genes -> training, 5,000 genes -> benchmark, 20 genes -> verification

�  Result �  最高で,Sn=0.891, Sp=0.816, CC=0.552 をマークした.

ANN を駆使して,DSS/ASS を予測しようとした論文. シロイヌナズナでテストし,高精度をマークした.

Abbreviations 1: ANN = Artificial neural network, SS = Splice site DSS = Donor splice site ASS = Acceptor splice site

Abbreviations 2: CC = Correlation coefficient, Sp = Specificity, Sn = Sensitivity, TP = True positive, FP = False positive, TN = True negative, FN = False negative

Page 4: Journal club dec24 2015   splice site prediction using artificial neural networks

1. INTRODUCTION

�  SQ �  DB に存在するdata の中には,すでにgenetics, biochemical methods によって遺伝子やプロモータがannotate されいているものがある.

�  Sequence 情報は日々増大しており,annotate されないままのものも多く存在している.

�  This study did … �  Artificial neural networks を活用して,DSS (donor splice sites) やASS

(acceptor splice sites) を予測することを目指した.

増え続けるsequence 情報に対して,低コストにannotation を行うためにコンピュータの力を利用しようではないか!

ATGCGATTTAGC AGCGCGAATAGG GTGTCAGTTAAGCTGAG

DSSです! ASSです! ASSじゃない…

exon exon intron

Page 5: Journal club dec24 2015   splice site prediction using artificial neural networks

About neural networks – Architecture Neural networks (artificial neural networks) は,生体内のニューロンの働きを模した判別器であり,複数のlayers から構成される.

Input layer: 外部からの情報を 普通 0-1 の数値として各unit に渡され, それらを統合,処理を行い, 次のlayer の各unit に渡す.

Hidden layer: 直前のlayer の全unit からの入力を統合, 処理を行い,次のlayer の各unit に渡す.

Output layer: 直前のlayer の全unit からの入力を統合, 処理を行い,0-1 の数値を出力する.

情報の流れる向き

Input calculator

Input calculator: 外部からの情報がnumerical 出なかったり, 0-1 の範囲内の数値でない場合には, 変換処理を行う.

Page 6: Journal club dec24 2015   splice site prediction using artificial neural networks

About neural networks – Unit 各unit において,直前のlayer のweight をかけた各unit の出力地の合計が入力となる.それを関数に入力し,その出力を出力する.

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 1.25

0.5

1

gi gi-1 gi+1

wi wi+1 wi-1

入力値を合計 Σgiwi

入力値の合計を Sigmoid function に入力

Sigmoid function の出力を unit から出力 hj hj hj

Page 7: Journal club dec24 2015   splice site prediction using artificial neural networks

About neural networks – Unit 各unit において,直前のlayer のweight をかけた各unit の出力地の合計が入力となる.それを関数に入力し,その出力を出力する.

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 1.25

0.5

1

gi gi-1 gi+1

wi wi+1 wi-1

入力値を合計 Σgiwi

Bias, b を引いた入力地の合計を Sigmoid function に入力

Sigmoid function の出力を unit から出力 hj hj hj

Bias, b 入力値の合計からbias, b を減算して Sigmoid に入力する.

Page 8: Journal club dec24 2015   splice site prediction using artificial neural networks

About neural networks – Backpropagation

�  Backpropagatoin の流れ 1.  Initial parameters を決定.(ランダムのことが多い.) 2.  Train data でforward propagation (普通にNN にdata を入れて出力を出す) を実行. 3.  Convergence (収束)判定

�  Error 全train data での全output units におけるerror の総計が,規定閾値内かを判定し,閾値外であれば,train 続行.

4.  勾配計算 �  各weight とbias について,変化値をoutput layer の方から計算し,その変化値をinput layer まで伝播 (=backpropagation)

�  この計算時に,learning rate, η (eta) を用いる.(大きいほど早く学習する.) 5.  Parameter 修正 6.  2 に戻り,規定のepoch 数に達するか,error が規定閾値内に収まるまで続行.

�  *Epoch �  Train 時に,train data を一通り通ることをepoch と呼ぶ.

Training に用いられるスタンダードな方法がbackpropagation である.

Page 9: Journal club dec24 2015   splice site prediction using artificial neural networks

About neural networks – Backpropagation Training がどのように進むか,y=x^2 を例に説明する.

Train data Input <–> desired label

0.2 <-> 0.04 0.5 <-> 0.25 -0.9 <-> 0.81

1st epoch

0.2 0.5 -0.9

0.01 0.9 0.8

0.04 0.25 0.81 ((0.04-0.01)2+(0.25-0.9)2+(0.81-0.8)2)/2 =0.4235 のerror がある!

2nd epoch

0.2 0.5 -0.9

0.05 0.7 0.84

0.04 0.25 0.81 ((0.04-0.05)2+(0.25-0.7)2+(0.81-0.84)2)/2 =0.2035 のerror がある! でも,減った!

3rd epoch

0.2 0.5 -0.9

0.043 0.6 0.83

0.04 0.25 0.81 ((0.04-0.043)2+(0.25-0.6)2+(0.81-0.83)2)/2 =0.122909 のerror がある! でも,また減った!

Input data

Output data

Desired labels

Page 10: Journal club dec24 2015   splice site prediction using artificial neural networks

2. NEURAL NETWORK Prediction system overview: 一定長のDNA sequence を入力とし,そのwindow 内のDSS/ASS の存在有無を出力とするANN.

Splice Site Prediction Using Artificial Neural Networks 103

are calculated from the input calculator. For each step of the sliding window theneural network will give an output score if it recognizes there is a splice site inthe window. A diagram of the entire prediction system is shown in Fig. 2. Thewindow size is chosen to be 60 nucleotides. This is hopefully wide enough to findsignificant patterns on both sides of the splice site. A bigger window will makethe neural network bigger and thereby harder to train. Smaller window wouldmaybe exclude important information around the splice site.

Fig. 1.The connection between theDNAsequence and theneural network system.A slid-ing window covers 60 nucleotides, which is calculated to 240 input units to the neural net-work.Theneural network feedforward calculates a score for donor and acceptor splice site.

2.1 Network Topology

The neural network structure is a standard three layer feedforward neural net-work.1 There are two output units corresponding to the donor and acceptorsplice sites, 128 hidden layer units and 240 input units. The 240 input unitswere used since the orthogonal input scheme uses four inputs each nucleotidein the window. The neural network program code was reused from a previousstudy, and in this code the number of hidden units was hard coded and op-timized for 128 hidden units. There is also a bias signal added to the hiddenlayer and the output layer. The total number of adjustable weights is therefore(240 × 128) + (128 × 2) + 128 + 2 = 31106.1 This kind of neural network has several names, such as multilayer perceptrons

(MLP), feed forward neural network, and backpropagation neural network.

Input calculator: A -> (1,0,0,0) のように, 各塩基から4 units 分の入力がある.

Page 11: Journal club dec24 2015   splice site prediction using artificial neural networks

2. NEURAL NETWORK (cont’d) �  Network topology

�  3 layer feedforward neural network �  2 output units (DSS/ASS) �  128 hidden layer units �  240 input units (=60bp)

�  A -> (1,0,0,0) のように,1塩基を4 units に対応. �  31,106 parameters が存在.

�  240 x 128 + 128 x 2 + 128 + 2 �  Bias parameter も含めて.

�  Activation function �  Standard sigmoid function を使用.(ß=0.1)

�  Backpropagation �  各unit におけるerror の変化量を計算する際に,二次導関数を用いるようなことはしていない = 一次導関数を用いている,という意味だと思われる.

�  Parameter 調整の際に,error を微分して0,つまりこれ以上誤差が変化しない(局所)最適解を目指す.

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.25

0.5

0.75

1

Page 12: Journal club dec24 2015   splice site prediction using artificial neural networks

3. TRAINING DATA & BENCHMARKING DATA �  Data

�  The Arabidopsis Information Resource (TAIR) release 8 �  シロイヌナズナのDB.

�  Excluded genes �  ANN のinput に同時に複数の同種のSS が入らないようにするためと,

train に悪影響を及ぼす(複雑になる) 要素を取り除くため,以下の条件に該当する遺伝子を除いた. �  N gap が含まれている遺伝子. �  Single exon genes �  Exons/introns の長さが,30bp 以下のものを含む遺伝子. �  Alternative splicing を行う遺伝子.

�  一つだけを残して他を除いた. �  Training data set & benchmark data set

�  残った21,985 遺伝子をそれぞれのために 16,965 と5,000遺伝子にランダムに分割. �  20 遺伝子をfinal verification 用に. �  各gene set は全5chr.s からの遺伝子を含む.

Deviding data set

Training

Benchmark

Final verification

Remaining data set

Page 13: Journal club dec24 2015   splice site prediction using artificial neural networks

3. TRAINING DATA & BENCHMARKING DATA (cont’d) �  Training data, benchmark data とfinal verification data の関係

Training iterations

Benchmark data

Training data

Final verification data

Measurements of the performance

(= Verification) Training のために使用. Epoch ごとにinput される.

Training の進捗度を測定するために使用 毎epoch 終了時に使用される.

最終的に構築されたNN の精度評価に使用. (20 genes)

(16,965 genes)

(5,000 genes)

Page 14: Journal club dec24 2015   splice site prediction using artificial neural networks

4. TARINING METHOD

�  Input data �  各遺伝子配列上に60bp window をとり,1bp ごとにslide させてゆく. �  それをinput calculator にて,binary 値に変換し,input data とする.

�  Desired labels �  Output layer unit は2つあり,それぞれDSS, ASS に対応している. �  基本的には,各output unit で1 が出れば,そのSS を支持.

Standard backpropagation を行い,neural network をtrain した. Input data とそれに対応するdesired labels について示す.

...AACGGTTTCTGGTAAATGGAAGCTTACCGGAGAATCTGTTGAGGTCAAGGACCAGTGGGGATTTTGTAGTGAGGGCTTTCGTGGTAAGATTGGTCATAAA…

Input calculator

001000100100100010001000010000100010100010000010000101000100…1000010001000100010000100100100000100100

0 – 1 (for DSS) 0 – 1 (for ASS) Desired labels

Input data

+

A training data

Page 15: Journal club dec24 2015   splice site prediction using artificial neural networks

4. TARINING METHOD (cont’d)

�  Desired labels (in detail) �  Window の中央にtrue SS が来た時に,最大値1 を与え,そこからのズレに比例して最小値0 に近づく値を以下のscore function にて設定する.

Desired label として,0 or 1 のbinary 値にするのではなく,window 内でのSS の位置に応じて連続変化する値を設定する.

106 Ø. Johansen et al.

Fig. 2. Score function to calculate the desired output of a sliding window. The examplewindow in the figure has a splice site after 21 nucleotides. This makes the desired outputfor the acceptor splice site 0.7.

In some cases there may be two splice sites in a window. It is then one acceptorsplice site and one donor splice site.

4.4 Algorithm for Training on One Single Gene

The algorithm for training on a single gene is very simple. The program loopsthrough all the possible window positions. For each position, the program com-putes the desired output for the current window, computes the neural net inputand then calls Backpropagation() with these computed data. See Algorithm 1.

Algorithm 1. Training the neural net on one single gene1: procedure TrainGene(NN,G, η) ◃ Train the network on gene G2: n← length[G]− LW

3: for i← 0 to n do4: W ← G[i..(i + LW )] ◃ Slice the gene sequence5: desired← CalculateDesired(W )6: input← CalculateInput(W )7: Backpropagation(NN, input, desired, η)8: end for9: end procedure

In this algorithm, NN is a composite data structure that holds the neuralnetwork data to be trained. G is one specific gene that the neural network shouldbe trained on. The first integer variable n is simply the total number of positionsof the sliding window. LW is the number of nucleotides in the sliding window,which in this study, is set to 60 nucleotides.

The desired output is calculated as described in Section 4.3, and is listed inAlgorithm 2.

Splice Site Prediction Using Artificial Neural Networks 105

4 Training Method

The neural network training is done using standard backpropagation. This sec-tion describes how the neural network inputs were calculated and how the desiredoutput was obtained.

4.1 Sliding Window

For each gene in the training set, we let a window slide over the nucleotides.The window moves one nucleotide each step, covering a total of LG − LW + 1steps, where LG is the length of the gene and LW is the length of the window.As mentioned earlier, the length of the window is 60 nucleotides in this study.

4.2 Input to the Neural Network

For each nucleotide in the sliding window, we have four inputs to the neuralnet. The four inputs are represented as an orthogonal binary vector. (A=1000,T=0100, G=0010, C=0001). This input description has been used in severalother studies [5], and is described in Baldi and Brunak [1]. This input system iscalled orthogonal input, due to the orthogonal binary vectors. According to Baldiand Brunak [1] this is the most used input representation for neural networksin the application of gene prediction. This input scheme also has the advantagethat each nucleotide input is separated from each other, such that no arithmeticcorrelation between the monomers need to be constructed.

4.3 Desired Output and Scoring Function

The task is to predict splice sites, thus the desired output is 1.0 when a splice siteis in the middle of the sliding window. There are two outputs from the neuralnetwork: One for indicating acceptor splice site and one for indicating donorsplice site.

However, if it is only a 1.0 output when a splice site is in the middle of thewindow, and 0.0 when a splice site is not in the middle of the window, therewill probably be too many 0.0 training samples that the neural network wouldlearn to predict everything as ’no splice site’. This is why we introduce a scorefunction which calculates a target output not only when the splice site is inthe middle of the window, but whenever there is a splice site somewhere in thewindow. We use a weighting function where the weight of a splice site dependson the distance from the respective nucleotide to the nucleotide at the windowmid-point. The further from the mid point of the window this splice site is, thelower value we get in the target values. The target values decrease linearly fromthe mid point of the window. This gives the score function as shown in Eq. 2

f(n) = 1 − |1 − 2n

LW| (2)

If a splice site is exactly at the mid point, the target output is 1.0. An examplewindow is shown in Fig. 2.

n = 21

where Lw = 60

*SS: splice sites

A true splice site

Page 16: Journal club dec24 2015   splice site prediction using artificial neural networks

5. EVALUATION METHOD

�  Input data (training data 作成時と同一条件) �  各遺伝子配列上に60bp window をとり,1bp ごとにslide させてゆく. �  それをinput calculator にて,binary 値に変換し,input data とする.

�  Output data �  その塩基がwindow 内にあったinput からの出力を合計したものとなる �  その後, window の長さ(?) によって正規化される.

Neural network によってDSS/ASS らしさがそれぞれ計算され,各塩基に対してスコアが与えられる.

...AACGGTTTCTGGTAAATGGAAGCTTACCGGAGAATCTGTTGAGGTCAAGGACCAGTGGGGATTTTGTAGTGAGGGCTTTCGTGGTAAGATTGGTCATAAA…

..

.

..

. 全出力値を 合計

Page 17: Journal club dec24 2015   splice site prediction using artificial neural networks

6. BENCHMARK

�  Significant top �  以下の手順で値のピーク(=significant top)を見つける. 1.  Threshold を超えるindicator value を持つ塩基を探索. 2.  1 の塩基が連続して存在する箇所を探索. 3.  2 の領域のindicator values について,2次の多項式回帰線(上に凸な放物線となるはず)を求める.

4.  3 で求めた放物線の頂点のある位置が,推定されたSS の位置である.

NN からはDSS/ASS indicator がそれぞれ各塩基ごとに出力されるので,その値がピークになるところをそれぞれDSS/ASS とする.

0

0.2

0.4

0.6

0.8

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

Indicator values

0

0.2

0.4

0.6

0.8

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

Indicator values

Indicator

fitted line

Page 18: Journal club dec24 2015   splice site prediction using artificial neural networks

6. BENCHMARK (cont’d)

�  Significant top (cont’d) �  以下の例だと,pos 2 – 6 のみがthreshold, 0.2 を超えていて,それらのみを用いて,回帰線を描いている.

�  Threshold を決定するために,mean, SD といった動的な値を試したが,結局0.2 という定数が良い結果を生んだ.

Splice Site Prediction Using Artificial Neural Networks 109

value in the donor splice site indicator. When the algorithm finds a significanttop in the donor splice site indicator, the state switches to intron. The algorithmcontinues to look for a significant top in the acceptor splice site indicator, andthe state is switched back to exon. This process continues until the end of thegene. The gene must end in the exon state.

In the above paragraph, it is unclear what is meant by a significant top. Toindicate a top in a splice site indicator, the algorithm first finds a indicator valueabove some threshold value. It then finds all successive indicator data points thatare higher than this threshold value. Through all these values, a second orderpolynomial regression line is fitted, and the maximum of this parabola is usedto indicate the splice site. This method is explained with some example data inFig. 3. In this example the indicator value at 0 and 1 is below the threshold. Thevalue at 2 is just above the threshold and the successive values at 3,4,5 and 6 isalso above the threshold and these five values are used in the curve fitting. Therest of the data points are below the threshold and not used in the curve fitting.

Fig. 3. Predicting a splice site based on the splice site indicator. When the indicatorreaches above the threshold value, 0.2 in the figure, all successive data points abovethis threshold are used in a curve fitting of a parabola. The nucleotide closest to theparabola maxima is used the splice site.

Finding a good threshold value is difficult. Several values have been tried. Wehave performed some simple experiments with dynamically computed thresholdvalues based on average and standard deviation. However, the most practicalthreshold value was found to be a constant at 0.2.

6.2 Performance Measurements

The above method is used to predict the locations of the exons and introns.These locations can be compared with the actual exon and intron locations.

Significant top

Page 19: Journal club dec24 2015   splice site prediction using artificial neural networks

6. BENCHMARK (cont’d)

�  Measurement indicators �  Sensitivity (Sn)

�  100個の的のうち,いくつを当てたか.「数打ちゃ当たる」

�  Specificity (Sp) �  100発撃って,何発当てたか.

�  Correlation coefficient (CC) �  より少ない弾数で,より多くの的を当てたか.

推定されたSS を用いて,exon, intron を予測し,それらの精度を測定した.以下にその評価指標を示す.

Abbreviations: CC = Correlation coefficient, Sp = Specificity, Sn = Sensitivity, TP = True positive, FP = False positive, TN = True negative, FN = False negative

110 Ø. Johansen et al.

There are four different outcomes of this comparison, true positive (TP ), falsenegative (FN), false positive (FP ) and true negative (TN). The comparison ofactual and predicted location is done at nucleotide level.

The count of each comparison outcome are used to compute standard mea-surement indicators to benchmark the performance of the predictor. The sensi-tivity, specificity and correlation coefficient has been the de facto standard wayof measuring the performance of prediction tools. These prediction measurementvalues are defined by Burset and Guigo [2] and by Snyder and Stormo [7].

The sensitivity (Sn) is defined as the ratio of correctly predicted exon nu-cleotides to all actual exon nucleotides as given in Eq. 3.

Sn =TP

TP + FN(3)

The higher the ratio, the better prediction. As we can see, this ratio is between0.0 and 1.0, where 1.0 is the best possible.

The specificity (Sp) is defined as the ratio of correctly predicted exon nu-cleotides to all predicted exon nucleotides as given in Eq. 4.

Sp =TP

TP + FP(4)

The higher the ratio, the better prediction. As we can see, this ratio is between0.0 and 1.0, where 1.0 is the best possible.

The correlation coefficient (CC) combines all the four possible outcomes intoone value. The correlation coefficient is defined as given in Eq. 5.

CC =(TP × TN)− (FN × FP )!

(TP + FN)(TN + FP )(TP + FP )(TN + FN)(5)

6.3 The Overall Training Algorithm

The main loop of the training is very simple and is an infinite loop with twosignificant activities. First, the infinite loop trains the neural network on all genesin the training data set. Second, it benchmarks the same neural network on thegenes in the benchmark data set. There are also some other minor activitiesin the main loop like reshuffling the order of the training data set, saving theneural network, and logging the results. The main training loop is shown inAlgorithm 4.

In this algorithm, NN is the neural net to be trained, T is the data set ofgenes to be used for training and B is the data set of genes for benchmarking.In this algorithm the learning rate, η, is kept constant. The bm variable is acomposite data structure to hold the benchmarking data.

The subroutine Save() saves the neural network weights, and the Shuffle()subroutine reorders the genes in the data set. LogResult() logs the result tothe terminal window and to a log file.

110 Ø. Johansen et al.

There are four different outcomes of this comparison, true positive (TP ), falsenegative (FN), false positive (FP ) and true negative (TN). The comparison ofactual and predicted location is done at nucleotide level.

The count of each comparison outcome are used to compute standard mea-surement indicators to benchmark the performance of the predictor. The sensi-tivity, specificity and correlation coefficient has been the de facto standard wayof measuring the performance of prediction tools. These prediction measurementvalues are defined by Burset and Guigo [2] and by Snyder and Stormo [7].

The sensitivity (Sn) is defined as the ratio of correctly predicted exon nu-cleotides to all actual exon nucleotides as given in Eq. 3.

Sn =TP

TP + FN(3)

The higher the ratio, the better prediction. As we can see, this ratio is between0.0 and 1.0, where 1.0 is the best possible.

The specificity (Sp) is defined as the ratio of correctly predicted exon nu-cleotides to all predicted exon nucleotides as given in Eq. 4.

Sp =TP

TP + FP(4)

The higher the ratio, the better prediction. As we can see, this ratio is between0.0 and 1.0, where 1.0 is the best possible.

The correlation coefficient (CC) combines all the four possible outcomes intoone value. The correlation coefficient is defined as given in Eq. 5.

CC =(TP × TN)− (FN × FP )!

(TP + FN)(TN + FP )(TP + FP )(TN + FN)(5)

6.3 The Overall Training Algorithm

The main loop of the training is very simple and is an infinite loop with twosignificant activities. First, the infinite loop trains the neural network on all genesin the training data set. Second, it benchmarks the same neural network on thegenes in the benchmark data set. There are also some other minor activitiesin the main loop like reshuffling the order of the training data set, saving theneural network, and logging the results. The main training loop is shown inAlgorithm 4.

In this algorithm, NN is the neural net to be trained, T is the data set ofgenes to be used for training and B is the data set of genes for benchmarking.In this algorithm the learning rate, η, is kept constant. The bm variable is acomposite data structure to hold the benchmarking data.

The subroutine Save() saves the neural network weights, and the Shuffle()subroutine reorders the genes in the data set. LogResult() logs the result tothe terminal window and to a log file.

110 Ø. Johansen et al.

There are four different outcomes of this comparison, true positive (TP ), falsenegative (FN), false positive (FP ) and true negative (TN). The comparison ofactual and predicted location is done at nucleotide level.

The count of each comparison outcome are used to compute standard mea-surement indicators to benchmark the performance of the predictor. The sensi-tivity, specificity and correlation coefficient has been the de facto standard wayof measuring the performance of prediction tools. These prediction measurementvalues are defined by Burset and Guigo [2] and by Snyder and Stormo [7].

The sensitivity (Sn) is defined as the ratio of correctly predicted exon nu-cleotides to all actual exon nucleotides as given in Eq. 3.

Sn =TP

TP + FN(3)

The higher the ratio, the better prediction. As we can see, this ratio is between0.0 and 1.0, where 1.0 is the best possible.

The specificity (Sp) is defined as the ratio of correctly predicted exon nu-cleotides to all predicted exon nucleotides as given in Eq. 4.

Sp =TP

TP + FP(4)

The higher the ratio, the better prediction. As we can see, this ratio is between0.0 and 1.0, where 1.0 is the best possible.

The correlation coefficient (CC) combines all the four possible outcomes intoone value. The correlation coefficient is defined as given in Eq. 5.

CC =(TP × TN)− (FN × FP )!

(TP + FN)(TN + FP )(TP + FP )(TN + FN)(5)

6.3 The Overall Training Algorithm

The main loop of the training is very simple and is an infinite loop with twosignificant activities. First, the infinite loop trains the neural network on all genesin the training data set. Second, it benchmarks the same neural network on thegenes in the benchmark data set. There are also some other minor activitiesin the main loop like reshuffling the order of the training data set, saving theneural network, and logging the results. The main training loop is shown inAlgorithm 4.

In this algorithm, NN is the neural net to be trained, T is the data set ofgenes to be used for training and B is the data set of genes for benchmarking.In this algorithm the learning rate, η, is kept constant. The bm variable is acomposite data structure to hold the benchmarking data.

The subroutine Save() saves the neural network weights, and the Shuffle()subroutine reorders the genes in the data set. LogResult() logs the result tothe terminal window and to a log file.

Page 20: Journal club dec24 2015   splice site prediction using artificial neural networks

7. EXPERIMENTS & RESULTS

�  Finding splice sites in a particular gene �  Final verification set 20 genes から恣意的に抽出したある遺伝子について予測したexon/intron 構造を以下に示す.(平均より良い予測結果)

�  ほとんどのSS を正確に予測できていた. �  一部errors があったがそれはwindow を設定したことによる’low-pass

filtering effect’ の影響でindicator の鋭いピークが減衰されたためだろう.

ある遺伝子について,予測精度を確認したところ,ほとんどの予測SS が実際のものと合致していた.

112 Ø. Johansen et al.

Fig. 4. The splice site indicators plotted along an arbitrary gene (AT4G18370.1) formthe verification set. Above the splice site indicators, there are two line indicators wherethe upper line indicates predicted exons, and the other line indicates actual exons.The sensitivity, specificity and correlation coefficient of this gene is given in the figureheading. (Err is an error rate defined as the ratio of false predicted nucleotides to allnucleotides. Err = 1− SMC.)

Table 1. Measurements of the neural network performances for each of the threetraining sessions. Numbers are based on a set of 20 genes which are not found in thetraining set nor the benchmarking set.

Average All nucleotides in setSession Sn Sp Sn Sp CC SMCη = 0.20 0.864 0.801 0.844 0.802 0.5205 0.7761η = 0.10 0.891 0.816 0.872 0.806 0.5517 0.7916η = 0.02 0.888 0.778 0.873 0.777 0.4978 0.7680

8 Conclusion

This study shows an artificial neural networks used in splice site prediction. Thebest neural network trained in this study, achieve a correlation coefficient at0.552. This result is achieved without any prior knowledge of any sensor signals,like ’GT’ or ’GC’ for the donor splice sites, or ’AG’ for the acceptor splice sites.Also note that some of the genes in the data sets did not store the base casefor splicing, but an alternative splicing, which may have disturbed some of thetraining. It is fair to conclude that artificial neural networks are usable in geneprediction, and the method used, with a sliding window over the gene, is worthfurther study. This method combined with other statistical methods, like GeneralHidden Markov Models, would probably improve the results further.

References

1. Baldi, P., Brunak, S.: Bioinformatics, The Machine Learning Approach, 2nd edn.MIT Press, Cambridge (2001)

2. Burset, M., Guigo, R.: Evaluation of gene structure prediction programs. Ge-nomics 34(3), 353–367 (1996)

3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, NewYork (2001)

‘AT4G18370.1’ Sn=0.961 Sp=0.910 CC=0.835 Err=0.079 Predicted exons

Actual exons

Error: 予測と正解が異なるものを指している割合を示す. (exon とintron を反対に予測した総塩基数) / (遺伝子長)

Epoch count = about 80 Learning rate = 0.2

Page 21: Journal club dec24 2015   splice site prediction using artificial neural networks

7. EXPERIMENTS & RESULTS

�  Finding splice sites in a particular gene �  Final verification set 20 genes から恣意的に抽出したある遺伝子について予測したexon/intron 構造を以下に示す.(平均より良い予測結果)

�  ほとんどのSS を正確に予測できていた. �  一部errors があったがそれはwindow を設定したことによる’low-pass

filtering effect’ の影響でindicator の鋭いピークが減衰されたためだろう.

ある遺伝子について,予測精度を確認したところ,ほとんどの予測SS が実際のものと合致していた.

112 Ø. Johansen et al.

Fig. 4. The splice site indicators plotted along an arbitrary gene (AT4G18370.1) formthe verification set. Above the splice site indicators, there are two line indicators wherethe upper line indicates predicted exons, and the other line indicates actual exons.The sensitivity, specificity and correlation coefficient of this gene is given in the figureheading. (Err is an error rate defined as the ratio of false predicted nucleotides to allnucleotides. Err = 1− SMC.)

Table 1. Measurements of the neural network performances for each of the threetraining sessions. Numbers are based on a set of 20 genes which are not found in thetraining set nor the benchmarking set.

Average All nucleotides in setSession Sn Sp Sn Sp CC SMCη = 0.20 0.864 0.801 0.844 0.802 0.5205 0.7761η = 0.10 0.891 0.816 0.872 0.806 0.5517 0.7916η = 0.02 0.888 0.778 0.873 0.777 0.4978 0.7680

8 Conclusion

This study shows an artificial neural networks used in splice site prediction. Thebest neural network trained in this study, achieve a correlation coefficient at0.552. This result is achieved without any prior knowledge of any sensor signals,like ’GT’ or ’GC’ for the donor splice sites, or ’AG’ for the acceptor splice sites.Also note that some of the genes in the data sets did not store the base casefor splicing, but an alternative splicing, which may have disturbed some of thetraining. It is fair to conclude that artificial neural networks are usable in geneprediction, and the method used, with a sliding window over the gene, is worthfurther study. This method combined with other statistical methods, like GeneralHidden Markov Models, would probably improve the results further.

References

1. Baldi, P., Brunak, S.: Bioinformatics, The Machine Learning Approach, 2nd edn.MIT Press, Cambridge (2001)

2. Burset, M., Guigo, R.: Evaluation of gene structure prediction programs. Ge-nomics 34(3), 353–367 (1996)

3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, NewYork (2001)

‘AT4G18370.1’ Sn=0.961 Sp=0.910 CC=0.835 Err=0.079 Predicted exons

Actual exons

Error: 予測と正解が異なるものを指している割合を示す. (exon とintron を反対に予測した総塩基数) / (遺伝子長)

Epoch count = about 80 Learning rate = 0.2

Page 22: Journal club dec24 2015   splice site prediction using artificial neural networks

7. EXPERIMENTS & RESULTS

�  Benchmark は3パタンのlearning rate を用いて行われた. (learning rate が高いほど学習効率が良いが局所解に陥りやすい.)

�  Benchmark algorithm は,Sn とSp を平均して判断していた. (training 終了をその平均値で判断したということだと思われる.)

Final verification set 20 genes に対して,3パタンで構築したNNの精度評価を行った.

112 Ø. Johansen et al.

Fig. 4. The splice site indicators plotted along an arbitrary gene (AT4G18370.1) formthe verification set. Above the splice site indicators, there are two line indicators wherethe upper line indicates predicted exons, and the other line indicates actual exons.The sensitivity, specificity and correlation coefficient of this gene is given in the figureheading. (Err is an error rate defined as the ratio of false predicted nucleotides to allnucleotides. Err = 1− SMC.)

Table 1. Measurements of the neural network performances for each of the threetraining sessions. Numbers are based on a set of 20 genes which are not found in thetraining set nor the benchmarking set.

Average All nucleotides in setSession Sn Sp Sn Sp CC SMCη = 0.20 0.864 0.801 0.844 0.802 0.5205 0.7761η = 0.10 0.891 0.816 0.872 0.806 0.5517 0.7916η = 0.02 0.888 0.778 0.873 0.777 0.4978 0.7680

8 Conclusion

This study shows an artificial neural networks used in splice site prediction. Thebest neural network trained in this study, achieve a correlation coefficient at0.552. This result is achieved without any prior knowledge of any sensor signals,like ’GT’ or ’GC’ for the donor splice sites, or ’AG’ for the acceptor splice sites.Also note that some of the genes in the data sets did not store the base casefor splicing, but an alternative splicing, which may have disturbed some of thetraining. It is fair to conclude that artificial neural networks are usable in geneprediction, and the method used, with a sliding window over the gene, is worthfurther study. This method combined with other statistical methods, like GeneralHidden Markov Models, would probably improve the results further.

References

1. Baldi, P., Brunak, S.: Bioinformatics, The Machine Learning Approach, 2nd edn.MIT Press, Cambridge (2001)

2. Burset, M., Guigo, R.: Evaluation of gene structure prediction programs. Ge-nomics 34(3), 353–367 (1996)

3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, NewYork (2001)

Average: 遺伝子ごとに出した測定値の平均 All nucleotides in set: 遺伝子を問わずに全延期に対して算出した測定値 SMC: standard simple matching coefficient 予測と正解が同一のものを指している割合を示す. (両方がexonを支持 + 両方がintron を支持) / (総塩基数)

Page 23: Journal club dec24 2015   splice site prediction using artificial neural networks

8. CONCLUSION

�  Best neural network in this study �  最高でCC=0.552 を達成した.

�  以下のようなhandicaps を抱えての結果なのでこの結論はfair. �  GT-AG 則をはじめとする,splice site に関するどんな事前知識も用いなかった.

�  Training を阻害する影響を持つ,上記のような基本ルールを守っていない遺伝子も一部存在した.

�  Training を阻害する影響を持つ,Alternative splicing の存在する遺伝子も用いていた(splice variant のうち一つだけを用いていた) .

�  展望 �  GHMM と一緒に用いれば,鬼に金棒.

ANN は遺伝子予測に対して有用であり,さらにsliding window を用いるという本手法はより深く研究する価値がある.

Page 24: Journal club dec24 2015   splice site prediction using artificial neural networks
Page 25: Journal club dec24 2015   splice site prediction using artificial neural networks

Appdx: FAQ Q. 結局neural network って?

A. 有能な無能. Q. 結局この論文は何が新しいの?

A. window を用いて,gradient に変化するSS indicator を出力する点. Q. train data 多すぎない? Verification の遺伝子少なすぎない?

A. 御尤も. Q. なぜ,Arabidopsis?

A. 植物はintron でexon に対してAT-rich な傾向があるので,予測しやすいのかもしれない.

Q. IR を除いて遺伝子配列だけで予測するのはどうなの. A. まあ,train しやすいし,予測しやすく,当たりやすいでしょうね.

Q. CC=0.552 ってどうなの? 高いの? A. 低くはない,でしょう.しかし,同様にNN をexploit した  Genie のSS 用algorithm はCC > 0.81 です.

�  Q.