41
PREDICTING THE CELLULAR LOCALIZATION SITES OF PROTEINS USING ARTIFICIAL NEURAL NETWORKS Submitted by: Vaibhav Dhattarwal 08211018 Supervisor and Guide: Dr Durga Toshniwal

Dissertation Prsentation - Vaibhav

Embed Size (px)

Citation preview

Page 1: Dissertation Prsentation - Vaibhav

PREDICTING THE CELLULAR LOCALIZATION SITES OF PROTEINS USING ARTIFICIAL NEURAL NETWORKSSubmitted by:Vaibhav Dhattarwal08211018

Supervisor and Guide:Dr Durga Toshniwal

Page 2: Dissertation Prsentation - Vaibhav

Organization of Presentation• Introduction•Problem Statement•Background•Stages Proposed•Algorithm Implementation•Results & Discussion•Conclusion & Future Work•References

Page 3: Dissertation Prsentation - Vaibhav

Introduction• If one is able to deduce or figure out the sub cellular

location of a protein, we can interpret its function, its part in healthy processes and also in commencement of disease, and it’s probable usage as a drug target.

•The sub cellular location of a protein can provide valuable information about the role it has in the cellular dynamics.

•The intention is to understand their basic or specific function regards to the life of the cell.

Page 4: Dissertation Prsentation - Vaibhav

Introduction: Protein Structure

Page 5: Dissertation Prsentation - Vaibhav

Problem Statement•“Prediction of Cellular Localization sites of proteins

using artificial neural networks”•This report aims to combine the simulated artificial

neural networks and the field of bioinformatics to predict the location of protein in a yeast genome.

• I have introduced a new sub cellular prediction method based on a back propagation algorithm implemented artificial neural network.

Page 6: Dissertation Prsentation - Vaibhav

Background

•Neural Network Definition

•Neural Network Applications

•Neural Network Categorization

•Types of Neural Network

•Perceptrons and Learning Algorithm

•Classification for Yeast protein data set

Page 7: Dissertation Prsentation - Vaibhav

Background: Neural Network Definition

•A neural network is a system which consists of many simple processing elements that are operating in parallel and their function ascertained by network structure, strengths or weights of connections and the computation done at those computing elements/nodes.

•A neural network is a massively parallel distributed processor what holds a strong inherent ability to store large amount of experimental knowledge. It has two features:▫Knowledge is acquired through a learning procedure. ▫ Interneuron connection strengths or weights are used to store

this knowledge.

Page 8: Dissertation Prsentation - Vaibhav

• Computer scientists can find out properties of non-symbolic information processing by using neural networks, they can also find out more about learning systems in general.

• Statisticians might be able to use neural networks as flexible and nonlinear regression, and classification models.

• Neural Networks can be used by engineers for signal processing and automatic control.

• Cognitive scientists deploy neural networks to describe models of thinking and consciousness, which is basically brain function.

• Neurophysiologists use neural networks to describe and research medium-level brain function.

Background: Neural Network Applications

Page 9: Dissertation Prsentation - Vaibhav

Background: Neural Network Categorization

Page 10: Dissertation Prsentation - Vaibhav

•Supervised Learning based▫Feed Forward Topology based▫Feed Back Topology based▫Competitive Learning based

•Unsupervised Learning based▫Competitive Learning based▫Dimension Reduction Process▫Auto Associative Memory

Background: Types of Neural Network

Page 11: Dissertation Prsentation - Vaibhav

Background: Perceptrons and Learning Algorithm

Page 12: Dissertation Prsentation - Vaibhav

Background: Perceptrons and Learning Algorithm

Page 13: Dissertation Prsentation - Vaibhav

Background: Yeast Cell

Page 14: Dissertation Prsentation - Vaibhav

Background: Yeast Protein Data Set• erl : It is representative of the lumen in the endoplasmic reticulum in the cell. This attribute

tells whether an HDEL pattern as n signal for retention is present or not.• vac : This attribute give an indication of the content of amino acids in vacuolar and

extracellular proteins after performing a discriminant analysis.• mit : This attribute gives the composition of N terminal region, which has twenty residue, of

mitochondrial as well as non-mitochondrial protein after performing a discriminant analysis.• nuc : This feature tell us about nuclear localization patterns as to whether they are present or

not. It also holds some information about the frequency of basic residues.• pox : This attribute provides the composition of the sequence of protein after discriminant

analysis on them. Not only this, it also indicates the presence of a short sequence motif. • mcg : This is a parameter used in a signal sequence detection method known as McGeoch.

However in this case we are using a modified version of it. • gvh : This attribute represents a weight matrix based procedure and is used to detect signal

sequences which are cleavable.• alm : This final feature helps us by performing identification on the entire sequence for

membrane spanning regions.

Page 15: Dissertation Prsentation - Vaibhav

Background: Classification Tree

Page 16: Dissertation Prsentation - Vaibhav

Proposed Stages Of Work Done

•Stage one: Simulating the network

•Stage two: Implementing the algorithm

•Stage three: Training the Network

•Stage four: Obtaining results and comparing

performance

Page 17: Dissertation Prsentation - Vaibhav

Stage One: Simulating the network

Page 18: Dissertation Prsentation - Vaibhav

Stage Two: Implementing the algorithm

Page 19: Dissertation Prsentation - Vaibhav

Stage Three: Training the network• The localization site is represented by the class as output. Here are

the various classes:▫CYT (cytoskeletal) ▫NUC (nuclear) ▫MIT (mitochondrial) ▫ME3 (membrane protein, no N-terminal signal)▫ME2 (membrane protein, uncleaved signal) ▫ME1 (membrane protein, cleaved signal) ▫EXC (extracellular) ▫VAC (vacuolar) ▫POX (peroxisomal) ▫ERL (endoplasmic reticulum lumen)

Page 20: Dissertation Prsentation - Vaibhav

Stage Four: Obtaining Results & Comparing Performance•The yeast data set class statistics are mapped as

output.•The attributes of the data set are mapped to reflect the

variation of output.•Varying the number of nodes in the hidden layer is

used to evaluate performance.•Parameters for comparing performance are:▫Accuracy on test set.▫Ratio of correctly classified in the training set

Page 21: Dissertation Prsentation - Vaibhav

Algorithm Implementation

•Sigmoid Function & Its derivative

•Pseudo Code for a single network layer

•Pseudo Code for all network layers

•Pseudo Code for training patterns

•Pseudo Code for minimizing error

Page 22: Dissertation Prsentation - Vaibhav

Sigmoid Function & Its derivative

Page 23: Dissertation Prsentation - Vaibhav

Pseudo Code for a single network layer• InputLayer2[j] = Wt[0][j] • for all elements in Layer One [ NumUnitLayer1 ]•do ▫Add to InputLayer2[j] the sum over the product

OutputLayer1[i] * Wt[i][j]•end for•Compute the sigmoid to get activation output

Page 24: Dissertation Prsentation - Vaibhav

Pseudo Code for all network layers• for all elements in hidden layer [ NumUnitHidden ] // computes Hidden Layer PE outputs //• do

▫ InputHidden[j] = WtInput/Hidden[0][j] ▫ for all elements in input layer [ NumUnitInput ] ▫ do

Add to InputHidden[j] the sum over OutputInput[i] * WtInput/Hidden [i][j]▫ end for

• Compute sigmoid for output• end for • for all elements in output layer [ NumUnitOutput ] // computes Output Layer PE outputs //• do

▫ InputOutput[k] = WtHidden/Output[0][k] ▫ for all elements in hidden layer [ NumUnitHidden ] ▫ do

Add to InputOutput [k] sum over OutputHidden[j] * WtHidden/Output [j][k]▫ end for

• Compute sigmoid for output• end for

Page 25: Dissertation Prsentation - Vaibhav

Design for calculating output

Page 26: Dissertation Prsentation - Vaibhav

Pseudo Code for training patterns• Er = 0.0 ;• for all patterns in the training set• do // computes for all training patterns(E) //▫ for all elements in hidden layer [ NumUnitHidden ] ▫ do

InputHidden[E][j] = WtInput/Hidden[0][j] for all elements in input layer [ NumUnitInput ] do

Add to InputHidden[E] [j] the sum over OutputInput[E] [i] * WtInput/Hidden [i][j]; end for

▫Compute sigmoid for output▫ end for

Page 27: Dissertation Prsentation - Vaibhav

Pseudo Code for training patterns▫ for all elements in output layer [ NumUnitOutput ] ▫do

InputOutput[E] [k] = WtHidden/Output[0][k] for all elements in hidden layer [ NumUnitHidden ] do

Add to InputOutput [E] [k] sum over OutputHidden[E] [j] * WtHidden/Output [j][k] end for

▫Compute sigmoid for output▫Add to Er the sum over the product (1/2) * (Final[E][k] - Output[E]

[k]) * (Final[E][k] - Output[E][k])▫end for

• end for

Page 28: Dissertation Prsentation - Vaibhav

Pseudo Code for minimizing error• for all elements in hidden layer [ NumUnitHidden ] • do // This loop updates the weight input to hidden //▫Add to ΔWih [0][j] the sum of: product β * ΔH [j] to the product: α * ΔWih

[0][j]▫Add to WtInput/Hidden [0][j] the change ΔWih [0][j]▫ for all elements in input layer [ NumUnitInput ] ▫do

Add to ΔWih [i][j] the sum of product β * InputHidden [p][i] * ΔH [j] to the product: α * ΔWih [i][j]

▫Add to WtInput/Hidden [i][j] the change ΔWih [i][j]▫ end for

• end for

Page 29: Dissertation Prsentation - Vaibhav

Pseudo Code for minimizing error• for all elements in output layer [ NumUnitOutput ] • do // This loop updates the weight hidden to output //▫Add to ΔWho [0][k] the sum of: product β * ΔOutput[k] to the product: α *

ΔWho [0][k]▫Add to WtHidden/Output [0][k] the change ΔWho [0][k]▫ for all elements in hidden layer [ NumUnitHidden ] ▫do

Add to ΔWho [j][k] the sum of product β * OutputHidden [p][j] * ΔOutput [k] to the product: α *ΔWho [j][k]

Add to WtHidden/Output [j][k] the change ΔWho [j][k]▫ end for

• end for

Page 30: Dissertation Prsentation - Vaibhav

Results & Discussion•Yeast Data Set Class Statistics•Yeast Data Set Attributes•Comparison of Accuracies of various algorithms•Variation of success rate with number of iterations•Variation of success rate with number of nodes in

hidden layer•Variation of accuracy in training with the criteria in

testing

Page 31: Dissertation Prsentation - Vaibhav

Results: Yeast Data Set class statistics

Page 32: Dissertation Prsentation - Vaibhav

Results: Yeast Data Set Attributes

Page 33: Dissertation Prsentation - Vaibhav

Results: Comparison of Accuracies of various algorithms

Page 34: Dissertation Prsentation - Vaibhav

Results: Variation of Success Rate with no of iterations

Page 35: Dissertation Prsentation - Vaibhav

Results: Variation of Success Rate with no of nodes in Hidden Layer

Page 36: Dissertation Prsentation - Vaibhav

Results: Variation of Accuracy in Training with Criteria in Testing

Page 37: Dissertation Prsentation - Vaibhav

Conclusion•The classes CYT, NUC and MIT have the largest

number of instances.• Interesting observations are that the value of erl and

pox are almost constant throughout the entirety of the data set whereas the rest of the attributes show constant variation.

•The algorithm is able to achieve slightly higher accuracy than the rest of the algorithms.

Page 38: Dissertation Prsentation - Vaibhav

Conclusion•Another thing of note is to see that considerable

success is achieved in the yeast data set which we chose to implement with accuracy leading up to 61%

•After about 100 iterations the success rate remains constant more or less.

•The success rate reaches a constant value after about 75 elements in the layer.

•The Accuracy rises till we reach the limit to which we can set the success rate.

Page 39: Dissertation Prsentation - Vaibhav

Future Work•Since the prediction of proteins’ cellular localization

sites is a typical classification problem, many other techniques such as probability model, Bayesian network, K-nearest neighbours etc, can be compared with our technique.

•Thus, an aspect of future work is to examine the performance of these techniques on this particular problem.

Page 40: Dissertation Prsentation - Vaibhav

Key References• [1]. "A Probablistic Classification System for Predicting the Cellular Localization Sites of

Proteins", Paul Horton & Kenta Nakai, Intelligent Systems in Molecular Biology, 109-115.• [2]. "Expert Sytem for Predicting Protein Localization Sites in Gram-Negative Bacteria",

Kenta Nakai & Minoru Kanehisa, PROTEINS: Structure, Function, and Genetics 11:95-110, 1991.

• [3]. "A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells", Kenta Nakai & Minoru Kanehisa, Genomics 14:897-911, 1992.

• [4]. Cairns, P. Huyck, et.al, A Comparison of Categorization Algorithms for Predicting the Cellular Localization Sites of Proteins, IEEE Engineering in Medicine and Biology, pp.296 300, 2001.

•  [5]. Donnes, P., and Hoglund, A., Predicting protein subcellular localization: Past, present, and future Genomics Proteomics Bioinformatics, 2:209-215, 2004.

• [6]. Feng, Z.P., An overview on predicting the subcellular location of a protein, In Silico Biol2002. 

• [7]. Reinhardt, A., and Hubbard, T., Using neural networks for prediction of the subcellular location of proteins, Nucleic Acids Res., 26(9):2230-2236, 1998.

Page 41: Dissertation Prsentation - Vaibhav

THANK YOU