Click here to load reader

Data Mining-Project Report Gene Classification using Neural Network- Apil Tamang

  • View

  • Download

Embed Size (px)

Text of Data Mining-Project Report Gene Classification using Neural Network- Apil Tamang

1. Data Mining Fall 2013 Project Report Apil Tamang Gene Classification using Neural Networks 2. Introduction Problem: Genes play a fundamental role in any living organisms life. The processes of life are controlled by proteins that are produced within an organisms cells. Functions such as muscle movement, food digestion, production of energy, waste removal, producing antibodies to fight infection etc. are all controlled by the production of proteins within an organism. Fundamental processes such as breathing, heartbeats, growth and regeneration etc. are all dependent on the production of the right kind of proteins at the right places and moments of time. In other words, life is sustained by proteins: many different kinds of them! The synthesis of proteins is controlled by genes. A gene is a certain length of DNA that is found within the chromosomes within the nucleus of an organism. It consists of a sequence of DNA base pairs: specifically, four different kinds. They are (A)denine, (T)hymine, (G)uanine, and (C)ytosine. Only certain specific regions within in the DNA serve as protein synthesizing elements. Each gene results in the production of one and only one kind of protein. The genes for the entire set of proteins available to an organism are found within the chromosomes. Hence, the DNA is also called the code for life. The entire DNA sequence contains many different kinds of sequences in addition to genes. There are regions that serve as binding sites for other processes, regions that signal the beginning and end of the gene regions, and regions that serve absolutely no purpose (to current knowledge), to name a few. Much of the entire DNA sequence is not quite understood about. For e.g. it is estimated that only 2% of the entire DNA sequence of human serve as genes. Researchers do not know for sure what the purpose of the rest of the DNA is. In this project, we examine the DNA of lower-class organisms where the entire DNA sequence can be divided into two main categories: the coding (gene), and the non-coding (non-gene) regions. The files containing sequences for all the proteins known to two organisms: E. Coli (Strain MG1665) and A.Baccillus were downloaded from the NCBI genome repository. The work is based on the hypothesis that the coding regions have a certain pattern in their gene statistics which makes it possible to identify them from the innumerable sequence combinations that can be constructed from the DNA sequence. We mentioned previously that most of the DNA itself consists of non-coding regions. Hence, the attempt is to look at a sample of DNA sequence and be able to tell if it is a coding sequence or a non-coding sequence, i.e. a gene or a non-gene region, respectively. We go a step further and see if there is any specific pattern that can be inferred from the genes of two different organisms such that this information can be used to correctly identify which organism the given sample is from. Neural Networks: 3. Neural Networks are computer algorithms that can be used to solve many classes of artificial intelligence problems. These problems can range from optimization to classification. The major constituent elements of a neural network are neurons, connections, weights and transformation functions. The neurons themselves are modeled after their biological counterparts that serve a central role of survival in higher-class organisms. Neurons are specialized cells capable of receiving electrical signals and transmitting them after some processing. They are also capable of forming interconnections within the organism and control the movement of virtually every muscle in the organism. Neurons constitute the central nervous system (i.e. brain and spinal cord) by forming a massive and very complex web of connections between themselves. Thus, neurons are also the seat of intelligence and memory in higher-class organisms. The neurons in the neural network algorithm have very similar features. They are able to take in an input and form connections with the neighboring neurons. Weights are the signals that neurons pass amongst each other during a computation process. Each neuron is capable of processing the input via a mathematical function that can be specified by a user. In a typical network, neurons communicate by passing weights around. The output of a network is the overall collective processing performed by each neuron as it communicates with other neurons in the network. In this way, neural networks often provide a black-box like problem solving tool for the end user. There are many different kinds of neural networks available in the field. These networks differ from each other by the kind of function they use for processing the input, the way they are interconnected in the network, and the way information is passed around in the system before an end result is displayed. In this class project, we have used a fully interconnected Multilayer Perceptron network with the standard forward-feed, back-propagation learning algorithm. This setup is optimal for classification and is widely used for this class of problems. 4. Methodology Data preprocessing: The building block of proteins is the amino acid. An amino acid consists of a set of three DNA base pair sequence. This set of three DNA base pair is also referred to as a codon. Given that there are 4 kinds of DNA base pairs, there are 64 possible kinds of codons that can be formed by this set of sequence. There are 20 different amino acids identified by scientists and researchers. Hence, there is a many-to-one mapping from codons to amino acids. The process of forming a gene statistics for this project consists of taking a gene sequence and deriving normalized frequencies of all the codons and amino acids. This is done for each gene in a file containing all known gene sequences for the organism. There are two source files for this purpose: the file containing all the gene sequences as DNA base pairs, and the file containing all the sequences as amino acids. These files are used to derive the normalized frequencies for the codons and the amino acids, respectively. The process is mentioned in pseudo-code briefly below: Organism 1 - Protein 1: ATGGATCCG - Protein 2: ATGCGATCG.. - - .. - . - Protein N: ATGTTACTG.. Organism 1 Codon Freq. Table Cdn1 Cdn2 Cdn64 0.23 0.12 ................ 0.05 0.11 0.17 0.20 .. . 0.34 0.15 . 0.16 Organism 1 AA Freq. Table AA1 AA2 AA20 0.13 0.15 ................ 0.25 0.01 0.21 0.10 .. . 0.14 0.09 . 0.25 5. Once the files containing the statistics are obtained, we perform the following steps. a. Split the statistics file into two disjoint parts for each organism. The split is randomly orchestrated. b. Merge one part of the statistics file from an organism with a part of the statistics from the second organism. Ensure the lines from each are randomly distributed in the merged files. Do the same for the remaining part of the statistics file for each organism. c. Use one of the merged file as training (80%) and testing (20%) data for the neural network classifier. d. Use the other merged file as activating data for the neural network classifier. This is the set of data that the neural classifier works on to produce classification results. Print the results. e. Use the results file and perform analysis on the overall accuracy. Note that the entire set of steps is carried out for the statistics file for both the codons and the amino acids independently. The steps outlined above is presented diagrammatically in the following image. 6. Classifier Setup: The following describes the structure of the neural network classifier used in this project: The above is a sample of a multilayered perceptron used in classification problems. The red ovals (far left) represent input neurons. These are basically neurons that take in input as normalized numerical values of the attributes in the classification problem. There needs to be one input neuron per attribute. Hence, in this project, 20 input neurons are used when using amino acid statistics. Likewise, 64 input neurons are used when using codon statistics to represent the 64 different types of codons available. The green ovals in the middle represent the hidden neuron layer. They represent the layer that performs the analysis on the input data. Their number can vary. It is recommended that they have at least as many elements as the number of input neurons. Finally, we have an output layer that is represented by the oval in dark green (far right). One output neuron is required for each class attribute. However; for a classification problem, it is recommended that for optimal performance, one output neuron is used for each class value possible. Hence in this project, we would have two output neurons: one for each of the two organisms to whom the genes may belong. The two output neurons in this project are configured to produce the output as a value between 0 and 1. The value represents the probability that the particular tuple is of a certain class. In this project, the closer the value is to 0, the more likely that it is a gene of organism 1, and the closer it is to 1, the more likely it is a gene of organism 2. The table of results looks as follows: 7. ID Class1 Class2 --------------------- 1 0.32 0.68 2 0.11 0.89 .. . N 0.09 0.91 It is required that a certain cutoff value be chosen to make a prediction. If I chose a cutoff value, say: 0.7, then based on that, tuples with ID 2 and N is chosen to be Class-2 and tuple with ID equals 1 will be categorized as falling in Class-1. 8. Results Problem 1: In the first part of the experiment, we chose genes from the E.Coli (Strain MG1655) to produce the corresponding statistics file on amino acid frequencies. Recall that this file contains lines of sequences of numbers, each line being normalized frequencies of the amino acids of the genes from which they are constructed from. This set of sequences formed the first

Search related