Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

Embed Size (px)

Citation preview

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    1/13

    Online available since 2014/Jan/26 at www.oricpub.com

    (2014) Copyright ORIC Publications

    Journal of Science and Engineering

    Vol. 3 (2), 2013, 63-75

    SEJournalScience and Engineering

    ORICPublicationswww.oricpub.com

    www.oricpub.com/journal-of-sci-and-eng

    All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any

    means without the written permission of ORIC Publications,www.oricpub.com.

    Identifying Cancer Patients using DNA Micro-Array Data in Data

    Mining EnvironmentZakaria Suliman Zubi

    1,Marim Aboajela Emsaed

    2

    1Sirte University, Faculty of Science, Computer Science Department, Sirte, P.O Box 727, Libya.

    2Tripoli University, Faculty of Information Technology , Computer Science Department, Tripoli, Libya, P,O Box 13210.

    Abstract

    The purpose of this work is to Identifying Cancer Patients using DNA Micro-Array Data

    that use DNA chains which contain informational code to composition of the human body,

    methods are based on the idea of selecting a gene subset to distinguish all classes, it will be

    more effective to solve a multi-class problem, and we will propose a genetic programming

    (GP) based approach to deal with the gene selection and classification tasks for biological

    datasets. This biological dataset will be derived from multiple biological databases. The

    procedure responsible for extracting datasets called DNA-Aggregator. We will design a

    biological aggregator, which aggregates various datasets via DNA micro-array

    community-developed ontology. Our aggregator is composed of modules that retrieve the

    data from various biological databases. It will also enable queries by other applications to

    recognize the genes. The genes will be categorized in groups based on a classification

    method, which collects similar expression patterns. Using a clustering method such as

    k-mean is required either to discover the groups of similar objects from the biologicaldatabase to characterize the underlying data distribution.

    1. INTRODUCTIONData mining techniques used to make predictions and typically using

    only recent static data. Sequence mining is a special case of structured data

    mining and concerned with finding statistically relevant patterns between

    data examples where the values delivered in a sequence. These values

    delivered and then stored in huge collections of data; examples of such

    collections include biological databases were the DNA sequence databases.However, these data is a sequential data in nature cases, which requires a

    technique for discovering sequential patterns; this technique could be

    sequence-mining technique. The principle of sequence mining is to discover

    useful sequential knowledge. This knowledge obtains the form of insight

    into the structure of the data. DNA (gene) is an extraordinary chip data with

    thousands of attributes which represents the gene expression values [8].

    Cancers caused through gene mutations and other types of chromosomal or

    molecular abnormalities. The frequent sporadic cancers, i.e. cancers in

    individuals with a negative family history for cancer, carry somatic gene

    mutations acquired at mitosis. Genes caught up with cancers are mainly

    those involved in normal homeostasis of cellular proliferation,

    differentiation and death.

    Received: 29 June 2013Accepted: 20 Dec 2013

    Keywords:

    Data Mining

    Sequence Mining

    Biological Database

    Genetic Algorithm

    Clustering

    Classification

    K-means

    Correspondences:

    Z. S. Zubi

    Sirte University, Faculty of

    Science, Computer Science

    Department, Sirte, P.O Box

    727, Libya.

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    2/13

    64 | Z. S. Zubi, M. A. Emsaed

    ORIC Publications/2014

    Cancer growth usually requires some different gene mutations accumulate in a cell of origin and in its sub

    clones during colonial evolution of malignant growth. Gene mutations in cancers invariably leads to

    alterations of gene expression patterns with respect to normal cellular counterparts, including the mutated

    genes themselves and their downstream targets [5].

    New techniques may help us to overcome this limitation called Genetic programming (GP). Genetic

    programming (GP) based is an essential method for both feature selection and generating simple models

    based on a few genes demonstrated on cancer data. Genetic programming (GP) has been widely applied with

    classification problems because it can discover underlying data relationships. GP is a promising solution for

    the discovery of potentially important gene by generating comprehensible rules for classification.

    1.1 Early Diagnosing of Cancer Diseases

    A sound body depends on the continuous interplay of thousands of proteins, acting together in just the

    right amounts and in just the right places--and each properly functioning protein is the product of an intact

    gene.

    Many, if not most of the diseases have their roots in our genes. More than 4,000 diseases stem from

    altered genes inherited from one's mother and/or father. Common disorders such as heart disease and mostcancers arise from a complex interplay among multiple genes and between genes and factors in the

    environment [4].

    Cancer is a class of diseases distinguished by out-of-control cell growth. There are over 100 dissimilar

    types of cancer, and the type of cell that is initially affected classifies each [10].

    The Beginning of CancerAll cancers begin in cells, the body's basic unit of life. To recognize cancer, it's helpful to know what

    happens when normal cells become cancer cells.

    The body is made up of many types of cells. These cells grow and divide in a controlled way toproduce more cells as they are needed to keep the body healthy. When cells become old or damaged, they

    die and are replaced with new cells.

    However, occasionally this orderly process goes wrong. The genetic material (DNA) of a cell can

    become damaged or changed, producing mutations that affect normal cell growth and division. When this

    occurs, cells do not die when they should and new cells form when the body does not need them. The extra

    cells may form a mass of tissue called a tumor as shown in figure1 [11].

    Figure 1 The cancer transformation

    Cancer ClassificationsFive broad groups used to classify cancer, these groups are listed as follow:

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    3/13

    Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 65

    Journal of Science and Engineering /Vol. 3 (2), 2013

    Carcinomas are characterized by cells that cover internal and external parts of the body such as lung,breast, and colon cancer.

    Sarcomas distinguished by cells that are located in bone, cartilage, fat, connective tissue, muscle, andother supportive tissues.

    Lymphomas are cancers that begin in the lymph nodes and immune system tissues.

    Leukemias are cancers that begin in the bone marrow and often accumulate in the bloodstream. Adenomas are cancers that arise in the thyroid, the pituitary gland, the adrenal gland, and other

    glandular tissues [10].

    The objectives of the early detection are listed as follow:

    a) To detect and remove / arrest all premalignant lesions;

    b) To give patients the best treatment available;

    c) To reduce the morbidity and mortality of this disease;

    d) To help spread awareness among patients.

    1.2 Sequence Mining TechniquesSequences are an important type of data which occur frequently in many fields such as medical,

    business, financial, customer behavior, educations, security, and other applications. In these applications, the

    analysis of the data needs to be carried out in different ways to satisfy different application requirements,

    and it needs to be implemented in an efficient way as well.

    DNA sequences encode the genetic makeup of humans and all other species; and protein sequences

    describe the amino acid composition of proteins and encode the structure and function of proteins. Moreover,

    sequences can be used to capture how individual humans behave through various temporal activity histories

    such as weblogs histories and customer purchase ones. In general there are various methods to extract

    information and patterns from databases, such as Time series, association rule mining and data mining [11].

    2 BASIC DNA PRINCIPLESThe basic element of life is the cell, which is a tiny factory producing the raw materials, energy, and

    waste removal capabilities necessary to sustain life. Thousands of different proteins, called enzymes, are

    necessary to keep these cellular factories functioning. An average human being is composed of

    approximately 100 trillion cells, all of which originated from a single cell. Each cell contains the same

    genetic structure within the nucleus of our cells is a chemical substance known as DNA that contains the

    informational code for replicating the cell and constructing the needed enzymes. Because the DNA resides

    in the nucleus of the cell, it is often referred to it as a nuclear DNA [3].

    DNA has two primary purposes: (1) to make copies of it so cells can divide and carry on the same

    information; and (2) to carry instructions on how to make proteins so cells can build the machinery of life.

    Information encoded within the DNA structure itself is passed on from generation to generation with

    one-half of a person's DNA information coming from their mother and one-half coming from their father.

    2.1 DNA Structure and definitionNucleic acids including DNA are composed of nucleotide units that are made up of three parts: a

    nucleobase, a sugar, and a phosphate shown in figure 2. The nucleobase or 'base' imparts the variation in

    each nucleotide unit while the phosphate and sugar portions form the backbone structure of the DNA

    molecule. The DNA alphabet is composed of only four characters representing the four nucleobases: A

    (adenine), T (thymine), C (cytosine), and G (guanine).

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    4/13

    66 | Z. S. Zubi, M. A. Emsaed

    ORIC Publications/2014

    Figure 2. Basic components of nucleic acids: (a) phosphate sugar backbone with bases coming off the sugar molecules, (b)

    chemical structure of phosphates and sugar molecules illustrating numbering scheme on the sugar carbon atoms. DNA sequencesare conventionally written from 5 to 3.

    2.2 Base pairing and hybridization of DNA Strands

    In its natural state in the cell, DNA is actually composed of two strands that are correlated together

    through a process known as hybridization. Individual nucleotides pair up with their complementary base

    through hydrogen bonds that form between the bases. The base pairing rules are such that adenine can only

    hybridize to thymine and cytosine can only hybridize to guanine figure 3 illustrated more facts about the

    pairing rules.

    Figure 3. Base pairing of DNA strands to form doublehelix structure.

    2.3 Chromosomes, genes, and DNA markers

    There are approximately three billion base pairs in a single copy of the human genome. Obtaining a full

    catalog of our genes was the focus of the Human Genome Project. The information from the Human

    Genome Project will benefit medical science as well as forensic human identity testing and help us better

    understand our genetic makeup.

    Within human cells, DNA found in the nucleus of the cell (nuclear DNA) is divided into chromosomes,

    which are dense packets of DNA and protection proteins called histones. The human genome consists of 22

    matched pairs of autosomal chromosomes and two sex determining chromosomes figure 4 shows thesepairs. Thus, normal human cells contain 46 different chromosomes or 23 pairs of chromosomes. Males are

    designated XY because they contain a single copy of the X chromosome and a single copy of the Y

    chromosome while females.

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    5/13

    Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 67

    Journal of Science and Engineering /Vol. 3 (2), 2013

    Figure 4. Human genome contained in every cell consists of 23 pairs of chromosomes and a small circular genome known

    as mitochondrial DNA.

    Designating physical chromosome locationsThe basic regions of a chromosome are illustrated in figure 5. The centre region of a chromosome,

    known as the centromere, controls the movement of the chromosome during cell division. On the other side

    of the centromere are arms that terminate with telomeres as shown in figure 5. The shorter arm is referred

    to as p while longer arm is designated q.

    Figure 5. Basic chromosome structure and nomenclature

    3 TUMOUR SUPPRESSOR GENE P53The p53 tumour suppressor gene is the most frequently altered gene in human cancer, including brain

    tumours.

    The p53 protein is a transcription factor involved in maintaining genomic integrity by controlling cell

    cycle progression and cell survival. About 50% of primary human tumours carry mutations in the p53 gene.

    The function of p53 is critical to the efficiency of many cancer treatment procedures, because radiotherapy

    and chemotherapy act in part by triggering programmed cell death inresponse to DNA damage [6]. P53

    tumour suppressor gene is one of the most commonly mutated genes. The p53 is a 20 Kb gene located on the

    short arm of chromosome 17 at 17p13.1 locus.

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    6/13

    68 | Z. S. Zubi, M. A. Emsaed

    ORIC Publications/2014

    3.1 Primers for PCR and DNA sequencingThe primers used were oligonucleotides complementary to the sequence flanking the exon/intron

    junctions of exons 59. The sequence of the primers is as follows:

    exon5, 5CTGACTTTCAACTCTG-3(forward) and 5-AGCCCTGTCGTCTCT-3 (reverse);

    exon 6, 5- CTCTGATTCCTCACTG-3(forward) and 5-ACCCCA GTTGCAAACC-3 (reverse);

    exon 7, 5-TGCTTGCCACAGGTCT-3(forward )and 5-ACAGCAGGCCAGTGT3(reverse);

    exon8, 5AGGACCTGATTTCCTTAC-3 (forward) and 5-TCTGAGGCATAACTGC-3 (reverse);

    exon9,5-TATGCCTCAGATTCACT-3(forward) and 5-ACTTGATAAGAGGTCC-3(reverse).

    4 DNA MICRO-ARRAYS DATA CONCEPTSThe DNA micro-arrays produced by placing small drops of liquid include genes on a glass microscope

    slide, and allowing the spots to dry. Each spot of liquid contains numerous copies of a single gene and the

    characteristics of each spot's of gene are shown in figure 6.

    Figure 6 Cartoon of a DNA micro-array

    The mRNA is isolated from each population and each population of mRNA converted into colored cDNAusually in red and green. Once the two populations of cDNA's produced, they will be mixed and incubated

    with the DNA micro-array and unbound cDNA is washed off, figure 7 shows the incubate process.

    The DNA micro-array scanned to discover the two colours of cDNA and then the green and the red

    images will be stored. Software merges the two colours and spots bound by both colours of cDNA appear

    yellow .

    Figure 7 Shows the method for producing labeled cDNA

    We indicate some real data in figure 8 using an application program to analyze the data [1].

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    7/13

    Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 69

    Journal of Science and Engineering /Vol. 3 (2), 2013

    Figure.8 illustrates the real micro-array data for three genes.

    The major application of microchips falls into three categories:

    1- Gene expression profiling : while RNA is extracted from tumour samples and hybridised to the

    micro-array to assess concurrently and in a single experiment the term of thousands of genes within the

    sample.

    2- Genotyping: Genomic DNA from an individual tested for hundreds or thousands of genetic

    markers [notably single nucleotide polymorphisms (SNPs) or snips, or micro-satellite markers] in a single

    hybridisation. This will yield a genetic fingerprint, which in turn may be linked to the risk of developing

    single gene disorders or particular common complex diseases.

    3- DNA sequencing: Sequence variations of specific genes can be monitored in a test DNA sample,

    thereby greatly increasing the scope for precise molecular diagnosis in single gene disorders or complex

    genetic diseases. [5].

    DNA Sequencing Process:

    1- MappingIdentity set of clones that span region of genome to sequence.

    2- Library CreationMake sets of smaller clones from mapped clones.

    3- Template PreparationPurify DNA from smaller clones

    Set up and perform Sequencing chemistries

    4- Gel ElectrophoresisDetermine sequences from smaller clones5- Pre-finishing and Finishing

    Specialty techniques to produce high quality sequences

    6- Data Editing/ AnnotationQuality assurance

    Verification

    Biological annotation

    Submission to public database [12].Applications of DNA micro-arrays or chips in oncology Global understanding of abnormal gene expression contributing to malignancy, i.e. snapshots

    of genes either up or down regulated in tumours.

    Molecular classification of neoplasm's by gene expression signatures, forecasting the tissue

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    8/13

    70 | Z. S. Zubi, M. A. Emsaed

    ORIC Publications/2014

    of origin of a tumour in the context of multiple cancer classes.

    Classification of novel molecular-based subclasses in the tumours with clinical relevance. Discovery of new prognostic or predictive indicators and biomarkers of therapeutic response; Identification and validation of new molecular targets for drug development; Prediction of drug side effects during preclinical development and toxicology studies; Identification of genes conferring drug resistance; Prediction or selection of patients most likely to benefit from, or suffer from particular side

    effects of drugs (pharmacogenomics) [5].

    5 DNA BIOLOGICAL DATABASESStarting out with any research project it is required to gain information on the problem to be investigated.

    Biological data can be organized in many different manners:

    1. Flat text files databases;

    2. Relational databases;

    3. Object oriented databases.

    Biological databases can be broadly classified in to sequence and structure databases. Sequence databases

    are applicable to both nucleic acid sequences and protein sequences, whereas structure database is applicable

    to only Proteins.

    Biological database is the database of sequence. Three kinds of biological sequences include protein,

    DNA and RNA. In recent years biological data is doubled in size every 15 or 16 months. Since there are so

    many data in biology, biology database has greatly developed and became a part of the biologists everyday

    toolbox. The number of everyday queries has also increased to 40,000 queries per day. So we should have

    some good database search methods. Otherwise, we cannot use the biological database efficiently.

    The Nature of the Data Collected from Patients and so to construct database, samples of DNA must be

    collected, the samples analyzed, and the resulting data stored in such a way that it can be accessed efficiently.

    In the systems now in use, blood, saliva, or other tissue or fluid is collected.

    Databases and the ability to organize data are needed in order to keep research efficient and to get optimal

    output and information from data obtained in the lab.

    5.1 Biological DatasetBiological dataset is a data or measurements collected from biological sources, which is stored or

    exchanged in a digital form. Biological dataset is regularly stored in files or databases. Examples of

    biological data are DNA base-pair sequences, and population data used in ecology.There are a number of DNA datasets from published cancer gene expression, including leukemia cancer

    dataset, colon cancer dataset, lymphoma dataset, breast cancer dataset, and ovarian cancer dataset. Among

    them three datasets will be used in this proposal work.

    Leukemia cancer datasetLeukemia dataset consists of 61 samples: 25 samples of Acute Myeloid Leukemia (AML) and 36 samples

    of Acute Lymphoblastic Leukemia (ALL). The source of the gene expression measurements was taken form

    55 bone marrow samples and 6 peripheral blood samples. The 34 of 61 samples are Leukemia cancer

    samples and the remaining are normal samples.

    Colon cancer datasetColon dataset consists of 68 samples of colon epithelial cells taken from colon-cancer patients. The 46

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    9/13

    Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 71

    Journal of Science and Engineering /Vol. 3 (2), 2013

    of 68 samples are colon cancer samples and the remaining are normal samples.

    Lymphoma cancer datasetLymphoma dataset consists of 35 samples of Lymphoma cells taken from Lymphoma-cancer patients.

    The 27 of 35 samples are Lymphoma cancer samples and the remaining are normal samples.

    6 METHODS AND MODELS6.1 Genetic algorithm

    Genetic Algorithms (GAs) are adaptive Guidance search algorithm provided on the evolutionary ideas

    of natural selection and genetic. The basic concept of GAs is designed to simulate processes in natural

    system necessary for evolution, specifically those that follow the principles first laid down by Charles

    Darwin of survival of the fittest [7].

    Three operators are used by genetic algorithms:

    1. Selection:The selection operator Indicates to the method used for selecting which chromosomeswill be reproducing. The fitness function evaluates each of the chromosomes (candidate solutions), and the

    fitter the chromosome, the more likely it will be selected to reproduce.

    2. Crossover:The crossover operator performs recombination, creating two new offspring by randomly

    selecting a locus and exchanging sub sequences to the left and right of that locus between two chromosomes

    chosen during selection. For example, in binary representation, two strings 11111111 and 00000000 could

    be crossed over at the sixth locus in each to generate the two new offspring 11111000 and 00000111.

    3. Mutation: The mutation operator randomly changes the bits or digits at a particular locus in a

    chromosome: usually, however, with very small probability. For example, after crossover, the 11111000

    child string could be mutated at locus two to become 10111000. Mutation introduces new information to the

    collect genetic and protects against pile too quickly to a local optimum.

    Most genetic algorithms function Recursively updating a collection of possible solutions called a

    population. Each member of the population is evaluated for fitness on each cycle. A new population then

    replaces the old population using the operators above, with the fittest members being chosen for

    reproduction or cloning.

    The fitness function f (x) is a real-valued function operating on the chromosome (potential solution), not

    the gene, so that the x in f (x) refers to the numeric value taken by the chromosome at the time of fitness

    evaluation [2].

    6.2 ClusteringClustering indicates to the grouping of records, observations, or cases into classes of similar objects. A

    cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.

    Clustering differs from classification in that there is no target variable for clustering. The clustering task

    does not try to classify, speculation, or expect the value of a target variable. Instead, clustering algorithms

    requires segmenting the all data set into relatively homogeneous subgroups or clusters, where the similarity

    of the records within the cluster is maximized, and the similarity to records outside this cluster is minimized.

    k-means clusteringIn statistics and machine learning, k-means clustering is a method of cluster analysis which aims to

    parting n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

    Algorithm:The algorithm of k-means clustering is a simple and effective algorithm for finding clusters in

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    10/13

    72 | Z. S. Zubi, M. A. Emsaed

    ORIC Publications/2014

    data. The steps of algorithm proceeds as follows.

    Step 1: Choose the number of clusters, k.Step 2: Randomly assign k records to be the initial cluster center locations.Step 3: For each record, find the nearest cluster center, in a sense, each cluster center "owns" a subset

    of the records, which representing a partition of the data set. Thus consists k clusters, C1, C2, . . . , Ck.

    Step 4: For each of the k clusters, find the cluster centroid, and update the location of each clustercenter to the new value of the centroid.

    Step 5: Repeat steps 3 to 5 until convergence or termination.The "nearest" standard in step 3 is usually Euclidean distance. The cluster centroid in step 4 is found as

    follows:

    Suppose that there ndata points (a1, b1, c1), (a2, b2, c2), . . . , (an, bn, cn), the centroid of these points is

    the center of gravity of these points and is located at point (ai/n ,bi/n,ci/n) (1). For example, the points

    (1,1,1), (1,2,1), (1,3,1), and (2,1,1) would have centroid.

    (1)

    The algorithm terminates when the centroids no longer change. In other words, the algorithm terminates

    when for all clusters C1, C2, . . . ,Ck, all the records "owned" by each cluster center remain in that cluster,

    the algorithm may terminate when some convergence standard is met, such as no significant shrinkage in the

    sum of squared errorsuse Equation (2):

    (2)

    The proposed system

    This chart contains the phases throughout the system and the operations of the system respectively.

    DNA sequence

    Input

    MATLAB

    DNA-Aggregator( data set)

    Genetic Algorithm

    Cluster

    Result

    Performance

    Analysis

    Figure 9 the proposed system

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    11/13

    Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 73

    Journal of Science and Engineering /Vol. 3 (2), 2013

    7 IMPLEMENTATIONThe system will apply several methods such as Genetic Programming method in scene of initialization of

    GP. We will also describe how the Data Clustering algorithms used in the system using MATLAB version

    7.9.0.529 (R 2009b).

    The results will be conducted in an excel file. This figure shows the results of starting the match

    program which appears in the below Excel file.

    Figure 10 result in Excel file

    8

    RESULTSThe reported results in our work were carried out in the proposed processes aiming to early and

    accurate diagnosis for cancer patients.

    - Leukem

    Figure 11 result of Leukemia process

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    12/13

    74 | Z. S. Zubi, M. A. Emsaed

    ORIC Publications/2014

    - Colon

    Figure 12 result of colon process

    - Lymph

    Figure 13result of Lymph process

    9 CONCLUSIONIn this paper, we proposed a genetic algorithm GA based approach to deal with the gene selection and

    classification tasks for multi-class micro-array datasets. The multi-class problem was divided it into multiple

    two-class problems, and a set of sub-ensemble systems deployed to deal with respective two-class problems.

    The procedure responsible for extracting datasets called DNA-Aggregator. We designed a biological

    aggregator, which aggregates various datasets via DNA micro-array community-developed ontology based

    upon the concept of semantic Web for integrating and exchanging biological data. Trees constructed with

    different genes; important genes selected as important references for clinic diagnosis or cancer development.

    For each dataset, the biological significance of the selected genes validated from a biological database. The

    GA based method presents useful alternatives in the analysis of complex multi-class micro-array datasets,

    and working whit cluster (K-means) [9].

    In our work we have applied GA in the sequencing of DNA molecules. The results produced by the

    algorithm were very good and in many cases were optimal or close to optimal. Several challenges have been

    faced and solutions found, so the system that has been designed is used for classifying, clustering and

    detecting cancer in DNA chips data. The system involves two major modules, the first module the clustering

    and the second module detects the cancer from the DNA chips.

    REFERENCES[1] Malcolm Campbell and Laurie J. Heyer DNA Microarrays: Background, Interactive Databases, and Hands-on Data

    Analysis .page 5 .

    [2] DANIEL T. LAROSE. DATA MINING METHODS AND MODELS. Copyright 2006 by John Wiley & Sons,

  • 8/13/2019 Identifying Cancer Patients using DNA Micro -Array Data in Data Mining Environment

    13/13

    Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 75

    Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada.

    Page 241.

    [3] John M. Butler ,FORENSIC DNA TYPING, Copyright 2005, Elsevier (USA).[4] Lydia Schindler ,Donna Kerrigan, M.S, Jeanne Kelly , Brian Hollen . Understanding Cancer and Related Topics

    Understanding Gene Testing.

    [5] M. F. Fey The impact of chip technology on cancer medicine. DOI: 10.1093/annonc/mdf647.[6] PORNIMA PHATAK, S KALAI SELVI, T DIVYA, A S HEGDE, SRIDEVI HEGDE and KUMARAVEL

    SOMASUNDARAM Alterations in tumour suppressor gene p53 in human gliomas from Indian patients. December 2002,

    Indian Academy of Sciences

    [7] Tan Jun-shan, He Wei1, Qing Yan , Application of Genetic Algorithm in Data Mining. 2009 First InternationalWorkshop on Education Technology and Computer Science. 978-0-7695-3557-9/09 2009 IEEE . DOI

    10.1109/ETCS.2009.340. page 353.page 353.

    [8] W. B. Langdon and B. F. Buxton Genetic Programming for Mining DNA Chip data from Cancer Patients ComputerScience, University College, Gower Street, London, WC1E 6BT, UK, fW.Langdon, [email protected]

    http://www.cs.ucl.ac.uk/sta_/W.Langdon, /sta_/B.Buxton .page 1

    [9] Zakaria Suliman Zubi ,Marim Aboajela Emsaed, 2010. "Sequence mining in DNA chips data for diagnosing cancerpatients". InProceedings of the 10th WSEAS international conference on Applied computer science(ACS'10), Hamido Fujita

    and Jun Sasaki (Eds.). World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, Wisconsin, USA,

    139-151.

    [10]http://www.medicalnewstoday .com/info/cancer-oncology / whatiscancer .php. Page headerWhat is Cancer?.Loginclock 01:37 pm. date: 06-05-2010

    [11]http://www.cancer.gov/ cancertopics / what-is-cance Cancer?.Login clock 11:37 pm. date: 04-05-2010.[12] http://www.ornl.gov/sci/techresources/Human_Genome/graphics/DNASeq. Process.pdf .Page header: DNA

    Sequencing Process Date. Login clock 11:03pm.Date 16-2-2010.

    Please cite this article as: Z. S. Zubi, M. A. Emsaed, (2013), Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment,Journal of Science and Engineering, Vol. 3(2), 63-75.

    mailto:[email protected]://www.medicalnewstoday/http://www.medicalnewstoday/http://www.cancer.gov/%20cancertopics%20/%20what-is-cancehttp://www.cancer.gov/%20cancertopics%20/%20what-is-cancehttp://www.ornl.gov/sci/techresources/Human_Genome/graphics/DNASeq.http://www.ornl.gov/sci/techresources/Human_Genome/graphics/DNASeq.http://www.ornl.gov/sci/techresources/Human_Genome/graphics/DNASeq.http://www.cancer.gov/%20cancertopics%20/%20what-is-cancehttp://www.medicalnewstoday/mailto:[email protected]