6
427 A New Survey On knowledge Discovery And Data Mining Faouzi Mhamdi and Mourad Elloumi Abstract- Knowledge discovery is an emerging field where we combine techniques from algorithmics, artificial intelligence, mathematics and statistics to deal with the theoretical and practical issues of extracting knowledge, i.e., new concepts or concept relationships hidden in volumes of raw data. Knowledge discovery offers the capacity to automate complex search and data analysis tasks. Data mining is the main phase in the knowledge discovery process. It consists in extracting nuggets of knowledge, i.e., pertinent patterns, pattern correlations, estimation or rules, hidden in bodies of data. The extracted nuggets of knowledge will be used in the verification of hypothesis or the prediction and explanation of knowledge. In this paper, we present a new survey on Knowledge Discovery and Data Mining (KDD). Index Terms- Data Mining, Knowledge Discovery. I. INTRODUCTION Knowledge Discovery and Data Mining (KDD) is the process of non trivial discovering, from data, of implicit informations, previously unknown and potentially interesting [39]. By non-trivial is meant that the identification was done thanks to a research or an induction. In many fields, such as Health, Geology, Marketing, Finance and Molecular Biology, the size and the complexity of data is growing exponentially [12,35,39,49,60,68,93,105, 111]. Indeed, concerning the biological data, for example, the Human Genome Project is providing the sequence of the three billion DNA bases that constitute the human genome. Consequently, we are also provided with the sequences of about 100 000 proteins. Therefore, we are entering the post- genomic era: after having focused so much efforts on the accumulation of data, we have now to focus as much effort, and even more, on the analysis of these data. This will enable us to learn more about gene expression, protein inter-actions and other biological mechanisms. Analyzing this huge volume of data is a challenging task because, not only, of its complexity and its multiple numerous correlated factors, but also, because of the continuous evolution of our understanding of the biological mechanisms. Classical approaches of biological data analysis are no longer efficient and produce only a very limited amount of information, compared to the numerous and complex Manuscript received December 25, 2007. Faouzi Mhamdi and Mourad Elloumi are with Research Unit of Technologies of Information and Communication, Higher School of Sciences and Technologies of Tunis. 5, Avenue Taha Hussein, Monflewy 1008 Tunis, Tunisia [email protected], biological mechanisms under study. Actually, these approaches use only a very limited number of parameters, to represent the so-many correlated factors involved in the biological mechanisms. From here comes the necessity to use computer tools and develop new in silico high performance approaches to support us in the analysis of biological data and, hence, to help us in our understanding of the correlations that exist between, on one hand, structures and functional patterns of biological sequences, Le. DNA, RNA and proteins, and, on the other hand, genetic and biochemical mechanisms. KDD are a response to these new trends [12,29,59,102, 103,107]. KDD is a process made up of three main phases: - Phase of data preprocessing, - Phase of data processing, also called data mining, - And phase of data postprocessing [12, 35, 39, 49, 60, 68, 93,105, 111]. In this paper, we present a new survey on KDD. In the second section, we discuss the phase of data preprocessing. In the third one, we discuss the phase of data mining. In the fourth one, we discuSs the phase of data postprocessing. Finally, in" the last one, we present a conclusion. Knowledge : : I I VJ I I I I I I I _---1-----. I I Data Mining Fig. 1 The main phases of the KDD process. II. DATAPREPROCESSING Data preprocessing [30,34,191] consists in : - Understanding the domain of application and the acquired knowledge, and identifying the goal of the KDD process from the user's point of view. - Creating a target data set, Le., selecting a sample of data on which the KDD must be applied. - Cleaning the data, e.g., removing the errors, eliminating the redundant data and/or completing the missing data.

[IEEE 2008 Second International Conference on Research Challenges in Information Science (RCIS) - Marrakech (2008.06.3-2008.06.6)] 2008 Second International Conference on Research

  • Upload
    mourad

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

427

A New Survey On knowledge DiscoveryAnd Data Mining

Faouzi Mhamdi and Mourad Elloumi

Abstract- Knowledge discovery is an emerging field where wecombine techniques from algorithmics, artificial intelligence,mathematics and statistics to deal with the theoretical andpractical issues of extracting knowledge, i.e., new concepts orconcept relationships hidden in volumes of raw data. Knowledgediscovery offers the capacity to automate complex search anddata analysis tasks.

Data mining is the main phase in the knowledge discoveryprocess. It consists in extracting nuggets of knowledge, i.e.,pertinent patterns, pattern correlations, estimation or rules,hidden in bodies of data. The extracted nuggets of knowledge willbe used in the verification of hypothesis or the prediction andexplanation of knowledge.

In this paper, we present a new survey on Knowledge Discoveryand Data Mining (KDD).

Index Terms- Data Mining, Knowledge Discovery.

I. INTRODUCTION

Knowledge Discovery and Data Mining (KDD) is theprocess of non trivial discovering, from data, of implicitinformations, previously unknown and potentially interesting[39]. By non-trivial is meant that the identification was donethanks to a research or an induction.

In many fields, such as Health, Geology, Marketing,Finance and Molecular Biology, the size and the complexityof data is growing exponentially [12,35,39,49,60,68,93,105,111]. Indeed, concerning the biological data, for example, theHuman Genome Project is providing the sequence of the threebillion DNA bases that constitute the human genome.Consequently, we are also provided with the sequences ofabout 100 000 proteins. Therefore, we are entering the post­genomic era: after having focused so much efforts on theaccumulation of data, we have now to focus as much effort,and even more, on the analysis of these data. This will enableus to learn more about gene expression, protein inter-actionsand other biological mechanisms. Analyzing this huge volumeof data is a challenging task because, not only, of itscomplexity and itsmultiple numerous correlated factors, but also, because of thecontinuous evolution of our understanding of the biologicalmechanisms. Classical approaches of biological data analysisare no longer efficient and produce only a very limited amountof information, compared to the numerous and complex

Manuscript received December 25, 2007.Faouzi Mhamdi and Mourad Elloumi are with Research Unit of

Technologies of Information and Communication, Higher School of Sciencesand Technologies of Tunis. 5, Avenue Taha Hussein, Monflewy 1008 Tunis,Tunisia [email protected],[email protected],

biological mechanisms under study. Actually, theseapproaches use only a very limited number of parameters, torepresent the so-many correlated factors involved in thebiological mechanisms. From here comes the necessity to usecomputer tools and develop new in silico high performanceapproaches to support us in the analysis of biological data and,hence, to help us in our understanding of the correlations thatexist between, on one hand, structures and functional patternsof biological sequences, Le. DNA, RNA and proteins, and, onthe other hand, genetic and biochemical mechanisms. KDDare a response to these new trends [12,29,59,102, 103,107].

KDD is a process made up ofthree main phases:- Phase of data preprocessing,- Phase of data processing, also called data mining,- And phase of data postprocessing [12, 35, 39, 49, 60,

68, 93,105, 111].In this paper, we present a new survey on KDD. In the

second section, we discuss the phase of data preprocessing. Inthe third one, we discuss the phase of data mining. In thefourth one, we discuSs the phase of data postprocessing.Finally, in" the last one, we present a conclusion.

Knowledge~ ~ ~-'-'-'-'-'-'-'-'i'-'-'-'-'-'-'~ ~~t:>

: : tso~I I VJI II II I

I _---1-----.II

DataMining

Fig. 1 The main phases of the KDD process.

II. DATAPREPROCESSING

Data preprocessing [30,34,191] consists in :- Understanding the domain of application and the

acquired knowledge, and identifying the goal of the KDDprocess from the user's point ofview.

- Creating a target data set, Le., selecting a sample of dataon which the KDD must be applied.

- Cleaning the data, e.g., removing the errors, eliminatingthe redundant data and/or completing the missing data.

428

Justify

Temperature < 37.5

Throat irritated Sick

Sick In good health

- Case-Based Reasoning: it is a matter of a reasoningwhich is made starting from old experiments in order to solvea new problem. We operate as follows [123,124,125] :

During the first step, we propose a solution.During the second step, either we adapt this solution to the

needs for the new problem or we justify it.During the third step, we criticize the solution obtained then

we evaluate it. If this solution is satisfactory then we save it, ifnot we return to the second step, in order to adapt it.

Case-based reasoning is used to make classification [126,127], to make prediction [128] and to make clusterin [129].

Save

~_c_n_.p

Fig. 2 Case-based reaSoning cycle

Propose a solution

- Links analysis: it is a matter of an analysis which consistsin connecting entities (customers, companies ...) with eachother by links. With each link a weight, which quantifies theforce of this connection, is assigned [130]. Links analysis isused to make prediction and to make classification [130,131,132,133,134].

- Decision trees: a decision tree is a tree such that eachinternal node represents a test on the value of one or moreattributes. According to the result of this test, we pass to oneof the children of this node. In order to process an element, westart from the root of the tree. We stop when a leave isreached, Le., a node representing no test. We say then that adecision is taken. Hence the name of this tree [43]. Decisiontrees are used to make classification [10,135,95,136,76,42]and to make re ession 10.

Fig. 3 Example ofa decision tree.

Indeed, data may contain errors coming from the experimentsand/or the measurements done [9,47].

- Coding the data, i.e., extracting attributes and selectingfrom them those which are pertinent, in order to code the dataaccording to the goal to reach [72,112].

- Choosing the data mining task to achieve, e.g.,prediction [107,109], clustering [4,74,104,106,108],estimation [86,94] and classification [65,100, 101].

- Choosing the data mining algorithm to apply, according tothe task to achieve.

III. DATA MINING

Data mining is the application of specific algorithms forthe extraction of nuggets ofknowledge, i.e., pertinent patterns,pattern correlations, estimation or rules, from data[35,125,37]. According to Fayyad et ale [35], data miningtasks are:

- Detection of the changes and deviations: this taskconsists in locating the changes and deviations occurred onthe starting data.

- Modeling of the dependences: this task consists inmodeling the dependences existing between attributes.

- Abbreviation: this task consists in defining a shorteneddescription ofa subset ofdata.

- Regression: this task consists in examining thedependence between random variables, called dependentvariables, and other random ones, called independentvariables [94,54]. This dependence is represented by anequation c~lled regression equation [29,38]. Among thetechniques allowing to make regression, we quote Bayesiantechniques [14,15,17,82,83].

Clustering: this task, also called unsupervisedclassification, consists in forming classes starting from a set ofdata. The classes are formed so that two elements of the sameclass resemble each other much more than two elements oftwo different classes. We distinguish two main types ofclustering [4,8,13,16,24,57] :

(a) Either hierarchical clustering: By makinghierarchical clustering, we build a tree of classes calleddendrogram. Each node in this tree represents a class. Thechildren of a node represent the subclasses of the classrepresented by the node father. Among the techniquesallowing to make hierarchical clustering, we quote thetechnique based on link metrics [25,26,87,98], the one ofhierarchical classes ofarbitrary forms [41,45,46] and the oneof divisive binary partitioning [107,99,100].

(b) Or unhierarchical clustering: By makingunhierarchical clustering, we build classes that are notsubclasses of each other. Among the techniques allowing tomake unhierarchical clustering, we quote, the technique of k­means [50,51], the one of k-medoids [32,62,113,115], the oneof probabilistic clustering [114,116,117,118] and density­based one [119,120,121,122].

- Classification: this task, also called supervisedclassification, consists in assigning a new element to one ofthe predefined classes.

The main techniques used to carry out the tasks presentedabove are: .

429

InputFig. 5 Architecture ofa monolayer ANN.

(b) Connections between the neurons, known under thename of synaptic weights, are used to store knowledge.

ANN are used to make classification [5,27,41,56,96], tomake clustering [41,53], to make prediction [Ill] and to makeregression [56,75].

& Outputn

LX;*W;;=1

- Support Vector Machines: a Support Vector Machine(SVM) is a discrimination technique. It consists in separatingtwo, or more, sets of points by a hyperplane. It is based onthe use of functions, known as core (kerne!), which allow anoptimal separation of the points of the plane in variouscategories. Unlike the learning approaches, which are basedon the minimization of the empirical risk, this technique isbased on the principle of minimization of the structured risk.SVM are used to make classification [19,22,23] and to makeregression [67,89,97].

- Association rules: an association rule is a rule of theform « if C then P », where C is the condition of the rule andP is the prediction. The condition C is a conjunction of termsand the prediction P is a simple term. A term is of the form(attribute operator value) [59,66]. Association rules are usedto make classification [71,73,99,137] and to make clustering[138].

Example: In the decision tree of Fig. 3, a patient having atemperature strictly lower than 37.5 and having an irritatedthroat, for example, will be classified as sick by this tree.Thus, the translation of this decision tree in association rulesis as follows:

if (temperature < 37.5) and (throat = irritated)then patient = sick

if (temperature < 37.5) and (throat;to rritated)then patient = in good healthif (temperature ~ 37.5) then patient = sick

- Genetic algorithms:' A genetic algorithm is an algorithmwhose principle of functioning is as follows: We start with aninitial population of potential solutions, arbitrarily selected.We evaluate their respective performances. On the basis ofthese performances, we create a new population of potentialsolutions by using simple evolutionary operators, i.e.,selection, crossing-over and mutation. We reiterate thisprocess until we obtain a satisfactory solution [40,44,55,59].Genetic algorithms are used to make classification [3,64,88]and to make clustering [18, 31,33,58].

Choice of an initialpopulation of

potential solutions

Maximum margin

Support vectors

oo 0o

\ 0 DO\ D 0

\0 0\ 0

\\

\\

\

Evaluation ofpotential solutions

NO

Creation of a new populationpopulation of potential

solutions(selection, crossing-over,

mutation)

Fig.4 Principle of functioning ofa genetic algorithm.

- Artificial neural networks : an Artificial Neural network(ANN) is a system made up of several simple calculating unitsfunctioning in parallel, whose function is determined by thestructure of the network, the solidity of the connections andthe operation carried out by the elements or nodes. An ANNresembles the human brain on two aspects [52]:

(a) Knowledge is acquired by the ANN through alearning process.

Fig. 6 Principle of a SYM.

- K-Nearest neighbors algorithms: a K-Nearest Neighbors(KNN) algorithms is an algorithm dedicated to classificationtasks and can be extended to estimation ones. It is a case­based reasoning algorithm. It is based on the idea of takingdecisions by seeking similar cases already solved in memory.There is no training step. Generally, the functioning of a KNNalgorithm is carried out in two steps:

430

During the first step, we compute the distance between asample y and each training sample x, by using an Euclideandistance.

During the second step, we identify the K samples whichare the closest to y and choose the majority class among theseK samples. KNN algorithms are used to make classification[20,48].

IV. DATAPOSTPROCESSING

Data postprocessing of the nuggets of knowledge extractedduring the data mining phase consists in [11] :

(a) Filtering the nuggets of 1qlowledge in order to keeponly those which are pertinent compared to the problem inquestion [69,70,95].

(b) Interpreting and explaining the nuggets ofknowledge inorder to make them understandable, if they have to be used bya human, or to express them in a suitable formalism, if theyhave to be used by a machine. In the case where the nuggetsof knowledge have to be used by a human, techniques ofvisualization are used [2,6,21,36,63, 77,78,90].

(c) Evaluating the nuggets ofknowledge. This can be made,for example, by calculating the classification error rate or byevaluating the association rules via numerical indicators, e.g.measurements ofinterest [1].

(d) Integrating the nuggets of knowledge. This consists incombining these nuggets in order to obtain knowledge, Le.new concepts or concept relationships hidden in volumes ofraw data [11,95].

V. CONCLUSION

In this paper, we presented a new survey on KnowledgeDiscovery and Data Mining (KDD). We presented thedifferent phases of KDD: data preprocessing, dataprocessing, also called data mining, and data postprocessing.The most important phase among these three phases is datamining. We provided the different tasks of data mining and·the main techniques used to carry out these tasks. Thetechniques presented in this paper are quite general. There areother data mining techniques specialized for particular typesofdata and domain.

REFERENCES

[1] R Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules,in Proc. 20th International Conference on Very Large Data Bases(VLDB'94) : (1994), p487-499.

[2] M. Ankerst, M. Ester, H-P. Kriegel, Towards an Effective Cooperationof the Computer and the User for Classification, in Proc. KDD'OO,ACM SIGKDD, (Boston, USA): (2000), pI79-188.

[3) S. Bandyopadhyay, C. A. Muthy, Pattern Classification Using GeneticAlgorithms, Pattern Recognition Letters, Vol. 16 : (1995), p801-808.

[4] P. Berkhin, Survey of clustering data mining techniques, Technicalreport, Accrue Software: (2002).

[5] C. M. Bishop, Neural Networks for Pattern Recognition, OxfordUniversity Press (Publish.) : (1995).

[6] L. Boudjeloud, F. Poulet, Visual Interactive Evolutionary Algorithm forHigh Dimensional Data Clustering and Outlier Detection, Pacific AsiaConference on Knowledge Discovery and Data Mining, PAKDD'2005,(Hanoi, Vietnam) : (May, 2005).

[7] A. Bouguettaya, On Line Clustering, IEEE Transaction on Knowledgeand Data Engineering Vol.8, N°2 : (April 1996).

[8] J. M. Bouroche, G. Saporta, L 'Analyse des Donnees, PressesUniversitaires de France (Eds.), (Paris, France) : (2002).

[9] R Brachman, T. Anand, The process of knowledge discovery indatabases: A human-centered Approach, in U. Fayyad, G. Piatetsky­Shapiro, Amith, P. Smyth, R. Uthurusamy (Eds.), Advances inKnowledge Discovery and Data Mining, MIT Press : (1996).

[10] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification andregression trees, Technical report, Wadsworth International, (Monterey,CA): (1984).

[11] I. Bruha, A. F. Famili, Postprocessing in Machine Learning and DataMining, in Proc. workshop Post-Processing in Machine Learning andData Mining: Interpretation, Visualization, Integration, and RelatedTopics within KDD-2000: ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, (Boston, MA, USA): (August,2000), p20-23.

[12] V. Brusic, J. Zeleznikow, Knowledge Discovery and Data Mining inBiological Databases, The Knowledge Engineering Review, Vol. 14,N°3, Cambridge University Press UK. : (1999), p257-277.

[13] F. Cailliez, Introduction a l'analyse des donnees, Smash (publish.):(1976).

[14] O. Cappe, T. Ryden, Hidden Markov Models, Springer-Verlag, NewYork (Eds.) : (2004).

[15] B. P. Carlin, S. Chib, Bayesian Model Choice Through Markov ChainMonte Carlo, J. Roy. Statist. Soc. (Ser. B), Vol 57, N°3 : (1995), p473­484.

[16] G. Celeux, E. Diday, H. Ralambondrainy, Y. Lechevallier, G. Govaert,Classification automatique des donnees, Dunod (Publish.) : (1989).

[17] M. H Chen, Q. M. Shao, J. G. Ibrahim, Monte Carlo Methods inBayesian Computation, Springer-Verlag Heidelberg, Germany (Eds.) :(2000).

[18] R M. Cole, Clustering with Genetic Algorithms, Master's Thesis,University ofWestem Australia, Australia: (1998).

[19] C. Cortes, V. N. Vapnik, Support Vector·Networks, Machine Learning J.,N~O : (1995), p273-297. .

[20) T. M. Cover, P. E. Hart, Nearest neighbor pattern classification, IEEETransactions on Information Theory J., Vol. 13 : (1967), p21-27.

[21] K.. Cox, S. Eick, G. Wills, Visual data mining: Recognizing TelephoneCalling Fraud, Data Mining and Knowledge Discovery J., Vol. 1, N°2:(1997).

[22] N. Cristianini, J. Shawe-Taylor, An Introduction to Support VectorMachines and Other Kernel-based Learning Methods, CambridgeUniversity Press, Cambridge: (2000).

[23] N. Cristianini, J. Shawe-Taylor, An Introduction to Support VectorMachines, Cambridge University Press, Cambridge, UK : (2000).

[24] B. V. Custsem, Classification and dissimilarity analysis, Springer­Verlag Heidelberg, Germany (Eds.) : (1994).

[25] W. Day, H. Edelsbrunner, Efficient Algorithms for AgglomerativeHierarchical Clustering Methods, J. Classification, Vol. 1, N°7 : (1984),p7-24.

[26] D. Defays, An Efficient Algorithm for a Complete Link Method, TheComputer Journal, N°20 : (1977), p364-366.

[27] L. Devroye, L. GyOrfi, G. Lugosi, A Probabilistic Theory of PatternRecognition, Springer-Verlag (Publish.) : (1996).

[28] N. R. Draper, H. Smith, Applied Regression Analysis, Wiley Series inProbability and Statistics (publish.) : (1998).

[29] M. Elloumi. Knowledge Discovery and Data Mining. Knowledge BasedSystems Journal, Vol. 15, Issue 4, Elsevier Publishing Co., Amsterdam,North-Holland (Publisher) : (AvriI2002).

[30] R. Engels, C. Theusinger, Using a data metric for offeringpreprocessing advice in data mining applications, in Proc. ThirteenthEuropean Conference on Artificial Intelligence, p430-434 : (1998).

[31] V. Estivill-Castro, Hybrid genetic algorithms are better for spatialclustering, in Pacific Rim International Conference on ArtificialIntelligence: (2000), p424-434.

[32) M. Ester. H. P. Kriegel, X Xu, A Database Interface for Clustering inLarge Spatial Databases, in ProC. ACM SIGKDD'95, (Montreal,Canada) : (1995), p94-99.

[33] E. Falkenauer, Genetic Algorithms and Grouping Problems, John Wileyand Sons (Publish.) : (1998).

431

[34] A. Famili, M. Shen, R. E. E. S. Weber, Data Preprocessing andIntelligent Data Analysis, Intelligent Data Analysis, Vol. 1 : (1997), p3­23.

[35] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, From Data Mining toKnowledge Discovery in Databases: an Overview in Advances inKnowledge Discovery and Data Mining, AAAIIMIT Press, Menlo Park,USA: (1996).

[36] U. Fayyad, G. Grinstein, A. Wierse, Information Visualization in DataMining and Knowledge Discovery, Morgan Kaufmann (publish.) :(2001).

[37] U. Fayyad, Data Mining Grand Challenges, in Proc. PAcific KnowledgeDiscovery and Datamining, PAKDD'04, Vol. 2 : (2004).

[38] J. Fox, Applied Regression Analysis, Linear Models and RelatedMethods, Sage (Publish.): (1997).

[39] W. Frawley, G. Piatetsky-Shapiro, C. Matheus, Knowledge Discovery inDatabases: An Overview. AI Magazine: (Fall 1992), p213-228.

[40] A. A. Freitas, A swvey ofEvolutionary Algorithms for Data Mining andKnowledge Discovery, in Advances in Evolutionary Computation, AGhosh and S. Tsutsui (Eds.), Springer-Verlag: (2002).

[41] B. Fritzke, Growing Cell Structures - A Self-Organizing Network forUnsupervised andSupervisedLearning, Neural Networks J., Vol. 7, N°9: (1994).

[42] O. Gascuel, Data Model and Classification by Trees, special issue ofStudies in Classification, data analysis, and knowledge organization, W.Gaul, Locarek-Junge, H. (Eds.), Springer: (1999), p58-67.

[43] M. Goebe, L. Gruenwald, A Survey of Data Mining and KnowledgeDiscovery Software Tools, ACM SIGKDD, Vol. 1, N°l : (June 1999),p20-33.

[44] D. E. Goldberg, Genetic Algorithms - in Search, Optimization andMachine Learning, Addison-Wesley (Publish.): (1989).

[45] S. GOOa, R. Rastogi, K. Shim, CURE: An Efficient Clustering Algorithmfor Large Databases, in Proc. ACM SIGMOD'98 Conference, (Seattle,WA): (1998), p73-84.

[46] S. GOOa, R Rastogi, K. Shim, ROCK: A Robust Clustering Algorithmfor Categorical Attributes, in Proc. ICDE'99~ (Sydney, Australia) :(1999), p512-521.

[47] O. Guyon, N. Matic, N. Vapnik, Discovering Informative Patterns andData Cleaning. In Advances in Knowledge Discovery and Data Mining(eds.) U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R Uthurusamy,Menlo Park, California, AAAI Press: (1996), p181-204.

[48] E. H. Han, Text Categorization Using Weight Adjusted k-NearestNeighbor Classification, PhD thesis, University of Minnesota: (October1999).

[49] J. Han, M. Kamber, Data Mining : Concepts and Techniques, 2Dd

edition, Morgan Kaufmann (Publish.) : (2006).[50] J. Hartigan, Clustering Algorithms. John Wiley and Sons (Publish.),

New York, NY : (1975).[51] J. Hartigan, M. Wong, Algorithm AS136: A Ie-means Clustering

Algorithm, Applied Statistics??, N~8 : (1979), p100-108.[52] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan

NY (Publish.) : (1994).[53] J. Herrero, A. Valencia, J. Dopazo, A Hierarchical Unsupervised

Growing Neural Network for Clustering Gene Expression Patterns,Bioinfonnatics J., Vol. 17, N°2 : (2001), pI26-136.

[54] R Hilborn, M. Mangel, The Ecologist Detective - Confronting Modelswith Data. Princeton University Press: (1997).

[55] J. Holland, Adaptation in Natural and Artificial Systems, University ofMichigan Press (Publish.) : (1975).

[56] K. Hornik, Approximation Capabilities of Multilayer FeedforwardNetworks, Neural Networks J., N°4: (1991), p251-257.

[57] A. K. Jain, R C. Dubes, Algorithms for Clustering Data, Prentice Hall(Publish.), Englewood Cliffs New Jursey USA: (1988), p5S-142.

[58] K. A. Jong, W. M. SPears, D. F. Gordon, Using genetic algorithmsforconcept learning, Machine Learning J., Vol. 13 : (1993), pI61-188.

[59] L. Jourdan, Metaheuristique pour l'Extraction de Connaissances :Application ala Genomique, These de Doctorat, Universite des Scienceset Technologies de Litle, France : (Novembre 2003).

[60] M. Kantardzic, Data Mining: Concepts, Models, Methods, andAlgorithms, Wiley-IEEE Press (publish.) : (2002).

[61] G. Karypis, E. H. Han, V. Kumar, CHAMELEON: A HierarchicalClustering Algorithm Using Dynamic Modeling, Computer J., N°32 :(1999), p68-75.

[62] L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction toCluster Analysis. John Wiley and Sons (publish.), New York: (1990).

[63] D. Keirn, Information Visualization and Visual Data Mining, IEEETransactions on Visualization and Computer Graphics, Vol. 8, N°1 :(2002), pl-8.

[64] J. D. Kelly, L. Davis, A Hybrid Genetic Algorithmfor Classification, inProc. IJCAI'91, Morgan Kaufmann (publish.), p645-6S0 : (1991).

[65] T. M. Khoshgoftaar, E. B. Allen, Modeling Software Quality withClassification Trees, in Recent Advances in Reliability and QualityEngineering, Hoang Pham (Ed.), World Scientific, Singapore (Publish.) :(1999).

[66] Y. Kodratoff, Techniques et outils de l'extraction de connaissances apartir des donnees, Signaux 1., N°92, p38-43 : (Mars 1998).

[67] T. Y. Kwok, Support Vector Mixturefor Classification And RegressionProblems, in Proc. 14th International Conference on Pattern Recognition,(Brisbane, Australia) : (1998).

[68] D. T. Larose, Discovering Knowledge in Data: An Introduction to DataMining, John Wiley (Publish.): (2004).

[69] R Lehn, F. Guillet, H. Briand, Qualite d'un Ensemble de Regles :Elimination des Regles Redondantes, Mesures de Qualite pour laFouille de Donnees, RNTI-E-N°l : (2004), pI41-167.

[70] P. Lenca, P. Meyer, P. Picouet, B. Vaillant, S. Lallich, Evaluation etAnalyse Multicriteres des Mesures de Qualite des Regles d'Association,Mesures de Qualite pour la Fouille de Donnees RNTI-E-l : (2004),p219-246.

[71] W. Li, 1. Han, J. Pei, CMAR: Accurate andEfficient Classification basedon multiple Classe-Association Rules, ICDMOI : (2001).

[72] H. Liu, Feature Extraction, Construction and Selection A Data MiningPerspective, in Series: The Springer International Series in Engineeringand Computer Science, Motoda, Hiroshi (Eds.), N° 453 : (1998).

[73] B. Liu, W. Hsu, Y. Ma, Integrating Classification andAssociation RuleMining, KDD'98 : (1998).

[74] A. Bouguettaya, On Line Clustering, IEEE Transaction on Knowledgeand Data Engineering Vol. 8, N~ : (April 1996).

[75] D. Mandie, J. Chambers, Recurrent Neural Networks for Prediction:Architectures, Learning algorithms and Stability, Wiley (Publish.) :(2001). .

[76] T. Mitchell, Decision Tree Learning, in T. Mitchell, Machine Learning,The McGraw-Hill Companies Inc. (Publish.) : (1997), p52-78.

[77] F.Poulet, Full-View: A Visual Data-Mining Environment, InternationalJournal of Image and Graphics, Vo1.2, N°I : (2002),127-144.

[78] F. Poulet, P. Kuntz, Visualisation et Extraction de Connaissances,numero special Revue des Nouvelles Technologies de I'Infonnation,RNTI-E-4, Cepadues (Publish.): (2005).

[79] W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery,Numerical Recipes in C, Cambridge University Press: (1988).

[80] 1. Quinlan, C4.5 : Programs for Machine Learning, Morgan Kaufmann(Publish.), (San Mateo, CA): (1993).

[81] B. D. Ripley, Pattern Recognition and Neural Networks, CambridgeUniversity Press (publish.): (1996).

[82] C. P. Robert, The Bayesian Choice, Springer-Verlag Heidelberg,Gennany (Eds.) : (2001).

[83] C. P. Robert, G. Casella, Monte Carlo Statistical Methods, Springer­Verlag, New York (Eds.): (2004).

[84] S. Savaresi, D. Boley, On Performance of Bisecting K-means andPDDP, in Proc. SIAM ICDM'Ol, (Chicago, IL) : (2001).

[85] S. M. Savaresi, D. L. Boley, S. Bittanti, G. Gazzaniga, Cluster Selectionin Divisive Clustering Algorithms, in Proc. SIAM ICDM'02, (Arlington,VA): (2002), p299-314.

[86] D.W. Scott, Multivariate Density Estimation, Wiley, New York,(Publish.) : (1992).

[87] R Sibson, SLINK: An Optimally Efficient Algorithm for the Single LinkCluster Method Computer Journal, W16: (1973), p30-34.

[88] D. B. Skalak (1994), Using a Genetic Algorithm to Learn Prototypes forCase Retrieval and Classification, in Proc. AAAI-93 Case-BasedReasoning Workshop, (Menlo Park, CA): 1994, p64-69.

[89] A.1. Smola, Regression Estimation with Support Vector LeamingMachines, Master's thesis, Technische Universitat M1lnchen : (1996).

[90] R. Spence, Information Visualization, Addison-Wesley, ACM Press(Publish.): (2001).

[91] L. Soibelman, M. Asce, K. Hyunjoo, Data Preparation Process forConstruction Knowledge Generation through Knowledge Discovery inDatabases, J. Computing In Civil Engineering: (January 2002).

432

[92] M. Steinbach, G. Karypis, V. Kumar, A Comparison of DocumentClustering Techniques, ACM SIGKDD'OO, (Boston, MA) : (2000).

[93] P. N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining,Pearson Addison Wesley (Publish.) : (May, 2005).

[94] J. L. Thome, H. Kishino, I. S. Painter, Estimating the rate of evolutionof the rate of molecular evolution, Molecular Biology and Evolution,Vol. 15, N°12 : (1998), pl647-1657.

[95] H. Toivonen M. Klemettinen, P. Ronkainen, K. Hatonen, H. Mannila,Pruning and Grouping of Discovered Association Rules, ECML-99,Workshop Statistics, Machine Learning, and Discovel)' in Databases,(Heraklion, Greece) : (1995).

[96] J. Utans , J. Moody, Selecting Neural Network Architectures Via thePrediction Risk: Application to Corporate Bond Rating Prediction, inProc. AI Applications on Wall Street, IEEE Computer Society(Publish.), (Los Alamitos, CA) : (1991), p35-41.

[97] V. Vapnik, S. Golowich, A. Smola, Support Vector Method forFunction Approximation, Regression Estimation, and Signal Processing,in M. Mozer, M. Jordan, and T. Petsche (Eds.), Advances in NeuralInformation Processing Systems, (Cambridge, MA), MIT Press, N°9 :(1997), p281-287.

[98] E. M. Voorhees, Implementing Agglomerative Hierarchical ClusteringAlgorithms for Use in Document Retrieval. Information Processing andManagement, Vol. 22, N°6 : (1986), P465-476.

[99] J. Wang, G. Karypis, HARMONY: Efficiently Mining the Best Rulesfor Classification, SIAM'05 : (2005).

[100] J. T. L. Wang, T. G. Marr, D. Shasha, B. A Shapiro, G. Chim,Discovering Active Motifs in Sets of Related Protein Sequences andusing Them for Classification, Journal of Nucleic Acids Research, Vol.22, N°14 : (1994), p2769-2775.

[101] J. T. L. Wang, S. Rozen, B. A. Shapiro, D. Shasha, Z. Wang, M. Yin,New Techniques for DNA Sequence Classification, Journal ofComputational Biology, Vol. 6, N°2 : (1999), p209-218.

[102] J. T. L. Wang, Q. Ma, D. Shasha, C. H. Wu. Application of NeuralNetworks to Biological Data Mining: A Case Study in Protein SequenceClassification, in Proc. of 6th International Conference ·on KnowledgeDiscovery and Data Mining, ACM SIGKDD : (2000), p305-309.

[103] J. T. L. Wang, M. J. zaki, H. T. T. Toivonen, Data mining inbioinformatics, Springer, Germany: (2005).

[104] P. Willet, Parallel Database Processing, Text Retrieval and ClusterAnalyses, Pitman (Publish.), London UK: (1990).

[105] I. Witten, E. Frank, Data Mining: Practical Machine Learning Tools andTechniques, 2nd edition, Morgan Kaufmann (Publish.) : (2006).

[106] N. Ye, A Scalable Clustering Technique for Intrusion SignatureRecognition, in Proc. 2001 IEEE, Workshop on Infonnation Assuranceand Security, United States Military Academy, West Point (Publish.),NY, 5-6 June: (2001), pl-4.

[107] M. M. Yin, J. T. L. Wang. GeneScout: A Data Mining System forPredicting Vertebrate Genes in Genomic DNA Sequences. Journal ofInformation Sciences, Vol. 163, N°I-3 : (June 2004), p201-218.

[108] L. T. H. Yu, F. Chung, S. C.F. Chan, Emerging Pattern Based ProjectedClustering for Gene Expression Data. European Workshop on DataMining and Text Mining for Bioinfonnatics, ECML / PKDD'03,(Dubrovnik, Croatia): (September 2003). '

[109] X. Zhang, J.P. Mesirov, D.L. Waltz, A Hybrid System for ProteinSecondary Structure Prediction, Journal of Molecular Biology, Vol. 225: (1992), pl049-1063.

[110] D. Zighed, R. Rakotomalala, Data Mining, Editions Techniques del'Ing6nieur: (2002), pI-26.

[111] G. Piatetsky-Shapiro, R Grossman, C. Djeraba, R Feldman, L. Getoor,M. Zaki, What Axe The Grand Challenges for Data Mining? , KDD-2006Panel Report, SIGKDD Explorations, Vol.8,N~ : (Dec 2006).

[112] R Rakotomalala, F. Mhamdi, M. Elloumi, Hybrid Feature Ranking forProtein Classification, in Proc. 1st International Conference onAdvanced Data Mining and Applications 2005, ADMA'05 (Wuhan,China), Springer Lecture Notes in Computer Science series, Springer­Verlag, Berlin (Eds.), Indexed by SCI: (Juillet 2005)

[113] R Ng, J. Han, Efficient and Effective Clustering Methods for SpatialDatamining, in Proc. VLDB'94, (Santiago, Chile) : (1994), pl44-155.

[114] C. Wallace, D. Dowe, Intrinsic Classification by M.ML - The SnobProgram, in Proc. Australian Joint Conference on Artificial Intelligence,World Scientific (publish.), (Armidale, Australia) : (1994), p37-44.

[115] N. Beckmann, H. P. Kriegel, R Schneider, B. Seeger, The R*-tree: An .Efficient Access Methodfor Points andRectangles. in Proc. International

Conference on Geographic Infonnation Systems, (Ottawa, Canada) :1990.

[116] P. Cheeseman, J. Stutz, Bayesian Classification (AutoClass): Theoryand Results. in Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., andUthurusamy , R (eds.) Advances in Knowledge Discovery and DataMining, AAAI Press/MIT Press: (1996).

[117] C. Fraley, A. Raftel)', MCLUST: Software for Model-Based Cluster andDiscriminant Analysis, Technical Report, Dept. Statistics, Univ. ofWashington, N°342: (1999).

[118] W. Erray, F. Mhamdi, FaBR-CL : Methode de Classification Croisee deProteines, in Proc. EGC'06 (Lille, France): (Janvier 2006).

[119] H. P. Kriegel, B. Seeger, R Schneider, N. Beckmann, The R*-Tree: AnEfficient Access Method for Geographic Information Systems, in Proc.International Conference on Geographic Information Systems, (Ottawa,Canada) : 1990.

[120] M. Ester, H. P. Kriegel, J. Sander,X. Xu, A Density-BasedAlgorithmforDiscovering Clusters in Large Spatial Databases With Noise, in Proc.ACM SIGKDD'96, (portland, Oregon), p226-231 : (1996)

[121] J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd

edition, Morgan Kaufmann (Publish.) : (2006).[122] E. Kolatch, Clustering Algorithms for Spatial Databases: A Survey,

Dept. of Computer Science, University of Maryland,College Park:(2001).

[123] 1. L. Kolodner, An Introduction to Case-Based Reasoning, ArtificialIntelligence J., Vol. 6, N°l : (1992), p3-34.

[124] J. L. Kolodner, Case-Based Reasoning, Morgan Kaufman (publish.),(San Mateo, CA) : (1993).

[125] A. Aamodt et E. Plaza, Cased-Based Reasoning: Foundational Issues,Methodological Variations, and System Approaches, AICommunications J., Vol. 7, N°l : (1994), p39-59.

[126] C. G. Overton, J. Haas, Case-Based Reasoning Driven GeneAnnotation, in: Salzberg SL, Searls DB and Kasif S (Eds.)Computational Methods in Molecular Biology, Elsevier (Publish.) :(1998), p65-86.

[127] 1. Lieber, Strong, Fuzzy and Smooth Hierarchical Classification forCase-Based Problem Solving, in Pmc. European Conference onArtificial Intelligence : (2002).

[128] B. Leng, B. G. Buchanan, H. B. Nicholas, Protein Secondary StructurePrediction using Two-Level Case-Based Reasoning, Joumal ofComputational Biology, Vol. 1 : (1994), p25-38.

[129] C. Tsatsoulis, H. A. Amthauer, Finding Clusters of Similar Eventswithin Clinical Incident Reports: A Novel Methodology CombiningCase-Based Reasoning and Infonnation Retrieval, International JournalofQuality and Safety in Health Care, Vol. 12, N~, p24-32.

[130] M. J. A. Berry, G. Linoff, Data mining: for Marketing, sales, andcustomer Support, John Wiley and Sons (Publish.): (1997).

[131] D. Jensen et H. Goldberg, AA.Al Fall Symposium on AI and LinkAnalysis, AAAI Press (Publish.) : (1998).

[132] G. Attardi, A. Gull, F. Sebastiani, Automatic Web page categorizationby linlc and context analysis, in Chris Hutchison and Gaetano Lanzarone,editors, Proc. THAI'99, European Symposium on Telematics,Hypennedia and Artificial Intelligence, (Varese, IT): (1999), pl05-119.

[133] R Feldman, Link analysis: Current State of the Art, in Proc. KDD'02Tutorial : (2002).

[134] M. Thelwall, Link Analysis: An Information Science Approach,Academic Press (publish.) : (2004).

[135] P. Winston, Learning by Building Identification Trees, in P. Winston,Artificial Intelligence, Addison-Wesley (Publish.): (1992), p423-442.

[136] Y. Yuan, M. J. Shaw, Induction of Fuzzy Decision Trees, Fuzzy Setsand Systems J., N°69 : (1995), pI25-139.

[137] X. Yin, J. Han, CPAR: Classification based on Predictive AssociationRules, SDM03 : (2003).

[138] B. Lent, A. Swami, J. Widom, Clustering association rules, in Proc.International Conference on Data Engineering ICDE'97, IEEEComputer Society (Publish.), p220-231 : (1997).

Faouzi Mhamdi received the Licence's Degree in Computer Sscience in 1999from the Faculty of Sciences of Tunis, Tunisia Then, he received a Master'sDegree in Computer Science from the National School of Computer Science,Tunis, Tunisia Now, he is preparing a PhD Degree in Computer Science atthe Faculty of Sciences of Tunis. His main research area is KnowledgeDiscovery in Databases and its application in Bioinfonnatics.