26
Article type: Advanced Review Data Discretization: Taxonomy and Big Data Challenge Sergio Ram´ ırez-Gallego 1 , Salvador Garc´ ıa 1* ,H´ ector Mouri ˜ no-Tal´ ın 2 , David Mart´ ınez-Rego 2,3 , Ver´ onica Bol ´ on-Canedo 2 , Amparo Alonso-Betanzos 2 , Jos ´ e Manuel Ben´ ıtez 1 , Francisco Herrera 1 1. Department of Computer Science and Artificial Intelligence, University of Granada, 18071, Spain, {sramirez|salvagl|j.m.benitez|herrera}@decsai.ugr.es 2. Department of Computer Science, University of A Coru ˜ na, 15071 A Coru˜ na, Spain. {h.mtalin|dmartinez|veronica.bolon|ciamparo}@udc.es 3. Department of Computer Science, University College London, WC1E 6BT London, United Kingdom. * . Corresponding Author. Keywords Data Discretizacion, Taxonomy, Big Data, Data Mining, Apache Spark Abstract Discretization of numerical data is one of the most influential data preprocess- ing tasks in knowledge discovery and data mining. The purpose of attribute discretization is to find concise data representations as categories which are adequate for the learning task retaining as much information in the original con- tinuous attribute as possible. In this paper, we present an updated overview of discretization techniques in conjunction with a complete taxonomy of the lead- ing discretizers. Despite the great impact of discretization as data preprocessing technique, few elementary approaches have been developed in the literature for Big Data. The purpose of this paper is twofold: on the first part, a comprehen- sive taxonomy of discretization techniques to help the practitioners in the use of the algorithms; and moreover, on the second part of the paper our aim is to demonstrate that standard discretization methods can be parallelized in Big Data platforms like Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of one of the most well-known discretizers based on Information Theory, and that obtains better results: the entropy minimization discretizer proposed by Fayyad and Irani. Our scheme goes beyond a simple parallelization and it is intended to be the first to face the Big Data challenge. INTRODUCTION Data is present in diverse formats, for example in categorical, numerical or continuous values. Categorical or nominal values are unsorted, whereas numerical or continuous 1

Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Article type: Advanced Review

Data Discretization: Taxonomy and BigData ChallengeSergio Ramırez-Gallego 1, Salvador Garcıa 1∗, Hector Mourino-Talın 2,David Martınez-Rego 2,3, Veronica Bolon-Canedo 2, Amparo Alonso-Betanzos2, Jose Manuel Benıtez 1, Francisco Herrera 1

1. Department of Computer Science and Artificial Intelligence, University ofGranada, 18071, Spain, {sramirez|salvagl|j.m.benitez|herrera}@decsai.ugr.es2. Department of Computer Science, University of A Coruna, 15071 A Coruna,Spain. {h.mtalin|dmartinez|veronica.bolon|ciamparo}@udc.es3. Department of Computer Science, University College London, WC1E 6BTLondon, United Kingdom.∗. Corresponding Author.

KeywordsData Discretizacion, Taxonomy, Big Data, Data Mining, Apache Spark

Abstract

Discretization of numerical data is one of the most influential data preprocess-ing tasks in knowledge discovery and data mining. The purpose of attributediscretization is to find concise data representations as categories which areadequate for the learning task retaining as much information in the original con-tinuous attribute as possible. In this paper, we present an updated overview ofdiscretization techniques in conjunction with a complete taxonomy of the lead-ing discretizers. Despite the great impact of discretization as data preprocessingtechnique, few elementary approaches have been developed in the literature forBig Data. The purpose of this paper is twofold: on the first part, a comprehen-sive taxonomy of discretization techniques to help the practitioners in the useof the algorithms; and moreover, on the second part of the paper our aim isto demonstrate that standard discretization methods can be parallelized in BigData platforms like Apache Spark, boosting both performance and accuracy.We thus propose a distributed implementation of one of the most well-knowndiscretizers based on Information Theory, and that obtains better results: theentropy minimization discretizer proposed by Fayyad and Irani. Our schemegoes beyond a simple parallelization and it is intended to be the first to face theBig Data challenge.

INTRODUCTION

Data is present in diverse formats, for example in categorical, numerical or continuousvalues. Categorical or nominal values are unsorted, whereas numerical or continuous

1

Page 2: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

values are assumed to be sorted or represent ordinal data. It is well-known that DataMining (DM) algorithms depend very much on the domain and type of data. In thisway, the techniques belonging to the field of statistical learning prefer numerical data(i.e., support vector machines and instance-based learning) whereas symbolic learningmethods require inherent finite values and also prefer to perform a branch of valuesthat are not ordered (such as in the case of decision trees or rule induction learning).These techniques are either expected to work on discretized data or to be integratedwith internal mechanisms to perform discretization.

The process of discretization has aroused general interest in recent years (51; 23) andhas become one of the most effective data pre-processing techniques in DM (71).Roughly speaking, discretization translates quantitative data into qualitative data, procur-ing a non-overlapping division of a continuous domain. It also ensures an associationbetween each numerical value and a certain interval. Actually, discretization is con-sidered a data reduction mechanism since it diminishes data from a large domain ofnumeric values to a subset of categorical values.

There is a necessity to use discretized data by many DM algorithms which can only dealwith discrete attributes. For example, three of the ten methods pointed out as the top tenin DM (75) demand a data discretization in one form or another: C4.5 (73), Apriori (8)and Naıve Bayes (41). Among its main benefits, discretization causes in learning meth-ods remarkable improvements in learning speed and accuracy. Besides, some decisiontree-based algorithms produce shorter, more compact, and accurate results when usingdiscrete values (39; 23).

The specialized literature gathers a huge number of proposals for discretization. In fact,some surveys have been developed attempting to organize the available pool of tech-niques (51; 23; 43). It is crucial to determine, when dealing with a new real problem ordata set, the best choice in the selection of a discretizer. This will condition the successand the suitability of the subsequent learning phase in terms of accuracy and simplicityof the solution obtained. In spite of the effort made in (51) to categorize the whole fam-ily of discretizers, probably the most well-known and surely most effective are includedin a new taxonomy presented in this paper, which has now been updated at the time ofwriting.

Classical data reduction methods are not expected to scale well when managing hugedata -both in number of features and instances- so that its application can be under-mined or even become impracticable (59). Scalable distributed techniques and frame-works have appeared along with the problem of Big Data. MapReduce (26) and itsopen-source version Apache Hadoop (62; 74) were the first distributed programmingtechniques to face this problem. Apache Spark (64; 72) is one of these new frame-works, designed as a fast and general engine for large-scale data processing based onin-memory computation. Through this Spark’s ability, it is possible to speed up itera-tive processes present in many DM problems. Similarly, several DM libraries for BigData have appeared as support for this task. The first one was Mahout (63) (as part ofHadoop), subsequently followed by MLlib (67) which is part of the Spark project (64).Although many state-of-the-art DM algorithms have been implemented in MLlib, it isnot the case for discretization algorithms yet.

2

Page 3: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

In order to fill this gap, we face the Big Data challenge by presenting a distributed ver-sion of the entropy minimization discretizer proposed by Fayyad and Irani in (6) usingApache Spark, which is based on Minimum Description Length Principle. Our main ob-jective is to prove that well-known discretization algorithms as MDL-based discretizer(henceforth called MDLP) can be parallelized in these frameworks, providing gooddiscretization solutions for Big Data analytics. Furthermore, we have transformed theiterativity yielded by the original proposal in a single-step computation. Notice that thisnew version for distributed environments has supposed a deep restructuring of the origi-nal proposal and a challenge for the authors. Finally, to demonstrate the effectiveness ofour framework, we perform an experimental evaluation with two large datasets, namely,ECBDL14 and epsilon.

In order to achieve the goals mentioned, this paper is structured as follows. First we pro-vide in the next Section (Background and Properties) an explanation of discretization,its properties and the description of the standard MDLP technique. The next Section(Taxonomy) presents the updated taxonomy of the most relevant discretization meth-ods. Afterwards, in the Section Big Data Background, we focus on the background ofthe Big Data challenge including the MapReduce programming framework as the mostprominent solution for Big Data. The following section (Distributed MDLP Discretiza-tion) describes the distributed algorithm based on entropy minimization proposed forBig Data. The experimental framework, results and analysis are given in last but onesection (Experimental Framework and Analysis). Finally, the main concluding remarksare summarized.

BACKGROUND AND PROPERTIES

Discretization is a wide field and there have been many advances and ideas over theyears. This section is devoted to providing a proper background on the topic, includingan explanation of the basic discretization process and enumerating the main propertiesthat allow us to categorize them and to build a useful taxonomy.

Discretization Process

In supervised learning, and specifically in classification, the problem of discretizationcan be defined as follows. Assuming a data set S consisting of N examples, M at-tributes and c class labels, a discretization scheme DA would exist on the continuousattribute A ∈ M , which partitions this attribute into k discrete and disjoint intervals:{[d0, d1], (d1, d2], . . . , (dkA−1, dkA

]}, where d0 and dkAare, respectively, the minimum

and maximal value, and PA = {d1, d2, . . . , dkA−1} represents the set of cut points ofA in ascending order.

We can associate a typical discretization as a univariate discretization. Although thisproperty will be reviewed in the next section, it is necessary to introduce it here for the

3

Page 4: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Fig. 1. Discretization Process.

basic understanding of the basic discretization process. Univariate discretization oper-ates with one continuous feature at a time while multivariate discretization considersmultiple features simultaneously.

A typical discretization process generally consists of four steps (seen in Figure 1): (1)sorting the continuous values of the feature to be discretized, either (2) evaluating a cutpoint for splitting or adjacent intervals for merging, (3) splitting or merging intervals ofcontinuous values according to some defined criterion, and (4) stopping at some point.Next, we explain these four steps in detail.

– Sorting: The continuous values for a feature are sorted in either descending or as-cending order. It is crucial to use an efficient sorting algorithm with a time com-plexity of O(NlogN). Sorting must be done only once and for the entire initialprocess of discretization. It is a mandatory treatment and can be applied when thecomplete instance space is used for discretization.

– Selection of a Cut Point: After sorting, the best cut point or the best pair of adja-cent intervals should be found in the attribute range in order to split or merge ina following required step. An evaluation measure or function is used to determinethe correlation, gain, improvement in performance or any other benefit accordingto the class label.

– Splitting/Merging: Depending on the operation method of the discretizers, intervalseither can be split or merged. For splitting, the possible cut points are the differentreal values present in an attribute. For merging, the discretizer aims to find the bestadjacent intervals to merge in each iteration.

4

Page 5: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

– Stopping Criteria: It specifies when to stop the discretization process. It shouldassume a trade-off between a final lower number of intervals, good comprehensionand consistency.

Discretization Properties

In (51; 23; 43), various pivots have been used in order to make a classification of dis-cretization techniques. This sections reviews and describes them, underlining the majoraspects and alliances found among them. The taxonomy presented afterwards will befounded on these characteristics (acronyms of the methods correspond with those pre-sented in Table 1):

– Static vs. Dynamic: This property refers to the level of independence between thediscretizer and the learning method. A static discretizer is run prior to the learningtask and is autonomous from the learning algorithm (23), as a data preprocessingalgorithm (71). Almost all isolated known discretizers are static. By contrast, a dy-namic discretizer responds when the learner requires so, during the building of themodel. Hence, they must belong to the local discretizer’s family (see later) embed-ded in the learner itself, producing an accurate and compact outcome together withthe associated learning algorithm. Good examples of classical dynamic techniquesare ID3 discretizer (73) and ITFP (31).

– Univariate vs. Multivariate: Univariate discretizers only operate with a single at-tribute simultaneously. This means that they sort the attributes independently, andthen, the derived discretization disposal for each attribute remains unchanged in thefollowing phases. Conversely, multivariate techniques, concurrently consider all orvarious attributes to determine the initial set of cut points or to make a decisionabout the best cut point chosen as a whole. They may accomplish discretizationhandling the complex interactions among several attributes to decide also the at-tribute in which the next cut point will be split or merged. Currently, interest hasrecently appeared in developing multivariate discretizers since they are decisivein complex predictive problems where univariate operations may ignore importantinteractions between attributes (68; 69) and in deductive learning (56).

– Supervised vs. Unsupervised: Supervised discretizers consider the class label whereasunsupervised ones do not. The interaction between the input attributes and the classoutput and the measures used to make decisions on the best cut points (entropy,correlations, etc.) will define the supervised manner to discretize. Although mostof the discretizers proposed are supervised, there is a growing interest in unsuper-vised discretization for descriptive tasks (48; 56). Unsupervised discretization canbe applied to both supervised and unsupervised learning, because its operation doesnot require the specification of an output attribute. Nevertheless, this does not occurin supervised discretizers, which can only be applied over supervised learning. Un-supervised learning also opens the door to transfering the learning between taskssince the discretization is not tailored to a specific problem.

– Splitting vs. Merging: These two options refer to the approach used to define orgenerate new intervals. The former methods search for a cut point to divide the

5

Page 6: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

domain into two intervals among all the possible boundary points. On the contrary,merging techniques begin with a pre-defined partition and search for a candidatecut point to mix both adjacent intervals after removing it. In the literature, the termsTop-Down and Bottom-up are highly related to these two operations, respectively.In fact, top-down and bottom-up discretizers are thought for hierarchical discretiza-tion developments, so they consider that the process is incremental, property whichwill be described later. Splitting/merging is more general than top-down/bottom-upbecause it is possible to have discretizers whose procedure manages more than oneinterval at a time (33; 35). Furthermore, we consider the hybrid category as the wayof alternating splits with merges during running time (9; 69).

– Global vs. Local: In the time a discretizer must select a candidate cut point tobe either split or merged, it could consider either all available information in theattribute or only partial information. A local discretizer makes the partition decisionbased only on partial information. MDLP (6) and ID3 (73) are classical examples oflocal methods. By definition, all the dynamic discretizers and some top-down basedmethods are local, which explains the fact that few discretizers apply this form.The dynamic discretizers search for the best cut point during internal operationsof a certain DM algorithm, thus it is impossible to examine the complete data set.Besides, top-down procedures are associated with the divide-and-conquer scheme,in such manner that when a split is considered, the data is recursively divided,restricting access to partial data.

– Direct vs. Incremental: For direct discretizers, the range associated with an intervalmust be divided into k intervals simultaneously, requiring an additional criterion todetermine the value of k. One-step discretization methods and discretizers whichselect more than a single cut point at every step are included in this category. How-ever, incremental methods begin with a simple discretization and pass through animprovement process, requiring an additional criterion to determine when it is thebest moment to stop. At each step, they find the best candidate boundary to be usedas a cut point and, afterwards, the rest of the decisions are made accordingly.

– Evaluation Measure: This is the metric used by the discretizer to compare twocandidate discretization schemes and decide which is more suitable to be used. Weconsider five main families of evaluation measures:• Information: This family includes entropy as the most used evaluation measure

in discretization (MDLP (6), ID3 (73), FUSINTER (18)) and others derivedfrom information theory (Gini index, Mutual Information) (40).

• Statistical: Statistical evaluation involves the measurement of dependency/correlationamong attributes (Zeta (15), ChiMerge (5), Chi2 (17)), interdependency (27),probability and bayesian properties (13) (MODL (32)), contingency coefficient(36), etc.

• Rough Sets: This class is composed of methods that evaluate the discretiza-tion schemes by using rough set properties and measures (66), such as classseparability, lower and upper approximations, etc.

• Wrapper: This collection comprises methods that rely on the error provided bya classifier or a set of classifiers that are used in each evaluation. Representativeexamples are MAD (52), IDW (55) and EMD (69).

6

Page 7: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

• Binning: In this category of techniques, there is no evaluation measure. Thisrefers to discretizing an attribute with a predefined number of bins in a simpleway. A bin assigns a certain number of values per attribute by using a nonsophisticated procedure. EqualWidth and EqualFrequency discretizers are themost well-known unsupervised binning methods.

Table 1 Most Important Discretizers.

Acronym Ref. Acronym Ref. Acronym Ref.

EqualWidth (1) EqualFrequency (1) Chou91 (4)D2 (3) ChiMerge (5) 1R (7)ID3 (73) MDLP (6) CADD (9)

MDL-Disc (10) Bayesian (13) Friedman96 (12)ClusterAnalysis (11) Zeta (15) Distance (14)

Chi2 (17) CM-NFD (16) FUSINTER (18)MVD (19) Modified Chi2 (24) USD (22)

Khiops (25) CAIM (27) Extended Chi2 (30)Heter-Disc (28) UCPD (29) MODL (32)

ITPF (31) HellingerBD (33) DIBD (34)IDD (35) CACC (36) Ameva (38)

Unification (40) PKID (41) FFD (41)CACM (46) DRDS (47) EDISC (50)U-LBG (48) MAD (52) IDF (55)

IDW (55) NCAIC (60) Sang14 (58)IPD (56) SMDNS (66) TD4C (68)

EMD (69)

Minimum Description Length-based Discretizer

Minimum Description Length-based discretizer (MDLP) (6), proposed by Fayyad andIrani in 1993, is one of the most important splitting methods in discretization. Thisunivariate discretizer uses the Minimum Description Length Principle to control thepartitioning process. This also introduces an optimization based on a reduction of wholeset of candidate points, only formed by the boundary points in this set.

Let A(e) denote the value for attribute A in the example e. A boundary point b ∈Dom(A) can be defined as the midpoint value between A(u) and A(v), assuming thatin the sorted collection of points in A, two examples exist u, v ∈ S with different classlabels, such that A(u) < b < A(v); and the other example w ∈ S does not exist, suchthat A(u) < A(w) < A(v). The set of boundary points for attribute A is defined asBA.

This method also introduces other important improvements. One of them is relatedto the number of cut points to derive in each iteration. In contrast to discretizers like

7

Page 8: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

ID3 (73), the authors proposed a multi-interval extraction of points demonstrating thatbetter classification models -both in error rate and simplicity- are yielded by using theseschemes.

It recursively evaluates all boundary points, computing the class entropy of the parti-tions derived as quality measure. The objective is to minimize this measure to obtainthe best cut decision. Let bα be a boundary point to evaluate, S1 ⊂ S be a subset where∀a′ ∈ S1, A(a′) ≤ bα, and S2 be equal to S − S1. The class information entropyyielded by a given binary partitioning can be expressed as:

EP (A, bα, S) =|S1||S|

E(S1) +|S2||S|

E(S2), (1)

where E represents the class entropy 1 of a given subset following Shannon’s defini-tions (21).

Finally, a decision criterion is defined in order to control when to stop the partitioningprocess. The use of MDLP as a decision criterion allows us to decide whether or not topartition. Thus a cut point bα will be applied iff:

G(A, bα, S) >log2(N − 1)

N+

∆(A, bα, S)

N, (2)

where ∆(A, bα, S) = log2(3c)− [cE(S)− c1E(S1)− c2E(S2)], c1 and c2 the number

of class labels in S1 and S2, respectively; and G(A, bα, S) = E(S)− EP (A, bα, S).

TAXONOMY

Currently, more than 100 discretization methods have been presented in the specializedliterature. In this section, we consider a subgroup of methods which can be consideredthe most important from the whole set of discretizers. The criteria adopted to charac-terize this subgroup are based on the repercussion, availability and novelty they have.Thus, the precursory discretizers which have served as inspiration to others, those whichhave been integrated in software suites and the most recent ones are included in this tax-onomy.

Table 1 enumerates the discretizers considered in this paper, providing the name ab-breviation and reference for each one. We do not include the descriptions of these dis-cretizers in this paper. Their definitions are contained in the original references, thuswe recommend consulting them in order to understand how the discretizers of interestwork. In Table 1, 30 discretizers included in KEEL software are considered. Addition-ally, implementations of these algorithms in Java can be found (37).

1 Logarithm in base 2 is used in this function

8

Page 9: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Fig.

2.D

iscr

etiz

atio

nTa

xono

my.

9

Page 10: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

In the previous section, we studied the properties which could be used to classify thediscretizers proposed in the literature. Given a predefined order among the seven char-acteristics studied before, we can build a taxonomy of discretization methods. All tech-niques enumerated in Table 1 are collected in the taxonomy depicted in Figure 2. Itrepresents a hierarchical categorization following the next arrangement of properties:static / dynamic, univariate / multivariate, supervised / unsupervised, splitting / merg-ing / hybrid, global / local, direct / incremental and evaluation measure.

The purpose of this taxonomy is two-fold. Firstly, it identifies the subset of most repre-sentative state-of-the-art discretizers for both researchers and practitioners who want tocompare with novel techniques or require discretization in their applications. Secondly,it characterizes the relationships among techniques, the extension of the families andpossible gaps to be filled in future developments.

When managing huge data, most of them become impracticable in real settings, dueto the complexity they cause (for example, in the case of MDLP, among others). Theadaptation of these classical methods implies a thorough redesign that becomes manda-tory if we want to exploit the advantages derived from the use of discrete data on largedatasets (42; 44). As reflected in our taxonomy, no relevant methods in the field of BigData have been proposed to solve this problem. Some works have tried to deal withlarge-scale discretization. For example, in (53) the authors proposed a scalable imple-mentation of Class-Attribute Interdependence Maximization algorithm by using GPUtechnology. In (61), a discretizer based on windowing and hierarchical clustering is pro-posed to improve the performance of classical tree-based classifiers. However, none ofthese methods have been proved to cope with the data magnitude presented here.

BIG DATA BACKGROUND

The ever-growing generation of data on the Internet is leading us to managing huge col-lections using data analytics solutions. Exceptional paradigms and algorithms are thusneeded to efficiently process these datasets (65) so as to obtain valuable information,making this problem one of the most challenging tasks in Big Data analytics.

Gartner (20) introduced the popular denomination of Big Data and the 3V terms thatdefine it as high volume, velocity and variety of information that require a new large-scale processing. This list was then extended with 2 additional terms. All of them aredescribed in the following: Volume, the massive amount of data that is produced everyday is still exponentially growing (from terabytes to exabytes); Velocity, data needsto be loaded, analyzed and stored as quickly as possible; Variety, data comes in manyformats and representations; Veracity, the quality of data to process is also an importantfactor. The Internet is full of missing, incomplete, ambiguous, and sparse data; Value,extracting value from data is also established as a relevant objective in big analytics.

The unsuitability of many knowledge extraction algorithms in the Big Data field hasmeant that new methods have been developed to manage such amounts of data effec-tively and at a pace that allows value to be extracted from it.

10

Page 11: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

MapReduce Model and Other Distributed Frameworks The MapReduce frame-work (26), designed by Google in 2003, is currently one of the most relevant tools inBig Data analytics. It was aimed at processing and generating large-scale datasets, au-tomatically processed in an extremely distributed fashion through several machines2.The MapReduce model defines two primitives to work with distributed data: Map andReduce. These two primitives imply two stages in the distributed process, which we de-scribe below. In the first step, the master node breaks up the dataset into several splits,distributing them across the cluster for parallel processing. Each node then hosts sev-eral Map threads that transform the generated key-value pairs into a set of intermediatepairs. After all Map tasks have finished, the master node distributes the matching pairsacross the nodes according to a key-based partitioning scheme. Then, the Reduce phasestarts, combining those coincident pairs so as to form the final output.

Apache Hadoop (62; 74) is presented as the most popular open-source implementationof MapReduce for large-scale processing. Despite its popularity, Hadoop presents someimportant weaknesses, such as poor performance on iterative and online computing,and a poor inter-communication capability or inadequacy for in-memory computation,among others (49). Recently, Apache Spark (64; 72) has appeared and integrated withthe Hadoop Ecosystem. This novel framework is presented as a revolutionary tool capa-ble of performing even faster large-scale processing than Hadoop through in-memoryprimitives, making this framework a leading tool for iterative and online processing and,thus, suitable for DM algorithms. Spark is built on distributed data structures called Re-silient Distributed Datasets (RDDs), which were designed as a fault-tolerant collectionof elements that can be operated in parallel by means of data partitioning.

DISTRIBUTED MDLP DISCRETIZATION

In the Background Section, a discretization algorithm based on an information entropyminimization heuristic was presented (6). In this work, the authors proved that multi-interval extraction of points and the use of boundary points can improve the discretiza-tion process, both in efficiency and error rate. Here, we adapt this well-known algo-rithm for distributed environments, proving its discretization capability against real-world large problems.

One important point in this adaption is how to distribute the complexity of this algorithmacross the cluster. This is mainly determined by two time-consuming operations: on theone hand, the sorting of candidate points, and, on the other hand, the evaluation of thesepoints. The sorting operation conveys a O(|A|log(|A|)) complexity (assuming that allpoints in A are distinct), whereas the evaluation conveys a O(|BA|2) complexity. In theworst case, it implies a complete evaluation of entropy for all points.

Note that the previous complexity is bounded to a single attribute. To avoid repeating theprevious process on all attributes, we have designed our algorithm to sort and evaluateall points in a single step. Only when the number of boundary points in a attribute is

2 For a complete description of this model and other distributed models, please review (54).

11

Page 12: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

higher than the maximum per partition, computation by feature is necessary (which isextremely rare according to our experiments).

Spark primitives extend the idea of MapReduce to implement more complex operationson distributed data. In order to implement our method, we have used some extra primi-tives from Spark’s API, such as: mapPartitions, sortByKey, flatMap and reduceByKey3.

Main discretization procedure

Algorithm 1 explains the main procedures in our discretization algorithm. The algo-rithm calculates the minimum-entropy cut points by feature according to the MDLPcriterion. It uses a parameter to limit the maximum number of points to yield.

Algorithm 1 Main discretization procedureInput: S Data setInput: M Feature indexes to discretizeInput: mb Maximum number of cut points to

selectInput: mc Maximum number of candidates

per partitionOutput: Cut points by feature1: comb←2: map s ∈ S3: v ← zeros(|c|)4: ci← class index(v)5: v(ci)← 16: for all A ∈M do7: EMIT < (A,A(s)), v >8: end for9: end map

10: distinct← reduce(comb, sum vectors)11: sorted← sort by key(distinct)

12: first← first by part(sorted)13: bds← get boundary(sorted, first)14: bds←15: map b ∈ bds16: < (att, point), q >← b17: EMIT < (att, (point, q)) >18: end map19: (SM,BI)← divide atts(bds,mc)20: sth←21: map sa ∈ SM22: th← select ths(SM(sa),mb,mc)23: EMIT < (sa, th) >24: end map25: bth← ()26: for all ba ∈ BI do27: bth← bth+ select ths(ba,mb,mc)28: end for29: return(union(bth, sth))

The first step creates combinations from instances through a Map function in order toseparate values by feature. It generates tuples with the value and the index for eachfeature as key and a class counter as value (< (A,A(s)), v >). Afterwards, the tuplesare reduced using a function that aggregates all subsequent vectors with the same key,obtaining the class frequency for each distinct value in the dataset. The resulting tuplesare sorted by key so that we obtain the complete list of distinct values ordered by featureindex and feature value. This structure will be used later to evaluate all these points in asingle step. The first point by partition is also calculated (line 11) for this process. Oncesuch information is saved, the process of evaluating the boundary points can be started.

3 For a complete description of Spark’s operations, please refer to Spark’s API:https://spark.apache.org/docs/latest/api/scala/index.html

12

Page 13: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Boundary points selection

Algorithm 2 (get boundary) describes the function in charge of selecting those pointsfalling in the class borders. It executes an independent function on each partition inorder to parallelize the selection process as much as possible so that a subset of tuples isfetched in each thread. The evaluation process is described as follows: for each instance,it evaluates whether the feature index is distinct from the index of the previous point; ifit is so, this emits a tuple with the last point as key and the accumulated class counter asvalue. This means that a new feature has appeared, saving the last point from the currentfeature as its last threshold. If the previous condition is not satisfied, the algorithmchecks whether the current point is a boundary with respect to the previous point ornot. If it is so, this emits a tuple with the midpoint between these points as key and theaccumulated counter as value.

Algorithm 2 Function to generate the boundary points (get boundary)Input: points An RDD of tuples (<

(att, point), q >), where att representsthe feature index, point the point to con-sider and q the class counter.

Input: first A vector with all first elementsby partition

Output: An RDD of points.1: boundaries←2: map partitions part ∈ points3: < (la, lp), lq >← next(part)4: accq ← lq5: for all < (a, p), q >∈ part do6: if a <> la then7: EMIT < (la, lp), accq >8: accq ← ()9: else if is boundary(q, lq) then

10: EMIT < (la, (p +lp)/2), accq >

11: accq ← ()

12: end if13: < (la, lp), lq >←< (a, p), q >14: accq ← accq + q15: end for16: index← get index(part)17: if index < npartitions(points)

then18: < (a, p), q >← first(index+ 1)19: if a <> la then20: EMIT < (la, lp), accq >21: else22: EMIT < (la, (p +

lp)/2), accq >23: end if24: else25: EMIT < (la, lp), accq >26: end if27: end map28: return(boundaries)

Finally, some evaluations are performed over the last point in the partition. This point iscompared with the first point in the next partition to check whether there is a change inthe feature index -emitting a tuple with the last point saved-, or not -emitting a tuple withthe midpoint- (as described above). All tuples generated by the partition are then joinedinto a new mixed RDD of boundary points, which is returned to the main algorithm asbds.

In Algorithm 1 (line 14), the bds variable is transformed by using a Map function,changing the previous key to a new key with a single value: the feature index (<(att, (point, q)) >). This is done to group the tuples by feature so that we can divide

13

Page 14: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

them into two groups according to the total number of candidate points by feature. Thedivide atts function is then aimed to divide the tuples in two groups (big and small)depending on the number of candidate points by feature (count operation). Features ineach group will be filtered and treated differently according to whether the total num-ber of points for a given feature exceeds the threshold mc or not. Small features will begrouped by key so that these can be processed in a parallel way. The subsequent tuplesare now re-formatted as follows: (< point, q >).

MDLP evaluation

Features in each group are evaluated differently from that mentioned before. Small fea-tures are evaluated in a single step where each feature corresponds with a single parti-tion, whereas big features are evaluated iteratively since each feature corresponds witha complete RDD with several partitions. The first option is obviously more efficient,however, the second case is less frequent due to the fact the number of candidate pointsfor a single feature fits perfectly in one partition. In both cases, the select ths functionis applied to evaluate and select the most relevant cut points by feature. For small fea-tures, a Map function is applied independently to each partition (each one represents afeature) (arr select ths). In case of big features, the process is more complex and eachfeature needs a complete iteration over a distributed set of points (rdd select ths).

Algorithm 3 (select ths) evaluates and selects the most promising cut points groupedby feature according to the MDLP criterion (single-step version). This algorithm startsby selecting the best cut point in the whole set. If the criterion accepts this selection, thepoint is added to the result list and the current subset is divided into two new partitionsusing this cut point. Both partitions are then evaluated, repeating the previous process.This process finishes when there is no partition to evaluate or the number of selectedpoints is fulfilled.

Algorithm 4 (arr select ths) explains the process that accumulates frequencies andthen selects the minimum-entropy candidate. This version is more straightforward thanthe RDD version as it only needs to accumulate frequencies sequentially. Firstly, it ob-tains the total class counter vector by aggregating all candidate vectors. Afterwards, anew iteration is necessary to obtain the accumulated counters for the two partitions gen-erated by each point. This is done by aggregating the vectors from the most-left pointto the current one, and from the current point to the right-most point. Once the accumu-lated counters for each candidate point are calculated (in form of < point, q, lq, rq >),the algorithm evaluates the candidates using the select best function.

Algorithm 5 (rdd select ths) explains the selection process; in this case for “big” fea-tures (more than one partition). This process needs to be performed in a distributedmanner since the number of candidate points exceeds the maximum size defined. Foreach feature, the subset of points is hence re-distributed in a better partition scheme tohomogenize the quantity of points by partition and node (coalesce function, line 1-2).After that, a new parallel function is started to compute the accumulated counter bypartition. The results (by partition) are then aggregated to obtain the total accumulated

14

Page 15: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Algorithm 3 Function to select the best cut points for a given feature (select ths)Input: cands A RDD/array of tuples (<

point, q >), where point represents a can-didate point to evaluate and q the classcounter.

Input: mb Maximum number of intervals orbins to select

Input: mc Maximum number of candidates toeval in a partition

Output: An array of thresholds for a givenfeature

1: st← enqueue(st, (candidates, ()))2: result← ()3: while |st| > 0 & |result| < mb do4: (set, lth)← dequeue(st)5: if |set| > 0 then

6: if type(set) = ′array′ then7: bd← arr select ths(set, lth)8: else9: bd← rdd select ths(set, lth,mc)

10: end if11: if bd <> () then12: result← result+ bd13: (left, right)← divide(set, bd)14: st← enqueue(st, (left, bd))15: st← enqueue(st, (right, bd))16: end if17: end if18: end while19: return(sort(result))

Algorithm 4 Function to select the best cut point according to MDLP criterion (single-step version) (arr select ths)Input: cands An array of tuples (<

point, q >), where point represents acandidate point to evaluate and q the classcounter.

Output: The minimum-entropy cut point1: total← sum freqs(cands)2: lacc← ()

3: for < point, q >∈ cands do4: lacc← lacc+ q5: freqs← freqs+(point, q, lacc, total−

lacc)6: end for7: return(select best(cands, freqs))

frequency for the whole subset. In line 9, a new distributed process is started with theaim of computing the accumulated frequencies at points on both sides (as explained inAlgorithm 4). In this procedure, the process accumulates the counter from all previouspartitions to the current one to obtain the first accumulated value (the left one). Thenthe function computes the accumulated values for each inner point using the counterfor points in the current partition, the left value and the total values (line 7). Once thesevalues are calculated (< point, q, lq, rq >), the algorithm evaluates all candidate pointsand their associated accumulators using the select best function (as above).

Algorithm 6 evaluates the discretization schemes yielded by each point by computingthe entropy for each partition generated, also taking into account the MDLP criterion.Thus, for each point4, the entropy is calculated for the two generated partitions (line 8)as well as the total entropy for the whole set (lines 1-2). Using these values, the entropygain for each point is computed and its MDLP condition, according to Equation 2. Ifthe point is accepted by MDLP, the algorithm emits a tuple with the weighted entropy

4 If the set is an array, it is used as a loop structure, else it is used as a distributed map function

15

Page 16: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Algorithm 5 Function that selects the best cut points according to MDLP criterion(RDD version) (rdd select ths)Input: cands An RDD of tuples (<

point, q >), where point represents acandidate point to evaluate and q the classcounter.

Input: mc Maximum number of candidates toeval in a partition

Output: The minimum-entropy cut point1: npart← round(|cands|/mc)2: cands← coalesce(cands, npart)3: totalpart←4: map partitions partition ∈ cands5: return(sum(partition))6: end map7: total← sum(totalpart)8: freqs←

9: map partitions partition ∈ cands10: index← get index(partition)11: ltotal← ()12: freqs← ()13: for i = 0 until index do14: ltotal← ltotal + totalpart(i)15: end for16: for all < point, q >∈ partition do17: freqs ← freqs +

(point, q, ltotal + q, total − ltotal)18: end for19: return(freqs)20: end map21: return(select best(cands, freqs))

average of partition and the point itself. From the set of accepted points, the algorithmselects the one with the minimum class information entropy.

Algorithm 6 Function that calculates class entropy values and selects the minimum-entropy cut point (select best)Input: freqs An array/RDD of tuples (<

point, q, lq, rq >), where point representsa candidate point to evaluate, leftq the leftaccumulated frequency, rightq the right ac-cumulated frequency and q the class fre-quency counter.

Input: total Class frequency counter for allthe elements

Output: The minimum-entropy cut point1: n← sum(total)2: totalent← ent(total, n)3: k ← |total|4: accp← ()5: for all < point, q, lq, rq >∈ freqs do6: k1← |lq|; k2← |rq|

7: s1← sum(lq); s2← sum(rq);8: ent1 ← ent(s1, k1); ent2 ←

ent(s2, k2)9: partent← (s1 ∗ ent1 + s2 ∗ ent2)/s

10: gain← totalent− partent11: delta← log2(3

k− 2)− (k ∗hs− k1 ∗ent1− k2 ∗ ent2)

12: accepted← gain > ((log2(s− 1)) +delta)/n

13: if accepted = true then14: accp← accp+ (partent, point)15: end if16: end for17: return(min(accp))

The results produced by both groups (small and big) are joined into the final point setof cut points.

16

Page 17: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Analysis of efficiency

In this section, we analyze the performance of the main operations that determined theoverall performance of our proposal. Note that the first two operations are quite costlyfrom the point of view of network usage, since they imply shuffling data across thecluster (wide dependencies). Nevertheless, once data are partitioned and saved, theseremain unchanged. This is exploited by the subsequent steps, which take advantage ofthe data locality property. Having data partitioned also benefits operations like group-ByKey, where the grouping is performed locally. The list of such operations (showed inAlgorithm 1) is presented below:

1. Distinct points (lines 1-10): this is an standard map-reduce operation that fetchesall the points in the dataset. The map phase generates and distributes tuples us-ing a hash partitioning scheme (linear distributed complexity). The reduce phasefetches the set of coincident points and sums up the class vectors (linear distributedcomplexity).

2. Sorting operation (line 11): this operation uses a more complex primitive of Spark:sortByKey. This samples the set and produces a set of bounds to partition this set.Then, a shuffling operation is started to re-distribute the points according to theprevious bounds. Once data are re-distributed, a local sorting operation is launchedin each partition (loglinear distributed order).

3. Boundary points (lines 12-13): this operation is in charge of computing the subsetcandidate of points to be evaluted. Thanks to the data partitioning scheme generatedin the previous phases, the algorithm can yield the boundary points for all attributesin a distributed manner using a linear map operation.

4. Division of attributes (lines 14-19): once the reduced set of boundary points isgenerated, it is necessary to separate the attributes into two sets. To do that, severaloperations are started to complete this part. All these sub-operations are performedlinearly using distributed operations.

5. Evaluation of small attributes (lines 20-24): this is mainly formed by two suboper-ations: one for grouping the tuples by key (done locally thanks to the data locality),and one map operation to evaluate the candidate points. In the map operation, eachfeature starts an independent process that, like the sequential version, is quadratic.The main advantage here is the parallelization of these processes.

6. Evaluation of big features (lines 26-28): The complexity order for each feature isthe same as in the previous case. However, in this case, the evaluation of features isdone iteratively.

EXPERIMENTAL FRAMEWORK AND ANALYSIS

This section describes the experiments carried out to demonstrate the usefulness andperformance of our discretization solution over two Big Data problems.

17

Page 18: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Experimental Framework

Two huge classification datasets are employed as benchmarks in our experiments. Thefirst one (hereinafter called ECBDL14) was used as a reference at the ML competitionof the Evolutionary Computation for Big Data and Big Learning held on July 14, 2014,under the international conference GECCO-2014. This consists of 631 characteristics(including both numerical and categorical attributes) and 32 million instances. It is abinary classification problem where the class distribution is highly imbalanced: 2% ofpositive instances. For this problem, the MapReduce version of the Random OverSam-pling (ROS) algorithm presented in (57) was applied in order to replicate the minorityclass instances from the original dataset until the number of instances for both classeswas equalized. As a second dataset, we have used epsilon, which consists of 500 000instances with 2000 numerical features. This dataset was artificially created for the Pas-cal Large Scale Learning Challenge in 2008. It was further pre-processed and includedin the LibSVM dataset repository (45).

Table 2 gives a brief description of these datasets. For each one, the number of examplesfor training and test (#Train Ex., #Test Ex.), the total number of attributes (#Atts.), andthe number of classes (#Cl) are shown. For evaluation purposes, Naive Bayes (70) andtwo variants of Decision Tree (2) –with different impurity measures– have been chosenas reference in classification, using the distributed implementations included in MLliblibrary (67). The recommended parameters of the classifiers, according to their authors’specification5, are shown in Table 3.

Table 2 Summary description for classificationdatasets

Data Set #Train Ex. #Test Ex. #Atts. #Cl.

epsilon 400 000 100 000 2000 2ECBDL14 (ROS) 65 003 913 2 897 917 631 2

Table 3 Parameters of the algorithms used

Method Parameters

Naive Bayes lambda = 1.0Decision Tree - gini (DTg) impurity = gini, max depth = 5, max bins = 32Decision Tree - entropy (DTe) impurity = entropy, max depth = 5, max bins = 32Distributed MDLP max intervals = 50, max by partition = 100,000

As evaluation criteria, we use two well-known evaluation metrics to assess the qualityof the underlying discretization schemes. On the one hand, Classification accuracy is

5 https://spark.apache.org/docs/latest/api/scala/index.html

18

Page 19: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

used to evaluate the accuracy yielded by the classifiers -number of examples correctlylabeled divided by the total number of examples-. On the other hand, in order to provethe time benefits of using discretization, we have employed the overall classificationruntime (in seconds) in training as well as the overall time in discretization as addi-tional measures.

For all experiments we have used a cluster composed of twenty computing nodes andone master node. The computing nodes hold the following characteristics: 2 processorsx Intel Xeon CPU E5-2620, 6 cores per processor, 2.00 GHz, 15 MB cache, QDRInfiniBand Network (40 Gbps), 2 TB HDD, 64 GB RAM. Regarding software, wehave used the following configuration: Hadoop 2.5.0-cdh5.3.1 from Cloudera’s open-source Apache Hadoop distribution6, Apache Spark and MLlib 1.2.0, 480 cores (24cores/node), 1040 RAM GB (52 GB/node). Spark implementation of the algorithm canbe downloaded from the first author’ GitHub repository7. The design of the algorithmhas been adapted to be integrated in MLlib Library.

Experimental Results and Analysis

Table 4 shows the classification accuracy results for both datasets8. According to theseresults, we can assert that using our discretization algorithm as a preprocessing stepleads to an improvement in classification accuracy with Naive Bayes, for the two datasetstested. It is specially relevant in ECBDL14 where there is a improvement of 5%. Thisshows the importance of discretization in the application of some classifiers like NaiveBayes. For the other classifiers, our algorithm is capable of getting the same competitiveresults as those performed implicitly by the decision trees.

Table 4 Classification accuracy values

Dataset NB NB-disc DTg DTg-disc DTe DTe-disc

ECBDL14 0.6276 0.7260 0.7347 0.7339 0.7459 0.7508epsilon 0.6550 0.7065 0.6616 0.6623 0.6611 0.6624

Table 5 shows classification runtime values for both datasets distinguishing whetherdiscretization is applied or not. As we can see, there is a slight improvement in bothcases on using MDLP, but not enough significant. According to the previous results,we can state that the application of MDLP is relevant at least for epsilon, where thebest accuracy result has been achieved by using Naive Bayes and our discretizer. ForECBDL14, it is better to use the implicit discretization performed by the decision trees,since our algorithm is more time-consuming and obtains similar results.

6 http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-homepage.html

7 https://github.com/sramirez/SparkFeatureSelection8 In all tables, the best result by column (best by method) is highlighted in bold.

19

Page 20: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Table 5 Classification time values: with vs. w/o discretiza-tion (in seconds)

Dataset NB NB-disc DTg DTg-disc DTe DTe-disc

ECBDL14 31.06 26.39 347.76 262.09 281.05 264.25epsilon 5.72 4.99 68.83 63.23 74.44 39.28

Table 6 shows discretization time values for the two versions of MDLP, namely, se-quential and distributed. For the sequential version on ECBDL14, the time value wasestimated from small samples of this dataset, since its direct application is unfeasi-ble. A graphical comparison of these two versions is shown in Figure 3. Comparingboth implementations, we can notice the great advantage of using the distributed ver-sion against the sequential one. For ECBDL14, our version obtains a speedup ratio(speedup = sequential/distributed) of 271.86 whereas for epsilon the ratio is equalto 12.11. This shows that the bigger the dataset, the higher the efficiency improvement;and, when the data size is large enough, the cluster can distribute fairly the computa-tional burden across its machines. This is notably the case study of ECBDL14, wherethe resolution of this problem was found to be impractical using the original approach.

Table 6 Sequential vs. distributed discretizationtime values (in seconds)

Dataset Sequential Distributed Speedup Rate

ECBDL14 295 508 1 087 271.86epsilon 5 764 476 12.11

Conclusion

Discretization, as an important part in DM preprocessing, has raised generalinterest in recent years. In this work, we have presented an updated taxon-omy and description of the most relevant algorithms in this field. The aim ofthis taxonomy is to help the researchers to better classify the algorithms thatthey use, on the one hand, while also helping to identify possible new futureresearch lines. At this respect, and although Big Data is currently a trendingtopic in science and business, no distributed approach has been developed inthe literature, as we have shown in our taxonomy.Here, we propose a completely distributed version of the MDLP discretizer withthe aim of demonstrating that standard discretization methods can be paral-lelized in Big Data platforms, boosting both performance and accuracy. Thisversion is capable of transforming the iterativity yielded by the original proposalin a single-step computation through a complete redesign of the original ver-sion. According to our experiments, our algorithm is capable of performing 270

20

Page 21: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

Fig. 3. Discretization time: sequential vs. distributed (logaritmic scale).

times faster than the sequential version, improving the accuracy results in allused datasets. For future works, we plan to tackle the problem of discretizationin large-scale online problems.

Acknowledgments

This work is supported by the National Research Project TIN2014-57251-P, TIN2012-37954 and TIN2013-47210-P, and the Andalusian Research Plan P10-TIC-6858, P11-TIC-7765 and P12-TIC-2958, and by the Xunta de Galicia through the research projectGRC 2014/035 (all projects partially funded by FEDER funds of the European Union).S. Ramırez-Gallego holds a FPU scholarship from the Spanish Ministry of Educationand Science (FPU13/00047). D. Martınez-Rego and V. Bolon-Canedo acknowledgesupport of the Xunta de Galicia under postdoctoral Grant codes POS-A/2013/196 andED481B 2014/164-0.

References

[1] Andrew K. C. Wong and David K. Y. Chiu. Synthesizing statistical knowledgefrom incomplete mixed-mode data. IEEE Transactions on Pattern Analysis andMachine Intelligence, 9:796–805, 1987.

[2] J.R. Quinlan. Induction of decision trees. In In Shavlik J.W. and Dietterich T.G.,editors, Readings in Machine Learning. Morgan Kaufmann Publishers, 1990.Originally published in Machine Learning 1:81-106, 1986.

21

Page 22: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

[3] J. Catlett. On changing continuous attributes into ordered discrete attributes. InEuropean Working Session on Learning (EWSL), volume 482 of Lecture Noteson Computer Science, pages 164–178. Springer-Verlag, 1991.

[4] Philip A. Chou. Optimal partitioning for classification and regression trees. IEEETransactions on Pattern Analysis and Machine Intelligence, 13:340–354, 1991.

[5] R. Kerber. Chimerge: Discretization of numeric attributes. In National Con-ference on Artifical Intelligence American Association for Artificial Intelligence(AAAI), pages 123–128, 1992.

[6] Usama M. Fayyad and Keki B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th Inter-national Joint Conference on Artificial Intelligence (IJCAI), pages 1022–1029,1993.

[7] Robert C. Holte. Very simple classification rules perform well on most com-monly used datasets. Machine Learning, 11:63–90, 1993.

[8] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining asso-ciation rules. In Proceedings of the 20th Very Large Data Bases conference(VLDB), pages 487–499, 1994.

[9] John Y. Ching, Andrew K. C. Wong, and Keith C. C. Chan. Class-dependent dis-cretization for inductive learning from continuous and mixed-mode data. IEEETransactions on Pattern Analysis and Machine Intelligence, 17:641–651, 1995.

[10] Bernhard Pfahringer. Compression-based discretization of continuous at-tributes. In Proceedings of the 12th International Conference on MachineLearning (ICML), pages 456–463, 1995.

[11] Michal R. Chmielewski and Jerzy W. Grzymala-Busse. Global discretizationof continuous attributes as preprocessing for machine learning. InternationalJournal of Approximate Reasoning, 15(4):319–331, 1996.

[12] Nir Friedman and Moises Goldszmidt. Discretizing continuous attributes whilelearning bayesian networks. In Proceedings of the 13th International Confer-ence on Machine Learning (ICML), pages 157–165, 1996.

[13] Xindong Wu. A bayesian discretizer for real-valued attributes. The ComputerJournal, 39:688–691, 1996.

[14] Jesus Cerquides and Ramon Lopez De Mantaras. Proposal and empiricalcomparison of a parallelizable distance-based discretization method. In Pro-ceedings of the Third International Conference on Knowledge Discovery andData Mining (KDD), pages 139–142, 1997.

[15] K. M. Ho and Paul D. Scott. Zeta: A global method for discretization of contin-uous variables. In Proceedings of the Third International Conference on Knowl-edge Discovery and Data Mining (KDD), pages 191–194, 1997.

[16] Se June Hong. Use of contextual information for feature ranking and dis-cretization. IEEE Transactions on Knowledge and Data Engineering, 9:718–730, 1997.

[17] Huan Liu and Rudy Setiono. Feature selection via discretization. IEEE Trans-actions on Knowledge and Data Engineering, 9:642–645, 1997.

[18] D. A. Zighed, S. Rabaseda, and R. Rakotomalala. FUSINTER: a method fordiscretization of continuous attributes. International Journal of Uncertainty,Fuzziness Knowledge-Based Systems, 6:307–326, 1998.

22

Page 23: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

[19] Stephen D. Bay. Multivariate discretization for set mining. Knowledge Informa-tion Systems, 3:491–512, 2001.

[20] M.A. Beyer and D. Laney. 3d data management: Controlling data volume, ve-locity and variety, 2001. [Online; accessed March 2015].

[21] C. E. Shannon. A mathematical theory of communication. SIGMOBILE Mob.Comput. Commun. Rev., 5(1):3–55, January 2001.

[22] R. Giraldez, J.S. Aguilar-Ruiz, J.C. Riquelme, F.J. Ferrer-Troyano, and D.S.Rodrıguez-Baena. Discretization oriented to decision rules generation. In Fron-tiers in Artificial Intelligence and Applications 82, pages 275–279, 2002.

[23] Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretiza-tion: An enabling technique. Data Mining and Knowledge Discovery, 6(4):393–423, 2002.

[24] F. E. H. Tay and L. Shen. A modified chi2 algorithm for discretization. IEEETransactions on Knowledge and Data Engineering, 14:666–670, 2002.

[25] Marc Boulle. Khiops: A statistical discretization method of continuous at-tributes. Machine Learning, 55:53–69, 2004.

[26] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processingon large clusters. In OSDI 2004, pages 137–150, 2004.

[27] Lukasz A. Kurgan and Krzysztof J. Cios. CAIM discretization algorithm. IEEETransactions on Knowledge and Data Engineering, 16(2):145–153, 2004.

[28] Xiaoyan Liu and Huaiqing Wang. A discretization algorithm based on a het-erogeneity criterion. IEEE Transactions on Knowledge and Data Engineering,17:1166–1173, 2005.

[29] Sameep Mehta, Srinivasan Parthasarathy, and Hui Yang. Toward unsuper-vised correlation preserving discretization. IEEE Transactions on Knowledgeand Data Engineering, 17:1174–1185, 2005.

[30] Chao-Ton Su and Jyh-Hwa Hsu. An extended chi2 algorithm for discretizationof real value attributes. IEEE Transactions on Knowledge and Data Engineer-ing, 17:437–441, 2005.

[31] Wai-Ho Au, Keith C. C. Chan, and Andrew K. C. Wong. A fuzzy approachto partitioning continuous attributes for classification. IEEE Transactions onKnowledge Data Engineering, 18(5):715–719, 2006.

[32] Marc Boulle. MODL: A bayes optimal discretization method for continuousattributes. Machine Learning, 65(1):131–165, 2006.

[33] Chang-Hwan Lee. A hellinger-based discretization method for numeric at-tributes in classification learning. Knowledge-Based Systems, 20:419–425,2007.

[34] QingXiang Wu, David A. Bell, Girijesh Prasad, and Thomas Martin McGinnity.A distribution-index-based discretizer for decision-making with symbolic ai ap-proaches. IEEE Transactions on Knowledge and Data Engineering, 19:17–28,2007.

[35] F. J. Ruiz, C. Angulo, and N. Agell. IDD: A supervised interval Distance-Basedmethod for discretization. IEEE Transactions on Knowledge and Data Engineer-ing, 20(9):1230–1238, 2008.

[36] Cheng-Jung Tsai, Chien-I. Lee, and Wei-Pang Yang. A discretization algo-rithm based on class-attribute contingency coefficient. Information Sciences,178:714–731, 2008.

23

Page 24: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

[37] J. Alcala-Fdez, L. Sanchez, S. Garcıa, M. J. del Jesus, S. Ventura, J. M.Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, J. C. Fernandez, andF. Herrera. KEEL: a software tool to assess evolutionary algorithms for datamining problems. Soft Computing, 13(3):307–318, 2009.

[38] L. Gonzalez-Abril, F. J. Cuberos, F. Velasco, and J. A. Ortega. Ameva: An au-tonomous discretization algorithm. Expert Systems with Applications, 36:5327–5332, 2009.

[39] Hsiao-Wei Hu, Yen-Liang Chen, and Kwei Tang. A dynamic discretization ap-proach for constructing decision trees with a continuous label. IEEE Transac-tions on Knowledge and Data Engineering, 21(11):1505–1514, 2009.

[40] Ruoming Jin, Yuri Breitbart, and Chibuike Muoh. Data discretization unifica-tion. Knowledge and Information Systems, 19:1–29, 2009.

[41] Ying Yang and Geoffrey I. Webb. Discretization for naive-bayes learning: man-aging discretization bias and variance. Machine Learning, 74(1):39–74, 2009.

[42] Veronica Bolon-Canedo, Noelia Sanchez-Marono, and Amparo Alonso-Betanzos. On the effectiveness of discretization on gene selection of microar-ray data. In International Joint Conference on Neural Networks, IJCNN 2010,Barcelona, Spain, 18-23 July, 2010, pages 1–8, 2010.

[43] Ying Yang, Geoffrey I. Webb, and Xindong Wu. Discretization methods. InData Mining and Knowledge Discovery Handbook, pages 101–116. 2010.

[44] V. Bolon-Canedo, N. Sanchez-Marono, and A. Alonso-Betanzos. Feature se-lection and classification in multiple class datasets: An application to KDD Cup99 dataset. Expert Syst. Appl., 38(5):5947–5957, May 2011.

[45] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library forsupport vector machines. ACM Transactions on Intelligent Sys-tems and Technology, 2:27:1–27:27, 2011. Datasets available athttp://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/.

[46] M. Li, S. Deng, S. Feng, and J. Fan. An effective discretization based on class-attribute coherence maximization. Pattern Recognition Letters, 32(15):1962–1973, 2011.

[47] M. Gethsiyal Augasta and T. Kathirvalavakumar. A new discretization algo-rithm based on range coefficient of dispersion and skewness for neural net-works classifier. Applied Soft Computing, 12(2):619–625, 2012.

[48] A. J. Ferreira and M. A. T. Figueiredo. An unsupervised approach to featurediscretization and selection. Pattern Recognition, 45(9):3048–3060, 2012.

[49] Jimmy Lin. Mapreduce is good enough? if all you have is a hammer, throwaway everything that’s not a nail! CoRR, abs/1209.2191, 2012.

[50] Khurram Shehzad. EDISC: A class-tailored discretization technique for rule-based classification. IEEE Transactions on Knowledge Data Engineering,24(8):1435–1447, 2012.

[51] Salvador Garcıa, Julian Luengo, Jose Antonio Saez, Victoria Lopez, and Fran-cisco Herrera. A survey of discretization techniques: Taxonomy and empiricalanalysis in supervised learning. IEEE Transactions on Knowledge and DataEngineering, 25(4):734–750, 2013.

[52] Murat Kurtcephe and H. Altay Guvenir. A discretization method based on max-imizing the area under receiver operating characteristic curve. InternationalJournal of Pattern Recognition and Artificial Intelligence, 27(1), 2013.

24

Page 25: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

[53] Alberto Cano, Sebastian Ventura, and Krzysztof J. Cios. Scalable CAIM dis-cretization on multiple GPUs using concurrent kernels. The Journal of Super-computing, 69(1):273–292, 2014.

[54] Alberto Fernandez, Sara del Rıo, Victoria Lopez, Abdullah Bawakid,Marıa Jose del Jesus, Jose Manuel Benıtez, and Francisco Herrera. Big datawith cloud computing: an insight on the computing environment, mapreduce,and programming frameworks. Wiley Interdisc. Rew.: Data Mining and Knowl-edge Discovery, 4(5):380–409, 2014.

[55] A. J. Ferreira and M. A. T. Figueiredo. Incremental filter and wrapper ap-proaches for feature discretization. Neurocomputing, 123:60–74, 2014.

[56] H.-V. Nguyen, E. Muller, J. Vreeken, and K. Bohm. Unsupervised interaction-preserving discretization of multivariate data. Data Mining and Knowledge Dis-covery, 28(5-6):1366–1397, 2014.

[57] S. Rio, V. Lopez, J.M. Benitez, and F. Herrera. On the use of mapreduce forimbalanced big data using random forest. Information Sciences, (285):112–137, 2014.

[58] Y. Sang, H. Qi, K. Li, Y. Jin, D. Yan, and S. Gao. An effective discretizationmethod for disposing high-dimensional data. Information Sciences, 270:73–91,2014.

[59] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. Data mining withbig data. IEEE Trans. on Knowl. and Data Eng., 26(1):97–107, 2014.

[60] D. Yan, D. Liu, and Y. Sang. A new approach for discretizing continuous at-tributes in learning systems. Neurocomputing, 133:507–511, 2014.

[61] Yiqun Zhang and Yiu-Ming Cheung. Discretizing numerical attributes in deci-sion tree for big data analysis. In ICDM Workshops, pages 1150–1157, 2014.

[62] Apache Hadoop Project. Apache Hadoop, 2015. [Online; accessed March2015].

[63] Apache Mahout Project. Apache Mahout, 2015. [Online; accessed March2015].

[64] Apache Spark: Lightning-fast cluster computing. Apache spark, 2015. [Online;accessed March 2015].

[65] Abdullah Gani, Aisha Siddiqa, Shahaboddin Shamshirband, and FarizaHanum. A survey on indexing techniques for big data: taxonomy and perfor-mance evaluation. Knowledge and Information Systems, pages 1–44, 2015.

[66] F. Jiang and Y. Sui. A novel approach for discretization of continuous attributesin rough set theory. Knowledge-Based Systems, 73:324–334, 2015.

[67] Machine Learning Library (MLlib) for Spark. Mllib, 2015. [Online; accessedMarch 2015].

[68] R. Moskovitch and Y. Shahar. Classification-driven temporal discretization ofmultivariate time series. Data Mining and Knowledge Discovery. In press, DOI:10.1007/s10618-014-0380-z, 2015.

[69] S. Ramırez-Gallego, S. Garcıa, J. M. Benıtez, and F. Herrera. Multivariate dis-cretization based on evolutionary cut points selection for classification. IEEETransactions on Cybernetics. In press, DOI: 10.1109/TCYB.2015.2410143,2015.

[70] Richard O. Duda and Peter E. Hart. Pattern classification and scene analysis,volume 3. Wiley New York, 1973.

25

Page 26: Article type: Data Discretization: Taxonomy and Big Data ... · ing a non-overlapping division of a continuous domain. It also ensures an association between each numerical value

[71] Salvador Garcıa, Julian Luengo, and Francisco Herrera. Data Preprocessingin Data Mining. Springer, 2015.

[72] M. Hamstra, H. Karau, M. Zaharia, A. Konwinski, and P. Wendell. LearningSpark: Lightning-Fast Big Data Analytics. O’Reilly Media, Incorporated, 2015.

[73] J. Ross Quinlan. C4.5: programs for machine learning. Morgan KaufmannPublishers Inc., 1993.

[74] T. White. Hadoop, The Definitive Guide. O’Reilly Media, Inc., 2012.[75] Xindong Wu and Vipin Kumar, editors. The Top Ten Algorithms in Data Mining.

Chapman & Hall/CRC Data Mining and Knowledge Discovery, 2009.

26