20
Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L 2 -norm transformation for improving k-means clustering Finding a suitable model by range transformation for novel data analysis Piyush Kumar Sharma 1 · Gary Holness 2 Received: 1 December 2016 / Accepted: 28 March 2017 / Published online: 29 April 2017 © Springer International Publishing Switzerland 2017 Abstract In the age of increasingly pervasive sensing applications, measurement of unknown pattern phenomena resulting in novel data presents a challenge to selection of appropriate modeling tools. Because there is no rich history of domain knowledge, one can easily make early commit- ments to poor modeling choices. Data transformation, a solution in effort to modify the data’s geometry, can make important regularities more clear. The wrong transforma- tion can damage the very pattern information one seeks to identify. In contrast to data transformation, we contribute an alternative method, range transformation focusing on alter- ing the measurement tool. As a function, a model maps data inputs to a range. Focusing on transformations of the model’s range, we can find a generally applicable way to alter the model’s properties to best suit the data. Every modification to a function class, something we call editing the function, results in a change to the original function’s range. This work contributes a method for modifying a broad class of models to suit novel data through range transformation. We investigate range transformation for a class of information theoretic transformations and evaluate impact on classifica- This paper is an extension version of the DSAA-2016 paper: Dilation of Chisini–Jensen–Shannon divergence [39]. This work was supported by National Science Foundation Award# HRD-1242067 and Award# CNS-1205426. B Piyush Kumar Sharma [email protected] Gary Holness [email protected] 1 Department of Mathematical Sciences, Delaware State University, Dover, DE 19904, USA 2 Department of Computer and Information Sciences, Delaware State University, Dover, DE 19904, USA tion and clustering. We also develop an optimization-based framework employing range transformation based on desired geometric properties and use it to improve a widely used model, k -means clustering. Keywords Chisini–Jensen–Shannon divergence · CJSD kernel · k -Means clustering · L 2 -norm · Range transforma- tion · LIBS 1 Introduction A number of successful methods in machine learning, most notably kernel methods, modify the input representation by a transformation φ : X Z that maps the input repre- sentation X into a new higher dimensional feature space representation Z . In doing so, models are afforded the abil- ity to define complex nonlinear surfaces with without paying increased cost in terms of learning complexity. In trans- forming the input representation, a singular transformation is applied across all regions of input space (Fig. 1). In this work, we study a method that strives to transform the model’s measurement instrument or distance function as an alternative to transformation of the input representa- tion. The impact of this approach is a transformation of the range associated with the learned hypothesis. This results in an effective repositioning, thus changing their geometry, of data points in the input representation when learning a model. In transforming the range, more fine-grained control can be achieved over regions of the input representation beyond a singular transformation en masse of the input space itself. Consider a data set where points near the boundary between clusters are tightly packed. A solution for improved clus- tering would amplify distances corresponding to regions of input space near cluster boundaries and attenuate distances 123

link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266DOI 10.1007/s41060-017-0054-1

REGULAR PAPER

L2-norm transformation for improving k-means clusteringFinding a suitable model by range transformation for novel data analysis

Piyush Kumar Sharma1 · Gary Holness2

Received: 1 December 2016 / Accepted: 28 March 2017 / Published online: 29 April 2017© Springer International Publishing Switzerland 2017

Abstract In the age of increasingly pervasive sensingapplications, measurement of unknown pattern phenomenaresulting in novel data presents a challenge to selection ofappropriate modeling tools. Because there is no rich historyof domain knowledge, one can easily make early commit-ments to poor modeling choices. Data transformation, asolution in effort to modify the data’s geometry, can makeimportant regularities more clear. The wrong transforma-tion can damage the very pattern information one seeks toidentify. In contrast to data transformation, we contribute analternative method, range transformation focusing on alter-ing the measurement tool. As a function, a model maps datainputs to a range. Focusing on transformations of themodel’srange, we can find a generally applicable way to alter themodel’s properties to best suit the data. Every modificationto a function class, something we call editing the function,results in a change to the original function’s range. Thiswork contributes a method for modifying a broad class ofmodels to suit novel data through range transformation. Weinvestigate range transformation for a class of informationtheoretic transformations and evaluate impact on classifica-

This paper is an extension version of the DSAA-2016 paper: Dilationof Chisini–Jensen–Shannon divergence [39].

This work was supported by National Science Foundation Award#HRD-1242067 and Award# CNS-1205426.

B Piyush Kumar [email protected]

Gary [email protected]

1 Department of Mathematical Sciences, Delaware StateUniversity, Dover, DE 19904, USA

2 Department of Computer and Information Sciences, DelawareState University, Dover, DE 19904, USA

tion and clustering. We also develop an optimization-basedframework employing range transformation based on desiredgeometric properties and use it to improve a widely usedmodel, k-means clustering.

Keywords Chisini–Jensen–Shannon divergence · CJSDkernel · k-Means clustering · L2-norm · Range transforma-tion · LIBS

1 Introduction

A number of successful methods in machine learning, mostnotably kernel methods, modify the input representation bya transformation φ : X → Z that maps the input repre-sentation X into a new higher dimensional feature spacerepresentation Z . In doing so, models are afforded the abil-ity to define complex nonlinear surfaces with without payingincreased cost in terms of learning complexity. In trans-forming the input representation, a singular transformationis applied across all regions of input space (Fig. 1).

In this work, we study a method that strives to transformthe model’s measurement instrument or distance functionas an alternative to transformation of the input representa-tion. The impact of this approach is a transformation of therange associated with the learned hypothesis. This results inan effective repositioning, thus changing their geometry, ofdata points in the input representationwhen learning amodel.In transforming the range, more fine-grained control can beachieved over regions of the input representation beyond asingular transformation en masse of the input space itself.Consider a data set where points near the boundary betweenclusters are tightly packed. A solution for improved clus-tering would amplify distances corresponding to regions ofinput space near cluster boundaries and attenuate distances

123

Page 2: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

248 Int J Data Sci Anal (2017) 3:247–266

Fig. 1 Schematic illustrates a model’s range transformation

Fig. 2 Schematic illustrates our strategy of finding a suitable modelby range transformation in both ways

corresponding to regions of input space further away fromcluster boundaries. The effect of such a range transformationwould make clusters more compact and well separated thusresulting in better clustering performance.

A range transformation is achieved two ways; the firstchanges or edits the model resulting in a transformation orgeometric change to its range (whatwe call the forward direc-tion), and the second begins with a desired geometric changeto the range and rearranges or edits the model to achieve it(what we call the reverse direction) as shown in Fig. 2. Anapproach for range transformation in the reverse direction isthe primary contribution of ourwork.We explore range trans-formation as a concept using two different models. For theforward direction, we explore a recently introduced informa-tion theoretic measure, CJSD [40]. For the reverse direction,we explore the very popular k-means clustering model.

The literature is full of examples from many disci-plines where a model’s performance is improved throughits direct modification [14,16,33,50]. Recently publishedresults in information theoretic models [41] demonstratedthat Jensen–Shannon divergence (JSD) [10,21,24–26,38]does not provide adequate separation for drawing distinctionsbetween subtly different distributions. This work reformu-lated (or edited) JSD with alternate operators in Fig. 3,namely, Chisini means [8]. The resulting Chisini–Jensen–Shannon divergences (CJSDs) rescale JSD’s range. Whilethe range remains uncountably infinite, it is transformed fromthe interval [0, 1] to a different scale on the real number lineto [0,∞] (Fig. 4).

Because of the broad applicability of information theo-retic measures in describing regularities in data, for rangetransformation in the forward direction we focus on thedevelopment of range transformations arising from usingChisini means for function edits of JSDs (i.e., CJSDs). Wederive the governing parameter, namely dilation parameter,

Fig. 3 Schematic presents model shaping by function edit for JSD

Fig. 4 CJSDs map input distributions onto nonnegative real numberline

describing the relationship between ranges associated withdifferent CJSDs (JSD edits). Closely related is the Breg-man divergence framework which implements a distortionfunction (rearrangement of points under the function) [30].Bregman divergence is limited to a specific set of functions;namely, those with a convex set domain and function f isstrictly convex, i.e., its derivative is monotonically increas-ing [1]. Transformation of a function’s range extends beyondthe more limited set of functions required by Bregman diver-gence.

Extending beyond the recently introduced CJSDs, ourwork contributes an exploration of the mechanism respon-sible for CJSDs’ improvement. Our larger contribution tomachine learning is the development of range transforma-tion as an approach to selection of suitable models. This isparticularly important when analyzing novel data for whichthere is little knowledge about pattern phenomena (e.g.,quantum effects due to photon energy absorption) or theresolution needed between model outputs in order to dis-criminate between classes. A method generally focused onresolution needed for discriminability can be valuable, par-ticularly when there are subtle differences between classboundaries or members of different clusters. Using rangetransformation, one begins with the requisite resolution inthe distance measure needed for discriminability, and thenselects a transformation that achieves it. This approach opensa more broad class of functions available for modeling.

123

Page 3: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266 249

Open questions that we address concern: How does aCJSD transform a JSD’s range?What are the properties thesetransformations? For what types of data are these recon-figurations useful? How are distances rescaled for differentCJSDs and what is the magnitude of this rescaling? What isthe relationship between different CJSDs?

The range of CJSDs falls on the nonnegative real numberline (Fig. 4). We describe dilation mapping, the rescaling ofCJSDs in Fig. 9. Apart from rescaling, the dilation parameteralso plays an important role in another set of parameterswhich we call dilation ratios, the relative dilation betweentwo CJSDs. The intuition behind dilation ratios concerns therelative difference in scaling. A dilation ratio describes therescaling magnitude. Our concept of CJSD dilation ratioscan be considered similar to the ratio of magnification inGeometry [10]. Dilation ratios have interesting properties,that is they can be put in rank order. With this one can beginto surmise the improvement in discrimination between inputdistributions afforded by the family of CJSDs.

Motivated by applications in laser spectroscopy, the origi-nal CJSD contribution [41] applied CJSD in a new kernel forSVM classification. Our experiments carry forward inves-tigation of CJSDs’ utility beyond laser spectroscopy withvalidation on a synthetic data set where subtlety betweenthe classes is controlled. This validation tests the applicabil-ity of CJSDs on arbitrary data characterized by the generalproperties of stochasticity and subtlety. Most importantly,our experiments empirically confirm our theories concern-ing dilation parameter.

We also study range transformation in the backwardsdirection in Fig. 2. There are many examples in the literaturefor improving the performance of a model by transformationon data. For example, k-means assumes clusters are spheri-cal in shape. This technique works well for a certain types ofdata sets; however, it performs poorly for others. This stemsfrom trying to construct clusters comprised of equal num-bers of data points. Examples in the literature demonstratethat this problem can be solved by data transformation (suchas polar coordinate transformation). With range transforma-tion, there is no early commitment about the data’s geometryand because data are not projected into a higher dimensionalrepresentation, we do not run against the curse of dimen-sionality. Statistical learning theory tells us that one can onlyincrease dimensionality to the point where the sample sizeremains large enough to pay for the increased complexity[49].

We implement range transformation in the backwarddirection using an optimization framework to modify themodel. The contributed implementation of range transforma-tion automates alteration of themeasurement tool, namely thedistancemeasure.We chose L2-norm because of its ubiquity.Moreover, the L2-norm appears in many machine learningalgorithms—several SVM kernels (RBF, CJSDs, Gaussian,

String, etc.), nonlinear manifold embeddings (ISOMAP,LLE, Laplacian eigenmaps, etc.), and clustering (k-means,hierarchical, mean shift, etc.). Consider k-means clustering.In Sect. 7.2, we demonstrate that range transformation ofL2-norm gives improved performance on novel spectroscopydata.

2 Chisini–Jensen–Shannon divergences

Jensen–Shannon divergence (JSD) is a symmetric andsmoothed version of Kullback–Leibler divergence (KLD)[25,26]. Let, P={pi }N

i=1, and Q={qi }Ni=1 be two probability dis-

tributions, where pi and qi are the respective probabilitiesassociated with the i th state (possible values). Jensen–Shannon divergence (JSD) of Q from P is

JSD(P||Q) = 1

2

[N∑

i=1

pi logpi

Mi+

N∑i=1

qi logqi

Mi

](1)

where M is Arithmetic Mean, A.M. ={

pi + qi

2

}N

i=1(2)

Average value of M in (2) computes JSD precisely at themidpoint of P and Q; however, when distributions havesubtle differences, it is hard to distinguish them. Therefore,reformulation of JSD for better discrimination with alternateoperators was proposed [41]. Though authors demonstratedthe utility of this reformulation by replacing (2) by (3) or (4),given as follows:

Geometric Mean, G.M. = {√pi qi}N

i=1 (3)

Harmonic Mean, H.M. ={

2pi qi

pi + qi

}N

i=1(4)

this reformulation is not limited to Chisini mean.

3 Transformations

An important question concerns the type of transformationsone might perform on a function’s range. We investigatedifferent types of transformations which are useful in dis-ciplines, such as, mechanics, engineering, geometry, andcomputer graphics. These transformations are well knownfor their applicability in the rigid and nonrigid body deforma-tion and distortion. In computer graphics, their applicationsare also well studied in modeling the shape of an object. Weconsider the effect of different transformations to understandhow a function’s range can be modified. This helps providean answer to the questions How does a CJSD transform aJSD’s range? and What are the properties of these transfor-mations?

123

Page 4: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

250 Int J Data Sci Anal (2017) 3:247–266

Fig. 5 Some affine transformations

3.1 Affine transformation

Affine transformation is a linear mapping that preservescollinearity. After transformation, all points lying on a lineinitially still lie on a line. An affine transformation is not anangle or length preserving transformation. However, it doespreserve proportions on lines. Moreover, it preserves ratiosof distances so that sets of parallel lines remain parallel afteran affine transformation. Therefore, a midpoint on a straightlinewill remain amidpoint after transformation.Affine trans-formations do not move any object from the affine space R3

to the plane at infinity or conversely. They are considered atype of projective transformations and also known as affinity[42].

Affine transformations are widely used in geometry tocorrect deformations or distortions caused by camera angles.In image registration satellite camera cause wide angel lensdistortion. Affine transformations are used to correct suchdistortions. This is sometimes achieved by transformingand fusing the images to a large, flat coordinate system.Some affine transformations are translation, scaling, rota-tion, shear, reflection, expansion, contraction, composition,homothey, dilation, and spiral similarity shown in Fig. 5.

3.2 Geometrical deformation

Deformation is a function from R3 → R3 which maps eachpoint P(x, y, z) in Euclidean space to a point P(x ′, y′, z′).As demonstrated for affine transformation in previous sub-section, a desired geometric deformation can be obtainedby function compositions. Therefore, it is a transformationsfrom some initial to some final geometry.

Deformation can be defined as a change in shape, volume,position or rotation due to an applied stress. In case of rigidbody, such changes can be obtained by affine transforma-tion we discussed in previous subsection; however, in case

Fig. 6 Homogeneous and heterogeneous deformation

Fig. 7 A change occurs in the point cloud position after deformation.An injective mapping exists from initial to final position

of nonrigid body it is called distortion and can be obtained bydilation. Under such transformations, we can obtain homo-geneous and heterogeneous deformations (Fig. 6).

Homogeneous deformation is obtained by uniformlyapplying a transformation on a rigid body. In preserves paral-lelism of lines, however, each geometrical object is distortedinto a different shape, a cube into a prism, a square into aparallelograms, a circle into an ellipse, and a sphere into anellipsoid.

Heterogeneous deformation is obtained by irregularly ornonuniformly applying a transformation on a rigid body. Itdoes not preserve parallelism of lines, and each object, suchas, square, cube, and circle is distorted into an irregular com-plex shapes.

Each particle on a geometrical object has its unique loca-tion in space. As these locations change under deformation,we may assert that there exists a mapping between initialand final positions as shown in Fig. 7. Let, vectors x and yrepresent the initial and the final position, respectively, ofa particular particle on interest on any geometrical shape.Because no two distinct particles can be deformed into thesame location, this mapping is an injective mapping definedby y = f (x), where f is the mapping function [4].

In our results we prove that there exist a dilation mappingbetween Chisini–Jensen–Shannon divergences (CJSDs).These CJSDs are resulted by the transformation of JSD usingdilation. Our concept of range transformation of CJSDs issimilar to the above concept of mapping for geometric defor-mations. Such mapping uniquely defines the positions ofparticles (or inputs to CJSDs) before and after deformation(range transformation in CJSDs). Controlling the governingparameter(s) responsible for deformation (range transforma-

123

Page 5: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266 251

tion for CJSDs), we can preset the final positions of particles.By this approach,we can implement the desired relative spac-ing between two CJSDs; thus, a desired magnification byscaling can be obtained. Our approach, JSD’s range transfor-mation, is particularly helpful with nuanced data sets whereit is difficult to distinguish classes from each other. Thismodification gives rise to the formulation of CJSD-basedkernels which improve generalization performance in SVMclassification tasks over our JSD-based kernel (kernel with-out dilation) Sect. 6.2.

3.3 Modulation and perturbation transformation

Deformation techniques interactively model a rough sur-face. Because some objects have a geometric representation,these techniques study the point location of a deformedobject which are useful in modeling [3,7,37]. In the areaof computer graphics, two known deformation-based trans-formations techniques are

Modulation Transformation, TM, for modulation M andamplitude vector (Ax , Ay, Az) is defined as:

TM(x, y, z) = (x +t Ax , y+t Ay, z+t Az), t = M(x, y, z)

Perturbation Transformation, TP, for perturbation P andamplitude vector A is defined as:

TP (x, y, z) = (x + u A, y + vA, z + wA) ,

(u, v, w) = P (x, y, z)

These transformations can introduce a desired irregularity ordeformation to an object at the right position, orientation andsize. Therefore, the proper tuning of the parameter valuescan provide a wide variety of visual effects. Moreover, thesedeformations can be reformulated by combining them withother affine transformation mentioned earlier.

3.4 Isotropic versus non-isotropic dilation

An isotropic dilation is a uniform contraction or expansionof the plane or space about some fixed point (center) as illus-trated in Fig. 5. If this point is the origin, the transformationis homogeneous isotropic dilation. Even though it is an affinetransformation, it is not an isometry. However, the product ofsuch a dilation and isometry is a similarity, which is a one-to-onemapping. In similarity mapping each length is multipliedby a same number which is called similarity ratio, or strain inphysics [28] (ForCJSDswecalled itdilation ratiosSect. 4.1).

An anisotropic dilation is a contraction or expansion ofthe plane or space whose ratio is dependent on orientation.Under this transformation, points move either far away froma center or closer to a center by unequal distances. There-

Fig. 8 (Left) points A and B in 2D plane under anisotropic dilatationwith scale factors k1 and k2. (Right) obtained elliptical shape afteranisotropic dilation

fore, an object is changed nonuniformly which may appearas a distortion or deformation in original object’s shape. Forexample, an isosceles or an equilateral triangle remains nolonger an isosceles or an equilateral triangle under this trans-formation.

Let, in 3Dcase, point (x, y, z) is transformed to (x ′, y′, z′).Let, X be a vector in Cartesian coordinate system. Inmatrix form, the transformation matrix can be representedby X T = K X , where K is the transformation matrix. Thetransformation matrices (dilation matrices) for isotropic andnon-isotropic dilations are gives as:

KIsotropic =

⎡⎢⎢⎣

k 0 0 00 k 0 00 0 k 00 0 0 1

⎤⎥⎥⎦

KNon-Isotropic =

⎡⎢⎢⎣

k1 0 0 00 k2 0 00 0 k3 00 0 0 1

⎤⎥⎥⎦

Here, for isotropic case, x ′ = kx, y′ = ky, z′ = kz, and fornon-isotropic case, x ′ = k1x, y′ = k2y, z′ = k3z.

It was shown that two points in a 2D plane produce anelliptical shape under non-isotropic dilation [28]. Let, A ≡(x1, y1) and B ≡ (x2, y2) be the two points in 2D planeunder non-isotropic dilation transformation by scale factorsk1 and k2 toward X and Y axis, respectively, Fig. 8. Let, theline on which these two points lie, makes an angle φ fromthe X axis.Distance between A and B: d1 = √(x1 − y1)2 + (x2 − y2)2

Distance after dilation: d2 =√

(x ′1 − y′

1)2 + (x ′

2 − y′2)

2

But, x ′ = k1x and y′ = k2y as defined above for anisotropicdilation. Therefore,

kφ = d1

d2=√

k21(x1 − y1)2 + k22(x2 − y2)2√(x1 − y1)2 + (x2 − y2)2

Substituting y = x tan φ in previous expression, we get,

kφ =√

k21 cos2 φ + k22 sin

2 φ

123

Page 6: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

252 Int J Data Sci Anal (2017) 3:247–266

We can control the amount of dilation by varying k1 andk2 toward X and Y axis, respectively. This means that kφ andφ represent an ellipse with k1 and k2 as their major or minoraxes.

Diagonal elements of K are responsible for dilation. Insuch case, transformation is uniform as each point or distanceis dilated by the same number k. This is very useful as we cantransform things from microscale to macroscale. Its utilitycomes handy in applicationwheremicro-level things needs tobemeasuredwith exact precision, andmeasuring equipmentsare limited in their measuring capability. For example, bio-chemists depend heavily on high precision microscope andsensors for the measurement of distances between bio-macromolecules (protein, RNA, etc.). In such domain, distances aremeasured at nanometer scale level. Magnification of relativedistributions of distance (e.g., some spectroscopic techniquefor measurement) by isotropic dilation can give a better reso-lution of chemical structure bio-macromolecules. Moreover,astronomical distances are extremely large, measured in lightyears (relative distances between heavenly bodies). Again,such measurements depend heavily on advanced equipments(satellite, telescope, etc.). Dilation can be used to scale downthese measurements.

4 Dilation of CJSDs

Definition 1 Dilation of a function means either stretchingaway from an axis or compressing toward an axis.

Definition 2 There exists a bijection mapping from domainof input distributions to their respective ranges mapped byCJSDs. We call this function mapping a dilation mapping ofCJSDs.

Consider S a set of pairs si of probability distributionsP = {pi } and Q = {qi }, where, i = 1, . . . , n, andpi , qi ∈ [0, 1] ∈ R. So, there are n such pairs of proba-bility distributions in set S, and for each pair we computetheir JSDs.

Let, the functions JSDAM, JSDGM, and JSDHM map eachsi ∈ S to images xi ∈ X , yi ∈ Y , and zi ∈ Z , respectively,as follows (Fig. 9):

JSDAM : S �→ X

JSDGM : S �→ Y

JSDHM : S �→ Z

JSD with subscripts AM, GM, and HM refer to the respec-tive divergences reformulated with Chisini mean (arithmetic,geometric, and harmonic) Fig. 17.

Fig. 9 Schematic of dilation mappings for CJSDs

4.1 Dilation ratios

The following inequality was proved in [41]:

JSDAM < JSDGM < JSDHM (5)

Consider the ratios:

α = JSDGM

JSDAM> 1 (6)

β = JSDHM

JSDAM> 1 (7)

γ = JSDHM

JSDGM> 1 (8)

(6), (7), and (8) are true by (5).We call α, β, γ the dilationratios which give us the scale factor that tell us the amountof magnification in scaling. Dilation ratios show how CJSDsspace apart a given distributions w.r.t. each other.

4.2 Relationship between α, β, γ

β − α = JSDHM − JSDGM

JSDAM> 0

⇒ β > α (9)

and,1

γ− 1

β= JSDGM − JSDAM

JSDHM> 0

⇒ β > γ (10)

Alternately, we can prove (10) by substituting (6) and (7) in(8), i.e.,

123

Page 7: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266 253

γ = β · JSDAM

α · JSDAM

⇒ β = γα (11)

As α, γ > 1,we have β > γ

Now, γ − α = JSDHM

JSDGM− JSDGM

JSDAM

= JSDAMJSDHM − JSD2GM

JSDAMJSDGM

As the numerator is less than 0 by Proposition. 2,

⇒ γ < α

Also, γ − 1

α= JSDHM − JSDAM

JSDGM> 0 (12)

⇒ γ >1

α(13)

As α > 1 ⇒ 1α

= γ+ = a nonnegative real numbersmaller than 1.

From (9), (12), and (13);

⇒ β > α > γ >1

α(14)

Proposition 1 Prove that CJSDs have a dilation relation-ship. Given arbitrary input S, JSDGM’s scaling is greaterthan JSDAM, likewise JSDHM’s scaling is greater than bothJSDAM and JSDGM.

Proof Define xi ∈ [0, 1], yi ∈ [0,∞), and zi ∈ [0,∞] inco-domains X , Y , and Z , respectively, as mapped images ofeach tuple si in S Fig. 17. For JSDAM, si is mapped to co-domain X described by the interval [0, 1] ∈ R, while JSDGM

and JSDHM map si to co-domain Y and Z , respectively,described by the interval [0,∞]. This means that JSDGM

and JSDHM provide greater scaling than JSDAM. Moreover,from (14), JSDHM provides greater scaling than both JSDAM

and JSDGM (Fig. 17). ��

Remark 1 Proposition 1 shows that the range [0,∞] ofJSDGM and JSDHM is scaled differently. We show that theyhave the same cardinality. For any interval on the real numberline and the real number line itself have the same cardinality,namely uncountably infinite. This is demonstrated by proofthat a bijection exists between intervals onR andR. Consider[0, 1] from the set of intervals on R. Map [0, 1] to [0,∞)

by taking an element a ∈ [0, 1] and mapping it throughf (x) = 1

x to element b ∈ [0,∞]. Algebraically the valueof f (a) = 1

a . Therefore,1a = b. Through the inverse map-

ping, b’s pre-image is f −1(b) = 1b = a. This also applies

to arbitrary intervals [i, j] ∈ R. Because an onto mappingexists and a is the only value mapped onto b, the pair (a, b)

is a unique mapping. Hence, [0, 1] and [0,∞) have the samecardinality.

Proposition 2 JSDAM JSDHM − JSD2GM < 0.

Proof As (5) holds, we have the following:

JSDGM − JSDAM > 0 & JSDHM − JSDGM > 0

∴ (JSDGM − JSDAM)(JSDHM − JSDGM) > 0

or JSDGMJSDHM − JSD2GM − JSDAMJSDHM

+ JSDAMJSDGM > 0

or JSD2GM + JSDAMJSDHM < JSDAMJSDGM

+ JSDGMJSDHM

Adding − 2JSD2GM on both sides:

JSDAMJSDHM − JSD2GM < JSDAMJSDGM

+ JSDGMJSDHM − 2JSD2GM

< JSDGM(JSDAM + JSDHM − 2JSDGM)

Now, JSDGM = 1

2[JSDAM + JSDHM]

[Proved in [41]]

⇒ JSDAMJSDHM − JSD2GM < 0

��

Proposition 3 Prove the following equality,

JSDHM − JSDAM = 2(JSDHM − JSDGM)

= 2(JSDGM − JSDAM)

Proof JSDHM − JSDAM =

1

2

[{ n∑i=1

pi log( pi

HM

)+

n∑i=1

qi log( qi

HM

)}

−{

n∑i=1

pi log( pi

AM

)+

n∑i=1

qi log( qi

AM

)}]

= 1

2

[n∑

i=1

pi log

(AM

HM

)+

n∑i=1

qi log

(AM

HM

)]

= 1

2

n∑i=1

[pi log

⎛⎝ AM

GM2

AM

⎞⎠+ qi log

⎛⎝ AM

GM2

AM

⎞⎠]

[∵ G.M.2 = A.M. × H.M.]= 1

2

[n∑

i=1

pi log

(AM

GM

)2+

n∑i=1

qi log

(AM

GM

)2]

123

Page 8: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

254 Int J Data Sci Anal (2017) 3:247–266

Therefore, we get

JSDHM − JSDAM =n∑

i=1

[(pi + qi ) log

(AM

G M

)](15)

Similarly, we get the following two equations:

JSDHM − JSDGM = 1

2

n∑i=1

[(pi + qi ) log

(AM

G M

)](16)

JSDGM − JSDAM = 1

2

n∑i=1

[(pi + qi ) log

(AM

G M

)](17)

The right-hand side of (16) and (17) is also known asarithmetic–geometric mean divergence [20,46]. Above threesteps give the desired proof. ��

4.3 Order of dilation in CJSDs

The question that remains up to what order JSDGM w.r.t.JSDAM and JSDHMw.r.t. JSDAM&JSDGM space apart distri-butions on the real number line? Representing the right-handsides of (15) as κ . Equations (15), (16), and (17) can bewritten as:

JSDHM = JSDAM + κ (18)

JSDHM = JSDGM + 1

2κ (19)

JSDGM = JSDAM + 1

2κ (20)

Remark 2 Since (18), (19), and (20) represent a consistentand dependent system of linear equations, it is not possibleto solve it algebraically.

Substituting (20) in (6):

α = 1 + 1

2

JSDAM

](21)

Substituting (18) in (7):

β = 1 +[

κ

JSDAM

](22)

Substituting (19) in (8):

γ = 1 + 1

2

JSDGM

](23)

Substituting (20) in (23):

∴ γ = 1 +[

κ

2JSDAM + κ

](24)

Relative improvement (R.I.) in scaling is computed as:

JSDGM−JSDAMJSDAM

× 100%JSDHM−JSDGM

JSDGM× 100%

JSDHM−JSDAMJSDAM

× 100%

(α − 1) × 100%, (β − 1) × 100%, (γ − 1) × 100%

[By (6), (7), and (8)]

Remark 3 While Eqs. (18), (19), and (20) compute the dif-ferences (spacing) between CJSDs, and Eqs. (21), (22), and(24) compute dilation ratioswhich tell us precisely up towhatorder (scale factor) CJSDs will space apart the given distri-butions. Moreover, Eqs. (21), (22), and (24) also confirm thevalidity of (14). By substituting expression for dilation ratios(21), (22), and (24) into (6), (7), and (8), respectively, we getexpressions ordering dilations, i.e., (18), (19), (20).

Equations (21), (22), and (24) are advantaged over (6), (7),and (8) because they depend only on JSDAM and κ . Thus,we can compute α, β, and γ without first computing JSDGM

and JSDHM. Also, from Eqs. (18) and (20), we can computeJSDHM and JSDGM implicitly as they depend only on JSDAM

and κ . See themodified schematic of this approach in Fig. 10.

4.4 Alternate representation of CJSDs and κ

We call κ the dilation parameter. Elaborate κ in order tounderstand its principal constituents:

κ =n∑

i=1

[(pi + qi ) log

(AM

GM

)]

=n∑

i=1

[(pi + qi ) log

(pi + qi

2√

pi qi

)]

=n∑

i=1

[(pi + qi ) log (pi + qi ) − (pi + qi ) log 2

−1

2pi log pi − 1

2pi log qi − 1

2qi log pi − 1

2qi log qi

]

123

Page 9: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266 255

Fig. 10 Modified schematic of dilation mapping shown in Fig. 17.JSDHM (set Z ) can be computed implicitly by mappings φ3, φ2

Expanding the cross entropy terms:

=n∑

i=1

[(pi +qi ) log (pi +qi )−(pi +qi ) log 2− 1

2pi log pi

−1

2pi log pi − 1

2pi log

qi

pi− 1

2qi log qi − 1

2qi log

pi

qi

−1

2qi log qi

]

=n∑

i=1

[(pi + qi ) log (pi + qi ) − (pi + qi ) log 2

−pi log pi +(1

2pi log

pi

qi+ 1

2qi log

qi

pi

)−qi log qi

]

=[

HP + HQ − HP+Q −n∑

i=1

(pi + qi ) + 1

2(SKLD)

]

κ =[

I (P, Q) −n∑

i=1

(pi + qi ) + 1

2(SKLD)

](25)

where HP , HQ , and HP+Q are the Shannon entropies of thedistributions P and Q, and their joint distribution P + Q,respectively. I (P, Q) is the mutual information [9]. SKLDis the symmetric KL divergence. κ can be used for derivingalternate representation for JSDGM and JSDHM. First, expandJSDAM:

JSDAM = 1

2

n∑i=1

[pi log

pi

(pi + qi )/2+ qi log

qi

(pi + qi )/2

]

= 1

2

n∑i=1

[pi log pi − pi log

(pi + qi )

2+ qi log qi

−qi log(pi + qi )

2

]

= 1

2

n∑i=1

[pi log pi +qi log qi −(pi +qi ) log(pi +qi )

+ (pi + qi ) log 2

]

= 1

2

[−HP − HQ + HP+Q +

n∑i=1

(pi + qi )

]

JSDAM = 1

2

[−I (P, Q) +

n∑i=1

(pi + qi )

](26)

Substituting (25) and (26) in (20):

JSDGM = 1

4[SKLD] (27)

Though relation (27) was established by [41], ours providesan alternate proof. Substituting (25) and (26) in (18):

JSDHM= 1

2

[HP +HQ −HP+Q −

n∑i=1

(pi +qi )+SK L D

]

JSDHM = 1

2

[I (P, Q) −

n∑i=1

(pi + qi ) + SK L D

]

(28)

Alternate representations in (25), (26), (27), and (28) areimportant in understanding the effect of κ’s spacing apart ofJSDAM outputs on the real number line. Section 4.2 discussedmutual differences between CJSDs. Parameter κ modifiesJSDAM’s values given by (18) and (20) by dilating them fur-ther apart on the real number line. Moreover, from (25) wecan also tell precisely how much it will dilate. In order to getJSDGM and JSDHM, we dilate JSDAM by a quantity κ/2, andκ , respectively.

4.5 Modified dilation mapping

We redefine dilation mapping defined earlier in Def. 2. Con-sider new dilation mapping functions φ1, φ2, and φ3 suchthat:

φ1 : Xκ/2 �→ Y φ2 : Yκ/2 �→ Z φ3 : Xκ �→ Z

Xκ contains the corresponding elements of X and κ , Xκ/2

contains the corresponding elements of sets X and κ/2 andYκ/2 contains the corresponding elements of sets Y and κ/2where, κ can be computed implicitly by (20), that is, κ =2(Y − X). In Fig. 10 mappings φ1, φ2 and φ3 represent (20),(19), and (18), respectively.

5 L2-norm range transformation

So far we studied what happens to the range when modelchanges and proved it in the forward direction. Now, we will

123

Page 10: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

256 Int J Data Sci Anal (2017) 3:247–266

study what happens to the model when range changes andprove it in the backward direction (Fig. 2). For experimentalpurpose, we choose L2-distance (Euclidean distance) whichis commonly used in real-world applications, such as facerecognition, financial prediction [17], and data mining [13].

L2-distance is widely used measure for computing dis-tance between two points. Its range falls on the nonnegativereal number line. Let, X = {x1, x2, . . . , xn} be a data set oflength n. Then, for any two given data points xi , x j ∈ X , itis defined as:

L2 = ∥∥xi − x j∥∥ =√√√√ n∑

l=1

(xi,l − x j,l)2 (29)

Let, W = {w1, w2, . . . , wn} be a transformation param-eter of length n. Then, modified expression for weightedL2-distance is:

L2W =

√√√√ n∑l=1

wl(xi,l − x j,l)2 (30)

Theweighted version of L2-distance has beenwidely usedas a distance measure for classification and other algorithmictasks [15,44,45,51].

We show how to transform L2-distance’s range directlyby choosing a suitable transformation out of many possi-ble transformations discussed in Sect. 3. This improves themodel’s performance. For validation purpose, we select k-means clustering [19] and employ it on LIBS amino aciddata Sect. 6.3. Our choice of k-means clustering is to impacta very well known and widely used algorithm. We show howrange transformation of L2-distance gives rise to improvedk-means clustering performance by finding a suitable W .

Open questions that we address concern: How is L2-distance range transformation related to k-means clustering?How does the range transformation impact k-means cluster-ing? What kind of transformations can we choose? How tofind a suitable transformation parameter W that gives theimproved clustering results?

For n-many d-dimensional data points {xi }ni=1, k-means

clustering partitions the data into k clusters, where k ≤ n.The objective is to minimize the sum of distances betweeneach data point in the cluster and its cluster center (i.e., Sumof Squared Error (SSE)). This is expressed as an objectivefunction:

arg minK

k∑i=1

∑x∈Ki

‖x − μi‖2 (31)

where K is the set of all clusters, and μi is the mean of datapoints belonging to cluster Ki .

Fig. 11 k-means clustering algorithm

The k-means algorithm produces exactly k different clus-ters maintaining maximum possible separation betweenthem. While k is not known a priori and the prior proba-bility for all k clusters is the same, an appropriate value for kcan be found through validation. Therefore, k-means tries toproduce clusters with approximately equal numbers of datapoints. Moreover, for k-means, the distribution of each vari-able is assumed to have the same variance (i.e., k-means triesto produce clusters which are spherical in shape). The algo-rithm for k-means clustering is given is Fig. 11.

5.1 Understanding L2-distance transformation ink-means

Consider data with two classes for k-means clustering asshown in Fig. 12. If there are subtle differences in classes,and cluster centers are unfavorably initialized as shown inFig. 12a, k-means will yield poor clustering results by pro-ducing clusters containing both classes. An isotropic trans-formation of L2-distance matrix will transform all distanceswith the same scale factor, thus preserving the proportion ofall relative distances (Fig. 12b, c). Therefore, in such sce-nario, an isotropic transformation cannot improve clusteringresults. Alternately, anisotropic transformation can put datapoints near to the right centers by shifting and orienting allL2-distances with different scale factors (Fig. 12d). This

123

Page 11: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266 257

Fig. 12 a Considering a scenario of 2 class data problem in k-meansclustering. Here L2-distance of red point B from center C1 is biggerthan its distance from center C2. The same is true for blue point Awhich is closer to centerC1 than to centerC2 b Isotropic dilation of L2-distances, shown on horizontal axis, does not help c Isotropic dilationof L2-distances, shown on vertical axis, does not help d anisotropicdilation of L2-distances can allow to put transformation at the rightplace and the right orientation so that the data points are closer to theright centers (color figure online)

Fig. 13 Some possible scenarios of bad clustering results by k-meansclustering. a Clusters touching b clusters overlapping

refers back to our discussion on isotropic and anisotropictransformations in Sect. 3.4.

For good clustering, data points should be close to theirrespective centers and clusters should be far from each other.Depending on the data, k-means clustering gives poor resultsin the following scenarios:

(i) Clusters are touching each other at boundaries inFig. 13a.It depends on the distributions of data points andassigned cluster centers. As k-means tries to makespherical clusters of equal size, it may create clusterswhich touch at each other’s boundaries.

(ii) Overlapping clusters in Fig. 13b.It may happen if data points from different classes areequidistant from the cluster centers. As k-means tries tomake clusters with roughly equal number of data points,it may yield overlapping clusters.

(iii) Data points are assigned to wrong clusters.If two data points from different classes have subtle dif-ferences they may be assigned to wrong cluster centers.

L2-distance range transformation may help in the followingscenario:

– In the scenario when cluster boundaries are touchingeach other, a suitable choice of anisotropic transfor-mation Sect. 3, on L2-distance matrix causes distancesbetween data points and their respective centers to becomesmaller or bigger. Smaller L2-distances give compactclusters. Applied anisotropic transformation reconfiguresL2-distancematrix such that the resulted clusters are sepa-rated from each other (Fig. 14). Transformed range of L2-distances rescales original L2-distances in a way so thatthe clusters contain data pointswhich fallwithin that trans-formed range. This results in separated (compact) cluster(Fig. 14a). Other possibility of getting well-separatedclusters is to reorientate L2-distances by anisotropic trans-formation without making clusters compact (Fig. 14b).

– In the scenariowhen clusters are overlapping, L2 transfor-mationmaygive separable compact clusters, nevertheless,

123

Page 12: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

258 Int J Data Sci Anal (2017) 3:247–266

Fig. 14 Possible scenarios of separated clusters obtained by L2-distance range transformation in k-means clustering. a Compact clus-ters, b separated clusters

data points may not necessarily be assigned to the rightcluster center as they are equidistant from centers. There isa 50% chances for an equidistant data point to be assignedto the correct center.

– If there are subtle differences in data distributions, thisstrategy is not successful.

5.2 Finding the transformation parameterby optimization

Aswe discussed in Sect. 5.1, our L2 transformation approachcan help for scenariowhen clusters are touching (Fig. 13a). Aclustering is good if data points are close to their centroids andclusters are well separated from each other. This improve-ment ismeasuredby computingDavies–Bouldin Index (DBI)which is defined as a function of the ratio comparing the sumof within cluster scatter and the between cluster separation[11].

Definition 3 (Davies–Bouldin Index) Let, di and d j be theaverage distances between each point and the centroids in thei th and j th clusters, respectively, and let di, j be the distancebetween the centroids of the i th and j th clusters. Then,

DBI = 1

k

k∑i=1

maxj �=i

{di + d j

di, j

}

where k is the number of clusters.

We employ an optimization approach to find a suitabletransformation parameter, W , responsible for an improve-ment in clustering results by transforming original L2.

Define optimization problem for objective function DBIw.r.t.W . Let,W ∈ R be a real vector with n ≥ 1 and let f :R

n → R be an objective function. Then, the unconstrainedoptimization problem is:

minW

f (X,W)

Where X is the data set.

Our optimization approach finds a suitable parameterW using anisotropic transformation of L2-distance matrix.Therefore, the transformed L2

W -distances describe a modelthat improves k-means clustering results. This refers backto our strategy for finding a suitable model by transform-ing the range in backward direction (Fig. 2). The benefit ofrange transformation as an alternative to data transformationis that there is no early commitment about the data’s geome-try. Unlike data transformation, the implementation of rangetransformation focuses on altering the measurement tool, inour case, L2-distances. This is the primary contribution ofthis work.

We choose DBI for the interpretation and validation ofconsistency within clusters of data. It is important to notethat other clustering validation methods may be employed(e.g., Calinski-Harabasz criterion also known as varianceratio criterion (VRC) [6], Gap Value [48], Silhouette Value[22,35]).

6 Experiments

As described in Sect. 1, we study range transformationapproach in two ways: first changing the model to transformits range (forward direction) and then transform the range tochange the model (backward direction) (Fig. 2).

For the forward direction, we employed synthetic datain our experiments Sect. 6.1. A goal of our experimenta-tion was to demonstrate the impact of dilation afforded byCJSDs. We also set out to understand whether the dilationparameter’s empirical behavior coincides with the theory bycontrolling for subtlety in the input data’s distribution. Wecomputed all CJSDs from input distributions from scratchand compared the results to our proofs. In the following dis-cussion, we refer to Jensen–Shannon divergence measure asJSD, Chisini–Jensen–Shannon divergencemeasure asCJSD,the kernel function employing CJSD as CJSD kernel, andkernel density estimation as KDE.

For the backward direction, we choose a spectral datafrom the analysis of chemical compounds Sect. 6.3. A goalof our experimentation was to demonstrate the impact ofanisotropic dilation on L2-norm in k-means clustering asdescribed in Sect. 5.

6.1 Synthetic data generation

In the ideal case, classes in feature space consist of pointclouds that are both compact and easily separable from theother classes. As noise increases, it becomes increasinglydifficult to distinguish data points between classes, particu-larly those closer to the boundaries between two modes. Wemodel noise as an increase in dispersion causing ahighdegreeof overlap between the resulting point clouds. To model

123

Page 13: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266 259

Fig. 15 synthetic data for γ = 0.1, 0.5, 0.9

Fig. 16 Example of class overlap representing different subtlety prob-lem

subtlety, we constructed a series of one-vs-one decision prob-lems by combining multimodal point clouds consisting ofclasses whose dispersions differ in regular increments.

Define mixture distribution,

p(x|θ) =M∑

i=1

wi g(x|μi ,Σ i )

where x is a d-dimensional continuous valued data vector,wi , i = 1, . . . , M are mixture weights and g(x|μi ,Σ i ), i =1, . . . , M are component multivariate Gaussian densities ind-dimensions.

g(x|μi ,Σ i )

= 1√(2π)d |Σ i |

exp

{−1

2(x − μi )

tΣ−1i (x − μi )

}

Beginning with the interval [−1000, 1000] ∈ R, we setthe means of our multimodal distribution by equally spacingapart two points per axis along 10-dimensions. This resultsin a total of 1024 modes in 10-dimensions. The Gaussianmixture is defined by setting themean vectors to eachmode’slocus. The component distributions share a single isotropiccovariance matrix Σi = γ σ 2 I where σ 2 is initialized to thesquare of half the distance between the centers of adjacentmodes. Setting γ along a schedule, 0.0, 0.1, . . . , 0.9, of 10%increments, we sample data sets of increasing smoothness.

Wedepict an example samplingdistribution inFig. 15 con-sisting of 4-modal 2-dimensional versions of the data withγ set to 0.1, 0.5, and 0.9. It is important to note in Fig. 15that the vertical axis is the probability, depicted as a visual-

Fig. 17 High-intensity peaks of d-Serine (red) at certain wavelengthin nanometer (nm) regions due to micro-well influence. Intensities inarbitrary unit (a.u.) of d-Serine (red) and Water (black) seem to be ontop of other compounds due to the subtle differences in data classes.a LIBS experimental setup, b wavelength versus intensity plot (colorfigure online)

ization, of the data points in the (x1, x2) plane. As can beseen, an increase in γ results in overlap between modes dueto variability. For example, a one-vs-one decision problemconstructed by pairing the 4-modal γ = 0.1 data (red points)with the 4-modal γ = 0.5 data (green points) has the dif-ference between the classes as more subtle than the decisionproblem constructed by pairing γ = 0.1 (red points) withγ = 0.9 (green points) (Fig. 16).Moreover, subtlety betweenpair of classes is not the same even though their difference insmoothnesses is the same (Fig. 16). For example, while thepairings (γ = 0.1, γ = 0.2) and (γ = 0.2, γ = 0.3) bothhave a difference in smoothness�γ = 0.1, the latter involvesclasses whose point clouds are more disperse and thus morenoisy (stochasticity). In our synthetic data sets, we control forboth subtlety between the classes as well as stochasticity. Forour synthetic data, γ is the smoothness parameter. Moreover,for a one-vs-one decision problem, the difference in smooth-ness �γ is the relative difference in smoothness between thepaired data sets. �γ measures the decision problem’s sub-tlety.

123

Page 14: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

260 Int J Data Sci Anal (2017) 3:247–266

Fig. 18 ISOMAP and LLE scatter plots with neighborhood size 15 onLIBS amino acid data. a ISOMAP, b LLE

Each test data set is a one-vs-one decision problem con-sisting of 2048 instances in 10-dimensions with equal priorsover the classes. Given the unique pairings covering �γ =0.1, . . . , 0.9 we generated a total of 45 tests.

6.2 CJSD kernels for SVM classification

We implemented amplified and scaled versions of CJSD ker-nels [41].

Kamplified = CJSD(P||Q)e−|xi −x j |22σ 2 (32)

Kscaled = e−CJSD(P||Q)|xi −x j |2

2σ 2 (33)

Because theymodulate the radial basis functionwith infor-mation about the data’s distribution, we performed SVMclassification using WEKA’s [18] sequential minimal opti-mization (SMO) algorithm [31] augmented with our customkernels with RBF kernel as a comparative measure forCJSD’s impact on classification performance. Where RBFtakes pointwise information over the data set, augmentationwith JSD considers the shape of the distribution over all datapoints while CJSD dilates the way in which the shape is con-sidered. Moreover, given a complex decision problem withunknown nonlinearity, because the classification model mustbe able to shatter the data set, RBF kernel is a reasonablechoice for SVM classification as it offers infinite VC dimen-

Fig. 19 (Top) γ versus CJSDs. (Green JSDAM, Blue JSDGM,Orange JSDHM). (Middle) γ versus κ . (Bottom) γ versus Slopeof κ . (Notice white line overlapping green line in middle and bottomsubfigures confirming (18), (19), (20)) (color figure online)

sion. The RBF also carries the added benefit of constructingsmooth solutions [5].

In all experiments, we performed tenfold cross-validationand report the sample mean and standard error. We estimatedprobability densities in our CJSDs using nonparametric KDE[29,34,43] where the bandwidth parameter is initialized witha diagonal covariance matrix computed from the data andgrown, using Newton’s method, to maximize the sum ofleave-one-out log density [32]. It is important to note that datasets used in experiments for each tested CJSD kernel (AM,GM, and HM) are randomized with unique random numberseeds. For fair comparison, the benchmark RBF experimentis run using the same randomized data sets.

123

Page 15: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266 261

6.3 LIBS data

Classification and identification of amino acids in aqueoussolutions is important in the study of bio-macromolecules.Laser- Induced Breakdown Spectroscopy (LIBS) is a pow-erful analytical technique that provides information aboutthe elemental composition of a given sample [17,18]. Themethod uses an intense short laser pulse to break down thematrix of the target and to create a short-lived micro-plasma.During the cooling of plasma, atomic, ionic, and occasionallymolecular constituents emit the spectra which is further col-lected and analyzed, allowing characterization of the samplecontent in Fig. 17a.

The distinction between amino acids from the perspectiveof their elemental composition is subtle as they appear verysimilar when measured by the emissions of their principalconstituents, namely, hydrogen, carbon, nitrogen, and oxy-gen. In this work we analyze six compounds, namely, water,polysaccharide (Ficoll), aspartic acid (ASP), glutamic acid(Glu), cysteine (Cys), and d-Serine.

ASP, Cys, and Glu each consisted of 100 instances, d-Serine and water each consisted of 50 instances, and waterconsisted 270 instances. Each instance represent a singleLIBS spectrum, produced from 13 different samples drawnfrom the six aforementioned compounds. Captured wave-lengths remain consistent across all LIBS spectra. Thisresulted in 26,100 different wavelengths. In our experiment,we used the spectral intensity at eachwavelength as a feature.Therefore, our data instances were vector valued in 26,100dimensions.

We produced two different versions of data by employingISOMAP [47] and LLE [36] with manifold neighborhoodsize 15. We computed ISOMAP scree plot which we usedto select the top 3 principal components that explain over98.5% variance. We choose top 3 principal components forboth ISOMAP andLLE. Fromour 3D scatter plots in Fig. 18,it is obvious that these reduced data sets are not the same.Therefore, our experimental results with k-means clusteringare tested on two different data sets.

7 Results

We set out to understand the performance improvementafforded by CJSDs. After deriving dilation parameter andthe relationship between CJSDs, it was important to exam-ine how the theory plays out empirically. We report resultsfor CJSD output versus�γ ,�γ versus κ , and�γ versus�κ

(Fig. 19 top, middle, and bottom, respectively). Directingattention to results for CJSD versus �γ , upper left sub-plot, the difference between CJSD outputs remains constant.This confirms Propostion 3 because the relative relationshipbetweenAM,GM, andHMversions ismaintained regardless

of�γ . Moreover, as variability in the data increases (readingtop to bottom, left to right) Proposition 3 still holds. Thismeans Proposition 3 is maintained as variability (or stochas-ticity) in the data increases. Results for �γ versus κ showthat the order of dilations coincides empirically with resultsderived in (18), (19), and (20) and holds regardless of dif-ference in subtlety �γ . That is, the difference in scaling κ

between CJSD versions GM–AM and HM–GM is half thedifference in scaling betweenHM andAM. Finally, in resultsfor�γ versus�κ , as we increase subtlety, going to the right,the difference in value �κ for the CJSD AM, GM, and HMversions preserves the aforementioned 2:1 ratio.

SVMclassification results appear inTables 1 and 2.As canbe seen, in all cases, CJSDs outperform SVMwith RBFKer-nel with statistical significance at the 95% confidence level.Out of generated 45 tests with unique pairing of smooth-ness �γ = 0.1, 0.2, . . . , 0.9, for roughly 65% experiments,scaled and/or amplified versions of CJSDkernels for GMandHM give statistically significantly better performance thanamplified version of CJSD kernel for AM. For the remain-ing experiments, both amplified and scaled versions of CJSDkernels give similar performance for AM, GM and HM.

7.1 Distortion in L2-distance

We discussed the role of L2-distance in k-means cluster-ing. Now, in order to understand the role of L2-distancetransformation in k-means clustering, recall transformationparameter W described in Sect. 5. The k-means algorithmsiterates and computes L2-distances between data points andcluster centers until it converges in Fig. 11. These L2-distances are represented in a matrix of size n × k, where nis the number of data points and k is the number of clusters.

Define distortion as a change in L2-distances due to intro-duced range transformation parameter W . Let distortion�L = L2

W −L2, where L2W is the transformed L2 defined in

(30). We compute distortions, for ISOMAP& LLE data sets,and plot themw.r.t. the number of iterations taken by k-meansclustering to converge. This is done for clusters k =2:10. Thenumber of iteration steps taken by k-means can be differentwhen we use L2 and L2

W distances. Generally, L2W version

of k-means converges in fewer iterations than that of L2.We report distortion �L for the number of k-means itera-tions needed for the L2

W version to converge. This was doneto measure the impact of anisotropic transformation usingcomparative performance between L2 and L2

W versions ofk-means. Therefore, we consider at most 10 iterations forcomputing distortion �L . We use an optimization approachin order to find L2

W .From the subplots, we notice that �L decreases as the

number of iterations increases and becomes stable after afew iterations (Fig. 20). This indicates that our optimizationalgorithm, defined in Sect. 5.2, converges toward an optimal

123

Page 16: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

262 Int J Data Sci Anal (2017) 3:247–266

Table 1 CJSD kernels SVM classification accuraciesDatasets Kernel A.M.(%) G.M.(%) H.M.(%)

Amplify 59.6±0.9 59.5±0.4 60.2±1.1D000 vs. D010 Scaled 57.8±0.9 82.7±0.4 57±0.7Amplify 68.3±0.7 68±1.4 67.3±1.0D000 vs. D020 Scaled 68.6±0.9 83.2±0.8 83.3±0.8Amplify 78.1±1.0 81.6±0.9 81.9±0.4D000 vs. D030 Scaled 81.6±0.9 83.7±0.7 83.8±0.9Amplify 88.7±0.9 87.5±0.8 87.8±1.0D000 vs. D040 Scaled 88.8±0.7 83.1±0.8 83.5±0.6Amplify 92.5±0.5 92.3±0.5 92.4±0.7D000 vs. D050 Scaled 91.8±0.9 86.9±1.3 83.7±0.7Amplify 95.7±0.4 95.8±0.6 95.7±0.4D000 vs. D060 Scaled 95.7±0.4 93.6±1.8 86.6±1.9Amplify 97.3±0.2 97.4±0.4 97.5±0.3D000 vs. D070 Scaled 95.7±0.2 97.4±0.4 96.5±1.1Amplify 98±0.3 98±0.3 98±0.3D000 vs. D080 Scaled 98±0.3 98.1±0.3 98.2±0.3Amplify 98.9±0.2 98.9±0.2 98.9±0.3D000 vs. D090 Scaled 98.9±0.2 98.9±0.2 99.1±0.3Amplify 63.2±1.3 95.6±0.8 63.8±1.3D010 vs. D020 Scaled 63.2±1.3 69.3±1.1 68.7±1.0Amplify 72.5±1.3 77.1±0.9 78.1±0.8D010 vs. D030 Scaled 74.7±1.1 79.1±0.8 75.7±0.7Amplify 82.8±1.2 84.7±0.9 85.6±0.7D010 vs. D040 Scaled 82.5±0.9 80±0.9 74.4±1.2Amplify 89.2±0.6 89.5±0.7 90.6±0.5D010 vs. D050 Scaled 87.8±0.7 81±1 75.5±0.7Amplify 94.5±0.3 93.6±0.7 93.3±0.3D010 vs. D060 Scaled 93.3±0.4 80.4±0.9 73.1±1.2Amplify 96.6±0.4 96.2±0.4 96.2±0.4D010 vs. D070 Scaled 95.7±0.6 80.6±0.9 74.2±1.1Amplify 97.5±0.4 96.9±0.2 96.6±0.4D010 vs. D080 Scaled 96.6±0.4 81.3±0.5 72.9±0.6Amplify 98.7±0.2 98.1±0.3 98±0.23D010 vs. D090 Scaled 97.9±0.3 81.7±0.9 75.7±0.7Amplify 61.6±0.6 63.8±1.0 63.3±1.2D020 vs. D030 Scaled 61.4±1.1 64.7±0.8 64.1±1.4Amplify 70±1.0 74.3±0.6 74.5±0.6D020 vs. D040 Scaled 72.8±0.6 74.2±0.6 68.2±0.8Amplify 76.8±1.0 80.9±0.6 80.5±0.9D020 vs. D050 Scaled 80±0.7 80.5±0.8 75.7±0.7Amplify 86.9±0.7 88.8±0.9 88.2±0.9D020 vs. D060 Scaled 87.5±0.8 87.4±1.2 76.8±1.7Amplify 91.4±0.6 91.6±0.6 91.6±0.7D020 vs. D070 Scaled 91.3±0.7 90.7±0.7 81.1±2.2Amplify 94.5±0.6 93.9±0.5 94.2±0.6D020 vs. D080 Scaled 93.7±0.6 90.9±0.8 83.7±1.3Amplify 96.5±0.5 95.8±0.5 95.3±0.4D020 vs. D090 Scaled 95.7±0.5 94.6±0.5 87.2±0.8Amplify 56.4±1.0 59.2±1.0 58.1±1.1D030 vs. D040 Scaled 57.8±1.4 59.5±0.9 57.7±0.6Amplify 61.8±1.1 67.1±0.9 66.2±0.8D030 vs. D050 Scaled 67±1.2 67.5±0.7 67.5±0.8Amplify 74.1±0.8 77.4±0.9 76.8±0.6D030 vs. D060 Scaled 77.2±0.8 77.2±0.9 75.5±0.9Amplify 81.3±1.0 82.9±0.7 83.1±0.5D030 vs. D070 Scaled 84±0.7 82.7±0.7 80.7±0.6Amplify 86.4±0.7 88.3±0.6 87±1.0D030 vs. D080 Scaled 88.6±0.7 87.8±0.6 82.9±1.2Amplify 90.1±0.6 91.4±0.8 90.6±0.6D030 vs. D090 Scaled 91.3±0.4 89.5±0.7 86.4±0.9Amplify 54.2±1.0 56.4±1.0 56.4±0.7D040 vs. D050 Scaled 56.4±1.1 56.6±1.1 56.8±0.9Amplify 62.4±0.7 66.9±1.0 67±0.9D040 vs. D060 Scaled 66.8±0.8 66.2±1.0 66.2±1.0Amplify 69.8±0.9 74.5±1.0 74.2±0.9D040 vs. D070 Scaled 75.1±0.6 74.1±0.9 72.5±1.0Amplify 76.4±0.9 80.5±1.0 80.5±0.8D040 vs. D080 Scaled 81.6±0.8 80.1±1.1 77.3±0.7Amplify 81.2±0.8 84.6±0.4 84.4±0.8D040 vs. D090 Scaled 85.1±0.9 84±0.5 81.1±0.7Amplify 54.6±0.5 56.2±0.7 57±0.9D050 vs. D060 Scaled 55.1±1 57±0.7 57.6±0.7Amplify 60.6±1.1 66±0.9 64.2±0.6D050 vs. D070 Scaled 66.2±1.1 65.9±1.0 64.4±0.7Amplify 67±0.7 72.8±1.2 71.4±0.8D050 vs. D080 Scaled 74.9±0.7 73.2±1.2 70.1±0.9Amplify 72.9±0.8 78.3±0.8 77.8±0.6D050 vs. D090 Scaled 79±0.8 78.3±0.9 75.1±0.7Amplify 53.6±0.8 56.4±0.9 56.3±0.9D060 vs. D070 Scaled 57.9±1 56.4±0.7 56.4±0.7Amplify 59.7±0.8 63.6±0.7 62.1±0.8D060 vs. D080 Scaled 64.8±0.9 63.8±0.6 61.2±1.2Amplify 65.8±1.2 69.9±0.9 68.9±0.9D060 vs. D090 Scaled 72±1.4 69.8±0.9 68±0.9Amplify 54.6±1.4 56.3±0.6 55.5±1.2D070 vs. D080 Scaled 56.5±1.1 56.3±0.9 55.2±1.1Amplify 57.3±0.8 62.2±0.8 62.2±1.1D070 vs. D090 Scaled 61.9±1 62.5±0.9 61.9±1.1Amplify 53.1±1.0 52.3±0.8 52.9±0.8D080 vs. D090 Scaled 53.9±0.7 50.6±0.6 53.2±1.1

Table 2 RBF kernels SVM classification accuraciesDatasets Kernel A.M.(%) G.M.(%) H.M.(%)

Amplify 59.6±0.9 59.5±0.4 60.2±1.1D000 vs. D010 Scaled 57.8±0.9 82.7±0.4 57±0.7Amplify 68.3±0.7 68±1.4 67.3±1.0D000 vs. D020 Scaled 68.6±0.9 83.2±0.8 83.3±0.8Amplify 78.1±1.0 81.6±0.9 81.9±0.4D000 vs. D030 Scaled 81.6±0.9 83.7±0.7 83.8±0.9Amplify 88.7±0.9 87.5±0.8 87.8±1.0D000 vs. D040 Scaled 88.8±0.7 83.1±0.8 83.5±0.6Amplify 92.5±0.5 92.3±0.5 92.4±0.7D000 vs. D050 Scaled 91.8±0.9 86.9±1.3 83.7±0.7Amplify 95.7±0.4 95.8±0.6 95.7±0.4D000 vs. D060 Scaled 95.7±0.4 93.6±1.8 86.6±1.9Amplify 97.3±0.2 97.4±0.4 97.5±0.3D000 vs. D070 Scaled 95.7±0.2 97.4±0.4 96.5±1.1Amplify 98±0.3 98±0.3 98±0.3D000 vs. D080 Scaled 98±0.3 98.1±0.3 98.2±0.3Amplify 98.9±0.2 98.9±0.2 98.9±0.3D000 vs. D090 Scaled 98.9±0.2 98.9±0.2 99.1±0.3Amplify 63.2±1.3 95.6±0.8 63.8±1.3D010 vs. D020 Scaled 63.2±1.3 69.3±1.1 68.7±1.0Amplify 72.5±1.3 77.1±0.9 78.1±0.8D010 vs. D030 Scaled 74.7±1.1 79.1±0.8 75.7±0.7Amplify 82.8±1.2 84.7±0.9 85.6±0.7D010 vs. D040 Scaled 82.5±0.9 80±0.9 74.4±1.2Amplify 89.2±0.6 89.5±0.7 90.6±0.5D010 vs. D050 Scaled 87.8±0.7 81±1 75.5±0.7Amplify 94.5±0.3 93.6±0.7 93.3±0.3D010 vs. D060 Scaled 93.3±0.4 80.4±0.9 73.1±1.2Amplify 96.6±0.4 96.2±0.4 96.2±0.4D010 vs. D070 Scaled 95.7±0.6 80.6±0.9 74.2±1.1Amplify 97.5±0.4 96.9±0.2 96.6±0.4D010 vs. D080 Scaled 96.6±0.4 81.3±0.5 72.9±0.6Amplify 98.7±0.2 98.1±0.3 98±0.23D010 vs. D090 Scaled 97.9±0.3 81.7±0.9 75.7±0.7Amplify 61.6±0.6 63.8±1.0 63.3±1.2D020 vs. D030 Scaled 61.4±1.1 64.7±0.8 64.1±1.4Amplify 70±1.0 74.3±0.6 74.5±0.6D020 vs. D040 Scaled 72.8±0.6 74.2±0.6 68.2±0.8Amplify 76.8±1.0 80.9±0.6 80.5±0.9D020 vs. D050 Scaled 80±0.7 80.5±0.8 75.7±0.7Amplify 86.9±0.7 88.8±0.9 88.2±0.9D020 vs. D060 Scaled 87.5±0.8 87.4±1.2 76.8±1.7Amplify 91.4±0.6 91.6±0.6 91.6±0.7D020 vs. D070 Scaled 91.3±0.7 90.7±0.7 81.1±2.2Amplify 94.5±0.6 93.9±0.5 94.2±0.6D020 vs. D080 Scaled 93.7±0.6 90.9±0.8 83.7±1.3Amplify 96.5±0.5 95.8±0.5 95.3±0.4D020 vs. D090 Scaled 95.7±0.5 94.6±0.5 87.2±0.8Amplify 56.4±1.0 59.2±1.0 58.1±1.1D030 vs. D040 Scaled 57.8±1.4 59.5±0.9 57.7±0.6Amplify 61.8±1.1 67.1±0.9 66.2±0.8D030 vs. D050 Scaled 67±1.2 67.5±0.7 67.5±0.8Amplify 74.1±0.8 77.4±0.9 76.8±0.6D030 vs. D060 Scaled 77.2±0.8 77.2±0.9 75.5±0.9Amplify 81.3±1.0 82.9±0.7 83.1±0.5D030 vs. D070 Scaled 84±0.7 82.7±0.7 80.7±0.6Amplify 86.4±0.7 88.3±0.6 87±1.0D030 vs. D080 Scaled 88.6±0.7 87.8±0.6 82.9±1.2Amplify 90.1±0.6 91.4±0.8 90.6±0.6D030 vs. D090 Scaled 91.3±0.4 89.5±0.7 86.4±0.9Amplify 54.2±1.0 56.4±1.0 56.4±0.7D040 vs. D050 Scaled 56.4±1.1 56.6±1.1 56.8±0.9Amplify 62.4±0.7 66.9±1.0 67±0.9D040 vs. D060 Scaled 66.8±0.8 66.2±1.0 66.2±1.0Amplify 69.8±0.9 74.5±1.0 74.2±0.9D040 vs. D070 Scaled 75.1±0.6 74.1±0.9 72.5±1.0Amplify 76.4±0.9 80.5±1.0 80.5±0.8D040 vs. D080 Scaled 81.6±0.8 80.1±1.1 77.3±0.7Amplify 81.2±0.8 84.6±0.4 84.4±0.8D040 vs. D090 Scaled 85.1±0.9 84±0.5 81.1±0.7Amplify 54.6±0.5 56.2±0.7 57±0.9D050 vs. D060 Scaled 55.1±1 57±0.7 57.6±0.7Amplify 60.6±1.1 66±0.9 64.2±0.6D050 vs. D070 Scaled 66.2±1.1 65.9±1.0 64.4±0.7Amplify 67±0.7 72.8±1.2 71.4±0.8D050 vs. D080 Scaled 74.9±0.7 73.2±1.2 70.1±0.9Amplify 72.9±0.8 78.3±0.8 77.8±0.6D050 vs. D090 Scaled 79±0.8 78.3±0.9 75.1±0.7Amplify 53.6±0.8 56.4±0.9 56.3±0.9D060 vs. D070 Scaled 57.9±1 56.4±0.7 56.4±0.7Amplify 59.7±0.8 63.6±0.7 62.1±0.8D060 vs. D080 Scaled 64.8±0.9 63.8±0.6 61.2±1.2Amplify 65.8±1.2 69.9±0.9 68.9±0.9D060 vs. D090 Scaled 72±1.4 69.8±0.9 68±0.9Amplify 54.6±1.4 56.3±0.6 55.5±1.2D070 vs. D080 Scaled 56.5±1.1 56.3±0.9 55.2±1.1Amplify 57.3±0.8 62.2±0.8 62.2±1.1D070 vs. D090 Scaled 61.9±1 62.5±0.9 61.9±1.1Amplify 53.1±1.0 52.3±0.8 52.9±0.8D080 vs. D090 Scaled 53.9±0.7 50.6±0.6 53.2±1.1

123

Page 17: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266 263

Fig. 20 Applied distortion in L2-distance due to its range transforma-tion reduces and comes to a Pareto after a few iterations. a ISOMAP bLLE

Fig. 21 DBI decreases and becomes stable (less fluctuations) withincreasing distortion in L2-distance due to its range transformation.a ISOMAP, b LLE

Fig. 22 Pattern search, global search and multipoint search performbetter than all tested optimizationmethods.OptimalDBI value (smaller)indicates better clustering than the case when no optimization (i.e., noL2 transformation) was performed (red line). a Optimization results onISOMAP data, b optimization results on LLE data (color figure online)

W required to introduce a necessary anisotropic transforma-tion in L2-distances.

We analyzed the role of L2W in k-means clustering. Now,

we analyze its impact on k-means clustering. In other words,we set out to understand that how clustering improves withthe transformation of L2-distance. The improvement in clus-tering can be measured by computing Davies–Bouldin Index(DBI; smaller is better). We compute distortions �L forISOMAP & LLE data sets and plot them w.r.t. DBI. Thisis done for clusters k = 2:10. Like before, we find optimalW which gives transformed L2-distances by using our opti-mization approach given in Sect. 5.2.

From the subplots, we notice that the�L versusDBI is nota smooth plot; DBI decreases non-monotonically (Fig. 21).This is an artifact of the optimization. However, as distortionincreases, DBI values start becoming closer. This means thatour optimization algorithm creates distortions in L2-distanceuntil it finds a suitableW responsible for L2-distance trans-formation. This transformed L2

W gives an optimal DBI.

7.2 Optimal DBI

Using our approach in Sect. 5.2, we employed many opti-mization algorithms to find a suitable method which gives

123

Page 18: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

264 Int J Data Sci Anal (2017) 3:247–266

an optimal DBI value for some transformation parameterW . Optimization methods can be characterized on the basisof search algorithm they employ. For example, linear andnonlinear, gradient (direct search) and non-gradient based,deterministic and stochastic start point, and constrained andunconstrained [12]. We compute optimal DBIs w.r.t. clusternumbers for both, ISOMAP&LLE data sets, and benchmarkour DBI results against when no L2 transformation was used.This is done for clusters k = 2:10. From Fig. 22 (best viewedin color), we notice that Pattern Search (blue line), GlobalSearch (black line) and Multipoint Search (black asteriskline) give better clustering results (smaller DBI) versus whenno optimization was used (red line, when no L2 transforma-tion was used for k-means clustering). While Pattern SearchandMultipoint Search outperform other employed optimiza-tion methods for most k, Global Search outperforms forevery k.

8 Conclusion

This work contributed a method for modifying a broad classof models to suit novel data through transformation of therange. We investigated different classes of transformationsand evaluated their impact on machine learning tasks (clas-sification and clustering). We demonstrated that by focusingon transformations of function’s range, we can find a gen-erally applicable way to change the function’s properties tobest suit the data.

Our contribution to machine learning is an approach toselecting suitable models by focusing on transformations oftheir range. In particular when faced with novel data withlittle knowledge about pattern phenomena (e.g., quantumeffects due to photon energy absorption) and the neededresolution required between model outputs in order to dis-criminate between classes, amore general approach informedby resolution can be valuable. Given a certain degree of sub-tlety between classes within the data or between membersof a cluster, one begins with the requisite resolution in dis-tance measure needed for discriminability and then selects atransformation achieving it. By this approach a more broadclass of functions is available for modeling. A new frame-work employing range transformations to improve modelswas developed.

We studied range transformation approach in two ways:first changing the model to transform its range and then totransform the range to change the model (Fig. 2).

Going in forward direction, because of their broad appli-cability in describing regularities in data, we focused on thedevelopment of range transformations arising from functionedits of JSDs using Chisini means. The family of modifiedJensen–Shannon divergences, or Chisini–Jensen–Shannondivergences (CJSDs), carry important properties responsible

for performance improvements. We addressed open ques-tions related to the transformation of CJSD’s range, proper-ties and limitations of JSDs reconfigurations, data domainsto which these reconfigurations are applicable in general,precise rescaling done by CJSDs between input distribu-tions, the magnitude of this rescaling, and computing CJSDsimplicitly.

Beginning with an initial focus on how reformulation ofJSDs with Chisini mean rescaled their range, we proved anumber of mathematical propositions that establish the gov-erning parameter, namely the dilation parameter and dilationmapping. For empirical validation, we generated syntheticdata sets controlling for unit step increments in variabilityand subtlety between the classes. We validated the dilationparameter and showed that it holds regardless of subtlety(something we control for in our synthetic data experiments.Wecontrol for subtlety using covariance inmultimodalmulti-variate Gaussian). Our findings demonstrate utility of Chisinimeans in classification of complex data whose propertiesinclude stochasticity and subtlety. These findings explainthe AM < GM < HM ordering among CJSDs and shedslight on how the JSD distance metric spaces apart its out-puts resulting in improvement in kernel performance. Whenthe differences between classes in data are subtle, an under-standing of dilation provides tools to begin suggesting theappropriate measure for drawing distinctions. For example,if subtlety is small, our results recommend using HM ver-sion of CJSD. Most importantly, our experiments confirmour theories concerning governing parameters aswell as theirutility as kernels in SVM classification. As the current workis theoretical in nature, future work will include evaluationof performance on real-world data sets.

After empirically validating our theoretical results inforward direction, we studied range transformation in back-ward direction. In doing so, we developed an optimizationapproach for finding a suitable model. With our approach,we showed that one can change the model by focusing onthe transformation of the model’s range. A validation usingL2-distance range transformation was provided for k-meansclustering on LIBS data. This work demonstrated the benefitof the range transformation as an alternative to data trans-formation in clustering methods. Our optimization approachfinds a suitable parameter W using anisotropic transfor-mation of L2-distance matrix. Therefore, the transformedL2W -distances describe a model that improves k-means clus-

tering results. This refers back to our strategy for finding asuitable model by transforming the range in backward direc-tion. The benefit of range transformation as an alternativeto data transformation is that there is no early commitmentabout the data’s geometry. Unlike data transformation, theimplementation of range transformation focuses on alteringthe measurement tool, in our case, L2-distances. This is theprimary contribution of this work.

123

Page 19: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

Int J Data Sci Anal (2017) 3:247–266 265

L2-distance is widely applicable as a measuring tool. Wedemonstrated an anisotropic range transformation approachfor this measure. Further investigation can be carried out forother domains beyond k-means clustering. Moreover, trans-formations for othermeasures, such as,Mahalanobis distance[27], Bhattacharyya distance (Hellinger distance) [2], andManhattan distance [23] can be studied and analyzed for andbeyond k-means clustering.

One distance measure can be reduced to other distancemeasure. For example, if in the definition ofMahalanobis dis-tance, the covariance matrix is the identity matrix, it reducesto Euclidean distance. Future workwill investigate variationsin distance measures by using range transformation. In par-ticular, we will study what kind of transformations do thesevariations give?

Finally, in this work we explored range transformationfor the analysis of synthetic and spectral data sets, whichcarry properties of subtlety, stochasticity and unknown non-linearities. Future workwill explore range transformation forother data domains, such as, astronomical, video, and voicedata each one with unique characteristics (noise, nonlinear-ity, nuances in data classes, etc.).

Acknowledgements The authors would like to thank Dr. YuriMarkushin, Dr. Poopalasingam Sivakumar, and Dr. NoureddineMelikechi for their assistance in understanding LIBS and LIBS exper-imental protocol.

Compliance with ethical standards

Conflict of interest On behalf of all authors, the corresponding authorstates that there is no conflict of interest.

References

1. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering withBregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

2. Bhattacharyya, A.: On ameasure of divergence between twomulti-nomial populations. Sankhya Indian J. Stat, 401–406 (1946)

3. Blanc, C., Guitton, P., Schlick, C.: A methodology for descriptionof geometrical deformations. In: Proceedings of Pacific Graphics,vol. 94 (1994)

4. Brannon, R.: Kinematics: the mathematics of deformation. CourseNotes, ME EN 6530 (2008)

5. Buhmann, M.D.: Radial basis functions: theory and implementa-tions. Camb. Monogr. Appl. Comput. Math. 12, 147–165 (2004)

6. Calinski, T., Harabasz, J.: A dendrite method for cluster analysis.Commun. Stat. Theory Methods 3(1), 1–27 (1974)

7. Chen, J., Thalmann, N.M., Tsang, Z., Thalmann, D.: Fundamentalsof computer graphics. World Scientific, Singapore (1994)

8. Chisini, O.: Sul concetto di media. Periodico di Matematiche9:2(4), 106–116 (1929)

9. Cover, T.M., Thomas, J.A.: Elements of InformationTheory.Wiley,New York (2012)

10. Coxeter, H.S.M., Greitzer, S.L.: Geometry Revisited.MAA,Wash-ington, DC (1967)

11. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEETrans. Pattern Anal. Mach. Intell. PAMI–1(2), 224–227 (1979)

12. Dixon, L.C.W., Szegö, G.P.: Towards Global Optimisation, vol. 2.North-Holland, Amsterdam (1978)

13. Faloutsos, C., Lin, K.I.: Fastmap: a fast algorithm for index-ing, data-mining and visualization of traditional and multimediadatasets. SIGMOD Rec. 24(2), 163–174 (1995)

14. Gelman, A., Hill, J.: Data Analysis Using Regression and Mul-tilevel/Hierarchical Models. Cambridge University Press, Cam-bridge (2006)

15. Greenacre, M.J., Groenen, P.J.: Weighted Euclidean biplots. J.Classif. 33(3), 442–459 (2016)

16. Griffith, D.A.: Reformulating classical linear statistical models.In: Advanced Spatial Statistics, pp. 82–107. Springer Netherlands(1988)

17. Güvenir, H.A., Altingovde, S., Uysal, I., Erel, E.: Bankruptcyprediction using feature projection based classification. In: Pro-ceedings of SCI/ISAS (1999)

18. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.,Witten, I.H.: The WEKA data mining software: an update. ACMSIGKDD Explor. Newsl. 11(1), 10–18 (2009)

19. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means cluster-ing algorithm. J. R. Stat. Soc. Ser C (Appl. Stat.) 28(1), 100–108(1979)

20. Inder, J.: New developments in generalized information measures.Adv. Imag. Electron Phys. 91, 37–135 (1995)

21. Kapur, J.N.: A comparative assessment of various measures ofdirected divergence. Adv. Manag. Stud. 3(1), 1–16 (1984)

22. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Intro-duction to Cluster Analysis, vol. 344. Wiley, New York (2009)

23. Krause, E.F.: Taxicab Geometry: An Adventure in Non-EuclideanGeometry. Courier Corporation, North Chelmsford (2012)

24. Kullback, S.: Information Theory and Statistics, 2nd edn. DoverPublications, New York (1968)

25. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann.Math. Stat. 22(1), 79–86 (1951)

26. Lin, J.: Divergence measures based on the Shannon entropy. IEEETrans. Inf. Theory 37(1), 145–151 (1991)

27. Mahalanobis, P.C.: On the generalised distance in statistics. Proc.Natl. Inst. Sci. India 2(1), 49–55 (1936)

28. Mortenson, M.: Geometric Transformations fo 3D Modeling.Industrial Press Inc, Norwalk (2007)

29. Parzen, E.: On estimation of a probability density function andmode. Ann. Math. Stat. 33(3), 1065–1076 (1962)

30. Pflug, G.C.: On distortion functionals. Stat. Decis. 24(1/2006), 45–60 (2006)

31. Platt, J.C.: Advances in kernel methods. In: Schölkopf, B., Burges,C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods, chap. FastTraining of Support Vector Machines Using Sequential MinimalOptimization, pp. 185–208. MIT Press, Cambridge (1999)

32. Raykar,V.C.,Duraiswami,R.: Fast optimal bandwidth selection forkernel density estimation. In: SDM, pp. 524–528. SIAM (2006)

33. Rodgers, J.L., Kohler, H.P.: Reformulating and simplifying the dfanalysis model. Behav. Genet. 35(2), 211–217 (2005)

34. Rosenblatt, M.: Remarks on some nonparametric estimates of adensity function. Ann. Math. Stat. 27(3), 832–837 (1956)

35. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretationand validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

36. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction bylocally linear embedding. Science 290(5500), 2323–2326 (2000)

37. Schlick, C.B.P.G.C.: Amethodology for description of geometricaldeformations. Fundam. Comput. Graph. 94 (1994)

38. Shannon, C.: A mathematical theory of communication. Bell Syst.Techn. J. 27(379–423), 623–656 (1948)

39. Sharma, P., Holness, G.: Dilation of Chisini–Jensen–Shannondivergence. In: 3rd IEEE International Conference on Data Sci-ence and Advanced Analytics (2016)

123

Page 20: link.springer.com · 2017-08-29 · Int J Data Sci Anal (2017) 3:247–266 DOI 10.1007/s41060-017-0054-1 REGULAR PAPER L2-norm transformation for improving k-means clustering Finding

266 Int J Data Sci Anal (2017) 3:247–266

40. Sharma, P.K., Holness, G.: Dilation of Chisini–Jensen–Shannondivergences. In: 33rd International Conference on Machine Learn-ing (ICML 2016) (2016)

41. Sharma, P.K., Holness, G., Markushin, Y., Melikechi, N.: A fam-ily of Chisini mean based Jensen–Shannon divergence kernels.In: 14th IEEE International Conference on Machine Learning andApplications. IEEE, Miami, FL (2015)

42. Simon, U.: Affine differential geometry. Handb. Differ. Geom. 1,905–961 (2000)

43. Simonoff, J.S.: Smoothing Methods in Statistics. Springer, Berlin(2012)

44. Singha, J., Das, K.: Indian sign language recognition using eigenvalue weighted Euclidean distance based classification technique.arXiv preprint arXiv:1303.0634 (2013)

45. Smith, A.M.: A dual algorithm for the weighted Euclidean distancemin–max location problem in R2 and R3. ProQuest. Doctoral dis-sertation, Clemson University, South Carolina, USA (2009)

46. Taneja, I.J.:On a difference of Jensen inequality and its applicationsto mean divergence measures. arXiv:math/0501302 (2005)

47. Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geo-metric framework for nonlinear dimensionality reduction. Science290(5500), 2319–2323 (2000)

48. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number ofclusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B(Stat. Methodol.) 63(2), 411–423 (2001)

49. Vapnik, V.: The Nature of Statistical Learning Theory. Springer,Berlin (2013)

50. Wagstaff, A.: The demand for health: an empirical reformulationof the Grossman model. Health Econ. 2(2), 189–198 (1993)

51. Wang, L., Zhang, Y., Feng, J.: On the Euclidean distance of images.IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1334–1339 (2005)

123