AUTHOR QUERY FORM - University of Malaya · AUTHOR QUERY FORM Journal: ... 8 a Department of Computer Science, Universiti Malaysia Terengganu, ... The rest of the paper is organizedas

Our reference: KNOSYS 2544 P-authorquery-v11

AUTHOR QUERY FORM

Journal: KNOSYS

Article Number: 2544

Please e-mail or fax your responses and any corrections to:

E-mail: [email protected]

Fax: +31 2048 52799

Dear Author,

Please check your proof carefully and mark all corrections at the appropriate place in the proof (e.g., by using on-screen annotation in the PDFfile) or compile them in a separate list. Note: if you opt to annotate the file with software other than Adobe Reader then please also highlightthe appropriate place in the PDF file. To ensure fast publication of your paper please return your corrections within 48 hours.

For correction or revision of any artwork, please consult http://www.elsevier.com/artworkinstructions.

Any queries or remarks that have arisen during the processing of your manuscript are listed below and highlighted by flags in the proof. Clickon the ‘Q’ link to go to the location in the proof.

Location inarticle

Query / Remark: click on the Q link to goPlease insert your reply or correction at the corresponding line in the proof

Q1 Please confirm that given names and surnames have been identified correctly.

Q2 Please check whether the identification of corresponding author is okay as typeset.

Q3 As per journal requirements, a minimum of at least five keywords are required. Kindly provide the same.

Q4 Please check whether the insertion of ‘Ref. [2]’ text citation is okay as typeset.

Q5 Please note that the author name ‘Hongwu et al.’ does not match with Ref. [30]. Kindly check.

Q6 Please note that ‘Figs. 2–6, 7–10 and 11 and 12’ have identical captions. Kindly check.

Thank you for your assistance.

Please check this box if you have nocorrections to make to the PDF file

mailto:[email protected]

1

3

4

5

6

7

89

10

11

1 3

1415161718

192021222324

2 5

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

Q2

Q1

Q3

Q4

Knowledge-Based Systems xxx (2013) xxx–xxx

KNOSYS 2544 No. of Pages 11, Model 5G

1 August 2013

Contents lists available at SciVerse ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier .com/ locate /knosys

MAR: Maximum Attribute Relative of soft set for clustering attributeselection

0950-7051/$ - see front matter � 2013 Published by Elsevier B.V.http://dx.doi.org/10.1016/j.knosys.2013.05.009

⇑ Corresponding author.E-mail addresses: [email protected] (R. Mamat), [email protected] (T. Herawan),

[email protected] (M.M. Deris).

Please cite this article in press as: R. Mamat et al., MAR: Maximum Attribute Relative of soft set for clustering attribute selection, Knowl. Based Syst.http://dx.doi.org/10.1016/j.knosys.2013.05.009

Rabiei Mamat a,⇑, Tutut Herawan b, Mustafa Mat Deris c

a Department of Computer Science, Universiti Malaysia Terengganu, 21300 Kuala Terengganu, Terengganu, Malaysiab Department of Information System, University of Malaya, 50603 Pantai Valley, Kuala Lumpur, Malaysiac Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, 86400 Batu Pahat, Johore, Malaysia

a r t i c l e i n f o a b s t r a c t

2627282930313233343536

Article history:Received 13 July 2012Received in revised form 15 May 2013Accepted 15 May 2013Available online xxxx

Keywords:Soft set theoryClustering attributesAttribute relativeComplexity

37

Clustering, which is a set of categorical data into a homogenous class, is a fundamental operation in datamining. One of the techniques of data clustering was performed by introducing a clustering attribute. Anumber of algorithms have been proposed to address the problem of clustering attribute selection. How-ever, the performance of these algorithms is still an issue due to high computational complexity. Thispaper proposes a new algorithm called Maximum Attribute Relative (MAR) for clustering attribute selec-tion. It is based on a soft set theory by introducing the concept of the attribute relative in informationsystems. Based on the experiment on fourteen UCI datasets and a supplier dataset, the proposed algo-rithm achieved a lower computational time than the three rough set-based algorithms, i.e. TR, MMR,and MDA up to 62%, 64%, and 40% respectively and compared to a soft set-based algorithm, i.e. NSS upto 33%. Furthermore, MAR has a good scalability, i.e. the executing time of the algorithm tends to increaselinearly as the number of instances and attributes are increased respectively.

� 2013 Published by Elsevier B.V.

38

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

1. Introduction

Data clustering refers to the method of creating a group ofobjects or cluster. Results of data clustering or so-called data clus-ter is a collection of data objects where each cluster contains dataobjects that are similar to each other and their different clustersare quite distinct [1]. Data clustering is also known as clusteranalysis, segmentation analysis, taxonomy analysis, and unsuper-vised classification. Clustering has been used in many areas suchas gene data processing [2], transactional data processing [3], deci-sion support [4], and radar signals processing [5]. Recently, greaterattentions have been given to categorical data clustering [6,7],where data objects are made up of non-numerical attributes.

The main difference between categorical data and numericaldata is the multi-valued attribute that belongs to the categoricaldata. These properties lead to difficulties in the similarities and dis-similarity measurement in the clustering process, since the normaldistance measures cannot be applied directly to the categoricaldata. Therefore, the best similarity measurement of the categoricaldata is done by defining the common object for the attribute aswell as the common values of the attribute, and the associationbetween the two [8]. A number of algorithms for clustering

83

84

85

86

categorical data have been proposed including works by Ganti etal. [9], Gibson et al. [10], Guha et al. [11] and Zaki [12].

These methods make important contributions to the issue ofclustering categorical data. However, these approaches have ledto an increment in computational complexity and inefficient sys-tem due to the properties of multi-valued attribute in the categor-ical data, having so many structures and evolved into a very largedata set. Another major issue is that they are not designed to han-dle uncertainties, while in the real world it is very important inmany applications since there are no sharp boundaries betweenclusters.

To this, rough set theory [13–15] has been used by introducinga series of clustering attributes, in which, one of the attributes isselected and used to cluster the objects that are not clustered yetat each time until all objects are clustered. The approaches of find-ing a clustering attribute had successfully exploited the uncertain-ties in the multi-valued information system [26,27,29]. But, somescenarios under study have shown that those approaches havetheir own drawbacks, i.e. there exists some unexpected iterationthat leads to an increment in the processing time. In addition, aclustering attribute also leads to the instability to the cluster num-ber in the clustering process.

A new way to manage uncertain data is called the soft set the-ory, which is proposed by Molodtsov [16] in 1999. According tohim, the main advantage of soft set theory is free from the inade-quacy of the parameterization tools, unlike in the previous theories

(2013),

http://dx.doi.org/10.1016/j.knosys.2013.05.009





http://www.sciencedirect.com/science/journal/09507051

http://www.elsevier.com/locate/knosys


Original text:

Inserted Text

Soft Set

Original text:

Inserted Text

Clustering Attribute Selection

Original text:

Inserted Text

givenname

Original text:

Inserted Text

surname

Original text:

Inserted Text

givenname

Original text:

Inserted Text

surname

Original text:

Inserted Text

givenname

Original text:

Inserted Text

surname

Original text:

Inserted Text

Science

Original text:

Inserted Text

Terengganu

Original text:

Inserted Text

System

Original text:

Inserted Text

Malaya

Original text:

Inserted Text

Malaysia

Original text:

Inserted Text

Raja

Original text:

Inserted Text

maximum attribute relative

Original text:

Inserted Text

algorithms , i.e

Original text:

Inserted Text

processing

Original text:

Inserted Text

al.

Original text:

Inserted Text

al.

Original text:

Inserted Text

al.

Original text:

Inserted Text

et al.

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174175177177

178

179

180

181182

184184

Table 1Tabular representation of soft set (F, E) in (5).

U/E e1 e2 e3 e4 e5 e6 e7

c1 1 0 0 0 0 0 0c2 1 1 0 1 1 0 0c3 0 0 0 1 0 0 0c4 1 0 0 1 0 0 0c5 1 0 0 1 1 0 0c6 0 0 1 0 1 1 1c7 0 0 0 0 1 0 0c8 0 1 0 1 1 0 0c9 0 1 1 0 1 1 1c10 0 0 1 0 1 1 1

Table 2Multi-valued information system.

SID Degree English Exp IT Math Prog Stat

1 Ph.D Good Medium Good Good Good Good2 Ph.D Medium Medium Good Good Good Good3 M.Sc Medium Medium Medium Good Good Good4 M.Sc Medium Medium Medium Good Good Medium5 M.Sc Medium Medium Medium Medium Medium Medium6 M.Sc Medium Medium Medium Medium Medium Medium7 B.Sc Medium Good Good Medium Medium Medium8 B.Sc Bad Good Good Medium Medium Good

Q5

2 R. Mamat et al. / Knowledge-Based Systems xxx (2013) xxx–xxx


1 August 2013

such as the theory of probability. The soft set theory uses parame-terization sets as its main solutions for problem solving, whichmakes it very convenient and easy to apply in practice. As a result,great progresses of study in soft set theory have been made, includ-ing the works of [17–22]. As for standard soft set, it may be rede-fined as the classification of objects in two distinct classes (yes/1and no/0), thus confirming that soft set can deal with a Boolean-valued information system.

For a multi-valued information system, the concepts of multi-soft sets have been proposed in [23]. Since a direct proof that everyrough set is a soft set have been given in [24], Hongwu et al. [30]proposed a soft set model on the equivalent classes of an informa-tion system, which can be easily applied in obtaining approximatesets of rough sets. Furthermore, they use it to select a clusteringattribute for categorical datasets and a heuristic algorithm, namelythe NSS. Experimental results on fifteen UCI benchmark datasetsshowed that the proposed NSS algorithm provides a faster decisionin selecting a clustering attribute as compared with MaximumDependency Attributes (MDA) approach. However, since it trans-fers MDA to a soft set attribute dependency model, the time com-plexity of the algorithm is still an outstanding issue.

In this paper, by applying the concept of attribute relatives insoft set theory, an alternative technique to select a clustering attri-bute using soft set theory is proposed. In summary, the contribu-tions of this work are summarized as follows:

a. MAR, an alternative technique to select a clustering attributeusing soft set theory is suggested. It is based on a concept ofattribute relative where the comparison of attributes ismade by taking into account the relative of the attribute atthe category level.

b. The proposed technique potentially discovers the attributessubsets with better coverage and achieves a lower computa-tional time. From the experimentation on fifteen UCI bench-mark data sets, the MAR achieves a lower computationaltime than the three rough set-based algorithms, i.e. TR,MMR, MDA up to 62%, 64%, and 40%, respectively and a softset-based algorithm NSS up to 33%.

The rest of the paper is organized as follows. In Section 2, we ex-plain some basic knowledge about this topic. Then, we give a liter-ature survey and review on rough and soft set-based clusteringattribute selection techniques in Sections 3 and 4. In Section 5,we present the MAR’s technique. Section 6 discusses the experi-ment and a series of evaluations on the result. Finally, we concludethis work in Section 7.

2. Essential rudiments

2.1. Decision system

An information system as defined in [15] is a 4-tuple (quadruple)S = (U, A, V, f), where U = {u0,u1, . . . ,ujU�1j,ujUj} is a non-empty finiteset of objects, A = {a0,a1, . . . ,ajA�1j,ajAj} is a non-empty finite set ofattributes, V ¼

Sei2AVei

, where Va is the domain (value set) of attri-bute a, f: U � A ? V is an information function, such that, f(x,a) 2 Va, for every (x, a) 2 U � A, is called information (knowledge)function. An information system can be intuitively expressed interms of an information table such as in Table 1.

In many applications, there is an outcome of classification thatis known. This posteriori knowledge is expressed by one (or more)distinguished attribute called decision attribute; the process isknown as supervised learning. An information system of this kindis called a decision system. A decision system is an information sys-tem of the form D = (U, A [ {d}, V, f), where d R A is the decisionattribute.

Please cite this article in press as: R. Mamat et al., MAR: Maximum Attribute Relhttp://dx.doi.org/10.1016/j.knosys.2013.05.009

2.2. Rough set theory

Rough set theory is an approach to aid decision making in thepresence of uncertainty [13–15]. Within this theory, a data-setcan be formally described using an information system. The notionof information system provides a convenient tool for the represen-tation of objects in terms of their attribute values. Based on theconcept of an information system, some basic definitions in therough set theory are given below.

Let S = (U, A, V, f), an information system as defined in Section2.1 and U is the set of all objects.

Definition 1 (Indiscernibility Relation IND(B)). Given two objects xi,xj 2 U, they are indiscernible by the set of attributes B in A, if andonly if f(xi, a) = f(xj, a) for every a 2 B. That is, (xi, xj) 2 IND(B), if andonly if "a 2 B where B # A, f(xi, a) = f(xj, a).

Obviously, IND(B) is an equivalence relation. It is well-knownthat an equivalence relation induces a unique partition. The parti-tion of U induced by IND(B) in S = (U, A, V, f) denoted by U/B andequivalence class in the partition U/B containing x 2 U, denotedby [x]IND(B).

Definition 2 (Equivalence class). Given IND(B), the set of objects inU having the same values for the set of attributes in B is called anequivalence classes, [x]IND(B). It is s also known as an elementary setwith respect to B.

Definition 3 (Lower Approximation). Given a set of attributes B inA and set of objects X in U. The lower approximation of X is definedas the union of all elementary set which is contained in X. That isXB ¼ fx 2 Uj½x�INDðBÞ # Xg ð1Þ

Definition 4 (Upper Approximation). Given a set of attributes B in Aand set of objects X in U. The upper approximation of X is definedas the union of the elementary sets which has a non-empty inter-section with X. That is

XB ¼ fx 2 Uj½x�INDðBÞ \ X – /g ð2Þ

ative of soft set for clustering attribute selection, Knowl. Based Syst. (2013),


Original text:

Inserted Text

System

Original text:

Inserted Text

⋯

Original text:

Inserted Text

⋯

Original text:

Inserted Text

Set Theory

Original text:

Inserted Text

.

185

186

187188

190190

191

192

193

194

195

196

197198200200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

218218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

236236

237

238

239

240

241

R. Mamat et al. / Knowledge-Based Systems xxx (2013) xxx–xxx 3


1 August 2013

Definition 5 (Roughness). The accuracy of approximation (rough-ness) of any subset X # U with respect to B # A denoted byaB(X) is measured by

aBðXÞ ¼XB

XBð3Þ

242

243

244

245

246

247

248

249

250

251

2.3. Soft set theory

Throughout this section, U refers to an initial universe, E is a setof parameters describing objects in U, P(U) is the power set of Uand A # E.

Definition 6 (See [16,25]). A pair (F, A) is called a soft set over Uwhere F is a mapping given by

F : A! PðUÞ ð4Þ

In other words, a soft set (F, A) over U is a parameterized familyof the universe U. For a 2 A, F(a) may be considered as the set of a-elements of the soft set F(A) or the set a-approximate elements ofthe soft set F(A). Clearly, a soft set is not a (crisp) set.

253253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

Example 1. Let a universe U = {c1, c2, c3, c4, c5, c6, c7, c8, c9, c10} be aset of candidates and a set of parameters E = {e1, e2, e3, e4, e5, e6, e7}be a set of soft skills which stand for the parameters ‘‘communi-cative’’, ‘‘critical thinking’’, ‘‘team work’’, ‘‘information manage-ment’’, ‘‘entrepreneurship’’, ‘‘leadership’’ and ‘‘moral’’, respectively.Consider F be a mapping of E into the set of all subsets of the set Uas F(e1) = {c1, c2, c4, c5}, F(e2) = {c3, c8, c9}, F(e3) = {c6, c9, c10},F(e4) = {c2, c3, c4, c5, c8}, F(e5) = {c2, c5, c6, c7, c8.c9, c10}, F(e6) = {c6, c9,c10} and F(e7) = {c6, c9, c10}. Now consider a soft set (F, E), whichdescribes the ‘‘capabilities of the candidate for hire’’. According tothe data collected, the soft set (F, E) is given by

ðF; EÞ ¼

communicative ¼ fc1; c2; c4; c5g;critical thinking ¼ fc3; c8; c9g;team work ¼ fc6; c9; c10g;information management ¼ fc2; c3; c4; c5; c8g;entrepreneurship ¼ fc2; c5; c6; c7; c8; c9; c10g;leadership ¼ fc6; c9; c10g;moral ¼ fc6; c9; c10g

8>>>>>>>>>>><>>>>>>>>>>>:

9>>>>>>>>>>>=>>>>>>>>>>>;

ð5Þ

Obviously, the soft set (F, E) is not a crisp set and (F, E) is a param-eterized family {F(ei), i = 1,2,3, . . . ,7} of subsets of the set Uthat havetwo parts of approximation: predicate (p) and value (v). For exam-ple, for the approximation ‘‘moral = {c6, c9, c10}’’, p is moral andv = {c6, c9, c10}.

Definition 7. Let S = (U, A, V{0,1}, f) be an information system. IfVa = {0,1}, for every a 2 A, then S = (U, A, V{0,1}, f) is called a Bool-ean-valued information system.

Proposition 1. Each soft set can be considered as a Boolean-valuedinformation system.

Proof. Let (F, E) be a soft-set over the universe U, S = (U, A, V, f) bean information system. Obviously, the universe U in (F, E) can beconsidered as the universe U, the parameter set E may be consid-ered as the attributes A. Then, the information function, f is definedby

f ¼1; h 2 FðeÞ0; h R FðeÞ

�


That is, when hi 2 F(ej), where hi 2 U and ej 2 E, then f(hi, ej) = 1,otherwise f(hi, ej) = 0. To this, we have V(hi, ej) = {0,1}. Therefore, asoft set (F, E) can be considered as a Boolean-valued informationsystem where S = (U, A, V{0,1}, f) and a soft set (F, E) can be repre-sented in the form of binary table.

As can be seen in Table 1, ‘1’ denotes the presence of thedescribed parameters, while ‘0’ means the parameter which is notpart of the capabilities of the candidate.

However, for a multi-valued information system, as describedin [23], this needs to be converted to multi-soft sets based on thenotion of a decomposition of a multi valued information system.Let S = (U, A, V, f) be a multi valued information system and Si = (U,ai, V{0,1}, f), where i = 1,2, . . . , jAj be a Boolean valued informationsystem. As a result of decomposition, we have

S ¼ ðU;A;V ; f Þ ¼S1 ¼ ðU; ai;Vf0;1g; f Þ () ðF; a1Þ

..

. ... ..

.

SjAj ¼ ðU; ajAj;Vf0;1g; f Þ () ðF; ajAjÞ

8>><>>:

¼ ððF; aiÞ; ðF; a2Þ; . . . ; ðF; ajAjÞÞ

We define (F, E) = ((F, a1), (F, a2), . . . , (F, ajAj)) as a multi-soft sets overuniverse U representing a multi valued information system S = (U, A,V, f). As an example, by using an information system in Table 2which describes the student’s enrollment qualification, the conceptof multi soft-set is illustrated in Fig. 1. h

3. Rough set-based clustering attribute selection techniques

Currently, there are two measurement approaches based onthe theory of rough set in clustering attribute selection. The firstapproach is based on the roughness of the attribute, and the sec-ond one is based on the attribute dependencies. In this section,two techniques of first approach, i.e. Total Roughness (TR) andMin–Min Roughness (MMR), and a technique of second approachcalled Maximum Dependency of Attribute (MDA) will bereviewed.

Fig. 1. Multi-soft set composition from dataset in Table 2.



Original text:

Inserted Text

Set Theory

Original text:

Inserted Text

”critical

Original text:

Inserted Text

set(

Original text:

Inserted Text

1,2,3,

Original text:

Inserted Text

,7}

Original text:

Inserted Text

{0,1},

Original text:

Inserted Text

{0,1}.

Original text:

Inserted Text

’1’

Original text:

Inserted Text

’0’

Original text:

Inserted Text

),(

Original text:

Inserted Text

,(

Original text:

Inserted Text

Min-Min

268

269

270

271

272

273

274

275

276

277

278279

281281

282

283

284

285

286

287

288289

291291

292

293

294295

297297

298

299

300

301

302

303304

306306

307

308

309

310

311312

314314

315

316

317318

320320

321

322

323324

326326

327

328

329330332332

333

334

335

336

337

338

339

340

341

342

343

344345

347347

348

349

350

351352354354

355

356

357

358

359360362362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379380

382382

383384

386386



1 August 2013

3.1. Total Roughness (TR)

Total roughness technique, which is proposed by Mazlack et al.[26], has three main steps. The first one is to determine the rough-ness of each attribute by exploiting the relationship between lowerand upper approximation which is defined in Eqs. (1) and (2),respectively. Let S = (U, A, V, f), an information system as describedin Section 2.1, that ai 2 A has k-different values, say bk, k = 1,2, . . . ,n.Let X(ai = bk), k = 1,2, . . . ,n be a subset of the objects having k-dif-ferent values of attribute ai. The roughness of the set X(ai = bk),k = 1,2, . . . ,n with respect to aj, where i – j, denoted byTR Roughaj

ðXjai ¼ bkÞ, is defined by

TR RoughajðXjai ¼ bkÞ ¼

jXajðai ¼ bkÞj

jXaiðai ¼ bkÞj

; k ¼ 1;2; . . . ;n ð6Þ

where jXajðai ¼ bkÞj and jXaj

ðai ¼ bkÞj are, respectively, the cardinal-ity of lower and upper approximations of a subset of the objectshaving k-different values of attribute ai with respect to attributeaj, where i – j.

The next step, the mean roughness of attribute ai 2 A withrespect to attribute aj 2 A, where i – j, denoted byTR MeanRoughaj

ðaiÞ is evaluated as follows:

TR MeanRoughajðaiÞ ¼

PjVðaiÞjj¼1 RðXjai ¼ bkÞjVðaiÞj

ð7Þ

where V(ai) is the set of values of attribute ai 2 A. Finally, the totalroughness of attribute ai 2 A, with respect to attribute aj 2 A, wherei – j denoted as TR(ai) is obtained by the following formula:

TRðaiÞ ¼PjAj

j¼1TR MeanRoughajðaiÞ

jAj � 1ð8Þ

Mazlack et al. suggested that the highest of TR is the best selectionfor a clustering attribute.

3.2. Min–Min Roughness (MMR)

MMR, proposed by Parmar et al. [27], is another rough set-based technique. The value of roughness of MMR technique is gi-ven below:

MMR RoughajðXjai ¼ bkÞ ¼ 1�

jXajðai ¼ bkÞj

jXajðai ¼ bkÞj

¼ 1� TR RoughajðXjai ¼ bkÞ ð9Þ

It is clear that MMR technique uses Marczewski–Steinhaus met-ric [28] to measure the roughness of the set X(ai = bk), k = 1,2, . . . ,n,with respect to aj, where i – j.

The value of the mean roughness of MMR technique is given asfollows:

MMR MeanRoughajðaiÞ ¼

PjVðaiÞjk¼1 MMR Roughaj

ðXjai ¼ bkÞjVðaiÞj

ð10Þ

From Eqs. (9) and (10), it is easily seen that the value of the meanroughness of MMR technique is an opposite of the value of themean roughness of TR.

MMR MeanRoughajðaiÞ ¼

PjVðaiÞjk¼1 ð1� TR Roughaj

ðXjai ¼ bkÞÞjVðaiÞj

¼PjVðaiÞj

k¼1 1�PjVðaiÞj

k¼1 TR RoughajðXjai ¼ bkÞ

jVðaiÞj

¼ jVðaiÞjjVðaiÞj

�PjVðaiÞj

k¼1 TR RoughajðXjai ¼ bkÞ

jVðaiÞj¼ 1� TR MeanRoughaj

ðaiÞ


for i – j.Let m be the attributes, Min-Roughness (MR) of attribute ai with

respect to all aj, where i – j is defined asMinRðaiÞ¼minfMMR MeanRougha1

ðaiÞ; . . . ;MMR MeanRoughamðaiÞg

where 1 6 i, j 6m.The minimum of the Min-Roughness or known as Min–Min-

Roughness, which is denoted by MMR, is formulated as follows:

MMR ¼ minðMinRða1Þ; . . . ;MinRðaiÞ; . . . ;MinRðajAjÞÞ ð11Þ

where A is the number of attribute in the information system, ai 2 A,1 6 i 6 jAj. Parmar et al. suggested that the least mean roughness isthe best selection of a clustering attribute.

3.3. Maximum Attribute Dependency (MDA)

Maximum Dependency of Attributes, which is proposed by Her-awan et al. [29], disseminate the attributes of information systemby using the attribute dependencies. MDA has a 4-step algorithmwhich starts with a search for indiscernibility relation. Subse-quently, attribute dependency for each attribute in the informationsystem is calculated. If given two attributes, ai, aj 2 A, the depen-dency degree (in k) of attribute ai on attribute aj denoted by aj) k-

ai is calculated using the following equation:

kajðaiÞ ¼

PX2ðU=aiÞjajðXÞjjUj ð12Þ

The highest dependency degree (Max-Dependency), which isdenoted as MD for each ai 2 A, 1 6 i 6 jAj, is selected in the follow-ing step. If given m attributes, then, MD of attribute ai, aj 2 A is de-fined as

MDðaiÞ ¼maxðkaiðaiÞ; . . . ; kaj

ðaiÞ; . . . ; kam ðaiÞÞ ð13Þ

where ai – aj and 1 6 i, j 6m. Lastly, MDA technique suggested thatthe highest MD is selected as a clustering attribute using Eq. (14). Ifthere are more than one attribute shared the highest k value, then,the next highest k in each attribute will be used and compared untilthe tie is broken.MDA ¼ maxðMDða1Þ; . . . ;MDðaiÞ; . . . ;MDðamÞÞ ð14Þ

In reviewing the above three techniques to handle large data-sets, even with increasing computing capabilities, the complexityof computation is still an outstanding issues. From the analysis,we have found that two techniques, i.e., MMR and MDA, are suf-fered with ‘unpredicted iteration’. If there are n objects in thedata-set, by taking the worst case scenarios, this situation can trig-ger n2 of processing time which can increase the computationalcomplexity when involving large data-sets. Although TR does notface the same problems as discussed above, TR initially has insuf-ficient computational time as discussed in [29]. Therefore, there isa need for a technique in clustering attribute selection to furtherimprove the computational complexity.

4. Soft set-based clustering attribute selection techniques

A novel soft-set (NSS) approach has been proposed by Hongwuet al. in [30]. In this approach, the re-definition of attribute depen-dency in the rough set theory is given in the contact of soft set the-ory. In NSS, the degree of dependency is defined as:

kajðaiÞ ¼

jajðaiÞjjUj ð15Þ

The cardinality of ai(aj) can be calculated as

jaiðajÞj ¼XDþjU=aj j

k¼Dþ1

jaiðekÞj ð16Þ



Original text:

Inserted Text

Roughness

Original text:

Inserted Text

al.

Original text:

Inserted Text

sub-Section

Original text:

Inserted Text

1,2,⋯

Original text:

Inserted Text

1,2,⋯

Original text:

Inserted Text

1,2,⋯

Original text:

Inserted Text

where,

Original text:

Inserted Text

Min-Min

Original text:

Inserted Text

Marczewski-Steinhaus

Original text:

Inserted Text

1,2,⋯

Original text:

Inserted Text

Min-Min-Roughness,

Original text:

Inserted Text

and Mat Deris

Original text:

Inserted Text

Set-based Clustering Attribute Selection Techniques

Original text:

Inserted Text

al

387

388

389391391

392

393

394

395

396

397

398

399

400

401

402

403

405405

406

407

408

409

411411

412

413

414

415

417417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435



1 August 2013

where lower approximation of attribute aj with respect to attribute ai

is defined as
436
437

438

439

440

441

442

aiðajÞ ¼ fxjx 2 aiðekÞ; k ¼ Dþ 1; . . . ;Dþ jU=ajjg ð17Þ

where D ¼Pj�1

l¼1jU=alj.As another alternative approach of the rough set theory, the NSS

takes the highest total degree of dependency as the selected clus-tering attribute.

443

444

445

446

5. Proposed maximum attribute relative

Throughout this section, a pair (F, A), refers to multi-soft setsover the universe U representing a categorical valued informationsystem S = (U, A, V, f).

Definition 8. Let (F, A) be a multi soft-sets over the universe U,where (F, ai), . . . , (F, ajAj) # (F, A) and ðF; aij

Þ; . . . ; ðF; aijai jÞ# ðF; aiÞ.

Support of ðF; aij Þ by ðF; aikÞ denoted supðF;aik

ÞðF; aij Þ is defined as

supðF;aikÞðF; aij Þ ¼

jðF; aij Þ \ ðF; aik ÞjjðF; aik Þj

ð18Þ

Definition 9. Maximum support is a summation of all supportwith value equals to 1. For each soft set ðF; aij Þ, the maximum sup-port, max supðF;aij

Þ, is defined as

max supðF;aijÞ ¼

XðsupðF;aik

ÞðF; aij Þ ¼ 1Þ ð19Þ

Definition 10. Minimum support is a summation of all supportwith value less than 1. For each soft set ðF; aij Þ, the minimum sup-port is denoted by min supðF;aij

Þ is defined as

min supðF;aijÞ ¼

XðsupðF;aik

ÞðF; aij Þ – 1Þ ð20Þ

471
Definition 11. Mode refers to the value that is most frequentlyoccurred in the probability distribution.
472

473

474
Definition 12. Max refers to the value that is the highest in theprobability distribution.
475

476

477

478

479

Proposition 2. IfModeðmaxðmax supðF;aij

Þ; . . . ;max supðF;ajmjjnj ÞÞ ¼ 1Þ then ðF; aij Þ is aclustering attribute.

480

481

482483

485485

486

Proof. Let (F, ai) and (F, aj) be two soft sets over the universe U, ifai = aj then the support value of (F, ai) is equal to 1, therefore it issaid that ai is relative to aj. Therefore, ai can be used to describeaj and vice versa. If ai – aj, then there exists (F, ak) where ai = ak

and aj – ak then, the support value of (F, ai) is greater than 1. Basedon Definition 5, it is clear that (F, ai) is selected as a clustering attri-bute. h

487

488

489

490

491

492

Corollary 1. If Modeðmaxðmax supðF;aijÞ; . . . ;max supðF;ajmjjnj ÞÞ > 1Þ

then maxðmin supðF;aijÞ; . . . ;min supðF;ajmjjnj ÞÞ is a clustering attribute.

Proof. The proof is clear from Definition 10 and Proposition 2. h


5.1. The complexity

Suppose that in an information system, there are n objects, mattributes and l is the maximum distinct values of each attribute.Computational cost to determine the elementary set of all attri-butes is ai – aj. The proposed technique needs ml(ml � 1) timesto determine the support for each category. Thus, the computa-tional complexity for the proposed technique is O(ml(ml � 1) + nm + 1).

5.2. Algorithm

The proposed algorithm for the proposed technique is describedin the following figure.

Algorithm 1. MAR Algorithm

Input: Categorical-valued data-setOutput: A Clustering attributeBegin1. Builds the multi-soft set approximation2. Calculate Support, MaxSup and MinSuP

for i = all categoriesfor j = all categories

intersection = Data(i) And Data(j)Sup(i,j) = Intersection/Data(j)if Sup (i,j) = 1 then

MaxSup(i) = MaxSup(i) + Sup(i,j)else

MinSup(i) = MinSup(i) + Sup(i,j)end

endend

3. Finding Clustering Attributeif Mode(Max(MaxSup(data(1)..data(n)))) = 1 then

Clustering Attribute = Max(MaxSup(Data(1)..Data(n)))else

Clustering Attribute = Max(MinSup(Data(1)..Data(n)))end

End

Example 2. In this part, the example of implementating the pro-posed technique is presented. Dataset as in Table 2 is used as aninput data. The dataset contains 8 objects with 7 categorical-val-ued attributes. The attributes are Degree, English, Exp, IT, Math, Prog,and Stat where the first two attributes, i.e., Degree and Englishhave 3 categorical values and the rest have 2 values. From algo-rithm 1, the technique begins by transforming the input datasetinto the multi- soft set, as shown in Fig. 1. Secondly, the supportvalue for each soft set by other soft-set is acquired using formulaas defined in Definition 3. For example, the support value for softset (F, ExpMedium) by (F, MathMedium) is calculated as follows;

supðF;MathMediumÞðF; ExpMediumÞ ¼jf1;2;3;4;5;6g \ jf5;6;7;8g

jf5;6;7;8gj ¼ 24

¼ 0:5

As a result, overall support value for the case (F, ExpMediium) are1.0, 1.0, 0.0, 1.0, 0.8333, 0.0, 1.0, 0.0, 0.5, 1.0, 1.0, 0.5, 1.0, 0.5, 0.75,and 0.75. The support values of all soft set are summarized in Table3. Once the support value has raised, the calculation of the maxi-mum and minimum support value for the current soft set can bedone directly using Definitions 4 and 5, respectively. Then, usingDefinition 6, the support index for each soft set is computed. Lastly,



Original text:

Inserted Text

Maximum Attribute Relative

Original text:

Inserted Text

⋯,(

Original text:

Inserted Text

vice-versa.

Original text:

Inserted Text

Complexity

Original text:

Inserted Text

O(

Original text:

Inserted Text

Data (j)

Original text:

Inserted Text

Intersection / Data(j)

Original text:

Inserted Text

Sup(i,j)

Original text:

Inserted Text

3.Finding

Original text:

Inserted Text

,Exp

Original text:

Inserted Text

,Math

Original text:

Inserted Text

,Exp

Original text:

Inserted Text

0.0,1.0,0.0,0.5,1.0,1.0,0.5,1.0,0.5,0.75,

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

Table 3Support value from data set in Table 1.

Soft set Support value

(F, DegreePhD) 1, 0, 0, 1, 0.1667, 0, 0.3333, 0, 0.5, 0, 0.5, 0, 0.5, 0,0.5, 0

(F, DegreeMSc) 0, 1, 0, 0, 0.6667, 0, 0.6667, 0, 0, 1, 0.5, 0.5, 0.5, 0.5,0.25, 0.75

(F, DegreeBSc) 0, 0, 1, 0, 0.1667, 1, 0, 1, 0.5, 0, 0, 0.5, 0, 0.5, 0.25,0.25

(F, EnglishGood) 0.5, 0, 0, 1, 0, 0, 0.1667, 0, 0.25, 0, 0.25, 0, 0.25, 0,0.25, 0

(F, EnglishMedium) 0.5, 1, 0.5, 0, 1, 0, 0.8333, 0.5, 0.5, 1, 0.75, 0.75, 0.75,0.75, 0.5, 1

(F, EnglishBad) 0, 0, 0.5, 0, 0, 1, 0, 0.5, 0.25, 0, 0, 0.25, 0, 0.25, 0.25, 0(F, ExpGood) 1, 1, 0, 1, 0.8333, 0, 1, 0, 0.5, 1, 1, 0.5, 1, 0.5, 0.75,

0.75(F, ExpMedium) 0, 0, 1, 0, 0.1667, 1, 0, 1, 0.5, 0, 0, 0.5, 0, 0.5, 0.25,

0.25(F, ITGood) 1, 0, 1, 1, 0.3333, 1, 0.3333, 1, 1, 0, 0.5, 0.5, 0.5, 0.5,

0.75, 0.25(F, ITMedium) 0, 1, 0, 0, 0.6667, 0, 0.6667, 0, 0, 1, 0.5, 0.5, 0.5, 0.5,

0.25, 0.75(F, MathGood) 1, 0.5, 0, 1, 0.5, 0, 0.6667, 0, 0.5, 0.5, 1, 0, 1, 0, 0.75,

0.25(F, MathMedium) 0, 0.5, 1, 0, 0.5, 1, 0.3333, 1, 0.5, 0.5, 0, 1, 0, 1, 0.25,

0.75(F, ProgGood) 1, 0.5, 0, 1, 0.5, 0, 0.6667, 0, 0.5, 0.5, 1, 0, 1, 0, 0.75,

0.25(F, ProgMedium) 0, 0.5, 1, 0, 0.5, 1, 0.3333, 1, 0.5, 0.5, 0, 1, 0, 1, 0.25,

0.75(F, StatGood) 1, 0.25, 0.5, 1, 0.3333, 1, 0.5, 0.5, 0.75, 0.25, 0.75,

0.25, 0.75, 0.25, 1, 0(F, StatMedium) 0, 0.75, 0.5, 0, 0.6667, 0, 0.5, 0.5, 0.25, 0.75, 0.25,

0.75, 0.25, 0.75, 0, 1

Table 4Overall MaxSup and MinSup from the above example.

Soft set MaxSup MinSup Categorical rank Attribute rank

(F, DegreePhD) 2 2.5000 10 6(F, DegreeMSc) 2 4.3340 8(F, DegreeBSc) 3 2.1667 7(F, EnglishGood) 1 1.6667 13 4(F, EnglishMedium) 4 6.3333 4(F, EnglishBad) 1 2.0000 12(F, ExpGood) 3 2.1667 7 1(F, ExpMedium) 7 3.8333 1(F, ITGood) 6 3.6666 2 2(F, ITMedium) 2 4.3334 9(F, MathGood) 4 3.6667 6 3(F, MathMedium) 5 3.3333 3(F, ProgGood) 4 3.6667 6 3(F, ProgMedium) 5 3.3333 3(F, StatGood) 4 5.0833 5 5(F, StatMedium) 1 5.9167 11

Table 5Total roughness and TR choice.

Attribute Degree English Exp IT Math Prog Stat TR

Degree – 0.0953 0.3333 0.3333 0 0 0 0.127English 0.1667 – 0 0.1667 0 0 0.1667 0.0833Exp 1 0.1429 – 0.25 0.25 0.25 0 0.3155IT 1 0.1250 0.1250 – 0 0 0 0.2083Math 0.3333 0.1429 0.1250 0 – 1 0 0.2669Prog 0.3333 0.1429 0.1250 0 1 – 0 0.2669Stat 0.1250 0.1250 0 0 0 0 – 0.0417

Table 6Minimum roughness and MMR choice.

Attribute Degree English Exp IT Math Prog Stat MMR result

Degree – 0.9048 0.6667 0.6667 1 1 1 0.6667English 0.8333 – 1 0.8333 1 1 0.8333 0.8333Exp 0 0.8571 – 0.7500 0.7500 0.7500 1 0

0.75IT 0 0.875 0.875 – 1 1 1 0

0.875Math 0.6667 0.8571 0.875 1 – 1 1 0.6667Prog 0.6667 0.8571 0.875 1 1 – 1 0.6667Stat 0.875 0.875 1 1 1 1 – 0.875

Table 7Degree of dependency and MDA choice.

Attribute Degree English Exp IT Math Prog Stat MDA

Degree – 0.25 0.25 0.5 0 0 0 0.5

English 0.5 – 0.25 0.5 0 0 0.5 0.5

Exp 1 0.25 – 0.5 0.5 0.5 0 10.50.5

IT 1 0.25 0.25 – 0 0 0 10.25

Math 0.5 0.25 0.25 0 – 1 0 10.50.25

Prog 0.5 0.25 0.25 0 1 – 0 10.50.25

Stat 0.25 0.25 0.25 0 0 0 – 0.25

Table 8Degree of dependency of NSS and selection criteria.

Attribute Degree English Exp IT Math Prog Stat Total

Degree – 0.25 0.25 1 0 0 0 1.5English 0.5 – 0 0.5 0 0 0.5 1.5Exp 1 0.25 – 1 0.5 0.5 0 3.25IT 0.75 0.125 0 – 0 0 0 0.875Math 0.5 0.25 0.25 0.5 – 1 0 2.5Prog 0.5 0.25 0.25 0.5 1 – 0 2.5Stat 0.25 0.25 0 0.5 0 0 – 1.0



1 August 2013

a clustering attribute is selected using Definition 7. From Table 4,the attribute Exp is selected as a clustering attribute since it hasthe highest support index. This is followed by IT, Math, and Prog,English, Stat, and Degree as stated in the attribute rank column inTable 4.


As a comparison, the result of the implementations using TR,MMR, MDA and NSS is given in Tables 5–8, respectively.

From Table 5, the highest total roughness is 0.3155 whichbelongs to Exp attribute. Thus, it is clear that TR and MMRtechniques also select the Exp attribute as the clustering attribute.The result is reasonable since MMR technique is the reversal formof TR technique. But, MMR technique needs to compare twoattributes, i.e., Exp and IT, that shared minimum roughness value 0,before Exp attribute is finally selected with the second minimumroughness value 0.75.

It is also clear that, MDA technique also selects Exp attribute asthe clustering attribute. However, the MDA needs 3 iterationsbefore attribute ‘‘Exp’’ is selected as the clustering attribute. FromTable 7, after the first iteration, the MDA selects Exp, IT, Math andProg as candidate attributes. Further, it selects Exp, Math and Progin the second iteration. The decision to select attribute Exp can onlybe made in the third iteration since attribute Exp has the highestMDA as compared to attributes Math and Prog. In contrast to MDA,NSS does not face the same problem as MDA even though they areusing the same approach. It is clearly shown in Table 8 that NSS candirectly choose Exp as the clustering attribute without anyprejudice.



520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

Q6



1 August 2013

6. Experiment results

This section explains and discusses the experimental results ofthe proposed technique. The main focus of the experiments is onthe performance measurement of the proposed technique in whichexecution time is used as a parameter. For comparisons, four tech-niques which have been discussed earlier will be used with four-teen datasets obtained from the benchmark UCI machinelearning Repository [31] and a Supplier Dataset from [32] as de-scribed in Table 9 below.

All the selected data sets are different from one another interms of size, either horizontally or vertically aimed to analyze

Table 9Datasets.

No Data sets Number of instances Number of attributes

1 Acute Inflammation 120 62 Balloons 16 43 Bank Marketing 45,211 114 Car Evaluation 1728 65 Chess 3196 366 Cylinder Bands 512 157 Lenses 24 48 Lung Cancer 32 569 Mushroom 8124 22

10 Nursery 12,960 711 Solar Flare 1389 1012 Soybean (Large) 307 3513 Soybean (Small) 47 3514 Supplier Dataset 27 715 Zoo 101 17

Fig. 2. The execution times (in seconds) of TR, MMR, MDA, NSS and MAR.



the performance of the proposed technique when involving a highnumber of records as well as the high number of attributes. Somedatasets have been modified by removing instances that haveincomplete data and removing an attribute that only have one cat-egorical value. All techniques (TR, MMR, MDA, NSS and MAR) areimplemented using Ruby programming language version 1.9 underWindows 7 Home Edition operating system powered by Intel i5processor with 4 GB memory.

The results of the experiments were summarized in five (5) fig-ures. Fig. 2 shows the execution time results between 0.001 s and






Original text:

Inserted Text

seconds

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

Table 10Summary of comparisons between TR, MMR, MDA, NSS and MAR in term of execution time.

No Data sets Techniques name Improvement (%)

TR MMR MDA NSS MAR

1 Acute Inflammation 0.0362 0.0367 0.0345 0.0300 0.0224 25.452 Balloons 0.0015 0.0017 0.0017 0.0016 0.0014 15.403 Bank Marketing 297.46 297.61 262.72 210.28 18.6556 91.124 Car Evaluation 0.2337 0.2388 0.1640 0.1093 0.1017 6.955 Chess 118.55 156.36 69.27 59.57 3.3663 94.346 Cylinder Bands 0.7900 0.8530 0.4160 0.3860 0.3860 0.007 Lenses 0.0035 0.0040 0.0010 0.0010 0.0010 0.008 Lung Cancer 0.2700 0.2500 0.1500 0.1364 0.1200 12.009 Mushroom 112.91 128.39 112.15 83.079 5.2396 93.69

10 Nursery 0.2685 0.2691 0.2395 0.2056 0.0362 82.4011 Solar Flare 0.9905 0.9891 0.8351 0.7167 0.1646 77.0212 Soybean (Large) 0.8597 0.9284 0.4525 0.4205 0.4064 3.3513 Soybean (Small) 0.0575 0.0425 0.0234 0.0210 0.0208 0.0014 Supplier Dataset 0.0028 0.0039 0.0025 0.0022 0.0022 0.0015 Zoo 0.0593 0.0406 0.0189 0.0178 0.0175 0.00

Average improvement 33.45

Fig. 7. The scalability of TR, MMR, MDA, NSS and MAR to the number of instances.

Fig. 8. The scalability of TR, MMR, MDA, NSS and MAR to the number of instances.

Fig. 9. The scalability of TR, MMR, MDA, NSS and MAR to the number of attributes.

Fig. 10. The scalability of TR, MMR, MDA, NSS and MAR to the number of attributes.



1 August 2013

0.004 s which are involving the datasets of Ballons, Lenses, andSupplier.

Fig. 3 illustrates the execution times results between 0.017 sand 0.06 s which are involving the datasets of Acute Inflammation,Soybean (Small), and Zoo.

The results of the execution time between 0.03 s and 0.3 sshown in Fig. 4, which are involving the datasets of Car Evaluation,Lung Cancer, and Nursery.

Fig. 5 illustrates the results of the execution time between 0.1 sand 1 s, which are involving the datasets of Cylinders Bands, SolarFlare, and Soybean (Large).


Fig. 6 shows the execution time results between 3 s and 300 swhich are involving the datasets of Bank Marketing, Chess, andMushroom.

Table 10 summarizes the comparison results between TR, MMR,MDA, NSS and MAR in term of execution time. To compute the rel-ative improvement of MAR with respect to TR, MMR, MDA, and NSSi.e. how much faster is MAR with respect to them, we use the fol-lowing formula



Original text:

Inserted Text

seconds

Original text:

Inserted Text

seconds

Original text:

Inserted Text

seconds

Original text:

Inserted Text

seconds

Original text:

Inserted Text

seconds

Original text:

Inserted Text

seconds

Original text:

Inserted Text

second,

Original text:

Inserted Text

seconds

Original text:

Inserted Text

seconds

560

562562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

Fig. 12. The scalability of TR, MMR, MDA, NSS and MAR to the data size.



1 August 2013

jOther Algorithms�MARjMAR

� 100%

In summary, based on the experiment on fourteen UCI datasetsand a supplier dataset, the proposed algorithm achieved a lowercomputational time than the three rough set-based algorithms,i.e. TR, MMR, and MDA up to 62%, 64%, and 40% respectively andcompared to a soft set- based algorithm, i.e. NSS up to 33%. Finally,on the average that MAR outperforms the four algorithms up to33.45%.

Furthermore, we present two types of scalability of TR, MMR,MDA, NSS and MAR algorithms on fifteen data sets in Table 9.The first is their scalability to the number of instants and the sec-ond is their scalability to the number of attributes.

Figs. 7 and 8 show the scalability to the number of instants ofusing these algorithms in fifteen datasets. It can be observed thatthe execution time of these algorithms tends to increase. However,for a contrast example that the execution time to select a cluster-ing attribute of TR, MMR, MDA, NSS and MAR algorithms on Chessdata set with the number of instances 3196 is 118.55, 156.36,69.27, 59.57, and 3.3663 s, respectively. Meanwhile, in Nurserydata set with the number of instances 12,960 (about four timesof Chess’s instances), these algorithms achieve 0.2685, 0.2691,0.2395, 0.2056, and 0.0362 s, respectively. Hence, in general thatthe five algorithms increases non-linearly as the number of in-stances in the fifteen datasets is increased.

Figs. 9 and 10 show scalability to the number of attributes ofusing these algorithms in fifteen datasets. From Fig. 9, it can be ob-served that the execution time of these algorithms tends to in-crease linearly to the dataset with attribute size ranging from 4to 10.

However, for the number of attributes in data sets increasedranging from 11 to 56 as shown in Fig. 10, the five algorithms in-creases non-linearly. As an example that the execution time to se-lect a clustering attribute of TR, MMR, MDA, NSS and MARalgorithms on Bank Marketing data set with the number of attri-butes 11 is 297.46, 297.61, 262.72, 210.28, and 18.6556 s, respec-tively. Meanwhile, in Lung Cancer data set with the number ofattributes 56 (about five times of Bank Marketing’s attributes),these algorithms achieve 0.2700, 0.2500, 0.1500, 0.1364, 0.1200 s,respectively.

In reviewing the scalability of TR, MMR, MDA, NSS and MARalgorithms on the fifteen data sets, the value varies in several num-bers of instances and attributes. However, on the whole, thesealgorithms have a good scalability to data size. Let S = (U, A, V, f)be an information system as described in Section 2.1, the data sizehere is based-on the complexity of computing S = (U, A, V, f). Sincein there are jUj-number of instances and jAj-number of attributes,thus the complexity will be a O(jUj � jAj). From the fifteen data setsin Table 9, Figs. 11 and 12 describe the scalability of these algo-

Fig. 11. The scalability of TR, MMR, MDA, NSS and MAR to the data size.


rithms to the data size. It is clearly depicted their good scalabilityto the data size, which is highly desired in real data miningapplications.

639

640

7. Conclusion

A number of algorithms have been proposed to address theproblem of clustering attribute selection, including the works ofTR, MMR, MDA, and NSS. However, the performances of these algo-rithms still an issue due to high computational complexity. Thispaper has proposed an alternative soft set-based technique calledMAR (Maximum Attribute Relative), for clustering attribute selec-tion in multi-value information systems. The MAR technique ex-ploits the value of the relativity relationship among attributes inmulti-value information system to determine a clustering attri-bute. Comparisons at the attribute level firstly computed beforethe representative for each attribute is compared to select a clus-tering attribute. Using this technique, the repeated looping as facesby many rough set-based clustering attribute selection techniquescan be avoids. From experiments, results show that MAR achievelower execution time and show a good scalability as in general itis linearly increased when the data size is increased.

The main contribution of this work is in terms of reducing exe-cution time where it is slightly improved especially when involvinglarge categorical data sets as compared to three rough set-basedtechniques TR, MMR, and MDA as well a soft set-based techniqueNSS. However, the proposed MAR technique still can only be ap-plied on multi-valued information systems as multi-soft sets canonly dealt with such information systems. Therefore for data setshave been categorized, MAR technique is fit to them.

In the next phase, the ability of the MAR technique in clusteringthe categorical data will be tested in terms of clustering efficiencyas well as clustering accuracy.

641

642

643

644

Acknowledgements

The work of Rabiei Mamat is supported by the Ministry of High-er Education, Malaysia. The work of Tutut Herawan is supported byUniversity of Malaya research grant.

645

646647648649650651

References

[1] G. Guojun, M. Chaoqun, W. JianHong, Data clustering: theory, algorithms andapplications, ASA-SIAM Series on Statistic and Applied Probability (2000) 3–17.

[2] D. Jiang, C. Tang, A. Zhang, Cluster analysis for gene expression data: a survey,IEEE Transaction on Knowledge and Data Engineering 16 (11) (2004) 1370–1386.



Original text:

Inserted Text

i.e

Original text:

Inserted Text

seconds,

Original text:

Inserted Text

12960

Original text:

Inserted Text

seconds,

Original text:

Inserted Text

4-10.

Original text:

Inserted Text

11-56

Original text:

Inserted Text

seconds,

Original text:

Inserted Text

seconds,

Original text:

Inserted Text

sub-Section

Original text:

Inserted Text

applications

Original text:

Inserted Text

Acknowledgement

652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687

688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723



1 August 2013

[3] F. Giannotti, C. Gozzi, G. Manco, Clustering transactional data, in: Proceeding ofPractices of Knowledge Discovery in Database, 2002, pp. 175–187.

[4] R. Mathieu, J. Gibson, A methodology for large scale R&D planning based forcluster analysis, IEEE Transaction on Engineering Management 40 (3) (2004)283–292.

[5] S. Haimov et al., Classification of radar signatures by autoregressive modelfitting and cluster analysis, IEEE Transactions on Geo Science and RemoteSensing 8 (1) (1989) 606–610.

[6] S. Deng, Z. He, X. Hu, G-ANMI: a mutual information based genetic clusteringalgorithm for categorical data, Knowledge Based System 23 (2010) 144–149.

[7] K. Chen, L. Liu, K. Best, Critical clustering structures in categorical dataset,Knowledge Information System 20 (2009) 1–33.

[8] S. Wu et al., Cluster analysis of gene expression data based on self-splitting andmerging competitive learning, IEEE Transaction on Information Technology inBioMedicine 8 (1) (2004) 5–15.

[9] V. Ganti, J. Gehrke, R. Ramakrishnan, CACTUS – clustering categorical datausing summaries, in: Proc. 5th ACM Sigmod Intl. Conf. on Knowledge Discoveryin Databases, 1999, pp. 73–83.

[10] D. Gibson, J. Kleinberg, P. Raghavan, Clustering categorical data: an approachbased on dynamical systems, The Very Large Data Bases Journal 8 (3–4) (2000)222–236.

[11] S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for largedatabases, in: Proc. of the ACM SIGMOD Conference on Management of Data,1998, pp. 73–84.

[12] M.J. Zaki et al., CLICKS: An effective algorithm for mining subspace clusters incategorical datasets, Data and Knowledge Discovery 60 (1) (2007) 51–70.

[13] Z. Pawlak, Rough sets, International Journal of Computer and InformationScience 11 (1982) 341–356.

[14] Z. Pawlak, Rough Sets – Theoretical Aspect of Reasoning about Data, KluwerAcademic Publisher, Boston, 1991.

[15] Z. Pawlak, A. Skowron, Rudiments of rough sets, International Journal ofComputer and Information Science 177 (1) (2007) 3–27.

[16] D. Molodtsov, Soft set theory – first results, Computer and Mathematics withApplications 37 (4/5) (1999) 19–31.

[17] P.K. Maji, A.R. Roy, R. Biswas, An application of soft sets in a decision makingproblem, Computer and Mathematics with Applications 44 (2002) 1077–1083.

724


[18] D. Chen et al., The parameterization of soft sets and its application, Computerand Mathematics with Applications 49 (5-6) (2005) 757–763.

[19] Z. Kong et al., The normal parameter reductions of soft sets and its algorithm,Computer and Mathematics with Applications 56 (2008) 3029–3037.

[20] M.I. Ali et al., On some new operation in soft sets theory, Computer andMathematics with Applications 57 (9) (2009) 1547–1553.

[21] T. Herawan, M.M. Deris, Soft decision making for patients suspected influenza,in: ICCSA (Lecture Notes in Computer Science), Springer-Verlag, 2010, pp. 405–418.

[22] T. Herawan, M.M. Deris, Soft set theory for association rules mining,Knowledge Based Systems 24 (1) (2011) 186–195.

[23] T. Herawan, M.M. Deris, On multi-soft sets construction in informationsystems, in: ICIC (Lecture Notes in Computer Science), Springer-Verlag, 2009,pp. 101–110.

[24] T. Herawan, M.M. Deris, A direct proof of every rough sets is a soft sets, in:Third Asia International Conference on Modelling and Simulation (AMS 2009),IEEE Press, 2009, pp. 119–124.

[25] P.K. Maji, R. Biswas, A.R. Roy, Soft set theory, Computer and Mathematics withApplications 45 (2003) 555–562.

[26] L.J. Mazlack, et al., A rough sets approach in choosing partitioning attributes,in: Proceeding of ICSA 13th. International Conference, CAINE-2000, 2000, pp.1–6.

[27] D. Parmar, T. Wu, J. Blackhurst, MMR: an algorithm for clustering categoricaldata using rough set theory, Data and Knowledge Discovery 63 (3) (2007) 879–893.

[28] Y.Y. Yao, Constructive and algebraic methods of the theory of rough sets,Information Science 109 (1–4) (1998) 21–47.

[29] T. Herawan, M.M. Deris, J.H. Abawajy, A rough set approach for selectingclustering attribute, Knowledge-Based System 23 (3) (1998) 220–231.

[30] H. Qin, X. Ma, J.M. Zain, T. Herawan, A novel soft set approach in selectingclustering attribute, Knowledge-Based Systems 36 (2012) 139–145.

[31] UCI Repository of Machine Learning Database. <http://www.ics.uci.edu/�mlearn/MLRRepository.html>.

[32] D. Parmar, T. Wu, T. Callarman, J. Fowler, P. Wolfe, A clustering algorithm forsupplier base management, International Journal of Production Research 48(13) (2010) 3803–3821.


http://www.ics.uci.edu/~mlearn/MLRRepository.html

http://www.ics.uci.edu/~mlearn/MLRRepository.html


Documents

AUTHOR QUERY FORM - University of Malaya · AUTHOR QUERY FORM Journal: ... 8 a Department of Computer Science, Universiti Malaysia Terengganu, ... The rest of the paper is organizedas