12
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/authorsrights

Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/authorsrights

Page 2: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145

jo ur nal ho me p ag e: www.int l .e lsev ierhea l t h.com/ journa ls /cmpb

Privacy-preserving Kruskal–Wallis test

Suxin Guoa,∗, Sheng Zhongb, Aidong Zhanga

a Department of Computer Science and Engineering, SUNY at Buffalo, United Statesb State Key Laboratory for Novel Software Technology, Nanjing University, China

a r t i c l e i n f o

Article history:

Received 4 January 2013

Received in revised form

17 May 2013

Accepted 28 May 2013

Keywords:

Data security

Statistical test

Kruskal–Wallis test

a b s t r a c t

Statistical tests are powerful tools for data analysis. Kruskal–Wallis test is a non-parametric

statistical test that evaluates whether two or more samples are drawn from the same dis-

tribution. It is commonly used in various areas. But sometimes, the use of the method is

impeded by privacy issues raised in fields such as biomedical research and clinical data

analysis because of the confidential information contained in the data. In this work, we give

a privacy-preserving solution for the Kruskal–Wallis test which enables two or more parties

to coordinately perform the test on the union of their data without compromising their data

privacy. To the best of our knowledge, this is the first work that solves the privacy issues in

the use of the Kruskal–Wallis test on distributed data.

© 2013 Elsevier Ireland Ltd. All rights reserved.

1. Introduction

Statistical hypothesis tests are very widely used for dataanalysis. Some popular statistical tests include t-test [1],ANOVA [2], Kruskal–Wallis test [3], and Wilcoxon rank sumtest [4]. Although these four are different tests, they servethe same goal, which is to find out whether the samplescome from the same population. The t-test and ANOVA areparametric tests and assume the normal distribution of data.The non-parametric equivalence of these two tests are theWilcoxon rank sum test, which is also known as Mann-Whitney U test [5], and Kruskal–Wallis test, respectively.They do not assume the data to be normally distributed.The t-test can only deal with the comparison betweentwo samples, and the ANOVA extends it to multiple sam-ples. Similarly, the Kruskal–Wallis is also a generalization ofthe Wilcoxon rank sum test from two samples to multiplesamples.

As stated above, the four tests are doing similar thingsunder different assumptions. The non-parametric tests

∗ Corresponding author. Tel.: +1 7165647706.

perform better when the data is not normally distributed, andare suitable especially in the cases when the data size is small(<25 per sample group) [6]. Although the Kruskal–Wallis testis a helpful tool in many areas, sometimes the use of it isimpeded by privacy concerns due to the confidential infor-mation in the data, especially in the clinical and biomedicalresearch.

For example, some hospitals conducted a study and testedthe INR (International Normalized Ratio) values for theirpatients so that each hospital holds a set of INR values. Thehospitals want to perform the Kruskal–Wallis test to checkwhether their values are following the same trend. In thiscase, the set of the INR values of each hospital is treatedas a sample. To conduct the Kruskal–Wallis test, all samplesshould be known, which means, the hospitals have to sharetheir data with each other. The problem is that it might beimproper for the hospitals to share their samples becausethe data contains the private information of patients. Cur-rently there is no method that enables the conduction ofthe Kruskal–Wallis test on such distributed data with privacyconcerns.

0169-2607/$ – see front matter © 2013 Elsevier Ireland Ltd. All rights reserved.http://dx.doi.org/10.1016/j.cmpb.2013.05.023

Page 3: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

136 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145

To solve this problem, we propose a privacy-preservingalgorithm that allows the Kruskal–Wallis test to be appliedon samples distributed in different parties without revea-ling each party’s private information to others. Due to thesimilarity in non-parametric tests, our method can also helpthe design of privacy-preserving solutions for other non-parametric tests. For example, the Wilcoxon rank sum testand the Kruskal–Wallis test are used in the situations of twosamples and two or more samples, respectively, and are essen-tially the same in the two samples case [3]. So our algorithmalso solves the privacy issue of the Wilcoxon rank sum test tosome extent.

The rest of this paper is organized as follows: In Section 2,we present the related work. Section 3 provides the techni-cal preliminaries including the background knowledge aboutthe Kruskal–Wallis test and the cryptographic tools we need.We propose the basic algorithm and the complete algorithmin Sections 4 and 5, respectively. The basic algorithm showsthe procedure of conducting the Kruskal–Wallis test securelywhen there is no tie in the data. The complete algorithmfollows the basic algorithm and takes the existence of tiesinto consideration. In Section 6, we present the experimentalresults and finally, Section 7 concludes the paper.

2. Related work

In recent years, due to the increasing awareness of pri-vacy problems, a lot of data analyzing methods have beenenhanced to be privacy-preserving, including many populardata mining and machine learning algorithms. Most of theseapproaches can be divided into two categories. Approachesin the first category protect data privacy with data pertur-bation techniques, such as randomization [7,8],rotation [9]and resampling [10]. Since the original data is changed, theseapproaches usually lose some accuracy. The methods in thesecond category are generally based on the Secure MultipartyComputation (SMC) and apply cryptographic techniques toprotect data during the computations [11,12]. Such methodsusually cause no accuracy loss but have higher computationalcost. In our case, since the Kruskal–Wallis test is often used onsmall sized data, we choose the second way, which is to pro-tect privacy with cryptographic tools. It enables us to achievehigher accuracy with an affordable computational cost.

In the cryptographic category, some SMC tools are verycommonly used, such as secure sum [13], secure comparison[14,15], secure division [16], secure scalar product [13,16,17],secure matrix multiplication [18–20], and secure set operations[13].

Many data mining and machine learning algorithms havebeen extended with privacy solutions, such as decision treeclassification [11,21], k-means clustering [22,23], gradientdescent methods [24], but only a few works have been pro-posed to study the privacy issues in statistical tests. [25] givesa privacy-preserving algorithm to compare survival curveswith the logrank test. [26] presents a privacy-preserving solu-tion to perform the permutation test securely on distributeddata. There is no work studies the privacy issues of theKruskal–Wallis test on distributed data. To the best of ourknowledge, our work is the first one.

3. Technical preliminaries

3.1. The Kruskal–Wallis test

We first review the Kruskal–Wallis test in this section. Thetest as proposed by Kruskal and Wallis [3] evaluates whethertwo or more samples are from the same distribution. The nullhypothesis is that all the samples come from the same distri-bution.

Suppose we have k samples, each contains a set of values.To perform the Kruskal–Wallis test, we need to first rank all thevalues together without considering which sample the valuesbelong to, then compute the sum of all the ranks of valueswithin every sample, so that each sample has its sum of ranks.If there is no tie in all the values, the test statistic is:

H = 12N(N + 1)

k∑i=1

R2i

ni− 3(N + 1), (1)

where N is the total number of values in all samples; ni is thenumber of values contained in the ith sample, and Ri is thesum of ranks in ith sample.

After the calculation of H, we compare it to a value �2˛:k−1

which can be found in a table of the chi-squared probabilitydistribution with k − 1 as the degrees of freedom and ̨ as thedesired significance. If H ≥ �2

˛:k−1, the hypothesis is rejected.Otherwise, the hypothesis is accepted.

If there are ties in the values, the calculation of the teststatistic should be changed slightly. First, when ranking allthe values, the ranks of each group of tied values are givenas the average of the ranks that these tied values would havereceived without ties. For example, suppose we have values {1,3, 3, 5} with one tie of two “3”s. Without considering the tie,their ranks should be 1, 2, 3, 4, respectively. After we changethe ranks of the tied values to the average of them, their ranksbecome 1, 2.5, 2.5, 4. Then we can compute H with these newranks.

Besides the adjustment of ranks, we also need to divide Hby:

C = 1 −∑g

i=1(t3i

− ti)

N3 − N, (2)

where g is the number of groups of tied values, and ti is thenumber of tied values in the ith group. For the above example{1, 3, 3, 5}, we have only 1 group of 2 tied values, so g = 1 andt1 = 2.

To sum up for the case with existence of ties, we need toadjust the ranks of tied values, and the test statistic is:

Hc = H

C. (3)

Actually Eq. (3) is the general solution that holds no matterthere are ties or not. If there is no tie, C = 1 and thus, Hc = H.

Page 4: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 137

3.2. Privacy protection of the Kruskal–Wallis test

Like the hospital example mentioned in the introduction, weassume that each party has a sample and they hope to conducta Kruskal–Wallis test jointly to find out whether their samplesfollow the same distribution without revealing their data toothers. Here our solution is based on the semi-honest model,which is widely used in the cryptographic category of privacy-preserving methods [27,11,28,29,24,13,16,30]. In this model, allparties strictly follow the protocol, but can attempt to derivethe private information of other parties with the intermediateresults they get during the execution of protocols.

3.3. Cryptographic tools

3.3.1. Homomorphic cryptographic schemeAn additive homomorphic asymmetric cryptographic systemis used to encrypt and decrypt the data in our work. A cryp-tographic scheme that encrypts integer x as E(x) is additivehomomorphic if there are operators ⊕ and ⊗ that for any twointegers x1, x2 and a constant a, we have

E(x1 + x2) = E(x1) ⊕ E(x2),

E(a × x1) = a ⊗ E(x1).

This means, with an additive homomorphic cryptographicsystem, we can compute the encrypted sum of integersdirectly from their encryptions. We do not need to decrypt theintegers and compute the sum.

In an asymmetric cryptographic system, we have a pair ofkeys: a public key for encryption and a private key for decryp-tion.

3.3.2. Elgamal cryptographic systemThere are several additive homomorphic cryptographicschemes [30,31]. In this work, we apply a variant of ElGamalscheme [32], which is semantically secure under the Diffe-Hellman Assumption [33].

Elgamal Cryptographic system is a multiplicative homo-morphic asymmetric cryptographic system. With this system,the encryption of a cleartext m is such a pair:

E(m) = (m × yr, gr),

where g is a generator, x is the private key, y is the public keythat y = gx and r is a random integer.

We call the first part of the pair c1 and the second partc2. c1 = m × yr and c2 = gr. To decrypt E(m), we compute s = cx

2 =grx = gxr = yr. Then do c1 × s−1 = m × yr × y−r and we can get thecleartext m.

In the variant of Elgamal scheme we use, the cleartext m isencrypted in such a way:

E(m) = (gm × yr, gr).

The only difference between the original Elgomal schemeand this variant is that m in the first part is changed to gm.With this operation, this variant is an additive homomorphiccryptosystem such that:

E(x1 + x2) = E(x1) × E(x2),

E(a × x1) = E(x1)a.

To decrypt E(m), we follow the same procedure as in theoriginal Elgamal algorithm. But this time, after the above cal-culations, we obtain gm instead of m. To get m from gm, we needto perform exhaustive search, which is to try every possiblem and look for the one that matches gm. Please note that thisexhaustive search is limited to a small range of possible plaintextsonly, so the time needed is reasonable.

In our work, the private key is shared by all the parties andno party knows the complete private key. The parties needto coordinate with each other to do the decryptions and theciphertexts can be exposed to any party, because no party candecrypt them without the help of others.

The private key is shared in this way: Suppose there aretwo parties, parties A and B. A holds a part of private key, x1

and B holds the other part, x2 such that x1 + x2 = x, where x isthe complete private key. In the decryption, we need to com-pute s = cx

2 = cx1+x22 = cx1

2 × cx22 . Party A computes s1 = cx1

2 andparty B computes s2 = cx2

2 . s = s1 × s2. We need to do c1 × s−1 =c1 × (s1 × s2)−1 = c1 × s−1

1 × s−12 . Party A computes c1 × s−1

1 andsends it to party B. Then party B computes c1 × s−1

1 × s−12 =

c1 × s−1 = gm and sends it to A. In this way both parties can getthe decrypted result. Here since the party B does the decryp-tion later, it gets the final result earlier. If it does not send theresult to A, the decrypted result can only be known to party B.The sequence of the parties can be changed, so if we need theresult to be known to only one party, the party should do thedecryption later.

3.3.3. Secure comparisonWe apply the secure comparison protocol proposed in [15] tocompare two values from different parties securely. The inputof this algorithm are two integers a and b which are from dif-ferent parties. The output is an encryption of 1 if a > b, or anencryption of 0 otherwise.

The basic idea of the secure comparison algorithm is asfollows.

Let the binary presentation of a and b be al, . . ., a1

and bl, . . ., b1, where a1 and b1 are the least significantbits. If a > b, there is a “pivot bit” i such that bi − ai + 1 =0and aj XOR bj = aj + bj − 2ajbj = 0 for every i < j ≤ l. This methodapplies the homomorphic encryptions to check if the pivot bitexists.

This method can find out if a > b, but it cannot find out ifa ≥ b directly. So when we want to know if a ≥ b, we compare2a + 1 and 2b instead of a and b. If 2a + 1 >2b, since both a andb are integers, we can derive that a ≥ b.

4. The basic algorithm ofprivacy-preserving Kruskal–Wallis test

In this part, we present the basic algorithm for computing theH statistic of the Kruskal–Wallis test securely without consid-ering the existence of ties. The complete algorithm that alsodeals with ties will be discussed in the next section. To makethe presentation clear, we first give the algorithm for perform-ing the test within two parties, then extend it to the multipartycase.

Page 5: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

138 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145

Suppose there are two parties, A and B. Party A has sampleS1 which contains n1 values, and party B has sample S2 thatcontains n2 values. The total number of values N = n1 + n2. Thebasic structure of the algorithm goes as follows:

1. For each value in each party, count how many values in itsown party (including itself) are smaller than or equal to it.Encrypt these counts.

2. For each value in each party, compare it with all the valuesin the other party using the secure comparison algorithm.Then by adding the comparison results up, count howmany values in the other party are smaller than or equalto it. Since the results of the secure comparison algorithmare in cipher text, these counts are also in cipher text.

3. For each value in each party, add the above two countssecurely so we can get the total number of values in bothparties that are smaller than or equal to it, which is therank of it in cipher text. Then for each party, add all theencrypted ranks of its values and this is the encrypted ranksum of this party. Call the rank sums of the two parties R1

and R2, respectively.4. With the encrypted rank sum of both parties, compute the

H statistic with Eq. (1). Here comes a problem: to calculateH, we need the squared rank sum of both parties, R2

1 andR2

2. Since we only have the encrypted rank sums of the twoparties E(R1) and E(R2), we have to compute E(R2

1) and E(R22)

from E(R1) and E(R2). This is not easy because we are usingan additive homomorphic system, which does not supportthe direct multiplication of two encrypted integers. So weneed to develop an algorithm to solve it.

Let us explain each step in details.

4.1. Secure computation of the rank sums

To compute the rank of one value, we just need to count howmany values in both parties are smaller than or equal to it. Forexample, with values {5, 6, 7}, the rank of value 5 is 1, becauseonly 1 value is smaller than or equal to it, which is itself (5 ≤ 5).The rank of 6 is 2 since there are 2 values smaller than or equalto it (5 ≤ 6 and 6 ≤ 6). Similarly, the rank of 7 is 3.

For each value in each party, to count how many values aresmaller than or equal to it in its own party is quite simple.We compare it with all values in its party, which can be easilydone. But to count the number of smaller or equal values inthe other party is not that straightforward. We also need tocompare the value with all values in the other party, and thecomparisons should be conducted with the secure compari-son algorithm.

Suppose the values in party A are a1, a2, . . . , an1 , and thevalues in party B are b1, b2, . . . , bn2 . For each value ai(i = 1, b,. . ., n1), we need to compare it with every value in party B withthe secure comparison protocol. After these n2 secure compar-isons, we have n2 results, and each of them is an encryptionof 0 or 1 (E(0) or E(1)). For each value bj(j = 1, 2, . . ., n2), the com-parison between ai and bj is E(1) if ai ≥ bj and E(0) otherwise.Since the results are in cipher text, no party knows what theyare. The sum of the n2 results is the encrypted number of val-ues that are smaller than or equal to ai in party B. We call itE(RB(ai)).

The number of values that are smaller than or equal toai in party A can be easily computed. It is named RA(ai). Weencrypt it and get E(RA(ai)). The sum of RA(ai) and RB(ai) is therank of ai, which is R(ai). The encryption of this rank E(R(ai))can be computed from E(RA(ai)) and E(RB(ai)) with the additivehomomorphic system that we utilize.

In this way, we can get the encryptions of the ranks of allvalues from both parties:

E(R(a1)), E(R(a2)), . . . , E(R(an1 )),

E(R(b1)), E(R(b2)), . . . , E(R(bn2 )).

Then E(R1) and E(R2), which are the encryptions of the ranksums of party A and B, respectively, can be computed fromthem because R1 = R(a1) + R(a2) + · · · + R(an1 ) and R2 = R(b1) +R(b2) + · · · + R(bn2 ).

4.2. Secure computation of the squared rank sums

We need to compute E(R21) and E(R2

2) from E(R1) and E(R2). Sincethe additive homomorphic cryptosystem does not supportthe direct multiplication of two encrypted integers, here wepresent an algorithm to solve it.

To compute E(ab) from E(a) and E(b) that are known to bothparties, first we need to make one of the integers additivelyshared by the two parties. For example, we make a additivelyshared by the two parties such that party A holds an integeraA and party B holds an integer aB that aA + aB = a. aA and aB

can be got from E(a) in this way: Party A randomly generatesan integer aA, and computes E(aA). Then E(a − aA) = E(aB) can becomputed from E(a) and E(aA) by party A. A sends it to partyB and the two parties coordinate with each other to decryptE(aB). During the decryption, we make sure that the decryp-tion result aB is only known to party B. This can be achievedwith the cryptographic system that we use, as explained inSection 3.

After A gets aA and B gets aB, the two parties A and B cancompute E(aA × b) and E(aB × b), respectively. This can be donewith the additive homomorphic system from aA, aB and E(b)because aA and aB are both integers in plaintext.

What we want is E(ab) = E((aA + aB) × b) = E(aA × b + aB × b).Since E(aA × b) is held by party A and E(aB × b) is held by partyB, the two parties should exchange their values so that bothof them can compute the final result E(ab). But exchanging thevalues directly may cause privacy loss. For example, if party Agives E(aA × b) to party B, since E(aA × b) = E(b)aA with the vari-ant of Elgamal system we use, and E(b) is known to party B,party B can derive some information about aA from E(aA × b).So before the two parties calculate E(aA × b) and E(aB × b) andexchange their values, they do rerandomizations to their E(b)s.With the rerandomizations, the random numbers “r” that areused in the encryptions are changed, so the encryptions aredifferent from the original ones. To make the presentationclear, we call the rerandomized E(b)s as E′(b) and E′′(b) in partyA and party B, respectively. Then parties A and B can cal-culate E′(aA × b) = E′(b)aA and E′′(aB × b) = E′′(b)aB , respectively,and exchange their values E′(aA × b) and E′′(aB × b). Since theencryptions are changed, the parties cannot derive informa-tion from the value they get from each other. For example,

Page 6: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 139

although party B gets E′(aA × b) from A, E′(aA × b) = E′(b)aA andparty B does not know E′(b) because it is the rerandomizationdone by party A. So B cannot derive aA.

After the exchange, party A has E(aA × b) and E′′(aB × b)and party B has E′(aA × b) and E(aB × b). They can computeE(ab) = E(aA × b + aB × b) by themselves. The rerandomizationsdo not affect the calculations of the encrypted sums. In thisway, both parties can get E(ab) from E(a) and E(b). Algorithm 1shows the main procedure of this encrypted multiplication.

Algorithm 1. Encrypted multiplication of two integers

Input. Encryptions of integers a and b, E(a) and E(b) that areknown to both parties;Output. The encryption of a × b, E(ab);

1: Party A generates a random integer aA andcomputes E(aA);

2: Party A computes E(a − aA) and sends it toparty B;

3: The two parties coordinately decryptE(a − aA) and only party B gets the resulta − aA = aB;

4: Parties A and B rerandomize E(b) and getE′(b) and E′′(b), respectively;

5: Parties A and B calculate E′(aA × b) andE′′(aB × b), respectively, and exchange thetwo values;

6: Parties A and B computeE(ab) = E(aA × b + aB × b) by themselves;

4.3. Secure computation of H

With Algorithm 1 we can get E(R21) and E(R2

2) from E(R1) andE(R2). Because we assume there are two parties, the H statisticis calculated as:

H = 12N(N + 1)

(R2

1

n1+ R2

2

n2) − 3(N + 1),

where N, n1 and n2 are constants known to both parties. FromE(R2

1) and E(R22), both parties can compute E(R2

1 × n2 + R22 × n1).

They then coordinately decrypt it and get R21 × n2 + R2

2 × n1.The final result is calculated as:

H = 12N(N + 1)n1n2

(R21 × n2 + R2

2 × n1) − 3(N + 1).

The reason why we compute R21 × n2 + R2

2 × n1 and thendivide it with n1n2 instead of compute R2

1/n1 + R22/n2 directly is

that the cryptographic system we use only support the opera-tions on non-negative integers. To avoid the decimal fractionsin the encryptions, we compute R1

2 × n2 + R22 × n1 and after

the decryption, the division is applied.

4.4. The summarized algorithm

The main steps of the algorithm is summarized in Algorithm2.

Algorithm 2. The basic algorithm of privacy-preservingKruskal–Wallis test

Input. Party A has sample S1 which contains n1 values, andparty B has sample S2 which contains n2 values. The totalnumber of values N = n1 + n2;Output. The statistic H;

1: for each value ai in party A do2: Calculate the encrypted rank of it E(R(ai));3: end for4: for each value bj in party B do5: Calculate the encrypted rank of it E(R(bj));6: end for7: Compute the encrypted rank sum of each

party E(R1) and E(R2) where R1 =∑n1

i=1R(ai)and R2 =

∑n2j=1R(bj);

8: Calculate E(R21) and E(R2

2) from E(R1) andE(R2) with Algorithm 1;

9: Calculate E(R21 × n2 + R2

2 × n1) and decryptit;

10: Compute H from R21 × n2 + R2

2 × n1;

4.5. Extension to multiparty

The extension of the algorithm from two-party to multipartyis straightforward. For each value in each party, to get its rankin the two-party case, we count the number of values that aresmaller than or equal to it in its own party and in the otherparty. To count the number in the other party, we need thesecure comparison protocol. Similarly, in the multiparty case,we also count the number of values that are smaller than orequal to it in its own party and every other party with the helpof the secure comparison protocol.

After the computation of encrypted ranks for everyvalue in every party, the encrypted rank sums are calcu-lated, just like in the 2-party case. Then the encryptedsquared rank sums E(R2

1), E(R22), . . . , E(R2

k) can be computed

with Algorithm 1. They are known to all the parties. Aswe compute E(R2

1 × n2 + R22 × n1) when there are two parties,

for the k parties, E(R21 × n2n3 . . . nk + R2

2 × n1n3 . . . nk + · · · + R2k

×n1n2 . . . nk−1) is computed. We decrypt it and divide the decryptresult by n1n2 . . . nk instead of n1n2 in the two-party case. Thenthe final result H is calculated.

5. The complete algorithm ofprivacy-preserving Kruskal–Wallis test

We present the privacy-preserving Kruskal–Wallis test withconsidering ties in this section.

5.1. Modifying the data to eliminate ties

Before we explain the complete algorithm, we give a simplermethod to deal with the tied values. This is to modify thevalues slightly to eliminate the ties and then apply the basicalgorithm to the modified data. Since the data is modified alittle, this method causes slight accuracy loss.

Page 7: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

140 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145

To eliminate ties between parties, we do the followingsteps: If there are two parties, for every value in the first party,multiply it with 10 and then add 0 to it. For every value inthe second party, multiply it with 10 and then add 1 to it. Forexample, suppose ai belongs to the first party and bj belongsto the second party. We do ai = ai × 10 + 0 and bj = bj × 10 + 1. Inthis way, the ties between the two parties are eliminated andthe ranks of other values are not affected.

If there are more than two parties, the data is modifiedsimilarly depending on the number of parties. For example,if there are ten parties, we still multiply every value in everyparty with 10 and add zeros to the values in the first party,add ones to the values in the second party, . . ., add nines tothe tenth party. If there are 100 parties, multiply every valuewith 100 and add zeros to ninety-nines to the values of thefirst to 100th party, respectively.

To deal with the ties within parties, we do not need to mod-ify the data. We can ignore these ties when calculating theranks. For example, suppose one party has three values, {1, 1,1}. With our algorithm, the ranks are calculated by countingthe number of smaller or equal values. For these three values,the number of smaller or equal values in their own partiesare 3, 3 and 3. We change them to 1, 2 and 3, respectively. Thiscan be easily finished because every party has the informationof ties within it. After changing the local counts, the counts ofsmaller or equal values from other parties are added to get therank. The ranks do not contain any tie because both ties withinthe local party and the ties between parties are disregarded.After the modifications, we can apply the basic algorithm thatdeals with data without ties.

5.2. The complete algorithm

Here we present the complete algorithm that works for datacontaining ties. Similar to the previous section, the algorithmis proposed with assumption that there are only two partiesand then extended to the multiparty case.

As mentioned in Section 3, when there are ties in the data,the calculation of the statistic is changed in two aspects: Theranks of the tied values should be adjusted when computingH, and H should be divided by C. Both of them will be discussedin details.

5.2.1. Adjustment of the ranks of tied valuesThe ranks of each group of tied values should be changed tothe average of the ranks that these tied values would havereceived without ties. We use an example to show the basicidea to achieve this adjustment. Suppose there are values {1,2, 3, 4, 4, 4, 4, 4} that are distributed in two samples held bytwo parties, respectively. Party A has sample S1 which containsvalues {1, 2, 4, 4} and party B has sample S2 which containsvalues {3, 4, 4, 4}. Without considering the tie, we know thatthe ranks of the values {1, 2, 3, 4, 4, 4, 4, 4} are 1, 2, 3, 4, 5, 6,7, 8, respectively. The five “4”s are tied and their ranks are 4,5, 6, 7, 8. The largest rank in this tie is 8 and the smallest rankis 4. The average of the ranks is 6 and it can be calculated bytaking the average of the largest rank 8 and the smallest rank4. This is because that the ranks of values in a tie is an arith-metic sequence, so the average of all values in the sequence isthe same as the average of the smallest and the largest values.

After changing the ranks of the tied values to the average ofthem, the ranks should be 1, 2, 3, 6, 6, 6, 6, 6. In our algorithm,since we calculate the rank of each value by counting the val-ues that smaller than or equal to it, the ranks are 1, 2, 3, 8, 8,8, 8, 8 because for each 4, there are 8 values smaller than orequal to it. So with our algorithm, the ranks of each group oftied values are actually the largest rank in the tie. We need toadd some steps into our algorithm to change the ranks form1, 2, 3, 8, 8, 8, 8, 8 to 1, 2, 3, 6, 6, 6, 6, 6.

The basic idea is: Since the ranks of values in each tie isthe largest rank in the tie, we only need to get the smallestrank in the tie and take the average of the largest rank andthe smallest rank. To get the smallest rank from the largestrank, we need to know the number of values in the tie. Withthe largest rank named as l, the smallest rank named as s, andthe number of values in the tie named as t, we have s = l − t + 1.As in our example, the tie contains 5 values with the largestrank as 8 and the smallest rank as 4. We have 8 − 5 +1 = 4. So,to change the ranks form 1, 2, 3, 8, 8, 8, 8, 8 to 1, 2, 3, 6, 6, 6, 6, 6,we need to get the number of values in ties, and then computethe smallest ranks in ties, and take the average of the largestranks and the smallest ranks.

We assume that each value is in a tie and calculate thenumber of values in each value’s tie. In our example, value 1 isin a tie that contains only 1 value, so are values 2 and 3. Eachvalue 4 is in a tie that contains 5 values. So for values {1, 2,3, 4, 4, 4, 4, 4}, we have 1, 1, 1, 5, 5, 5, 5, 5 as the number ofvalues in each value’s tie. Then for each value, compute thesmallest rank in its tie with s = l − t + 1. For value 1, the smallestrank is 1 − 1 +1 = 1. For value 2, the smallest rank is 2 − 1 +1 = 2.For value 3, the smallest rank is 3 − 1 +1 = 3. For each value 4,the smallest rank is 8 − 5 +1 = 4. So the smallest ranks for theeight values are 1, 2, 3, 4, 4, 4, 4, 4. With the largest ranks 1,2, 3, 8, 8, 8, 8, 8, we can get the averaged ranks 1, 2, 3, 6, 6, 6,6, 6. We can see that for values 1, 2 and 3 that are not tied,assuming that they are in ties containing 1 value does notaffect the calculation results of their ranks. The reason whywe make such assumption is that, although we show all thevalues, ranks and tied numbers of values together in cleartextto make it easier to understand, in the real settings, they areencrypted or distributed and no party has the complete infor-mation about them. So no party knows whether a value is in atie or not. For example, party A has one value 1 and this valueis not in a tie in party A. But A does not know whether partyB has value 1 or not, and A does not know whether value 1 isin a tie globally. So all values are assumed to be in a tie.

After explaining the basic idea of the adjustment of ranks,let us show the steps that the two parties do the adjustmentsecurely.

We follow the basic algorithm in Section 4 to get the ranksof each value in each party. Here the “rank”s are the numberof smaller or equal values, which are the largest ranks of eachtie. To count the smaller or equal values for value ai in party A,it is compared with both values in party A and party B. Whencomparing ai with values in party A, we also count the numberof values that are equal to ai in party A and name it TA(ai). Asmentioned in Section 4, when comparing ai with every valuein party B securely, each of the comparison result is an encryp-tion of 0 or 1 such that if bj ≤ ai, the comparison result betweenbj and ai is E(1) and otherwise E(0). The sum of these results

Page 8: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 141

Table 1 – An example table.

b1 . . . bj . . . bn2

a1

. . .

ai E(1). . .

an1

is the encrypted number of values smaller than or equal to ai

in party B. Here we keep all the comparison results betweenevery pair of ai and bj in a n1 × n2 table such that the elementin the table on the aith row and bjth column is the compar-ison result between ai and bj, which is E(1) if bj ≤ ai and E(0)otherwise. Table 1 is an example with a n1 × n2 table.

Similarly, to count the smaller or equal values of value bj inparty B, we compare it with values in both party A and partyB. When comparing bj with values in party B, we also countthe number of values that are equal to bj in party B and nameit TB(bj). When comparing bj with values in party A securely,each comparison result is not the same as the previous case.Here the comparison result between bj and ai is E(1) if ai ≤ bj

and E(0) otherwise. We also keep the comparison results in an1 × n2 table.

The two tables storing the comparison results are not thesame. In the first table, the value in the aith row and bjth col-umn is E(1) if bj ≤ ai and E(0) otherwise; while in the secondtable, the value in the aith row and bjth column is E(1) if bj ≥ ai

and E(0) otherwise. Here we introduce a third n1 × n2 table thateach element in it is the secure sum of the two correspond-ing elements in the first and second tables. For example, if thevalue in the aith row and bjth column in the first table is E(1)and in the second table is E(0), then the value in the aith rowand bjth column in the third table is E(1 + 0).

The values in the third table is either E(1) or E(2). If ai < bj,the value in the second table is E(1) and the value in the firsttable is E(0). Thus, the value in the third table is E(1). If ai > bj,the value in the first table is E(1) and the value in the secondtable is E(0). Thus, the value in the third table is also E(1). Ifai = bj, both values in the first and second tables are E(1) andthe value in the third table is E(2). To sum up, the value in theaith row and bjth column in the third table is E(1) if ai /= bj andE(2) if ai = bj.

We securely deduct 1 from every element in the third table.Then the value in the aith row and bjth column in the newtable is E(0) if ai /= bj and E(1) if ai = bj. This new table containsthe information of equal values between the two parties. Thesum of all the values in the aith row is the encrypted num-ber of values that are equal to ai in party B which is namedas E(TB(ai)). The sum of all the values in the bjth column isthe encrypted number of values that are equal to bj in party Awhich is named as E(TA(bj)). Since parties A and B have com-puted TA(ai) and TB(bj), respectively, the two numbers can beencrypted and added to the E(TB(ai)) and E(TA(bj)), respectivelyto get E(T(ai)) = E(TA(ai) + TB(ai)) and E(T(bj)) = E(TA(bj) + TB(bj)).

For each value ai(i = 1, 2, . . ., n1) in party A, we have E(R(ai))which is the encrypted largest rank in ai’s tie and E(T(ai)) whichis the encrypted number of values in ai’s tie, or the number ofvalues equal to ai in both parties. For each value bj(j = 1, 2, . . .,n2) in party B, we have the similar numbers E(R(bj)) and E(T(bj)).

To get the averaged rank for each value, we need to know thesmallest rank in each value’s tie. The smallest ranks can becalculated from the largest ranks and the numbers of valuesin ties. For each value ai(i = 1, 2, . . ., n1) in party A, the encryptedsmallest rank E(S(ai)) in ai’s tie is E(R(ai) − T(ai) + 1) and theencrypted adjusted rank of ai is E((S(ai) + R(ai))/2), which isthe average between the largest and the smallest rank. Toavoid the decimal fraction in the ciphertext, we only calcu-late E(S(ai) + R(ai)) and the division by 2 is applied after thefinal decryption. For each value bj(j = 1, 2, . . ., n2) in party B, theencrypted smallest rank E(S(bj)) in bj’s tie is E(R(bj) − T(bj) + 1)and the encrypted adjusted rank of bj is E((S(bj) + R(bj))/2). Wealso calculate E(S(bj) + R(bj)) and apply the division by 2 afterthe final decryption.

In this way, we can adjust the ranks of every value and therank sums are calculated based on these new ranks. Pleasenotice that if a value is not tied with others, the adjustmentdoes not change its rank. The complete algorithm of calculat-ing H is summarized in Algorithm 3.

Algorithm 3. The complete algorithm of privacy-preservingKruskal–Wallis test

Input. Party A has sample S1 which contains n1 values,and party B has sample S2 which contains n2 values. Thetotal number of values N = n1 + n2;Output. The statistic H;

1: for each value ai in party A do2: Calculate the encrypted rank of it E(R(ai))

and record the secure comparison results;3: end for4: for each value bj in party B do5: Calculate the encrypted rank of it E(R(bj))

and record the secure comparison results;6: end for7: From the secure comparison results, get

the information of equal values betweenthe two parties;

8: for each value ai in party A do9: Calculate the encrypted number of

values equal to it E(T(ai));10: Calculate the encrypted smallest rank in

its tie E(S(ai)) from E(T(ai)) and E(R(ai));11: Calculate the encrypted averaged rank

of it;12: end for13: for each value bj in party B do14: Calculate the encrypted number of

values equal to it E(T(bj));15: Calculate the encrypted smallest rank in

its tie E(S(bj)) from E(T(bj)) and E(R(bj));16: Calculate the encrypted averaged rank

of it;17: end for18: Do the remaining calculations to compute

H as in Algorithm 2 with the encryptedaveraged ranks;

Page 9: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

142 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145

To extend the adjustment from two parties to multipleparties, we just need to create a table containing the infor-mation of equal values for each pair of parties during thecomputations of ranks. For each value, calculate the encryptednumber of values equal to it by collecting information from alltables it is involved. Then the encrypted smallest rank in itstie and the averaged rank can be computed and the follow-ing steps are the same as in the extension of Algorithm 2 inSection 4.

5.2.2. Calculation of CIn most cases, dividing H by C makes little change in the finalresult. If the number of tied values are not more than 1/4 of thetotal values, the division does not change the result by morethan 10% for some degrees of freedom and significance [3].

To calculate C securely for two parties A and B, we needthe information of ties computed in the adjustment of ranks,the E(T(ai)) for each value ai(i = 1, 2, . . ., n1) in party A and theE(T(bj)) for each value bj(j = 1, 2, . . ., n2) in party B.

From Eq. (2), we have

C = 1 −∑

(t3i

− ti)

N3 − N,

where ti is the number of values in the ith tie.To compute C securely, we treat T(ai) of each distinct ai and

T(bj) of each distinct bj as ti. For the values that are not tiedwith others, since their T values are equal to 1, and 13 − 1 =0,adding them do not affect the value of C. For the tied values,their T values should be considered just once in the calcula-tion of C, so we consider the T’s of the distinct values in eachparty. With the example we used before that party A has val-ues {1, 2, 4, 4} and party B has values {3, 4, 4, 4}, for party A,we only consider T(1) = 1, T(2) = 1 and T(4) = 5. For party B, weconsider T(3) = 1 and T(4) = 5. Here all the T’s are encrypted andno party knows the exact numbers. C can be securely com-puted from the encryption of tis. The E(t3

i) is calculated from

E(ti) with Algorithm 1 and then E(∑

t3i

− ti) can be computed.The problem is, although only the T’s of distinct values

in each party are included in the calculation of C, there arestill duplicates. Considering only the distinct values in eachparty can make sure that the ties within parties are countedonly once, but it cannot eliminate the duplicated ties betweenparties. As in the above example, T(4) is counted twice becausethe tie of value 4 exists in both parties.

We call the set of ties exist only in party A TA, the set ofties exist only in party B TB and the set of ties exist in bothparties TAB. We want the information about TA, TB and TAB

to be included in C just once. With the above solution, TAB iscounted twice.

If we consider only T(ai) for each value ai(i = 1, 2, . . ., n1) inparty A and do not add the T(bj) for each value bj(j = 1, 2, . . ., n2)in party B, TA and TAB are considered once but the informationof TB is lost. We cannot add the information of only TB withoutadding TAB, because every party does not know whether a tiein it is local or global.

We haven’t worked out a solution to calculate C exactly asit is. The two solutions mentioned above either add more tieinformation or lose some tie information when calculating C.But they can give a range of C by providing an upper boundand a lower bound and cut down the loss of accuracy.

Table 2 – The BMI dataset.

Asians Indians Malays

32 (15) 26.4 (11) 24.9 (8)30.1 (14) 23.1 (2) 25.3 (9)27.6 (12) 23.5 (4) 23.8 (5)26.2 (10) 24.6 (7) 22.1 (1)28.2 (13) 24.3 (6) 23.4 (3)

We use some examples to show the extension of the cal-culation of C from two parties to multiparty. Suppose thereare three parties, A, B and C. Similar to the two-party case, wedenote TA, TB and TC as the sets of ties exist only in party A,B and C, respectively. TAB is the set of ties in parties A and B.TAC, TBC and TABC are defined in the same way.

For each pair of parties, we have a table storing the informa-tion of tied values between the two parties. The three tablesare named as Table(AB), Table(AC) and Table(BC), respectively.We collect the tie information of each distinct value in partyA from all the tables that involve A, which are Table(AB) andTable(AC). This gives us the information about all the tiesappear in party A, which are TA, TAB, TAC and TABC. Thenwe disregard party A and the tables involving A, and collectthe tie information of each distinct value in party B from allthe remaining tables that involve B, which is only Table(BC).With this step, we can add the information about all the tiesappearing in party B but not in A, which are TB and TBC. Thenwe encounter the same problem as in the two-party case: ifwe stop here, the tie information of TC is lost; if we add thetie information of each distinct value in party C from a tableinvolving C such as Table(AC), both TC and TAC are added, andthus TAC is counted twice.

When there are k parties, we follow the same procedureand get the information of all the ties appear in the first party,then add the information of ties in the second party, and so on.When it comes to the last party, we either lose the informationof ties appearing only in the last party, or add duplicate infor-mation about ties appearing in both the last party and someother party. This gives us an upper bound and a lower boundof C.

6. Experiments

The experimental results are presented in this section. Allthe algorithms are implemented with the Crypto++library inthe C++language and the communications between partiesare implemented with socket API. The experiments are con-ducted on a Red Hat server with 16 × 2.27 GHz CPUs and 24 Gof memory.

We use the two datasets from [34] to test the accuracy ofour algorithms. The first dataset, as shown in Table 2, contains3 samples with equal sizes. The “sample” in the context ofthis paper is clearly different from that in many other papers.Each “sample” here is the set of data held by a party and thenumber of samples is the number of parties. In this dataset,the data are simulated Body Mass Index (BMI) values for sub-jects of 3 different races from a surburb of San Francisco. Herethe BMI values for subjects of each race is a sample. There isno tie in this dataset and the rank of every value is given inparentheses.

Page 10: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 143

Table 3 – The INR dataset.

Hospital A Hospital B Hospital C Hospital D

1.68 (1) 1.71 (6) 1.74 (13.5) 1.71 (6)1.69 (2) 1.73 (10) 1.75 (16) 1.71 (6)1.70 (3.5) 1.74 (13.5) 1.77 (18) 1.74 (13.5)1.70 (3.5) 1.74 (13.5) 1.78 (20) 1.79 (22)1.72 (8) 1.78 (20) 1.80 (23.5) 1.81 (26)1.73 (10) 1.78 (20) 1.81 (26) 1.85 (29)1.73 (10) 1.80 (23.5) 1.84 (28) 1.87 (30)1.76 (17) 1.81 (26) 1.91 (31)

The second dataset is presented in Table 3. It contains 4samples and the sizes of them are not all equal. Each sample isa set of simulated International Normalized Ratio (INR) valuesof patients in one hospital. The ranks are given in parentheses.There are ties in the data and the tied ranks are bold.

Since our secure algorithm only deal with non-negativeintegers, each value in dataset 1 is multiplied by 10 and eachvalue in dataset 2 is multiplied by 100. This step changes all thevalues to non-negative integers without changing the ranks ofvalues, and it does not affect the result of the Kruskal–Wallistest which is calculated from the ranks.

The accuracy of our basic algorithm for data without tiesis 100%. This is shown with dataset 1. We provide both theH values calculated in two-party and multiparty scenarios inTable 4. In the two-party case, we take the first two samples ofdataset 1 and calculate the H value on these two samples. Inthe multiparty case, the H value is calculated on all the threesamples of dataset 1.

Our algorithms for data with ties cause some accuracy loss.There are two methods to deal with tied values. The first oneis to modify the data slightly to eliminate ties and then com-pute H with the basic algorithm. Accuracy loss occurs becausethe data is changed. The second method is to keep the dataunchanged, but adjust the ranks and divide H by C. Here theaccuracy loss comes from the calculation of C. Because wecan compute an upper bound and a lower bound for C, we canalso get an upper bound and a lower bound for the final resultHc. We test the two methods with dataset 2 and the resultsare shown in Table 5. Here we also take the first two samplesfrom dataset 2 to test the two-party case and all four samplesof dataset 2 to test the multiparty case.

As we can see in the result, the second method has bet-ter accuracy than the first one. In the case with two parties,although the first two samples of dataset 2 that we use containa lot of ties (9 out of 16 values are in ties), the two bounds areboth very close to the accurate result. In the multiparty case,both the upper and lower bounds are equal to the accurateresult. This is because the two bounds are calculated by eitherdisregarding the ties only in the last sample, or counting theties between the last and the first samples twice. Fortunately,

Table 4 – Kruskal–Wallis test result on data without ties.

2 samples 3 samples

H calculated by the originalKruskal–Wallis test

5.7709 8.72

H calculated by our basic algorithm 5.7709 8.72

in this dataset, the last sample does not contain any tie that isonly in it, and there is no tie between the last sample and thefirst sample. So with this dataset, the two bounds are equal tothe accurate result.

Let us show the computation overheads of the algorithms.In Fig. 1 we present the running time comparison between thealgorithms we proposed with different sizes of data under thetwo-party scenario. The running time values are in seconds.We can find that the execution time of the basic algorithm fordata without ties and the first method for data with ties arevery close. This is because in the first method of dealing withties, we eliminate the ties and then follow the same procedureas the basic algorithm. The second method for data with tiestakes more time than the first one, mostly because that theadjustment of ranks takes time.

We also show the overheads in the multiparty case withdatasets 1 and 2. The execution time of the basic algorithm ondataset 1 is:

Running time for 2 samples: 5 sRunning time for 3 samples: 17 s

The execution time of the first method for data containingties on dataset 2 is:

Running time for 2 samples: 15 sRunning time for 3 samples: 67 sRunning time for 4 samples: 599 s

The execution time of the second method for data contain-ing ties on dataset 2 is:

Running time for 2 samples: 26 sRunning time for 3 samples: 169 sRunning time for 4 samples: 2159 s

Table 5 – Kruskal–Wallis test result on data with ties.

2 samples 4 samples

Hc calculated by the originalKruskal–Wallis test

6.4191 11.876

H calculated from modifieddata (the first method)

6.89338 12.6971

The upper bound of Hc (thesecond method)

6.4574 11.876

The lower bound of Hc (thesecond method)

6.4 11.876

Page 11: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

144 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145

5 10 15 200

100

200

300

400

500

600

Sample Size

Run

ning

Tim

e (s

)

The basic algorithm for data without tiesThe first method for data with tiesThe second method for data with ties

Fig. 1 – Running time comparison of algorithms with two parties.

7. Conclusion

In this work, we proposed several algorithms that enableparties to conduct the Kruskal–Wallis test securely withoutrevealing their data to others. We showed the procedure of thealgorithms for data both with and without ties. We also pre-sented an algorithm to do the multiplication of two encryptedintegers under the additive homomorphic cryptosystem. Ouralgorithms can be extended to make other non-parametricrank based statistical tests secure, such as the Friedman test.This is our future work.

Conflict of interest

We wish to confirm that there are no known conflicts of inter-est associated with this publication and there has been nosignificant financial support for this work that could haveinfluenced its outcome.

r e f e r e n c e s

[1] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery,Numerical recipes in C: the art of scientific computing,Transform 1 (i) (1992) 504–510 (online). Available at:http://www.jstor.org/stable/1269484?origin=crossref.

[2] G.E.P. Box, Non-normality and tests on variances, Biometrika(1953) 318–335.

[3] W.H. Kruskal, W.A. Wallis, Use of ranks in one-criterionvariance analysis, Journal of the American StatisticalAssociation 47 (260) (1952) 583–621 (online). Available at:http://www.jstor.org/stable/2280779.

[4] F. Wilcoxon, Individual comparisons by ranking methods,Biometrics Bulletin (1945) 80–83.

[5] H.B. Mann, D.R. Whitney, On a test of whether one of tworandom variables is stochastically larger than the other,

Annals of Mathematical Statistics 18 (1) (1947) 50–60(online). Available at: http://www.jstor.org/stable/2236101.

[6] C.M.R. Kitchen, Nonparametric versus parametric tests oflocation in biomedical research, American Journal ofOphthalmology (2009) 571–572.

[7] R. Agrawal, R. Srikant, Privacy-Preserving Data Mining (2000).[8] Z. Huang, W. Du, B. Chen, Deriving private information from

randomized data, in: Proceedings of the 2005 ACM SIGMODInternational Conference on Management of Data SIGMOD05, 2005, p. 37.

[9] K. Chen, L. Liu, Privacy preserving data classification withrotation perturbation, in: Proceedings of the Fifth IEEEInternational Conference on Data Mining, ser. ICDM ‘05, IEEEComputer Society, Washington, DC, USA, 2005, pp. 589–592(online). Available at:http://dx.doi.org/10.1109/ICDM.2005.121.

[10] G.R. Heer, A bootstrap procedure to preserve statisticalconfidentiality in contingency tables, in: Proceedings of theInternational Seminar on Statistical Confidentiality, 1993,pp. 261–271.

[11] Y. Lindell, B. Pinkas, Privacy preserving data mining, Journalof Cryptology 15 (3) (2002) 177–206.

[12] W. Du, Z. Zhan, Building decision tree classifier on privatedata, Reproduction (2002) 1–8.

[13] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, M.Y. Zhu, Toolsfor privacy preserving distributed data mining, ACM SIGKDDExplorations Newsletter 4 (2) (2002) 28–34.

[14] I. Damgard, M. Fitzi, E. Kiltz, J.B. Nielsen, T. Toft,Unconditionally Secure Constant-Rounds Multi-partyComputation for Equality, Comparison, Bits andExponentiation, vol. 3876, Springer, 2006, pp. 285–304.

[15] I. Damgard, M. Geisler, M. Kroigard, Homomorphicencryption and secure comparison, International Journal ofApplied Cryptography 1 (2008) 22.

[16] W. Du, M. Atallah, Privacy-Preserving Cooperative StatisticalAnalysis, IEEE Computer Society, 2001, p. 102.

[17] B. Goethals, S. Laur, H. Lipmaa, T. Mielik?inen, On privatescalar product computation for privacy-preserving datamining, Science 3506 (2004) 104–120.

[18] S. Han, W.K. Ng, Privacy-preserving linear fisherdiscriminant analysis., in: Proceedings of the 12thPacific-Asia Conference on Advances in Knowledge

Page 12: Author's personal copy - University at Buffalo · 2014. 6. 17. · Author's personal copy 136 comput er met hods an d pr ogr ams in biomedicin e 1 1 2 (2 0 1 3 ) 135 145 To solve

Author's personal copy

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 145

Discovery and Data Mining, ser. PAKDD’08, Springer-Verlag,Berlin, Heidelberg, 2008, pp. 136–147.

[19] W. Du, Y.Y.S. Han, S. Chen, Privacy-Preserving MultivariateStatistical Analysis: Linear Regression and Classification,vol. 233, Lake Buena Vista, Florida, 2004.

[20] S. Han, W.K. Ng, P.S. Yu, Privacy-preserving singular valuedecomposition, in: 2009 IEEE 25th International Conferenceon Data Engineering, 2009, pp. 1267–1270.

[21] Z. Teng, W. Du, A hybrid multi-group privacy-preservingapproach for building decision trees, in: Proceedings of the11th Pacific-Asia Conference on Advances in KnowledgeDiscovery and Data Mining, ser. PAKDD’07, Springer-Verlag,Berlin, Heidelberg, 2007, pp. 296–307.

[22] J. Vaidya, W. Lafayette, C. Clifton, Privacy-preservingk-means clustering over vertically partitioned data, Security(2003) 206–215.

[23] G. Jagannathan, R.N. Wright, Privacy-Preserving Distributedk-Means Clustering Over Arbitrarily Partitioned Data, ACM,2005, pp. 593–599.

[24] L. Wan, W.K. Ng, S. Han, V.C.S. Lee, Privacy-preservation forgradient descent methods, in: Proceedings of the 13th ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining KDD 07, 2007, p. 775.

[25] T. Chen, S. Zhong, Privacy-preserving models for comparingsurvival curves using the logrank test, Computer Methodsand Programs in Biomedicine 104 (2) (2011) 249–253 (online).Available at: http://www.ncbi.nlm.nih.gov/pubmed/21636164.

[26] Y. Zhang, S. Zhong, Privacy preserving distributedpermutation test, 2012, submitted for publication.

[27] O. Goldreich, Foundations of Cryptography, vol. 1, no. 3,Cambridge University Press, 2001.

[28] M. Kantarcioglu, C. Clifton, Privacy-preserving distributedmining of association rules on horizontally partitioned data,IEEE Transactions on Knowledge and Data Engineering 16 (9)(2004) 1026–1037.

[29] J. Vaidya, C. Clifton, Privacy-Preserving Outlier Detection,vol. 41, no. 1, IEEE, 2004, pp. 233–240.

[30] S. Zhong, Privacy-preserving algorithms for distributedmining of frequent itemsets, Information Sciences 177 (2)(2007) 490–503.

[31] P. Paillier, Public-key cryptosystems based on compositedegree residuosity classes, Computer 1592 (1999)223–238.

[32] T. ElGamal, A public key cryptosystem and a signaturescheme based on discrete logarithms, IEEE Transactions onInformation Theory 31 (4) (1985) 469–472.

[33] D. Boneh, The Decision Diffie-Hellman Problem, vol. 1423,Springer-Verlag, 1998, pp. 48–63.

[34] A.C. Elliott, L.S. Hynan, A sas((r)) macro implementation of amultiple comparison post hoc test for a Kruskal–Wallisanalysis, Computer Methods and Programs in Biomedicine102 (1) (2011) 75–80 (online). Available at:http://www.ncbi.nlm.nih.gov/pubmed/21146248.