Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
M-Sanit: Evaluation of Misusability Measure Based Big Data
Sanitization
1Y.SOWMYA,
2Dr.M.NAGARATNA,
3Dr.C.SHOBA BINDHU
1Dept. of Computer Science and Engineering, Research Scholar JNTUA, Anantapuramu, India
2Dept. of Computer Science and Engineering, Associate, Professor, JNTUH Hyderabad, India
3Dept. of Computer Science and Engineering, Professor, JNTUA, Anantapuramu, India
E-Mail: [email protected];
Abstract
Big data, in the wake of distributed computing technologies,
frameworks and cloud computing, has wherewithal to add big
value to enterprises. Due to exponential growth of data, it
became indispensable to have machine learning techniques to
discover useful patterns from it. Most of the existing data mining
techniques focused on extracting hidden trends from
databases. However, there risk of misusability of data. It is
more so in the era of cloud computing as data owners
outsource data to public cloud and do not maintain local copy.
The rationale behind this is that data is voluminous and cannot
be accommodated in the local machines. As cloud is untrusted
from user point of view, focussing on garnering of information
from such data is inadequate unless there is a mechanism to
withstand inference attacks or misuse of data. In our previous
work a framework named M-Sanit was proposed to have
misusability measure based big data sanitization using
Hadoop’sMapReduce programming paradigm. In this paper we
threw light into the evaluation of M-Sanit prototype in usage of
misusability measure based sanitization of big data. The
evaluation is required as to standardize the prototype by
experimenting different levels of sanitization based on
misusability measure and derive thresholds for each level which
can ensure optimal sanitization of big data with appropriate
level of privacy.
Index Terms – Big data, misusability measure, sanitization, M-
Sanit
International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 1859-1870ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
1859
1 Introduction
Data is an asset to an organization due to data-driven decision making.
In every domain, giving data to business partners is required to perform
their duties. Having said this, putting limits on the data access in the
interest of maintaining privacy might make them handicap in fulfilling
their duties. This is one side of the coin which supports giving sufficient
data to partners. The other side of the coin is the privacy issues due to
malicious insiders. Therefore, it is essential to have mechanism for
detecting data misuse and data leakage in place. However, it is very
challenging task to detect malicious insiders. According to a survey
made by Cyber Security Watch [31], in a year, 26% of cyber security
attacks are made by malicious insiders only. Around 43% of the victims
of cyber attacks reported that those insiders are most damaging.
Leakage of sensitive data was at 15% and theft of sensitive data was at
16%. These statistics reveal the severity of the security and privacy
issues with malicious insiders.
In order to mitigate misuse of data in the context of big data, in our prior
work [29] we proposed algorithms for parallelizing k-Anonymity for
privacy preserving big data mining using MapReduce framework. It was
to protect big data from inference attacks. Then we proposed a
framework known as M-Sanit [32] for effective sanitization of big data
using MapReduce paradigm. This framework exploits extended
misusability score measure before determining level of sanitization
required. This measure was based on the work of [10]. It uses our
algorithm named Misusability Measure-based Big Data Sanitization
(MMBDS). This algorithm is used by M-Sanit to mitigate misusability
probabilities. However, this algorithm was not evaluated formally though
it is capable of applying appropriate level of sanitization based on
misusability measure.
The focus of this paper is to evaluate the M-Sanit framework and
provide conclusions on the level of sanitization for protecting big data
from leakage and abuse. As the insiders do have legitimate rights to
gain access to data, it is crucial to have such mechanism to prevent
malicious activities from them. The existing privacy preserving data
mining techniques such as k-Anonymity [11], l-Diversity [33] and t-
Closeness [24] do provide privacy to data but they are to be improved
with MapReduce programming approach for big data anonymization.
Towards this end we studied k-Anonymity by redefining it for
MapReduce [29]. However, it does not use misusability measure to
determine the level of anonymization. To exploit the extended
misusability measure [32], we proposed M-Sanit for misusability
measure based big data sanitization. Thus this framework achieves dual
purpose of mining big data and protecting it from misuse. Our
contributions in this paper are as follows.
We proposed a methodology for evaluating M-Sanit [32] for
misusability based big data sanitization. Especially it focuses on
International Journal of Pure and Applied Mathematics Special Issue
1860
the level of sanitization needed based on the misusability
probability detected on the given data.
We evaluated the MMBDS algorithm with experiments to
generalize thresholds for different levels of sanitization in
response to the value of misusability measure.
We built a prototype application to demonstrate parallelization of
k-Anonymity, computation of misusability measure on big data
and the utility of M-Sanit framework with MapReduce
programming paradigm in distributed environment.
The remainder of the paper is structured as follows. Section 2
reviews related literature on privacy preserving data mining
techniques and privacy issues and prevention measures in the
context of big data. Section 3 presents our methodology to
evaluate M-Sanit framework. Section 4 presents results of
evaluation. Section 5 concludes the paper and provides
directions for future scope of the research
.
2Related Works This section provides literature on privacy preserving data mining
techniques, privacy in big data and misusability of data. Liu et al. [1]
proposed a method based on perturbation for privacy preserving data
mining in distributed environment. Similar kind of research is made in
[4]. Bonomi et al. [2] proposed an information-theoretic approach to
achieve sequential sanitization. Xu and Jaing [3] on the other hand
proposed a framework for privacy preserving categorization techniques
on big data. Matatov et al. [5] proposed a feature-set partitioning
approach for privacy preserving data mining. Cheng and Kumar [6]
studied sanitization of log files for protecting the data from privacy
attacks. Li et al. [7] studied outsourcing of data with privacy preserved.
Towards this end, they used sanitizing and minimizing databases with
reference to outsourcing of software tests.
Hong et al. [8] applied differential privacy and boosted utility for
sanitization of collaborative search logs. Thus the search logs can be
distributed to stakeholders without worrying about internal attacks on
privacy. Zhang et al. [9] proposed a methodology for local recording
anonymization with proximity-aware approach using MapReduce
programming paradigm. Harel et al. [10] proposed a misusability
measure known as M-score. Its extension is known as extended
misusability measure proposed in our previous work [32]. Chiu and Tsai
[11] proposed a data privacy preservation method known as k-
anonymity clustering. Chakravorty et al. [12] on the other hand
considered smart homes for privacy preserving data analytics. Various
data privacy issues associated with big data is provided by Smith et al.
[13]. Big data issues and challenges are explored in [17]. Similar kind of
work is done in [18].
When knowledge is to be discovered from big data, design principles
useful for efficiency are proposed in [14]. Data mining for big data [15]
International Journal of Pure and Applied Mathematics Special Issue
1861
and future of enterprise computing [16] explore the respective ideas on
big data processing. Risk control for big data in terms of assessing risk
and taking corrective measures is explored in [19]. Recognizing big data
value [20], scalable k-anonymization [21], protection of big data privacy
[22], different tools for big data processing with privacy preserved [23],
different anonymity techniques such as k-anonymity, l-diversity and t-
closeness [24], privacy preserving techniques for pervasive systems
[25], privacy approaches in the data pertaining to Internet of Things (IoT)
[26], the concept of decentralization of privacy preserving policies [27]
and t-closeness along with differential privacy [30] are other important
researches found in the literature. In this paper, we evaluated privacy
preserving of big data based on misuseability measure computed. Thus
it provides appropriate sanitization to be more effective serving dual
purpose of achieving privacy and utility.
3 Methodology ToEvalute M-Sanit
M-Sanit framework is meant for minimizing data leakage or misuse risk
by exploiting the probability of misusability of big data. The MMBDS
algorithm presented in this section is defined in our work [31] for
determining level of sanitization needed based on the value of
misusability measure. More details on the M-Sanit and the MMBDS
algorithm includerrs (1), rdf (2), frs (3) and ms (4) equations pertaining to
extended misusability score can be found in [31]. The main purpose of
this paper is to evaluate the algorithm with M-Sanit framework which
takes dataset (big data) as input and produce sanitized dataset.
However, this sanitization is based on the determination of level of
sanitization needed. This is achieved by using the misuse probability
(0.0 to 1.0) provided by the extended misusability measure explored in
[31]. This measure takes big data as input and computes probability of
misuse of the data. Based on the probability, appropriate level of
sanitization is required for privacy preserving mining of big data. Which
level of sanitization is appropriate based on the misuse probability is the
hypothesis tested in this paper.
International Journal of Pure and Applied Mathematics Special Issue
1862
Algorithm 1: MMBDS algorithm [31]
This algorithm computes misusability score from big data and tells the
level of sanitization needed. However, it is vague to say level 1 or level 2
or level 3. Therefore, we are evaluating the process of application of
sanitization and the level of it really needed for protecting big data from
privacy attacks made by malicious insiders. Proactive misusability
reduction is the main purpose of M-Sanit. It does mean that data is
modified or sanitized so as to reduce misuseability level. At the same
time, it is important to see that the sanitized data should be useful for
performing mining tasks. Therefore it is important to consider both the
things such as reducing misusability probability and ensuring utility or
usefulness of data to partners who perform their tasks meaningfully.
Measuring the risk of exposing data is done first using our extended
misuseability measure. Once it is done, reducing the risk is taken care of
by sanitization. Then it is important to know the utility of the data mining
algorithms when sanitization is employed.
Evaluation is made by considering a classification algorithm like Naive
Bayes classifier with the MMBDS algorithm in place. Different levels of
sanitization are evaluated using our parallelized k-Anonymity [29]. The
evaluation measures considered are precision, recall, and Root Mean
Square Error (RMSE) against different levels of anonymity. Precision
and recall measures are based on the confusion matrix provided in
Table 1.
Table 1: Confusion matrix used for evaluation
International Journal of Pure and Applied Mathematics Special Issue
1863
Ground Truth (correct
classification decision)
Ground Truth (incorrect
classification decision)
Algorithm
(correct
classification
decision)
True Positive (TP) False Positive (FP)
Algorithm
(incorrect
classification
decision)
False Negative (FN) True Negative (TN)
Precision = (TP/(TP+FP)*100 (1) Recall = (TP/(TP+FP))*100 (2)
Precision tells the ratio of correctly classified instances to the number of
instances available in the dataset. In the same fashion, recall tells the
number of correctly classified instances to the number of correctly
classified instances present in the dataset. RMSE is computed as
follows.
4 Experimental Results
Experiments are made with Mushroom dataset from UCI machine
learning repository using M-Sanit framework and the results are
evaluated.
Table 2: Shows classification accuracy and RMSE against k value
k-Anonymity
Level
Classification Accuracy RMSE
No
anonymization
0.95456 0.1756
K=5 0.95023 0.1751
K=10 0.94605 0.1795
K=20 0.9362 0.1821
K=25 0.934505 0.1873
K=30 0.923827 0.1885
International Journal of Pure and Applied Mathematics Special Issue
1864
K=40 0.914120 0.1913
K=45 0.907924 0.1935
K=50 0.907021 0.1941
Above table 2 shows that when k-Anonymity level changes the relevant
classification Accuracy and RMSE values are also changed. If K value
increases Accuracy decreases and RMSE value increases. The highest
value of RMSE noted at K=50 and Least value at K=5. The highest
value of accuracy noted at no anonymization.Least value noted at K=50.
Above Figure differentiates between Accuracy and Anonymization. If K
value increases Accuracy decreases. The highest value of accuracy
noted at no anonymization having value is 0.95456. Least value noted at
K=50 having value is 0.907021.
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
Cla
ssif
icat
ion
Acc
ura
cy (
%)
Level of Anonymity
Classification Accuracy
International Journal of Pure and Applied Mathematics Special Issue
1865
Above Figure differentiates between RMSE and Anonymization. If K
value increases Accuracy decreases. The highest value of accuracy
noted at no anonymization having value is 0.1941 Least value noted at
K=50 having value is 0.1756
Table 3: Precision and recall measures
k-Anonymity
Level
Precision Recall
No
anonymization
0.961 0.957
K=5 0.957 0.953
K=10 0.950 0.946
K=20 0.940 0.936
K=25 0.937 0.933
K=30 0.925 0.923
K=40 0.915 0.913
K=45 0.909 0.908
K=50 0.908 0.906
Above table 3 shows that when k-Anonymity level changes the relevant
classification Precision and Recall values are also changed. If K value
increases Precision decreases and Recall value also decreases. The
highest value of Recall noted at no anonymization and Least value at
0.165
0.17
0.175
0.18
0.185
0.19
0.195
0.2
RM
SE
Level of Anonymity
RMSE
International Journal of Pure and Applied Mathematics Special Issue
1866
K=50. The highest value of accuracy noted at no anonymization.Least
value noted at K=50.
Above Figure differentiates between Precision/Recall and level of
Anonymization. If K value increases Precision/Recall decreases. The
highest value of Precision noted at no anonymization having value is
0.961. Least value noted at K=50 having value is 0.908. The highest
value of Recall noted at no anonymization having value is 0.957. Least
value noted at K=50 having value is 0.906
5 Conclusions And Future Work In this paper we proposed a methodology to evaluate the functionality of
the M-Sanit which is a framework proposed by us [31] for misusability
based sanitization of big data. Since the level of sanitization is based on
the misusability measure which provides probability of big data misuse,
this paper focused on evaluation of the work by using precision, recall,
classification accuracy and RMSE. Naive Bayes classifier is used along
with parallelized k-Anonymity proposed in our previous work [29] for
evaluation of the M-Sanit. The results reveal that it is important to
understand the utility of big data also after anonymization and domain
experts need to set threshold based on the usability of sanitized data.
References [1] Kun Liu, HillolKargupta And Jessica Ryan, Random Projection-
Based Multiplicative Data Perturbation For Privacy Preserving
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97P
rec
isio
n /
Re
ca
ll
Level of Anonymity
Precision
Recall
International Journal of Pure and Applied Mathematics Special Issue
1867
Distributed Data Mining, Ieee Transactions On Knowledge And Data Engineering, 18 (1), (2006) ,92-106.
[2] Luca Bonomi,Liyue Fan And Hongxia Jin, An Information-Theoretic Approach To Individual Sequential Data Sanitization, (2016), 337-346.
[3] Lei XuAndChunxiao Jiang, A Framework For Categorizing And Applying Privacy-Preservation Techniques In Big Data Mining, (2016), P54-62.
[4] Li Liu, Murat Kantarcioglu And BhavaniThuraisingham, The Applicability Of The Perturbation Based Privacy Preserving Data Mining For Real-World Data, Elsevier, (2008) , 5-21.
[5] NissimMatatov, LiorRokach, And OdedMaimon, Privacy-Preserving Data Mining, A Feature Set Partitioning Approach, Elsevie, (2010), 2696-2720.
[6] Hsin-Jung Cheng And Akhil Kumar, Process Mining On Noisy Logs Can Log Sanitization Help To Improve Performance, Elsevier, (2015), 1-12.
[7] Boyang Li, Mark Grechanik And Denys Poshyvanyk, Sanitizing And Minimizing Databases For Software Application Test Outsourcing, Ieee International Conference On Software Testing, (2014), 1-10.
[8] Yuan Hong,JaideepVaidya,HaibingLu,PanagiotisKarras And Sanjay Goel, Collaborative Search Log Sanitization, Toward Differential Privacy And Boosted Utility. Ieee Transactions On Dependable And Secure Computing, (2014), 1-16.
[9] Xuyun Zhang, Wanchun Dou, JianPei,SuryaNepal,Chi Yang, Chang Liu, And Jinjun Chen, Proximity-Aware Local-Recoding Anonymization With Mapreduce For Scalable Big Data Privacy Preservation In Cloud, Ieee Transactions On Computers, (2013), 1-14.
[10]Amir Harel, AsafShabtai, LiorRokach, And Yuval Elovici, M-Score, A Misuseability Weight Measure, Ieee Transactions On Dependable And Secure Computing, 9 (3), (2012), 1-15.
[11]Chuang-Cheng Chiu And Chieh-Yuan Tsai,A K-Anonymity Clustering Method For Effective Data Privacy Preservation, (2007), 89-99.
[12]AntorweepChakravorty, Tomasz WlodarczykAndChunmingRongprivacy Preserving Data Analytics For Smart Homes, (2013), 1-5.
[13]Matthew Smith, Christian Szongott,BenjaminHenne And Gabriele Von Voigt, Big Data Privacy Issues In Public Social Media, (2012), 1-6.
[14]EdmonBegoli And James Horey, Design Principles For Effective Knowledge Discovery From Big Data, (2012), 1-4.
[15]XindongWu,XingquanZhu,Gong-Qing Wu And Wei Ding, Data Mining With Big Data, Ieee Transactions On Knowledge And Data Engineering. 26 (1), (2014), 1-11.
[16]Juhnyoung Lee, The Future Of Enterprise Computing, (2013), 1-1.
[17]AvitaKatal,MohammadWazid And R H Goudar, Big Data, Issues, Challenges, Tools And Good Practices, (2012), 1-6.
International Journal of Pure and Applied Mathematics Special Issue
1868
[18]Elisa Bertino, Big Data - Opportunities And Challenges Panel Position Paper, Ieee, (2013), 479-480.
[19]Duncan Hodges And Sadie Creese, Breaking The Arc, Risk Control For Big Data, Ieee, (2013), 613-621.
[20]Ningyuxin And Liyueling, How We Could Realize Big Data Value, Ieee, (2013), 425-427.
[21]AntorweepChakravorty, Tomasz WiktorWlodarczyk And ChunmingRong, A Scalable K-Anonymization Solution For Preserving Privacy In An Aging-In-Place Welfare Intercloud, Ieee International Conference On Cloud Engineering, (2014), 424-431.
[22]AbidMehmood, IynkaranNatgunanathan, Yong Xiang,GuangHua And Song Guo, Protection Of Big Data Privacy, Ieee, 4, (2016), 1821-1834.
[23]Chris Clifton, Murat Kantarcioglu,Xiaodong Lin And Michael Y. Zhu,Tools For Privacy Preserving Distributed Data Mining, 4 (2), (2002), 1-7.
[24]Ninghui Li, Tiancheng Li And Suresh Venkatasubramanian, T-Closeness, Privacy Beyond K-Anonymity And-Diversity, (2005), 1-10.
[25]Claudio Bettini And Daniele Riboni, Privacy Protection In Pervasive Systems: State Of The Art And Technical Challenges, (2014), 1-36.
[26]MahadevSatyanarayanany, Pieter Simoensz, Yu Xiao, PadmanabhanPillai, ZhuoCheny, Kiryong Hay, WenluHuy And Brandon Amosy, Edge Analytics In The Internet Of Things, (2004), 1-6.
[27]G. Zyskind, O. Nathan And A. Pentland, Decentralizing Privacy, Using Blockchain To Protect Personal Data, (2016), 1-4.
[29]Y. SowmyaAnd Dr. M. Nagaratna, Parallelizing K-Anonymity Algorithm For Privacy Preserving Knowledge Discovery From Big Data. International Journal Of Applied Engineering Research, 11 (2), (2016), 1314-1321.
[30]Josep Domingo-Ferrer And JordiSoria-Comas, From T-Closeness To Di Erential Privacy And Vice Versa In Data Anonymization, (2015), 1-20.
[31]Cyber Security Watch Survey, Http://Www.Cert.Org/Archive/Pdf/Ecrimesummary10.Pdf, 2012.
[32]Y. Sowmya And Dr. M. Nagaratna, M-Sanit: A Framework For Effective Big Data Sanitization Using Map Reduce Programming In Hadoop. International Journal Of Applied Engineering Research, 11 (2), (2017), 1314-1321.
[33]A. Machanavajjhala, J. Gehrke, D. Kifer, And M. Venkitasubramaniam, L-Diversity: Privacy Beyond K-Anonymity, In Proc. 22nd Intnl. Conf. Data Engg. (Icde), (2006), 1-24.
International Journal of Pure and Applied Mathematics Special Issue
1869
1870