View
219
Download
2
Embed Size (px)
Citation preview
An Experimental Study of Association Rule Hiding Techniques
Emmanuel Pontikakis*Emmanuel Pontikakis*[email protected]@ceid.upatras.grDept. of Computer Engineering and InformaticsUniversity of PatrasPatra, Greece
Vassilios Verykios*Vassilios Verykios*[email protected]@cti.grDept. of Computer and Communication EngineeringUniversity of ThessalyVolos, Greece
*Computer Technology Institute*Computer Technology InstituteResearch Unit 3Research Unit 3Athens, GreeceAthens, Greece
Outline
Introduction - Related Work Distortion-based Techniques Blocking-based Techniques Comparison and Analysis Conclusions
Introduction
Database
User
Data Mining
Association RulesChangedDatabaseHide Sensitive Rules
Related Work
Association Rule Hiding Blocking-based Technique (Saygin,
Verykios, Clifton) Distortion-based (Sanitization)
Technique – (Oliveira, Zaiane, Verykios, Dasseni)
Outline
Introduction - Related Work Distortion-based Techniques Blocking-based Techniques Comparison and Analysis Conclusion
Distortion-based Techniques
A B C D
1 1 1 0
1 0 1 1
0 0 0 1
1 1 1 0
1 0 1 1
Rule ARule A→C has: →C has:
Support(Support(A→CA→C)=80%)=80%
Confidence(Confidence(A→CA→C)=100%)=100%
Sample DatabaseSample Database
A B C D
1 1 1 0
1 0 00 1
0 0 0 1
1 1 1 0
1 0 00 1
Distorted DatabaseDistorted Database
Rule ARule A→C has now: →C has now:
Support(Support(A→CA→C)=40%)=40%
Confidence(Confidence(A→CA→C)=50%)=50%
DistortionAlgorithm
Side EffectsBefore Hiding Before Hiding ProcessProcess
After Hiding After Hiding ProcessProcess
Side EffectSide Effect
Rule Ri has had
conf(Rconf(Rii)>MCT)>MCTRule Ri has now conf(Rconf(Rii)<MCT)<MCT
Rule Eliminated(Undesirable Side Effect)
Rule Ri has had
conf(Rconf(Rii)<MCT)<MCTRule Ri has now conf(Rconf(Rii)>MCT)>MCT
Ghost Rule(Undesirable Side Effect)
Large Itemset I has had sup(I)>MSTsup(I)>MST
Itemset I has now sup(I)<MSTsup(I)<MST
Itemset Eliminated(Undesirable Side Effect)
Distortion-based Techniques Challenges/Goals:
To minimize the undesirable Side Effects that the hiding process causes to non-sensitive rules.
To minimize the number of 1’s1’s that must be deleted in the database.
Algorithms must be linear in time as the database increases in size.
Our Proposal: Weight-based Sorting Distortion Algorithm (WSDA)
High Level Description: Input:
Initial Database Set of Sensitive Rules Safety Margin (for example 10%)
Output: Sanitized Database Sensitive Rules no longer hold in the
Database
WSDA Algorithm
High Level Description: 1st step:
Retrieve the set of transactions which support sensitive rule RRSS
For each sensitive rule RRSS find the number NN11 of transaction in which, one item that supports the rule will be deleted
WSDA Algorithm
High Level Description: 2nd step:
For each rule RRii in the Database with common items with RRSS compute a weight w w that denotes how strong is RRii
For each transaction that supports RRSS compute a priority PPii, that denotes how many strong rules this transaction supports
WSDA Algorithm
High Level Description: 3rd step:
Sort the NN11 transactions in ascending order according to their priority value PPii
4th step: For the first NN11 transactions hide an item
that is contained in RRSS
WSDA Algorithm
High Level Description: 5th step:
Update confidence and support values for other rules in the database
Experimental Results of WSDA algorithm
0
100
200
300
400
500
600
700
10% 20% 40% 60%
Safety Margin
Ite
ms
ets
Re
ma
ine
d
1.b
WSDA
Itemsets Remained unaffected in the Database
0,0%
10,0%20,0%
30,0%
40,0%
50,0%60,0%
70,0%
80,0%
10% 20% 40% 60%Safety Margin
Ru
les
Ch
ang
ed(%
)
1.b
WSDA
Rules ChangedIn the Database
Experimental Results of WSDA algorithm
0
10
20
30
40
50
60
70
80
90
2500 5000 7500 10000
Database Transactions
Tim
e in
sec
s
1.b
WSDA
Average number of items per transaction: 13/50
0
20
40
60
80
100
120
140
2500 5000 7500 10000
Database Transactions
Tim
e in
sec
s 1.b
WSDA
Average number of items per transaction: 20/50
Outline
Introduction - Related Work Distortion-based Techniques Blocking-based Techniques Comparison and Analysis Conclusion
Quality of Data Sometimes it is dangerous to delete some
items from the database (etc. medical databases) because the false data may create undesirable effects.
So, we have to hide the rules in the database by adding uncertainty without distorting the database.
Blocking-based Techniques
AA BB CC DD
11 11 11 00
11 00 11 11
00 00 00 11
11 11 11 00
11 00 11 11
AA BB CC DD
11 11 11 00
11 00 ?? 11
?? 00 00 11
11 11 11 00
11 00 11 11
BlockingAlgorithm
Initial DatabaseInitial Database New DatabaseNew Database
Support and Confidence becomes marginal. Support and Confidence becomes marginal.
In New Database: In New Database: 60% ≤ conf(A → C) ≤ 100%60% ≤ conf(A → C) ≤ 100%
Modification of Association Rule Definition
A rule’s A→→B confidence and support becomes marginal:
sup(A→B)A→B) [minsup(A→B), maxsup(A→B)][minsup(A→B), maxsup(A→B)]
conf(A→B) [minconf(A→B), maxconf(A→B)]conf(A→B) [minconf(A→B), maxconf(A→B)]
minsup(A→→B)=
maxsup(A→→B)=
D
BA )1()1(
D
BABABABA ?)(?)()1(?)(?)()1()1()1(
Modification of Association Rule Definition
minconf(A→B)=
maxconf(A→B)=
|?)(?)|(|)1(?)|(|1|
|?)(?)|(|)1(?)|(|?)()1|(|)1()1|(
BABAA
BABABABA
|?)(?)|(|)0(?)|(|1|
|)1()1|(
BABAA
BA
Negative Border Rules Set (NBRS) Definition
When a rule R has either
sup(R)>MSTsup(R)>MST AND conf(R)<MCTconf(R)<MCT
OR
sup(R)<MSTsup(R)<MST AND conf(R)>MCTconf(R)>MCT,
then we say that R belongs to NBRS.
Side Effects Definition Modification in Blocking-based Techniques
Before Hiding Before Hiding ProcessProcess
After Hiding ProcessAfter Hiding Process Side EffectSide Effect
Rule Ri has had
conf(Rconf(Rii)>MCT)>MCTRule Ri has now minconf(Rminconf(Rii)<MCT)<MCT
Rule Eliminated(Undesirable Side Effect)
Rule Ri has had
conf(Rconf(Rii)<MCT)<MCTRule Ri has now maxconf(Rmaxconf(Rii)>MCT)>MCT
Ghost Rule(Desirable Side Effect)
Large Itemset I has had sup(I)>MSTsup(I)>MST
Itemset I has now minsup(I)<MSTminsup(I)<MST
Itemset Eliminated(Undesirable Side Effect)
Itemset I has had
sup(I)<MSTsup(I)<MSTItemset I has now maxsup(I)>MSTmaxsup(I)>MST
Ghost Itemset(Desirable Side Effect)
Privacy Breaches Definitions If an item ii, some values of which, are hidden by ?’s?’s, is
contained in a sensitive rule, a privacy breach will occur if the adversary can assume that with c% confidencec% confidence.
For a rule RR with maxconf(R)>MCTmaxconf(R)>MCT, a privacy breach occurs if it can be estimated, with c% confidencec% confidence, that RR is either a sensitive or a ghost ruleghost rule.
For a blocked item ii in a specific transaction TT, a privacy breach occurs if the adversary can estimate with c%c% confidenceconfidence that its original value is either 0 or 1.
Blocking-Based Techniques Goals that an algorithm has to achieve:Goals that an algorithm has to achieve:
To put a relatively small number of ?’s and reduce significantly the confidence of senstitive rules.
To minimize the undesirable side effects (rules and itemsets lost) by selecting the items in the appropriate transactions to change, and maximize the desirable side effects.
To modify the database in a way that an adversary cannot recover the original values of the database.
Our Proposal: Blocking Algorithm (BA) High Level Description
1st step: For each sensitive rule RRSS (Rule RRSS has left itemset IILL and right
itemset IIRR) compute how many 0’s and 1’s you have to block, in order to reduce the confidence of RRS.S.
2nd step: Find the set of transactions TTRR that support RRSS or the set of
transactions TTLpR’LpR’ that support partially RRSS (support partially the left itemset and do not support the right itemset).
For each transaction in TTRR find the rules RRcommoncommon with at least one common item with IIRR and for each transaction in TTLpR’LpR’ find the R’R’commoncommon∈NBRS∈NBRS with at least one common item with IL.
Assign a weight ww for each RcommonRcommon and a weight w’ w’ for each R’commonR’common..
Assign a PPTT for each transaction in T T such as P PTT is large if transaction Ti Ti has many Rcommon Rcommon rules with large w, w, and a priority value PPT’T’ for each Ti’Ti’ such as PPTT’’ is small if transaction T T has many Rcommon Rcommon rules with large w’.w’.
Blocking Algorithm High Level Description
3rd step: Sort T∈TT∈TRR starting from them with lowest PPTiTi. and sort T’∈TT’∈TL’RpL’Rp
starting from them with highest PPTi’Ti’.
4th step: For the first NN11 sorted TT∈∈TTRR block an item i∈Ii∈IRR and for the first
NN00 sorted TT∈∈TTL’Rp L’Rp block an item i∈ Ii∈ ILL
5th step: Update values minconf(Ri)minconf(Ri), minsup(Ri)minsup(Ri), for all other rules that have
been affected.
Blocking-Based Techniques Main Problems of blocking technique:Main Problems of blocking technique:
1. The maximum confidence of a sensitive rule cannot be reduced.
2. An adversary can infer the hidden values if he applies a smart inference technique, if the blocking algorithm does not add much uncertainty in the database.
3. Both 0’s and 1’s must be hidden, because if only 1’s were hidden the adversary would simply replace all the ?’s with 1’s and would restore easily the initial database.
4. Many ?’s must be inserted, if we don’t want an adversary to infer hidden data.
Experimental Results of Blocking Algorithm
0
100
200
300
400
500
600
700
10% 20% 40% 60%Safety Margin
Lar
ge
Item
sets
R
emai
ned
BA
CRA
Large Itemsets Remained afterThe hiding process
0%
20%
40%
60%
80%
100%
10% 20% 40% 60%Safety Margin
Ru
les
Ch
ang
ed(%
)
BA
CRA
Rules changed (%) after theprocess
Experimental Results of Blocking Algorithm (2)
020406080100120140160
2500 5000 7500 10000Database Transactions
Tim
e in
sec
s
BA
CRA
Databases with average 20 items per transaction
0
20
40
6080
100
120
140
2500 5000 7500 10000Database Transactions
Tim
e in
sec
s
BA
CRA
Databases with average 13 items per transaction
Experimental Results of Blocking Algorithm (3)
020
406080100
120140160
180200
"3:1" "2:1" "1:1" "1:2" "1:3"
Proportion (0:1)
Ru
les
ch
an
ge
d
Rules changed, when weChange the proportion 0:1
0%5%10%15%20%25%30%35%40%45%
10% 20% 40% 60%Safety Margin
Mis
scla
ssif
ied
Item
s(%
)
Decision Tree ExperimentsMisclassified Items (%)
Outline
Introduction - Related Work Distortion-based Techniques Blocking-based Techniques Comparison and Analysis Conclusions
Comparison and Analysis
Distortion-based Distortion-based TechniquesTechniques
Blocking-based Blocking-based TechniquesTechniques
Privacy Privacy BreachesBreaches
No privacy breaches
Many kinds of privacy breaches
Simplicity of Simplicity of algorithmsalgorithms
Simpler More complicated
Database Database ModificationModification
Database contains false information
Many ?’s must be inserted in the Database
Outline
Introduction - Related Work Distortion-based Techniques Blocking-based Techniques Comparison and Analysis Conclusions
Conclusions There are open research problems in
Blocking Technique:
A) What techniques must be used in order to reduce the privacy breaches?
B) In what other ways can we prevent an adversary from inferring the association rules in the database?
C) Maybe applying a chi-square test to the final database reveal some correlations between the items
References [Evfimienski et.al] Alexandre Evfimievski, Ramakrishnan
Srikant, Rakesh Agrawal, Johannes Gehrke. Privacy Privacy Preserving Mining of Association Rules.Preserving Mining of Association Rules. SIGKDD 2002, Edmonton, Alberta Canada.
Murat Kantarcioglou and Chris Clifton, Privacy Preserving Privacy Preserving Distributed Mining of Association Rules on Horizontally Distributed Mining of Association Rules on Horizontally Partitioned DataPartitioned Data, In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2002), 24–31.
Jaideep Vaidya and Chris Clifton, Privacy Preserving Privacy Preserving Association Rule Mining in Vertically Partitioned DataAssociation Rule Mining in Vertically Partitioned Data, In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), 639–644.
References Stanley R. M. Oliveira and Osmar R. Zaïane. Algorithms for Algorithms for
Balacing Privacy and Knowledge Discovery in Association Balacing Privacy and Knowledge Discovery in Association Rule MiningRule Mining. In Proc. of the Seventh International Database Engineering & Applications Symposium (IDEAS'03), pp. 54-63, Hong Kong, July 16-18, 2003.
Yucel Saygin, Vassilios Verykios, and Chris Clifton, Using Using Unknowns to Prevent Discovery of Association RulesUnknowns to Prevent Discovery of Association Rules, SIGMOD Record 30 (2001), no. 4, 45–54.
S. Verykios, Ahmed K. Elmagarmid, Bertino Elisa, Yucel Saygin, and Dasseni Elena, Association Rule HidingAssociation Rule Hiding, IEEE Transactions on Knowledge and Data Engineering (2003).