Upload
harley-jason
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
PDM Workshop April 8, 2006
Deriving Private Information from Perturbed Data Using IQR-based Approach
Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte Yingjiu Li Singapore Management Univ
PDM April 8, 2006 3
Source: http://www.privacyinternational.org/survey/dpmap.jpg
HIPAA for health care California State Bill 1386
Grann-Leach-Bliley Act for financial COPPA for childern’s online privacy etc.
PIPEDA 2000
European Union (Directive 94/46/EC)
PDM April 8, 2006 4
Mining vs. Privacy
• Data mining The goal of data mining is summary results (e.g., classification,
cluster, association rules etc.) from the data (distribution)
• Individual Privacy Individual values in database must not be disclosed, or at least no
close estimation can be derived by attackers
• Privacy Preserving Data Mining (PPDM) How to “perturb” data such that
we can build a good data mining model (data utility) while preserving individual’s privacy at the record level (privacy)?
PDM April 8, 2006 5
Our Focus
SSN Name Zip Age Sex Balance … Income Interest Paid
1 *** *** 28223 20 M 10k … 85k 2k
2 *** *** 28223 30 F 15k … 70k 18k
3 *** *** 28262 20 M 50k … 120k 35k
. . . . . . . … . .
n *** *** 28223 20 M 80k … 110k 15k
Focus in this talkk-anonymity,
L-diversity
SDC etc.
PDM April 8, 2006 6
Additive Noise based PPDM
• Distribution reconstruction AS method, Agrawal and Srikant, SIGMOD 00 EM method, Agrawal and Aggarwal, PODS 01
• Individual value reconstruction Spectral Filtering (SF) , Kargupta et al. ICDM 03 PCA, Huang, Du and Chen SIGMOD 05
PDM April 8, 2006 7
Additive Randomization (Y = X +R )
50 | 40K | ... 30 | 70K | ... ...
...
Randomizer Randomizer
ReconstructDistribution
of Age
ReconstructDistributionof Salary
ClassificationAlgorithm
Model
65 | 20K | ... 25 | 60K | ... ...30
becomes 65
(30+35)
Alice’s age
Add random number to
Age
• R.Agrawal and R.Srikant SIGMOD 00
PDM April 8, 2006 8
Distribution Reconstruction
n
ij
XiiY
jXiiY
afayxf
afayxf
n 1 )())((
)())((1
fX0 := Uniform distribution
j := 0 // Iteration number repeat
fXj+1(a) :=
j := j+1
until (stopping criterion met)
• Converges to maximum likelihood estimate – Agrawal and Aggarwal PODS 01
0
200
400
600
800
1000
1200
20 60
Age
Num
ber
of P
eopl
e
OriginalRandomizedReconstructed
• Algorithm
PDM April 8, 2006 9
Individual Reconstruction• Spectral Filtering Technique (Kargupta et al. ICDM03)
Apply EVD Using the covariance of V, extract the first k principle components
λ1≥ λ2··· ≥ λ k ≥ λ e and e1, e2, · · · ,ek are the corresponding eigenvectors of Qk = [e1 e2 · · · ek] forms an orthonormal basis of a subspace X
Find the orthogonal projection on to X:
Estimate data as PUU pˆ
TUp QQ
Up
TkkQQP
PCA Technique, Huang, Du and Chen, SIGMOD 05
PDM April 8, 2006 10
Motivation• The goal of randomization-based perturbation
To hide the sensitive data by randomly modifying the data values using some additive noise
To keep the aggregate characteristics or distribution remain unchanged or recoverable
• Do those aggregate characteristics or distribution contain confidential information which may be exploited by snoopers to derive individual’s sensitive data?
private information
PDM April 8, 2006 11
Our Scenario
• Each individual data is associated with one privacy interval privacy policies corporate agreements
• The data holder can utilize or release data to the third party for analysis, however, he is required not to disclose any individual data within its privacy interval
Balance … Income Interest Paid
1 10k … 85k 2k
2 15k … 70k 18k
3 50k … 120k 35k
. . … . .
n 80k … 110k 15k
• A single party (data holder) holds a collection of original individual data
PDM April 8, 2006 12
Inter-Quantile Range (IQR)
• Inter-Quantile Range [xα1 , x α2 ] is defined as P( xα1 ≤ x ≤ x α2 ) ≥ c%, while c = α2 − α1 denotes the confidence.
• IQR measures the amount of spread and variability of the variable. Hence it can be used by attackers to estimate the range of each individual value.
• IQR we used: [x(1-c)/2 , x (1+c)/2 ]
α2
α1
xα1xα2
PDM April 8, 2006 13
Comparison with other Privacy definition
• Interval privacy (Agrawal and Srikant, SIGMOD00) If the original value can be estimated with c% confidence to lie in
the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level
• Mutual Information (Aggarwal and Agrawal, PODS01)• Reconstruction privacy (Rizvi & Haritsa, VLDB02)• -to- privacy breach (Evfimievski et al. PODS03)
PDM April 8, 2006 14
Disclosure Measure
],[],[
],[],[
2/)1(2/)1(
2/)1(2/)1(
ccui
li
ccui
li
i xxuu
xxuud
],[ ui
li uu
Individual’s privacy interval
],[ 2/)1(2/)1( cc xx
Attacker’s estimated range
n
i idnD1
/1
Measure Similarity
Complete disclosed point if its Complete disclosed point if its estimated rangeestimated range
• contains the original valuecontains the original value • fully falls within the pre- fully falls within the pre- specified privacy interval specified privacy interval
PDM April 8, 2006 15
Empirical Evaluation
• Data sets: Bank
5 attributes (Home Equity, Stock/Bonds, Liabilities, Savings, CDs) 50,000 tuples
Signal 35 correlated features (sinusoidal, square, triangle, normal distributions ) 30,000 tuples
• Pre-specified individual’s privacy intervals: [ui(1-p), ui(1+p)] p is varied
PDM April 8, 2006 16
IQR from Reconstructed Dist. Using AS with Uniform noise
• IQR Direct inference ---perturbed • IRQ with AS inference ---reconstructed • IRQ ideal inference ---original
• Uniform noise: [-125,125]• Bank Data set • Attribute: Stock/Bonds• 95% IQR• information loss for AS : 14.6%
Ratio of Complete disclosure Ratio of Complete disclosure pointspoints
PDM April 8, 2006 17
Interval p % no. of disclosed points(100%) D
direct IQR ideal IQR with AS ideal AS
35 13.9 21.2 3.5 0.605 0.663
40 16.0 32.5 15.1 0.660 0.698
45 17.9 43.0 29.6 0.712 0.746
50 19.8 52.9 41.8 0.763 0.796
55 22.0 62.9 53.2 0.814 0.844
60 23.9 72.9 63.4 0.864 0.889
65 26.0 83.3 73.5 0.916 0.932
70 28.0 94.3 83.7 0.972 0.977
75 29.9 99.9 94.5 0.999 0.999
80 32.0 100 100 1 1
],[],[
],[],[
2/)1(2/)1(
2/)1(2/)1(
ccui
li
ccui
li
i xxuu
xxuud
n
i idnD1
/1
IQR from Reconstructed Dist. Using AS with Uniform noise
PDM April 8, 2006 18
AS vs. SF with Gaussian Noise
• Gaussian noise N(0,8)• Signal dataset• Feature 2 (sinusoidal distributed)• 95% IQR• information loss for AS : 32.9%• information loss for SF : 47.0%
PDM April 8, 2006 19
Disclosure vs. noise
• Uniform noise with varied range
• Bank Data set • Attribute: Stock/Bonds• 95% IQR
PDM April 8, 2006 20
Extend to Multivariate Cases• In practice, the distribution of multiple numerical attributes are
often modeled by one multi-variate normal distribution, N(μ,Σ)• The ellipsoid {z : (z − μ)′ Σ−1(z − μ) ≤ χ2
p(α)} contains a fixed percentage, (1 −α)100% of data values.
• The projection of this ellipsoid on axis zi has bound: 1c
2c
1
2
2Z
1Z
])(,)([ 22iipiiipi
2Z
1Z
PDM April 8, 2006 21
Related Work
• Rotation based approach: Y = RX When R is an orthonormal matrix (RRT = I)
Vector length: |Rx| = |x| Euclidean distance: |Rx – Ry| = |x-y| Inner product : <Rx,Ry> = <x,y>
Popular classifiers and clustering methods are invariant to this perturbation.
K. Liu, H. Kargupta etc. Random projection based multiplicative data perturbation for privacy preserving distributed data mining. TKDE 2006.
K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. ICDM 2005
PDM April 8, 2006 22
Is Y=RX Secure?
0.3333 0.6667 0.6667
-0.6667 0.6667 -0.3333
-0.6667 -0.3333 0.6667
10 15 50 45 80
85 70 120 23 110
2 18 35 134 15
61.33 63.67 110.00 119.67 63.33
49.33 30.67 55.00 -59.33 -31.67
-33.67 -21.33 -30.00 51.67 -51.67
=
Y = R X
Bal income … IntP
1 10k 85k … 2k
2 15k 70k … 18k
3 50k 120k … 35k
4 45k 23k … 134k
. . . … .
N 80k 110k … 15k
RRT = RTR = I
PDM April 8, 2006 23
Our Preliminary Results
• Even Y = RX + E is NOT secure when some a-priori knowledge is available to attackers.
4.751 2.429 2.282
1.156 4.457 0.093
3.034 3.811 4.107
10 15 50 45 80
85 70 120 23 110
2 18 35 134 15
265.95 286.63 475.68 581.71 520.53
394.30 338.49 569.58 174.22 277.79
362.55 394.11 665.37 776.46 463.08
=
Y = R X
+
7.334 4.199 9.199 6.208 9.048
3.759 7.537 8.447 7.313 5.692
0.099 7.939 3.678 1.939 6.318
+ ER can be any random matrix
PDM April 8, 2006 24
A-priori Knowledge ICA Based Attack
• Privacy can be breached when a small subset of the original data X , is available to attackers
Bal income … IntP
1 10k 85k … 2k
2 15k 70k … 18k
3 50k 120k … 35k
4 45k 23k … 134k
. . . … .
N 80k 110k … 15k
PDM April 8, 2006 25
Summary
• The reconstructed distribution can be exploited by attackers to derive sensitive individual information.
• Present a simple IQR attacking method
• Complex and effective attacking methods exist More research is needed on attacking methods from the attacker
point of view
PDM April 8, 2006 26
Acknowledgement
• NSF Grant CCR-0310974 IIS-0546027
• Personnel Xintao Wu Songtao Guo Ling Guo
• More Info http://www.cs.uncc.edu/~xwu/ [email protected],
PDM April 8, 2006 28
Information Loss
• Distribution level
• Individual value level
F
F
F
U
UUUUre
UUUUae
||||
||ˆ||)ˆ,(
||ˆ||)ˆ,(
]|)()(|[2
1),( ''
X
dxxfxfEffI XXXX
PDM April 8, 2006 29
National Laws• US
HIPAA for health care Passed August 21, 96 lowest bar and the States are welcome to enact more stringent rules
California State Bill 1386 Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for childern’s online privacy etc.
• Canada PIPEDA 2000
Personal Information Protection and Electronic Documents Act Effective from Jan 2004
• European Union (Directive 94/46/EC) Passed by European Parliament Oct 95 and Effective from Oct 98. Provides guidelines for member state legislation Forbids sharing data with states that do not protect privacy
PDM April 8, 2006 30
ICA Direct Attack?
• Can we get X when only Y is available? It seems Independent Component Analysis can help.
Y = R X + E General Linear Perturbation Model
X = A S + N ICA Model
PDM April 8, 2006 31
ICA
1 111 1
1
( ) ( )
( ) ( )
m
n nm m n
s t x tA A
A A s t x t
Linear Mixing ProcessLinear Mixing Process
Mixing Matrix Source Observed
Separation ProcessSeparation Process
Separated Demixing Matrix
1 111 1
1
( ) ( )
( ) ( )
n
m m mn n
y t x tW W
y t W W x t
Independent?Independent?
Cost Function
Optimize
PDM April 8, 2006 32
Restriction of ICA
• Restrictions: All the components si should be independent;
They must be non-Gaussian with the possible exception of one component.
The number of observed linear mixtures m must be at least as large as the number of independent components n
The matrix A must be of full column rank
• Can we apply the ICA directly? No
Correlations among attributes of X More than one attributes may have Gaussian distributions
Y = RX + E
X = AS + N