Upload
rosa-holt
View
217
Download
0
Embed Size (px)
Citation preview
1
Effects of Rounding on Data Quality
Lawrence H. Cox, Jay J. Kim,Myron Katzoff, Joe Fred Gonzalez, Jr.
U.S. National Center for Health Statistics
2
Outline
I. Introduction
II. Four Rounding Rules
III. Mean and Variance
IV. Distance Measure
V. Concluding Comments
VI. References
3
I. Introduction
Reasons for rounding
• Rounding noninteger values to integer values for statistical purposes;
• To enhance readability of the data;
• To protect confidentiality of records in the file;
• To keep the important digits only.
4
• Purpose of this paper:
Evaluate the effects of four rounding methods on data quality and utility in two ways:
(1) bias and variance;
(2) effects on the underlying distribution of the data determined by a distance measure.
5
B : Base
: Quotient
: Remainder
x xx q B r
xq
xr
( ) ( )x xR x q B R r
6
Types of rounding:
• Unbiased rounding: E [R (r) |r] = r
Example
E[R(3)= 0 or 10|3] = 3
• Sum-unbiased rounding: E [R (r)] = E (r)
7
II. Four rounding rules
1. Conventional rounding
1.1 B even
Suppose r = 0, 1, 2, . . . ,9.
In this case B = 10.
If r (B/2), round r up to 10
else round r down to zero (0).
5
8
1.2 B odd
Round r up to B when
r
Otherwise round down to zero.
Example:
When B = 5, round up r = 3 and 4 to 5;
Otherwise, round down.
1
2
B
9
Assumptions for r and q
r follows a discrete uniform distribution;
q follows lognormal, Pareto of second kind or multinomial distribution.
Thus,
x has some mixed distribution, but the term qB dominates.
10
2. Modified Conventional rounding
Same as conventional rounding, except whenrounding 5 (B/2) up or down with probability ½.
3. Zero-restricted 50/50 rounding
Except zero (0), round r up or down with probability ½.
11
4. Unbiased rounding rule
Round r up with probability r/B,
and
Round r down with probability 1 - r/B
Example:
r = 1, P [R(1)=B] = 1/10P [R(1)=0] = 9/10
12
III. Mean and variance
III.1 Mean and variance of unrounded number
r = 0, 1, 2, 3, . . . , B-1.
P (r) = and
E (r) =
= .
1
1
1B
r
rB
1
2
B
1
B
13
=
In general when r and q are independent,
22 1
( ) ( )12
BV x B V q
2 1( )
12
BV r
( ) ( | ) ( | )V x V E x q E V x q
12 2
1
1( )
B
r
E r rB
( 1)(2 1)
6
B B
14
III. 2 Conventional rounding when B is even
for unrounded number.
1( )
2
BE r
[ ] Pr[ ( ) ] 0 Pr[ ( ) 0]
2
E R r B R r B R r
B
15
2
[ ]4
BV R r
2 1( ) .
12
BV r for unrounded number
2 2 1( )
2E R r B
1[ ] 3 ( )
12V R r V r
16
2
2( ) ( ) ,4
BV R x B V q and
2
2 1( ) ( ) .
4
BMSE R x B V q
17
III.3 Conventional rounding when B is odd
Note ,
(B-1)/2 out of B elements can be rounded up.
is sum unbiased, and
for unrounded number
1[ ]
2
BE R r
2 1[ ( )] .
4
BV R r
2 1( )
12
BV r
1P ( )
2
BR r B
B
18
P [R (r) = B] for modified conventional rounding.
P [R (r) = B] for 50/50 rounding: same as above.
No. of elements which can be rounded up: B-1.
All B elements. Probability of rounding up is ½.
1P ( )
2 2
BR r B
B
1 1 1P ( )
2 2 2
BR r B
B B
19
P [R (r) = B] for unbiased rounding
Modified conventional rounding,
50/50 rounding
and
unbiased rounding
have the same mean, variance and MSE
as the conventional rounding with odd B.
1
1
1 1P ( )
2
B
r
r BR r B
B B B
20
IV. Distance measure
Assume that when x = 0, U = 0.
Define
2
2
[ ( ) ]
[ ( ) ]x x
R x xU
x
R r r
x
1,
0, .x
x
if r is rounded up
otherwise
21
Reexpressing the numerator of U, we have
With conventional rounding with B=10,
Then we have
2( )x xB r
21
0
( )( | ) | ( ).
x
x
x xx
B rE U x x P
x
1 5, 6, . . , 9x xwith r
22
Expected value of U
We define
2
( ) ( | ) |
( )| ( ) | ( ) ( ).
x x x
q r
x xx x x
q r
E U E E E U x q
B rx P q P r P q
x
21
10
( )( | ) | | ( ) | ( ).
x x
x xr x x
r x x
B rU E E U x q x P q P r
q B r
23
IV.1 Conventional rounding with B even
which can be expressed as
2 2/ 2 1 1
1 / 21
( ) 1
x x
B Bx x
x x x xr r B
r B rU
q B r q B r B
2/ 2 1 1
11 / 2
( ) 1
x x
B Bx
xr r B x
B rU r
r B
2 2/ 2 1 1
1 / 2
( ) 1
x x
B Bx x
x x x xr r B
r B r
q B r q B r B
24
Sum of 1/r terms.
Recall the harmonic series:
The upper and lower bounds for harmonic series
2/ 2 1 1
11 / 2
( ) 1[ ]
x x
B Bx
xr r xB
B rU r
r B
2 2/ 2 1 1
1 / 2
( ) 1[ ]
x x
B Bx x
x xr r B
r B r
q B q B B
ln( 1) 1 ln( )nn H n
1 1 11 2 3 . .nH n
25
The upper bound for the first term of is
The second term of is
Note that the second term of E(U) is
1ln
2
2( 1)
2
BB
B
B
1U
/ 2 1 12 2
21 / 2
1( )
1x
x x
B B
q x xr r Bx
E r B rBq
2 2
12
B
B
1U
26
IV.2 Modified conventional rounding with even B
This has the same E(U) as conventional rounding.
IV.3 50/50 rounding
The first term of is
The second term of is
1U
1ln ( 1)
2 2
BB
22 3 1
6
B B
B
1U
27
IV.4 Unbiased rounding
The first term of is
The second term of is
1U
1
2
B
2 1
6
B
B
1U
28
IV.5 Comparison of four rounding rules
Conventional or
Mod. Conven.
50/50 Unbiased
Term
1
Term
2
2( 1) 1ln
2 2
B BB
B
2 2 1
12
BE
B q
1ln( 1)
2 2
BB
22 3 1 1
6
B BE
B q
1
2
B
2 1 1
6
BE
B q
29
Comparison of four rounding rulesB = 10
Conventional or
Mod. Conven.
50/50 Unbiased
Term
1 2.61 11.49 (4.4) 4.5 (1.7)
Term
2 .85 2.85 1.651
Eq
1E
q
1E
q
30
Comparison of four rounding rulesB = 1,000
Conventional or
Mod. Conven.
50/50 Unbiased
Term
1 194 3,454 (18) 500 (2.6)
Term
2 83 323 1661
Eq
1E
q
1E
q
31
IV.6 E(1/q) for log-normal distribution
Suppose
and
Then, x has a lognormal distribution, i.e.,
2( , )y N
2
12
ln12( | , ) 2 , 0
x
f x x e x
yx e
32
Let
Then
which is equivalent to
2
1( | , )c f x dx
21
1 1 1| 1, , ( )E q f q dq
q c q
212
11e
c
33
IV.6 E(1/q) for Pareto distribution of the 2nd kind
The Pareto distribution of the second kind is
In the above k = min(q).
Let
1( ) , 0, 0
a
a
akf q a q k
q
1( | 0, 0)c f x a q k dx
34
IV.7 Upper bound for E(1/q) for multinomial distribution
The multinomial distribution has the form
= 0,1,2,
11
( 1)
a kE
q c a
1 2 1 21
1
!( , , . . | , , . . )
!
i
kq
k k iki
ii
nf q q q p p p p
q
iq
35
In binomial distribution, we let
When is truncated at 1 from below, we have
Note that
for all i.
1s p
1 1 3
1 ( 1)( 2)i i i i
E E Eq q q q
( )if q
1( | 1, , )
1in ni
i i ni
nf q q n p p s
ns
36
In general,
Using the above relationship, we have the following
does not generate any term having or .
Hence,
1 23
1|E q q
q
1 2 3 3 1 2 2 1 1( ) ( | ) ( | ) ( )P q q q P q q q P q q P q
1 2 11 2 3 1 2 3
1 1 1 1| |E E E E q q q
q q q q q q
1q 2q
1 2 11 2 3 3 2 1
1 1 1 1| |E E q q E q E
q q q q q q
37
The upper bound of the expected value is
Let be the size of the category j andjn
1
1
1
1
2
211 2
5( 1)( 2) 2( 2) 6
2( 1)( 2) (1 )
1
i
jj
i
jj
n
i i i
n
i i
k
ik
n n p s n p
n n p s
Eq q q
1 21 1 11 2 1 1 1
2 2 2
1 1 3
1 ( 1)( 2)
1 3 1 3
1 ( 1)( 2) 1 ( 1)( 2)
kq q qk
k k k
Eq q q q q q
q q q q q q
38
V. Concluding comments
Various methods of rounding and in some applications various choices for rounding base B are available.
The question becomes: which method and/or base is expected to perform best in terms of data quality and preserving distributional properties of original data and, quantitatively, what is the expected distortion due to rounding?
39
The expected value of U, the distance measure, is intractable, so we derived its upper bound.
The expected value of 1/q is also intractable for a multinomial distribution. So we derived an upper bound. There should be room for improvement.
This paper provides a preliminary analysis toward answering these questions.
In summary,
• In terms of bias, unbiased rounding is optimal.
40
• In terms of the distance measure, conventional or modified conventional rounding performs best.
• In terms of protecting confidentiality, 50/50 rounding rule is best.
VI. References
Grab, E.L & Savage, I.R. (1954), Tables of the Expected Value of 1/X for Positive Bernoulli and Poisson Variables, Journal of the American Statistical Association 49, 169-177.
N.L. Johnson & S. Kotz (1969). Distributions in Statistics, Discrete Distributions, Boston: Houghton Mifflin Company.
41
N.L. Johnson & S. Kotz (1970). Distributions in Statistics, Continuous Univariate Distributions-1, New York: John Wiley and Sons, Inc.
Kim, Jay J., Cox, L.H., Gonzalez, J.F. & Katzoff, M.J. (2004), Effects of Rounding Continuous Data Using Rounding Rules, Proceedings of the American Statistical Association, Survey Research Methods Section, Alexandria, VA, 3803-3807 (available on CD).
Vasek Chvatal. Harmonic Numbers, Natural Logarithm and the Euler-Mascheroni Constant. See www.cs.rutgers.edu/~chvatal/notes/harmonic.html