41
1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

Embed Size (px)

Citation preview

Page 1: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

1

Effects of Rounding on Data Quality

Lawrence H. Cox, Jay J. Kim,Myron Katzoff, Joe Fred Gonzalez, Jr.

U.S. National Center for Health Statistics

Page 2: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

2

Outline

I. Introduction

II. Four Rounding Rules

III. Mean and Variance

IV. Distance Measure

V. Concluding Comments

VI. References

Page 3: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

3

I. Introduction

Reasons for rounding

• Rounding noninteger values to integer values for statistical purposes;

• To enhance readability of the data;

• To protect confidentiality of records in the file;

• To keep the important digits only.

Page 4: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

4

• Purpose of this paper:

Evaluate the effects of four rounding methods on data quality and utility in two ways:

(1) bias and variance;

(2) effects on the underlying distribution of the data determined by a distance measure.

Page 5: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

5

B : Base

: Quotient

: Remainder

x xx q B r

xq

xr

( ) ( )x xR x q B R r

Page 6: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

6

Types of rounding:

• Unbiased rounding: E [R (r) |r] = r

Example

E[R(3)= 0 or 10|3] = 3

• Sum-unbiased rounding: E [R (r)] = E (r)

Page 7: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

7

II. Four rounding rules

1. Conventional rounding

1.1 B even

Suppose r = 0, 1, 2, . . . ,9.

In this case B = 10.

If r (B/2), round r up to 10

else round r down to zero (0).

5

Page 8: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

8

1.2 B odd

Round r up to B when

r

Otherwise round down to zero.

Example:

When B = 5, round up r = 3 and 4 to 5;

Otherwise, round down.

1

2

B

Page 9: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

9

Assumptions for r and q

r follows a discrete uniform distribution;

q follows lognormal, Pareto of second kind or multinomial distribution.

Thus,

x has some mixed distribution, but the term qB dominates.

Page 10: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

10

2. Modified Conventional rounding

Same as conventional rounding, except whenrounding 5 (B/2) up or down with probability ½.

3. Zero-restricted 50/50 rounding

Except zero (0), round r up or down with probability ½.

Page 11: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

11

4. Unbiased rounding rule

Round r up with probability r/B,

and

Round r down with probability 1 - r/B

Example:

r = 1, P [R(1)=B] = 1/10P [R(1)=0] = 9/10

Page 12: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

12

III. Mean and variance

III.1 Mean and variance of unrounded number

r = 0, 1, 2, 3, . . . , B-1.

P (r) = and

E (r) =

= .

1

1

1B

r

rB

1

2

B

1

B

Page 13: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

13

=

In general when r and q are independent,

22 1

( ) ( )12

BV x B V q

2 1( )

12

BV r

( ) ( | ) ( | )V x V E x q E V x q

12 2

1

1( )

B

r

E r rB

( 1)(2 1)

6

B B

Page 14: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

14

III. 2 Conventional rounding when B is even

for unrounded number.

1( )

2

BE r

[ ] Pr[ ( ) ] 0 Pr[ ( ) 0]

2

E R r B R r B R r

B

Page 15: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

15

2

[ ]4

BV R r

2 1( ) .

12

BV r for unrounded number

2 2 1( )

2E R r B

1[ ] 3 ( )

12V R r V r

Page 16: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

16

2

2( ) ( ) ,4

BV R x B V q and

2

2 1( ) ( ) .

4

BMSE R x B V q

Page 17: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

17

III.3 Conventional rounding when B is odd

Note ,

(B-1)/2 out of B elements can be rounded up.

is sum unbiased, and

for unrounded number

1[ ]

2

BE R r

2 1[ ( )] .

4

BV R r

2 1( )

12

BV r

1P ( )

2

BR r B

B

Page 18: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

18

P [R (r) = B] for modified conventional rounding.

P [R (r) = B] for 50/50 rounding: same as above.

No. of elements which can be rounded up: B-1.

All B elements. Probability of rounding up is ½.

1P ( )

2 2

BR r B

B

1 1 1P ( )

2 2 2

BR r B

B B

Page 19: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

19

P [R (r) = B] for unbiased rounding

Modified conventional rounding,

50/50 rounding

and

unbiased rounding

have the same mean, variance and MSE

as the conventional rounding with odd B.

1

1

1 1P ( )

2

B

r

r BR r B

B B B

Page 20: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

20

IV. Distance measure

Assume that when x = 0, U = 0.

Define

2

2

[ ( ) ]

[ ( ) ]x x

R x xU

x

R r r

x

1,

0, .x

x

if r is rounded up

otherwise

Page 21: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

21

Reexpressing the numerator of U, we have

With conventional rounding with B=10,

Then we have

2( )x xB r

21

0

( )( | ) | ( ).

x

x

x xx

B rE U x x P

x

1 5, 6, . . , 9x xwith r

Page 22: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

22

Expected value of U

We define

2

( ) ( | ) |

( )| ( ) | ( ) ( ).

x x x

q r

x xx x x

q r

E U E E E U x q

B rx P q P r P q

x

21

10

( )( | ) | | ( ) | ( ).

x x

x xr x x

r x x

B rU E E U x q x P q P r

q B r

Page 23: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

23

IV.1 Conventional rounding with B even

which can be expressed as

2 2/ 2 1 1

1 / 21

( ) 1

x x

B Bx x

x x x xr r B

r B rU

q B r q B r B

2/ 2 1 1

11 / 2

( ) 1

x x

B Bx

xr r B x

B rU r

r B

2 2/ 2 1 1

1 / 2

( ) 1

x x

B Bx x

x x x xr r B

r B r

q B r q B r B

Page 24: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

24

Sum of 1/r terms.

Recall the harmonic series:

The upper and lower bounds for harmonic series

2/ 2 1 1

11 / 2

( ) 1[ ]

x x

B Bx

xr r xB

B rU r

r B

2 2/ 2 1 1

1 / 2

( ) 1[ ]

x x

B Bx x

x xr r B

r B r

q B q B B

ln( 1) 1 ln( )nn H n

1 1 11 2 3 . .nH n

Page 25: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

25

The upper bound for the first term of is

The second term of is

Note that the second term of E(U) is

1ln

2

2( 1)

2

BB

B

B

1U

/ 2 1 12 2

21 / 2

1( )

1x

x x

B B

q x xr r Bx

E r B rBq

2 2

12

B

B

1U

Page 26: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

26

IV.2 Modified conventional rounding with even B

This has the same E(U) as conventional rounding.

IV.3 50/50 rounding

The first term of is

The second term of is

1U

1ln ( 1)

2 2

BB

22 3 1

6

B B

B

1U

Page 27: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

27

IV.4 Unbiased rounding

The first term of is

The second term of is

1U

1

2

B

2 1

6

B

B

1U

Page 28: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

28

IV.5 Comparison of four rounding rules

Conventional or

Mod. Conven.

50/50 Unbiased

Term

1

Term

2

2( 1) 1ln

2 2

B BB

B

2 2 1

12

BE

B q

1ln( 1)

2 2

BB

22 3 1 1

6

B BE

B q

1

2

B

2 1 1

6

BE

B q

Page 29: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

29

Comparison of four rounding rulesB = 10

Conventional or

Mod. Conven.

50/50 Unbiased

Term

1 2.61 11.49 (4.4) 4.5 (1.7)

Term

2 .85 2.85 1.651

Eq

1E

q

1E

q

Page 30: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

30

Comparison of four rounding rulesB = 1,000

Conventional or

Mod. Conven.

50/50 Unbiased

Term

1 194 3,454 (18) 500 (2.6)

Term

2 83 323 1661

Eq

1E

q

1E

q

Page 31: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

31

IV.6 E(1/q) for log-normal distribution

Suppose

and

Then, x has a lognormal distribution, i.e.,

2( , )y N

2

12

ln12( | , ) 2 , 0

x

f x x e x

yx e

Page 32: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

32

Let

Then

which is equivalent to

2

1( | , )c f x dx

21

1 1 1| 1, , ( )E q f q dq

q c q

212

11e

c

Page 33: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

33

IV.6 E(1/q) for Pareto distribution of the 2nd kind

The Pareto distribution of the second kind is

In the above k = min(q).

Let

1( ) , 0, 0

a

a

akf q a q k

q

1( | 0, 0)c f x a q k dx

Page 34: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

34

IV.7 Upper bound for E(1/q) for multinomial distribution

The multinomial distribution has the form

= 0,1,2,

11

( 1)

a kE

q c a

1 2 1 21

1

!( , , . . | , , . . )

!

i

kq

k k iki

ii

nf q q q p p p p

q

iq

Page 35: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

35

In binomial distribution, we let

When is truncated at 1 from below, we have

Note that

for all i.

1s p

1 1 3

1 ( 1)( 2)i i i i

E E Eq q q q

( )if q

1( | 1, , )

1in ni

i i ni

nf q q n p p s

ns

Page 36: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

36

In general,

Using the above relationship, we have the following

does not generate any term having or .

Hence,

1 23

1|E q q

q

1 2 3 3 1 2 2 1 1( ) ( | ) ( | ) ( )P q q q P q q q P q q P q

1 2 11 2 3 1 2 3

1 1 1 1| |E E E E q q q

q q q q q q

1q 2q

1 2 11 2 3 3 2 1

1 1 1 1| |E E q q E q E

q q q q q q

Page 37: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

37

The upper bound of the expected value is

Let be the size of the category j andjn

1

1

1

1

2

211 2

5( 1)( 2) 2( 2) 6

2( 1)( 2) (1 )

1

i

jj

i

jj

n

i i i

n

i i

k

ik

n n p s n p

n n p s

Eq q q

1 21 1 11 2 1 1 1

2 2 2

1 1 3

1 ( 1)( 2)

1 3 1 3

1 ( 1)( 2) 1 ( 1)( 2)

kq q qk

k k k

Eq q q q q q

q q q q q q

Page 38: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

38

V. Concluding comments

Various methods of rounding and in some applications various choices for rounding base B are available.

The question becomes: which method and/or base is expected to perform best in terms of data quality and preserving distributional properties of original data and, quantitatively, what is the expected distortion due to rounding?

Page 39: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

39

The expected value of U, the distance measure, is intractable, so we derived its upper bound.

The expected value of 1/q is also intractable for a multinomial distribution. So we derived an upper bound. There should be room for improvement.

This paper provides a preliminary analysis toward answering these questions.

In summary,

• In terms of bias, unbiased rounding is optimal.

Page 40: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

40

• In terms of the distance measure, conventional or modified conventional rounding performs best.

• In terms of protecting confidentiality, 50/50 rounding rule is best.

VI. References

Grab, E.L & Savage, I.R. (1954), Tables of the Expected Value of 1/X for Positive Bernoulli and Poisson Variables, Journal of the American Statistical Association 49, 169-177.

N.L. Johnson & S. Kotz (1969). Distributions in Statistics, Discrete Distributions, Boston: Houghton Mifflin Company.

Page 41: 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

41

N.L. Johnson & S. Kotz (1970). Distributions in Statistics, Continuous Univariate Distributions-1, New York: John Wiley and Sons, Inc.

Kim, Jay J., Cox, L.H., Gonzalez, J.F. & Katzoff, M.J. (2004), Effects of Rounding Continuous Data Using Rounding Rules, Proceedings of the American Statistical Association, Survey Research Methods Section, Alexandria, VA, 3803-3807 (available on CD).

Vasek Chvatal. Harmonic Numbers, Natural Logarithm and the Euler-Mascheroni Constant. See www.cs.rutgers.edu/~chvatal/notes/harmonic.html