1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez,...

Preview:

Citation preview

1

Effects of Rounding on Data Quality

Lawrence H. Cox, Jay J. Kim,Myron Katzoff, Joe Fred Gonzalez, Jr.

U.S. National Center for Health Statistics

2

Outline

I. Introduction

II. Four Rounding Rules

III. Mean and Variance

IV. Distance Measure

V. Concluding Comments

VI. References

3

I. Introduction

Reasons for rounding

• Rounding noninteger values to integer values for statistical purposes;

• To enhance readability of the data;

• To protect confidentiality of records in the file;

• To keep the important digits only.

4

• Purpose of this paper:

Evaluate the effects of four rounding methods on data quality and utility in two ways:

(1) bias and variance;

(2) effects on the underlying distribution of the data determined by a distance measure.

5

B : Base

: Quotient

: Remainder

x xx q B r

xq

xr

( ) ( )x xR x q B R r

6

Types of rounding:

• Unbiased rounding: E [R (r) |r] = r

Example

E[R(3)= 0 or 10|3] = 3

• Sum-unbiased rounding: E [R (r)] = E (r)

7

II. Four rounding rules

1. Conventional rounding

1.1 B even

Suppose r = 0, 1, 2, . . . ,9.

In this case B = 10.

If r (B/2), round r up to 10

else round r down to zero (0).

5

8

1.2 B odd

Round r up to B when

r

Otherwise round down to zero.

Example:

When B = 5, round up r = 3 and 4 to 5;

Otherwise, round down.

1

2

B

9

Assumptions for r and q

r follows a discrete uniform distribution;

q follows lognormal, Pareto of second kind or multinomial distribution.

Thus,

x has some mixed distribution, but the term qB dominates.

10

2. Modified Conventional rounding

Same as conventional rounding, except whenrounding 5 (B/2) up or down with probability ½.

3. Zero-restricted 50/50 rounding

Except zero (0), round r up or down with probability ½.

11

4. Unbiased rounding rule

Round r up with probability r/B,

and

Round r down with probability 1 - r/B

Example:

r = 1, P [R(1)=B] = 1/10P [R(1)=0] = 9/10

12

III. Mean and variance

III.1 Mean and variance of unrounded number

r = 0, 1, 2, 3, . . . , B-1.

P (r) = and

E (r) =

= .

1

1

1B

r

rB

1

2

B

1

B

13

=

In general when r and q are independent,

22 1

( ) ( )12

BV x B V q

2 1( )

12

BV r

( ) ( | ) ( | )V x V E x q E V x q

12 2

1

1( )

B

r

E r rB

( 1)(2 1)

6

B B

14

III. 2 Conventional rounding when B is even

for unrounded number.

1( )

2

BE r

[ ] Pr[ ( ) ] 0 Pr[ ( ) 0]

2

E R r B R r B R r

B

15

2

[ ]4

BV R r

2 1( ) .

12

BV r for unrounded number

2 2 1( )

2E R r B

1[ ] 3 ( )

12V R r V r

16

2

2( ) ( ) ,4

BV R x B V q and

2

2 1( ) ( ) .

4

BMSE R x B V q

17

III.3 Conventional rounding when B is odd

Note ,

(B-1)/2 out of B elements can be rounded up.

is sum unbiased, and

for unrounded number

1[ ]

2

BE R r

2 1[ ( )] .

4

BV R r

2 1( )

12

BV r

1P ( )

2

BR r B

B

18

P [R (r) = B] for modified conventional rounding.

P [R (r) = B] for 50/50 rounding: same as above.

No. of elements which can be rounded up: B-1.

All B elements. Probability of rounding up is ½.

1P ( )

2 2

BR r B

B

1 1 1P ( )

2 2 2

BR r B

B B

19

P [R (r) = B] for unbiased rounding

Modified conventional rounding,

50/50 rounding

and

unbiased rounding

have the same mean, variance and MSE

as the conventional rounding with odd B.

1

1

1 1P ( )

2

B

r

r BR r B

B B B

20

IV. Distance measure

Assume that when x = 0, U = 0.

Define

2

2

[ ( ) ]

[ ( ) ]x x

R x xU

x

R r r

x

1,

0, .x

x

if r is rounded up

otherwise

21

Reexpressing the numerator of U, we have

With conventional rounding with B=10,

Then we have

2( )x xB r

21

0

( )( | ) | ( ).

x

x

x xx

B rE U x x P

x

1 5, 6, . . , 9x xwith r

22

Expected value of U

We define

2

( ) ( | ) |

( )| ( ) | ( ) ( ).

x x x

q r

x xx x x

q r

E U E E E U x q

B rx P q P r P q

x

21

10

( )( | ) | | ( ) | ( ).

x x

x xr x x

r x x

B rU E E U x q x P q P r

q B r

23

IV.1 Conventional rounding with B even

which can be expressed as

2 2/ 2 1 1

1 / 21

( ) 1

x x

B Bx x

x x x xr r B

r B rU

q B r q B r B

2/ 2 1 1

11 / 2

( ) 1

x x

B Bx

xr r B x

B rU r

r B

2 2/ 2 1 1

1 / 2

( ) 1

x x

B Bx x

x x x xr r B

r B r

q B r q B r B

24

Sum of 1/r terms.

Recall the harmonic series:

The upper and lower bounds for harmonic series

2/ 2 1 1

11 / 2

( ) 1[ ]

x x

B Bx

xr r xB

B rU r

r B

2 2/ 2 1 1

1 / 2

( ) 1[ ]

x x

B Bx x

x xr r B

r B r

q B q B B

ln( 1) 1 ln( )nn H n

1 1 11 2 3 . .nH n

25

The upper bound for the first term of is

The second term of is

Note that the second term of E(U) is

1ln

2

2( 1)

2

BB

B

B

1U

/ 2 1 12 2

21 / 2

1( )

1x

x x

B B

q x xr r Bx

E r B rBq

2 2

12

B

B

1U

26

IV.2 Modified conventional rounding with even B

This has the same E(U) as conventional rounding.

IV.3 50/50 rounding

The first term of is

The second term of is

1U

1ln ( 1)

2 2

BB

22 3 1

6

B B

B

1U

27

IV.4 Unbiased rounding

The first term of is

The second term of is

1U

1

2

B

2 1

6

B

B

1U

28

IV.5 Comparison of four rounding rules

Conventional or

Mod. Conven.

50/50 Unbiased

Term

1

Term

2

2( 1) 1ln

2 2

B BB

B

2 2 1

12

BE

B q

1ln( 1)

2 2

BB

22 3 1 1

6

B BE

B q

1

2

B

2 1 1

6

BE

B q

29

Comparison of four rounding rulesB = 10

Conventional or

Mod. Conven.

50/50 Unbiased

Term

1 2.61 11.49 (4.4) 4.5 (1.7)

Term

2 .85 2.85 1.651

Eq

1E

q

1E

q

30

Comparison of four rounding rulesB = 1,000

Conventional or

Mod. Conven.

50/50 Unbiased

Term

1 194 3,454 (18) 500 (2.6)

Term

2 83 323 1661

Eq

1E

q

1E

q

31

IV.6 E(1/q) for log-normal distribution

Suppose

and

Then, x has a lognormal distribution, i.e.,

2( , )y N

2

12

ln12( | , ) 2 , 0

x

f x x e x

yx e

32

Let

Then

which is equivalent to

2

1( | , )c f x dx

21

1 1 1| 1, , ( )E q f q dq

q c q

212

11e

c

33

IV.6 E(1/q) for Pareto distribution of the 2nd kind

The Pareto distribution of the second kind is

In the above k = min(q).

Let

1( ) , 0, 0

a

a

akf q a q k

q

1( | 0, 0)c f x a q k dx

34

IV.7 Upper bound for E(1/q) for multinomial distribution

The multinomial distribution has the form

= 0,1,2,

11

( 1)

a kE

q c a

1 2 1 21

1

!( , , . . | , , . . )

!

i

kq

k k iki

ii

nf q q q p p p p

q

iq

35

In binomial distribution, we let

When is truncated at 1 from below, we have

Note that

for all i.

1s p

1 1 3

1 ( 1)( 2)i i i i

E E Eq q q q

( )if q

1( | 1, , )

1in ni

i i ni

nf q q n p p s

ns

36

In general,

Using the above relationship, we have the following

does not generate any term having or .

Hence,

1 23

1|E q q

q

1 2 3 3 1 2 2 1 1( ) ( | ) ( | ) ( )P q q q P q q q P q q P q

1 2 11 2 3 1 2 3

1 1 1 1| |E E E E q q q

q q q q q q

1q 2q

1 2 11 2 3 3 2 1

1 1 1 1| |E E q q E q E

q q q q q q

37

The upper bound of the expected value is

Let be the size of the category j andjn

1

1

1

1

2

211 2

5( 1)( 2) 2( 2) 6

2( 1)( 2) (1 )

1

i

jj

i

jj

n

i i i

n

i i

k

ik

n n p s n p

n n p s

Eq q q

1 21 1 11 2 1 1 1

2 2 2

1 1 3

1 ( 1)( 2)

1 3 1 3

1 ( 1)( 2) 1 ( 1)( 2)

kq q qk

k k k

Eq q q q q q

q q q q q q

38

V. Concluding comments

Various methods of rounding and in some applications various choices for rounding base B are available.

The question becomes: which method and/or base is expected to perform best in terms of data quality and preserving distributional properties of original data and, quantitatively, what is the expected distortion due to rounding?

39

The expected value of U, the distance measure, is intractable, so we derived its upper bound.

The expected value of 1/q is also intractable for a multinomial distribution. So we derived an upper bound. There should be room for improvement.

This paper provides a preliminary analysis toward answering these questions.

In summary,

• In terms of bias, unbiased rounding is optimal.

40

• In terms of the distance measure, conventional or modified conventional rounding performs best.

• In terms of protecting confidentiality, 50/50 rounding rule is best.

VI. References

Grab, E.L & Savage, I.R. (1954), Tables of the Expected Value of 1/X for Positive Bernoulli and Poisson Variables, Journal of the American Statistical Association 49, 169-177.

N.L. Johnson & S. Kotz (1969). Distributions in Statistics, Discrete Distributions, Boston: Houghton Mifflin Company.

41

N.L. Johnson & S. Kotz (1970). Distributions in Statistics, Continuous Univariate Distributions-1, New York: John Wiley and Sons, Inc.

Kim, Jay J., Cox, L.H., Gonzalez, J.F. & Katzoff, M.J. (2004), Effects of Rounding Continuous Data Using Rounding Rules, Proceedings of the American Statistical Association, Survey Research Methods Section, Alexandria, VA, 3803-3807 (available on CD).

Vasek Chvatal. Harmonic Numbers, Natural Logarithm and the Euler-Mascheroni Constant. See www.cs.rutgers.edu/~chvatal/notes/harmonic.html

Recommended