19
ST2352 - Probability and Theoretical Statistics II Project - Benford’s Law Team Number 5: Karl Downey 11445848 Conn McCarthy 11313016 Graeme Meyler 11367851

ST2352 - Probability and Theoretical Statistics II Project ... folder/Best... · Introduction . 1.1. Benford’s Law? ... mean deviation of approximately 5% from the theoretical proportion

Embed Size (px)

Citation preview

ST2352 - Probability and Theoretical Statistics II Project - Benford’s Law

Team Number 5:

Karl Downey 11445848

Conn McCarthy 11313016 Graeme Meyler 11367851

jhaslett
Note
Very solid exploration of Benfords law, and of dists related to it, inc ratio dists (new to me) and dist for second digit. Includes exam of skeness and kurtosis. Small error on Exp dist; 95%

2

Table of Contents

1. Introduction 1.1. Benford’s Law? 1.2. Scale Invariance 1.3. Discrete Probability Distribution 1.4. Discrete vs. Continuous Distribution

2. Benford’s Law and Distributions 2.1. Extension of Law to Common Distributions 2.2. Uniform Distribution 2.3. Uniform Ratio Distribution 2.4. Exponential Distribution

3. Moments and Moment Generating Function 3.1. Moment Generating Function 3.2. Raw and Centred Moments of Benford

4. Extension of Benford’s Law Beyond the First Digit 4.1. Extension of Benford’s Law to the nth digit 4.2. Distribution of the nth digit as n approaches infinity 4.3. Conditional Probabilities and Independence 4.4. Covariance and Correlation

Layout: There are four main sections in the project as seen above. These treat different aspects of Benford’s Law separately with roughly two major questions per section broken into subsections.

3

1. Introduction

1.1. Benford’s Law?

Benford’s Law, also called the First Digit Law, refers to the frequency distribution of the first digit in many naturally occurring data sets. This was made famous in 1938 by physicist Frank Benford, who after observing a number of naturally occurring data sets, noticed the surprising distribution of the first digit. Although being discovered before Benford in 1881 by Simon Newcomb, who noticed that the front pages in his logarithm book were more worn than the back, the Law bears Benford’s name as he was the first to confirm it with empirical evidence. To his amazement, it satisfied a plethora of data sets such as surface areas of rivers, molecular weights and countless more. In essence, the law states that given a naturally occurring data set, the probability distribution of the leading digit will follow a logarithmic law with the probability of a one being approximately 30%, and then the probabilities decrease with the likelihood of the leading digit to be a nine of about 4.5%. Although this work well for natural data sets, it is also possible to create well defined probability distributions which also obey Benford’s Law. It will also be shown that we can generalise this law beyond the first digit and show that the distribution of numbers beyond the leading digit approaches uniformity. 1.2. Scale Invariance Since Benford’s Law applies to many data sets, most of which are measured in different units, it is reasonable to assume that the distribution is scale invariant, which is indeed the case. This means that if we multiply numbers in a data set by an arbitrary constant, the distribution of the first digit should remain the same. Since we are interested only in the distribution of the first digit let’s express our numbers in scientific notation, { }. The first significant digit, , is then the first digit of . We can then find a scale invariant distribution for , if we find one for If the distribution of is scale invariant, then the distribution for ( ) (to the base ten since we use a base ten number system) should stay the same when we add a constant factor to . This is due to one of the defining properties of logarithms. Adding a constant, , to would simply be multiplying by some constant.

( ) ( ) ( ) , where is the constant ( ).

The only probability distribution for on the unit interval which will remain unaffected after the addition of an arbitrary constant is the uniform distribution. This is clear from the probability density function of the uniform distribution.

Figure 1 - PDF of Uniform Distribution on (a, b)

4

Thus, is uniformly distributed between ( ) and ( ) . Here is a representation of the distribution of and that of the uniform.

Figure 2 – Uniform vs. Log Scale

Now to find the corresponding probabilities for the digits one through nine we can integrate over the probability density function of .

( ) ( ) ( ( ) ( )) We can now do the integration since is uniformly distributed on the unit interval and thus has probability density function, ( ) .

∫ ( )

( )

( )

( ) ( ) ( ⁄ )

As a result of scale invariance, we have now derived Benford’s Law. 1.3. Discrete Probability Distribution As a result of the previous section, we know the distribution of the first digits as described by Benford’s Law. The corresponding probabilities for the leading digits are as followed, where

{ } and ( ) ( ⁄ ).

( ) Approx. ( ) Frequency

1 ( ) 0.301 30.1%

2 ( ) 0.176 17.6%

3 ( ) 0.125 12.5%

4 ( ) 0.097 9.7%

5 ( ) 0.0791 7.91%

6 ( ) 0.67 6.7%

7 ( ) 0.058 5.8%

8 ( ) 0.051 5.1%

9 ( ) 0.0457 4.57%

Table 1 – Benford’s Law

5

To check that this is indeed a well-defined probability distribution we can check that the probability mass function summed over all numbers equates to one.

∑ ( )

∑ ( ⁄ )

( ) (

⁄ )

This is a valid probability mass function and its associated graph is a monotonically decreasing step function. This is quite an interesting result as we are modelling a discrete distribution from the

continuous function, specifically ( ) ( ⁄ )

Figure 3 – Benford’s Discrete PMF

1.4. Discrete vs. Continuous Distribution Now we can raise a valid question: although Benford’s Law is modelled on a discrete function, can we extend it to a continuous distribution due to the fact that it is derived from a continuous

function, namely ( ⁄ ). This is a valid thought but in fact it doesn’t actually work in the

continuous case. To see this we can check if it is well defined, the integral over its domain should be equal to one.

∫ ( ⁄ )

( )∫ ( ⁄ )

Here we just used the change of base formula. Now doing integration by parts yields the result:

( )[ ( ⁄ ) ( )]

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9

Discrete Benford

Discrete Benford

6

So treating this as a continuous distribution doesn’t make any sense but the fact that Benford’s Law is modelled of a continuous function will play an important role in deriving certain distributions which also obey Benford’s Law.

Figure 4 – Discrete vs. Continuous

7

2. Benford’s Law and Distributions 2.1. Extension of Law to Common Distributions So far we have only looked at naturally occurring data sets and they model quite well with Benford’s Law, there are however some distributions that also satisfy Benford’s Law. The key thing to note here is that a distribution will satisfy the law if its corresponding probability density function closely

resembles the function ( ⁄ ).

Figure 5 – Graph of (

⁄ )

2.2. Uniform Distribution The first distribution we look at is the uniform distribution. As the name suggests one would believe that the distribution of the first digit in this case would be uniform and the probability for each digit

being exactly the same, namely ⁄ . It should be noted that when we say first digit we mean the first

nonzero significant digit as Benford’s Law doesn’t apply to zero. We can generate a thousand samples in Excel to see if the uniform distribution obeys Benford’s Law.

Figure 6 – Uniform Distribution

So as expected, the uniform distribution doesn’t obey Benford’s Law but it was useful to conduct this experiment as we now know the accuracy of the random number generator. In general, there was a mean deviation of approximately 5% from the theoretical proportion of 111 for each respective digit from a sample size of 1000.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9

log_10(1 + 1⁄x)

0

100

200

300

400

1 2 3 4 5 6 7 8 9

Uniform

Benford

0

100

200

300

400

1 2 3 4 5 6 7 8 9

Uniform

Benford

8

2.3. Uniform Ratio Distribution We now turn our attention to the uniform ratio distribution. Put simply, this distribution can be thought of best in terms of Monte-Carlo simulation. If we generate two identically independent uniform distributions on the unit interval, then isolate the first nonzero significant digit of the ratio of the two numbers, this proportion will obeys Benford’s Law very well. We have done this in Excel with a sample size of one thousand.

Numbers Proportion Benford Proportion Approx. Error

1 330 301 10%

2 148 176 -16%

3 115 125 -8%

4 84 97 -13%

5 69 79 -13%

6 62 67 -7%

7 78 58 35%

8 53 51 4%

9 58 46 27%

Table 2 – Uniform Ratio Distribution

Figure 7 – Uniform Ratio Distribution

As we can see here, this distribution fits Benford’s Law quite nicely even with only one thousand replications. We can now confirm our empirical evidence above by some theory; we will find the probability density function. Obtaining the density functions in some cases can be quite tricky but in this case it’s not too bad. Let our two uniform distributions be [ ] and [ ]. We know the cumulative distribution function for a uniform variable on [ ]:

[ ] ( ) { [ ]

[ ]

And its corresponding probability density function is just the derivative of this:

( ) { [ ]

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9

Proportion

Benford Proportion

9

Now we define the distribution ⁄ and find its corresponding cumulative distribution function

by integrating over the required area. Since and are independent, their joint probability density function is equal to the product of their densities, namely ( ) ( ) ( ) .

( ) ( ⁄ ) ( ) ∫

( )

∫ ( )

Now we can split this into two parts: The interval when and the interval when . Doing this we obtain two easy integrals:

( ) ∫

⁄ ( ) ∫

Now we can differentiate these two functions in order to get the probability density function:

( ) { ⁄

We can now sketch this function using Excel and note the similarities between Figure 8.

Figure 8 – Uniform Ratio PDF

The relationship between this function and that of ( ⁄ ) is clear from above. So, a

probability distribution with its density function resembling that of the function ( ⁄ ), in

figure 8, will model quite well with Benford’s Law. We now look at another example where this occurs.

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0 1 2 3 4 5 6 7 8 9

PDF of Z

10

2.4. Exponential Distribution We now check to see if the exponential distribution models well to Benford’s Law. This distribution has probability density function given by:

( ) {

In this example there are two things to note: the range from which we sample our ’s and our value of . The former will have little effect but the latter can completely change the distribution. So for convenience we set [ ] and now investigate the distribution depending on our choice of lambda. Now using Excel’s exponential distribution function we can generate a thousand samples using random numbers between zero and five and look at the distribution of their first digit. For example, here is one such replication when .

Numbers Proportion Benford Proportion Approx. Error

1 295 301 -2%

2 176 176 0%

3 128 125 2%

4 77 97 -21%

5 67 79 -15%

6 74 67 11%

7 67 58 16%

8 63 51 23%

9 53 46 16%

Table 3 – Exponential with

Figure 9 - Exponential with

Here we can observe that the exponential distribution with fits quite closely to Benford’s Law. Now we pose the question: What happens when we vary lambda? It is to check via simulation that the model works well for large lambda but breaks down when it goes below approximately 1. This is clear once again from the probability density function of the exponential model.

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9

Exponential

Benford

11

Here is a graph of the probability density functions for three possible lambdas:

Figure 10 – Varying

Here we can see as lambda gets small the graph becomes more linear such as in whereas when lambda gets bigger, the graph becomes more curved. In fact, for large lambda, the exponential distribution obeys Benford’s Law quite well. On the other hand, as lambda goes below one, the law

breaks down as the graph becomes almost linear and bears no resemblance of ( ⁄ ).

12

3. Moments and Moment Generating Function 3.1. Moment Generating Function Given a discrete distribution , we can calculate its moment generating function from definition: ( ) ( ) . So given that we know the probability mass function of Benford’s Law we can find the moment generating function as follows.

( ) ( ) ∑ ( )

∑ ( ⁄ )

( ) (

⁄ )

As a result of this we can calculate the raw moments of the distribution using the formula:

( )

( ), and this is not too bad as we are working with exponentials. We can now

derive a genera l formula to calculate all moments.

( ) ∑ ( ⁄ )

( )( ) ∑ (

⁄ )

We can now recognise this as the correct formula as the raw moments are defined as:

( ) ∑ ( )

∑ ( ⁄ )

3.2. Raw and Centred Moments of Benford Recall that given a discrete distribution , the raw moments of the distribution are defined

as ( ) { }. This can be easily calculated for Benford’s Law doing the following:

( ) ∑ ( )

∑ ( ⁄ )

This can be done in Excel using the sum-product formula and yields the following result for the first four raw moments:

( ) ( ) ( ) ( )

3.44 17.89 115.08 823.27

Table 4 – Raw Moments

As a whole, these results aren’t too interesting as they are raw moments, i.e. centred on zero. A much better indicator of the function is the centred moments; these are the moments which are centred at the mean of the distribution and in some cases can be standardised. 1. The first moment is the same as above, the mean. ( ) .

2. The second centred moment is the variance and this measures square deviation from the mean.

(( ( )) ) ( ).

13

This can be written explicitly as:

( ) ∑( ( )) ( )

∑( ) ( ⁄ )

This also gives us the standard deviation, √ ( ) .

3. The third centred moment is known as skewness and is standardised. This measures the asymmetry of the distribution with a negative skew being a tail on the left while a positive skew indicating a tail on the right. Knowing the graph of our distribution, we should expect a positive skew.

Figure 11 – Skewness

We can now calculate the skewness as follows:

(( ( ))

)

( ) ⁄

∑( ) ( ⁄ )( )

This is large as expected and is indicative of the large positive skew of Benford’s Law.

4. The fourth centred moment we use here is the excess kurtosis. This can be computed as:

(( ( ))

)

( ) (∑( ) (

⁄ )( )

)

It should be noted that the subtraction of three may seem quizzical but in fact its main reason for being there is that it makes the excess kurtosis of the normal distribution equal to zero.

14

4. Extension of Benford’s Law Beyond the First Digit 4.1. Extension of Benford’s Law to the nth digit We have seen quite a bit how Benford’s Law is applied in the case of the first digit. In this section, we wish to generalise the law to beyond the first digit. This can be done in a very natural way if we start with an intuitive example and then generalise. For example, let’s take the probability that the second digit of a number is 9. Therefore, the starting two digits can be 19, 29, 39, and so on. We can now calculate the probability since these are independent events:

( ) ( ⁄ ) (

⁄ )

We can now generalise this for the second digit and obtain a formula for all probabilities:

( ) ∑ ( ( )⁄ )

{ }

Using this result we can create a table for the probabilities of the second digit, remembering zero is now a possibility:

( ) Approx. Frequency

0 0.119679 12%

1 0.11389 11%

2 0.108821 11%

3 0.10433 10%

4 0.100308 10%

5 0.096677 10%

6 0.093375 9%

7 0.090352 9%

8 0.08757 9%

9 0.084997 8%

Table 5 – Second Place Probabilities

Figure 12 – Second Place PMF

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 1 2 3 4 5 6 7 8 9

Benford Second Digit

15

As we can see this graph looks somewhat like the first digit case except the probabilities are becoming more uniform, we will see the general result in the next section. The final thing we can do here now is check that this is indeed well defined for the second digit just by summing over all probabilities (computed in Excel):

∑ ( )

∑ ( ( )⁄ )

It is now easy to make the final step in generalising this to the nth digit. We already have the main formula as seen in the probability for the second digit, all we have to worry about for larger n is where we start and end the sum. For example, if we look at the case when and the probability that the third digit is 1, the possibilities go from 101, 121, 131,…, 211, 221,…, 911, 921,…, 991. So we must start the sum at 10 and end the sum at 99 to get all possibilities. In general we obtain the formula:

( ) ∑ ( ( )⁄ )

{ }

4.2. Distribution of the nth digit as n approaches infinity As we saw in the previous section, it looks like the distribution of the nth digit as n increases tends to becomes more uniform. In fact, this is the case and it approaches uniformity very quickly. We’ve already seen the instance when , to further illustrate this result, let’s see the case when . Using the formula from above for ( ), we calculate the probabilities in Excel for the third place digit. The required formula for when is:

( ) ∑ ( ( )⁄ ) { }

Using this formula in Excel gives the following table for third place probabilities:

( ) Approx. Frequency

0 0.101784 10.178%

1 0.101376 10.138%

2 0.100972 10.097%

3 0.100573 10.057%

4 0.100178 10.018%

5 0.099788 9.979%

6 0.099401 9.940%

7 0.099019 9.902%

8 0.098641 9.864%

9 0.098267 9.827%

Table 6 – Third Place Probabilities

16

Using this table we can create a graph in Excel which will give a clear illustration of the distribution of the third place digits.

Figure 13 – Third Place Probabilities

Clearly by looking at Figure 13, the distribution of the third digit is nearly exactly uniform. The standard deviation of the numbers from the mean is 0.001122, almost negligible. Running more spread sheets in Excel for larger numbers will further confirm this result but as it stands it is clear as n becomes large, the distribution of the nth digit tends towards uniformity. 4.3. Conditional Probabilities We now pose the question of independence and conditional probability. For simplicity, we will look at the case with the first and second digit distribution only. If we are given that the first digit is some predetermined number, say , does this affect the distribution of the second digit, ? In a more mathematically succinct manner: Does ( | ) ( )? Using some basic probability formulas we can answer this question. Recall the formula for the condition probability of a random variable A given the distribution of the random variable B:

( | ) ( )

( ) ( | )

( )

( )

Now let us take the value for , and compute the corresponding probabilities for given our value for . So we have:

( | ) ( )

( ) (

( )⁄ )

( ) { }

For example:

( | ) ( )

( ) (

⁄ )

( )

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 1 2 3 4 5 6 7 8 9

Benford Third Digit

17

This is quite a bit higher that our unconditioned probability for the second digit to be 1 which was 0.11389 in Table 6. We can now complete a table for ( | ) to illustrate the conditional probability.

Numbers ( | ) ( ) Difference

0 0.13750 0.11967 1.78%

1 0.12553 0.11389 1.16%

2 0.11548 0.10882 0.67%

3 0.10692 0.10433 0.26%

4 0.09954 0.10030 -0.08%

5 0.09311 0.09667 -0.36%

6 0.08746 0.09337 -0.59%

7 0.08246 0.09035 -0.79%

8 0.07800 0.08757 -0.96%

9 0.07400 0.08499 -1.10%

Table 7 – Conditional Probability for with

So from the above data, given that the first digit was a one, there is an increased probability that the second digit will be a 0, 1, 2 or 3 and less likely to be a 4, 5, 6, 7, 8, or 9. This suggests a very intuitive result: given that the first digit is a low number, the second digit is also likely to be low. We can now test this hypothesis in regard to the higher numbers. Let’s look at the probability distribution of ( | ). As before this is:

( | ) ( )

( ) (

( )⁄ )

(

⁄ ) { }

We can now complete a chart for ( | ) to illustrate the conditional probability.

Numbers ( | ) ( ) Difference

0 0.10488 0.11967 -1.48%

1 0.10373 0.11389 -1.02%

2 0.10261 0.10882 -0.62%

3 0.10151 0.10433 -0.28%

4 0.10044 0.10030 0.01%

5 0.09939 0.09667 0.27%

6 0.09836 0.09337 0.50%

7 0.09735 0.09035 0.70%

8 0.09636 0.08757 0.88%

9 0.09539 0.08499 1.04%

Table 8 – Conditional Probability for with

18

So, as it turns out our intuition was correct. Given a higher number for the first digit, it is more likely that the second digit will also be a higher number. Here is a graphical juxtaposition for the above information.

Figure 14 – Conditional Probability for with

Figure 15 - Conditional Probability for with

4.4. Covariance and Correlation Now that we’ve studied the relationship of the first and second digit distributions separately, it is natural to ask the question of how they vary together. Recall that there is a nice way of doing this in calculating the covariance of the distributions. One can then standardise this and get a good estimate of the linear relationship of the distributions on each other, this is called the correlation. In this section, let be the first digit Benford distribution and be the second digit Benford distribution. Now we know:

( ) ( ) ( ) ( ) ( ) ∑

∑ ( )

Now the only difficulty in calculating the covariance here is the ( ) as we need the joint probability density function, namely ( ). In fact, this is not that bad to derive and can be

easily visualised in a table. In table 9, X is on the vertical axis and Y is on the horizontal axis.

0.00000

0.02000

0.04000

0.06000

0.08000

0.10000

0.12000

0.14000

0.16000

0 1 2 3 4 5 6 7 8 9

Conditioned

Unconditioned

0.00000

0.02000

0.04000

0.06000

0.08000

0.10000

0.12000

0.14000

0.16000

0 1 2 3 4 5 6 7 8 9

Conditioned

Unconditioned

19

0 1 2 3 4 5 6 7 8 9 P(X)

1 0.041 0.038 0.035 0.032 0.030 0.028 0.026 0.025 0.023 0.022 0.301

2 0.021 0.020 0.019 0.018 0.018 0.017 0.016 0.016 0.015 0.015 0.176

3 0.014 0.014 0.013 0.013 0.013 0.012 0.012 0.012 0.011 0.011 0.125

4 0.011 0.010 0.010 0.010 0.010 0.010 0.009 0.009 0.009 0.009 0.097

5 0.009 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.007 0.007 0.079

6 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.006 0.006 0.006 0.067

7 0.006 0.006 0.006 0.006 0.006 0.006 0.006 0.006 0.006 0.005 0.058

8 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.051

9 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.004 0.004 0.004 0.046

P(Y) 0.120 0.114 0.109 0.104 0.100 0.097 0.093 0.090 0.088 0.085 1

Table 9 – Joint PMF of X and Y

Using this graph we can now write down the joint probability mass function explicitly:

( ) (

) { } { }

Now we can calculate ( ) in Excel as we have the formula above for the covariance and the joint probability mass function:

( ) ∑

∑ ( )

∑ ( ( )⁄ )

We already have all the relevant information to get the expectation of X and Y which are ( ) and ( ) .

( ) ( ) ( ) ( ) ( )( ) This is a positive covariance as expected, since the two variables show similar behaviour as they both conform to Benford’s Law. We can get a better idea of the strength of this relationship by computing the correlation coefficient, this will standardise the covariance. Recall the formula for the correlation coefficient of two variables:

( ) ( )

√ ( )√ ( )

This has the defining property of being standardised, i.e. ( ) , so is a good indicator of how X and Y vary linearly together. We already have the means to calculate all of the above as we just computed the covariance and ( ) and ( ) .

( ) ( )

√ ( )√ ( )

√ √

So there is indeed a positive linear relationship between the first and second digits of Benford’s Law although it is quite weak. This may be expected as there is a strong relationship between the digits via Benford’s Law, but the relationship isn’t very linear as seen from the numerous diagrams of Benford’s Law.