Review of Basic Concepts in Probability and Statisticssetia/cs700-S04/slides/lecture1.pdfReview of Basic Concepts in Probability and Statistics Sanjeev Setia Dept of Computer Science

1

1

Review of Basic Concepts in Probabilityand Statistics

Sanjeev SetiaDept of Computer ScienceGeorge Mason University

2

About this Class

CS 700: Quantitative Methods andExperimental Design in Computer Science Required class for CS Ph.D. students

Prerequisites: Undergraduate probability & statistics Doctoral status

2

3

What you will learn

Applications of probability and statistical techniques forcomputer science comparing systems using sample data fitting distributions to sample data confidence interval calculations regression models design of experiments simulation and analysis of simulation results introduction to analytic performance modeling and queuing analysis workload characterization, pitfalls in performance analysis and

reporting Back-of-the envelope calculations

Goal: motivate these techniques with examples from theresearch literature

4

Logistics

Grade: 35% project, 35% midterm, 30% take homefinal

Slides, assignments, reading material on class webpage http://www.cs.gmu.edu/~setia/cs700/

Several small assignments related to materialdiscussed in class Not graded, but we will go over solutions in class

Term project should involve experimentation (measurement, simulation) select a topic in your research area if possible apply techniques discussed in this class

3

5

Acknowledgement

These slides are based onpresentations created and

copyrighted by Prof. Daniel Menasce(GMU)

6

Review of Probability

4

7

Review of Probability Concepts

Classical (theoretical) approach:

Empirical approach (relative frequency):

The relative frequency converges to theprobability for a large number of experiments.

Events ofNumber Total

OccurCan Event WaysNo. A process has to be known!

nsObservatio ofNumber Total

Experiment in the Occurred Result Times No. A

8

Review of Probability Rules1. A probability is a number between 0 and 1

assigned to an event that is the outcome of anexperiment:

2. Complement of event A.

3. If events A and B are mutually exclusive then

[ ]1,0][ ∈AP

][1][ APAP −=

[ ] ][][or BPAPBAP +=

[ ] 0 and =BAP

5

9

Review of Probability Rules (cont’d)4. If events A1, …, AN are mutually exclusive

and collectively exhaustive then:

5. If events A and B are not mutually exclusivethen:

6. Conditional Probability:

1][1

=∑=

N

iiAP

] and [][][]or [ BAPBPAPBAP −+=

][

][]|[

][

] and []|[

BP

APABP

BP

BAPBAP ==

10

Review of Probability Rules (cont’d)7. If events A and B are independent (i.e., P[A] =

P[A|B] and P[B]=P[B|A]) then:

8. Regardless of whether events A and B areindependent or not

9. Theorem of Total Probability: if events A1, …, ANare mutually exclusive and collectively exhaustivethen

][][] and [ BPAPBAP ×=

][]|[][]|[] and [ APABPBPBAPBAP ==

][]|[][1

i

N

ii APABPBP ∑

=

=

6

11

Discrete Random Variables

12

Random Variables

A variable is called a random variable if it takesone of a specified set of values with a specifiedprobability Discrete random variables: can only take discrete values,

e.g. age (in years) of students in this class, number ofcalls to a telephone exchange in one minute

Continuous random variables: can take on “continuous”values, i.e. every real number in sample space has aprobability of occurring, e.g. time between consecutivecalls to telephone exchange, time before a componentfails

7

13

Discrete Probability Distribution

Distribution: set of all possible values andtheir probabilities.

Number of I/Os per

Transaction Probability

0 0.3501 0.1202 0.0953 0.0854 0.0705 0.0606 0.0547 0.0488 0.0439 0.04010 0.035

1.000

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0 1 2 3 4 5 6 7 8 9 10

14

Moments of a Discrete Random Variable

Expected Value:

k-th moment:

][][ ii

i XPXXE ×== ∑∀

µ

][][ ii

ki

k XPXXE ×== ∑∀

µ

Number of I/Os per


For First Moment

(average)

For Second Moment

0 0.350 0.000 0.0001 0.120 0.120 0.1202 0.095 0.190 0.3803 0.085 0.255 0.7654 0.070 0.280 1.1205 0.060 0.300 1.5006 0.054 0.324 1.9447 0.048 0.336 2.3528 0.043 0.344 2.7529 0.040 0.360 3.24010 0.035 0.350 3.500

1.000 2.859 17.673

meansecond moment

8

15

Central Moments of a Discrete RandomVariable

k-th central moment:

The variance is the second central moment:

][)(])[( ik

ii

k XPXXXXE ×−=− ∑∀

22

222

2222

)(][

)(2)(][

]2)([])[(

XXE

XXXE

XXXXEXXE

−=

=−+=

−+=−=σ

16

Central Moments of a Discrete RandomVariable

Number of I/Os per


For First Moment

(average)

For Second Moment

For Second Central Moment

0 0.350 0.000 0.000 2.86091 0.120 0.120 0.120 0.41472 0.095 0.190 0.380 0.07013 0.085 0.255 0.765 0.00174 0.070 0.280 1.120 0.09115 0.060 0.300 1.500 0.27506 0.054 0.324 1.944 0.53287 0.048 0.336 2.352 0.82318 0.043 0.344 2.752 1.13659 0.040 0.360 3.240 1.508510 0.035 0.350 3.500 1.7848

1.000 2.859 17.673 9.4991

varianceaverage

9

17

Properties of the Mean

The mean of the sum is the sum of themeans.

If X and Y are independent randomvariables, then the mean of the product isthe product of the means.

][][][ YEXEYXE +=+

][][][ YEXEXYE =

18

Important discrete random variables

BinomialHypergeometricNegative BinomialGeometricPoisson

10

19

The Binomial Distribution Distribution: based on carrying out Bernoulli

trials (independent experiments with twopossible outcomes): Success with probability p and Failure with probability (1-p).

A binomial r.v. counts the number ofsuccesses in n trials.

knkknk ppknk

npp

k

nkXP −− −

−=−

== )1(

)!(!

!)1(][

20

The Binomial Distribution

Success Probability 0.6 (p)Number of Attempts 10 (n) Number

of Attempts

(k)

Probability k successful

attempts in n Cumulative

0 0.000105 0.0001051 0.001573 0.0016782 0.010617 0.0122953 0.042467 0.0547624 0.111477 0.1662395 0.200658 0.3668976 0.250823 0.6177197 0.214991 0.8327108 0.120932 0.9536439 0.040311 0.993953

10 0.006047 1.000000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 1 2 3 4 5 6 7 8 9 10

Probability Distribution Cumulative Distribution

11

21

Shape of the Binomial Distribution

Probability Distribution

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 1 2 3 4 5 6 7 8 9 10

p = 0.5 symmetric for any n.

22


p = 0.2 right skewed


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 1 2 3 4 5 6 7 8 9 10

12

23


p = 0.8 left skewed


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 1 2 3 4 5 6 7 8 9 10

24

Moments of the Binomial Distribution

Average: n p Variance: Standard Deviation: Coefficient of Variation:

)1( pnp −

)1( pnp −

np

p

np

pnp −=

− 1)1(

13

25

Hypergeometric Distribution Binomial was based on experiments with equal success

probability “sampling with replacement”

Hypergeometric: not all experiments have the samesuccess probability “sampling without replacement”

Given a sample size of n out of a population of size Nwith A known successes in the population, theprobability of k successes is

−

−

==

n

N

kn

AN

k

A

kXP ][

total # of possible samples

choose (n-k) failures fromN-A failures in the population

choose k successes out of Asuccesses in the population

26

Hypergeometric DistributionNo. successes

in samplesample

size

no. successes

in population

population size

k n A N0 20 10 100 0.095116271 20 10 100 0.267933162 20 10 100 0.318170633 20 10 100 0.209208094 20 10 100 0.084107305 20 10 100 0.021531476 20 10 100 0.003541367 20 10 100 0.000367938 20 10 100 0.000023009 20 10 100 0.0000007810 20 10 100 0.00000001

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 1 2 3 4 5 6 7 8 9 10

In Excel:Pr[X=k]=HYPGEOMDIST (k,n,A,N)

14

27

Moments of the Hypergeometric

Average:

Standard Deviation:

If the sample size is less than 5% of thepopulation, the binomial is a goodapproximation for the hyper-geometric.

N

nA

1

)(2 −

−−

N

nN

N

ANnA

28

Negative Binomial Distribution

Probability of success is equal to p and is thesame on all trials.

Random variable X counts the number of trialsuntil the k-th success is observed.

( ) kkn ppk

nnXP −−

−

−== 1

1

1][

1 2 3 4 n-1 n

SS F F S F. . .

15

29

Negative Binomial DistributionSuccess probability 0.8

k n Prob[X=n]1 1 0.8000001 2 0.1600001 3 0.0320001 4 0.0064005 5 0.3276805 6 0.3276805 7 0.1966085 8 0.0917505 9 0.0367005 10 0.0132125 11 0.004404

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

5 6 7 8 9 10 11

n

In Excel:Pr [X=n] = NEGBINOMDIST (n-k,k,p)

30

Moments of the Negative BinomialDistribution

Average:

Standard Deviation:

Coefficient of Variation:

p

k

2)1(

p

pk −

k

p−1

16

31

Geometric Distribution

Special case of the negative binomial withk=1.

Probability that the first success occursafter n trials is

Discrete random variable with the“memoryless” property

,...2,1)1(][ 1 =−== − nppnXp n

32

Geometric DistributionSuccess probability 0.6

n P[X=n]1 0.60002 0.24003 0.09604 0.03845 0.01546 0.00617 0.00258 0.00109 0.000410 0.0002

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10

17

33

Moments of the Geometric Distribution

Average:

Standard Deviation:

Coefficient of Variation:

p

1

21

p

p−

11 ≤− p

34

Poisson Distribution

Used to model the number of arrivals overa given interval, e.g., Number of requests to a server Number of failures of a component Number of queries to the database.

A Poisson distribution usually arises whenarrivals come from a large number ofindependent sources.

18

35

Poisson Distribution

Distribution: Counting arrivals in an interval of duration t:

Average = Standard Deviation = λ

∞===−

,...,1,0!

][ kk

ekXP

k λλ

( )∞==

−,...,1,0

!]t)[0,in arrivals [ k

k

etkP

tk λλ

36

Poisson DistributionLambda 10

KPoisson Distribution CDF

0 0.00005 0.00001 0.00045 0.00052 0.00227 0.00283 0.00757 0.01034 0.01892 0.02935 0.03783 0.06716 0.06306 0.13017 0.09008 0.22028 0.11260 0.33289 0.12511 0.457910 0.12511 0.583011 0.11374 0.696812 0.09478 0.791613 0.07291 0.864514 0.05208 0.916515 0.03472 0.951316 0.02170 0.973017 0.01276 0.985718 0.00709 0.992819 0.00373 0.996520 0.00187 0.9984

0.00.10.20.30.40.50.60.70.80.91.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Distribution CDF

In Excel:P[X=k] = POISSON (k,λ,FALSE)P[X≤k] = POISSON (k,λ,TRUE)

19

37

Continuous Random Variables

38

Relevant Functions

Probability density function (pdf) of r.v. X:

Cumulative distribution function (CDF):

pdf is the derivative of the CDF f(x) = dF(x)/dx

Tail of the distribution (reliability function):

)(xfX

dxxfbXaPb

a X )(][ ∫=≤≤

][)( xXPxFX ≤=

)(1][)( xFxXPxR XX −=>=

20

39

Moments

k-th moment: Expected value (mean): first moment

k-th central moment:

Variance: second central moment

dxxfxXE Xkk )(][ ∫

+∞

∞−=

dxxxfXE X )(][ ∫+∞

∞−==µ

dxxfxXE Xkk )()(])[( ∫

+∞

∞−−=− µµ

dxxfxXE X )()(])[( 222 ∫+∞

∞−−=−= µµσ

40

Important continuous distributions

Uniform Exponential Normal Erlang Hypo-exponential Hyper-exponential Weibull Lognormal Pareto

21

41

The Uniform Distribution

pdf:

Mean:

Variance:

≤≤

−=otherwise0

1)( bxa

abxfX

2

ba +=µ

( )12

22 ab −=σ

42

The Uniform Distribution

0 1

1

0.2 0.5

P[0.2<X<0.5]=(0.5-0.2)x1.0=0.3

U(0,1)

22

43

The Normal Distribution

Important because Many natural phenomena follow a normal distribution (bell curve) Sum of independent normal variables is normally distributed Sum of a large number of independent observations from any

distribution tends to have a normal distribution

Two parameters: mean and standard deviation.

),( σµN

[ ]2/)()2/1(

2

1)( σµ

σπ−−= x

X exf

44

The Normal Distribution ),( σµN

0.0

0.1

0.1

0.2

0.2

0.3

0.3

0.4

0.4

0.5

-6 -4 -2 0 2 4 6

N (0,1) N (0,3)

23

45

The Standard Normal Distribution

To use tables for computing values related to thenormal distribution, we need to standardize a normalr.v. as

Given X, compute a Z value z. Find the area value in a Table (Prob [0<Z<z]).

σµ−

=X

Z

standard normal score

46

Normal CDF

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-10 -9 -8 -7 -6 -5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 5 6 7 8 9 10

In Excel:FX(x)=NORMDIST(x,µ,σ,TRUE)fX(x)=NORMDIST(x,µ,σ,FALSE)

24

47

Using Normal TablesZ (0,1)

0.0

0.1

0.1

0.2

0.2

0.3

0.3

0.4

0.4

0.5

-6 -4 -2 0 2 4 6

Table shows area from 0 to Z.

48

The Exponential Distribution

Widely used in queuing systems to modelthe inter-arrival time between requests toa system.

If the inter-arrival times are exponentiallydistributed then the number of arrivals inan interval t has a Poisson distribution andvice-versa.

01)( . ≥−= − xexF xX

λxX exf .)( λλ −=

25

49

The Exponential Distribution

Mean and Standard Deviation:

The c.v is 1. The exponential is the only continuousr.v. with c.v =1.

The exponential distribution is “memoryless.” Thedistribution of the residual time until the nextarrival is also exponential with the same mean asthe original distribution.

λσµ /1==

50

Memoryless Property of the Exponential Distribution

arrival i arrival i+1

t

€

P[Y ≤ y X > t] = P[X − t ≤ y X > t]

P[X ≤ y + t | X > t] =P[t < X ≤ y + t]

P[X > t]

=P[X ≤ y + t]− P[X ≤ t]

P[X > t]

=1− e−λ.(y+ t ) − (1− e−λ.t )

1− (1− e−λ.t )=1− e−λ.y

X

Y

26

51

Exponential Distribution

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 2 4 6 8 10 12

pdf CDF

In Excel:FX(x) = EXPONDIST(x,λ,TRUE)fX(x) = EXPONDIST(x,λ,FALSE)

52

Generation of Random Variables

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-10 -9 -8 -7 -6 -5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 5 6 7 8 9 10

• randomly generate a number u = U(01,)• x = F-1 (u) where F is the CDF

27

53

Goals in Studying Statistics

Analyze, present, and describe numericalinformation properly.

Draw conclusions about the properties of largepopulations from sample information (inference).

Design experiments to learn about real-worldsituations.

To forecast or predict not-measured values from aset of measurements.

54

Population and Sample

Population (or universe): all N members of aclass or group. E.g., all files retrieved from a Web site since

the site went into operation. Sample: portion of the population. Its size

is denoted by n. E.g., the set of files retrieved from a Web site

from 10:00 AM to 2:00 PM on January 03,2001.

28

55

Census, Parameter, Statistic

Census: enumeration or count of every member of thepopulation.

Parameter: summary measure of the individualobservations made in census of an entire population. E.g., average size of all files ever retrieved from the Web

site.

Statistic: summary measure obtained from a sample. E.g., average size of all files retrieved from the Web site

from 10:00 AM to 2:00 PM on January 03, 2001.

56

Visualizing Numerical Data

Type of Plots: Time ordered plots: the time scale is time.

o Time-scale analysis: time is slotted into fixed timeintervals. The y-axis displays a statistics over the timeslot (e.g., sum, average).

o Changing the time scale may reveal interestingproperties about the variable being plotted (e.g., strongcorrelations between adjacent time intervals).

Percent frequency histograms: show the percentage ofoccurrences of values in a bin (range of values).

Cumulative frequency histograms.

29

57

Example of a Time Plot

0

1

2

3

4

5

6

7

8

9

102 78 156

236

316

404

486

571

660

741

820

900

987

1072

1153

1238

1325

1406

1487

1560

1644

1726

1808

1894

1973

2059

2146

2228

2303

2378

2453

2538

2620

2711

2801

2883

2963

3056

3138

3216

3294

3374

3458

3546

Time (sec)

Inte

rrar

riva

l Tim

e o

f R

equ

ests

to

a W

eb S

erve

r (i

n s

ec)

58

Time-scale Analysis

From “In Search of Invariants in E-commerce Workloads,” Menascé et al,Proc. ACM Conference onE-commerce, Minneapolis,MN, October 17-20, 2000.

30

59

0%

5%

10%

15%

20%

25%

30%

0.40 0.80 1.20 1.60 2.00 2.40 2.80 3.20 3.60 4.00 4.40 4.80 5.20 5.60 6.00 6.40 6.80 7.20 7.60 8.00 8.40 8.80 9.20 9.60

Example of a Percentage Frequency Histogram for Inter-arrivalTime Between Requests

60

Example of a Cumulative Percentage Frequency Plot for Inter-arrivalTime Between Requests

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.40 0.80 1.20 1.60 2.00 2.40 2.80 3.20 3.60 4.00 4.40 4.80 5.20 5.60 6.00 6.40 6.80 7.20 7.60 8.00 8.40 8.80 9.20 9.60

31

61

Other kinds of plots

Stacked histograms, Gantt charts, Kiviatcharts, Schumacher charts see Chapter 10 of Jain

Documents

Review of Basic Concepts in Probability and Statisticssetia/cs700-S04/slides/lecture1.pdfReview of Basic Concepts in Probability and Statistics Sanjeev Setia Dept of Computer Science