Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
1
1
Review of Basic Concepts in Probabilityand Statistics
Sanjeev SetiaDept of Computer ScienceGeorge Mason University
2
About this Class
CS 700: Quantitative Methods andExperimental Design in Computer Science Required class for CS Ph.D. students
Prerequisites: Undergraduate probability & statistics Doctoral status
2
3
What you will learn
Applications of probability and statistical techniques forcomputer science comparing systems using sample data fitting distributions to sample data confidence interval calculations regression models design of experiments simulation and analysis of simulation results introduction to analytic performance modeling and queuing analysis workload characterization, pitfalls in performance analysis and
reporting Back-of-the envelope calculations
Goal: motivate these techniques with examples from theresearch literature
4
Logistics
Grade: 35% project, 35% midterm, 30% take homefinal
Slides, assignments, reading material on class webpage http://www.cs.gmu.edu/~setia/cs700/
Several small assignments related to materialdiscussed in class Not graded, but we will go over solutions in class
Term project should involve experimentation (measurement, simulation) select a topic in your research area if possible apply techniques discussed in this class
3
5
Acknowledgement
These slides are based onpresentations created and
copyrighted by Prof. Daniel Menasce(GMU)
6
Review of Probability
4
7
Review of Probability Concepts
Classical (theoretical) approach:
Empirical approach (relative frequency):
The relative frequency converges to theprobability for a large number of experiments.
Events ofNumber Total
OccurCan Event WaysNo. A process has to be known!
nsObservatio ofNumber Total
Experiment in the Occurred Result Times No. A
8
Review of Probability Rules1. A probability is a number between 0 and 1
assigned to an event that is the outcome of anexperiment:
2. Complement of event A.
3. If events A and B are mutually exclusive then
[ ]1,0][ ∈AP
][1][ APAP −=
[ ] ][][or BPAPBAP +=
[ ] 0 and =BAP
5
9
Review of Probability Rules (cont’d)4. If events A1, …, AN are mutually exclusive
and collectively exhaustive then:
5. If events A and B are not mutually exclusivethen:
6. Conditional Probability:
1][1
=∑=
N
iiAP
] and [][][]or [ BAPBPAPBAP −+=
][
][]|[
][
] and []|[
BP
APABP
BP
BAPBAP ==
10
Review of Probability Rules (cont’d)7. If events A and B are independent (i.e., P[A] =
P[A|B] and P[B]=P[B|A]) then:
8. Regardless of whether events A and B areindependent or not
9. Theorem of Total Probability: if events A1, …, ANare mutually exclusive and collectively exhaustivethen
][][] and [ BPAPBAP ×=
][]|[][]|[] and [ APABPBPBAPBAP ==
][]|[][1
i
N
ii APABPBP ∑
=
=
6
11
Discrete Random Variables
12
Random Variables
A variable is called a random variable if it takesone of a specified set of values with a specifiedprobability Discrete random variables: can only take discrete values,
e.g. age (in years) of students in this class, number ofcalls to a telephone exchange in one minute
Continuous random variables: can take on “continuous”values, i.e. every real number in sample space has aprobability of occurring, e.g. time between consecutivecalls to telephone exchange, time before a componentfails
7
13
Discrete Probability Distribution
Distribution: set of all possible values andtheir probabilities.
Number of I/Os per
Transaction Probability
0 0.3501 0.1202 0.0953 0.0854 0.0705 0.0606 0.0547 0.0488 0.0439 0.04010 0.035
1.000
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0 1 2 3 4 5 6 7 8 9 10
14
Moments of a Discrete Random Variable
Expected Value:
k-th moment:
][][ ii
i XPXXE ×== ∑∀
µ
][][ ii
ki
k XPXXE ×== ∑∀
µ
Number of I/Os per
Transaction Probability
For First Moment
(average)
For Second Moment
0 0.350 0.000 0.0001 0.120 0.120 0.1202 0.095 0.190 0.3803 0.085 0.255 0.7654 0.070 0.280 1.1205 0.060 0.300 1.5006 0.054 0.324 1.9447 0.048 0.336 2.3528 0.043 0.344 2.7529 0.040 0.360 3.24010 0.035 0.350 3.500
1.000 2.859 17.673
meansecond moment
8
15
Central Moments of a Discrete RandomVariable
k-th central moment:
The variance is the second central moment:
][)(])[( ik
ii
k XPXXXXE ×−=− ∑∀
22
222
2222
)(][
)(2)(][
]2)([])[(
XXE
XXXE
XXXXEXXE
−=
=−+=
−+=−=σ
16
Central Moments of a Discrete RandomVariable
Number of I/Os per
Transaction Probability
For First Moment
(average)
For Second Moment
For Second Central Moment
0 0.350 0.000 0.000 2.86091 0.120 0.120 0.120 0.41472 0.095 0.190 0.380 0.07013 0.085 0.255 0.765 0.00174 0.070 0.280 1.120 0.09115 0.060 0.300 1.500 0.27506 0.054 0.324 1.944 0.53287 0.048 0.336 2.352 0.82318 0.043 0.344 2.752 1.13659 0.040 0.360 3.240 1.508510 0.035 0.350 3.500 1.7848
1.000 2.859 17.673 9.4991
varianceaverage
9
17
Properties of the Mean
The mean of the sum is the sum of themeans.
If X and Y are independent randomvariables, then the mean of the product isthe product of the means.
][][][ YEXEYXE +=+
][][][ YEXEXYE =
18
Important discrete random variables
BinomialHypergeometricNegative BinomialGeometricPoisson
10
19
The Binomial Distribution Distribution: based on carrying out Bernoulli
trials (independent experiments with twopossible outcomes): Success with probability p and Failure with probability (1-p).
A binomial r.v. counts the number ofsuccesses in n trials.
knkknk ppknk
npp
k
nkXP −− −
−=−
== )1(
)!(!
!)1(][
20
The Binomial Distribution
Success Probability 0.6 (p)Number of Attempts 10 (n) Number
of Attempts
(k)
Probability k successful
attempts in n Cumulative
0 0.000105 0.0001051 0.001573 0.0016782 0.010617 0.0122953 0.042467 0.0547624 0.111477 0.1662395 0.200658 0.3668976 0.250823 0.6177197 0.214991 0.8327108 0.120932 0.9536439 0.040311 0.993953
10 0.006047 1.000000
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 1 2 3 4 5 6 7 8 9 10
Probability Distribution Cumulative Distribution
11
21
Shape of the Binomial Distribution
Probability Distribution
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 1 2 3 4 5 6 7 8 9 10
p = 0.5 symmetric for any n.
22
Shape of the Binomial Distribution
p = 0.2 right skewed
Probability Distribution
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 1 2 3 4 5 6 7 8 9 10
12
23
Shape of the Binomial Distribution
p = 0.8 left skewed
Probability Distribution
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 1 2 3 4 5 6 7 8 9 10
24
Moments of the Binomial Distribution
Average: n p Variance: Standard Deviation: Coefficient of Variation:
)1( pnp −
)1( pnp −
np
p
np
pnp −=
− 1)1(
13
25
Hypergeometric Distribution Binomial was based on experiments with equal success
probability “sampling with replacement”
Hypergeometric: not all experiments have the samesuccess probability “sampling without replacement”
Given a sample size of n out of a population of size Nwith A known successes in the population, theprobability of k successes is
−
−
==
n
N
kn
AN
k
A
kXP ][
total # of possible samples
choose (n-k) failures fromN-A failures in the population
choose k successes out of Asuccesses in the population
26
Hypergeometric DistributionNo. successes
in samplesample
size
no. successes
in population
population size
k n A N0 20 10 100 0.095116271 20 10 100 0.267933162 20 10 100 0.318170633 20 10 100 0.209208094 20 10 100 0.084107305 20 10 100 0.021531476 20 10 100 0.003541367 20 10 100 0.000367938 20 10 100 0.000023009 20 10 100 0.0000007810 20 10 100 0.00000001
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0 1 2 3 4 5 6 7 8 9 10
In Excel:Pr[X=k]=HYPGEOMDIST (k,n,A,N)
14
27
Moments of the Hypergeometric
Average:
Standard Deviation:
If the sample size is less than 5% of thepopulation, the binomial is a goodapproximation for the hyper-geometric.
N
nA
1
)(2 −
−−
N
nN
N
ANnA
28
Negative Binomial Distribution
Probability of success is equal to p and is thesame on all trials.
Random variable X counts the number of trialsuntil the k-th success is observed.
( ) kkn ppk
nnXP −−
−
−== 1
1
1][
1 2 3 4 n-1 n
SS F F S F. . .
15
29
Negative Binomial DistributionSuccess probability 0.8
k n Prob[X=n]1 1 0.8000001 2 0.1600001 3 0.0320001 4 0.0064005 5 0.3276805 6 0.3276805 7 0.1966085 8 0.0917505 9 0.0367005 10 0.0132125 11 0.004404
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
5 6 7 8 9 10 11
n
In Excel:Pr [X=n] = NEGBINOMDIST (n-k,k,p)
30
Moments of the Negative BinomialDistribution
Average:
Standard Deviation:
Coefficient of Variation:
p
k
2)1(
p
pk −
k
p−1
16
31
Geometric Distribution
Special case of the negative binomial withk=1.
Probability that the first success occursafter n trials is
Discrete random variable with the“memoryless” property
,...2,1)1(][ 1 =−== − nppnXp n
32
Geometric DistributionSuccess probability 0.6
n P[X=n]1 0.60002 0.24003 0.09604 0.03845 0.01546 0.00617 0.00258 0.00109 0.000410 0.0002
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6 7 8 9 10
17
33
Moments of the Geometric Distribution
Average:
Standard Deviation:
Coefficient of Variation:
p
1
21
p
p−
11 ≤− p
34
Poisson Distribution
Used to model the number of arrivals overa given interval, e.g., Number of requests to a server Number of failures of a component Number of queries to the database.
A Poisson distribution usually arises whenarrivals come from a large number ofindependent sources.
18
35
Poisson Distribution
Distribution: Counting arrivals in an interval of duration t:
Average = Standard Deviation = λ
∞===−
,...,1,0!
][ kk
ekXP
k λλ
( )∞==
−,...,1,0
!]t)[0,in arrivals [ k
k
etkP
tk λλ
36
Poisson DistributionLambda 10
KPoisson Distribution CDF
0 0.00005 0.00001 0.00045 0.00052 0.00227 0.00283 0.00757 0.01034 0.01892 0.02935 0.03783 0.06716 0.06306 0.13017 0.09008 0.22028 0.11260 0.33289 0.12511 0.457910 0.12511 0.583011 0.11374 0.696812 0.09478 0.791613 0.07291 0.864514 0.05208 0.916515 0.03472 0.951316 0.02170 0.973017 0.01276 0.985718 0.00709 0.992819 0.00373 0.996520 0.00187 0.9984
0.00.10.20.30.40.50.60.70.80.91.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Distribution CDF
In Excel:P[X=k] = POISSON (k,λ,FALSE)P[X≤k] = POISSON (k,λ,TRUE)
19
37
Continuous Random Variables
38
Relevant Functions
Probability density function (pdf) of r.v. X:
Cumulative distribution function (CDF):
pdf is the derivative of the CDF f(x) = dF(x)/dx
Tail of the distribution (reliability function):
)(xfX
dxxfbXaPb
a X )(][ ∫=≤≤
][)( xXPxFX ≤=
)(1][)( xFxXPxR XX −=>=
20
39
Moments
k-th moment: Expected value (mean): first moment
k-th central moment:
Variance: second central moment
dxxfxXE Xkk )(][ ∫
+∞
∞−=
dxxxfXE X )(][ ∫+∞
∞−==µ
dxxfxXE Xkk )()(])[( ∫
+∞
∞−−=− µµ
dxxfxXE X )()(])[( 222 ∫+∞
∞−−=−= µµσ
40
Important continuous distributions
Uniform Exponential Normal Erlang Hypo-exponential Hyper-exponential Weibull Lognormal Pareto
21
41
The Uniform Distribution
pdf:
Mean:
Variance:
≤≤
−=otherwise0
1)( bxa
abxfX
2
ba +=µ
( )12
22 ab −=σ
42
The Uniform Distribution
0 1
1
0.2 0.5
P[0.2<X<0.5]=(0.5-0.2)x1.0=0.3
U(0,1)
22
43
The Normal Distribution
Important because Many natural phenomena follow a normal distribution (bell curve) Sum of independent normal variables is normally distributed Sum of a large number of independent observations from any
distribution tends to have a normal distribution
Two parameters: mean and standard deviation.
),( σµN
[ ]2/)()2/1(
2
1)( σµ
σπ−−= x
X exf
44
The Normal Distribution ),( σµN
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
0.5
-6 -4 -2 0 2 4 6
N (0,1) N (0,3)
23
45
The Standard Normal Distribution
To use tables for computing values related to thenormal distribution, we need to standardize a normalr.v. as
Given X, compute a Z value z. Find the area value in a Table (Prob [0<Z<z]).
σµ−
=X
Z
standard normal score
46
Normal CDF
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-10 -9 -8 -7 -6 -5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 5 6 7 8 9 10
In Excel:FX(x)=NORMDIST(x,µ,σ,TRUE)fX(x)=NORMDIST(x,µ,σ,FALSE)
24
47
Using Normal TablesZ (0,1)
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
0.5
-6 -4 -2 0 2 4 6
Table shows area from 0 to Z.
48
The Exponential Distribution
Widely used in queuing systems to modelthe inter-arrival time between requests toa system.
If the inter-arrival times are exponentiallydistributed then the number of arrivals inan interval t has a Poisson distribution andvice-versa.
01)( . ≥−= − xexF xX
λxX exf .)( λλ −=
25
49
The Exponential Distribution
Mean and Standard Deviation:
The c.v is 1. The exponential is the only continuousr.v. with c.v =1.
The exponential distribution is “memoryless.” Thedistribution of the residual time until the nextarrival is also exponential with the same mean asthe original distribution.
λσµ /1==
50
Memoryless Property of the Exponential Distribution
arrival i arrival i+1
t
€
P[Y ≤ y X > t] = P[X − t ≤ y X > t]
P[X ≤ y + t | X > t] =P[t < X ≤ y + t]
P[X > t]
=P[X ≤ y + t]− P[X ≤ t]
P[X > t]
=1− e−λ.(y+ t ) − (1− e−λ.t )
1− (1− e−λ.t )=1− e−λ.y
X
Y
26
51
Exponential Distribution
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 2 4 6 8 10 12
pdf CDF
In Excel:FX(x) = EXPONDIST(x,λ,TRUE)fX(x) = EXPONDIST(x,λ,FALSE)
52
Generation of Random Variables
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-10 -9 -8 -7 -6 -5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 5 6 7 8 9 10
• randomly generate a number u = U(01,)• x = F-1 (u) where F is the CDF
27
53
Goals in Studying Statistics
Analyze, present, and describe numericalinformation properly.
Draw conclusions about the properties of largepopulations from sample information (inference).
Design experiments to learn about real-worldsituations.
To forecast or predict not-measured values from aset of measurements.
54
Population and Sample
Population (or universe): all N members of aclass or group. E.g., all files retrieved from a Web site since
the site went into operation. Sample: portion of the population. Its size
is denoted by n. E.g., the set of files retrieved from a Web site
from 10:00 AM to 2:00 PM on January 03,2001.
28
55
Census, Parameter, Statistic
Census: enumeration or count of every member of thepopulation.
Parameter: summary measure of the individualobservations made in census of an entire population. E.g., average size of all files ever retrieved from the Web
site.
Statistic: summary measure obtained from a sample. E.g., average size of all files retrieved from the Web site
from 10:00 AM to 2:00 PM on January 03, 2001.
56
Visualizing Numerical Data
Type of Plots: Time ordered plots: the time scale is time.
o Time-scale analysis: time is slotted into fixed timeintervals. The y-axis displays a statistics over the timeslot (e.g., sum, average).
o Changing the time scale may reveal interestingproperties about the variable being plotted (e.g., strongcorrelations between adjacent time intervals).
Percent frequency histograms: show the percentage ofoccurrences of values in a bin (range of values).
Cumulative frequency histograms.
29
57
Example of a Time Plot
0
1
2
3
4
5
6
7
8
9
102 78 156
236
316
404
486
571
660
741
820
900
987
1072
1153
1238
1325
1406
1487
1560
1644
1726
1808
1894
1973
2059
2146
2228
2303
2378
2453
2538
2620
2711
2801
2883
2963
3056
3138
3216
3294
3374
3458
3546
Time (sec)
Inte
rrar
riva
l Tim
e o
f R
equ
ests
to
a W
eb S
erve
r (i
n s
ec)
58
Time-scale Analysis
From “In Search of Invariants in E-commerce Workloads,” Menascé et al,Proc. ACM Conference onE-commerce, Minneapolis,MN, October 17-20, 2000.
30
59
0%
5%
10%
15%
20%
25%
30%
0.40 0.80 1.20 1.60 2.00 2.40 2.80 3.20 3.60 4.00 4.40 4.80 5.20 5.60 6.00 6.40 6.80 7.20 7.60 8.00 8.40 8.80 9.20 9.60
Example of a Percentage Frequency Histogram for Inter-arrivalTime Between Requests
60
Example of a Cumulative Percentage Frequency Plot for Inter-arrivalTime Between Requests
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.40 0.80 1.20 1.60 2.00 2.40 2.80 3.20 3.60 4.00 4.40 4.80 5.20 5.60 6.00 6.40 6.80 7.20 7.60 8.00 8.40 8.80 9.20 9.60
31
61
Other kinds of plots
Stacked histograms, Gantt charts, Kiviatcharts, Schumacher charts see Chapter 10 of Jain