Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
1
Introduction to statistics
Literature
Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley
Schickinger, Steger: Diskrete Strukturen Band 2, Springer
David Lilja: Measuring Computer Performance: A Practitioner’s Guide, Cambridge University Press
2
Goals
r Provide intuitive conceptual background for some standard statistical methods
m Draw meaningful conclusions in presence of noisy measurements
m Learn how to apply techniques in new situations
→ Don’t simply plug and crank from a formula
r Present techniques for aggregating large quantities of data
m Obtain a big-picture view of your results
m Obtain new insights from complex measurement and simulation results
3
Statistics: Why do we need it?
1. Aggregate data into
meaningful information.
445 446 397 226
388 3445 188 1002
47762 432 54 12
98 345 2245 8839
77492 472 565 999
1 34 882 545 4022
827 572 597 364
...=x
4
What is a statistic?
r “A quantity that is computed from a sample [of data].”
Merriam-Webster
→ A single number used to summarize a larger collection of values
What are statistics ?r “A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.”
Merriam-Webster
→ We are most interested in analysis and interpretation here
r “Lies, damn lies, and statistics!”
5
The simplest statistic: a mean?
r Reduces dataset to a single number
r But what does this mean mean?
r Indices of central tendencym Sample mean
m Sample median
m Sample mode
r Other meansm Arithmetic
m Harmonic
m Geometric
r Quantifying dispersion
6
The problem with means
r Performance is multidimensional
m CPU or I/O time
m Network delay
m Interactions of various components
m …
r Systems are often specialized
m Performs great on application type X
m Performs lousy on anything else
r Potentially a wide range of execution times on one system using different benchmark programs
7
The problem with means (2)
r Nevertheless, people still want a single number answer!
r How to (correctly) summarize a wide range of measurements with a single value?
8
Index of central tendency
r Tries to capture “center” of a distribution of values
r Use this “center” to summarize overall behavior
r You will be pressured to provide “mean” value
m Understand how to choose the best type for the circumstance
m Be able to detect bad results from others
r Examples
m Sample mean: “Average” value
m Sample median: ½ of the values are above, ½ below
m Sample mode: Most common value
9
Indices of central tendency (2.)
r “Sample” implies
m Values are measured from a discrete random variable X
r Value computed is only an approximation of true mean value of underlying process
r True mean value cannot actually be known
m Would require infinite number of measurements
10
Sample mean
r Expected value of X = E[X]
m First moment of X
m xi = values measured (i = {1, …, n})
m pi = P(X = xi) = P(we measure xi)
∑=
=
n
i
ii pxXE1
][
11
Sample mean (2)
rWithout additional information, assumem pi = constant = 1/n (Laplace principle)
m n = number of measurements
r Arithmetic meanmCommon “average”
∑=
=
n
i
ixn
x1
1
12
Potential problem with means
r Sample mean gives equal weight to all measurements
r Outliers can have a large influence on the computed mean value
r Distorts our intuition about the central tendency of the measured values
13
Potential problem with means (2.)
Mean
Mean
14
Median
r Index of central tendency with
m ½ of the values larger, ½ smaller
m Algorithm
• Sort n measurements
• If n is odd
– Median = middle value
– Else, median = mean of two middle values
r Reduces skewing effect of outliers
15
Example
r Measured values: 10, 20, 15, 18, 16
m Mean = 15.8
m Median = 16
r Obtain one more measurement: 200
m Mean = 46.5
m Median = ½ (16 + 18) = 17
r Median gives more intuitive sense of central tendency
16
Potential problem with means (3.)
Mean
Mean
Median
Median
17
Mode
r Value that occurs most often
r May not exist
r May not be unique == multiple modes
m E.g., “bi-modal” distribution
• Two values occur with same frequency
18
Mean, median, or mode?
r Mean
m If the sum of all values is meaningful
m Incorporates all available information
r Median
m Intuitive sense of central tendency with outliers
m What is “typical” of a set of values?
r Mode
m When data can be grouped into distinct types, categories (categorical data)
19
Quantifying dispersion
r How “spread out” are the values?
r How much spread relative to the mean?
r What is the shape of the distribution of values?
=> A mean hides information about variability!
20
Histograms
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
30
35
40
r Similar mean values
r Widely different distributions
r How to capture this variability in one number?
21
Index of dispersion
Quantifies how “spread out” measurements are
r Range
m (max value) – (min value)
r 10- and 90- percentiles
r Maximum distance from the mean
m Max of | xi – mean |
r Neither efficiently incorporates all available information
22
Determine the distribution of data?
r Plot a histogram
m Count of observations within a cell or bucket
r Problem
m How to determine cell size?
• Small cells => large variations in # of obs per cell
• Large cells => details are lost
• Guideline: if any cell has less than five obs. increase cell size or use variable cell histogram
m How to determine cell spacing?
• Linear
• Logarithmic
23
Determine the distribution of data(2)?
r Plot a scatter plot
m For each value: X vs. Y
r Problem
m Too many points on top of each other ?
• Large dots => hard to distinguish points
• Small dots => hard to see outliers
Use two-dimensional histograms
Use densities
m Which scale?
• Linear
• Logarithmic
24
Determine the distribution of data(3)?
r Plot an empirical CDF
m Concentrate 1/n probability at each of the n numbers in a sample
r Problem
m Tail of interest => plot CCDF
∑−
≤=
n
i
inxXInxF
1
)(/1)(
25
Determine the distribution of data(4)?
r Plot a density
m Smoothed normalized counts of observations
r Problem
m How to determine cell size?
m How to do the smoothing
m How to determine cell spacing?
• Linear
• Logarithmic
26
Sources of Experimental ErrorsAccuracy, precision, resolution
27
Experimental errors
r Errors → noise in measured values
r Systematic errors
m Result of an experimental “mistake”
m Typically produce constant or slowly varying bias
Controlled through skill of experimenter
m Examples
• Temperature change causes clock drift
• Forget to clear cache before timing run
28
Experimental errors
r Random errors
m Unpredictable, non-deterministic
m Unbiased → equal probability of increasing or decreasing measured value
r Result of
m Limitations of measuring tool
m Observer reading output of tool
m Random processes within system
r Typically cannot be controlled
m Use statistical tools to characterize and quantify
29
A model of errors
r P(X=xi) = P(to measure xi)
corresponds to the “number of possible paths”
r P(X=xi) ~ binomial distribution
r As number of error sources becomes large
m n → ∞,
m Binomial → Gaussian (Normal)
r Thus, the bell curve
30
Frequency of measuring specific value
Mean of measured values
True valueResolution
Precision
Accuracy
31
Accuracy, precision, resolution
r Systematic errors → accuracym How close mean of measured values is to true value
m Hard to determine true accuracy
m Relative to a predefined standard• E.g. definition of a “second”
r Random errors → precisionm Repeatability of measurements
m Dependent on tools
r Characteristics of tools → resolutionm Smallest increment between measured values
m Quantify amount of imprecision using statistical tools
32
Confidence interval for the mean
c1 c2
1-α
α/2 α/2
= probability of c1 ≤ x ≤ c2
33
Normalize x
1
)(deviation standard
mean
tsmeasuremen ofnumber
/
n
1i
2
1
−
−
==
==
=
−=
∑
∑
=
=
n
xxs
x x
n
ns
xxz
i
n
i
i
34
Confidence interval for the mean (2)
r Normalized z follows the Student’s t distribution
m (n-1) degrees of freedom
m Area left of c2 = 1 – α/2
m Tabulated values for t
c1 c2
1-α
α/2 α/2
35
Confidence interval for the mean (2)
r As n → ∞, normalized distribution becomes Gaussian (normal)
c1 c2
1-α
α/2 α/2
36
An example
8.5 s8
5.2 s7
11.3 s6
9.5 s5
9.0 s4
5.0 s3
7.0 s2
8.0 s1
Measured valueExperiment
37
An example (2)
14.2deviation standard sample
94.71
==
=∑
==
s
n
xx
n
i i
38
An example (3)
r 90% CI → 90% chance that the measured value is in the interval
r 90% CI → α = 0.10
c1 c2
1-α
α/2 α/2
39
An example (4)
r 90% CI = [6.5, 9.4]
m 90% chance value is between 6.5, 9.4
r 95% CI = [6.1, 9.7]
m 95% chance value is between 6.1, 9.7
r Why is interval wider when we are more confident?
40
Higher confidence → Wider interval?
6.5 9.4
90%
6.1 9.7
95%
41
Key assumption
r Measurement errors are Normally distributed.
r Is this true for most measurements on real systems?
c1 c2
1-α
α/2 α/2
42
Key assumption (2)
r Saved by the Central Limit Theorem
Sum of a “large number” of values from any distribution will be Normally (Gaussian) distributed.
r What is a “large number?”m Typically assumed to be >≈ 6 or 7
m But in our case often millions or billions
43
How many measurements?
r Width of interval inversely proportional to √n
r Want to minimize number of measurements
r Find confidence interval for mean, such that:
m P(actual mean in interval) = (1 – α)
44
How many measurements (2)?
r But n depends on knowing mean and standard deviation!
r Estimate s with small number of measurements
r Use this s to find n needed for desired interval width