44
1 Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley Schickinger, Steger: Diskrete Strukturen Band 2, Springer David Lilja: Measuring Computer Performance: A Practitioner’s Guide, Cambridge University Press

Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

1

Introduction to statistics

Literature

Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley

Schickinger, Steger: Diskrete Strukturen Band 2, Springer

David Lilja: Measuring Computer Performance: A Practitioner’s Guide, Cambridge University Press

Page 2: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

2

Goals

r Provide intuitive conceptual background for some standard statistical methods

m Draw meaningful conclusions in presence of noisy measurements

m Learn how to apply techniques in new situations

→ Don’t simply plug and crank from a formula

r Present techniques for aggregating large quantities of data

m Obtain a big-picture view of your results

m Obtain new insights from complex measurement and simulation results

Page 3: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

3

Statistics: Why do we need it?

1. Aggregate data into

meaningful information.

445 446 397 226

388 3445 188 1002

47762 432 54 12

98 345 2245 8839

77492 472 565 999

1 34 882 545 4022

827 572 597 364

...=x

Page 4: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

4

What is a statistic?

r “A quantity that is computed from a sample [of data].”

Merriam-Webster

→ A single number used to summarize a larger collection of values

What are statistics ?r “A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.”

Merriam-Webster

→ We are most interested in analysis and interpretation here

r “Lies, damn lies, and statistics!”

Page 5: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

5

The simplest statistic: a mean?

r Reduces dataset to a single number

r But what does this mean mean?

r Indices of central tendencym Sample mean

m Sample median

m Sample mode

r Other meansm Arithmetic

m Harmonic

m Geometric

r Quantifying dispersion

Page 6: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

6

The problem with means

r Performance is multidimensional

m CPU or I/O time

m Network delay

m Interactions of various components

m …

r Systems are often specialized

m Performs great on application type X

m Performs lousy on anything else

r Potentially a wide range of execution times on one system using different benchmark programs

Page 7: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

7

The problem with means (2)

r Nevertheless, people still want a single number answer!

r How to (correctly) summarize a wide range of measurements with a single value?

Page 8: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

8

Index of central tendency

r Tries to capture “center” of a distribution of values

r Use this “center” to summarize overall behavior

r You will be pressured to provide “mean” value

m Understand how to choose the best type for the circumstance

m Be able to detect bad results from others

r Examples

m Sample mean: “Average” value

m Sample median: ½ of the values are above, ½ below

m Sample mode: Most common value

Page 9: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

9

Indices of central tendency (2.)

r “Sample” implies

m Values are measured from a discrete random variable X

r Value computed is only an approximation of true mean value of underlying process

r True mean value cannot actually be known

m Would require infinite number of measurements

Page 10: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

10

Sample mean

r Expected value of X = E[X]

m First moment of X

m xi = values measured (i = {1, …, n})

m pi = P(X = xi) = P(we measure xi)

∑=

=

n

i

ii pxXE1

][

Page 11: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

11

Sample mean (2)

rWithout additional information, assumem pi = constant = 1/n (Laplace principle)

m n = number of measurements

r Arithmetic meanmCommon “average”

∑=

=

n

i

ixn

x1

1

Page 12: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

12

Potential problem with means

r Sample mean gives equal weight to all measurements

r Outliers can have a large influence on the computed mean value

r Distorts our intuition about the central tendency of the measured values

Page 13: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

13

Potential problem with means (2.)

Mean

Mean

Page 14: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

14

Median

r Index of central tendency with

m ½ of the values larger, ½ smaller

m Algorithm

• Sort n measurements

• If n is odd

– Median = middle value

– Else, median = mean of two middle values

r Reduces skewing effect of outliers

Page 15: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

15

Example

r Measured values: 10, 20, 15, 18, 16

m Mean = 15.8

m Median = 16

r Obtain one more measurement: 200

m Mean = 46.5

m Median = ½ (16 + 18) = 17

r Median gives more intuitive sense of central tendency

Page 16: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

16

Potential problem with means (3.)

Mean

Mean

Median

Median

Page 17: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

17

Mode

r Value that occurs most often

r May not exist

r May not be unique == multiple modes

m E.g., “bi-modal” distribution

• Two values occur with same frequency

Page 18: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

18

Mean, median, or mode?

r Mean

m If the sum of all values is meaningful

m Incorporates all available information

r Median

m Intuitive sense of central tendency with outliers

m What is “typical” of a set of values?

r Mode

m When data can be grouped into distinct types, categories (categorical data)

Page 19: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

19

Quantifying dispersion

r How “spread out” are the values?

r How much spread relative to the mean?

r What is the shape of the distribution of values?

=> A mean hides information about variability!

Page 20: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

20

Histograms

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

r Similar mean values

r Widely different distributions

r How to capture this variability in one number?

Page 21: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

21

Index of dispersion

Quantifies how “spread out” measurements are

r Range

m (max value) – (min value)

r 10- and 90- percentiles

r Maximum distance from the mean

m Max of | xi – mean |

r Neither efficiently incorporates all available information

Page 22: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

22

Determine the distribution of data?

r Plot a histogram

m Count of observations within a cell or bucket

r Problem

m How to determine cell size?

• Small cells => large variations in # of obs per cell

• Large cells => details are lost

• Guideline: if any cell has less than five obs. increase cell size or use variable cell histogram

m How to determine cell spacing?

• Linear

• Logarithmic

Page 23: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

23

Determine the distribution of data(2)?

r Plot a scatter plot

m For each value: X vs. Y

r Problem

m Too many points on top of each other ?

• Large dots => hard to distinguish points

• Small dots => hard to see outliers

Use two-dimensional histograms

Use densities

m Which scale?

• Linear

• Logarithmic

Page 24: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

24

Determine the distribution of data(3)?

r Plot an empirical CDF

m Concentrate 1/n probability at each of the n numbers in a sample

r Problem

m Tail of interest => plot CCDF

∑−

≤=

n

i

inxXInxF

1

)(/1)(

Page 25: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

25

Determine the distribution of data(4)?

r Plot a density

m Smoothed normalized counts of observations

r Problem

m How to determine cell size?

m How to do the smoothing

m How to determine cell spacing?

• Linear

• Logarithmic

Page 26: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

26

Sources of Experimental ErrorsAccuracy, precision, resolution

Page 27: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

27

Experimental errors

r Errors → noise in measured values

r Systematic errors

m Result of an experimental “mistake”

m Typically produce constant or slowly varying bias

Controlled through skill of experimenter

m Examples

• Temperature change causes clock drift

• Forget to clear cache before timing run

Page 28: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

28

Experimental errors

r Random errors

m Unpredictable, non-deterministic

m Unbiased → equal probability of increasing or decreasing measured value

r Result of

m Limitations of measuring tool

m Observer reading output of tool

m Random processes within system

r Typically cannot be controlled

m Use statistical tools to characterize and quantify

Page 29: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

29

A model of errors

r P(X=xi) = P(to measure xi)

corresponds to the “number of possible paths”

r P(X=xi) ~ binomial distribution

r As number of error sources becomes large

m n → ∞,

m Binomial → Gaussian (Normal)

r Thus, the bell curve

Page 30: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

30

Frequency of measuring specific value

Mean of measured values

True valueResolution

Precision

Accuracy

Page 31: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

31

Accuracy, precision, resolution

r Systematic errors → accuracym How close mean of measured values is to true value

m Hard to determine true accuracy

m Relative to a predefined standard• E.g. definition of a “second”

r Random errors → precisionm Repeatability of measurements

m Dependent on tools

r Characteristics of tools → resolutionm Smallest increment between measured values

m Quantify amount of imprecision using statistical tools

Page 32: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

32

Confidence interval for the mean

c1 c2

1-α

α/2 α/2

= probability of c1 ≤ x ≤ c2

Page 33: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

33

Normalize x

1

)(deviation standard

mean

tsmeasuremen ofnumber

/

n

1i

2

1

==

==

=

−=

=

=

n

xxs

x x

n

ns

xxz

i

n

i

i

Page 34: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

34

Confidence interval for the mean (2)

r Normalized z follows the Student’s t distribution

m (n-1) degrees of freedom

m Area left of c2 = 1 – α/2

m Tabulated values for t

c1 c2

1-α

α/2 α/2

Page 35: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

35

Confidence interval for the mean (2)

r As n → ∞, normalized distribution becomes Gaussian (normal)

c1 c2

1-α

α/2 α/2

Page 36: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

36

An example

8.5 s8

5.2 s7

11.3 s6

9.5 s5

9.0 s4

5.0 s3

7.0 s2

8.0 s1

Measured valueExperiment

Page 37: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

37

An example (2)

14.2deviation standard sample

94.71

==

=∑

==

s

n

xx

n

i i

Page 38: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

38

An example (3)

r 90% CI → 90% chance that the measured value is in the interval

r 90% CI → α = 0.10

c1 c2

1-α

α/2 α/2

Page 39: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

39

An example (4)

r 90% CI = [6.5, 9.4]

m 90% chance value is between 6.5, 9.4

r 95% CI = [6.1, 9.7]

m 95% chance value is between 6.1, 9.7

r Why is interval wider when we are more confident?

Page 40: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

40

Higher confidence → Wider interval?

6.5 9.4

90%

6.1 9.7

95%

Page 41: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

41

Key assumption

r Measurement errors are Normally distributed.

r Is this true for most measurements on real systems?

c1 c2

1-α

α/2 α/2

Page 42: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

42

Key assumption (2)

r Saved by the Central Limit Theorem

Sum of a “large number” of values from any distribution will be Normally (Gaussian) distributed.

r What is a “large number?”m Typically assumed to be >≈ 6 or 7

m But in our case often millions or billions

Page 43: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

43

How many measurements?

r Width of interval inversely proportional to √n

r Want to minimize number of measurements

r Find confidence interval for mean, such that:

m P(actual mean in interval) = (1 – α)

Page 44: Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley ... c1 c2 1- α α/2 α/2. 36

44

How many measurements (2)?

r But n depends on knowing mean and standard deviation!

r Estimate s with small number of measurements

r Use this s to find n needed for desired interval width