Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art...

Preview:

Citation preview

1

Introduction to statistics

Literature

Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley

Schickinger, Steger: Diskrete Strukturen Band 2, Springer

David Lilja: Measuring Computer Performance: A Practitioner’s Guide, Cambridge University Press

2

Goals

r Provide intuitive conceptual background for some standard statistical methods

m Draw meaningful conclusions in presence of noisy measurements

m Learn how to apply techniques in new situations

→ Don’t simply plug and crank from a formula

r Present techniques for aggregating large quantities of data

m Obtain a big-picture view of your results

m Obtain new insights from complex measurement and simulation results

3

Statistics: Why do we need it?

1. Aggregate data into

meaningful information.

445 446 397 226

388 3445 188 1002

47762 432 54 12

98 345 2245 8839

77492 472 565 999

1 34 882 545 4022

827 572 597 364

...=x

4

What is a statistic?

r “A quantity that is computed from a sample [of data].”

Merriam-Webster

→ A single number used to summarize a larger collection of values

What are statistics ?r “A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.”

Merriam-Webster

→ We are most interested in analysis and interpretation here

r “Lies, damn lies, and statistics!”

5

The simplest statistic: a mean?

r Reduces dataset to a single number

r But what does this mean mean?

r Indices of central tendencym Sample mean

m Sample median

m Sample mode

r Other meansm Arithmetic

m Harmonic

m Geometric

r Quantifying dispersion

6

The problem with means

r Performance is multidimensional

m CPU or I/O time

m Network delay

m Interactions of various components

m …

r Systems are often specialized

m Performs great on application type X

m Performs lousy on anything else

r Potentially a wide range of execution times on one system using different benchmark programs

7

The problem with means (2)

r Nevertheless, people still want a single number answer!

r How to (correctly) summarize a wide range of measurements with a single value?

8

Index of central tendency

r Tries to capture “center” of a distribution of values

r Use this “center” to summarize overall behavior

r You will be pressured to provide “mean” value

m Understand how to choose the best type for the circumstance

m Be able to detect bad results from others

r Examples

m Sample mean: “Average” value

m Sample median: ½ of the values are above, ½ below

m Sample mode: Most common value

9

Indices of central tendency (2.)

r “Sample” implies

m Values are measured from a discrete random variable X

r Value computed is only an approximation of true mean value of underlying process

r True mean value cannot actually be known

m Would require infinite number of measurements

10

Sample mean

r Expected value of X = E[X]

m First moment of X

m xi = values measured (i = {1, …, n})

m pi = P(X = xi) = P(we measure xi)

∑=

=

n

i

ii pxXE1

][

11

Sample mean (2)

rWithout additional information, assumem pi = constant = 1/n (Laplace principle)

m n = number of measurements

r Arithmetic meanmCommon “average”

∑=

=

n

i

ixn

x1

1

12

Potential problem with means

r Sample mean gives equal weight to all measurements

r Outliers can have a large influence on the computed mean value

r Distorts our intuition about the central tendency of the measured values

13

Potential problem with means (2.)

Mean

Mean

14

Median

r Index of central tendency with

m ½ of the values larger, ½ smaller

m Algorithm

• Sort n measurements

• If n is odd

– Median = middle value

– Else, median = mean of two middle values

r Reduces skewing effect of outliers

15

Example

r Measured values: 10, 20, 15, 18, 16

m Mean = 15.8

m Median = 16

r Obtain one more measurement: 200

m Mean = 46.5

m Median = ½ (16 + 18) = 17

r Median gives more intuitive sense of central tendency

16

Potential problem with means (3.)

Mean

Mean

Median

Median

17

Mode

r Value that occurs most often

r May not exist

r May not be unique == multiple modes

m E.g., “bi-modal” distribution

• Two values occur with same frequency

18

Mean, median, or mode?

r Mean

m If the sum of all values is meaningful

m Incorporates all available information

r Median

m Intuitive sense of central tendency with outliers

m What is “typical” of a set of values?

r Mode

m When data can be grouped into distinct types, categories (categorical data)

19

Quantifying dispersion

r How “spread out” are the values?

r How much spread relative to the mean?

r What is the shape of the distribution of values?

=> A mean hides information about variability!

20

Histograms

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

r Similar mean values

r Widely different distributions

r How to capture this variability in one number?

21

Index of dispersion

Quantifies how “spread out” measurements are

r Range

m (max value) – (min value)

r 10- and 90- percentiles

r Maximum distance from the mean

m Max of | xi – mean |

r Neither efficiently incorporates all available information

22

Determine the distribution of data?

r Plot a histogram

m Count of observations within a cell or bucket

r Problem

m How to determine cell size?

• Small cells => large variations in # of obs per cell

• Large cells => details are lost

• Guideline: if any cell has less than five obs. increase cell size or use variable cell histogram

m How to determine cell spacing?

• Linear

• Logarithmic

23

Determine the distribution of data(2)?

r Plot a scatter plot

m For each value: X vs. Y

r Problem

m Too many points on top of each other ?

• Large dots => hard to distinguish points

• Small dots => hard to see outliers

Use two-dimensional histograms

Use densities

m Which scale?

• Linear

• Logarithmic

24

Determine the distribution of data(3)?

r Plot an empirical CDF

m Concentrate 1/n probability at each of the n numbers in a sample

r Problem

m Tail of interest => plot CCDF

∑−

≤=

n

i

inxXInxF

1

)(/1)(

25

Determine the distribution of data(4)?

r Plot a density

m Smoothed normalized counts of observations

r Problem

m How to determine cell size?

m How to do the smoothing

m How to determine cell spacing?

• Linear

• Logarithmic

26

Sources of Experimental ErrorsAccuracy, precision, resolution

27

Experimental errors

r Errors → noise in measured values

r Systematic errors

m Result of an experimental “mistake”

m Typically produce constant or slowly varying bias

Controlled through skill of experimenter

m Examples

• Temperature change causes clock drift

• Forget to clear cache before timing run

28

Experimental errors

r Random errors

m Unpredictable, non-deterministic

m Unbiased → equal probability of increasing or decreasing measured value

r Result of

m Limitations of measuring tool

m Observer reading output of tool

m Random processes within system

r Typically cannot be controlled

m Use statistical tools to characterize and quantify

29

A model of errors

r P(X=xi) = P(to measure xi)

corresponds to the “number of possible paths”

r P(X=xi) ~ binomial distribution

r As number of error sources becomes large

m n → ∞,

m Binomial → Gaussian (Normal)

r Thus, the bell curve

30

Frequency of measuring specific value

Mean of measured values

True valueResolution

Precision

Accuracy

31

Accuracy, precision, resolution

r Systematic errors → accuracym How close mean of measured values is to true value

m Hard to determine true accuracy

m Relative to a predefined standard• E.g. definition of a “second”

r Random errors → precisionm Repeatability of measurements

m Dependent on tools

r Characteristics of tools → resolutionm Smallest increment between measured values

m Quantify amount of imprecision using statistical tools

32

Confidence interval for the mean

c1 c2

1-α

α/2 α/2

= probability of c1 ≤ x ≤ c2

33

Normalize x

1

)(deviation standard

mean

tsmeasuremen ofnumber

/

n

1i

2

1

==

==

=

−=

=

=

n

xxs

x x

n

ns

xxz

i

n

i

i

34

Confidence interval for the mean (2)

r Normalized z follows the Student’s t distribution

m (n-1) degrees of freedom

m Area left of c2 = 1 – α/2

m Tabulated values for t

c1 c2

1-α

α/2 α/2

35

Confidence interval for the mean (2)

r As n → ∞, normalized distribution becomes Gaussian (normal)

c1 c2

1-α

α/2 α/2

36

An example

8.5 s8

5.2 s7

11.3 s6

9.5 s5

9.0 s4

5.0 s3

7.0 s2

8.0 s1

Measured valueExperiment

37

An example (2)

14.2deviation standard sample

94.71

==

=∑

==

s

n

xx

n

i i

38

An example (3)

r 90% CI → 90% chance that the measured value is in the interval

r 90% CI → α = 0.10

c1 c2

1-α

α/2 α/2

39

An example (4)

r 90% CI = [6.5, 9.4]

m 90% chance value is between 6.5, 9.4

r 95% CI = [6.1, 9.7]

m 95% chance value is between 6.1, 9.7

r Why is interval wider when we are more confident?

40

Higher confidence → Wider interval?

6.5 9.4

90%

6.1 9.7

95%

41

Key assumption

r Measurement errors are Normally distributed.

r Is this true for most measurements on real systems?

c1 c2

1-α

α/2 α/2

42

Key assumption (2)

r Saved by the Central Limit Theorem

Sum of a “large number” of values from any distribution will be Normally (Gaussian) distributed.

r What is a “large number?”m Typically assumed to be >≈ 6 or 7

m But in our case often millions or billions

43

How many measurements?

r Width of interval inversely proportional to √n

r Want to minimize number of measurements

r Find confidence interval for mean, such that:

m P(actual mean in interval) = (1 – α)

44

How many measurements (2)?

r But n depends on knowing mean and standard deviation!

r Estimate s with small number of measurements

r Use this s to find n needed for desired interval width

Recommended