Introduction to statistics

Literature

Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley

Schickinger, Steger: Diskrete Strukturen Band 2, Springer

David Lilja: Measuring Computer Performance: A Practitioner’s Guide, Cambridge University Press

r Provide intuitive conceptual background for some standard statistical methods

m Draw meaningful conclusions in presence of noisy measurements

m Learn how to apply techniques in new situations

→ Don’t simply plug and crank from a formula

r Present techniques for aggregating large quantities of data

m Obtain a big-picture view of your results

m Obtain new insights from complex measurement and simulation results

Statistics: Why do we need it?

1. Aggregate data into

meaningful information.

445 446 397 226

388 3445 188 1002

47762 432 54 12

98 345 2245 8839

77492 472 565 999

1 34 882 545 4022

827 572 597 364

What is a statistic?

r “A quantity that is computed from a sample [of data].”

Merriam-Webster

→ A single number used to summarize a larger collection of values

What are statistics ?r “A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.”

Merriam-Webster

→ We are most interested in analysis and interpretation here

r “Lies, damn lies, and statistics!”

The simplest statistic: a mean?

r Reduces dataset to a single number

r But what does this mean mean?

r Indices of central tendencym Sample mean

m Sample median

m Sample mode

r Other meansm Arithmetic

m Harmonic

m Geometric

r Quantifying dispersion

The problem with means

r Performance is multidimensional

m CPU or I/O time

m Network delay

m Interactions of various components

r Systems are often specialized

m Performs great on application type X

m Performs lousy on anything else

r Potentially a wide range of execution times on one system using different benchmark programs

The problem with means (2)

r Nevertheless, people still want a single number answer!

r How to (correctly) summarize a wide range of measurements with a single value?

Index of central tendency

r Tries to capture “center” of a distribution of values

r Use this “center” to summarize overall behavior

r You will be pressured to provide “mean” value

m Understand how to choose the best type for the circumstance

m Be able to detect bad results from others

r Examples

m Sample mean: “Average” value

m Sample median: ½ of the values are above, ½ below

m Sample mode: Most common value

Indices of central tendency (2.)

r “Sample” implies

m Values are measured from a discrete random variable X

r Value computed is only an approximation of true mean value of underlying process

r True mean value cannot actually be known

m Would require infinite number of measurements

Sample mean

r Expected value of X = E[X]

m First moment of X

m xi = values measured (i = {1, …, n})

m pi = P(X = xi) = P(we measure xi)

ii pxXE1

Sample mean (2)

rWithout additional information, assumem pi = constant = 1/n (Laplace principle)

m n = number of measurements

r Arithmetic meanmCommon “average”

Potential problem with means

r Sample mean gives equal weight to all measurements

r Outliers can have a large influence on the computed mean value

r Distorts our intuition about the central tendency of the measured values

Potential problem with means (2.)

Median

r Index of central tendency with

m ½ of the values larger, ½ smaller

m Algorithm

• Sort n measurements

• If n is odd

– Median = middle value

– Else, median = mean of two middle values

r Reduces skewing effect of outliers

Example

r Measured values: 10, 20, 15, 18, 16

m Mean = 15.8

m Median = 16

r Obtain one more measurement: 200

m Mean = 46.5

m Median = ½ (16 + 18) = 17

r Median gives more intuitive sense of central tendency

Potential problem with means (3.)

Median

r Value that occurs most often

r May not exist

r May not be unique == multiple modes

m E.g., “bi-modal” distribution

• Two values occur with same frequency

Mean, median, or mode?

r Mean

m If the sum of all values is meaningful

m Incorporates all available information

r Median

m Intuitive sense of central tendency with outliers

m What is “typical” of a set of values?

r Mode

m When data can be grouped into distinct types, categories (categorical data)

Quantifying dispersion

r How “spread out” are the values?

r How much spread relative to the mean?

r What is the shape of the distribution of values?

=> A mean hides information about variability!

Histograms

r Similar mean values

r Widely different distributions

r How to capture this variability in one number?

Index of dispersion

Quantifies how “spread out” measurements are

r Range

m (max value) – (min value)

r 10- and 90- percentiles

r Maximum distance from the mean

m Max of | xi – mean |

r Neither efficiently incorporates all available information

Determine the distribution of data?

r Plot a histogram

m Count of observations within a cell or bucket

r Problem

m How to determine cell size?

• Small cells => large variations in # of obs per cell

• Large cells => details are lost

• Guideline: if any cell has less than five obs. increase cell size or use variable cell histogram

m How to determine cell spacing?

• Linear

• Logarithmic

Determine the distribution of data(2)?

r Plot a scatter plot

m For each value: X vs. Y

r Problem

m Too many points on top of each other ?

• Large dots => hard to distinguish points

• Small dots => hard to see outliers

Use two-dimensional histograms

Use densities

m Which scale?

• Linear

• Logarithmic

r Plot an empirical CDF

m Concentrate 1/n probability at each of the n numbers in a sample

r Problem

m Tail of interest => plot CCDF

∑−

inxXInxF

)(/1)(

r Plot a density

m Smoothed normalized counts of observations

r Problem

m How to determine cell size?

m How to do the smoothing

m How to determine cell spacing?

• Linear

• Logarithmic

Sources of Experimental ErrorsAccuracy, precision, resolution

Experimental errors

r Errors → noise in measured values

r Systematic errors

m Result of an experimental “mistake”

m Typically produce constant or slowly varying bias

Controlled through skill of experimenter

m Examples

• Temperature change causes clock drift

• Forget to clear cache before timing run

Experimental errors

r Random errors

m Unpredictable, non-deterministic

m Unbiased → equal probability of increasing or decreasing measured value

r Result of

m Limitations of measuring tool

m Observer reading output of tool

m Random processes within system

r Typically cannot be controlled

m Use statistical tools to characterize and quantify

A model of errors

r P(X=xi) = P(to measure xi)

corresponds to the “number of possible paths”

r P(X=xi) ~ binomial distribution

r As number of error sources becomes large

m n → ∞,

m Binomial → Gaussian (Normal)

r Thus, the bell curve

Frequency of measuring specific value

Mean of measured values

True valueResolution

Precision

Accuracy

Accuracy, precision, resolution

r Systematic errors → accuracym How close mean of measured values is to true value

m Hard to determine true accuracy

m Relative to a predefined standard• E.g. definition of a “second”

r Random errors → precisionm Repeatability of measurements

m Dependent on tools

r Characteristics of tools → resolutionm Smallest increment between measured values

m Quantify amount of imprecision using statistical tools

Confidence interval for the mean

α/2 α/2

= probability of c1 ≤ x ≤ c2

Normalize x

)(deviation standard

tsmeasuremen ofnumber

Confidence interval for the mean (2)

r Normalized z follows the Student’s t distribution

m (n-1) degrees of freedom

m Area left of c2 = 1 – α/2

m Tabulated values for t

α/2 α/2

Confidence interval for the mean (2)

r As n → ∞, normalized distribution becomes Gaussian (normal)

α/2 α/2

An example

8.5 s8

5.2 s7

11.3 s6

9.5 s5

9.0 s4

5.0 s3

7.0 s2

8.0 s1

Measured valueExperiment

An example (2)

14.2deviation standard sample

An example (3)

r 90% CI → 90% chance that the measured value is in the interval

r 90% CI → α = 0.10

α/2 α/2

An example (4)

r 90% CI = [6.5, 9.4]

m 90% chance value is between 6.5, 9.4

r 95% CI = [6.1, 9.7]

m 95% chance value is between 6.1, 9.7

r Why is interval wider when we are more confident?

Higher confidence → Wider interval?

6.5 9.4

6.1 9.7

Key assumption

r Measurement errors are Normally distributed.

r Is this true for most measurements on real systems?

α/2 α/2

Key assumption (2)

r Saved by the Central Limit Theorem

Sum of a “large number” of values from any distribution will be Normally (Gaussian) distributed.

r What is a “large number?”m Typically assumed to be >≈ 6 or 7

m But in our case often millions or billions

How many measurements?

r Width of interval inversely proportional to √n

r Want to minimize number of measurements

r Find confidence interval for mean, such that:

m P(actual mean in interval) = (1 – α)

How many measurements (2)?

r But n depends on knowing mean and standard deviation!

r Estimate s with small number of measurements

r Use this s to find n needed for desired interval width

Introduction to statistics - TU Berlin · Introduction to statistics Literature Raj Jain: The Art...

Documents

marching.alexyoder.netmarching.alexyoder.net/pdf/earth_wind_fire.pdf% % % % % % % % % > > > > % > % % % ª ª α ∀ ∀ ∀∀ ∀ ∀ ∀ ∀ ∀ α α α α α α α α α Fl. Bα

Statistics Anxiety, Basic Mathematics Skills and Academic …25qt511nswfi49iayd31ch80-wpengine.netdna-ssl.com/wp... · 2020. 5. 12. · N M SD α Statistics Performance 75 53.45 16.05

Statistics Chapter 1 Introduction to Statistics

Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

Currents through Quantum dots - Harish-Chandra Research ...nanotr16/notes/BMuralidharan-3.pdf · α ααα α αα αα α α α α α α ααα τ π ε ε εε τ τ ρρ ρρ ρρ

Score TheCzardasReformed- MaestroScore..…Α > > α α ∀ α ∀ α α α α α α α α α α α 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 Flute Oboe Clarinet in Bα Bassoon

α β β α β β - University of Texas at Dallasbiewerm/5-applications.pdf · α + β " α + 0.97β " α - β α + 1.65β " α ‒ 0.68β " 1β " 0.68β The MO energy levels can

Basic Statistics · Basic Statistics Introduction to Statistics Basic Statistical Formulas Commonly used Ecological Equations INTRODUCTION TO STATISTICS Statistics is the branch of

Introduction Statistics

rec Alph α Black Series α

Statistics Introduction to Statistics. Section 1.1 An Overview of Statistics

ΚΥΠΡΟΣ - moi.gov.cy...Α Α Μ Ε Σ Α Ο Ρ Ι Α Σ Π Ρ Κ Ι Α Τ Ι ςΛ ςΛ Ι Ρ Ι Α Μ Α Ρ Α Θ ςΑ Σ Α Σ Ο Λ Ε Α Π Ε Ρ Ι Ο Χ Η Κ Ο Υ Μ ΑΝ

Descriptive Statistics Introduction

Applied Statistics - Introduction

α GOLDFINGER Ω USER MANUAL α - Musette · PDF fileBOGNER AMPLIFICATION 1 α GOLDFINGER Ω USER MANUAL α α Vintage Valve Ω Guitar System α Please read safety instructions before

Α Ν Δ Ρ Ε Α Σ Κ. Α Ρ Β Α Ν Ι Τ Η Σ - Arvanitis Clinicarvanitisclinic.gr/wp-content/uploads/2018/07/C.V...Α Ν Δ Ρ Ε Α Σ Κ. Α Ρ Β Α Ν Ι Τ Η Σ GP, MD,

Applied Statistics and Machine Learning€¦ · 2. Random split CV: randomly select (1-α) × 100 % data units for training and use the remaining α × 100 % data units for testing

Distinctness I. Introduction: Bans on Overcrowdingweb.mit.edu/norvin/OldFiles/www/24.956/handout4.pdf · • image of α includes everything α dominates • α c-commands β iff:

Introduction to Statistics - gauss.stat.su.segauss.stat.su.se/master/slht/probability_pdf/Introduction to Statistics...Introduction to Statistics Bob Conatser Irvine 210 Research Associate