63
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Data Visualisation & Interpretation The art of reading datasets Devert Alexandre School of Software Engineering of USTC 14 February 2012 — Slide 1/1

Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Data Visualisation & InterpretationThe art of reading datasets

Devert AlexandreSchool of Software Engineering of USTC

14 February 2012 — Slide 1/1

Page 2: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Table of Contents

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 2/1

Page 3: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Descriptive statistics

descriptive statistics helps to give a general summary ofdata

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 3/1

Page 4: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Mean

Example of descriptive statistics quantity

arithmetic mean

a =1

n

n∑i=1

ai

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 4/1

Page 5: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Mean

Example of descriptive statistics quantity

arithmetic mean

a =1

n(a1 + a2 + · · ·+ an)

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 4/1

Page 6: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Mean

The mean is defined in Rn ⇒ geometric center

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 5/1

Page 7: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Mean computation

You think, it is easy to compute the mean ?

0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 6/1

Page 8: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Mean computation

A naive summation algorithm will return this

>>> 0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.10.8999999999999999

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 7/1

Page 9: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Mean computation

An accurate summation algorithm will return this

>>> impor t math>>> math . fsum (0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1)0 .9

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 8/1

Page 10: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Mean computation

Algorithms like Kahan summation algorithm or Shewchuksummation algorithm reduces the numerical error

de f KahanSum( data ) :s = 0 .0c = 0 .0f o r i i n range ( l e n ( data ) ) :

y = data [ i ] − ct = s + yc = ( t − s ) − ys = t

r e t u r n s

Listing 1: Kahan summation

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 9/1

Page 11: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Central tendencyThe mean is a measure of central tendency ⇒ the mainbehaviour, the main value of some phenomenon

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 10/1

Page 12: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Central tendencyThe mean is a measure of central tendency ⇒ the mainbehaviour, the main value of some phenomenon

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 10/1

Page 13: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Mean robustnessThe mean is not a robust estimator of the centraltendency

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 11/1

Page 14: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Median

The median is the value such as 50% of the values arehigher, 50% of the values are lower

a = [6, 1, 7, 9, 6, 3, 4, 5, 2]

a = [1, 2, 3, 4, 5, 6, 6, 7, 9]

a = 5

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 12/1

Page 15: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Median

The median is the value such as 50% of the values arehigher, 50% of the values are lower

a = [6, 1, 7, 9, 6, 3, 4, 8, 5, 2]

a = [1, 2, 3, 4, 5, 6, 6, 7, 8, 9]

a =1

2(5 + 8) = 6.5

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 12/1

Page 16: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Median computation

To compute the median, you can

1 sort the list of samples

2 • if size is odd → a = a n+12

• if size is even → a = 12(a n

2+ a n+1

2)

Note that it is for indexes starting from 1

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 13/1

Page 17: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Median computation

Let’s code some python

de f median ( data ) :data . s o r t ( )i f l e n ( data ) % 2 == 0 :m = l e n ( data ) / 2r e t u r n 0 .5 ∗ ( data [m−1] + data [m] )

e l s e :r e t u r n data [ ( l e n ( data ) − 1) / 2 ]

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 14/1

Page 18: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Median computation

Let’s code some python

>>> a =[6 , 1 , 7 , 9 , 6 , 3 , 4 , 5 , 2 ]>>> median ( a )5

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 14/1

Page 19: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Median computationThe median have an equivalent in Rn ⇒ median center

Compute the median for each dimension to get themedian center

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 15/1

Page 20: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Median robustness

The median is a more robust estimator of the centraltendency

• green is the median

• pink is the arithmeticmean

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 16/1

Page 21: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Statistical dispersionThe following datasets have the same central tendency

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 17/1

Page 22: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Statistical dispersionThe following datasets have the same central tendency

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 17/1

Page 23: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Statistical dispersionBut they have different dispersions

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 18/1

Page 24: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Standard deviation

A traditional measure of dispersion is the standarddeviation sigma

σ2 =1

n − 1

N∑i=1

(ai − a)2

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 19/1

Page 25: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Standard deviation computation

Robust computation of the standard deviation ⇒Knuth-Welford algorithm

de f stdDev ( data ) :n = 0mean = 0M2 = 0meanEst imate = math . fsum ( data ) / l e n ( data )

f o r x i n data :y = x − meanEst imaten = n + 1d e l t a = y − meanmean = mean + d e l t a / nM2 = M2 + d e l t a ∗ ( y − mean )

r e t u r n math . s q r t (M2 / ( n − 1) )

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 20/1

Page 26: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Standard deviation

Standard deviation suffers from the same robustnessissues as mean. We will look why, later.

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 21/1

Page 27: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Quartiles

The lower quartile or first quartile is the value such as75% of the values are higher, 25% of the values are lower

a = [6, 1, 2, 7, 9, 6, 3, 4, 5, 2, 6]

a = [1, 2, 2, 3, 4, 5, 6, 6, 6, 7, 9]

q1 = 2

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 22/1

Page 28: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Quartiles

The higher quartile or third quartile is the value such as25% of the values are higher, 75% of the values are lower

a = [6, 1, 7, 9, 6, 3, 4, 5, 2, 6]

a = [1, 2, 2, 3, 4, 5, 6, 6, 6, 7, 9]

q3 = 6

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 22/1

Page 29: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Quartiles

Where is the second quartile ? ⇒ it’s the median

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 23/1

Page 30: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Interquartile range

The difference Q3− Q1 is the interquartile range or IQR⇒ it’s a more robust dispersion measure

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 24/1

Page 31: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distributionA model for random variables, with 2 parameters µ and σ

−6 −4 −2 0 2 4 60.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 25/1

Page 32: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distribution

The normal distributions have 2 parameters µ and σ.

Φ(x) =1√

2πσ2e

−(x−µ)2

2σ2

This is the probability density of the normal distribution.

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 26/1

Page 33: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distribution

The normal distributions have 2 parameters µ and σ.

Φ(x) =1√

2πσ2e

−(x−µ)2

2σ2

It tells the probability for x to appear, according to thisdistribution.

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 26/1

Page 34: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distributionµ is the mode, the central tendency of the normaldistribution

−6 −4 −2 0 2 4 60.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 27/1

Page 35: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distribution

If some data are following a normal distribution, then

µ = a

The more sample, the more ”true“ it will be

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 28/1

Page 36: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distributionσ controls the shape of the normal distribution

−6 −4 −2 0 2 4 60.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 29/1

Page 37: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distribution

If some data are following a normal distribution

σ2 =1

n − 1

N∑i=1

(ai − a)2

The standard deviation comes from here ⇒ dispersion ofa normal distribution

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 30/1

Page 38: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distributionµ and σ are completely independent parameters

−6 −4 −2 0 2 4 60.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 31/1

Page 39: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distribution

Practical interpretation of the normal distribution0.0

0.1

0.2

0.3

0.4

−2σ −1σ 1σ−3σ 3σµ 2σ

34.1% 34.1%

13.6%2.1%

13.6% 0.1%0.1%2.1%

68% of the values within [µ− σ, µ + σ]

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 32/1

Page 40: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distribution

Practical interpretation of the normal distribution0.0

0.1

0.2

0.3

0.4

−2σ −1σ 1σ−3σ 3σµ 2σ

34.1% 34.1%

13.6%2.1%

13.6% 0.1%0.1%2.1%

95% of the values within [µ− 2σ, µ + 2σ]

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 32/1

Page 41: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

normal distribution

Practical interpretation of the normal distribution0.0

0.1

0.2

0.3

0.4

−2σ −1σ 1σ−3σ 3σµ 2σ

34.1% 34.1%

13.6%2.1%

13.6% 0.1%0.1%2.1%

99.7% of the values within [µ− 3σ, µ + 3σ]

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 32/1

Page 42: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

skewed distributions

Your data might not have a symmetric distribution ⇒they might have a skewed distribution

0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

• red is the true centraltendency

• green is the median

• pink is the arithmeticmean

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 33/1

Page 43: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

skewed distributions

Your data might not have a symmetric distribution ⇒they might have a skewed distribution

0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.2

0.4

0.6

0.8

1.0

• red is the true centraltendency

• green is the median

• pink is the arithmeticmean

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 33/1

Page 44: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

skewed distributions

Your data might not have a symmetric distribution ⇒they might have a skewed distribution

0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

• red is the true centraltendency

• green is the median

• pink is the arithmeticmean

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 33/1

Page 45: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

skewed distributions

You can compute the skewness of your data

1n

∑ni=1(ai − a)3(

1n

∑ni=1(ai − a)2

) 32

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 34/1

Page 46: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

multimodal distributionsYour data might have multiple modes

−3 −2 −1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 35/1

Page 47: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

multimodal distributionsIn such case, the mean, median and other descriptivequantities might have no reliable meaning

−3 −2 −1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 36/1

Page 48: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

multimodal distributionsIn such case, the mean, median and other descriptivequantities might have no reliable meaning

−3 −2 −1 0 1 2 3 4 50.0

0.2

0.4

0.6

0.8

1.0

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 36/1

Page 49: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

multimodal distributionsIn such case, the mean, median and other descriptivequantities might have no reliable meaning

−3 −2 −1 0 1 2 3 4 50.0

0.2

0.4

0.6

0.8

1.0

1.2

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 36/1

Page 50: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

multimodal distributionsIn such case, the mean, median and other descriptivequantities might have no reliable meaning

−3 −2 −1 0 1 2 3 4 50.0

0.2

0.4

0.6

0.8

1.0

1.2

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 36/1

Page 51: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Table of Contents

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 37/1

Page 52: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Observe your data

Descriptive statistics can completely miss importantinformations from your data !

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 38/1

Page 53: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Observe your dataThe Anscombe’s quartet

4

8

12

0 10 20

4

8

12

0 10 20

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 39/1

Page 54: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Observe your data

Those 4 datasets have exactly the same

• mean

• variance

• regression line

But they are not quite the same things !

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 40/1

Page 55: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

BoxplotA nice way to summarize data distribution is the boxplot

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 41/1

Page 56: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

BoxplotA nice way to summarize data distribution is the boxplot

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 41/1

Page 57: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

BoxplotA nice way to summarize data distribution is the boxplot

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 41/1

Page 58: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Boxplot

The red mark shows the mean

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 42/1

Page 59: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Boxplot

The box goes from the lower quartile to the upperquartile

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 42/1

Page 60: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Boxplot

The box is thus centred on the median

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 42/1

Page 61: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Boxplot

The whiskers are the minimum and maximum values

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 42/1

Page 62: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Boxplot

Outliers values are shown as blue crosses

Outliers are values which are beyond 1.5× IQR from thequartiles

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 43/1

Page 63: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Scatter plotA scatter plot is simply a plot with the data as pointsalong 2 dimensions

−3 −2 −1 0 1 2 3−5

−4

−3

−2

−1

0

1

2

3

4

Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 44/1