Chapter 1: DESCRIPTIVE STATISTICS – PART I2 Statistics is the science of learning from data exhibiting random fluctuation. Descriptive statistics:

Mathematics 3 – Statistics

Chapter 1: DESCRIPTIVE STATISTICS –

PART I

Chapter 1: DESCRIPTIVE STATISTICS – PART I

2

What is Statistics? Statistics is the science of learning from data

exhibiting random fluctuation.

Descriptive statistics:

Collecting data

Presenting data

Describing data

Inferential statistics:

Drawing conclusions and/or making decisions concerning a population based only on sample data

Based on probability theory


3

What are data?

Data can be numbers, record names, or other labels.

Data are useless without their context…

To provide context we need Who, What (and in what units), When, Where, and How of the data.

In civil engineering we meet most often numerical data.

Presentation tools for numerical data (one sample):

Histogram

Boxplot

Data Presentation


4

Histogram (example)

22-23

23-24

24-25

25-26

26-27

27-28

28-29

29-30

30-31

31-32

32-33

33-34

34-35

35-36

36-37

37-38

38-39

0

5

10

15

20

25

Compressive strength of concrete (MPa)(sample size =150 concrete cylinders)

Fre

qu

en

cy


5

Other examples of histograms:

Example 1.1, part a) on my personal website:http://mat.fsv.cvut.cz/Zamestnanci/nHomePage.aspx?hala

How to construct a boxplot?Will be discussed later (the use of numerical measures is necessary).

Data Presentation (continued)


6

Measures of Location

Sample mean

Sample median

Mode single value which repeats more often

than any other

First quartile

Third quartile

Numerical Measures for One Sample


7

Measures of Variation

Range

Population variance

Sample variance

Population standard deviation

Sample standard deviation

Interquartile range

Numerical Measures for One Sample (continued)


8

Other Numerical Measures

Coefficient of variation o

Skewness o

Kurtosis or

Numerical Measures for One Sample (continued)


9

The analysis of 16 samples of building material yields the following weights of unwanted impurities (data in grams):

6 8 11 6 12 7 5 28 9 10 9 10 12 10 11 9

Compute all important numerical measures.

Answers:

sample mean:

variance: or

standard deviation: or

Comment: We can use the formulas introduced on previous pages. However much simpler is to use scientific calculators. After entering the data we can easily recall the statistics or We can use alternate formulas for the variance, too (see later).

Example 1.2


10

Answers (continued):

We have to sort the data so as to find median and quartiles:

5 6 6 7 8 9 9 9 10 10 10 11 11 12 12 28

is odd, moreover it is divisible by 4, so:

sample median:

mode: does not exist (values 9 and 10 repeat with

the same maximum frequency)

first quartile:

third quartile: 11 range:

interquartile range:

Example 1.2 (continued)


11

A particular value in a random sample is an outlier, if:

or

How do we construct Boxplot?

Draw a horizontal plot line, choose a suitable scale.

Plot a „box“ above the plot line, its edges represent quartiles.

Median is represented by a vertical segment inside the box.

Plot horizontal segments (whiskers) outside the box. Left one joins the left edge of the box with the smallest nonoutlier. Right one joins the right edge of the box with the largest nonoutlier.

Outliers (if they exist) are depicted by particular symbols (e.g. stars or circles).

Outliers, Boxplot


12

Refer to Example 1.2. Construct the boxplot for the data.

Answers:

Recall data array and important measures:

5 6 6 7 8 9 9 9 10 10 10 11 11 12 12 28 11= 3.5Fences (values cutting off the outliers):

lower fence:

upper fence:

There is one outlier in the sample – the largest observation 28.

Example 1.3


13

Boxplot:



14

Refer to Examples 1.2 and 1.3:

We found in Example 1.3 that value 28 is an outlier. Assume that this value is an erroneous measurement and exclude it from the sample.

a) Compute basic summary measures for the reduced sample of 15 observations.

b) Construct the boxplot for the reduced sample.

c) Compare the results for both samples.

Answers are available on my personal website.

Example 1.4


15

„Normally“ distributed data: Histogram has almost symmetric shape; it can be fitted well by

Gaussian curve – see Chapter 5.

Median and mean are almost equal.

Boxplot is almost perfectly symmetric; there are no outliers.

Skewness and kurtosis are very close to zero.

Symmetric Data Distribution


16

Examples:

Histogram of compressive strength of concrete on page 4.

Boxplot constructed in Example 1.4 (15 samples of building material – reduced data set).

Comment:

Skewness computed for the data in Example 1.4 is negative and equals approx. -0.416. It shows that there the data are actually gentle left skewed - see later.

(You will not be asked to compute skewness in the exam.)

Symmetric Data Distribution (continued)


17

We meet in applications very often left or right skewed data.

Coefficient of skewness is negative for left-skewed data positive for right-skewed data

Skewed Data Distribution

Mean = Median = Mode

Mean < Median < Mode

Right-SkewedLeft-Skewed Symmetric

(Longer tail extends to left) (Longer tail extends to right)

Mode < Median < Mean


18

Examples of right-skewed distributions: Example 1.2 (16 samples of building material – original data set) Comment: Skewness for this sample equals approx. 2.879.

Earthquakes magnitudes:

Skewed Data Distribution (continued)


19

An example of Boxplot for right-skewed data:



20

Examples of left-skewed distributions:

All three variables in Example 1.1 (Excel file Example 1.1_data and answers).

Grade distribution in a classof 80 students:

Additional questions:

What is the range for the marks of 20 best students?

Which value cuts off the marks of 25 % worst students?

Are there any outliers? Discuss.

Can we say anything about average mark in this exam?



21

Population variance:

Sample variance:

When and why to use? If a scientific calculator (or even a computer) is not

available… If the data are integer values… When we have to recalculate mean and variance after a

subtle change in the data set…

Alternate Variance Formulas


22

Example 1.5 A researcher observed using a microscope the number of gold particles in a thin coating of gold solution. He completed 517 observations in regular time intervals. The results are listed in the table:

Compute the mode, median, and quartiles. Compute the mean and standard deviation, too.

Comment on the data distribution.

Finding Mean and Variance Using Frequency Table

Number of particles 0 1 2 3 4 5 6 7

Frequency 112 168 130 68 32 5 1 1


23

Answers: Mode obviously equals 1.

So as to find easily median and quartiles let us calculate cumulative frequencies:

Sample size is odd, so the median equals to the value in 259th position in data array, therefore .

First quartile equals to the value in 130th position, third quartile equals to the value in 388th position, therefore , .

Additional task: Try to create boxplot for this data set.


Number of particles 0 1 2 3 4 5 6 7

Frequency 112 168 130 68 32 5 1 1

Cumulative frequency 112 280 410 478 510 515 516 517


24


If the data are summarized in a frequency table, mean and variance formulas can be modified:

where are distinct data values, are the corresponding frequencies a is the total of all frequencies, i.e. sample size.

An alternate variance formula can be derived, too:



25


We obtain the following results:

Data distribution is obviously right-skewed.

Comment: If you have a scientific calculator which is able to process data summarized in a frequency table, insert the data and frequencies and recall , , or .



26

If a large data set is grouped by classes in a frequency table and no computer is available we can approximate the values of the mean and variance using the table.

Hint: Round all data values within each class interval to the midpoint, Use the formulas listed on page 24 (where designates the

midpoint of class interval)

Example 1.6: Approximate the mean speed and the variance and standard deviation of the speed (Excel file Example 1.1_data and answers) using the frequency table in the sheet Histograms.

Answers are available in Excel file Example 1.1_data and answers in the sheet Approximate calculation.

Estimating Mean and Variance Using Grouped Frequency Table

Documents

Chapter 1: DESCRIPTIVE STATISTICS – PART I2 Statistics is the science of learning from data exhibiting random fluctuation. Descriptive statistics: