DATA ANALYSIS - Universiti Teknologi Malaysia

DATA ANALYSIS

Faculty of Civil Engineering

DATA

DATA - Introduction

• Data is a collection of facts, such as

numbers, words, measurements,

observations or even just descriptions of

things.

• Qualitative data is descriptive information

(it describes something).

• Quantitative data is numerical information

(numbers).

DATA - Introduction

• Quantitative data can also be discrete or

continuous.

• Discrete data can only take certain values

(like whole numbers).

• Continuous data can take any value

(within a range).

Data Analysis - Introduction

• A process of inspecting, cleaning

transforming and modeling data with the

goal of discovering useful information

suggesting conclusions and supporting

decision-making.

• Data analysis has multiple facets and

approaches, encompassing diverse

techniques under a variety of names, in

different disciplines.

Data Analysis - Introduction

• Data analysis is about manipulating and

presenting results.

• Data need to be organised, summarised

and analysed in order to draw/infer

conclusion.

Data Analysis - Processes

• Data requirements.

• Data collection.

• Data processing.

• Data cleaning.

• Exploratory data analysis.

• Modelling and algorithm

• Results & Report.

‘Sources of Data

• Lab Experimentation

• Survey

• Census

• Theoretical Analysis

• Numerical Analysis

• Software

• Other researchers data

Example Analysis – Results

• Estimation of parameter mean values

• Estimation of parameters variability

• Comparison of parameter mean values

• Comparison of parameter variability

• Modelling the dependence of dependant

variable on several quantitative &

qualitative independent variables

‘Data Processing

• Data initially obtained must be processed

or organized for analysis.

• For instance, this may involve placing data

into rows and columns in a table format for

further analysis, such as within a

spreadsheet or statistical software.

‘Data Cleaning

• The data may be incomplete, contain

duplicates, or contain errors.

• The need for data cleaning will arise from

problems in the way that data is entered

and stored.

• Data cleaning is the process of preventing

and correcting these errors.

‘Data Checking

• Before doing data analysis and

intrepretation, watch for invalid data using

whatever data checking procedure.

• Weeding out of bad data is to be done

continously throughout data gathering

process.

• Bad data can bias results & intrepretation.

• Repeat data gathering or experimentation

if there exist suspicous data.

Exploratory Data Analysis

• Once the data is cleaned, it can be analyzed.

• Analysts may apply a variety of techniques

referred to as exploratory data analysis to begin

understanding the messages contained in the

data.

• The process of exploration may result in

additional data cleaning or additional requests

for data, so these activities may be iterative in

nature.

Trial Test

• Do a simple trial test.

• To ensure that all parts in the testing set-

up function well.

• To determine the range of measurement

to be taken.

• To anticipate the time taken for each step

in the experiment.

• To see the error.

Error (Uncertainty)

• When writing a measurement results with

± e, it doesn’t mean that we have done

error

• It is uncertainty due to the limit of

equipment and technique of experiment

Error (Uncertainty)

• For example Case 1:

• Theory said deflection = 5 mm, in the

experiment the deflection = 5.5 mm. Is it

mean that the theory wrong?.

• Ask first what is the error limit. If the error

limit is ±0.75 , the theory is correct.

Error (Uncertainty)

• For example Case 2:

• Two experimentalist doing measurement

on the time taken for ….

• The first researcher give the result as

20.4±0.4sec.

• While the second researcher give

19.8±0.8sec.

• Is their results contradict?

Error (Uncertainty)

• No, their results is actually overlapping.

• However, we are more confident with the

first one because the error is half of the

second, meaning that the measurement is

done very carefully.

Analysis & Interpretation

• Mathematical formulas or numerical

models called algorithms may be applied

to the data to identify relationships among

the variables.

• Numerical models: using software

• Statistical analysis: using software

STATISTICAL ANALYSIS

What is Statistics

• The science of collecting and analyzing

data.

• It’s about the whole process of using the

scientific method to answer questions and

make decisions.

What is Statistics

• The process involves designing studies,

collecting good data, describing the data

with numbers and graphs, analyzing the

data, and then making conclusion.

Statistical Analysis

1) Designing studies

2) Collecting & selecting data

3) Describing data

4) Analyzing data

5) Making conclusion

Designing Studies

• Once a research question is defined, the

next step is designing a study in order to

answer that question.

• Figure out what process would be used to

get the data we need.

Designing Studies

• The observational study could be survey.

• Surveys are questionaires that are

presented to individuals who have been

selected from a population of interest.

• Another widely used observational study is

based on nature such: wildlife, geology

hydrology, meteorology, environment,etc.

Designing Studies

• Experiments take place in a controlled

setting, and are designed to minimize

biases that might occur.

• It is perhaps most important to note that

no matter what the study, it has to be

designed so that the original questions can

be answered in a credible way.

Collecting & selecting data

• If you select your subjects in a way that is biased

- that is, favoring certain individuals or groups of

individuals – then the results will also be biased.

• Experiments and observational studies use

instrumentation are sometimes even more

challenging when it comes to collecting data.

• Something happens during the experiment to

distract the subjects or the researchers.

Describing Data

• Once data are collected, the next step is to

summarize it all to get a handle on the big

picture.

• Statisticians describe data in two major

ways: with pictures (that is, chart & graph)

and with numbers, called descriptive

statistics.

CHARTS AND GRAPHS

Charts and Graphs

• Line graphs for trend & behaviour

• Time charts for time series data

• Scatter graphs for relationships

• Pie charts & bar charts for categorical data

• Histogram & box plots for numerical data

Line Graphs

• A powerful tools to explain results in term of

cause and effect.

• The horizontal x-axis is normally used for the

independent variable (the cause or controlled

variable).

• The vertical y-axis is normally used for

dependent variable (the effect).

• To describe the development or progression.

• To show trend, response or behaviour in data.

Line Graph

g

Time Charts

• To examine trend over time and another

name for time chart is a line graph.

• Typically a time chart has some unit of

time on the horizontal axis (year, day,

month, and so on) and a measured

quantity on the vertical axis (income, birth

rate, total sales..)

Time Chart

0

10

20

30

40

50

60

0 2 4 6 8 10 12

To

tal

sale

s

Time

Time Chart

Scatter Graphs

• Useful to present many data values.

• To show correlations between two

variables.

• To draw conclusions about relationship in

the data.

Scatter Graphs

0

10

20

30

40

50

60

70

80

90

0 2 4 6 8 10 12 14

Y

X

Pie Charts

• Present data in segment, convey simple

and straightforward proportion of each

category.

• A pie chart takes categorical data and

shows the percentage of individuals that

fall into each category.

• Each segment is presented in terms of

percentage and can only be used with one

data set.

Bar Charts

• An effective way of presenting frequencies.

• Common in reports of small scale research.

• The bar height represents quantity or amount.

• The number of bars represents the categories.

• Often used to compare groups by breaking and

showing them as side-by-side.

• Visually striking and simple to read.

Bar Charts

Histogram

• Is the statistician’s graph of choice for

numerical data that provide a quick way to

get the big idea about a numerical data set.

• A histogram is a graphical display of

tabulated frequencies as well as a

graphical version of a table that shows

what proportion of cases fall into each of

several or many specified categories.

Histogram

• A histogram is the most important graphical tool

for exploring the shape of data distributions

(Scott, 1992).

• The shape examined from the histogram puts

the type of distribution into view.

• A histogram can be constructed by plotting the

frequency of observation against midpoint class

of the data.

Number of Class Interval

Rule of thumb to choose appropriate width:

a is bin widths or widths of class interval

n is number of observation (data)

Log10(n) is the number of based 10 of the

number of observation

According to Sturges’s rule, 1000 observations

would be graphed with 11 class intervals.

Histogram

Histogram - tips

• If there are too few classes, it is difficult to

see how the data vary.

• If there are too many classes, then the

table is less of a summary

Histogram tells three features

• How the data are distributed (symmetric,

skewed right, skewed left, bell-shaped).

• The amount of variability in the data.

• Where the center of the data is

(approximately).

Histogram tells three shapes

• Symmetric: the left-hand side of the

histogram is a mirror image of the right-

hand side.

• Skewed right: it looks like a lopsided

mound with one long tail going off to the

right.

• Skewed left: it looks like a lopsided mound

with one long tail going off to the left.

Histogram tells variability

• If a histogram is quite flat with the bars

close to the same height, it indicates high

variability.

• A histogram with a big lump in the middle

and tails on the sides indicates more data

in the middle bars than the outer bars, the

the data are actually closer together or

less variability.

Histogram tells center

• A histogram can also give you a rough

idea of where the center of the data lies.

• To visualize the mean; the mean is the

point where the fulcrum has to be in order

to balance the weight on each side.

Boxplot

• A boxplot is a one-dimensional graph of

numerical data based on the five-number

summary, which includes the minimum

value, the 25th percentile (know as Q1)

median, the 75th percentile (Q3), and the

maximum value.

• In essence, these five descriptive statistics

divide the data set into four equal parts.

Making a Boxplot

1) Find the five number summary of data

set.

2) Create a horizontal number line whose

scale includes the numbers in the five-

number summary.

3) Label the number line using appropriate

units of equal distance from each other.

Making a Boxplot

40

50

60

70

80

90

100

Five number summary:

43: Minimum

68: 25th percentile

77: Median

89: 75th percentile

99: Maximum

Ex

am

s

co

re

Making a Boxplots

4) Mark the location of each number in the five-

number summary just above the number line.

5) Draw a box around the marks for the 25th

percentile and the 75th percentile.

6) Draw a line in the box where the medians is

located.

7) Draw lines from the outside edges of the box

out to the minimum & maximum values of the

data set.

Making a Boxplot

40

50

60

70

80

90

100

Step 5

Step 6

Step 7

Step 7

Interpreting a Boxplot

A boxplot can show information about the

distribution, variability, and center of a

data set.

Symmetric data shows a symmetric

boxplot.

Skewed data show a lopsided boxplot,

where the median cuts the box into two

unequal pieces.


If the longer part of the box is to the right

(or above) the median, the data is said to

be skewed right.

If the longer part is to the left (or below)

the median, the data is skewed left.


The upper part (vertical line) of the box is

wider than the lower part (vertical line).

This means that the data between the

median (77) and Q3 (89) are a little more

spread out, or variable, than the data

between the median (77) and Q1 (68).


Variability in a data set that is measured

by the the interquartile range (IQR).

The IQR is equal to Q3 – Q1.

A large distance from the 25th percentile

to the 75th percentile indicates the data

are more variable.

IQR ignores data below 25th or above 75th

which may contain outliers.


The median is part of the five-number

summary, and is shown by the line that

cuts through the box in the boxplot.

The mean, however is not part of the box

plot.

Misinterpret a boxplot: the bigger the box,

the more data.

A bigger part of the box means there is

more variability (a wider range of values).

DESCRIPTIVE STATISTICS

‘Summarizing Data

• Descriptive statistics are numbers that

summarize some characteristic about a

set of data.

• Summarizing data by numerical measures

makes a point clearly and concisely. .

• Mean, Median, Mode, Standard Deviation,

Variance, Coefficient of Variation,

Skewness, Kurtosis.

Sample Mean

• The sample mean is defined as the sum of

the observed variable, x divided by the

number of observed values.

Sample Median

• The sample median of a variable

x is defined as the middle value

when the n sample observations

of x are ranked in increasing

order of magnitude.

Sample Median

• S = 1,6,3,8,2,4,9

• We need to find the value x,

where half of the values aare

above x and half the values

below x.

• Rearrange, S = 1,2,3,4,6,8,9

• The median is 4

‘Sample Mode

• The sample mode of a variable x is

defined as the value with the highest

frequency.

• The mode of a data set is the value that

occurs most often or other words, has the

most probability of occuring.

‘Sample Mode

• Sometimes we can have two, three, or

more values that have relatively large

probability of occurrence.

• In such cases, we say that the distribution

is bimodal, tri-modal or multimodal,

respectively.

‘Sample Mode

• Consider the rolls of a ten-sided die:

• R = 2,8,1,9,5,2,7,2,7,9,4,7,1,5,2

• The number that appears the most is the

number 2.

• Therefore the mode of set R is the number

2

‘Sample Mode

• Consider the rolls of a ten-sided die:

• R = 2,8,1,9,5,2,7,2,7,9,4,7,1,5,2

• Note that if the number 7 had appeared

one more time, it would have been present

four times as well.

• In this case, we would have had a bimodal

distribution, with 2 and 7 as the modes.

Mean Median Mode

• When to use mean, median & mode?

• Mean – for normally distributed data

(symmetrical distribution).

• Median & Mode – for markedly

skewed data.

Measures of Dispersion

• Consider the following data set:

• S = 5,5,5,5,5,5 and R = 0,0,0,10,10,10

• If we calculated the mean for both S and

R, we would get the number 5.

• However, these are two vastly different

type of data sets.


• Therefore, we need another descriptive

statistic besides a measure of central

tendency, which we shall call a measure of

dispersion.

• We shall measure the dispersion or scatter

of the values of our data set about the

mean of the data set.


• If the values tend to be concentrated near

the mean, then this measure shall be

small, while if the values of the data tend

to be distributed far from the mean, then

the measure will be large.

• The two measures of dispersions that are

usually used are called the variance and

standard deviation.

Variance and Std Deviation

• A quantity of great importance in

probability and statistics is called the

variance.

• The variance denoted by σ2 , for a set of n

numbers x1, x2, …., xn is given by


• The variance is nonnegative number

• The positive square root of the variance

(σ2 ) is called the standard deviation (σ) .

• Find the variance and std deviation for the

following set of test scores:

• T = 75, 80, 82, 87, 96

from data set, μ = 84


• T = 75, 80, 87, 96

from data set, μ = 84

Variance σ2 = 50.8

Std Deviation, σ = √50.8 = 7.13

Variance & Std Deviation

• It is also widely accepted to divide by (n-1)

as opposed to n.

1

2

n

xxs

1

1

2

2

n

n

xx

s

n

xxs

Percentiles

• It is often convenient to subdivide your

ordered data set by use of ordinates so

that the amount of data points less than

the ordinate is some percentage of the

total amount of observations.

• The values corresponding to such areas

are called percentile values, or briefly,

percentiles.

Percentiles

• For example the percentage of scores that

fall below the ordinate at xα is α.

• For instance, the amount of scores less

than x0,10 would be 0.10 or 10% and x0,10

would be called the 10th percentile.

Percentiles

• Another example is the median.

• Since half the data points fall below the

median, it is the 50th percentile (or fifth

decile), and can be denoted by x0,0.5.

Percentiles

• The 25th percentile is often thought of as

the median of the scores below the

median, and the 75th percentile is often

thought of as the median of the scores

above the median.

Percentiles

• The 25th percentile is called the first

quartile, while the 75th percentile is called

the third quartile.

• The median is also known as the second

quartile.

Interquartile Range

• Another measure of dispersion is the

interquartile range (IQR).

• The interquartile range is defined to be the

first quartile substracted from the third

quartile.

• x0.75 - x0.25

Interquartile Range

• Find interquartile range from the following

data set:

• S = (67, 69, 70, 71, 74, 77, 78, 82, 89)

• The median is 74.

• The first quartile, x0.25 is the median of the

scores below the fifth position, the average

of the second and third score, which leads

to x0.25 = 69.5

Interquartile Range (IQR)

• The third quartile, x0.75 is the median of the

scores above the fifth position, the

average of the seventh and eighth score,

which leads to x0.75 = 80

• The interquartile range is x0.75 - x0.25 = 80

– 69.5 = 10.5.

• Semiinterquartile range is 0.5(x0.75 - x0.25 )

leads to 5.25

Skewness

• Distribution of scores in data set.

• We might have a symmetrical data set, or

a data set that is evenly distributed, or a

data set with more high values than low

values.

Skewness

• Often a distribution is not symmetric about

any value, but instead has a few more

higher values, then it is said to be skewed

to the right.

• If the data set has a few more lower

values, then it is said to be skewed to the

left.

Skewed

Skewed to

the right

Skewed to

the left

Kurtosis

• Kurtosis (from the Greek word κυρτός, kyrtos or

kurtos, meaning bulging) is a measure of the

"peakedness" of the probability distribution of a

real valued random variable

• Kurtosis is a measure of whether the data are

peaked or flat relative to a normal distribution.

• Higher kurtosis means more of the variance is

due to infrequent extreme deviations, as

opposed to frequent modestly-sized deviations.

Kurtosis

• A distribution having a relatively high peak

such as the curve is called leptokurtic,

while the curve which is flat-topped is

called platkurtic.

• The normal distribution which is not very

peaked or very flat-topped is called

mesokurtic.

Kurtosis

(a) Leptokurtic

(b) Playtykurtic

(c) Mesokurtic

Kurtosis

Moments

• If X1, X2, … XN are the N values assumed

by the variable X, we define the quantity:

• Called the rth moment. The first moment

with r = 1 is the arithmetic mean.

N

X

N

X

N

XXXX

r

N

j

r

jr

N

rrr

121 ..

Moments

• The rth moment about the mean is

defined as :

• If r = 1, m1 = 0.

• If r= 2, m2 = s2, the variance.

rr

N

j

r

j

r XXN

XX

N

XX

m

1

X

Moments

• The rth moment about any origin A is

defined as :

• Where d = X-A are the deviations of X

from A.

rrr

N

j

r

j

r AXN

d

N

AX

N

AX

m

1'

PROBABILITY DISTRIBUTION

Probability

• The classical definition of probability

• Suppose an event E can happen in h ways out

of a total of n possible equally likely ways.

• Then the probability of non-occurrence of the

event (called its success) is denoted by:

n

hEp Pr

Probability

• The probability of non-occurrence of the event

(called its failure) is denoted by:

• Thus p + q = 1, or Pr{E} + Pr{not E} = 1

Epn

h

n

hnEq Pr111notPr

Probability Distribution

Discrete probability distribution:

• If a variable X can assume a discrete set of

values X1, X2, …XK with respective probabilities

p1, p2,…,pk where p1 +p2 + ..+ pK = 1;

• We say that a discrete probability distribution for

X has been defined.

• In discrete case, by cumulating probabilities, we

obtain cumulative probability distributions.


• The function P(X) which has the respective

values p1, p2,…,pK for X = X1, X2,…,XK is called

the probability function p(X) which has the

respective values p1,p2,..,pK for X =X1,X2,..,XK, is

called the probability function or frequency

function of X.

• Because X can assume certain values with

given probabilities, it is often called a discrete

random variables.

• Also called random/chance/stochastic variables.


Continuous probability distribution:

• If the variable X may assume a continuous

set of values.

• The relative frequency polygon of a

sample becomes, in the theoretical or

limiting case of a population, a continuous

curve such as shown in figure.


• Curve equation is Y = p(X), the total area under

the curve bounded by the X axis is equal to one

a b

P(X)

X


• The area under the curve between lines X

= a and X = b (shaded in figure) gives the

probability that X lies between a and b,

which can be denoted by

• We call p(X) a probability density function.

• Variable X is called a continuous random

variable.

bXaPr

Mathematical Expectation

• If p is the probability that a person will

receive a sum of money S, the

mathematical expectation, or simply the

expectation, is defined by pS.

• If the probability that a man wins a RM100

prize is 1/5, his expectation is:

201005

1RMRM

Mathematical Expectation

• If X denotes a discrete random variable which

can assume the values X1, X2,…, XK with

respective probabilities p1, p2,..,pK where p1 +p2

+…pK =1, the mathematical expectation of X or

simply the expectation of X, denoted by E(X), is

defined:

K

j

jjKK pXXpXpXpXpXE1

2211 ..)(

Binomial Distribution

• An experiment such as tossing a coin or

die repeatedly and each toss or selection

is called a trial.

• In any single trial there will be a probability

associated with a particular event such as

head on the coin, four on the die.

• Such trials are said to be independent and

often called Bernoulli trials.

• Binomial is discrete distribution.


• Let p = the probability that an event will

happen in any single Bernoulli trial =

(called the probability of success).

• Then q = 1-p is the probability that the

event will fail to happen in any single trial

= (called the probability of failure).


)1(deviationstandard

)1(variance

)!(!

!)(

)()(

2

pnp

pnp

npmean

qpxnx

nxf

qpx

nxXPxf

xnx

xnx


• Toss a fair coin 100 times, and count the

number of heads that appear. Find the

mean, variance, and standard deviation of

this experiment.

• In 100 tosses of a fair coin, the expected

or mean number of heads is μ = (100)(0.5)

= 50

• Variance σ2 = 100(0.5)(0.5) = 25

• Std deviation σ = √(100)(0.5)(0.5) = 5

Poison Distributions

• Discrete distribution.

• Let X be a discrete random variable that can

take on the values 0,1,2,…such that the

probability function of X is given by,

• Where λ is a given positive constant.

,...2,1,0!

)()(

xx

exXPxf

x


• A random number variable having this

distribution is said to be Poison distributed.

• The values of Poison distribution can be

obtained using table (available in statistics text

book), which gives values of e-λ for various

values of λ.

,...2,1,0!

)()(

xx

exXPxf

x


deviationstandard

variance

,...2,1,0!

)()(

2

mean

xx

exXPxf

x

Normal Distribution

• One of the most important examples of a

continuous probability distribution is the

normal distribution.

• Sometimes called the Gaussian

distribution.

• Is very important and it will quite often

come up in practice.

Normal Distribution

• The density function for this distribution is

given by:

• Where μ = mean; σ = std deviation; π =

3.14159..; e = 2.71828..

xexf x 22 2/)(

2

1)(

Normal Distribution

• The total area bounded by the following curve

and the X axis is one.

• Hence the area under the curve between two

ordinates X = a and X = b

• where a < b, represents the probability that X

lies between a and b denoted by Pr{a < X < b}.

22 2/)(

2

1)(

xexf

Normal Distribution

• The corresponding distribution function is given

by:

• If X has the distribution function listed above

• Then we say that the random variable X is

normally distributed with mean μ and variance σ2

x

x dvexXPxF22 2/)(

2

1)()(

Normal Distribution

• If we let Z be the random variable corresponding

to the following:

• Then Z is called the standard variable

corresponding to X. The mean or expected

value of Z is 0 and the std deviation is 1.

XZ

Normal Distribution

• The density function for Z can be obtained from

the definition of a normal distribution by allowing

μ = 0 and σ2 = 1

• This is often referred to as the standard normal

density function.

2/2

2

1)( zezf

Normal Distribution

• The corresponding distribution is given by:

• We sometimes call the value z of the

standardized variable Z the standard score

• A graph of the standard normal density function

sometimes called the standard normal curve.

2/2

2

1)( zezf

Normal Distribution

• The standard normal curve indicates the areas

within 1, 2, and 3 standard deviations of the

mean.

• i.e. between z = -1 and +1, z = -2 and +2, z = -3

and +3 as equal , respectively, to 68.27%,

95.45% and 99.73% of the total area, which is

one. This means that:

• ometimes call the value z of the standardized

9973.0)33(

9545.0)22(

6827.0)11(

ZP

ZP

ZP

Standard Normal Curve

3 32 21 168.27%

95.45%99.73%

f (z)

z

0.3

0.2

0.1

0.4

Normal Distribution

• A table giving the areas under the curve

bounded by the ordinates at z = 0 and any

positive value of z is available in all

statistics text book.

• From this table the areas between any two

ordinates can be found by using the

symmetry of the curve about z = 0.

Normal Distribution

• Approximately 68% of the area under any

normal distribution curve lies within one

standard deviation of the mean.

• Approximately 95% of the area under any

normal distribution curve lies within two standard

deviation of the mean.

• Approximately 99.7% of the area under any

normal distribution curve lies within one

standard deviation of the mean.

Normal Distribution

•Total area under the curve = 1.0 or 100%

• The area under the curve :

within 1 std. deviation = 0.68 or 68%;

within 2 std deviation = 95%

within 3 std deviation = 99.7%

Normal Distribution

Normal Distribution

• A standard normal distribution is a normal

distribution with zero mean and one unit

variance , given by the probability function and

distribution function

POPULATION & SAMPLE

Population and Sample

• Often in practice we are interested in drawing

valid conclusions about large group of

individuals or objects.

• Instead of examining the entire group, called the

population, which may be difficult or impossible

to do.

• We may examine only a small part of this

population, which is called a sample.

• The process of obtaining samples is called

sampling.

Population and Sample

• Statistical Inference is drawing a conclusions

from sample data about the larger populations

from which the samples are drawn.

• A population is the whole set of a

measurements or counts about which we want

to draw a conclusion.

• A sample is a subset of the population, a set of

some of the measurements or counts which

comprise the population.

Sampling

• If we draw an object from an urn, we have

the choice of replacing the object into the

urn before we draw again.

• If the first case a particular object can

come up again and again, whereas in the

second it can come up only once.

Sampling

• Sampling where each member of a population

may be chosen more than once is called

sampling with replacement.

• Sampling where each member cannot be

chosen more than once is called sampling

without replacement.

• Practical purposes, sampling from a finite

population that is very large can be considered a

sampling from an infinite population.

Random samples

• For a finite populations: make sure that

each member of the population has the

same chance of being in the sample,

which is called a random sample.

• Random sampling can be accomplished

for relatively small populations by drawing

lots, or equivalently, by using a table of

random numbers specially constructed for

such purposes.

Random samples

• Because inference from sample to

population cannot be certain, we must use

the language of probability in any

statement of conclusions.

Population parameters

• One important problem of statistical inference is

the estimation of population parameters or

briefly parameters (such as population mean,

variance etc.) from the corresponding sample

statistics or briefly statistics (i.e. sample mean,

variance, etc).

• If the mean of the sampling distribution of a

statistic equals the corresponding population

parameter, the statistic is called an unbiased

estimator of the parameter, otherwise it is called

a biased estimator.


• If the sampling distributions of two statistics have

the same mean (or expectation), the statistic

with smaller variance is called an efficient

estimator of the mean while the other statistic is

called an inefficient estimator.

• If we consider all possible statistics whose

sampling distributions have the same mean, the

one with the smallest variance is sometimes

called the most efficient or best estimator of this

mean.


• An estimate of a population parameter given by

a single number is called a point estimate of the

parameter.

• An estimate of a population parameter given by

two numbers between which the parameter may

be considered to lie is called an interval estimate

of the parameter.

• Interval estimates indicate the precision or

accuracy of an estimate and are therefore

preferable to point estimates.


• A population is considered to be known when

we know the probability distribution f(x) of the

associated random variable X.

• If X is normally distributed, we say the

population is normally distributed or that we

have a normal population.

• If X is binomially distributed, we say that the

population is binomially distributed or that we

have a binomial population.

Sample Statistics

• We can take random samples from the

population and then use these samples to obtain

values that serve to estimate and test hypothesis

about the population parameters.

• For example, we wish to draw conclusions about

the height of 12000 adults students by

examining only 100 students selected from the

population.

• In this case, X can be a random variable whose

values are the various heights.

Standard error

• The standard deviation of a sampling distribution

of a statistic is often called its standard error.

• If the sample size N is large enough, the

sampling distributions are normally or nearly

normal. For this reason the methods are known

as large sampling methods.

• When N < 30, samples are called small and use

the theory of small samples or exact sampling

theory.

Confidence interval

• Confidence interval estimates of population

parameters.

• Let μs & σs be the mean and std deviation of the

sampling distribution of a statistic S.

• If the sampling distribution of S is approximately

normal for n ≥ 30, S lying in the interval:

μs – σs to μs + σs : 68.27% of the time

μs – 2σs to μs + 2σs : 95.45% of the time

μs – 3σs to μs + 3σs : 99.73% ot the time

Confidence interval

• Equivalently we can expect to find, or we can be

confident of finding μs in the interval S :

μs – σs to μs + σs : 68.27% confidence intervals

μs – 2σs to μs + 2σs : 95.45% confidence intervals


(i.e. for estimating the population parameter, in this case of

an unbiased S)

Confidence interval

• Equivalently we can expect to find, or we can be

confident of finding μs in the interval S :

μs – σs to μs + σs : 68.27% confidence intervals



S ± σs : 68.27% confidence limits

S ± 2σs : 95.45% confidence limits

S ± 3σs : 99.73% confidence limits

Confidence level

Confidence

Level

99.73% 99% 98% 96% 95.45%

Zc

(critical value)

3.00 2.58 2.33 2.05 2.00

Confidence

Level

95% 90% 80% 68.27% 50%

Zc

(critical value)

1.96 1.645 1.28 1.00 0.6745

S ± 1.96σs : 95% or 0.95 confidence level

S ± 2.58σs : 99% or 0.99 confidence level

Confidence interval

• For small sample n < 30, use the t distribution

(table) to obtain confidence levels.

• For example, if –t0.975 and t0.975 are the values of

T for which 2.5% of the area lies in each tail of

the t distribution, then a 95% confidence interval

for T is given by:

term]generalin[ˆ

ˆ 975.0975.0

n

StX

tS

nXt

c

The t-distribution

• The normal distribution is the well-known

bell-shaped distribution whose mean is μ

and standard deviation is σ.

• The t-distribution has a basic bell shape

with an area of 1 under it, but shorter and

flatter than a normal distribution.

• The standard deviation for t-distribution is

proportionally larger compared to the

standard normal, Z-distribution

The t-distribution

• Each t-distribution is distinguished by the

term degrees of freedom.

• If the sample size n = 10, the degrees of

freedom for corresponding t-distribution is

n-1= 10 – 1 = 9 degrees of freedom = t9.

• Smaller sample size have flatter t-

distributions than larger sample sizes.

• Larger sample size ≈ standard normal Z

Frequency distribution

• If a sample (or even a population) is large,

it is difficult to observe the various

characteristics or to compute statistics

such as mean or standard deviation.

• For this reason it is useful to organize or

group the raw data.


• Suppose that a sample consists of the

height of 100 male student at XYZ

University.

• We arrange data into classes or

categories, and determine the number of

individuals belonging to each class, called

the class frequency.


Height (inches) Number of

students

60-62 5

63-65 18

66-68 42

69-71 27

72-74 8

Total 100

HYPOTHESIS TESTS

‘Hypothesis testing

• Hypothesis testing is a statistician’s way of trying

to confirm or deny a claim about a population

using data from a sample.

• A hypothesis is a conjecture about a population

parameter.

• Hypothesis testing is a process of using sample

data and statistical procedures to decide

whether to reject or not reject a hypothesis

(statement) about a population parameter value.

‘Hypothesis testing

• Because parameters tend to be unknown

quantities, everyone wants to make claims about

what their values may be.

• This conjecture may or may not be true.

• The null hypothesis (Ho) always states the

population parameter is equal to the claimed

value.

• If null hypothesis is found not to be true so what

the alternative hypothesis (Ha) or (H1).

Hypothesis testing

• Decide on null hypothesis, H0.

• Decide on an alternative hypothesis, H1

• Decide on a significance level.

• Calculate the appropriate test statistic, using the sample

data.

• Find from tables the appropriate tabulated test statistic.

• Compare the calculated and tabulated test statistics, and

decide whether to reject the null hypothesis, H0.

• State a conclusion, after checking to see whether the

assumptions required for the test in question are valid.

Hypothesis testing

• The null hypothesis H0, generally

expresses the idea of no difference.

• The alternative hypothesis, which we

denote by H1, expresses the idea of some

difference.

• Alternative hypothesis may be one-sided

(greater or less than) or two-sided (not

equal to).

Critical values of Z

Level of

significance,

α

0.10 0.05 0.01 0.005 0.002

Critical values

of Z for one-

tailed tests

-1.28

or 1.28

-1.645

or 1.645

-2.33

or 2.33

-2.58

or 2.58

-2.88

or 2.88

Critical values

of Z for two-

tailed tests

-1.645

and 1.645

-1.96

and 1.96

-2.58

and 2.58

-2.81

and 2.81

-3.08

and 3.08

Level of significance

Rejection

regionAcceptance region

z

Rejection

region

Total shaded area is called level of significance

of the decision rule : two-tailed test

Hypothesis Example

• Situation A:

• A researcher is interested in finding out whether a new

medicine will have any undesirable side effects on the

pulse rate of the patient. Will the pulse rate increase,

decrease or remain unchanged. Since the researcher

knows the pulse rate of the population under study is 82

beats per minute, the hypothesis will be

Ho : μ = 82 (remain unchanged)

H1 : μ ≠ 82 (will be different)

• This is a two-tailed test since the possible effect

could be to raise or lower the pulse

Hypothesis Example

• Situation B:

• A chemist invents an additive to increase the life of an

automobile battery. The mean life time of ordinary

battery is 36 months. The hypothesis will be:

Ho : μ ≤ 36

H1 : μ > 36

• The chemist is interested only in increasing the lifespan

of the battery. His alternative hypothesis is that the mean

is larger than 36. Therefore the test is called right-tailed,

interested in the increase only.

Hypothesis Example

• Situation C:

• A contractor wishes to lower heating bill by using

a special type of insulation in house. If the

average monthly bill is RM100, his hypothesis

will be:

Ho : μ ≥ RM 100

H1 : μ < RM 100

• This is a left-tailed test since the contractor

is only interested in reducing the bill

Test of significance

• A z-test is used for testing the mean of a

population versus a standard, or comparing the

means of two populations, with large (n ≥ 30)

samples whether you know the population

standard deviation or not.

• It is also used for testing the proportion of some

characteristic versus a standard proportion, or

comparing the proportions of two populations.

• A significance level of 5% is the risk we take in

rejecting the null hypothesis.


• A t-test is used for testing the mean of one

population against a standard or comparing the

means of two populations if you do not know the

populations’ standard deviation and when you

have a limited sample (n < 30).

• If you know the populations’ standard deviation,

you may use a z-test.

• Example: Measuring the average diameter of

shafts from a certain machine when you have a

small sample.


• An F-test is used to compare 2 populations’

variances. The samples can be any size. It is the

basis of ANOVA.

• Example: Comparing the variability of bolt

diameters from two machines.

Chi-square goodness of fit test

• Chi-square value or can be denoted as χ2

provided a good test to fit the hypothesis

distribution with the real one.

• The observed data can be grouped into class

interval and observed frequency, O.

• Suppose that for a group of observation data, a

distribution can be specified for any whatsoever

type by making hypothesis based on the

histogram shape.


• For each class of the grouped data, the

expected frequency for each class can be

estimated on the basis of the hypothecal

distribution.

• It can be carried out by multiplying the reliability

density function of hypothesis distribution for

each class interval with number of data, n to

obtain expected frequency, E.

• The χ2 then can be estimated for each class

using the given formula.


• All single value of χ2 for each class can be

summed up.

• The hypothesis can be verified by comparing the

estimated χ2 with the critical value for χ2 statistic

from Chi-square statistic table.

• If the critical value for χ2 statistics is less than

the calculated value, the proposed distribution

will be rejected.

• The χ2 value from the statistic table can be

determined based on level of significance.

Estimated Chi-square

A measure of the discrepancy existing between observed

& expected frequencies by chi-square:

If chi-square zero:

observed & theory

agree exactly.

If chi-square

greater than zero,

they do not agree

exactly.

Shapiro-Wilk test

Test of normality

Shapiro-Wilk test

• The Shapiro–Wilk test is a test of normality.

• The Shapiro–Wilk test utilizes the null

hypothesis principle to check whether a

sample x1, ..., xn came from a normally

distributed population.

• Empirical testing has found that Shapiro–Wilk

has the best power for a given significance,

followed closely by Anderson–Darling when

comparing the Shapiro–Wilk, Kolmogorov-

Smirnov, Lilliefors and Anderson-Darling tests.

Shapiro-Wilk test

• The null hypothesis of this test is that the

population is normally distributed.

• Thus if the p-value is less than the chosen

alpha level, then the null hypothesis is

rejected and there is evidence that the

data tested are not from a normally

distributed population.

• In other words, the data are not normal..

Shapiro-Wilk test

• On the contrary, if the p-value is greater than the chosen

alpha level, then the null hypothesis that the data came

from a normally distributed population cannot be

rejected.

• Example: for an alpha level of 0.05, a data set with a p-

value of 0.02 rejects the null hypothesis that the data are

from a normally distributed population.

• However, since the test is biased by sample size, the

test may be statistically significant from a normal

distribution in any large samples.

• Thus a Q-Q plot is required for verification in addition to

the test.

Q-Q plot

• In statistics, a Q–Q plot ("Q" stands for quantile)

is a probability plot, which is a graphical

method for comparing two probability

distributions by plotting their quantiles against

each other.

• If the two distributions being compared are

similar, the points in the Q–Q plot will

approximately lie on the line y = x. If the

distributions are linearly related, the points in the

Q–Q plot will approximately lie on a line.

Q-Q plot

Q-Q plot

CURVE FITTING

Curve fitting

• The general problem of finding equations

of approximating curves which fit given

sets of data is called curve fitting.

• Linear relationship – straight line

• Non linear relationship - curve

Curve fitting

• Y = a0 + a1X straight line

• Y = a0 + a1X + a2X2 parabola/quadratic

• Y = a0 + a1X + a2X2 + a3X

3 cubic curve

• Y = a0 + a1X + a2X2 + a3X

3 + a4X4 quartic curve

• Y = a0 + a1X + a2X2 …+ a4X

n nth degree curve

Curve fitting

curveLogistic:11

curveGeometric:

curvelExponentia:

hyperbola:11

10

10

gabY

orgab

Y

aXY

abY

XaaY

orXaa

Y

X

X

b

X

Raw data & fitted curve

Polynomial curve fit

Curve fitting & distribution fitting

Curve fitting & confidence interval

Multiple Regression Analysis

• The multiple regression test is used to

identify change in two or more factors

(independent variables) which contribute to

change in a dependent variable.

• There are three types of multiple regression

procedures; the backward solution, forward

solution and stepwise solution.

• Stepwise has an advantage over the others.

Backward Solution

• This procedure is also known as the full

multiple regression model because every

predictor variable is initially entered into the

regression model.

• The variables which do not contribute

significantly to the regression model will

only be removed later.

Forward Solution

• The predictor variable is entered into the

regression model according to its

contribution to the regression.

• The first variable selected to be entered into

the model has the highest correlation with

the criterion variable.

• Selection of predictor variables is conducted

next until no more predictor variables which

contribute to significant change.

Stepwise Solution

• Is a variation of forward solution.

• The procedure for selecting predictor

variables is similar to the forward solution

except that after each predictor variable is

selected, a second significance test is

conducted to determine the contribution of

each predictor variable before this.

Multiple Regression Analysis

where

Y is the predicted criterion variable

X is the predictor variable

b is the regression coefficient for each

predictor variable

a is regression constant

aXbXbXbXbY kk ...ˆ332211

Correlation theory

• Correlation is the degree of relationship

between variables, which seek to

determine how well a linear or other

equation describes or explains the

relationship between variables.

• If satisfy an equation: perfectly correlated.

• If no relationship: uncorrelated.

Correlation theory

• If only two variables are involved: simple

correlation and simple regression.

• If more than two variables are involved:

multiple correlation and multiple

regression.

Correlation theory

• The correlation is called linear if all points

in the scatter diagram seem to lie near a

line.

• A linear equation is appropriate for

purposes of regression or estimation.

• If Y tends to increase as X increases: the

correlation is called positive or direct

correlation.

Correlation theory

• If Y tends to decrease as X increases: the

correlation is called negative or inverse

correlation.

• If all points seem to lie near some curve, the

correlation is called non-linear and a non-linear

equation is appropriate for regression or

estimation.

• The non-linear correlation can be sometimes

positive or sometimes negative.

Explained & Unexplained variation

• Total variation of Y is given,

Total variation =

unexplained variation + explained variation

2

.

2

.

2YYYYYY estest

Coefficient of Correlation

• The ratio of the explained variation to the

total variation is called the coefficient of

determination.

• The quantity r, called the coefficient of

correlation is given,

2

2

.

variationtotal

variationexplained

YY

YYr

est

Rank Correlation

• Instead of using precise values of the

variables, or when such precision is

unavailable, the data may be ranked in

order of size, importance, etc. using the

numbers 1, 2,3….., N.

Rank Correlation

• If two variables X and Y are ranked in such

manner the coefficient of rank correlation is

given by (spearman’s formula for rank

correlation),

D = differences between ranks of corresponding values

of X & Y.

N = number of pairs of values (X,Y) in the data

16

12

2

NN

Drrank

Correlation Tests

• Inferential research is conducted to

describe the characteristics of the

research subjects by identifying the

relationship between the dependent and

independent variables.

• The dependent variable is the effect; the

independent variable is the factor which

causes or effects a change in the

dependent variable.

Correlation Tests

There are 3 steps to determine relationship

between variables:

1. Indentify the dependent and independent

variables in the relationship.

2. Determine the measurement for variables

in the relationship.

3. Conduct an analysis of the relationship

between variables.

Correlation Tests

• The relationship between variables is

known as correlation and the strength of a

correlation is represented by the correlation

coefficient in the correlation test.

• There are various types of correlation tests

as shown in table.

Correlation Tests

• The standard relationship coefficients

between two variables, is the Pearson

product-moment correlation coefficient.

• The Spearman’s rho test is a non-

parametric test. It is used to analyse data

which is not normally distributed. For two

sets of not normally distributed data, the

data does not correlate linearly.

Correlation Tests

• The Spearman’s rho test is conceptually

similar to the Pearson r test.

• However, the Pearson r test is used to

identify correlation between two sets of

interval or ratio scale data while the

Spearman’s rho test is used to analyse

correlation between two sets of ordinal

scale data.

Correlation Tests

• In some cases, the data collected from a

sample is not ordinal, interval or ratio scale

data; instead, it is nominal scale data.

• The two correlation tests (Pearson r and

Spearman’s rho) are not suitable for

analysing nominal scale data.

Correlation Tests

• Correlation between two nominal scale

variables can be analysed by using the

Cramer’s V test.

• It is calculated based on the chi-square

value.

2

Type of Correlation Tests

Correlation test Type of measurement

Pearson

product-moment

coefficient

It states the relationship between variables using the

interval and ratio scales

Point-biserial

coefficient

It states the relationship between an interval or ratio

scale variable and a nominal scale variable

Spearman’s rho

or eta coefficient

It states the relationship between variables when the

distribution of data is not normal and where both

variables are in ordinal scale which are arranged

according to rank


Correlation

test

Type of measurement

Biserial

coefficient

It is similar to the point-biserial coefficient

where one of the variables is measured in

the interval or ratio scale whereas the other

variables is in the ordinal scale.

Tetrachoric

coefficient

It is similar to the Phi coefficient which

states the relationship of variables in the

nominal scale. The difference is that this

coefficient is used when the researcher

estimates that both variable scales have

ranking and the data distribution is normal.


Correlation

test

Type of measurement

Cramer, Phi

and Lambda

coefficient

Used when variables are in the nominal

scale and each variable has more than two

categories.

Rank-biserial

coefficient

It is similar to the point-biserial coefficient

where one variable in the relationship is in

the nominal scale and the other variable is

in the ordinal scale.

The Strength of coefficient, r

Correlation coefficient (r) Correlation strength

0.91 – 1.00 Very strong

0.71 – 0.91 Strong

0.51 – 0.70 Average/medium

0.31 – 0.50 Weak

0.01 – 0.30 Very weak

0.00 No correlation

Homogeneity of Variance

• Certain tests (e.g. ANOVA) require that the

variances of different populations are equal.

• This can be determined by the following

approaches:

1. Comparison of graphs (esp. box plots)

2. Comparison of variance, standard deviation

and IQR statistics

3. Statistical tests


• The F test presented in Two Sample Hyphotesis

Testing of Variances can be used to determine

whether the variances of two populations are

equal.

• For three or more variables the following

statistical tests for homogeneity of variances are

commonly used:

1. Levene’s test

2. Fligner Killeen test

3. Bartlett’s test


• Ways of dealing with models where the

variances are not sufficiently homogeneous (it is

called heterogeneous):

1. Non-parametric test: Kruskal-Wallis

2. Modified tests: Brown-Forsythe and Welch’s

ANOVA test

3. Transformations (square root, logarithmic)

Outliers

• The following ways of identifying the presence of

outliers:

1. Side by side plotting of the raw data

(histograms and box plots).

2. Examination of residuals.

Residuals for Levene’s test,

𝑒𝑖𝑗 = 𝑥𝑖𝑗 − 𝑥𝑗

Outliers

• The residual is a measure of how far away an

observation is from its group mean value (our

best guess of the value).

• If an observation has a large residual, we

consider it a potential outlier.

• To determine how large a residual must be to be

classified as an outlier we use the fact that if the

population is normally distributed, then the

residuals are also normally distributed with

distribution 𝑒𝑖𝑗~ 𝑁 0,𝑀𝑆𝑊

TIME SERIES

Time Series

• A time series is a set of observation taken at

specified times, usually at equal intervals.

• Time series has certain characteristic

movements or variations.

• The analysis has great value in the problem of

forecasting future movement.

• Many industries & governmental agencies are

concerned with this analysis.

Analysis of Time Series

• A time series is a sequence of data points,

typically consisting of successive

measurements made over a time interval.

• Examples of time series are ocean tides &

rainfall.

• Time series are very frequently plotted via

line charts.

Analysis of Time Series

• Time series are used in pattern

recognition, weather forecasting,

earthquake prediction, econometrics,

mathematical finance, intelligent transport

forecasting, astronomy and largely in any

domain of applied science and

engineering which involves temporal

measurements.

‘Analysis of Time Series

• Methods for time series analyses may be

divided into two classes: frequency-

domain methods and time-domain

methods.

• The former include spectral analysis and

recently wavelet analysis; the latter include

auto-correlation and cross-correlation

analysis.


• Time series analysis techniques may be

divided into parametric and non-

parametric.

• Methods of time series analysis may also

be divided into linear and non-linear, and

univariate and multivariate.


• The parametric approaches assume that

the underlying stationary stochastic

process has a certain structure which can

be described using a small number of

parameters (for example, using an

autoregressive or moving average model).

• In these approaches, the task is to

estimate the parameters of the model that

describes the stochastic process.


• By contrast, non-parametric approaches

explicitly estimate the covariance or the

spectrum of the process without assuming

that the process has any particular

structure.

‘Classification of Time Series

1. Long term or secular movement or long

term trend.

2. Cyclical movements or cyclical variations.

3. Seasonal movements or seasonal

variations.

4. Irregular or random movements.

Secular Movement

• The increase or decrease in the

movements of a time series is called

secular trend.

• A time series data may show upward trend

or downward trend for a period of years

and this may be due to factors like

increase in population, change in

technological progress, shift in consumer

demands.

Cylical Movement

• Cyclical variations are recurrent upward or

downward movements in a time series but

the period of cycle is greater than a year.

• Also these variations are not regular as

seasonal variation.

• Example: A business cycle showing these

oscillatory movements has to pass through

four phases: prosperity, recession,

depression, recovery.

Seasonal Variation

1. Seasonal variations are short term

fluctuation in a time series which occur

periodically in a year

2. This continue to repeat year after year.

3. The major factors are climate condition

and customs of people for example more

woolen clothes are sold in winter and

more ice-creams are sold in summer.

Irregular or Random Movement

1. Irregular variations are fluctuations in time series

that are short in duration, erratic in nature and

follow no regularity in the occurrence pattern.

2. This variations are also referred to as residual

variations since by definition they represent what

is left out in time series after trend, cyclical and

seasonal variations.

3. Irregular fluctuations results due to the

occurrence of unforeseen event such as floods

and earthquakes.

Long term trend

0

10

20

30

40

50

60

0 2 4 6 8 10 12

No

. o

f stu

den

ts

Time

Long term trend & cyclical movement

0

10

20

30

40

50

60

0 2 4 6 8 10 12

Clim

ate

pa

ram

ete

r

Time

Upward trend

Irregular time series

Seasonal & irregular

Multiplicative - seasonal fluctuation varies

Time Series

• A time series is a series of data points indexed

(or listed or graphed) in time order.

• Most commonly, a time series is a sequence

taken at successive equally spaced points in

time.

• Thus it is a sequence of discrete-time data.

Examples of time series are heights of ocean

tides, rainfall, streamflow, sediment, and daily

traffic flow on the roadway.

Time Series

• Time series are very frequently plotted via line

charts.

• Time series are used in pattern recognition,

weather forecasting, intelligent transport,

earthquake prediction.

• Time series are used largely in any domain of

applied science and engineering which involves

temporal measurements.

Time Series

1. A time series typically consists of a set of

observations on a variable, y, taken at

equally spaced intervals over time.

2. There are two aspects to the study of time

series: Analysis and Modelling.

Time Series Analysis

o The aim of analysis is to summarise the

properties of a series and to characterize

its salient features.

o This may be done either the time domain or

in the frequency domain.

Time Series Analysis

o In the time domain attention is focused on

the relationship between observations at

different points in time, while in the

frequency domain it is cyclical movements

which are studied.

o Time series analysis comprises methods

for analyzing time series data in order to

extract meaningful statistics and other

characteristics of the data

Time Series Modelling

1. The main reason for modelling a time series is to

enable forecasts of future values to be made.

2. The movement in yt are explained solely in

terms of its own past, or by its position in relation

to time.

3. Forecasts are then made by extrapolation.

4. Time series forecasting is the use of a model to

predict future values based on previously

observed values.

Component of time series

o A time series is essentially composed of

the following four components:

1. Trend

2. Seasonality

3. Cycle

4. Residuals

Trend

o The trend can usually be detected by

inspection of the time series.

o It can be upward, downward or constant,

depending on the slope of the trend-line.

o The trend-line equation of the line is

actually the equation of the regression line

of y(t) on t.

Seasonality

o The seasonal factor can easily be detected

from the graph of the time series.

o It is usually represented by peaks and

troughs occurring at regular time intervals,

suggesting that the variable attains maxima

and minima.

o The time interval between any two

successive peaks or troughs is known as

the period.

Cycle

o A cycle resembles a season except that its

period is usually much longer.

o Cycles occur as a result of changes of

qualitative nature, that is, changes in taste,

fashion and trend for example.

o A cycle is very hard to detect visually from

a time series graph and is thus very often

assumed to be negligible, especially for

short-term data

Residuals

o Residuals are also known as errors which are put

on the account of unpredictable external factors

commonly known as freaks of nature.

o They are the differences between the expected

and observed values of the variable.

o Theoretical values are the combination (addition or

multiplication) of trend, seasonality and cycle.

o It is assumed that residuals are normally

distributed and that, over a long range of time, they

cancel one another in such a way that their sum is

zero.

Trend Tests

o Trend detection: Mann-Kendall test,

Seasonal Mann-Kendall test, Correlated

seasonal Mann-Kendal test, Partial Mann-

Kendall test, Partial correlation trend test,

Cochran-Armitage test.

o Magnitude of trend: Sen’s slope, Seasonal

Sen’ slope.

o Change point detection: Pettitt’s test

Mann-Kendall Trend Test

o Mann-Kendall trend test is a nonparametric

test used to identify a trend in a series,

even if there is a seasonal component in

the series.

o The Mann-Kendall test compares the

direction of change for all possible time

period combinations to determine whether

the overall trend is increasing (upward) or

decreasing (downward).

Mann-Kendall Trend Test

• The null hypothesis H0 for these tests is that

there is no trend in the series.

• The three alternative hypotheses are that there

is a negative, non-null, or positive trend.

• The Mann-Kendall tests are based on the

calculation of Kendall's tau measure of

association between two samples, which is itself

based on the ranks with the samples.

• The computations assume that the observations

are independent.

Sen’s Slope Trend Test

o If a linear trend is present in a time series, then

the true slope (change per unit time) can be

estimated by using a simple nonparametric

procedure developed by Sen (1968) known as

Sen’s slope estimator.

f(t) = Qt + B

Where Q is the slope, B is a constant

o Could be seasonal slope estimator or non-

seasonal slope estimator

Sen’s Slope Trend Test

• Sen's slope is computed if we request to take into

account the autocorrelation(s).

• The Sen’s slope estimator is an unbiased

estimator of the true slope in simple OLS

regression, but is less sensitive to outliers.

• Inference of Sen’s slope estimates may be

affected by the presence of autocorrelation, and

consensus is required on how to make such

adjustments.

Cochran-Armitage Trend Test

• The Cochran–Armitage test for trend is used in

categorical data analysis when the aim is to

assess for the presence of an association

between a variable with two categories and a

variable with k categories.

• The most frequently used test for trend among

binomial proportions.

• When the objective is to assess the presence of

an association with some binary variable.

Cochran-Armitage Trend Test

• This test can be performed, for example, to

analyse animal carcinogenicity studies, genetic

association studies, controlled clinical trials, items

in questionnaire-based studies on functional

limitations & disabilities, and community-based

surveys.

• For example, doses of a treatment can be

ordered as 'low', 'medium', and 'high', and we

may suspect that the treatment benefit cannot

become smaller as the dose increases.

Type of time series models

o There are two types of time series models;

additive and multiplicative.

o In the additive model, the components are

added and, in the multiplicative model, they

are multiplied.

o Using T for trend, C for cycle, S for season

and R for residuals,.

Type of time series model

Time Series Models

o Models for time series data can have many

forms and represent different processes.

o When modelling variations in the level of a

process, three broad classes of practical

importance are the autoregressive (AR)

models, the integrated (I) models, and the

moving average (MA) models.

Time Series Models

o These three classes depend linearly on

previous data points.

o Combinations of these ideas produce

autoregressive moving average (ARMA)

and autoregressive integrated moving

average (ARIMA) models.

INDEX NUMBERS

‘Index Numbers

• Price Index (Fisher’s & Marshall-Edgeworth)

• Quantity or Volume Index Numbers

• Value Relatives

• Link and Chain Relatives

• Deflation of Time Series (seasonal index)

1) Use Average or Relative Method

2) Use Weighted Aggregate Method

‘Index Numbers - example

• Wage index, production index

• Unemployment index, cost of living index

• Consumer price index

• Standard precipitation index

• Palmer drought index, seasonal index

• Construction cost index

• Water quality index

‘Index Numbers

• An index number is a statistical measure

designed to show changes in a variable or

group of related variables with respect to

time, geographic location or other

characteristic such as: income, profession

etc.

• A collection of index numbers for different

years, location, etc., is sometimes called

an index series.

‘Index Numbers

• By using index number we can, for

example, compare food or other living

costs in a city during one year with those

of a previous year, or we can compare

steel production during a given year in one

part of a country with that in another part.

• Although mainly used in business and

economics, index numbers can be applied

in many other fields (civil engineering).

DATA ANALYSIS

Faculty of Civil Engineering

End of Presentation

Data Analysis

End of presentation

Thank you

Sobri Harun, UTM Skudai

email: [email protected]

October, 2016

Documents

DATA ANALYSIS - Universiti Teknologi Malaysia