Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
DATA ANALYSIS
Faculty of Civil Engineering
DATA
DATA - Introduction
• Data is a collection of facts, such as
numbers, words, measurements,
observations or even just descriptions of
things.
• Qualitative data is descriptive information
(it describes something).
• Quantitative data is numerical information
(numbers).
DATA - Introduction
• Quantitative data can also be discrete or
continuous.
• Discrete data can only take certain values
(like whole numbers).
• Continuous data can take any value
(within a range).
Data Analysis - Introduction
• A process of inspecting, cleaning
transforming and modeling data with the
goal of discovering useful information
suggesting conclusions and supporting
decision-making.
• Data analysis has multiple facets and
approaches, encompassing diverse
techniques under a variety of names, in
different disciplines.
Data Analysis - Introduction
• Data analysis is about manipulating and
presenting results.
• Data need to be organised, summarised
and analysed in order to draw/infer
conclusion.
Data Analysis - Processes
• Data requirements.
• Data collection.
• Data processing.
• Data cleaning.
• Exploratory data analysis.
• Modelling and algorithm
• Results & Report.
‘Sources of Data
• Lab Experimentation
• Survey
• Census
• Theoretical Analysis
• Numerical Analysis
• Software
• Other researchers data
Example Analysis – Results
• Estimation of parameter mean values
• Estimation of parameters variability
• Comparison of parameter mean values
• Comparison of parameter variability
• Modelling the dependence of dependant
variable on several quantitative &
qualitative independent variables
‘Data Processing
• Data initially obtained must be processed
or organized for analysis.
• For instance, this may involve placing data
into rows and columns in a table format for
further analysis, such as within a
spreadsheet or statistical software.
‘Data Cleaning
• The data may be incomplete, contain
duplicates, or contain errors.
• The need for data cleaning will arise from
problems in the way that data is entered
and stored.
• Data cleaning is the process of preventing
and correcting these errors.
‘Data Checking
• Before doing data analysis and
intrepretation, watch for invalid data using
whatever data checking procedure.
• Weeding out of bad data is to be done
continously throughout data gathering
process.
• Bad data can bias results & intrepretation.
• Repeat data gathering or experimentation
if there exist suspicous data.
Exploratory Data Analysis
• Once the data is cleaned, it can be analyzed.
• Analysts may apply a variety of techniques
referred to as exploratory data analysis to begin
understanding the messages contained in the
data.
• The process of exploration may result in
additional data cleaning or additional requests
for data, so these activities may be iterative in
nature.
Trial Test
• Do a simple trial test.
• To ensure that all parts in the testing set-
up function well.
• To determine the range of measurement
to be taken.
• To anticipate the time taken for each step
in the experiment.
• To see the error.
Error (Uncertainty)
• When writing a measurement results with
± e, it doesn’t mean that we have done
error
• It is uncertainty due to the limit of
equipment and technique of experiment
Error (Uncertainty)
• For example Case 1:
• Theory said deflection = 5 mm, in the
experiment the deflection = 5.5 mm. Is it
mean that the theory wrong?.
• Ask first what is the error limit. If the error
limit is ±0.75 , the theory is correct.
Error (Uncertainty)
• For example Case 2:
• Two experimentalist doing measurement
on the time taken for ….
• The first researcher give the result as
20.4±0.4sec.
• While the second researcher give
19.8±0.8sec.
• Is their results contradict?
Error (Uncertainty)
• No, their results is actually overlapping.
• However, we are more confident with the
first one because the error is half of the
second, meaning that the measurement is
done very carefully.
Analysis & Interpretation
• Mathematical formulas or numerical
models called algorithms may be applied
to the data to identify relationships among
the variables.
• Numerical models: using software
• Statistical analysis: using software
STATISTICAL ANALYSIS
What is Statistics
• The science of collecting and analyzing
data.
• It’s about the whole process of using the
scientific method to answer questions and
make decisions.
What is Statistics
• The process involves designing studies,
collecting good data, describing the data
with numbers and graphs, analyzing the
data, and then making conclusion.
Statistical Analysis
1) Designing studies
2) Collecting & selecting data
3) Describing data
4) Analyzing data
5) Making conclusion
Designing Studies
• Once a research question is defined, the
next step is designing a study in order to
answer that question.
• Figure out what process would be used to
get the data we need.
Designing Studies
• The observational study could be survey.
• Surveys are questionaires that are
presented to individuals who have been
selected from a population of interest.
• Another widely used observational study is
based on nature such: wildlife, geology
hydrology, meteorology, environment,etc.
Designing Studies
• Experiments take place in a controlled
setting, and are designed to minimize
biases that might occur.
• It is perhaps most important to note that
no matter what the study, it has to be
designed so that the original questions can
be answered in a credible way.
Collecting & selecting data
• If you select your subjects in a way that is biased
- that is, favoring certain individuals or groups of
individuals – then the results will also be biased.
• Experiments and observational studies use
instrumentation are sometimes even more
challenging when it comes to collecting data.
• Something happens during the experiment to
distract the subjects or the researchers.
Describing Data
• Once data are collected, the next step is to
summarize it all to get a handle on the big
picture.
• Statisticians describe data in two major
ways: with pictures (that is, chart & graph)
and with numbers, called descriptive
statistics.
CHARTS AND GRAPHS
Charts and Graphs
• Line graphs for trend & behaviour
• Time charts for time series data
• Scatter graphs for relationships
• Pie charts & bar charts for categorical data
• Histogram & box plots for numerical data
Line Graphs
• A powerful tools to explain results in term of
cause and effect.
• The horizontal x-axis is normally used for the
independent variable (the cause or controlled
variable).
• The vertical y-axis is normally used for
dependent variable (the effect).
• To describe the development or progression.
• To show trend, response or behaviour in data.
Line Graph
g
Time Charts
• To examine trend over time and another
name for time chart is a line graph.
• Typically a time chart has some unit of
time on the horizontal axis (year, day,
month, and so on) and a measured
quantity on the vertical axis (income, birth
rate, total sales..)
Time Chart
0
10
20
30
40
50
60
0 2 4 6 8 10 12
To
tal
sale
s
Time
Time Chart
Scatter Graphs
• Useful to present many data values.
• To show correlations between two
variables.
• To draw conclusions about relationship in
the data.
Scatter Graphs
0
10
20
30
40
50
60
70
80
90
0 2 4 6 8 10 12 14
Y
X
Pie Charts
• Present data in segment, convey simple
and straightforward proportion of each
category.
• A pie chart takes categorical data and
shows the percentage of individuals that
fall into each category.
• Each segment is presented in terms of
percentage and can only be used with one
data set.
Bar Charts
• An effective way of presenting frequencies.
• Common in reports of small scale research.
• The bar height represents quantity or amount.
• The number of bars represents the categories.
• Often used to compare groups by breaking and
showing them as side-by-side.
• Visually striking and simple to read.
Bar Charts
Histogram
• Is the statistician’s graph of choice for
numerical data that provide a quick way to
get the big idea about a numerical data set.
• A histogram is a graphical display of
tabulated frequencies as well as a
graphical version of a table that shows
what proportion of cases fall into each of
several or many specified categories.
Histogram
• A histogram is the most important graphical tool
for exploring the shape of data distributions
(Scott, 1992).
• The shape examined from the histogram puts
the type of distribution into view.
• A histogram can be constructed by plotting the
frequency of observation against midpoint class
of the data.
Number of Class Interval
Rule of thumb to choose appropriate width:
a is bin widths or widths of class interval
n is number of observation (data)
Log10(n) is the number of based 10 of the
number of observation
According to Sturges’s rule, 1000 observations
would be graphed with 11 class intervals.
Histogram
Histogram - tips
• If there are too few classes, it is difficult to
see how the data vary.
• If there are too many classes, then the
table is less of a summary
Histogram tells three features
• How the data are distributed (symmetric,
skewed right, skewed left, bell-shaped).
• The amount of variability in the data.
• Where the center of the data is
(approximately).
Histogram tells three shapes
• Symmetric: the left-hand side of the
histogram is a mirror image of the right-
hand side.
• Skewed right: it looks like a lopsided
mound with one long tail going off to the
right.
• Skewed left: it looks like a lopsided mound
with one long tail going off to the left.
Histogram tells variability
• If a histogram is quite flat with the bars
close to the same height, it indicates high
variability.
• A histogram with a big lump in the middle
and tails on the sides indicates more data
in the middle bars than the outer bars, the
the data are actually closer together or
less variability.
Histogram tells center
• A histogram can also give you a rough
idea of where the center of the data lies.
• To visualize the mean; the mean is the
point where the fulcrum has to be in order
to balance the weight on each side.
Boxplot
• A boxplot is a one-dimensional graph of
numerical data based on the five-number
summary, which includes the minimum
value, the 25th percentile (know as Q1)
median, the 75th percentile (Q3), and the
maximum value.
• In essence, these five descriptive statistics
divide the data set into four equal parts.
Making a Boxplot
1) Find the five number summary of data
set.
2) Create a horizontal number line whose
scale includes the numbers in the five-
number summary.
3) Label the number line using appropriate
units of equal distance from each other.
Making a Boxplot
40
50
60
70
80
90
100
Five number summary:
43: Minimum
68: 25th percentile
77: Median
89: 75th percentile
99: Maximum
Ex
am
s
co
re
Making a Boxplots
4) Mark the location of each number in the five-
number summary just above the number line.
5) Draw a box around the marks for the 25th
percentile and the 75th percentile.
6) Draw a line in the box where the medians is
located.
7) Draw lines from the outside edges of the box
out to the minimum & maximum values of the
data set.
Making a Boxplot
40
50
60
70
80
90
100
Step 5
Step 6
Step 7
Step 7
Interpreting a Boxplot
A boxplot can show information about the
distribution, variability, and center of a
data set.
Symmetric data shows a symmetric
boxplot.
Skewed data show a lopsided boxplot,
where the median cuts the box into two
unequal pieces.
Interpreting a Boxplot
If the longer part of the box is to the right
(or above) the median, the data is said to
be skewed right.
If the longer part is to the left (or below)
the median, the data is skewed left.
Interpreting a Boxplot
The upper part (vertical line) of the box is
wider than the lower part (vertical line).
This means that the data between the
median (77) and Q3 (89) are a little more
spread out, or variable, than the data
between the median (77) and Q1 (68).
Interpreting a Boxplot
Variability in a data set that is measured
by the the interquartile range (IQR).
The IQR is equal to Q3 – Q1.
A large distance from the 25th percentile
to the 75th percentile indicates the data
are more variable.
IQR ignores data below 25th or above 75th
which may contain outliers.
Interpreting a Boxplot
The median is part of the five-number
summary, and is shown by the line that
cuts through the box in the boxplot.
The mean, however is not part of the box
plot.
Misinterpret a boxplot: the bigger the box,
the more data.
A bigger part of the box means there is
more variability (a wider range of values).
DESCRIPTIVE STATISTICS
‘Summarizing Data
• Descriptive statistics are numbers that
summarize some characteristic about a
set of data.
• Summarizing data by numerical measures
makes a point clearly and concisely. .
• Mean, Median, Mode, Standard Deviation,
Variance, Coefficient of Variation,
Skewness, Kurtosis.
Sample Mean
• The sample mean is defined as the sum of
the observed variable, x divided by the
number of observed values.
Sample Median
• The sample median of a variable
x is defined as the middle value
when the n sample observations
of x are ranked in increasing
order of magnitude.
Sample Median
• S = 1,6,3,8,2,4,9
• We need to find the value x,
where half of the values aare
above x and half the values
below x.
• Rearrange, S = 1,2,3,4,6,8,9
• The median is 4
‘Sample Mode
• The sample mode of a variable x is
defined as the value with the highest
frequency.
• The mode of a data set is the value that
occurs most often or other words, has the
most probability of occuring.
‘Sample Mode
• Sometimes we can have two, three, or
more values that have relatively large
probability of occurrence.
• In such cases, we say that the distribution
is bimodal, tri-modal or multimodal,
respectively.
‘Sample Mode
• Consider the rolls of a ten-sided die:
• R = 2,8,1,9,5,2,7,2,7,9,4,7,1,5,2
• The number that appears the most is the
number 2.
• Therefore the mode of set R is the number
2
‘Sample Mode
• Consider the rolls of a ten-sided die:
• R = 2,8,1,9,5,2,7,2,7,9,4,7,1,5,2
• Note that if the number 7 had appeared
one more time, it would have been present
four times as well.
• In this case, we would have had a bimodal
distribution, with 2 and 7 as the modes.
Mean Median Mode
• When to use mean, median & mode?
• Mean – for normally distributed data
(symmetrical distribution).
• Median & Mode – for markedly
skewed data.
Measures of Dispersion
• Consider the following data set:
• S = 5,5,5,5,5,5 and R = 0,0,0,10,10,10
• If we calculated the mean for both S and
R, we would get the number 5.
• However, these are two vastly different
type of data sets.
Measures of Dispersion
• Therefore, we need another descriptive
statistic besides a measure of central
tendency, which we shall call a measure of
dispersion.
• We shall measure the dispersion or scatter
of the values of our data set about the
mean of the data set.
Measures of Dispersion
• If the values tend to be concentrated near
the mean, then this measure shall be
small, while if the values of the data tend
to be distributed far from the mean, then
the measure will be large.
• The two measures of dispersions that are
usually used are called the variance and
standard deviation.
Variance and Std Deviation
• A quantity of great importance in
probability and statistics is called the
variance.
• The variance denoted by σ2 , for a set of n
numbers x1, x2, …., xn is given by
Variance and Std Deviation
• The variance is nonnegative number
• The positive square root of the variance
(σ2 ) is called the standard deviation (σ) .
• Find the variance and std deviation for the
following set of test scores:
• T = 75, 80, 82, 87, 96
from data set, μ = 84
Variance and Std Deviation
• T = 75, 80, 87, 96
from data set, μ = 84
Variance σ2 = 50.8
Std Deviation, σ = √50.8 = 7.13
Variance & Std Deviation
• It is also widely accepted to divide by (n-1)
as opposed to n.
1
2
n
xxs
1
1
2
2
n
n
xx
s
n
xxs
Percentiles
• It is often convenient to subdivide your
ordered data set by use of ordinates so
that the amount of data points less than
the ordinate is some percentage of the
total amount of observations.
• The values corresponding to such areas
are called percentile values, or briefly,
percentiles.
Percentiles
• For example the percentage of scores that
fall below the ordinate at xα is α.
• For instance, the amount of scores less
than x0,10 would be 0.10 or 10% and x0,10
would be called the 10th percentile.
Percentiles
• Another example is the median.
• Since half the data points fall below the
median, it is the 50th percentile (or fifth
decile), and can be denoted by x0,0.5.
Percentiles
• The 25th percentile is often thought of as
the median of the scores below the
median, and the 75th percentile is often
thought of as the median of the scores
above the median.
Percentiles
• The 25th percentile is called the first
quartile, while the 75th percentile is called
the third quartile.
• The median is also known as the second
quartile.
Interquartile Range
• Another measure of dispersion is the
interquartile range (IQR).
• The interquartile range is defined to be the
first quartile substracted from the third
quartile.
• x0.75 - x0.25
Interquartile Range
• Find interquartile range from the following
data set:
• S = (67, 69, 70, 71, 74, 77, 78, 82, 89)
• The median is 74.
• The first quartile, x0.25 is the median of the
scores below the fifth position, the average
of the second and third score, which leads
to x0.25 = 69.5
Interquartile Range (IQR)
• The third quartile, x0.75 is the median of the
scores above the fifth position, the
average of the seventh and eighth score,
which leads to x0.75 = 80
• The interquartile range is x0.75 - x0.25 = 80
– 69.5 = 10.5.
• Semiinterquartile range is 0.5(x0.75 - x0.25 )
leads to 5.25
Skewness
• Distribution of scores in data set.
• We might have a symmetrical data set, or
a data set that is evenly distributed, or a
data set with more high values than low
values.
Skewness
• Often a distribution is not symmetric about
any value, but instead has a few more
higher values, then it is said to be skewed
to the right.
• If the data set has a few more lower
values, then it is said to be skewed to the
left.
Skewed
Skewed to
the right
Skewed to
the left
Kurtosis
• Kurtosis (from the Greek word κυρτός, kyrtos or
kurtos, meaning bulging) is a measure of the
"peakedness" of the probability distribution of a
real valued random variable
• Kurtosis is a measure of whether the data are
peaked or flat relative to a normal distribution.
• Higher kurtosis means more of the variance is
due to infrequent extreme deviations, as
opposed to frequent modestly-sized deviations.
Kurtosis
• A distribution having a relatively high peak
such as the curve is called leptokurtic,
while the curve which is flat-topped is
called platkurtic.
• The normal distribution which is not very
peaked or very flat-topped is called
mesokurtic.
Kurtosis
(a) Leptokurtic
(b) Playtykurtic
(c) Mesokurtic
Kurtosis
Moments
• If X1, X2, … XN are the N values assumed
by the variable X, we define the quantity:
• Called the rth moment. The first moment
with r = 1 is the arithmetic mean.
N
X
N
X
N
XXXX
r
N
j
r
jr
N
rrr
121 ..
Moments
• The rth moment about the mean is
defined as :
• If r = 1, m1 = 0.
• If r= 2, m2 = s2, the variance.
rr
N
j
r
j
r XXN
XX
N
XX
m
1
X
Moments
• The rth moment about any origin A is
defined as :
• Where d = X-A are the deviations of X
from A.
rrr
N
j
r
j
r AXN
d
N
AX
N
AX
m
1'
PROBABILITY DISTRIBUTION
Probability
• The classical definition of probability
• Suppose an event E can happen in h ways out
of a total of n possible equally likely ways.
• Then the probability of non-occurrence of the
event (called its success) is denoted by:
n
hEp Pr
Probability
• The probability of non-occurrence of the event
(called its failure) is denoted by:
• Thus p + q = 1, or Pr{E} + Pr{not E} = 1
Epn
h
n
hnEq Pr111notPr
Probability Distribution
Discrete probability distribution:
• If a variable X can assume a discrete set of
values X1, X2, …XK with respective probabilities
p1, p2,…,pk where p1 +p2 + ..+ pK = 1;
• We say that a discrete probability distribution for
X has been defined.
• In discrete case, by cumulating probabilities, we
obtain cumulative probability distributions.
Probability Distribution
• The function P(X) which has the respective
values p1, p2,…,pK for X = X1, X2,…,XK is called
the probability function p(X) which has the
respective values p1,p2,..,pK for X =X1,X2,..,XK, is
called the probability function or frequency
function of X.
• Because X can assume certain values with
given probabilities, it is often called a discrete
random variables.
• Also called random/chance/stochastic variables.
Probability Distribution
Continuous probability distribution:
• If the variable X may assume a continuous
set of values.
• The relative frequency polygon of a
sample becomes, in the theoretical or
limiting case of a population, a continuous
curve such as shown in figure.
Probability Distribution
• Curve equation is Y = p(X), the total area under
the curve bounded by the X axis is equal to one
a b
P(X)
X
Probability Distribution
• The area under the curve between lines X
= a and X = b (shaded in figure) gives the
probability that X lies between a and b,
which can be denoted by
• We call p(X) a probability density function.
• Variable X is called a continuous random
variable.
bXaPr
Mathematical Expectation
• If p is the probability that a person will
receive a sum of money S, the
mathematical expectation, or simply the
expectation, is defined by pS.
• If the probability that a man wins a RM100
prize is 1/5, his expectation is:
201005
1RMRM
Mathematical Expectation
• If X denotes a discrete random variable which
can assume the values X1, X2,…, XK with
respective probabilities p1, p2,..,pK where p1 +p2
+…pK =1, the mathematical expectation of X or
simply the expectation of X, denoted by E(X), is
defined:
K
j
jjKK pXXpXpXpXpXE1
2211 ..)(
Binomial Distribution
• An experiment such as tossing a coin or
die repeatedly and each toss or selection
is called a trial.
• In any single trial there will be a probability
associated with a particular event such as
head on the coin, four on the die.
• Such trials are said to be independent and
often called Bernoulli trials.
• Binomial is discrete distribution.
Binomial Distribution
• Let p = the probability that an event will
happen in any single Bernoulli trial =
(called the probability of success).
• Then q = 1-p is the probability that the
event will fail to happen in any single trial
= (called the probability of failure).
Binomial Distribution
)1(deviationstandard
)1(variance
)!(!
!)(
)()(
2
pnp
pnp
npmean
qpxnx
nxf
qpx
nxXPxf
xnx
xnx
Binomial Distribution
• Toss a fair coin 100 times, and count the
number of heads that appear. Find the
mean, variance, and standard deviation of
this experiment.
• In 100 tosses of a fair coin, the expected
or mean number of heads is μ = (100)(0.5)
= 50
• Variance σ2 = 100(0.5)(0.5) = 25
• Std deviation σ = √(100)(0.5)(0.5) = 5
Poison Distributions
• Discrete distribution.
• Let X be a discrete random variable that can
take on the values 0,1,2,…such that the
probability function of X is given by,
• Where λ is a given positive constant.
,...2,1,0!
)()(
xx
exXPxf
x
Poison Distributions
• A random number variable having this
distribution is said to be Poison distributed.
• The values of Poison distribution can be
obtained using table (available in statistics text
book), which gives values of e-λ for various
values of λ.
,...2,1,0!
)()(
xx
exXPxf
x
Poison Distributions
deviationstandard
variance
,...2,1,0!
)()(
2
mean
xx
exXPxf
x
Normal Distribution
• One of the most important examples of a
continuous probability distribution is the
normal distribution.
• Sometimes called the Gaussian
distribution.
• Is very important and it will quite often
come up in practice.
Normal Distribution
• The density function for this distribution is
given by:
• Where μ = mean; σ = std deviation; π =
3.14159..; e = 2.71828..
xexf x 22 2/)(
2
1)(
Normal Distribution
• The total area bounded by the following curve
and the X axis is one.
• Hence the area under the curve between two
ordinates X = a and X = b
• where a < b, represents the probability that X
lies between a and b denoted by Pr{a < X < b}.
22 2/)(
2
1)(
xexf
Normal Distribution
• The corresponding distribution function is given
by:
• If X has the distribution function listed above
• Then we say that the random variable X is
normally distributed with mean μ and variance σ2
x
x dvexXPxF22 2/)(
2
1)()(
Normal Distribution
• If we let Z be the random variable corresponding
to the following:
• Then Z is called the standard variable
corresponding to X. The mean or expected
value of Z is 0 and the std deviation is 1.
XZ
Normal Distribution
• The density function for Z can be obtained from
the definition of a normal distribution by allowing
μ = 0 and σ2 = 1
• This is often referred to as the standard normal
density function.
2/2
2
1)( zezf
Normal Distribution
• The corresponding distribution is given by:
• We sometimes call the value z of the
standardized variable Z the standard score
• A graph of the standard normal density function
sometimes called the standard normal curve.
2/2
2
1)( zezf
Normal Distribution
• The standard normal curve indicates the areas
within 1, 2, and 3 standard deviations of the
mean.
• i.e. between z = -1 and +1, z = -2 and +2, z = -3
and +3 as equal , respectively, to 68.27%,
95.45% and 99.73% of the total area, which is
one. This means that:
• ometimes call the value z of the standardized
9973.0)33(
9545.0)22(
6827.0)11(
ZP
ZP
ZP
Standard Normal Curve
3 32 21 168.27%
95.45%99.73%
f (z)
z
0.3
0.2
0.1
0.4
Normal Distribution
• A table giving the areas under the curve
bounded by the ordinates at z = 0 and any
positive value of z is available in all
statistics text book.
• From this table the areas between any two
ordinates can be found by using the
symmetry of the curve about z = 0.
Normal Distribution
• Approximately 68% of the area under any
normal distribution curve lies within one
standard deviation of the mean.
• Approximately 95% of the area under any
normal distribution curve lies within two standard
deviation of the mean.
• Approximately 99.7% of the area under any
normal distribution curve lies within one
standard deviation of the mean.
Normal Distribution
•Total area under the curve = 1.0 or 100%
• The area under the curve :
within 1 std. deviation = 0.68 or 68%;
within 2 std deviation = 95%
within 3 std deviation = 99.7%
Normal Distribution
Normal Distribution
• A standard normal distribution is a normal
distribution with zero mean and one unit
variance , given by the probability function and
distribution function
POPULATION & SAMPLE
Population and Sample
• Often in practice we are interested in drawing
valid conclusions about large group of
individuals or objects.
• Instead of examining the entire group, called the
population, which may be difficult or impossible
to do.
• We may examine only a small part of this
population, which is called a sample.
• The process of obtaining samples is called
sampling.
Population and Sample
• Statistical Inference is drawing a conclusions
from sample data about the larger populations
from which the samples are drawn.
• A population is the whole set of a
measurements or counts about which we want
to draw a conclusion.
• A sample is a subset of the population, a set of
some of the measurements or counts which
comprise the population.
Sampling
• If we draw an object from an urn, we have
the choice of replacing the object into the
urn before we draw again.
• If the first case a particular object can
come up again and again, whereas in the
second it can come up only once.
Sampling
• Sampling where each member of a population
may be chosen more than once is called
sampling with replacement.
• Sampling where each member cannot be
chosen more than once is called sampling
without replacement.
• Practical purposes, sampling from a finite
population that is very large can be considered a
sampling from an infinite population.
Random samples
• For a finite populations: make sure that
each member of the population has the
same chance of being in the sample,
which is called a random sample.
• Random sampling can be accomplished
for relatively small populations by drawing
lots, or equivalently, by using a table of
random numbers specially constructed for
such purposes.
Random samples
• Because inference from sample to
population cannot be certain, we must use
the language of probability in any
statement of conclusions.
Population parameters
• One important problem of statistical inference is
the estimation of population parameters or
briefly parameters (such as population mean,
variance etc.) from the corresponding sample
statistics or briefly statistics (i.e. sample mean,
variance, etc).
• If the mean of the sampling distribution of a
statistic equals the corresponding population
parameter, the statistic is called an unbiased
estimator of the parameter, otherwise it is called
a biased estimator.
Population parameters
• If the sampling distributions of two statistics have
the same mean (or expectation), the statistic
with smaller variance is called an efficient
estimator of the mean while the other statistic is
called an inefficient estimator.
• If we consider all possible statistics whose
sampling distributions have the same mean, the
one with the smallest variance is sometimes
called the most efficient or best estimator of this
mean.
Population parameters
• An estimate of a population parameter given by
a single number is called a point estimate of the
parameter.
• An estimate of a population parameter given by
two numbers between which the parameter may
be considered to lie is called an interval estimate
of the parameter.
• Interval estimates indicate the precision or
accuracy of an estimate and are therefore
preferable to point estimates.
Population parameters
• A population is considered to be known when
we know the probability distribution f(x) of the
associated random variable X.
• If X is normally distributed, we say the
population is normally distributed or that we
have a normal population.
• If X is binomially distributed, we say that the
population is binomially distributed or that we
have a binomial population.
Sample Statistics
• We can take random samples from the
population and then use these samples to obtain
values that serve to estimate and test hypothesis
about the population parameters.
• For example, we wish to draw conclusions about
the height of 12000 adults students by
examining only 100 students selected from the
population.
• In this case, X can be a random variable whose
values are the various heights.
Standard error
• The standard deviation of a sampling distribution
of a statistic is often called its standard error.
• If the sample size N is large enough, the
sampling distributions are normally or nearly
normal. For this reason the methods are known
as large sampling methods.
• When N < 30, samples are called small and use
the theory of small samples or exact sampling
theory.
Confidence interval
• Confidence interval estimates of population
parameters.
• Let μs & σs be the mean and std deviation of the
sampling distribution of a statistic S.
• If the sampling distribution of S is approximately
normal for n ≥ 30, S lying in the interval:
μs – σs to μs + σs : 68.27% of the time
μs – 2σs to μs + 2σs : 95.45% of the time
μs – 3σs to μs + 3σs : 99.73% ot the time
Confidence interval
• Equivalently we can expect to find, or we can be
confident of finding μs in the interval S :
μs – σs to μs + σs : 68.27% confidence intervals
μs – 2σs to μs + 2σs : 95.45% confidence intervals
μs – 3σs to μs + 3σs : 99.73% confidence intervals
(i.e. for estimating the population parameter, in this case of
an unbiased S)
Confidence interval
• Equivalently we can expect to find, or we can be
confident of finding μs in the interval S :
μs – σs to μs + σs : 68.27% confidence intervals
μs – 2σs to μs + 2σs : 95.45% confidence intervals
μs – 3σs to μs + 3σs : 99.73% confidence intervals
S ± σs : 68.27% confidence limits
S ± 2σs : 95.45% confidence limits
S ± 3σs : 99.73% confidence limits
Confidence level
Confidence
Level
99.73% 99% 98% 96% 95.45%
Zc
(critical value)
3.00 2.58 2.33 2.05 2.00
Confidence
Level
95% 90% 80% 68.27% 50%
Zc
(critical value)
1.96 1.645 1.28 1.00 0.6745
S ± 1.96σs : 95% or 0.95 confidence level
S ± 2.58σs : 99% or 0.99 confidence level
Confidence interval
• For small sample n < 30, use the t distribution
(table) to obtain confidence levels.
• For example, if –t0.975 and t0.975 are the values of
T for which 2.5% of the area lies in each tail of
the t distribution, then a 95% confidence interval
for T is given by:
term]generalin[ˆ
ˆ 975.0975.0
n
StX
tS
nXt
c
The t-distribution
• The normal distribution is the well-known
bell-shaped distribution whose mean is μ
and standard deviation is σ.
• The t-distribution has a basic bell shape
with an area of 1 under it, but shorter and
flatter than a normal distribution.
• The standard deviation for t-distribution is
proportionally larger compared to the
standard normal, Z-distribution
The t-distribution
• Each t-distribution is distinguished by the
term degrees of freedom.
• If the sample size n = 10, the degrees of
freedom for corresponding t-distribution is
n-1= 10 – 1 = 9 degrees of freedom = t9.
• Smaller sample size have flatter t-
distributions than larger sample sizes.
• Larger sample size ≈ standard normal Z
Frequency distribution
• If a sample (or even a population) is large,
it is difficult to observe the various
characteristics or to compute statistics
such as mean or standard deviation.
• For this reason it is useful to organize or
group the raw data.
Frequency distribution
• Suppose that a sample consists of the
height of 100 male student at XYZ
University.
• We arrange data into classes or
categories, and determine the number of
individuals belonging to each class, called
the class frequency.
Frequency distribution
Height (inches) Number of
students
60-62 5
63-65 18
66-68 42
69-71 27
72-74 8
Total 100
HYPOTHESIS TESTS
‘Hypothesis testing
• Hypothesis testing is a statistician’s way of trying
to confirm or deny a claim about a population
using data from a sample.
• A hypothesis is a conjecture about a population
parameter.
• Hypothesis testing is a process of using sample
data and statistical procedures to decide
whether to reject or not reject a hypothesis
(statement) about a population parameter value.
‘Hypothesis testing
• Because parameters tend to be unknown
quantities, everyone wants to make claims about
what their values may be.
• This conjecture may or may not be true.
• The null hypothesis (Ho) always states the
population parameter is equal to the claimed
value.
• If null hypothesis is found not to be true so what
the alternative hypothesis (Ha) or (H1).
Hypothesis testing
• Decide on null hypothesis, H0.
• Decide on an alternative hypothesis, H1
• Decide on a significance level.
• Calculate the appropriate test statistic, using the sample
data.
• Find from tables the appropriate tabulated test statistic.
• Compare the calculated and tabulated test statistics, and
decide whether to reject the null hypothesis, H0.
• State a conclusion, after checking to see whether the
assumptions required for the test in question are valid.
Hypothesis testing
• The null hypothesis H0, generally
expresses the idea of no difference.
• The alternative hypothesis, which we
denote by H1, expresses the idea of some
difference.
• Alternative hypothesis may be one-sided
(greater or less than) or two-sided (not
equal to).
Critical values of Z
Level of
significance,
α
0.10 0.05 0.01 0.005 0.002
Critical values
of Z for one-
tailed tests
-1.28
or 1.28
-1.645
or 1.645
-2.33
or 2.33
-2.58
or 2.58
-2.88
or 2.88
Critical values
of Z for two-
tailed tests
-1.645
and 1.645
-1.96
and 1.96
-2.58
and 2.58
-2.81
and 2.81
-3.08
and 3.08
Level of significance
Rejection
regionAcceptance region
z
Rejection
region
Total shaded area is called level of significance
of the decision rule : two-tailed test
Hypothesis Example
• Situation A:
• A researcher is interested in finding out whether a new
medicine will have any undesirable side effects on the
pulse rate of the patient. Will the pulse rate increase,
decrease or remain unchanged. Since the researcher
knows the pulse rate of the population under study is 82
beats per minute, the hypothesis will be
Ho : μ = 82 (remain unchanged)
H1 : μ ≠ 82 (will be different)
• This is a two-tailed test since the possible effect
could be to raise or lower the pulse
Hypothesis Example
• Situation B:
• A chemist invents an additive to increase the life of an
automobile battery. The mean life time of ordinary
battery is 36 months. The hypothesis will be:
Ho : μ ≤ 36
H1 : μ > 36
• The chemist is interested only in increasing the lifespan
of the battery. His alternative hypothesis is that the mean
is larger than 36. Therefore the test is called right-tailed,
interested in the increase only.
Hypothesis Example
• Situation C:
• A contractor wishes to lower heating bill by using
a special type of insulation in house. If the
average monthly bill is RM100, his hypothesis
will be:
Ho : μ ≥ RM 100
H1 : μ < RM 100
• This is a left-tailed test since the contractor
is only interested in reducing the bill
Test of significance
• A z-test is used for testing the mean of a
population versus a standard, or comparing the
means of two populations, with large (n ≥ 30)
samples whether you know the population
standard deviation or not.
• It is also used for testing the proportion of some
characteristic versus a standard proportion, or
comparing the proportions of two populations.
• A significance level of 5% is the risk we take in
rejecting the null hypothesis.
Test of significance
• A t-test is used for testing the mean of one
population against a standard or comparing the
means of two populations if you do not know the
populations’ standard deviation and when you
have a limited sample (n < 30).
• If you know the populations’ standard deviation,
you may use a z-test.
• Example: Measuring the average diameter of
shafts from a certain machine when you have a
small sample.
Test of significance
• An F-test is used to compare 2 populations’
variances. The samples can be any size. It is the
basis of ANOVA.
• Example: Comparing the variability of bolt
diameters from two machines.
Chi-square goodness of fit test
• Chi-square value or can be denoted as χ2
provided a good test to fit the hypothesis
distribution with the real one.
• The observed data can be grouped into class
interval and observed frequency, O.
• Suppose that for a group of observation data, a
distribution can be specified for any whatsoever
type by making hypothesis based on the
histogram shape.
Chi-square goodness of fit test
• For each class of the grouped data, the
expected frequency for each class can be
estimated on the basis of the hypothecal
distribution.
• It can be carried out by multiplying the reliability
density function of hypothesis distribution for
each class interval with number of data, n to
obtain expected frequency, E.
• The χ2 then can be estimated for each class
using the given formula.
Chi-square goodness of fit test
• All single value of χ2 for each class can be
summed up.
• The hypothesis can be verified by comparing the
estimated χ2 with the critical value for χ2 statistic
from Chi-square statistic table.
• If the critical value for χ2 statistics is less than
the calculated value, the proposed distribution
will be rejected.
• The χ2 value from the statistic table can be
determined based on level of significance.
Estimated Chi-square
A measure of the discrepancy existing between observed
& expected frequencies by chi-square:
If chi-square zero:
observed & theory
agree exactly.
If chi-square
greater than zero,
they do not agree
exactly.
Shapiro-Wilk test
Test of normality
Shapiro-Wilk test
• The Shapiro–Wilk test is a test of normality.
• The Shapiro–Wilk test utilizes the null
hypothesis principle to check whether a
sample x1, ..., xn came from a normally
distributed population.
• Empirical testing has found that Shapiro–Wilk
has the best power for a given significance,
followed closely by Anderson–Darling when
comparing the Shapiro–Wilk, Kolmogorov-
Smirnov, Lilliefors and Anderson-Darling tests.
Shapiro-Wilk test
• The null hypothesis of this test is that the
population is normally distributed.
• Thus if the p-value is less than the chosen
alpha level, then the null hypothesis is
rejected and there is evidence that the
data tested are not from a normally
distributed population.
• In other words, the data are not normal..
Shapiro-Wilk test
• On the contrary, if the p-value is greater than the chosen
alpha level, then the null hypothesis that the data came
from a normally distributed population cannot be
rejected.
• Example: for an alpha level of 0.05, a data set with a p-
value of 0.02 rejects the null hypothesis that the data are
from a normally distributed population.
• However, since the test is biased by sample size, the
test may be statistically significant from a normal
distribution in any large samples.
• Thus a Q-Q plot is required for verification in addition to
the test.
Q-Q plot
• In statistics, a Q–Q plot ("Q" stands for quantile)
is a probability plot, which is a graphical
method for comparing two probability
distributions by plotting their quantiles against
each other.
• If the two distributions being compared are
similar, the points in the Q–Q plot will
approximately lie on the line y = x. If the
distributions are linearly related, the points in the
Q–Q plot will approximately lie on a line.
Q-Q plot
Q-Q plot
CURVE FITTING
Curve fitting
• The general problem of finding equations
of approximating curves which fit given
sets of data is called curve fitting.
• Linear relationship – straight line
• Non linear relationship - curve
Curve fitting
• Y = a0 + a1X straight line
• Y = a0 + a1X + a2X2 parabola/quadratic
• Y = a0 + a1X + a2X2 + a3X
3 cubic curve
• Y = a0 + a1X + a2X2 + a3X
3 + a4X4 quartic curve
• Y = a0 + a1X + a2X2 …+ a4X
n nth degree curve
Curve fitting
curveLogistic:11
curveGeometric:
curvelExponentia:
hyperbola:11
10
10
gabY
orgab
Y
aXY
abY
XaaY
orXaa
Y
X
X
b
X
Raw data & fitted curve
Polynomial curve fit
Curve fitting & distribution fitting
Curve fitting & confidence interval
Multiple Regression Analysis
• The multiple regression test is used to
identify change in two or more factors
(independent variables) which contribute to
change in a dependent variable.
• There are three types of multiple regression
procedures; the backward solution, forward
solution and stepwise solution.
• Stepwise has an advantage over the others.
Backward Solution
• This procedure is also known as the full
multiple regression model because every
predictor variable is initially entered into the
regression model.
• The variables which do not contribute
significantly to the regression model will
only be removed later.
Forward Solution
• The predictor variable is entered into the
regression model according to its
contribution to the regression.
• The first variable selected to be entered into
the model has the highest correlation with
the criterion variable.
• Selection of predictor variables is conducted
next until no more predictor variables which
contribute to significant change.
Stepwise Solution
• Is a variation of forward solution.
• The procedure for selecting predictor
variables is similar to the forward solution
except that after each predictor variable is
selected, a second significance test is
conducted to determine the contribution of
each predictor variable before this.
Multiple Regression Analysis
where
Y is the predicted criterion variable
X is the predictor variable
b is the regression coefficient for each
predictor variable
a is regression constant
aXbXbXbXbY kk ...ˆ332211
Correlation theory
• Correlation is the degree of relationship
between variables, which seek to
determine how well a linear or other
equation describes or explains the
relationship between variables.
• If satisfy an equation: perfectly correlated.
• If no relationship: uncorrelated.
Correlation theory
• If only two variables are involved: simple
correlation and simple regression.
• If more than two variables are involved:
multiple correlation and multiple
regression.
Correlation theory
• The correlation is called linear if all points
in the scatter diagram seem to lie near a
line.
• A linear equation is appropriate for
purposes of regression or estimation.
• If Y tends to increase as X increases: the
correlation is called positive or direct
correlation.
Correlation theory
• If Y tends to decrease as X increases: the
correlation is called negative or inverse
correlation.
• If all points seem to lie near some curve, the
correlation is called non-linear and a non-linear
equation is appropriate for regression or
estimation.
• The non-linear correlation can be sometimes
positive or sometimes negative.
Explained & Unexplained variation
• Total variation of Y is given,
Total variation =
unexplained variation + explained variation
2
.
2
.
2YYYYYY estest
Coefficient of Correlation
• The ratio of the explained variation to the
total variation is called the coefficient of
determination.
• The quantity r, called the coefficient of
correlation is given,
2
2
.
variationtotal
variationexplained
YY
YYr
est
Rank Correlation
• Instead of using precise values of the
variables, or when such precision is
unavailable, the data may be ranked in
order of size, importance, etc. using the
numbers 1, 2,3….., N.
Rank Correlation
• If two variables X and Y are ranked in such
manner the coefficient of rank correlation is
given by (spearman’s formula for rank
correlation),
D = differences between ranks of corresponding values
of X & Y.
N = number of pairs of values (X,Y) in the data
16
12
2
NN
Drrank
Correlation Tests
• Inferential research is conducted to
describe the characteristics of the
research subjects by identifying the
relationship between the dependent and
independent variables.
• The dependent variable is the effect; the
independent variable is the factor which
causes or effects a change in the
dependent variable.
Correlation Tests
There are 3 steps to determine relationship
between variables:
1. Indentify the dependent and independent
variables in the relationship.
2. Determine the measurement for variables
in the relationship.
3. Conduct an analysis of the relationship
between variables.
Correlation Tests
• The relationship between variables is
known as correlation and the strength of a
correlation is represented by the correlation
coefficient in the correlation test.
• There are various types of correlation tests
as shown in table.
Correlation Tests
• The standard relationship coefficients
between two variables, is the Pearson
product-moment correlation coefficient.
• The Spearman’s rho test is a non-
parametric test. It is used to analyse data
which is not normally distributed. For two
sets of not normally distributed data, the
data does not correlate linearly.
Correlation Tests
• The Spearman’s rho test is conceptually
similar to the Pearson r test.
• However, the Pearson r test is used to
identify correlation between two sets of
interval or ratio scale data while the
Spearman’s rho test is used to analyse
correlation between two sets of ordinal
scale data.
Correlation Tests
• In some cases, the data collected from a
sample is not ordinal, interval or ratio scale
data; instead, it is nominal scale data.
• The two correlation tests (Pearson r and
Spearman’s rho) are not suitable for
analysing nominal scale data.
Correlation Tests
• Correlation between two nominal scale
variables can be analysed by using the
Cramer’s V test.
• It is calculated based on the chi-square
value.
2
Type of Correlation Tests
Correlation test Type of measurement
Pearson
product-moment
coefficient
It states the relationship between variables using the
interval and ratio scales
Point-biserial
coefficient
It states the relationship between an interval or ratio
scale variable and a nominal scale variable
Spearman’s rho
or eta coefficient
It states the relationship between variables when the
distribution of data is not normal and where both
variables are in ordinal scale which are arranged
according to rank
Type of Correlation Tests
Correlation
test
Type of measurement
Biserial
coefficient
It is similar to the point-biserial coefficient
where one of the variables is measured in
the interval or ratio scale whereas the other
variables is in the ordinal scale.
Tetrachoric
coefficient
It is similar to the Phi coefficient which
states the relationship of variables in the
nominal scale. The difference is that this
coefficient is used when the researcher
estimates that both variable scales have
ranking and the data distribution is normal.
Type of Correlation Tests
Correlation
test
Type of measurement
Cramer, Phi
and Lambda
coefficient
Used when variables are in the nominal
scale and each variable has more than two
categories.
Rank-biserial
coefficient
It is similar to the point-biserial coefficient
where one variable in the relationship is in
the nominal scale and the other variable is
in the ordinal scale.
The Strength of coefficient, r
Correlation coefficient (r) Correlation strength
0.91 – 1.00 Very strong
0.71 – 0.91 Strong
0.51 – 0.70 Average/medium
0.31 – 0.50 Weak
0.01 – 0.30 Very weak
0.00 No correlation
Homogeneity of Variance
• Certain tests (e.g. ANOVA) require that the
variances of different populations are equal.
• This can be determined by the following
approaches:
1. Comparison of graphs (esp. box plots)
2. Comparison of variance, standard deviation
and IQR statistics
3. Statistical tests
Homogeneity of Variance
• The F test presented in Two Sample Hyphotesis
Testing of Variances can be used to determine
whether the variances of two populations are
equal.
• For three or more variables the following
statistical tests for homogeneity of variances are
commonly used:
1. Levene’s test
2. Fligner Killeen test
3. Bartlett’s test
Homogeneity of Variance
• Ways of dealing with models where the
variances are not sufficiently homogeneous (it is
called heterogeneous):
1. Non-parametric test: Kruskal-Wallis
2. Modified tests: Brown-Forsythe and Welch’s
ANOVA test
3. Transformations (square root, logarithmic)
Outliers
• The following ways of identifying the presence of
outliers:
1. Side by side plotting of the raw data
(histograms and box plots).
2. Examination of residuals.
Residuals for Levene’s test,
𝑒𝑖𝑗 = 𝑥𝑖𝑗 − 𝑥𝑗
Outliers
• The residual is a measure of how far away an
observation is from its group mean value (our
best guess of the value).
• If an observation has a large residual, we
consider it a potential outlier.
• To determine how large a residual must be to be
classified as an outlier we use the fact that if the
population is normally distributed, then the
residuals are also normally distributed with
distribution 𝑒𝑖𝑗~ 𝑁 0,𝑀𝑆𝑊
TIME SERIES
Time Series
• A time series is a set of observation taken at
specified times, usually at equal intervals.
• Time series has certain characteristic
movements or variations.
• The analysis has great value in the problem of
forecasting future movement.
• Many industries & governmental agencies are
concerned with this analysis.
Analysis of Time Series
• A time series is a sequence of data points,
typically consisting of successive
measurements made over a time interval.
• Examples of time series are ocean tides &
rainfall.
• Time series are very frequently plotted via
line charts.
Analysis of Time Series
• Time series are used in pattern
recognition, weather forecasting,
earthquake prediction, econometrics,
mathematical finance, intelligent transport
forecasting, astronomy and largely in any
domain of applied science and
engineering which involves temporal
measurements.
‘Analysis of Time Series
• Methods for time series analyses may be
divided into two classes: frequency-
domain methods and time-domain
methods.
• The former include spectral analysis and
recently wavelet analysis; the latter include
auto-correlation and cross-correlation
analysis.
‘Analysis of Time Series
• Time series analysis techniques may be
divided into parametric and non-
parametric.
• Methods of time series analysis may also
be divided into linear and non-linear, and
univariate and multivariate.
‘Analysis of Time Series
• The parametric approaches assume that
the underlying stationary stochastic
process has a certain structure which can
be described using a small number of
parameters (for example, using an
autoregressive or moving average model).
• In these approaches, the task is to
estimate the parameters of the model that
describes the stochastic process.
‘Analysis of Time Series
• By contrast, non-parametric approaches
explicitly estimate the covariance or the
spectrum of the process without assuming
that the process has any particular
structure.
‘Classification of Time Series
1. Long term or secular movement or long
term trend.
2. Cyclical movements or cyclical variations.
3. Seasonal movements or seasonal
variations.
4. Irregular or random movements.
Secular Movement
• The increase or decrease in the
movements of a time series is called
secular trend.
• A time series data may show upward trend
or downward trend for a period of years
and this may be due to factors like
increase in population, change in
technological progress, shift in consumer
demands.
Cylical Movement
• Cyclical variations are recurrent upward or
downward movements in a time series but
the period of cycle is greater than a year.
• Also these variations are not regular as
seasonal variation.
• Example: A business cycle showing these
oscillatory movements has to pass through
four phases: prosperity, recession,
depression, recovery.
Seasonal Variation
1. Seasonal variations are short term
fluctuation in a time series which occur
periodically in a year
2. This continue to repeat year after year.
3. The major factors are climate condition
and customs of people for example more
woolen clothes are sold in winter and
more ice-creams are sold in summer.
Irregular or Random Movement
1. Irregular variations are fluctuations in time series
that are short in duration, erratic in nature and
follow no regularity in the occurrence pattern.
2. This variations are also referred to as residual
variations since by definition they represent what
is left out in time series after trend, cyclical and
seasonal variations.
3. Irregular fluctuations results due to the
occurrence of unforeseen event such as floods
and earthquakes.
Long term trend
0
10
20
30
40
50
60
0 2 4 6 8 10 12
No
. o
f stu
den
ts
Time
Long term trend & cyclical movement
0
10
20
30
40
50
60
0 2 4 6 8 10 12
Clim
ate
pa
ram
ete
r
Time
Upward trend
Irregular time series
Seasonal & irregular
Multiplicative - seasonal fluctuation varies
Time Series
• A time series is a series of data points indexed
(or listed or graphed) in time order.
• Most commonly, a time series is a sequence
taken at successive equally spaced points in
time.
• Thus it is a sequence of discrete-time data.
Examples of time series are heights of ocean
tides, rainfall, streamflow, sediment, and daily
traffic flow on the roadway.
Time Series
• Time series are very frequently plotted via line
charts.
• Time series are used in pattern recognition,
weather forecasting, intelligent transport,
earthquake prediction.
• Time series are used largely in any domain of
applied science and engineering which involves
temporal measurements.
Time Series
1. A time series typically consists of a set of
observations on a variable, y, taken at
equally spaced intervals over time.
2. There are two aspects to the study of time
series: Analysis and Modelling.
Time Series Analysis
o The aim of analysis is to summarise the
properties of a series and to characterize
its salient features.
o This may be done either the time domain or
in the frequency domain.
Time Series Analysis
o In the time domain attention is focused on
the relationship between observations at
different points in time, while in the
frequency domain it is cyclical movements
which are studied.
o Time series analysis comprises methods
for analyzing time series data in order to
extract meaningful statistics and other
characteristics of the data
Time Series Modelling
1. The main reason for modelling a time series is to
enable forecasts of future values to be made.
2. The movement in yt are explained solely in
terms of its own past, or by its position in relation
to time.
3. Forecasts are then made by extrapolation.
4. Time series forecasting is the use of a model to
predict future values based on previously
observed values.
Component of time series
o A time series is essentially composed of
the following four components:
1. Trend
2. Seasonality
3. Cycle
4. Residuals
Trend
o The trend can usually be detected by
inspection of the time series.
o It can be upward, downward or constant,
depending on the slope of the trend-line.
o The trend-line equation of the line is
actually the equation of the regression line
of y(t) on t.
Seasonality
o The seasonal factor can easily be detected
from the graph of the time series.
o It is usually represented by peaks and
troughs occurring at regular time intervals,
suggesting that the variable attains maxima
and minima.
o The time interval between any two
successive peaks or troughs is known as
the period.
Cycle
o A cycle resembles a season except that its
period is usually much longer.
o Cycles occur as a result of changes of
qualitative nature, that is, changes in taste,
fashion and trend for example.
o A cycle is very hard to detect visually from
a time series graph and is thus very often
assumed to be negligible, especially for
short-term data
Residuals
o Residuals are also known as errors which are put
on the account of unpredictable external factors
commonly known as freaks of nature.
o They are the differences between the expected
and observed values of the variable.
o Theoretical values are the combination (addition or
multiplication) of trend, seasonality and cycle.
o It is assumed that residuals are normally
distributed and that, over a long range of time, they
cancel one another in such a way that their sum is
zero.
Trend Tests
o Trend detection: Mann-Kendall test,
Seasonal Mann-Kendall test, Correlated
seasonal Mann-Kendal test, Partial Mann-
Kendall test, Partial correlation trend test,
Cochran-Armitage test.
o Magnitude of trend: Sen’s slope, Seasonal
Sen’ slope.
o Change point detection: Pettitt’s test
Mann-Kendall Trend Test
o Mann-Kendall trend test is a nonparametric
test used to identify a trend in a series,
even if there is a seasonal component in
the series.
o The Mann-Kendall test compares the
direction of change for all possible time
period combinations to determine whether
the overall trend is increasing (upward) or
decreasing (downward).
Mann-Kendall Trend Test
• The null hypothesis H0 for these tests is that
there is no trend in the series.
• The three alternative hypotheses are that there
is a negative, non-null, or positive trend.
• The Mann-Kendall tests are based on the
calculation of Kendall's tau measure of
association between two samples, which is itself
based on the ranks with the samples.
• The computations assume that the observations
are independent.
Sen’s Slope Trend Test
o If a linear trend is present in a time series, then
the true slope (change per unit time) can be
estimated by using a simple nonparametric
procedure developed by Sen (1968) known as
Sen’s slope estimator.
f(t) = Qt + B
Where Q is the slope, B is a constant
o Could be seasonal slope estimator or non-
seasonal slope estimator
Sen’s Slope Trend Test
• Sen's slope is computed if we request to take into
account the autocorrelation(s).
• The Sen’s slope estimator is an unbiased
estimator of the true slope in simple OLS
regression, but is less sensitive to outliers.
• Inference of Sen’s slope estimates may be
affected by the presence of autocorrelation, and
consensus is required on how to make such
adjustments.
Cochran-Armitage Trend Test
• The Cochran–Armitage test for trend is used in
categorical data analysis when the aim is to
assess for the presence of an association
between a variable with two categories and a
variable with k categories.
• The most frequently used test for trend among
binomial proportions.
• When the objective is to assess the presence of
an association with some binary variable.
Cochran-Armitage Trend Test
• This test can be performed, for example, to
analyse animal carcinogenicity studies, genetic
association studies, controlled clinical trials, items
in questionnaire-based studies on functional
limitations & disabilities, and community-based
surveys.
• For example, doses of a treatment can be
ordered as 'low', 'medium', and 'high', and we
may suspect that the treatment benefit cannot
become smaller as the dose increases.
Type of time series models
o There are two types of time series models;
additive and multiplicative.
o In the additive model, the components are
added and, in the multiplicative model, they
are multiplied.
o Using T for trend, C for cycle, S for season
and R for residuals,.
Type of time series model
Time Series Models
o Models for time series data can have many
forms and represent different processes.
o When modelling variations in the level of a
process, three broad classes of practical
importance are the autoregressive (AR)
models, the integrated (I) models, and the
moving average (MA) models.
Time Series Models
o These three classes depend linearly on
previous data points.
o Combinations of these ideas produce
autoregressive moving average (ARMA)
and autoregressive integrated moving
average (ARIMA) models.
INDEX NUMBERS
‘Index Numbers
• Price Index (Fisher’s & Marshall-Edgeworth)
• Quantity or Volume Index Numbers
• Value Relatives
• Link and Chain Relatives
• Deflation of Time Series (seasonal index)
1) Use Average or Relative Method
2) Use Weighted Aggregate Method
‘Index Numbers - example
• Wage index, production index
• Unemployment index, cost of living index
• Consumer price index
• Standard precipitation index
• Palmer drought index, seasonal index
• Construction cost index
• Water quality index
‘Index Numbers
• An index number is a statistical measure
designed to show changes in a variable or
group of related variables with respect to
time, geographic location or other
characteristic such as: income, profession
etc.
• A collection of index numbers for different
years, location, etc., is sometimes called
an index series.
‘Index Numbers
• By using index number we can, for
example, compare food or other living
costs in a city during one year with those
of a previous year, or we can compare
steel production during a given year in one
part of a country with that in another part.
• Although mainly used in business and
economics, index numbers can be applied
in many other fields (civil engineering).
DATA ANALYSIS
Faculty of Civil Engineering
End of Presentation
Data Analysis
End of presentation
Thank you
Sobri Harun, UTM Skudai
email: [email protected]
October, 2016