CHAPTER 5 STATISTICS 1. 5.1DATA SUMMARY AND DISPLAY Statistics ??? Meaning : Numerical facts Field...
If you can't read please download the document
CHAPTER 5 STATISTICS 1. 5.1DATA SUMMARY AND DISPLAY Statistics ??? Meaning : Numerical facts Field or discipline of study Collection of methods for planning
5.1DATA SUMMARY AND DISPLAY Statistics ??? Meaning : Numerical
facts Field or discipline of study Collection of methods for
planning experiments, obtaining data and organizing, analyzing,
interpreting and drawing the conclusions or making a decision.
2
Slide 3
Population -Entire collection of individuals which are
characteristic being studied. Sample -Subset of population. BASIC
TERMS IN STATISTICS Population Sample 3
Slide 4
Census -Survey includes every member of population. Sample
survey -Collecting information from a portion of population
(techniques) Element -Specific subject or object about which
information collected. Variable -Characteristics which make
different values. 4
Slide 5
Observation -Value of variable for an element. Data Set -A
collection of observation on one or more variables. NameScore Mohd
Amirul bin Hamdi90 Hashimah78 Element Variable Observation/
Measurement Table 1: Students Score for Business Statistic 5
Slide 6
TYPES OF VARIABLES Variable QuantitativeQualitative Discrete
(e.g, number of houses, cars accidents Continuous (e.g., length,
age, height, weight, time) e.g., gender, marital status 6
Slide 7
1) Quantitative variable A variable that can be measured
numerically. Data collected on a quantitative variable are called
quantitative data. There are two types of quantitative variables:-
i. Discrete Variable A variable whose values are countable, can
assume only certain values with no intermediate values. ii.
Continuous Variable A variable that can assume any numerical value
over a certain interval or intervals. 2) Qualitative variable A
variable that cannot assume a numerical value but can be classified
into two or more nonnumeric categories. Data collected on such a
variable are called qualitative data. QUANTITATIVE AND QUALITATIVE
VARIABLE 7
Slide 8
STATISTICS DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS Using
tables, graphs & summary measures Using sample result in making
decision or predict about a population. Also called inductive
reasoning or inductive statistics. 8
Slide 9
Descriptive Statistics Consists of methods for organizing,
displaying and describing data by using tables, graphs and summary
measures. In general divided by two categories :- - Data
presentation (display) - Statistics 9
Slide 10
Inferential Statistics Consists of methods that use sample
results to help make decisions or predictions about a population.
Area statistics which are deal with decision making procedures.
Example :- - In order to find the salary of a college graduate, we
may select 2000 recent college graduates, find the starting
salaries and make decision based on the information. 10
Slide 11
DATA PRESENTATION A data with a lot of observations usually
looks non informative -We cannot get much information with the raw
data We have to summarize or organize in such a way so that we can
get some information about the data. 11
Slide 12
DATA PRESENTATION OF QUALITATIVE DATA Tabular presentation for
qualitative data is usually in the form of frequency table
Frequency table- table represent the number of times the
observation occurs in data A graphic display can reveal at a glance
the main characteristics of a data set. Three types of graphs used
to display qualitative data:- - bar graph - pie chart - line chart
12
Slide 13
Example 5.1 Table 5.1 shows that the data of 50 UNIMAP students
with their data and background. Code used : For gender: 1 is male
and 2 is female For ethnic group: 1 is Malay, 2 is Chinese, 3 is
Indian and 4 is others Not much information can be obtained from
the data 1 in the raw form. It has to be summarized so that we can
get more informations. 13
Slide 14
If data from table 5.1 summarized into gender and ethnic group,
then the frequency tables can get as below : ObservationFrequency
Male28 Female22 Total50 Table 5.2: Frequency Table for the Gender
ObservationFrequency Malay33 Chinese9 Indian6 Others2 Total50 Table
5.3: Frequency Table for the Ethnic Group 14
Slide 15
Bar Chart Bar chart is used to display the frequency
distribution in the graphical form. It consists of two orthogonal
axes and one of the axes represent the observations while the other
one represents the frequency of the observations. The frequency of
the observations is represented by a bar. *Bar chart is for data
from Table 5.3. Figure 1: Bar Chart of the Ethnic Group 15
Slide 16
2.1.2 Pie Chart Pie Chart is used to display the frequency
distribution. It displays the ratio of the observations. It is a
circle consists of a few sectors. The sectors represent the
observations while the area of the sectors represent the proportion
of the frequencies of that observations. *Pie chart is for data
from Table 5.2. Figure 2: The Pie Chart for the Gender 16
Slide 17
2.1.3 Line Chart Line chart is used to display the trend of
observations. It consists of two orthoganal axes and one of the
axes represent the observations while the other one represents the
frequency of the observations. The frequency of the observations
are joint by lines. Example : Table 2.4 below shows the number of
sandpipers recorded between January 1989 till December 1989.
JanFebMarAprMayJuneJulyAugSeptOctNovDec 1075 3972603161421149
Figure 3: The line Chart for the numbers of common Sandpipers Table
2.4 : The number of sandpipers 17
Slide 18
DATA PRESENTATION OF QUANTITATIVE DATA Tabular presentation of
quantitative data is usually in the form of frequency distribution
Frequency distribution table that represents the frequency of the
observation that fall inside some spesific classes (intervals). The
are a few graph available for graphical presentation of the
quantitative data. The most popular are: -Histogram -Frequency
polygon -Ogive 18
Slide 19
FREQUENCY DISTRIBUTION When summarizing large quantities of raw
data, it is often useful to distribute the data into classes. In
determining the classes, there is no spesific rules but
statistician suggest the number of classes are between 5 to 20
Sturgess Rule Number of classes, c=1+3.3 log n Where n is the
numbers of observations in the data set. Class width: 19
Slide 20
Example 5.2 CGPA (Class)Frequency 2.50 - 2.752 2.75 - 3.0010
3.00 - 3.2515 3.25 - 3.5013 3.50 - 3.757 3.75 - 4.003 Total50 Table
5.5: The Fequency Distribution of the Students CGPA 20
Slide 21
Cumulative Frequency Distributions A cumulative frequency
distribution gives the total number of values that fall below the
upper boundary of each class. In cumulative frequency distribution
table, each class has the same lower limit but a different upper
limit. Table5.7 : Class Limit, Class Boundaries, Class Width,
Cumulative Frequency Weekly Earnings (dollars) (Class Limit) Number
of Employees, f Class BoundariesClass WidthCumulative Frequency
801-10009800.5 1000.52009 1001-1200221000.5 1200.52009 + 22 = 31
1201-1400391200.5 1400.520031 + 39 = 70 1401-1600151400.5
1600.520070 + 15 = 85 1601-180091600.5 1800.520085 + 9 = 94
1801-200061800.5 2000.520094 + 6 = 100 21
Slide 22
Histogram The histogram looks like the bar chart except that
the horizontal axis represent the data which is quantitative in
nature. There is no gap between the bars. 22
Slide 23
Frequency Polygon The frequency polygon looks like the line
chart except that the horizontal axis represent the class mark of
the data which is quantitative in nature. 23
Slide 24
Ogive Ogive is a line graph with the horizontal axis represent
the upper limit of the class interval while the vertical axis
represent the cummulative frequencies. 24
Slide 25
DATA SUMMARY What is statistic? Statistis is a number that
describe the sample such as sample mean which describe the sample
average. Type of statistic i.Measure of central tendency ii.Measure
of dispersion 25
Slide 26
MEASURE OF CENTRAL TENDENCY There are 3 popular central
tendency measures, mean, median & mode. 1) Mean The mean of a
sample is the sum of the measurements divided by the number of
measurements in the set. Mean is denoted by ( ) Mean = Sum of all
values / Number of values Mean can be obtained as below :- - For
raw data, mean is defined by, 26
Slide 27
For tabular/group data, mean is defined by: Where f = class
frequency; x = class mark (mid point) 27
Slide 28
Example The mean sample for students CGPA (raw) is 28
Slide 29
Example : The mean sample for Table 5.8 CGPA (Class)Frequency,
f Class Mark (Midpoint), xfx 2.50 - 2.7522.6255.250 2.75 -
3.00102.87528.750 3.00 - 3.25153.12546.875 3.25 - 3.50133.37543.875
3.50 - 3.7573.62525.375 3.75 - 4.0033.87511.625 Total50 161.750
Table 5.8 29
Slide 30
2) Median Median is the middle value of a set of observations
arranged in order of magnitude and normally is devoted by i) The
median for ungrouped data. - The median depends on the number of
observations in the data,. - If is odd, then the median is the th
observation of the ordered observations. - If is even, then the
median is the arithmetic mean of the th observation and the th
observation. 30
Slide 31
ii) The median of grouped data / frequency of distribution. The
median of frequency distribution is defined by: where, = the lower
class boundary of the median class; = the size of the median class
interval; = the sum of frequencies of all classes lower than the
median class = the frequency of the median class. 31
Slide 32
Example for ungrouped data :- The median of this data 4, 6, 3,
1, 2, 5, 7, 3 is 3.5. Proof :- -Rearrange the data in order of
magnitude becomes 1,2,3,3,4,5,6,7. As n=8 (even), the median is the
mean of the 4th and 5th observations that is 3.5. 32
Slide 33
Example for grouped data :- Find median for frequency
distribution below CGPA (Class)Frequency, f Cum. frequency 2.50 -
2.7522 2.75 - 3.001012 3.00 - 3.2515 27 3.25 - 3.501340 3.50 -
3.75747 3.75 - 4.00350 Total50 33
Slide 34
3) Mode The mode of a set of observations is the observation
with the highest frequency and is usually denoted by ( ). Sometimes
mode can also be used to describe the qualitative data. i) Mode of
ungrouped data :- - Defined as the value which occurs most
frequent. - The mode has the advantage in that it is easy to
calculate and eliminates the effect of extreme values. - However,
the mode may not exist and even if it does exit, it may not be
unique. 34
Slide 35
*Note: If a set of data has 2 measurements with higher
frequency, therefore the measurements are assumed as data mode and
known as bimodal data. If a set of data has more than 2
measurements with higher frequency so the data can be assumed as no
mode. ii) The mode for grouped data/frequency distribution data. -
When data has been grouped in classes and a frequency curve is
drawn to fit the data, the mode is the value of corresponding to
the maximum point on the curve. 35
Slide 36
ii) The mode for grouped data/ frequency distribution data
where = the lower class boundary of the modal class; = the size of
the modal class interval; = the difference between the modal class
frequency and the class before it; and = the difference between the
modal class frequency and the class after it. *Note: - The class
which has the highest frequency is called the modal class. 36
Slide 37
Example for ungrouped data : The mode for the observations
4,6,3,1,2,5,7,3 is 3. Example for grouped data based on table :
CGPA (Class)Frequency 2.50 - 2.752 2.75 - 3.00 10 3.00 - 3.2515
3.25 - 3.5013 3.50 - 3.757 3.75 - 4.003 Total50 Modal Class 37
Slide 38
Measure of Dispersion The measure of dispersion/spread is the
degree to which a set of data tends to spread around the average
value. It shows whether data will set is focused around the mean or
scattered. The common measures of dispersion are: 1) range 2)
variance 3) standard deviation The standard deviation actually is
the square root of the variance. The sample variance is denoted by
s 2 and the sample standard deviation is denoted by s. 38
Slide 39
39
Slide 40
Variance i) Variance for ungrouped data The variance of a
sample (also known as mean square) for the raw (ungrouped) data is
denoted by s 2 and defined by: ii) Variance for grouped data The
variance for the frequency distribution is defined by: 40
Slide 41
Example for ungrouped data : given income for 5 workers are :
RM 1000, RM 2500, RM 2000, RM 4000, RM 3500. Find variance of this
data. Solution: 41
Slide 42
Example for grouped data : The variance for frequency
distribution in Table is: Class boundariesFrequency, f Class Mark,
xfxfx 2 2.50 - 2.7522.6255.25013.781 2.75 - 3.00102.87528.75082.656
3.00 - 3.25153.12546.875146.484 3.25 - 3.50133.37543.875148.078
3.50 - 3.7573.62525.37591.984 3.75 - 4.0033.87511.62545.047 Total50
161.750528.031 42
Slide 43
ESTIMATION Introduction The field of statistical inference
consist of those methods used to make decisions or to draw
conclusions about a population. These methods utilize the
information contained in a sample from the population in drawing
conclusions 43
Slide 44
I have a sample of 5 numbers and I take the average. The
estimator is taking the average of the sample. The estimator of the
mean. Let say, the average = 4 the estimate. ESTIMATOR VS ESTIMATE
Estimator In statistics, the method used Estimate The value that
obtained from a sample 44
Slide 45
CONFIDENCE INTERVAL ESTIMATES Definition : An Interval Estimate
In interval estimation, an interval is constructed around the point
estimate and it is stated that this interval is likely to contain
the corresponding population parameter. Definition : Confidence
Level and Confidence Interval Each interval is constructed with
regard to a given confidence level and is called a confidence
interval. The confidence level associated with a confidence
interval states how much confidence we have that this interval
contains the true population parameter. The confidence level is
denoted by 45
Slide 46
46 CONFIDENCE INTERVAL ESTIMATES FOR POPULATION MEAN
Slide 47
47
Slide 48
48 EXAMPLE
Slide 49
49 SOLUTION
Slide 50
50
Slide 51
Example : A publishing company has just published a new
textbook. Before the company decides the price at which to sell
this textbook, it wants to know the average price of all such
textbooks in the market. The research department at the company
took a sample of 36 comparable textbooks and collected the
information on their prices. This information produced a mean price
RM 70.50 for this sample. It is known that the standard deviation
of the prices of all such textbooks is RM4.50. Construct a 90%
confidence interval for the mean price of all such college
textbooks. 51
Slide 52
solution 52
Slide 53
53
Slide 54
EXAMPLE Consider a survey on male students height in a certain
IPTA: a random sample of 100 male students are taken. The height of
the male students is normally distributed with mean 178.2 cm and
variance 17.75 cm 2. i)Construct a 95% CI for the mean of male
students height ii)If mean of the female students height is 170.2
cm height, at 98% CI, verify whether if this can proof that the
male are taller than the female students. 54
Slide 55
SOLUTION It is known that For 95 % CI 55
Slide 56
Hence 95% CI; 56
Slide 57
ii)It is known that and For 98 % CI We can see that mean of
female students does not lies in the interval hence this indicate
that the male students are taller than female students. 57
Slide 58
58 CONFIDENCE INTERVAL ESTIMATES FOR POPULATION PROPORTION
Slide 59
Example According to the analysis of Women Magazine in June
2005, Stress has become a common part of everyday life among
working women in Malaysia. The demands of work, family and home
place an increasing burden on average Malaysian women. According to
this poll, 40% of working women included in the survey indicated
that they had a little amount of time to relax. The poll was based
on a randomly selected of 1502 working women aged 30 and above.
Construct a 95% confidence interval for the corresponding
population proportion. 59
Slide 60
Solution 60
Slide 61
EXERCISE 61
Slide 62
Hypothesis and Test Procedures A statistical test of hypothesis
consist of : 1. The Null hypothesis, 2. The Alternative hypothesis,
3. The test statistic and its p-value 4. The rejection region 5.
The conclusion 62 HYPOTHESIS TESTS
Slide 63
Definition Hypothesis testing can be used to determine whether
a statement about the value of a population parameter should or
should not be rejected. Null hypothesis, H 0 : A null hypothesis is
a claim (or statement) about a population parameter that is assumed
to be true. (the null hypothesis is either rejected or fails to be
rejected.) Alternative hypothesis, H 1 : An alternative hypothesis
is a claim about a population parameter that will be true if the
null hypothesis is false. 63
Slide 64
Test Statistic is a function of the sample data on which the
decision is to be based. p-value is the probability calculated
using the test statistic. The smaller the p-value, the more
contradictory is the data to.
Slide 65
It is not always obvious how the null and alternative
hypothesis should be formulated. When formulating the null and
alternative hypothesis, the nature or purpose of the test must also
be taken into account. We will examine: 1)The claim or assertion
leading to the test. 2)The null hypothesis to be evaluated. 3)The
alternative hypothesis. 4)Whether the test will be two-tail or
one-tail. 5)A visual representation of the test itself. In some
cases it is easier to identify the alternative hypothesis first. In
other cases the null is easier. DEVELOPING NULL AND ALTERNATIVE
HYPOTHESIS
Slide 66
9.1.1 Alternative Hypothesis as a Research Hypothesis Many
applications of hypothesis testing involve an attempt to gather
evidence in support of a research hypothesis. In such cases, it is
often best to begin with the alternative hypothesis and make it the
conclusion that the researcher hopes to support. The conclusion
that the research hypothesis is true is made if the sample data
provide sufficient evidence to show that the null hypothesis can be
rejected.
Slide 67
Example 9.1: A new drug is developed with the goal of lowering
blood pressure more than the existing drug. Alternative Hypothesis:
The new drug lowers blood pressure more than the existing drug.
Null Hypothesis: The new drug does not lower blood pressure more
than the existing drug.
Slide 68
9.1.2 Null Hypothesis as an Assumption to be Challenged We
might begin with a belief or assumption that a statement about the
value of a population parameter is true. We then using a hypothesis
test to challenge the assumption and determine if there is
statistical evidence to conclude that the assumption is incorrect.
In these situations, it is helpful to develop the null hypothesis
first.
Slide 69
Example 9.2 : The label on a soft drink bottle states that it
contains at least 67.6 fluid ounces. Null Hypothesis: The label is
correct. > 67.6 ounces. Alternative Hypothesis: The label is
incorrect. < 67.6 ounces.
Slide 70
Example 9.3: Average tire life is 35000 miles. Null Hypothesis:
= 35000 miles Alternative Hypothesis: 35000 miles
Slide 71
It is not always obvious how the null and alternative
hypothesis should be formulated. When formulating the null and
alternative hypothesis, the nature or purpose of the test must also
be taken into account. We will examine: 1)The claim or assertion
leading to the test. 2)The null hypothesis to be evaluated. 3)The
alternative hypothesis. 4)Whether the test will be two-tail or
one-tail. 5)A visual representation of the test itself. In some
cases it is easier to identify the alternative hypothesis first. In
other cases the null is easier. 9.1 DEVELOPING NULL AND ALTERNATIVE
HYPOTHESIS
Slide 72
9.1.1 Alternative Hypothesis as a Research Hypothesis Many
applications of hypothesis testing involve an attempt to gather
evidence in support of a research hypothesis. In such cases, it is
often best to begin with the alternative hypothesis and make it the
conclusion that the researcher hopes to support. The conclusion
that the research hypothesis is true is made if the sample data
provide sufficient evidence to show that the null hypothesis can be
rejected.
Slide 73
Example 9.1: A new drug is developed with the goal of lowering
blood pressure more than the existing drug. Alternative Hypothesis:
The new drug lowers blood pressure more than the existing drug.
Null Hypothesis: The new drug does not lower blood pressure more
than the existing drug.
Slide 74
9.1.2 Null Hypothesis as an Assumption to be Challenged We
might begin with a belief or assumption that a statement about the
value of a population parameter is true. We then using a hypothesis
test to challenge the assumption and determine if there is
statistical evidence to conclude that the assumption is incorrect.
In these situations, it is helpful to develop the null hypothesis
first.
Slide 75
Example 9.2 : The label on a soft drink bottle states that it
contains at least 67.6 fluid ounces. Null Hypothesis: The label is
correct. > 67.6 ounces. Alternative Hypothesis: The label is
incorrect. < 67.6 ounces.
Slide 76
Example 9.3: Average tire life is 35000 miles. Null Hypothesis:
= 35000 miles Alternative Hypothesis: 35000 miles
Slide 77
How to decide whether to reject or accept ? The entire set of
values that the test statistic may assume is divided into two
regions. One set, consisting of values that support the and lead to
reject, is called the rejection region. The other, consisting of
values that support the is called the acceptance region. H 0 always
gets =. Tails of a Test 77 Two-Tailed Test Left-Tailed Test
Right-Tailed Test Sign in== or Sign in Rejection RegionIn both
tailIn the left tailIn the right tail
Slide 78
Population Mean,, ( known and unknown ) Null Hypothesis : Test
Statistic : 78 any population, is known and n is large or normal
population, is known and n is small any population, is unknown and
n is large normal population, is unknown and n is small
Slide 79
Alternative hypothesisRejection Region 79
Slide 80
Definition: p-value The p-value is the smallest significance
level at which the null hypothesis is rejected. 80
Slide 81
Example 81
Slide 82
Solution 82
Slide 83
83 POPULATION PROPORTION, P Alternative hypothesisRejection
Region
Slide 84
Example When working properly, a machine that is used to make
chips for calculators does not produce more than 4% defective
chips. Whenever the machine produces more than 4% defective chips
it needs an adjustment. To check if the machine is working
properly, the quality control department at the company often takes
sample of chips and inspects them to determine if the chips are
good or defective. One such random sample of 200 chips taken
recently from the production line contained 14 defective chips.
Test at the 5% significance level whether or not the machine needs
an adjustment. 84
Slide 85
Solution 85
Slide 86
REGRESSION AND CORRELATION Regression is a statistical
procedure for establishing the relationship between 2 or more
variables. This is done by fitting a linear equation to the
observed data. The regression line is then used by the researcher
to see the trend and make prediction of values for the data. There
are 2 types of relationship: Simple ( 2 variables) Multiple (more
than 2 variables)
Slide 87
THE SIMPLE LINEAR REGRESSION MODEL is an equation that
describes a dependent variable (Y) in terms of an independent
variable (X) plus random error where, = intercept of the line with
the Y-axis = slope of the line = random error Random error, is the
difference of data point from the deterministic value. This
regression line is estimated from the data collected by fitting a
straight line to the data set and getting the equation of the
straight line,
Slide 88
The least squares method is commonly used to determine values
for and that ensure a best fit for the estimated regression line to
the sample data points The straight line fitted to the data set is
the line: LEAST SQUARES METHOD
Slide 89
is the estimated or predicted value of y for a given value of
x. In other words, the predicted value of the dependent variable
for a given independent variable x can simply be obtain by
substituting the given value of x. We can find the least squares
estimators and by using the formula 89 where
Slide 90
EXAMPLE Suppose we take a sample of seven household from a low
moderate income neighborhood and collect information on their
income and food expenditures for the past month. The information
obtained (in hundreds of Ringgit Malaysia) is given below Find the
least squares regression line of food expenditure (Y) on income (X)
90 IncomeFood expenditures 359 4915 217 3911 155 288 259
Slide 91
SOLUTION IncomeFood Expenditure xyxyx2x2 y2y2 359315122581
49157352401225 21714744149 39114291521121 1557522525 28822478464
25922562581 x = 212 y = 64 xy = 2150 x 2 = 7222 y 2 = 646 91
Compute
Slide 92
92
Slide 93
Correlation measures the strength of a linear relationship
between the two variables. Also known as Pearsons product moment
coefficient of correlation. The symbol for the sample coefficient
of correlation is r Formula : CORRELATION (R)
Slide 94
Properties of r : Values of r close to 1 implies there is a
strong positive linear relationship between x and y. Values of r
close to -1 implies there is a strong negative linear relationship
between x and y. Values of r close to O implies little or no linear
relationship between x and y.
Slide 95
EXAMPLE Refer example before. Calculate the value of r and
interpret its meaning. Solution From example before we know that
Since the r value close to 1, implies that there is strong positive
linear relationship between income (x) and food expenditure (y).
95