CHAPTER 5 STATISTICS 1. 5.1DATA SUMMARY AND DISPLAY Statistics ??? Meaning : Numerical facts Field or discipline of study Collection of methods for planning

CHAPTER 5 STATISTICS 1

5.1DATA SUMMARY AND DISPLAY Statistics ??? Meaning : Numerical facts Field or discipline of study Collection of methods for planning experiments, obtaining data and organizing, analyzing, interpreting and drawing the conclusions or making a decision. 2

Population -Entire collection of individuals which are characteristic being studied. Sample -Subset of population. BASIC TERMS IN STATISTICS Population Sample 3

Census -Survey includes every member of population. Sample survey -Collecting information from a portion of population (techniques) Element -Specific subject or object about which information collected. Variable -Characteristics which make different values. 4

Observation -Value of variable for an element. Data Set -A collection of observation on one or more variables. NameScore Mohd Amirul bin Hamdi90 Hashimah78 Element Variable Observation/ Measurement Table 1: Students Score for Business Statistic 5

TYPES OF VARIABLES Variable QuantitativeQualitative Discrete (e.g, number of houses, cars accidents Continuous (e.g., length, age, height, weight, time) e.g., gender, marital status 6

1) Quantitative variable A variable that can be measured numerically. Data collected on a quantitative variable are called quantitative data. There are two types of quantitative variables:- i. Discrete Variable A variable whose values are countable, can assume only certain values with no intermediate values. ii. Continuous Variable A variable that can assume any numerical value over a certain interval or intervals. 2) Qualitative variable A variable that cannot assume a numerical value but can be classified into two or more nonnumeric categories. Data collected on such a variable are called qualitative data. QUANTITATIVE AND QUALITATIVE VARIABLE 7

STATISTICS DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS Using tables, graphs & summary measures Using sample result in making decision or predict about a population. Also called inductive reasoning or inductive statistics. 8

Descriptive Statistics Consists of methods for organizing, displaying and describing data by using tables, graphs and summary measures. In general divided by two categories :- - Data presentation (display) - Statistics 9

Inferential Statistics Consists of methods that use sample results to help make decisions or predictions about a population. Area statistics which are deal with decision making procedures. Example :- - In order to find the salary of a college graduate, we may select 2000 recent college graduates, find the starting salaries and make decision based on the information. 10

DATA PRESENTATION A data with a lot of observations usually looks non informative -We cannot get much information with the raw data We have to summarize or organize in such a way so that we can get some information about the data. 11

DATA PRESENTATION OF QUALITATIVE DATA Tabular presentation for qualitative data is usually in the form of frequency table Frequency table- table represent the number of times the observation occurs in data A graphic display can reveal at a glance the main characteristics of a data set. Three types of graphs used to display qualitative data:- - bar graph - pie chart - line chart 12

Example 5.1 Table 5.1 shows that the data of 50 UNIMAP students with their data and background. Code used : For gender: 1 is male and 2 is female For ethnic group: 1 is Malay, 2 is Chinese, 3 is Indian and 4 is others Not much information can be obtained from the data 1 in the raw form. It has to be summarized so that we can get more informations. 13

If data from table 5.1 summarized into gender and ethnic group, then the frequency tables can get as below : ObservationFrequency Male28 Female22 Total50 Table 5.2: Frequency Table for the Gender ObservationFrequency Malay33 Chinese9 Indian6 Others2 Total50 Table 5.3: Frequency Table for the Ethnic Group 14

Bar Chart Bar chart is used to display the frequency distribution in the graphical form. It consists of two orthogonal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations is represented by a bar. *Bar chart is for data from Table 5.3. Figure 1: Bar Chart of the Ethnic Group 15

2.1.2 Pie Chart Pie Chart is used to display the frequency distribution. It displays the ratio of the observations. It is a circle consists of a few sectors. The sectors represent the observations while the area of the sectors represent the proportion of the frequencies of that observations. *Pie chart is for data from Table 5.2. Figure 2: The Pie Chart for the Gender 16

2.1.3 Line Chart Line chart is used to display the trend of observations. It consists of two orthoganal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations are joint by lines. Example : Table 2.4 below shows the number of sandpipers recorded between January 1989 till December 1989. JanFebMarAprMayJuneJulyAugSeptOctNovDec 1075 3972603161421149 Figure 3: The line Chart for the numbers of common Sandpipers Table 2.4 : The number of sandpipers 17

DATA PRESENTATION OF QUANTITATIVE DATA Tabular presentation of quantitative data is usually in the form of frequency distribution Frequency distribution table that represents the frequency of the observation that fall inside some spesific classes (intervals). The are a few graph available for graphical presentation of the quantitative data. The most popular are: -Histogram -Frequency polygon -Ogive 18

FREQUENCY DISTRIBUTION When summarizing large quantities of raw data, it is often useful to distribute the data into classes. In determining the classes, there is no spesific rules but statistician suggest the number of classes are between 5 to 20 Sturgess Rule Number of classes, c=1+3.3 log n Where n is the numbers of observations in the data set. Class width: 19

Example 5.2 CGPA (Class)Frequency 2.50 - 2.752 2.75 - 3.0010 3.00 - 3.2515 3.25 - 3.5013 3.50 - 3.757 3.75 - 4.003 Total50 Table 5.5: The Fequency Distribution of the Students CGPA 20

Cumulative Frequency Distributions A cumulative frequency distribution gives the total number of values that fall below the upper boundary of each class. In cumulative frequency distribution table, each class has the same lower limit but a different upper limit. Table5.7 : Class Limit, Class Boundaries, Class Width, Cumulative Frequency Weekly Earnings (dollars) (Class Limit) Number of Employees, f Class BoundariesClass WidthCumulative Frequency 801-10009800.5 1000.52009 1001-1200221000.5 1200.52009 + 22 = 31 1201-1400391200.5 1400.520031 + 39 = 70 1401-1600151400.5 1600.520070 + 15 = 85 1601-180091600.5 1800.520085 + 9 = 94 1801-200061800.5 2000.520094 + 6 = 100 21

Histogram The histogram looks like the bar chart except that the horizontal axis represent the data which is quantitative in nature. There is no gap between the bars. 22

Frequency Polygon The frequency polygon looks like the line chart except that the horizontal axis represent the class mark of the data which is quantitative in nature. 23

Ogive Ogive is a line graph with the horizontal axis represent the upper limit of the class interval while the vertical axis represent the cummulative frequencies. 24

DATA SUMMARY What is statistic? Statistis is a number that describe the sample such as sample mean which describe the sample average. Type of statistic i.Measure of central tendency ii.Measure of dispersion 25

MEASURE OF CENTRAL TENDENCY There are 3 popular central tendency measures, mean, median & mode. 1) Mean The mean of a sample is the sum of the measurements divided by the number of measurements in the set. Mean is denoted by ( ) Mean = Sum of all values / Number of values Mean can be obtained as below :- - For raw data, mean is defined by, 26

For tabular/group data, mean is defined by: Where f = class frequency; x = class mark (mid point) 27

Example The mean sample for students CGPA (raw) is 28

Example : The mean sample for Table 5.8 CGPA (Class)Frequency, f Class Mark (Midpoint), xfx 2.50 - 2.7522.6255.250 2.75 - 3.00102.87528.750 3.00 - 3.25153.12546.875 3.25 - 3.50133.37543.875 3.50 - 3.7573.62525.375 3.75 - 4.0033.87511.625 Total50 161.750 Table 5.8 29

2) Median Median is the middle value of a set of observations arranged in order of magnitude and normally is devoted by i) The median for ungrouped data. - The median depends on the number of observations in the data,. - If is odd, then the median is the th observation of the ordered observations. - If is even, then the median is the arithmetic mean of the th observation and the th observation. 30

ii) The median of grouped data / frequency of distribution. The median of frequency distribution is defined by: where, = the lower class boundary of the median class; = the size of the median class interval; = the sum of frequencies of all classes lower than the median class = the frequency of the median class. 31

Example for ungrouped data :- The median of this data 4, 6, 3, 1, 2, 5, 7, 3 is 3.5. Proof :- -Rearrange the data in order of magnitude becomes 1,2,3,3,4,5,6,7. As n=8 (even), the median is the mean of the 4th and 5th observations that is 3.5. 32

Example for grouped data :- Find median for frequency distribution below CGPA (Class)Frequency, f Cum. frequency 2.50 - 2.7522 2.75 - 3.001012 3.00 - 3.2515 27 3.25 - 3.501340 3.50 - 3.75747 3.75 - 4.00350 Total50 33

3) Mode The mode of a set of observations is the observation with the highest frequency and is usually denoted by ( ). Sometimes mode can also be used to describe the qualitative data. i) Mode of ungrouped data :- - Defined as the value which occurs most frequent. - The mode has the advantage in that it is easy to calculate and eliminates the effect of extreme values. - However, the mode may not exist and even if it does exit, it may not be unique. 34

*Note: If a set of data has 2 measurements with higher frequency, therefore the measurements are assumed as data mode and known as bimodal data. If a set of data has more than 2 measurements with higher frequency so the data can be assumed as no mode. ii) The mode for grouped data/frequency distribution data. - When data has been grouped in classes and a frequency curve is drawn to fit the data, the mode is the value of corresponding to the maximum point on the curve. 35

ii) The mode for grouped data/ frequency distribution data where = the lower class boundary of the modal class; = the size of the modal class interval; = the difference between the modal class frequency and the class before it; and = the difference between the modal class frequency and the class after it. *Note: - The class which has the highest frequency is called the modal class. 36

Example for ungrouped data : The mode for the observations 4,6,3,1,2,5,7,3 is 3. Example for grouped data based on table : CGPA (Class)Frequency 2.50 - 2.752 2.75 - 3.00 10 3.00 - 3.2515 3.25 - 3.5013 3.50 - 3.757 3.75 - 4.003 Total50 Modal Class 37

Measure of Dispersion The measure of dispersion/spread is the degree to which a set of data tends to spread around the average value. It shows whether data will set is focused around the mean or scattered. The common measures of dispersion are: 1) range 2) variance 3) standard deviation The standard deviation actually is the square root of the variance. The sample variance is denoted by s 2 and the sample standard deviation is denoted by s. 38

Variance i) Variance for ungrouped data The variance of a sample (also known as mean square) for the raw (ungrouped) data is denoted by s 2 and defined by: ii) Variance for grouped data The variance for the frequency distribution is defined by: 40

Example for ungrouped data : given income for 5 workers are : RM 1000, RM 2500, RM 2000, RM 4000, RM 3500. Find variance of this data. Solution: 41

Example for grouped data : The variance for frequency distribution in Table is: Class boundariesFrequency, f Class Mark, xfxfx 2 2.50 - 2.7522.6255.25013.781 2.75 - 3.00102.87528.75082.656 3.00 - 3.25153.12546.875146.484 3.25 - 3.50133.37543.875148.078 3.50 - 3.7573.62525.37591.984 3.75 - 4.0033.87511.62545.047 Total50 161.750528.031 42

ESTIMATION Introduction The field of statistical inference consist of those methods used to make decisions or to draw conclusions about a population. These methods utilize the information contained in a sample from the population in drawing conclusions 43

I have a sample of 5 numbers and I take the average. The estimator is taking the average of the sample. The estimator of the mean. Let say, the average = 4 the estimate. ESTIMATOR VS ESTIMATE Estimator In statistics, the method used Estimate The value that obtained from a sample 44

CONFIDENCE INTERVAL ESTIMATES Definition : An Interval Estimate In interval estimation, an interval is constructed around the point estimate and it is stated that this interval is likely to contain the corresponding population parameter. Definition : Confidence Level and Confidence Interval Each interval is constructed with regard to a given confidence level and is called a confidence interval. The confidence level associated with a confidence interval states how much confidence we have that this interval contains the true population parameter. The confidence level is denoted by 45

46 CONFIDENCE INTERVAL ESTIMATES FOR POPULATION MEAN

48 EXAMPLE

49 SOLUTION

Example : A publishing company has just published a new textbook. Before the company decides the price at which to sell this textbook, it wants to know the average price of all such textbooks in the market. The research department at the company took a sample of 36 comparable textbooks and collected the information on their prices. This information produced a mean price RM 70.50 for this sample. It is known that the standard deviation of the prices of all such textbooks is RM4.50. Construct a 90% confidence interval for the mean price of all such college textbooks. 51

solution 52

EXAMPLE Consider a survey on male students height in a certain IPTA: a random sample of 100 male students are taken. The height of the male students is normally distributed with mean 178.2 cm and variance 17.75 cm 2. i)Construct a 95% CI for the mean of male students height ii)If mean of the female students height is 170.2 cm height, at 98% CI, verify whether if this can proof that the male are taller than the female students. 54

SOLUTION It is known that For 95 % CI 55

Hence 95% CI; 56

ii)It is known that and For 98 % CI We can see that mean of female students does not lies in the interval hence this indicate that the male students are taller than female students. 57

58 CONFIDENCE INTERVAL ESTIMATES FOR POPULATION PROPORTION

Example According to the analysis of Women Magazine in June 2005, Stress has become a common part of everyday life among working women in Malaysia. The demands of work, family and home place an increasing burden on average Malaysian women. According to this poll, 40% of working women included in the survey indicated that they had a little amount of time to relax. The poll was based on a randomly selected of 1502 working women aged 30 and above. Construct a 95% confidence interval for the corresponding population proportion. 59

Solution 60

EXERCISE 61

Hypothesis and Test Procedures A statistical test of hypothesis consist of : 1. The Null hypothesis, 2. The Alternative hypothesis, 3. The test statistic and its p-value 4. The rejection region 5. The conclusion 62 HYPOTHESIS TESTS

Definition Hypothesis testing can be used to determine whether a statement about the value of a population parameter should or should not be rejected. Null hypothesis, H 0 : A null hypothesis is a claim (or statement) about a population parameter that is assumed to be true. (the null hypothesis is either rejected or fails to be rejected.) Alternative hypothesis, H 1 : An alternative hypothesis is a claim about a population parameter that will be true if the null hypothesis is false. 63

Test Statistic is a function of the sample data on which the decision is to be based. p-value is the probability calculated using the test statistic. The smaller the p-value, the more contradictory is the data to.

It is not always obvious how the null and alternative hypothesis should be formulated. When formulating the null and alternative hypothesis, the nature or purpose of the test must also be taken into account. We will examine: 1)The claim or assertion leading to the test. 2)The null hypothesis to be evaluated. 3)The alternative hypothesis. 4)Whether the test will be two-tail or one-tail. 5)A visual representation of the test itself. In some cases it is easier to identify the alternative hypothesis first. In other cases the null is easier. DEVELOPING NULL AND ALTERNATIVE HYPOTHESIS

9.1.1 Alternative Hypothesis as a Research Hypothesis Many applications of hypothesis testing involve an attempt to gather evidence in support of a research hypothesis. In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that the researcher hopes to support. The conclusion that the research hypothesis is true is made if the sample data provide sufficient evidence to show that the null hypothesis can be rejected.

Example 9.1: A new drug is developed with the goal of lowering blood pressure more than the existing drug. Alternative Hypothesis: The new drug lowers blood pressure more than the existing drug. Null Hypothesis: The new drug does not lower blood pressure more than the existing drug.

9.1.2 Null Hypothesis as an Assumption to be Challenged We might begin with a belief or assumption that a statement about the value of a population parameter is true. We then using a hypothesis test to challenge the assumption and determine if there is statistical evidence to conclude that the assumption is incorrect. In these situations, it is helpful to develop the null hypothesis first.

Example 9.2 : The label on a soft drink bottle states that it contains at least 67.6 fluid ounces. Null Hypothesis: The label is correct. > 67.6 ounces. Alternative Hypothesis: The label is incorrect. < 67.6 ounces.

Example 9.3: Average tire life is 35000 miles. Null Hypothesis: = 35000 miles Alternative Hypothesis: 35000 miles

It is not always obvious how the null and alternative hypothesis should be formulated. When formulating the null and alternative hypothesis, the nature or purpose of the test must also be taken into account. We will examine: 1)The claim or assertion leading to the test. 2)The null hypothesis to be evaluated. 3)The alternative hypothesis. 4)Whether the test will be two-tail or one-tail. 5)A visual representation of the test itself. In some cases it is easier to identify the alternative hypothesis first. In other cases the null is easier. 9.1 DEVELOPING NULL AND ALTERNATIVE HYPOTHESIS

9.1.1 Alternative Hypothesis as a Research Hypothesis Many applications of hypothesis testing involve an attempt to gather evidence in support of a research hypothesis. In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that the researcher hopes to support. The conclusion that the research hypothesis is true is made if the sample data provide sufficient evidence to show that the null hypothesis can be rejected.

Example 9.1: A new drug is developed with the goal of lowering blood pressure more than the existing drug. Alternative Hypothesis: The new drug lowers blood pressure more than the existing drug. Null Hypothesis: The new drug does not lower blood pressure more than the existing drug.

9.1.2 Null Hypothesis as an Assumption to be Challenged We might begin with a belief or assumption that a statement about the value of a population parameter is true. We then using a hypothesis test to challenge the assumption and determine if there is statistical evidence to conclude that the assumption is incorrect. In these situations, it is helpful to develop the null hypothesis first.

Example 9.2 : The label on a soft drink bottle states that it contains at least 67.6 fluid ounces. Null Hypothesis: The label is correct. > 67.6 ounces. Alternative Hypothesis: The label is incorrect. < 67.6 ounces.

Example 9.3: Average tire life is 35000 miles. Null Hypothesis: = 35000 miles Alternative Hypothesis: 35000 miles

How to decide whether to reject or accept ? The entire set of values that the test statistic may assume is divided into two regions. One set, consisting of values that support the and lead to reject, is called the rejection region. The other, consisting of values that support the is called the acceptance region. H 0 always gets =. Tails of a Test 77 Two-Tailed Test Left-Tailed Test Right-Tailed Test Sign in== or Sign in Rejection RegionIn both tailIn the left tailIn the right tail

Population Mean,, ( known and unknown ) Null Hypothesis : Test Statistic : 78 any population, is known and n is large or normal population, is known and n is small any population, is unknown and n is large normal population, is unknown and n is small

Alternative hypothesisRejection Region 79

Definition: p-value The p-value is the smallest significance level at which the null hypothesis is rejected. 80

Example 81

Solution 82

83 POPULATION PROPORTION, P Alternative hypothesisRejection Region

Example When working properly, a machine that is used to make chips for calculators does not produce more than 4% defective chips. Whenever the machine produces more than 4% defective chips it needs an adjustment. To check if the machine is working properly, the quality control department at the company often takes sample of chips and inspects them to determine if the chips are good or defective. One such random sample of 200 chips taken recently from the production line contained 14 defective chips. Test at the 5% significance level whether or not the machine needs an adjustment. 84

Solution 85

REGRESSION AND CORRELATION Regression is a statistical procedure for establishing the relationship between 2 or more variables. This is done by fitting a linear equation to the observed data. The regression line is then used by the researcher to see the trend and make prediction of values for the data. There are 2 types of relationship: Simple ( 2 variables) Multiple (more than 2 variables)

THE SIMPLE LINEAR REGRESSION MODEL is an equation that describes a dependent variable (Y) in terms of an independent variable (X) plus random error where, = intercept of the line with the Y-axis = slope of the line = random error Random error, is the difference of data point from the deterministic value. This regression line is estimated from the data collected by fitting a straight line to the data set and getting the equation of the straight line,

The least squares method is commonly used to determine values for and that ensure a best fit for the estimated regression line to the sample data points The straight line fitted to the data set is the line: LEAST SQUARES METHOD

is the estimated or predicted value of y for a given value of x. In other words, the predicted value of the dependent variable for a given independent variable x can simply be obtain by substituting the given value of x. We can find the least squares estimators and by using the formula 89 where

EXAMPLE Suppose we take a sample of seven household from a low moderate income neighborhood and collect information on their income and food expenditures for the past month. The information obtained (in hundreds of Ringgit Malaysia) is given below Find the least squares regression line of food expenditure (Y) on income (X) 90 IncomeFood expenditures 359 4915 217 3911 155 288 259

SOLUTION IncomeFood Expenditure xyxyx2x2 y2y2 359315122581 49157352401225 21714744149 39114291521121 1557522525 28822478464 25922562581 x = 212 y = 64 xy = 2150 x 2 = 7222 y 2 = 646 91 Compute

Correlation measures the strength of a linear relationship between the two variables. Also known as Pearsons product moment coefficient of correlation. The symbol for the sample coefficient of correlation is r Formula : CORRELATION (R)

Properties of r : Values of r close to 1 implies there is a strong positive linear relationship between x and y. Values of r close to -1 implies there is a strong negative linear relationship between x and y. Values of r close to O implies little or no linear relationship between x and y.

EXAMPLE Refer example before. Calculate the value of r and interpret its meaning. Solution From example before we know that Since the r value close to 1, implies that there is strong positive linear relationship between income (x) and food expenditure (y). 95

Documents

CHAPTER 5 STATISTICS 1. 5.1DATA SUMMARY AND DISPLAY Statistics ??? Meaning : Numerical facts Field or discipline of study Collection of methods for planning