CHAPTER 5 STATISTICS 1. 5.1DATA SUMMARY AND DISPLAY Statistics ??? Meaning : Numerical facts Field or discipline of study Collection of methods for planning

Embed Size (px)

Citation preview

  • Slide 1
  • CHAPTER 5 STATISTICS 1
  • Slide 2
  • 5.1DATA SUMMARY AND DISPLAY Statistics ??? Meaning : Numerical facts Field or discipline of study Collection of methods for planning experiments, obtaining data and organizing, analyzing, interpreting and drawing the conclusions or making a decision. 2
  • Slide 3
  • Population -Entire collection of individuals which are characteristic being studied. Sample -Subset of population. BASIC TERMS IN STATISTICS Population Sample 3
  • Slide 4
  • Census -Survey includes every member of population. Sample survey -Collecting information from a portion of population (techniques) Element -Specific subject or object about which information collected. Variable -Characteristics which make different values. 4
  • Slide 5
  • Observation -Value of variable for an element. Data Set -A collection of observation on one or more variables. NameScore Mohd Amirul bin Hamdi90 Hashimah78 Element Variable Observation/ Measurement Table 1: Students Score for Business Statistic 5
  • Slide 6
  • TYPES OF VARIABLES Variable QuantitativeQualitative Discrete (e.g, number of houses, cars accidents Continuous (e.g., length, age, height, weight, time) e.g., gender, marital status 6
  • Slide 7
  • 1) Quantitative variable A variable that can be measured numerically. Data collected on a quantitative variable are called quantitative data. There are two types of quantitative variables:- i. Discrete Variable A variable whose values are countable, can assume only certain values with no intermediate values. ii. Continuous Variable A variable that can assume any numerical value over a certain interval or intervals. 2) Qualitative variable A variable that cannot assume a numerical value but can be classified into two or more nonnumeric categories. Data collected on such a variable are called qualitative data. QUANTITATIVE AND QUALITATIVE VARIABLE 7
  • Slide 8
  • STATISTICS DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS Using tables, graphs & summary measures Using sample result in making decision or predict about a population. Also called inductive reasoning or inductive statistics. 8
  • Slide 9
  • Descriptive Statistics Consists of methods for organizing, displaying and describing data by using tables, graphs and summary measures. In general divided by two categories :- - Data presentation (display) - Statistics 9
  • Slide 10
  • Inferential Statistics Consists of methods that use sample results to help make decisions or predictions about a population. Area statistics which are deal with decision making procedures. Example :- - In order to find the salary of a college graduate, we may select 2000 recent college graduates, find the starting salaries and make decision based on the information. 10
  • Slide 11
  • DATA PRESENTATION A data with a lot of observations usually looks non informative -We cannot get much information with the raw data We have to summarize or organize in such a way so that we can get some information about the data. 11
  • Slide 12
  • DATA PRESENTATION OF QUALITATIVE DATA Tabular presentation for qualitative data is usually in the form of frequency table Frequency table- table represent the number of times the observation occurs in data A graphic display can reveal at a glance the main characteristics of a data set. Three types of graphs used to display qualitative data:- - bar graph - pie chart - line chart 12
  • Slide 13
  • Example 5.1 Table 5.1 shows that the data of 50 UNIMAP students with their data and background. Code used : For gender: 1 is male and 2 is female For ethnic group: 1 is Malay, 2 is Chinese, 3 is Indian and 4 is others Not much information can be obtained from the data 1 in the raw form. It has to be summarized so that we can get more informations. 13
  • Slide 14
  • If data from table 5.1 summarized into gender and ethnic group, then the frequency tables can get as below : ObservationFrequency Male28 Female22 Total50 Table 5.2: Frequency Table for the Gender ObservationFrequency Malay33 Chinese9 Indian6 Others2 Total50 Table 5.3: Frequency Table for the Ethnic Group 14
  • Slide 15
  • Bar Chart Bar chart is used to display the frequency distribution in the graphical form. It consists of two orthogonal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations is represented by a bar. *Bar chart is for data from Table 5.3. Figure 1: Bar Chart of the Ethnic Group 15
  • Slide 16
  • 2.1.2 Pie Chart Pie Chart is used to display the frequency distribution. It displays the ratio of the observations. It is a circle consists of a few sectors. The sectors represent the observations while the area of the sectors represent the proportion of the frequencies of that observations. *Pie chart is for data from Table 5.2. Figure 2: The Pie Chart for the Gender 16
  • Slide 17
  • 2.1.3 Line Chart Line chart is used to display the trend of observations. It consists of two orthoganal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations are joint by lines. Example : Table 2.4 below shows the number of sandpipers recorded between January 1989 till December 1989. JanFebMarAprMayJuneJulyAugSeptOctNovDec 1075 3972603161421149 Figure 3: The line Chart for the numbers of common Sandpipers Table 2.4 : The number of sandpipers 17
  • Slide 18
  • DATA PRESENTATION OF QUANTITATIVE DATA Tabular presentation of quantitative data is usually in the form of frequency distribution Frequency distribution table that represents the frequency of the observation that fall inside some spesific classes (intervals). The are a few graph available for graphical presentation of the quantitative data. The most popular are: -Histogram -Frequency polygon -Ogive 18
  • Slide 19
  • FREQUENCY DISTRIBUTION When summarizing large quantities of raw data, it is often useful to distribute the data into classes. In determining the classes, there is no spesific rules but statistician suggest the number of classes are between 5 to 20 Sturgess Rule Number of classes, c=1+3.3 log n Where n is the numbers of observations in the data set. Class width: 19
  • Slide 20
  • Example 5.2 CGPA (Class)Frequency 2.50 - 2.752 2.75 - 3.0010 3.00 - 3.2515 3.25 - 3.5013 3.50 - 3.757 3.75 - 4.003 Total50 Table 5.5: The Fequency Distribution of the Students CGPA 20
  • Slide 21
  • Cumulative Frequency Distributions A cumulative frequency distribution gives the total number of values that fall below the upper boundary of each class. In cumulative frequency distribution table, each class has the same lower limit but a different upper limit. Table5.7 : Class Limit, Class Boundaries, Class Width, Cumulative Frequency Weekly Earnings (dollars) (Class Limit) Number of Employees, f Class BoundariesClass WidthCumulative Frequency 801-10009800.5 1000.52009 1001-1200221000.5 1200.52009 + 22 = 31 1201-1400391200.5 1400.520031 + 39 = 70 1401-1600151400.5 1600.520070 + 15 = 85 1601-180091600.5 1800.520085 + 9 = 94 1801-200061800.5 2000.520094 + 6 = 100 21
  • Slide 22
  • Histogram The histogram looks like the bar chart except that the horizontal axis represent the data which is quantitative in nature. There is no gap between the bars. 22
  • Slide 23
  • Frequency Polygon The frequency polygon looks like the line chart except that the horizontal axis represent the class mark of the data which is quantitative in nature. 23
  • Slide 24
  • Ogive Ogive is a line graph with the horizontal axis represent the upper limit of the class interval while the vertical axis represent the cummulative frequencies. 24
  • Slide 25
  • DATA SUMMARY What is statistic? Statistis is a number that describe the sample such as sample mean which describe the sample average. Type of statistic i.Measure of central tendency ii.Measure of dispersion 25
  • Slide 26
  • MEASURE OF CENTRAL TENDENCY There are 3 popular central tendency measures, mean, median & mode. 1) Mean The mean of a sample is the sum of the measurements divided by the number of measurements in the set. Mean is denoted by ( ) Mean = Sum of all values / Number of values Mean can be obtained as below :- - For raw data, mean is defined by, 26
  • Slide 27
  • For tabular/group data, mean is defined by: Where f = class frequency; x = class mark (mid point) 27
  • Slide 28
  • Example The mean sample for students CGPA (raw) is 28
  • Slide 29
  • Example : The mean sample for Table 5.8 CGPA (Class)Frequency, f Class Mark (Midpoint), xfx 2.50 - 2.7522.6255.250 2.75 - 3.00102.87528.750 3.00 - 3.25153.12546.875 3.25 - 3.50133.37543.875 3.50 - 3.7573.62525.375 3.75 - 4.0033.87511.625 Total50 161.750 Table 5.8 29
  • Slide 30
  • 2) Median Median is the middle value of a set of observations arranged in order of magnitude and normally is devoted by i) The median for ungrouped data. - The median depends on the number of observations in the data,. - If is odd, then the median is the th observation of the ordered observations. - If is even, then the median is the arithmetic mean of the th observation and the th observation. 30
  • Slide 31
  • ii) The median of grouped data / frequency of distribution. The median of frequency distribution is defined by: where, = the lower class boundary of the median class; = the size of the median class interval; = the sum of frequencies of all classes lower than the median class = the frequency of the median class. 31
  • Slide 32
  • Example for ungrouped data :- The median of this data 4, 6, 3, 1, 2, 5, 7, 3 is 3.5. Proof :- -Rearrange the data in order of magnitude becomes 1,2,3,3,4,5,6,7. As n=8 (even), the median is the mean of the 4th and 5th observations that is 3.5. 32
  • Slide 33
  • Example for grouped data :- Find median for frequency distribution below CGPA (Class)Frequency, f Cum. frequency 2.50 - 2.7522 2.75 - 3.001012 3.00 - 3.2515 27 3.25 - 3.501340 3.50 - 3.75747 3.75 - 4.00350 Total50 33
  • Slide 34
  • 3) Mode The mode of a set of observations is the observation with the highest frequency and is usually denoted by ( ). Sometimes mode can also be used to describe the qualitative data. i) Mode of ungrouped data :- - Defined as the value which occurs most frequent. - The mode has the advantage in that it is easy to calculate and eliminates the effect of extreme values. - However, the mode may not exist and even if it does exit, it may not be unique. 34
  • Slide 35
  • *Note: If a set of data has 2 measurements with higher frequency, therefore the measurements are assumed as data mode and known as bimodal data. If a set of data has more than 2 measurements with higher frequency so the data can be assumed as no mode. ii) The mode for grouped data/frequency distribution data. - When data has been grouped in classes and a frequency curve is drawn to fit the data, the mode is the value of corresponding to the maximum point on the curve. 35
  • Slide 36
  • ii) The mode for grouped data/ frequency distribution data where = the lower class boundary of the modal class; = the size of the modal class interval; = the difference between the modal class frequency and the class before it; and = the difference between the modal class frequency and the class after it. *Note: - The class which has the highest frequency is called the modal class. 36
  • Slide 37
  • Example for ungrouped data : The mode for the observations 4,6,3,1,2,5,7,3 is 3. Example for grouped data based on table : CGPA (Class)Frequency 2.50 - 2.752 2.75 - 3.00 10 3.00 - 3.2515 3.25 - 3.5013 3.50 - 3.757 3.75 - 4.003 Total50 Modal Class 37
  • Slide 38
  • Measure of Dispersion The measure of dispersion/spread is the degree to which a set of data tends to spread around the average value. It shows whether data will set is focused around the mean or scattered. The common measures of dispersion are: 1) range 2) variance 3) standard deviation The standard deviation actually is the square root of the variance. The sample variance is denoted by s 2 and the sample standard deviation is denoted by s. 38
  • Slide 39
  • 39
  • Slide 40
  • Variance i) Variance for ungrouped data The variance of a sample (also known as mean square) for the raw (ungrouped) data is denoted by s 2 and defined by: ii) Variance for grouped data The variance for the frequency distribution is defined by: 40
  • Slide 41
  • Example for ungrouped data : given income for 5 workers are : RM 1000, RM 2500, RM 2000, RM 4000, RM 3500. Find variance of this data. Solution: 41
  • Slide 42
  • Example for grouped data : The variance for frequency distribution in Table is: Class boundariesFrequency, f Class Mark, xfxfx 2 2.50 - 2.7522.6255.25013.781 2.75 - 3.00102.87528.75082.656 3.00 - 3.25153.12546.875146.484 3.25 - 3.50133.37543.875148.078 3.50 - 3.7573.62525.37591.984 3.75 - 4.0033.87511.62545.047 Total50 161.750528.031 42
  • Slide 43
  • ESTIMATION Introduction The field of statistical inference consist of those methods used to make decisions or to draw conclusions about a population. These methods utilize the information contained in a sample from the population in drawing conclusions 43
  • Slide 44
  • I have a sample of 5 numbers and I take the average. The estimator is taking the average of the sample. The estimator of the mean. Let say, the average = 4 the estimate. ESTIMATOR VS ESTIMATE Estimator In statistics, the method used Estimate The value that obtained from a sample 44
  • Slide 45
  • CONFIDENCE INTERVAL ESTIMATES Definition : An Interval Estimate In interval estimation, an interval is constructed around the point estimate and it is stated that this interval is likely to contain the corresponding population parameter. Definition : Confidence Level and Confidence Interval Each interval is constructed with regard to a given confidence level and is called a confidence interval. The confidence level associated with a confidence interval states how much confidence we have that this interval contains the true population parameter. The confidence level is denoted by 45
  • Slide 46
  • 46 CONFIDENCE INTERVAL ESTIMATES FOR POPULATION MEAN
  • Slide 47
  • 47
  • Slide 48
  • 48 EXAMPLE
  • Slide 49
  • 49 SOLUTION
  • Slide 50
  • 50
  • Slide 51
  • Example : A publishing company has just published a new textbook. Before the company decides the price at which to sell this textbook, it wants to know the average price of all such textbooks in the market. The research department at the company took a sample of 36 comparable textbooks and collected the information on their prices. This information produced a mean price RM 70.50 for this sample. It is known that the standard deviation of the prices of all such textbooks is RM4.50. Construct a 90% confidence interval for the mean price of all such college textbooks. 51
  • Slide 52
  • solution 52
  • Slide 53
  • 53
  • Slide 54
  • EXAMPLE Consider a survey on male students height in a certain IPTA: a random sample of 100 male students are taken. The height of the male students is normally distributed with mean 178.2 cm and variance 17.75 cm 2. i)Construct a 95% CI for the mean of male students height ii)If mean of the female students height is 170.2 cm height, at 98% CI, verify whether if this can proof that the male are taller than the female students. 54
  • Slide 55
  • SOLUTION It is known that For 95 % CI 55
  • Slide 56
  • Hence 95% CI; 56
  • Slide 57
  • ii)It is known that and For 98 % CI We can see that mean of female students does not lies in the interval hence this indicate that the male students are taller than female students. 57
  • Slide 58
  • 58 CONFIDENCE INTERVAL ESTIMATES FOR POPULATION PROPORTION
  • Slide 59
  • Example According to the analysis of Women Magazine in June 2005, Stress has become a common part of everyday life among working women in Malaysia. The demands of work, family and home place an increasing burden on average Malaysian women. According to this poll, 40% of working women included in the survey indicated that they had a little amount of time to relax. The poll was based on a randomly selected of 1502 working women aged 30 and above. Construct a 95% confidence interval for the corresponding population proportion. 59
  • Slide 60
  • Solution 60
  • Slide 61
  • EXERCISE 61
  • Slide 62
  • Hypothesis and Test Procedures A statistical test of hypothesis consist of : 1. The Null hypothesis, 2. The Alternative hypothesis, 3. The test statistic and its p-value 4. The rejection region 5. The conclusion 62 HYPOTHESIS TESTS
  • Slide 63
  • Definition Hypothesis testing can be used to determine whether a statement about the value of a population parameter should or should not be rejected. Null hypothesis, H 0 : A null hypothesis is a claim (or statement) about a population parameter that is assumed to be true. (the null hypothesis is either rejected or fails to be rejected.) Alternative hypothesis, H 1 : An alternative hypothesis is a claim about a population parameter that will be true if the null hypothesis is false. 63
  • Slide 64
  • Test Statistic is a function of the sample data on which the decision is to be based. p-value is the probability calculated using the test statistic. The smaller the p-value, the more contradictory is the data to.
  • Slide 65
  • It is not always obvious how the null and alternative hypothesis should be formulated. When formulating the null and alternative hypothesis, the nature or purpose of the test must also be taken into account. We will examine: 1)The claim or assertion leading to the test. 2)The null hypothesis to be evaluated. 3)The alternative hypothesis. 4)Whether the test will be two-tail or one-tail. 5)A visual representation of the test itself. In some cases it is easier to identify the alternative hypothesis first. In other cases the null is easier. DEVELOPING NULL AND ALTERNATIVE HYPOTHESIS
  • Slide 66
  • 9.1.1 Alternative Hypothesis as a Research Hypothesis Many applications of hypothesis testing involve an attempt to gather evidence in support of a research hypothesis. In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that the researcher hopes to support. The conclusion that the research hypothesis is true is made if the sample data provide sufficient evidence to show that the null hypothesis can be rejected.
  • Slide 67
  • Example 9.1: A new drug is developed with the goal of lowering blood pressure more than the existing drug. Alternative Hypothesis: The new drug lowers blood pressure more than the existing drug. Null Hypothesis: The new drug does not lower blood pressure more than the existing drug.
  • Slide 68
  • 9.1.2 Null Hypothesis as an Assumption to be Challenged We might begin with a belief or assumption that a statement about the value of a population parameter is true. We then using a hypothesis test to challenge the assumption and determine if there is statistical evidence to conclude that the assumption is incorrect. In these situations, it is helpful to develop the null hypothesis first.
  • Slide 69
  • Example 9.2 : The label on a soft drink bottle states that it contains at least 67.6 fluid ounces. Null Hypothesis: The label is correct. > 67.6 ounces. Alternative Hypothesis: The label is incorrect. < 67.6 ounces.
  • Slide 70
  • Example 9.3: Average tire life is 35000 miles. Null Hypothesis: = 35000 miles Alternative Hypothesis: 35000 miles
  • Slide 71
  • It is not always obvious how the null and alternative hypothesis should be formulated. When formulating the null and alternative hypothesis, the nature or purpose of the test must also be taken into account. We will examine: 1)The claim or assertion leading to the test. 2)The null hypothesis to be evaluated. 3)The alternative hypothesis. 4)Whether the test will be two-tail or one-tail. 5)A visual representation of the test itself. In some cases it is easier to identify the alternative hypothesis first. In other cases the null is easier. 9.1 DEVELOPING NULL AND ALTERNATIVE HYPOTHESIS
  • Slide 72
  • 9.1.1 Alternative Hypothesis as a Research Hypothesis Many applications of hypothesis testing involve an attempt to gather evidence in support of a research hypothesis. In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that the researcher hopes to support. The conclusion that the research hypothesis is true is made if the sample data provide sufficient evidence to show that the null hypothesis can be rejected.
  • Slide 73
  • Example 9.1: A new drug is developed with the goal of lowering blood pressure more than the existing drug. Alternative Hypothesis: The new drug lowers blood pressure more than the existing drug. Null Hypothesis: The new drug does not lower blood pressure more than the existing drug.
  • Slide 74
  • 9.1.2 Null Hypothesis as an Assumption to be Challenged We might begin with a belief or assumption that a statement about the value of a population parameter is true. We then using a hypothesis test to challenge the assumption and determine if there is statistical evidence to conclude that the assumption is incorrect. In these situations, it is helpful to develop the null hypothesis first.
  • Slide 75
  • Example 9.2 : The label on a soft drink bottle states that it contains at least 67.6 fluid ounces. Null Hypothesis: The label is correct. > 67.6 ounces. Alternative Hypothesis: The label is incorrect. < 67.6 ounces.
  • Slide 76
  • Example 9.3: Average tire life is 35000 miles. Null Hypothesis: = 35000 miles Alternative Hypothesis: 35000 miles
  • Slide 77
  • How to decide whether to reject or accept ? The entire set of values that the test statistic may assume is divided into two regions. One set, consisting of values that support the and lead to reject, is called the rejection region. The other, consisting of values that support the is called the acceptance region. H 0 always gets =. Tails of a Test 77 Two-Tailed Test Left-Tailed Test Right-Tailed Test Sign in== or Sign in Rejection RegionIn both tailIn the left tailIn the right tail
  • Slide 78
  • Population Mean,, ( known and unknown ) Null Hypothesis : Test Statistic : 78 any population, is known and n is large or normal population, is known and n is small any population, is unknown and n is large normal population, is unknown and n is small
  • Slide 79
  • Alternative hypothesisRejection Region 79
  • Slide 80
  • Definition: p-value The p-value is the smallest significance level at which the null hypothesis is rejected. 80
  • Slide 81
  • Example 81
  • Slide 82
  • Solution 82
  • Slide 83
  • 83 POPULATION PROPORTION, P Alternative hypothesisRejection Region
  • Slide 84
  • Example When working properly, a machine that is used to make chips for calculators does not produce more than 4% defective chips. Whenever the machine produces more than 4% defective chips it needs an adjustment. To check if the machine is working properly, the quality control department at the company often takes sample of chips and inspects them to determine if the chips are good or defective. One such random sample of 200 chips taken recently from the production line contained 14 defective chips. Test at the 5% significance level whether or not the machine needs an adjustment. 84
  • Slide 85
  • Solution 85
  • Slide 86
  • REGRESSION AND CORRELATION Regression is a statistical procedure for establishing the relationship between 2 or more variables. This is done by fitting a linear equation to the observed data. The regression line is then used by the researcher to see the trend and make prediction of values for the data. There are 2 types of relationship: Simple ( 2 variables) Multiple (more than 2 variables)
  • Slide 87
  • THE SIMPLE LINEAR REGRESSION MODEL is an equation that describes a dependent variable (Y) in terms of an independent variable (X) plus random error where, = intercept of the line with the Y-axis = slope of the line = random error Random error, is the difference of data point from the deterministic value. This regression line is estimated from the data collected by fitting a straight line to the data set and getting the equation of the straight line,
  • Slide 88
  • The least squares method is commonly used to determine values for and that ensure a best fit for the estimated regression line to the sample data points The straight line fitted to the data set is the line: LEAST SQUARES METHOD
  • Slide 89
  • is the estimated or predicted value of y for a given value of x. In other words, the predicted value of the dependent variable for a given independent variable x can simply be obtain by substituting the given value of x. We can find the least squares estimators and by using the formula 89 where
  • Slide 90
  • EXAMPLE Suppose we take a sample of seven household from a low moderate income neighborhood and collect information on their income and food expenditures for the past month. The information obtained (in hundreds of Ringgit Malaysia) is given below Find the least squares regression line of food expenditure (Y) on income (X) 90 IncomeFood expenditures 359 4915 217 3911 155 288 259
  • Slide 91
  • SOLUTION IncomeFood Expenditure xyxyx2x2 y2y2 359315122581 49157352401225 21714744149 39114291521121 1557522525 28822478464 25922562581 x = 212 y = 64 xy = 2150 x 2 = 7222 y 2 = 646 91 Compute
  • Slide 92
  • 92
  • Slide 93
  • Correlation measures the strength of a linear relationship between the two variables. Also known as Pearsons product moment coefficient of correlation. The symbol for the sample coefficient of correlation is r Formula : CORRELATION (R)
  • Slide 94
  • Properties of r : Values of r close to 1 implies there is a strong positive linear relationship between x and y. Values of r close to -1 implies there is a strong negative linear relationship between x and y. Values of r close to O implies little or no linear relationship between x and y.
  • Slide 95
  • EXAMPLE Refer example before. Calculate the value of r and interpret its meaning. Solution From example before we know that Since the r value close to 1, implies that there is strong positive linear relationship between income (x) and food expenditure (y). 95