117
Statistics Module Module Code: LV-STS Module Writer: Symon B. Chibaya Mathematics and Statistics Lecturer, Natural Resources College TABLE OF CONTENTS 1. Module Objectives ……………………………………………………..3 2. Basic Concepts of Statistics 1.1 Meaning of statistics……………………………………………....4

Statistics Module

Embed Size (px)

DESCRIPTION

statistics answers

Citation preview

TOPIC 1: BASIC CONCEPTS OF STATISTICS

Page 10 of 88 Page 10 of 88 Page 92 of 92 Statistics Module Module Code: LV-STS Module Writer: Symon B. Chibaya Mathematics and Statistics Lecturer, Natural Resources College TABLE OF CONTENTS 1. Module Objectives ..3 2. Basic Concepts of Statistics 1.1 Meaning of statistics....4 1.2 Using statistical data........4 1.3 Types of statistics....5 1.4 Sources of statistical data....5 1.5 Types of data variables5 3. Summarizing Data 2.1 Frequency distribution 8 2.2 Graphical presentation.14 4. Describing Data 1.1 Measures of central tendency.....19 1.2 Measures of dispersion...24 5. Probability Theory 4.1 Probability concepts30 4.2 Discrete probability distribution- Binomial distribution and Poisson distribution35 4.3 Normal probability distribution...42 5 Sampling Methods and Sampling Distributions......................................49 6 Hypothesis Testing 6.1 Defining hypothesis.54 6.2 Setting up hypothesis...54 6.3 Test statistics56 6.4 Analysis of Variance [ANOVA]..59 7 Test of Proportions (Chi-Square Testing)64 8 Simple Linear Regression and Correlation Analysis...68 9 Appendix Table 1 The binomial Distribution80 Table 2 F Distribution...82 Table 3 The Chi-Square Distribution85 Table 4 Critical values for the PPMC...87 10 References88 MODULE OBJECTIVES By the end of the course students should be able to: a) describe concepts used in statistics b) present statistical data in different formats c) apply statistical tools in problem solving. TOPIC 1: BASIC CONCEPTS OF STATISTICS OBJECTIVES By the end of this topic, you should be able to: (a). Define the term statistics. (b). State types of statistics. (c).List down sources of statistical data. (d).Describe types of data variables. DEFINITION OF STATISTICS Statistics is the science of conducting studies to collect, organizes, summarize, analyze and draw conclusions from data. Statistics is used to analyze the results of surveys and as a tool in scientific research to make decisions based on controlled experiments. Other uses of statistics include operations research, quality control, estimation and prediction. TYPES OF STATISTICS Statistics is divided into two types namely: (a) descriptive statistics and (b) inferential statistics (a) DESCRIPTIVE STATISTICS This consists of the collection, organization, summarization and presentation of data. For example population census. (b) INFERENTIAL STATISTICS This uses probability. That is the chance of an event occurring. Here statisticians try to make inferences from samples to population. POPULATION A population is a collection of all individuals, objects or measurements of interest. Most of the time, due to the expense, size of population, medical concerns, e.t.c it is not possible to use the entire population for a statistical study; therefore, researchers use samples. SAMPLE It is a group of subjects (human or otherwise) from the population. Example The following examples constitute a population. The heights of students at Natural Resources College. And its sample can be. The heights of students in nutrition 12 class. DATA These are values that the variable can assume. VARIABLES A variable is a characteristics or attribute that can assume different values. SOURCES OF STATISTICAL DATA Sources of data can be internal or external. INTERNAL DATA SOURCE All types of organization will collect and keep data, which is therefore internal to the organization. ADVANTAGES OF INTERNAL DATA SOURCES 1. It will be cheaper 2. Readily available information can be used much more quickly. 3. It can be understood much more easily. EXTERNAL SOURCES The sources of statistical information TYPES OF DATA VARIABLES Variables can be classified as qualitative or quantitative. Qualitative variables have non-numerical observations such as colour of hair, gender and religious prefences, although, of course, each possible non-numerical value may be associated with a numerical frequency. Whereas quantitative variables are numerical and can be ranked. Some examples of quantitative variables are age, height and weight. Quantitative variables can be further classified into two groups: discrete and continuous. Discrete variables can take whole numbers only. For example, number of children in a family while continuous variables take any value. It can be a whole number or decimal number. For example: height of students. Classification of variables can be summarized as follows Data Qualitative Quantitative

Discrete Continuous TYPES OF DATA (a) PRIMARY DATA If data is collected for a specific purpose then it known as primary data. For example: population census. They are collected by a researcher himself or herself. (b) SECONDARY DATA Secondary data is data, which has been collected for some purpose other than for which it is being used. For example if a company has to keep records of when employees are sick and you use this information to tabulate the number of days employees had malaria in a given month, then this information would be classified as secondary data. Secondary data is collected and possibly processed by people other than the researcher in question. In other words these are collected by others to be reused by the researcher. TUTORIAL 1: BASIC CONCEPTS OF STATISTICS 1. Define the term statistics 2. Distinguish between: (a) Inferential statistics and descriptive statistics (b) Sample and population (c) Qualitative data variables and quantitative data variables 3. Give two advantages of internal source of data. TOPIC 2: SUMMARISING DATA OBJECTIVES By the end of this topic, you should be able to: (a) Organize raw data into frequency distributions. (b) Define the following terms; limits, boundaries and class width. (c) Represent data in frequency distribution graphically using histograms, frequency polygons and a cumulative frequency polygon To describe situations, draw conclusions or make inferences about events, the researcher/statistician must organize the data in some meaningful way. The most convenient method of organizing data is to construct a frequency distribution. FREQUENCY DISTRIBUTION A frequency is the name given to the number of times a value occurs. Example Twenty-five army inductees were given a blood test to determine their blood type. The data set is A B B AB O O O B AB B B B O A O AB A O B A You can summarise the data in the example above with a frequency table or frequency distribution. A frequency distribution is the organization of raw data in the table from using classes and frequency. Steps to be followed when constructing ungrouped frequency distribution for the above raw data are as follows 1. Make a table as shown Class Tally Frequency A B AB O 2. Tally the data and place the results in the second column. 3. Count the tallies and place in the third column Class Tally Frequency A 5 B 7 AB 9 O 4 Example A survey taken in a restaurant shows the number of cups of coffee consumed with each meal. Construct an ungrouped frequency distribution. 0 2 2 1 1 [footnoteRef:1] [1: The mid-point is the numerical location of the centre of the class and it is necessary for graphing. ]

3 5 3 2 2 2 1 0 1 2 4 2 0 1 0 1 4 4 2 2 0 1 1 5 Solution Class Tally Frequency 0 5 1 8 2 10 3 2 4 3 5 2 GROUPED FREQUENCY DISTRIBUTION The following are the steps of constructing a frequency distribution for a grouped data; 1. There should be between 5 and 20 classes 2. The classes must be equal in width. The following rule may be used to find the required class interval width W L S K

Where W = class width L = the largest data S = the smallest data K =number of classes 3. The class width should be an odd number. This ensures that the mid-point of each class has the same value as the data. The class mid-point Xm is obtained by adding the lower and upper boundaries and dividing by 2 or adding the lower and upper limits and diving by 2. Xm lowerlim it upperlim it

2lowerboundary Or Xm upperboundary

Page 2 of 88 Page 9 of 92

Page of 88 2 Page of 92 9 4. The classes must be mutually exclusive. Mutually excusive classes have nonoverlapping class limits so that data cannot be placed into two classes. For example A B 10-20 not 10-20 21-31 20-30 32-42 30-40 43-53 40-50 If a person is 40 years old, into which class of table B she or he should be placed? 5. The classes must be continuous. Even if there are no values in a class, the class must be included in the frequency distribution. In other words, there should be no gaps in the frequency distribution. The only exception occurs when the class with a zero frequency is the first or last class. 6. The classes must be exhaustive. There should be enough classes to accommodate all the data. Example The following data represents times (second) for 50 runners in a race. 246 238 246 251 240 243 245 243 241 248 244 246 249 246 245 244 248 240 243 249 242 245 239 244 245 245 248 248 249 248 250 242 243 245 242 242 246 246 245 247 244 240 245 247 248 247 250 247 248 250 Construct a grouped frequency distribution for the data. Solution Procedure for constructing a group distribution Step 1: determine the classes Suppose we want to have 5 classes then W L S

K =2.4 3 A number is rounded up if there is any decimal remainder when dividing. For example, 534= 13.25 is rounded up to 15. 856 14.167 is rounded up to 15. Select a starting point for the lowest class limit. This can be the smallest data value or any convenient number less than the smallest data value. In this class 137 is used. Add the width to the lowest class limit taken as the starting point to get the next lower limit of the next class. Keep adding until there are 5 classes as shown. 137, 140, 143, 146, 149 Subtract one unit from the lower limit of the second class to get the upper limit of the first class. Then add the width to each upper limit to get all the upper limits as shown below. 139, 142, 145, 148, 151 So the five classes are: 137-139, 140-142, 143-145. 146-148, 149-151 Find the class boundaries by subtracting 0.5 from each lower class limit and adding 0.5 to each upper class limit. i.e 236.5-239.5, 239.5-242.5-245.5, e.t.c Find the mid-point of each class Step 2. Tally the data. Step3: Find the numerical frequency from tallies. Class interval Class boundaries Mid-point Tally Frequency

137-139 140-142 143-145 146-148 148-151 136.5-139.5 139.5 -142.5 142.5-145.5 145.5-148.5 148.5-151.5 238 241 244 247 250

2 8 14 19 7

Example The average quantitative GRE scores for the top 30 graduate schools or engineering are listed below. Construct a frequency distribution with six classes 767 770 761 760 771 768 776 763 760 747 766 754 771 771 780 750 746 764 769 759 753 756 766 758 770 762 746 Solution W L S K

= =5.667 6 Smallest lower limit (starting point) is 145. Other lower limits are 751, 757,763, 769, 775 The upper limit of the first class is 751-1=750. Other upper limits are 756, 762, 768, 774, 780 Class limit Class boundaries Tally Frequency

745-750 751-756 757-762 763-768 769-774 775-780 744.5-750.5 750.5-756.5 756.5-762.5 762.5-768.5 767.5-774.5 774.5-780.5

4 7 7 11 2 1

CUMULATIVE FREQUENCY DISTRIBUTION So far, we have discovered how to tabulate a frequency distribution. There is a further way of presenting frequencies and that is by forming cumulative frequencies. This technique conveys a considerable degree of information and involves adding up the number of times (frequency) values less than or equal to a particular value occur. Example A survey was conducted in a certain town on peoples weights in kilograms. The following are the results: Weight Frequency 32.0-32.7 1 35.0-35.7 3 36.0-36.7 9 37.0-37.7 20 38.0-38.7 28 39.0-39.7 26 40.0-40.7 15 41.0-41.7 4 Calculate the cumulative frequency of the data. Solution Weight Frequency Cumulative frequency 32.0-32.7 1 0+1=1 35.0-35.7 3 1+3=4 36.0-36.7 9 4+6=13 37.0-37.7 20 13+20=33 38.0-38.7 28 33+28=61 39.0-39.7 26 61+26=87 40.0-40.7 15 87+15=102 41.0-41.7 4 102+4=106 RELATIVE FREQUENCY DISTRIBUTIONS RELATIVE FREQUENCY Relative frequencies are the frequencies divided by the total number of observations. actualfrequencyRelative frequency=

Totalnumberofobservations Example The following data represents heights of students in cm Height (cm) Frequency Under 165 7 Under 170 11 Under 175 17 Under 180 20 Under 185 16 Under 190 9 Construct a relative frequency distribution. Solution Height (cm) Frequency Relative frequency Under 165 7 0.0875 Under 170 11 0.1375 Under 175 17 0.2125 Under 180 20 0.25 Under 185 16 0.2 Under 190 9 0.1125 CUMULATIVE RELATIVE FREQUENCY We have seen how to calculate cumulative frequencies. Using the same logic, you can obtain cumulative relative frequency in a particular class to that already arrived at for previous class. Example Construct cumulative relative frequency distribution for the above example. Solution Height (cm) Frequency Relative frequency Cumulative relative frequency Under 165 7 0.0875 0+0.0875=0.0875 Under 170 11 0.1375 0.0875+0.1375=0.225 Under 175 17 0.2125 0.225+0.2125=0.4357 Under 180 20 0.25 0.4357+0.25=0.6875 Under 185 16 0.2 0.6875+0.2=0.8875 Under 190 9 0.1125 0.8875+0.1125=1.000 The reasons for constructing a frequency distribution are: 1. To organize the data in a meaningful way. 2. To enable the reader to determine the nature or shape of the distribution. 3. To facilitate computational procedures for measures of average and spread. 4. To enable the researcher to draw charts and graphs for the presentation of the data. 5. To enable the reader to make comparison among different data sets. GRAPHICAL PRESENTATION After the data have been organized into a frequency distribution, they can be presented in graphical form. The purpose of graphs in statistics is to convey the data to the viewers in pictorial form. It is easer for most people to comprehend the meaning of data presented graphically than presented numerically in tables or frequency distributions. The three most commonly used graphs are: (a) histogram (b) frequency polygon (c) cumulative frequency graph or ogive (pronounced 0-jive) THE HISTOGRAM The word histogram is derived from Greek: histos- anything set upright and gramma drawing, record, writing. The histogram is a graph that displays the data by using continuous vertical bars (unless the frequency of class is 0) of various heights to represent the frequencies of the classes. Example The annual exports of a group of small firms in Lilongwe are Exports (K millions Number of firms

2-4 4

5-7 12

8-10 15

11-13 8

14-16 4

Construct a histogram to represent the data shown above. Solution Step 1: Construct a frequency distribution that has class boundaries. Class Class boundaries Frequency

2-4 5-7 8-10 11-13 14-16 1.5-4.5 4.5-7.5 7.5-10.5 10.5-13.5 13.5-16.5 4 12 15 8 4

Step 2: Draw and label the x-axis and y-axis. The x-axis is always the horizontal axis and the y-axis is always the vertical axis. Step 3: Represent the frequency on the y axis and the class boundaries on the x- axis. Step4: Using frequencies as the heights, draw vertical bars for each class. Frequency NOTE: The frequency in each class is represented by the rectangles. However, it is important to realize that it is not the height of the rectangle that represents the frequency but the area with of the rectangle. THE FREQUENCY POLYGON The frequency polygon is a graph that displays data using lines that connect points plotted for the frequencies at the mid-point of the classes. This is a very quick method of drawing the shape of a frequency distribution. Steps to be followed when drawing a frequency polygon are: 1. Draw a histogram. 2. Mark the mid-point at the top of each rectangle. 3. Join the mid-points with a ruler. 4. Extend the lines at each end of the histogram to the mid-points of the next highest and lowest classes, which will have equal a frequency of zero. NOTE: The lines are extended to the x-axis so that the area of the polygon will equal that of the histogram it represents. Example Draw a frequency polygon it represents from the example above. Solution Class Class boundaries Frequency Mid-point

2-4 5-7 8-10 11-13 14-16 1.5-4.5 4.5-7.5 7.5-10.5 10.5-13.5 13.5-16.5 4 12 15 8 4 3 6 9 12 15

OR Step 1: Find the mid-points of each class. Step 2: Draw the x-axis to represent scores. Step 3: Draw the y-axis to represent frequencies. Step 4: Plot the frequency against class mid-point. Step 5: Join the crosses in order, that is the cross representing the first class should be joined to the one representing the second class and so on. Step 6: Include the two extreme points. For instance, Class Class boundaries Frequency Mid-point

2-4 5-7 8-10 11-13 14-16 1.5-4.5 4.5-7.5 7.5-10.5 10.5-13.5 13.5-16.5 4 12 15 8 4 3 6 9 12 15

The mid-point before 3 is 3-3=0 and the mid-point after 15 is 15+3=18 CUMULATIVE FREQUENCY POLYGON It can be used to represent the cumulative frequencies for the classes. It is also called ogive. The cumulative frequency polygon is plotted against the upper boundaries of the corresponding class. In a cumulative frequency polygon, the cumulative frequencies are joined together by straight lines whereas in a cumulative frequency curve; a smooth curve joins the points. Example The lengths of 50 fish caught from the pond were measured and the following are the results. Length 20-22 23-25 26-28 29-31

Frequency 3 10 9 28

Construct a cumulative frequency polygon. Solution Class Class boundaries Frequency Cumulative frequency

20-22 23-25 26-28 29-31 19-5-22.5 22.5-25.5 25.5-28.5 28.5-31.5 3 10 9 28 3 13 22 50

TUTORIAL 2: SUMMARISAING DATA 1. 50 people were asked to record how many radio station they listen to in a week. The results are shown below: No. of radio station No. of listeners

0-9 10-19 20-29 30-39 2 32 10 6

(a).What is this table called? (b).Draw a histogram 2. In a study of reaction times of dogs to a specific stimulus, an animal trainer obtained the following data, given in seconds. Construct a histogram, frequency polygon and ogive for the data. Class limit Frequency

2.3-2.9 3.0-3.6 3.7-4.3 4.4-5.0 5.1-5.7 5.8-6.4 10 12 6 8 4 2

3. The number of calories per serving for selected ready-to-eat cereals is listed here. Construct a frequency distribution using seven classes. Draw a histogram, frequency polygon and ogive for the data. 130 190 140 80 100 120 220 110 100 210 130 100 90 210 200 120 180 120 190 210 120 200 130 180 260 100 160 190 240 80 120 90 190 200 210 190 180 115 210 110 225 190 130 TOPIC 3: DESCRIBING DATA OBJECTIVES By the end of this topic, you should be able to; (a) Calculate mean, median and mode. (b) Explain advantages and disadvantages of mean, mode and median. (c) Explain the difference between sample deviation and population deviation (d) Describe data using measures of variations such as the range, variance and standard deviation. Any set of measurement has two important properties namely: the central or typical value and the spread about the value. MEASURE OF CENTRAL TENDENCY The main measures of the central tendency are: Mean Median Mode MEAN The arithmetic mean of a set of values is the sum of values of the data dividing by the total number of values. The symbol x (read as x-bar) represents the sample mean nx x1 x2 x3 ...xn i 1 xi where n represents the total number of values in the nn

sample. For a population, the Greek letter (mu) is used for the mean. Nx1 x2 x3 ...xN i 1 x i where N represents the total number of values in the

NNpopulation. If some values appear more than once, we may use the following formula. x fx f Example Find the mean of the following data: 20, 26, 40, 36, 23, 42, 35, 24, 30. Solution x x n = 30.7 Example The table below shows the frequency distribution of the number of days on which 100 employs of a firm were late for work in a given month. Using the data, find the mean number of dates on which an employee is late in a month. Number of days late Number of employees

1 2 3 4 5 32 25 18 14 11

Solution x f Fx

1 2 3 4 5 32 25 18 14 11 32 50 54 56 55

f 100 fx 247

But x fx f =2.47 Therefore the mean number of days is 2.4 days. MEAN FOR GROUPED DATA Steps to be followed 1. Make a frequency distribution as shown below Class Frequency (f) mid-point(Xm) f. Xm 2. Find the mid-points of each class and place them in column 3 3. Multiply the frequency by the mid-point for each class and place the product in column 4. 4. Find the sum of columns 2 and 4. In other ways, find fand fxm 5. Use the following formula in order to find the mean. x fx f Example The marks scored by 500 candidates in an examination in which the maximum mark was 50 were: Mark range Frequency

1-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 10 41 72 83 94 81 71 27 13 8

Calculate a mean mark for these candidates. Solution Mark range Frequency Mid-point(xm) f.xm

1-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 10 41 72 83 94 81 71 27 13 8 3 8 13 18 23 28 33 38 43 48 30 328 936 1494 2162 2268 2343 1026 559 384

f 500 fxm 11530

But x fx f =23.1 ADVANTAGES OF ARITHMETIC MEAN 1. It is easy to calculate. 2. It uses all the values. 3. It is used in computing other statistics such as variance. DISADVANTAGRS OF ARITHMETIC MEAN 1. It is affected by extremely high or low values. 2. It can not be read from a graph. MEDIAN A statistic which is not affected by a few very unusual extreme scores is the median. Median is the middle value when the values are arranged in order (either ascending or descending order). When the data is ordered, it is called data array. The median is used when one must determine whether the data values fall into the upper or lower half in the distribution. Steps in computing the median are: 1. Arrange the data in order. 2. Select the middle point. NOTE: If the number of value (n) is odd, then the median is the value of the middle value; if n is even, then the median is the value of the arithmetic mean of the two middle values. In other words, (a) if n is odd and M is the value of the median then: M= the value of the n 1 th observation

2 (b) if n is even, the middle observations are nthandthen 1 th

2 2 observations and then M= the values of the mean of these two observations Example Find the median of 4, 3, 5, 2, 11. Solution Arranging in ascending order, we have 2, 3, 4, 5, 11. n= 5(odd) Median = the value of =3rd observation =4. Example Six people take shoe sizes: 7, 9, 9, 8, 5, 6. What is the median? Solution Arranging in ascending order, we have 5, 6, 7, 8, 9, 9.and n= 6(even) So M= the mean of 6thandthe61 th observations

2 2 =the mean of 3rd and 4th observations = = 7.5 ADVANTAGES OF MEDIAN 1. Its value is not distorted by extreme values. 2. All the observations are used to order the data even though only the middle one or two middle observations are used in the calculation. 3. It can be illustrated graphically in a very simple way. DISADVANTAGES OF MIDIAN 1. In a grouped frequency distribution, the value of the median within the median class can only be an estimate. 2. It is of little use in calculating other statistical measures. MODE Mode is the most frequent data value or the value that occurs most often in a data set. The mode is used when the most typical case is desired. A set observations with one mode is called unimodal, a set of observations with two modes is called bimodal while a set of observations with more than two modes is called multimodal. Example The monthly salaries of a sample of doctors are: K35000, K58000, K50000, K49000, K50000, K50000, K60000, K70400, K50000, K40000, K50000, K40000, K65000, K55000. What is the modal (mode) monthly salary? Solution The value which occurs most often is K50000. Therefore the modal salary is K50000. Example Find the modal class for the frequency distribution of miles that 20 runners ran in one week. Class Frequency

5.5-10.5 10.5-15.5 15.5-20.5 20.5-25.5 25.5-30.5 30.5-35.5 1 2 3 5 4 3

35.5-40.5 2

Solution The modal class is 20.5-25.5 since it has the largest frequency. ADVANTAGES OF THE MODE 1. It is not distorted by extreme values of the observations. 2. It is easy to calculate. DISADVANTAGES OF THE MODE 1. It can not be used to calculate any further statistics. 2. It may have more than one value. THE WEIGHTED MEAN Sometimes, one must find the mean of a data set in which not all values are equally represented. This type of mean is called weighted mean. To find the weighted mean, multiple each value by its corresponding weight and divide the sum of the products by the sum of the weights. In other words x w1x 1 w2x2 ... wnxn wx

w1 w2 ... wn w Where w1, w2, w3, , wn are the weights and x1, x2,, xn are the value. Example A student received an A in English composition 1 (3 credits), a C in Introduction to Psychology (3 credits), a B in Biology 1 (4 credits) and a D in Physical Education (2 credits). Assuming A=4 grade points, B= 3 grade points, C= 2 grade points, D= 1 grade point and F= 0 grade point. Find the students grade-point average. Solution Course Credits (w) Grade (X) wx

English composition 1 Introduction to Psychology Biology 1 Physical Education 3 3 4 2 A(4 points) C(2 points) B(3 points) D(1 point) 12 6 12 2

w12 wx 32

But x wx w = =2.7 MEASURES OF DISPERSION In statistics, to describe the data set accurately, statisticians must know more than measures of central tendency. We also need to know the spread of data. There are several different measures of dispersion. The most important of these (which we will describe in this section) are: Range Variance Standard deviation. RANGE A range is the deference between the highest value and the lowest value. Example The weights of the contents of several small bottles are (in grams): 4, 3, 6, 5, 7, 2 and 4. Find the range. Solution The lowest value is 2 and the highest value is 7. Therefore range = 7-2 =5 ADVANTANGES OF THE RANGE 1. It is easy to understand. 3. It is simple to calculate. 4. It is a good measure for comparison as it spans the whole distribution. DISADVANTAGES OF THE RANGE 1. It uses only two of the observations and so can be distorted by extreme values 2. It can not be used in calculating other functions of the observations. STANDARD DEVIATION AND VARIANCE These two measures of dispersion can be discussed in the same section because the standard deviation is the square root of the variance. The variance is the average of the squares of the distance each value is from the mean. The symbol for the population variance is 2 ( is the Greek lowercase sigma). The formula for the population variance is 2 x 2

N Where x= individual value Population mean N= population size The standard deviation is he square root of the variance. The symbol for the population standard deviation is and its formula is N x2

The standard deviation is one of the measures used to describe the variability of a distribution. It has an additional use which makes it more important than other measures of dispersions. It is used as a unit to measure the distance between any two observations. STEPS TO BE FOLLOWED IN CALCULATING STANDARD DEVIATIONS 1. Find the mean () of the data. 2. Subtract the mean from each data value x. 3. Square each result x2 . The reason of squaring is to eliminate the negative sign. 4. Find the sum of the squares, x2 . 5. Divide the sum by N to get the variance, x 2 .

N6. Take the square root of the variance to get standard deviation, x 2 . The N reason of squaring is that since the distance were squared; the units of the resultant numbers are squares of the units of the original raw data. Finding the square root of the variance puts the standard deviation in the same unit as the raw data. Example Find the variance and standard deviation for the following set: 10, 66, 50, 30, 40, 20 Solution The mean, = = 35 x x- (x-) 2

10 60 50 30 40 20 10-35=-25 60-35=25 50-35=15 30-35=-5 40-35=5 20-35=-15 625 625 225 25 25 225

x 2 1750

But 2 x 2

N = =291.7 Therefore, the variance is 291.7 And 2 x N Therefore, =291.7 =17.1 Therefore, the standard deviation is 17.1 SAMPLE VARIANCE AND STANARD DEVITION The formula for finding sample variance and standard deviation are as follows; xx2 x2 x2 Variance, s2 n1 n 1 n12nxx

Sample standard deviation, s or 1 2 2 nnxx

Where x= individual score x = sample mean n = sample sizes NOTE: Dividing by n-1 gives a slightly larger value and an unbiased estimate of the population variance. Example Find the sample variance and standard deviation for three standard deviation for three students earnings. The data are in Kwacha. 3, 4, 5 Solution 22 but s2 xx x2 nx

n1n1x 3 45 12 x2 122 144 x2 32 42 52 50 x2 x250 1144 s2 n1313n

= =1 (ii) The standard deviation is 1 1 For the grouped data, we use the following steps; 1. Make a table as shown below and find the mid-point of each class. Class Frequency Mid-pointxm f. xm f. xm2

2. Multiply the frequency by the mid-point of each class and place the product in column 4. 3. Multiply the frequency by the square of the mid-point and place the product in column 5 4. Find the sum of columns 2 (n), 4 f.xm and 5 f.xm2 . 5. Substitute in the formula and solve to get the variance. f .x2 f .xm s2 n 1nm

6. Take the square root to get the standard deviation. Example Find the sample variance and standard deviation for the following frequency distribution. Class limits Frequency

13-19 20-26 27-33 34-40 41-47 48-54 55-61 62-68 2 7 12 5 6 1 0 2

Solution Class limits Frequency Mid-pointxm f. xm f. xm2

13-19 20-26 27-33 34-40 41-47 48-54 55-61 62-68 2 7 12 5 6 1 0 2 16 23 30 37 44 51 58 65 32 161 360 185 264 51 0 130 512 3703 10800 6845 11616 2601 0 8450

f.xm=1183 f.xm2 =38327

38327s2

3511126.270588 s 1126.270588 =33.6 Therefore, the sample variance is 1126.3 and standard deviation is 33.6 TUTORIAL 3: DESCRIBING DATA 1. Find the mean, mode, median, variance and standard deviation of the following set of numbers; (a) 2, 3, 5, 3, 3 (b) 3, 4, 5, 4, 5, 4, 5, 3, 4, 4 2. Find the weighted mean of 20, 25, 30, 35 if they are assigned weightings of (a) 1, 2, 3, 4 (b) 1, 3, 7, 9 respectively 3. An instructor grades exams, 20%; term paper, 30%; final exam,50%. A student had grade of 83, 72 and 90 respectively. Find the students final average. Use the weighted mean. 4. In a class of 29 students, this distribution of quiz scores was recorded. Class limit Frequency

0-2 3-5 6-8 9-11 12-14 1 3 5 14 6

Find the mean, variance and standard deviation of the quiz. 5. A survey was made of the monthly earnings of four Agricultural assistance and the results are recorded below K18, 000, K19, 000, K20, 000 and K21, 000 Calculate the following (a) sample variance (b) Sample standard deviation. TOPIC 4: PROBABILITY THEORY OBJECTIVES By the end of this topic, you should be able to; (a) Determine sample space. (b) Find the probability of an event, using classical probability. (c) Calculate probabilities, applying rules of addition and multiplication. (d) Distinguish between a discrete random variable and a continuous random variable. (e) Construct a probability distribution for a random variable. (f) Find the mean, variance and standard deviation for the variable of a binomial distribution. (g) Find probabilities for outcomes of variables using the Poisson distributions. (h) Find the area under the standard normal distribution. (i) Calculate probabilities for normally distributed variable by transforming it into a standard normal variable. PROBABILITY CONCEPTS (a) EXPERIMET/TRIAL It is a repeated procedure whose outcome is attributed to chance. (b) SAMPLE SPACE/PROBABILITY SPACE(S) It is a set of all possible outcomes of a trial. It is denoted by S Examples of trials and sample space (a) Tossing a coin once S = H,T (b) Tossing a die once S 1,2,3,4,5,6 (c) SAMPLE POINT It is any distinct element of the sample space. (d) EVENT It is any subset of a sample space. For exampleS 1,2,3,4,5,6. One event is the event of picking an even number. If A is an event of picking an even number then A2,4,6 Example A die is tossed once. Find (i) Sample space (ii) Event of odd numbers. (iii) Event of even numbers. (iv) Event of prime numbers. (v) Event of multiples of 3. Solutions (i) S 1,2,3,4,5,6 (ii) A1,3,5 (iii) B2,4,6 (iv) C 1,2,3,5 (v) D3,6 (e) MUTUALLY EXHAUSTIVE EVENTS These are events which constitute the sample space. i.e. if ABS then A and B are mutually exhaustive events. (f) MUTUALLY EXCLUSIVE EVENTS These are events which dont have any elements in common. That is if AB then A and B are mutually excusive events. (g) COMPLEMENT OF AN EVENT The complement of an event A with respect to the sample space S is the set of points in S not belonging to the subset A. A or A/ is often used as a symbol for not A. A and not A are referred to as complementary events. (h) PROBABILITY OF AN EVENT The probability of any event E is nimberofoutcomesinE

TotalnumberofoutcomesinthesamplespaceThis probability is denoted by nE PE

nS Example Suppose you toss a coin. Find probability that the outcome is head Solution S H,T and E H nH But PH

nS Example For a card is drawn from a standard pack of cards, find the probability of getting a queen. Solution n(S)= 52 and n(E)= 4 nE but PE

nS Example If a family has three children, find the probability that all the children are girls. Solution S BBB,BBG,BGB,GBB,BGG,GBG,GGB,GGG So n(S)= 8 and n(E)= 1 nEBut PE

nS The following are four basic probability rules (a) The probability of any event is a number (either a fraction or decimal) between and including 0 and 1. This is denoted by 0 PE1 NOTE: Rule 1 states that probabilities can not be negative or greater than 1 (b) If an event E can not occur (i.e the event contains no members in the sample space) then its probability is 0. (c) If an event E is certain, then the probability of E is 1. (d) The sum of the probabilities of the outcomes in the sample space is 1. RULE FOR COMPLEMENTARY EVENTS PE 1 PE or PE1 PE or PE PE 1 Example The weather bureau estimates the probability of rain tomorrow to be 0.42. What is the probability that it does not rain? Solution P(no rain) = 1- P(rain) =1 0.41 =0.58 TWO LAWS OF PROBABILITY 1.ADDITION LAW Then PABPAPB. This is so if A and B are mutually exclusive Events. Or A BB PAB PAPBPABand B are not mutually exclusive events. Example A card is picked at random from a standard pack of cards. What is the probability it is (a) a red heart or a club? (b) red heart or a king? Solution There is no red heart that is a club. So P(r or c) = P(r)+P(c) = = = (i) Some red hearts are kings So P(r or k) = P(r) +P(k)-Prk = = = = Example In a hospital unit there are 8 nurses and 5 physicians; 7 nurses and 3 physicians are females. If a staff person is selected, find the probability that the subject is a nurse or a male. Solution The sample space is shown below staff Female males Total

Nurses physicians 7 3 1 2 8 5

total 10 3 13

There is a nurse who is a male. So P(nurse or male) = P(nurse) + P(male)- P(nurse male) = = MULTIPLICATION LAWS 1. When two events are independent, the probability of both occurring is PAB PA.PB Example A coin is tossed and a die is rolled. Find the probability of getting a 4 on the die and a head on the coin. Solution P(4 and head) = P(4).P(head) = = Example A card is drawn from a desk and replaced. Then a second card is drawn, find the probability of getting a queen and the ace. Solution P(queen and ace)= P(queen).P(ace) = = 2. PABPAPB/ A = PBPA/ B But PA/ B PA B when two events are dependent. PB

Example A card is drawn from a deck and without replacement. Then a second card is drawn. Find the probability of getting a queen and then ace. Solution P(queen ace)= = Example The world wide Insurance Company found that 53% of the residents of a city had homeowners insurance with the company. Of these clients, 27% also had automobile insurance with the company. If a resident is selected, find the probability that the resident has both homeowners and automobile insurances with the World Wide Insurance Company. Solution Let homeowners insurance be H and automobile insurance be A. So P(H) =.53 P(A/H) =0.27 P(H and A) = P(H).P(A/H) =0.54 x 0.27 =0.1431 DISCRETE PROBABILTY DISTRIBUTION RANDOM VARIABLES A random variable is the numerical outcome of a random experiment, denoted by X. If the experiment is repeated, different values of X will be obtained and these values are denoted by small x. If a variable can assume only a specific number of values, such as the outcome for the roll of a die or the outcome for the toss of a coin, then the variable is called a discrete variable. Discrete variables have values that can be counted. DISCRETE PROBABILTY DISTRIBUTION It consists of the values a random variable can assume and the corresponding probabilities of the value. The probabilities are determined theoretically or by observation. Example Construct a probability distribution for rolling a single die. Solution S 1,2,3,4,5,6 P1,P2,P3,P4,P5andP6 So its distribution is Outcome x 1 2 3 4 5 6

Probability P(x) 1 61 61 61 61 61 6

Probability distribution can be shown graphically by representing the values of x on the x-axis and the probabilities P(x) on the y-axis. The graphical representation of the above example is NOTE: In this probability distribution, the probabilities are between 0 and 1 0 PX x1 REQUIREMENT FOR A PROBABILITY DISTRIBUTION 1. The sum of the probabilities of all event in the sample must equal 1; that is P(X x) 1 2. The probabilities of each event in the sample space must be between or equal to 0 and 1. That is 0P(X x)1 Example Represent graphically the probability distribution for the sample space for tossing three coins. Number of heads 0 1 2 3

Probability P(X=x) 1 83 83 81 8

Solution Example A random variable has the distribution shown in the following table X 0 1 2

P(X=x) 1 4 1 4

(a) Find P(X=1) (b) Represent graphically the probability distribution. Solution (a) P(X=1) = 1 - 14 14 since P(X x) 1

=1 = (b) Probability distribution is BINOMIAL DISTRIBUTION Many types of probability problems have only two outcomes or can be reduced to two outcomes. Some examples of experiments where you have two outcomes are: 1. A win or loss in a football game. 2. A pass or fail in an examination. 3. A head or tail on a coin toss. 4. Effective or ineffective lecturer. 5. A correct or incorrect item. Situations like these are called binomial experiments. A binomial experiment is a probability that satisfies the following four requirements; (i) Each trial can have reduced to two outcomes that can be considered as either success or failure. (ii) There must be a fixed number of trials. (iii) The outcome of each trial must be independent of each other. (iv) The probability of success or failure must remain the same for each trial. The outcome of a binomial; experiment and the corresponding probabilities of these outcomes are called a binomial distribution. In a binomial experiment, the probability of exactly x successes in n trials is n! xqnx P(X x) p

(n x)!x! Where n= number of trials x= the number of successes in n trial NOTE 0 xn p= the numerical probability of success i.e P(S) q=The numerical probability of a failure i.e P(F)=1-P(S) NOTE: n!=nn1)n 2n3(n 4...321 0!=1 Example Find (a) 3! and (b) 5! Solutions (a) 3!= 3x2x1 =6 (b) 5!=5x4x3x2x1 =120 Example A survey found that one out of five Malawian say he or she has visited a doctor in any given month. If 10 people are selected at random, find the probability that exactly 3 will have visited a doctor last month. Solution n=10, x=3, p= and q= (1) n! xqnx but P(X x) p

(n x)!x! =10! 13 4 103

(103)!3! 5 5 =10987!134 7

7!321 5 5 =0.201 Example It has been found that an average 5% of the eggs supplied at NRC market are cracked. If you buy a box of 6 eggs what is the probability that it contains 2 or more cracked eggs? Solution P=P(cracked)=0.05, q=P(not cracked)= 1- 0.05(0.95) and n=6 P(2 or more cracked)=1-P(less than 2 cracked) =1- P(0) P(1) P(0) 6! 0.050 0.95 60

(60)!0! =10.956 ! =0.7351 P(1) 0.0510.9561 ! =0.050.955 ! =60.050.955 0.2321 Therefore P (2 or more cracked)=1-(0.7351+0.2321) =1-0.9672 =0.033 to 3 d.p For a binomial distribution (a) mean=np (b) variance = 2 n.p.qor n.p.(1-p) (c) standard deviation= n.p.q or n.p.1 p where n= number of trials p= P(S) q=P(F) Example A die is rolled 480 times. Find the mean, variance and standard deviation of the number of 2 s that will be rolled. Solution Here p=P(2)= ,so q=1- = n= 480 (a) mean= n.p =480x =80 (b) variance=n.p.q =480 x x =66.7 (c) standard deviation = n.p.q =66.7 =8.2 Example Let X be equal to the number of responses out of n=20 questions and let p equal to the probability of a correct choice on a single question. A candidate in an examination randomly select one of the 5 possible answers for each question and hence that p . Find the mean, variance and standard deviation for the student Solution Given: n= 20 and p . So q= 1- = (a) mean = n.p =20 x =4 (b) variance = n.p.q =20 x =3.2 (c) standard deviation= n.p.q =3.2 =1.8 to 1 decimal place POSSION DISTRIBUTION The binomial distribution is useful in cases where we take a fixed sample size and count the number of successes. Sometimes, we dont have a definite sample size and then the binomial distribution is of no use. In such cases we use another theoretical distribution called the Poisson distribution In a Poisson distribution, the probability of exactly r successes is P(r) em r m

r! Where m = mean r= number of events (successes) e= 2.7183 WHEN TO USE THE POISSON DISTRIBUTION (1) When n is large i. very large number of trials n 30 (2) p is small (3) The independent variable occurs over a period of time, or a density of items is distributed over a given area or volume. Example If there are 200 typographical randomly distributed in a 500 page manuscript, find the probability that a given page contains exactly three errors. Solution Given: r =3 But m= =0.4 and P(r) em m r

r!2.7183 0.4 3 P(3) (0.4)

3! =0.0072 Example A number of accidents per working week in a particular factory in Lilongwe are known to follow a Poisson distribution with a mean 0.5. Find the probability that in a particular week there will be (i) 2 accidents (ii) less than 3 accidents Solution (i) P(r) em m r r!

2.7183 0.5 2 P(2) (0.5)

2! =0.08 (ii) P(less than 3 accidents)= P(0) +P(1)+P(2) 2.7183 0.502.7183 0.512.7183 0.5 2P(lessthan3accidents) 0.5 0.5 0.5

0!1!2! =0.6065 NORMAL PROBABILITY DISTRIBUTION A normal distribution is a continuous, unimodal, symmetric bell shaped distribution of a variable. For example PROPERTIES OT THE THEORETICAL NORMAL DISTRIBUTION (i)The curve is bell-shaped (ii) The mean, median and mode are equal and located at the center of the distribution. (iii) The curve is unimodal (has only one mode) (iv) The curve is symmetrical about the mean. What this is that if you cut the normal curve vertically at the centre, the two halves so formed are images of the other. (v) The curve is continuous, that is , there are no gaps or holes. For each value of x, there is a corresponding value of y. (vi) All the values of y are greater than zero and approaches zero as x approaches (vii) The area between the curve and the x-axis is 1 unit or 100% The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. All normally distributed variables can be transformed into the standard normally distributed variables by using the formula for the standard score. X Z

Where Z = Z-score X = value =mean = standard deviation The Z score is actually the number of standard deviations that a particular X value is away from the mean. Example A student scored 65 on a Mathematics test that had a mean of 50 and a standard deviation of 10. Calculate his Z score. Solution X Z

= =1.5 AREA UNDER THE STANDARD NORMAL CURVE NOTE: (i) (z1) Area under the curve between the ordinate at Z and the mean. (ii) P(x1 < x < x2)= Area under the curve between the ordinates (x1) and (x2 ) Example Find the area under the normal curve between Z= 0 and Z= 2.34 Solution Using the normal tables 1. Look for 2.3 in the first column 2. Look for 0.04 in the second column 3. Look for the value where the row of 2.3 and column of 0.04 meet. In other words, they meet at 0.4904. Therefore the area is 0.4904 Example Find the area between Z=0 and Z= -1.75 Since area is always positive then look for 1.75. From the normal tables the area is 0.4599 Example Find the area to the left of Z= -1.93 Solution Look for 1.93. We get 0.4732 Area to the left of Z =-1.93 = 0.5000-Area between Z=0 and Z=-1.93 =0.5000-0.4732 =0.0268 NOTE: Area between Z=0 and is 0.5000 and area between Z=0 and - is 0.5000 Example Find the area between Z= 2.00 and Z=2.47 Solution First step: Find the area between Z=0 and Z=2.47 i.e. 0.4932 Second step: Find the area between Z= 0 and Z=2.000 i.e. 0.4772 Third step: Find the difference of the two i.e. 0.4932- 0.4772 Therefore, the area is 0.060 NOTE: If the area is on the same side of Z=0, subtract the areas. Example Find the area between Z=1.68 and Z= -37 Solution First step: Find the area between Z=0 and Z=1.68 i.e. 0.4535 Second step: Find the area between Z= 0 and Z=-1.37 i.e. 0.4147 Third step: Find the sum of the two i.e. 0.4535 +0.4147 Therefore, the area is 0.8682 NOTE: If the areas are on opposite sides of Z=0, add the two areas. NORMAL PROBABILITIES P(x1 x x2 ) P(z1 z z2 ) Area under the standard normal curve between the ordinates at z1 and z2. Example Find the probability for each of the following (a) P(0