Biostatistics, statistical software...Biostatistics or biometry is the application of statistics to a wide range of topics in biology. It has particular applications to medicine and

1

Biostatistics, statistical software Krisztina Boda PhD

Department of Medical Informatics, University of Szeged

Part I

INTERREG IIIA Community Initiative Program 2008

Szegedi Tudományegyetem Prirodno-matematički fakultet, Univerzitet u Novom Sadu

2

1 Introduction Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities. Statistics are also used for making informed decisions. The word statistics ultimately derives from the New Latin term statisticum collegium ("council of state") and the Italian word statista ("statesman" or "politician"). The German Statistik, first introduced by Gottfried Achenwall (1749), originally designated the analysis of data about the state, signifying the "science of state" (then called political arithmetic in English). It acquired the meaning of the collection and classification of data generally in the early 19th century. It was introduced into English by Sir John Sinclair. Biostatistics or biometry is the application of statistics to a wide range of topics in biology. It has particular applications to medicine and to agriculture. Although the terms "biostatistics" and "biometry" are sometimes used interchangeably, "biometry" is more often used of biological or agricultural applications and "biostatistics" of medical applications. In older sources "biometrics" is used as a synonym for "biometry", but this term has now been largely usurped by the information technology industry. Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. In addition, patterns in the data may be modelled in a way that accounts for randomness and uncertainty in the observations, and then used to draw inferences about the process or population being studied; this is called inferential statistics.

2 Data 2.1 Data table, cases, variables Data are information resulted by a measurement or collection process. By data we usually mean numerical information, tags such as people name are also important informations. Data values may be names, dates, identifiers or some categories. To be used, data must be recorded systematically and must have a context. Definition. Data are systematically recorded information, whether numbers or labels, together with a context. Individuals or cases are objects described by a set of data; they may be people, animals or things. For each individual, the data give values for one or more variables. A variable describes some characteristic of an individual, such as person's age, height, gender or salary. A data table records the same information about each individual in a structured layout. Information about the same characteristic of each individual is gathered together into a variable. In a data table, each variable is a column. Each row of a data table holds information for a single individual or case. The problem of repeated measurements. Sometimes we repeat the same measurement on t he same individual. If we would like to record repeated measurements in the data table, repeated measurements are placed in the same row. Two repeated measurements on each subject form therefore two variables. Statistical packages store the data in tables. These datasets have a 2-dimensional table structure where the rows typically represent cases (such as individuals or households) and

3

the columns represent measurements (such as age, sex or educational level). Generally 2 data types are defined, numeric and text (or "string"). Numbers in a text variable are considered to be letters. Variable have to be named, the names are in some systems only 8 character long, but there are variable labels that can be used to explain the content of a variable. For example, variable „EDUCAT” has the label „Educational level”. There are also so called „value labels”, that is, we can assign texts to numerical values of a variable. For example, in variable „SEX”, the label „Male” and „Female” are assigned to the values 1 and 2, respectively. This possibility facilitates the data entry, because it is easier to type one number (1) than a whole word (Male). Example 2.1-1. The following table (Table Table 2-1. Database about students, questionnaire data filled in by students.) contains data about individuals. There are 20 cases and 9 variables in this table. Number

Code-word Gender

Age Height

Exam mark Eye-colour

Is biostatistics necessary

Known statistical software

1 mimi L 20 168 5 brown yes 2 metro L 19 160 5 green yes 3 hadó L 18 160 5 brown yes 4 irdaa F 19 172 3 brown no 5 sel F 18 195 2 brown yes MS Statistic 6 MLG F 23 176 2 black yes Excel, Statistic-For Windows7 lamantin F 20 182 2 brown no dBaseIII, Excel 8 11200 L 19 163 2 blue yes 9 barca F 19 180 2 blue yes 10 cukor L 20 175 2 brown yes 11 pillangó L 19 165 2 blue yes Windows XP 12 fülemüle L 20 177 2 brown no 13 hopkinsoo1 F 24 180 3 brown yes 14 Board F 19 183 4 brown yes 15 pikula L 19 170 2 green no 16 uvula L 19 165 4 brown yes 17 szoszi20 L 20 168 5 blue yes 18 macska L 20 174 3 brown yes

Table 2-1. Database about students, questionnaire data filled in by students. Summary. Data are systematically recorded information collected in a data table. Observational units are cases in the rows of the data table Variables are characteristics of cases in the columns of the data table. 2.2 Types of variables Variables can be distinguished based on the number of values they can have. The exam mark can be five different numbers; sex may be male or female. These variables may take only a few values (i.e., 5 and 2), so they are called categorical variables. The age or the body height of an individual can be any number of a reasonable interval. For example, age changes continuously from the birth, but it can be measured in years, weeks, days or even

4

in seconds. Such variables with theoretically infinite possible values are called continuous variables. Definition. Variables with finite possible values are called discrete or categorical variables. Variables with only two possible values are called binary or dichotomous. Variables with infinite number of values (any real number) are called continuous variables. Based on the property they represent, we distinguish qualitative or quantitative variables. Qualitative variables represent types or categories. Numeric data in which numbers are measured are called quantitative variables. We think as quantitative variables as any variable for which it makes sense to average the values. A third classification of variables is the measurement scale. Nominal variable: values can be distinguished by names. Nominal variables are always categorical. For example: sex, blood group, name, exe colour, etc. Ordinal variable: values are ordered categories. Comparing two values we are able to decide which is greater. Such variables are the exam mark, categories of hotels, drinks (***, ****, *****), etc. This variable is always categorical (discrete) variable. Interval: The numbers assigned to objects have all the features of ordinal measurements, and in addition equal differences between measurements represent equivalent intervals. That is, differences between arbitrary pairs of measurements can be meaningfully compared. Operations such as addition and subtraction are therefore meaningful. The zero point on the scale is arbitrary; negative values can be used. Ratios between numbers on the scale are not meaningful, so operations such as multiplication and division cannot be carried out directly. Examples of interval measures are the year date in many calendars, and temperature in Celsius scale or Fahrenheit scale (We can say that 40°C are higher than 20°C with 20°C but we can not say that 40°C is twice warmer than 20°C. Ratio: The numbers assigned to objects have all the features of interval measurement and also have meaningful ratios between arbitrary pairs of numbers. Operations such as multiplication and division are therefore meaningful. The zero value on a ratio scale is non-arbitrary. Variables measured at the ratio level are called ratio variables. Most physical quantities, such as mass, length or energy are measured on ratio scales; so is temperature measured in Kelvin, that is, relative to absolute zero. The different classification types can overlap. For example: variable „sex” is categorical (it has only 2 values), binary and also nominal. Blood group is also categorical and nominal (we cannot say which is „better” A, B, AB or 0?). Exam mark is an ordinal, categorical and numerical variable. Example 2.2-1. Types of variables in Table 2-1. Code-word: nominal Gender: categorical (and nominal), binary Age: continuous Body height: continuous Exam mark: categorical (and ordinal) Eye-colour: categorical (and nominal) Is biostatistics necessary: categorical and binary. It can be considered also to be ordinal. Known statistical software: qualitative, nominal

5

Changing type of a variable. It is possible to transform any continuous variable to a categorical by defining limits. For example, systolic blood pressure measured on a continuous scale can be recategorized to be ordinal (low, normal, high blood pressure) according the value <80, 80-140, >140. Finding the appropriate type of a variable. Sometimes the type of a variable depends on its context. For example, the „colour” is categorical but if it is measured by the wavelength it is continuous. Many psychological variables are continuous (knowledge, pain, etc.) but they generally are measured on an ordinal scale (low, moderate, high). Sometimes there variables are measured on an analogue-visual scale which results in continuous variable. Summary Based on the number of values it can have a variable is categorical (discrete) or continuous. Variable with two possible values are called binary or dichotomous. Based on the property it represents variables is nominal, ordinal, interval or rate. Review questions What kind of information is in a data table? What are in the rows of a data table? What are in the columns of a data table? What is a case? What is a variable? What types of variables are there according the values they can have? What types of variables are there according on the property they represent? What types of variables are there according on the measurement scale? What is a binary variable? What is a categorical variable? What is a continuous variable? What is a nominal variable? What is an ordinal variable? What is the type of the variable „sex”? How many possible values are there in the variable „sex”? What is the type of the variable „exam”? How many possible values are there in the variable „exam”? What is the type of the variable „age”? How many possible values are there in the variable „age”? Problem 2.2-1. Identify variables as categorical or continuous in Table 2-1. Database about students, questionnaire data filled in by students.

Number Group Sex Age Domicile Education Marital status Reaction time 1

Reaction time 2

1.00 sick female 65 city secondary widowed 1.10 0.91 2.00 sick female 43 village secondary married 1.20 1.51 3.00 sick female 18 city secondary unmarried 1.60 0.78 4.00 sick female 18 city secondary unmarried 0.93 1.21 5.00 sick male 26 city elementary unmarried 1.48 1.50 6.00 sick male 21 city elementary married 1.37 1.10 7.00 sick female 28 city secondary unmarried 1.30 1.39 8.00 sick female 44 village elementary married 1.11 1.60

6

9.00 sick female 55 city secondary married 2.66 1.25 10.00 sick female 35 city elementary married 1.50 1.31 11.00 sick male 23 city elementary unmarried 1.61 1.11 12.00 sick female 33 village elementary married 1.83 1.62 13.00 sick female 60 city elementary married 2.22 1.13 14.00 sick female 43 city high married 1.45 1.86 15.00 sick female 48 city elementary widowed 1.78 0.82 16.00 sick female 20 village elementary married 1.34 1.42 17.00 sick female 46 city secondary married 1.02 1.26 18.00 sick female 39 village elementary married 1.12 1.00 19.00 sick male 32 city elementary married 2.47 0.83 20.00 sick male 29 city secondary married 2.15 1.48 21.00 sick female 27 city secondary divorced 0.98 1.26 22.00 sick male 22 village secondary unmarried 1.09 0.98 23.00 sick female 23 city secondary unmarried 1.83 0.99 24.00 sick female 74 town elementary widowed 1.04 0.77 25.00 sick female 46 city secondary divorced 1.53 1.28 26.00 sick male 56 city secondary married 1.11 1.20 27.00 sick male 41 city elementary divorced 1.03 1.02 28.00 sick male 33 city secondary married 2.48 0.82 29.00 sick male 30 city elementary married 1.10 0.99 30.00 sick female 24 city high unmarried 2.04 1.93 32.00 healthy male 22 town high unmarried 1.08 0.77 33.00 healthy female 22 city high unmarried 1.11 0.85 34.00 healthy male 48 village elementary married 0.95 1.00 35.00 healthy female 19 city secondary unmarried 1.91 0.82 36.00 healthy female 18 town secondary unmarried 1.05 1.07 37.00 healthy female 24 city secondary unmarried 1.28 1.29 38.00 healthy female 23 village secondary unmarried 2.25 1.47 39.00 healthy female 49 city elementary married 1.26 1.07 40.00 healthy male 19 city secondary unmarried 1.21 1.08 41.00 healthy male 20 town secondary unmarried 1.08 1.62 42.00 healthy male 22 city secondary unmarried 1.26 0.97 43.00 healthy female 24 town secondary unmarried 1.60 1.18 44.00 healthy male 22 village secondary unmarried 1.97 1.37 45.00 healthy male 24 city secondary unmarried 2.26 0.95 46.00 healthy female 31 city elementary married 1.12 1.12 47.00 healthy male 23 city secondary married 1.10 0.76 48.00 healthy female 20 town secondary unmarried 1.45 1.08 49.00 healthy female 41 city elementary married 1.05 1.17 50.00 healthy female 42 city elementary divorced 0.97 0.86 51.00 healthy female 28 village elementary married 0.91 1.35 52.00 healthy female 34 city elementary married 0.99 1.01 53.00 healthy female 50 village elementary married 0.98 0.83 54.00 healthy female 44 city secondary married 1.64 0.92 55.00 healthy female 33 city secondary unmarried 1.10 1.03 56.00 healthy female 48 city elementary married 1.02 1.32 57.00 healthy female 20 city secondary unmarried 0.85 0.86 58.00 healthy male 26 city secondary married 1.71 1.48 59.00 healthy male 23 town high unmarried 1.39 0.94 60.00 healthy male 24 town secondary unmarried 1.14 1.12 61.00 healthy female 35 city elementary married 0.87 0.89

Table 2-2.

7

2.3 Gathering data To collect data we must first define what to measure. This decision defines the type of the variable. Data are often measured in an experiment; this is a planned way gathering data. Another way of collecting data is the use of a questionnaire.

2.3.1 Aspects of planning questionnaires. An identification number or a simple serial number must be put on every sheet. Quantitative data can be written directly on the sheet. Categorical data can be coded, codes can be written on the sheet. Give 0-1 codes to binary variables. If possible, write the codes as numbers into the sheet. Some questions result in only one categorical variable; when only one answer is possible (blood group, sex). In that case give codes to the possible values beginning from 0 or 1. There are questions with several possible answers. For example: hobby, performed operations, etc. In case of these questions we have to plan as many binary variables as many possible answer. Let’s suppose that we would like to collect data about humans: sex, age, education, body weight and height, exe colour and hobby. Figure 2.3-1. Sample questionnaire.shows an example how to plan such a questionnaire. The serial number has three places so that maximum 1000 people will be asked. The sex is a binary variable; its value can be described by a single code. Variable age is continuous; its value can be written directly on the sheet. Education can be measure by the number of years but here it is measured by four categories. Only one category is possible so one number (one variable) is enough. Eye colour is similar. Recording hobbies of one person needs several variables because one person might have several hobbies. In case of hobby we need as many variables as many hobbies there are. To avoid the problem that somebody has a hobby not shown on the questionnaire, an „other” category is also given. 1. Serial number. 2. Sex 1: male 2: female 3. Age in years 4. Highest education 1: <8 years elementary school 2: elementary school 3: secondary school 4: high school 5. Body weight (kg) 6. Body height (cm) 7. Eye colour 1: blue 2: green 3: grey 4: brown 5: black 8. Hobby sport music stamp dance fine arts other.......................... 1: yes, 2: no

Figure 2.3-1. Sample questionnaire.

8

2.4 Planning data base, data input Small squares ( ) in the questionnaire represent variables. Table 2-3. Data base SMALLQUES, after filling in the questionnaire on Figure 2.3-1.is a small database based on answers of 20 persons. No SEX AGE SCH WEIGHT HEIGHT EYE SPORT MUSIC STAMP DANCE ART OTHER TEXT

1 1 20 3 65 185 3 1 1 2 2 2 2

2 2 17 3 60 170 4 1 2 2 1 2 2

3 1 22 3 62 177 2 2 1 2 2 2 2

4 2 28 4 62 176 4 2 1 2 1 2 1 utazás

5 1 9 1 32 148 4 2 2 2 2 2 1 LEGO

6 1 5 1 19 125 3 2 2 2 2 2 1

7 2 26 3 70 166 4 2 2 2 2 2 1 family

8 1 60 4 75 180 1 1 1 2 2 2 2

9 2 35 3 49 155 4 2 1 2 2 2 2 reading

10 2 51 4 61 162 4 2 1 2 2 2 2 knit

11 1 17 2 61 178 4 2 1 2 2 2 2

12 2 50 2 65 164 4 2 2 2 2 2 1

13 1 9 1 30 130 2 1 2 2 2 2 2

14 2 10 1 40 135 1 2 1 2 1 2 2

15 1 19 3 86 187 3 1 1 2 2 2 2

16 1 22 3 67 179 4 2 2 2 2 1 2

17 1 25 3 103 186 4 1 1 2 2 2 1 FIDESZ

18 1 29 4 74 176 1 1 1 2 2 1 2 friends

19 2 27 4 67 164 4 1 1 2 2 2 1 reading

20 1 19 3 70 180 4 1 1 2 2 2 2 Table 2-3. Data base SMALLQUES, after filling in the questionnaire on Figure 2.3-1.

2.4.1 Using Excel to data collection Excel is a very good tool to collect data. Each row should hold data for the same case across all columns, and each column should hold data for a single variable. Use the first row of each column to name your variables. A maximum of 8 character long name is recommended. Be careful with cells when numbers are expected, avoid letter such „?”, „x” , „-„ , „*”, etc. In case of missing data leave the cell empty. See Figure 2.4-1. Data input into Excel

9

Figure 2.4-1. Data input into Excel

Problem 2.4-1. Plan a database by a real or fictitious problem. Let it contain minimum 2 discrete variables with two and minimum three categories and minimum 2 continuous variables. Let the number of cases 20.

2.4.2 A brief survey of statistical program systems (R, SPSS, Statistica, SAS). Simple statistical calculations can be performed by hand, based on given formulas, significance can be stated using statistical tables. The use of a computer system makes even these simple calculations safer, and it is necessary in case of more complicated problems. There are many statistical systems; the choice of the appropriate statistical software is not easy. Here we give only a brief summary of some such systems. Several sites contain information about statistical software, for example http://statpages.org/ or http://en.wikipedia.org/wiki/List_of_statistical_packages . Independently of the choice of a software the following checking process is recommended: if we have a problem solved in a statistical textbook, check the software using the data of the solved problem. 2.4.2.1 Statistical software used in Department of Medical Informatics SPSS. References: SPSS 15.0 Command Syntax Reference 2006, SPSS Inc., Chicago Ill. http://www.spss.com, http://www.spss.hu, http://en.wikipedia.org/wiki/SPSS Downloads (demo): http://www.spss.com/downloads/papers.cfm SAS One of the most comprehensive packages. References: SAS Institute, Inc: The MIXED procedure in SAS/STAT Software: Changes and Enhancements through Release 6.11. Copyright © 1996 by SAS Institute Inc., Cary, NC 27513.

10

http://www.sas.com, http://www.sas.com/offices/europe/hungary/, http://en.wikipedia.org/wiki/SAS_programming_language STATISTICA for Windows References: http://www.statsoft.com/, http://www.statsoft.hu/, Electronic textbook: http://www.statsoft.com/textbook/stathome.html Statsitcica 8 demo: https://www.statsoft.com/secure/demorequest.html R - programming language for statistics. R's source code is freely available under the GNU General Public License. References: http://www.r-project.org/, http://en.wikipedia.org/wiki/R_programming_language, Microsoft Excel Reference: http://en.wikipedia.org/wiki/Microsoft_Excel 2.5 Distribution We always begin our consideration of data by asking about the values in each variable. In particular, we pay attention to how these values are distributed -- which values are common, which are rare, which occur not at all. The definition depends on the type of the variable.

2.5.1 The distribution of a categorical variable. Definition. The distribution of a categorical variable describes what values it takes and how often it takes these values. Discrete distributions can be characterized by frequency tables and can be displayed by bar charts or pie charts. Bar charts show a bar for each category of a categorical variable. The height of each bar depicts the frequency (or relative frequency) of that category. Bar charts thus obey the area principle. The area principle states that in a statistical display, each data value should be represented by the same amount of area in the display. The area principle is observed by most statistics displays. It is most often violated by displays that use a false third dimension to make the display more visually exciting (Figure 2.5-2. Inappropriate figures). For example Table 2-4. Distribution of the variable “sex” and Figure 2.5-1. show the distribution of the variable „music”, that is, from the 20 person, 13 has the music as a hobby, and 7 not.

Answer Frequency PercentageYes 13 65% No 7 35%

Table 2-4. Distribution of the variable “sex” Music

0

2

4

6

8

10

12

14

Yes No

coun

t

Music

0%

10%

20%

30%

40%

50%

60%

70%

Yes No

%

Music

Yes; 65%

No; 35%

11

Figure 2.5-1. Distribution of the variable “sex”

0%

10%

20%

30%

40%

50%

60%

70%

%

Yes No

Music

Music

0%

10%

20%

30%

40%

50%

60%

70%

Yes No

%

Figure 2.5-2. Inappropriate figures

2.5.2 The distribution of a continuous variable. Continuous variables and often also quantitative variable contain so many different values that it is not worth to give the frequency of each value. For example, Figure 2.6-1. shows a bar chart of each value. It can be seen that some values occur only once, a few values occur twice. To get an overview about the distribution of such variable we collapse some data by forming intervals. The length of the interval depends on the number of data and on our decision, but it also affects the form of the distribution. On Figure 2.6-2., the length of the intervals is 10 and 20. The bars are side-by-side, showing that these intervals are not fixed categories. The name of such figure is histogram. Sometimes the height of the columns is the relative frequency, it that case the form of the distribution does not change. Histograms show the distribution of a continuous variable in a „condensed” way. To practice how bin widths (or the number of bins) affect a histogram, see http://www.stat.sc.edu/~west/javahtml/Histogram.html.

age in years605150352928272625222019171095

Freq

uenc

y

2,0

1,5

1,0

0,5

0,0

age in years

Figure 2.5-3. Representation of variable “age” by a bar chart (not recommended)

0

1

2

3

4

5

6

7

8

0-10 11-20 21-30 31-40 41-50 51-60

age

coun

t

0

1

2

3

4

5

6

7

8

9

10

0-20 21-40 41-60

age

coun

t

Figure 2.5-4. Frequency histogram of variable “age” with different interval lengths

12

0%

5%

10%

15%

20%

25%

30%

35%

40%

0-10 11-20 21-30 31-40 41-50 51-60

age

%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0-20 21-40 41-60

age

coun

t

Figure 2.5-5. Relative frequency histogram of variable “age” with different interval lengths

Definition. The distribution of a continuous variable describes what values it takes and how often these values fall into an interval. Continuous distributions can be displayed by histograms, sometimes by dots (Figure 2.5-6. Dot-plot), or so called steam & leaf plot (Figure 2.5-7. Stem & leaf plot). Special programs are necessary to produce these charts.

0

10

20

30

40

50

60

70

0.95 1 1.05

age

Figure 2.5-6. Dot-plot

0 599 1 07799 2 02256789 3 5 4 5 01 6

Figure 2.5-7. Stem & leaf plot

2.5.3 The overall pattern of a distribution

The center, spread and shape describe the overall pattern of a distribution. Some distributions have simple shape, such as symmetric and skewed. Not all distributions have a simple overall shape, especially when there are few observations. A symmetric distribution is one in which the 2 "halves" of the histogram appear as mirror-images of one another. A distribution is skewed to the right if the right side of the histogram extends much farther out then the left side. The distribution shown on Figure 2.5-4. Frequency histogram of variable “age” with different interval lengthsis skewed to the right. Another characteristics of the shape, where are “humps”, i.e., which is the interval containing the most data. These “humps” are peaks of the smoothed distribution, or modes. A mode means the most frequent value assumed by a variable. The length of the intervals (or the number of intervals) affects a histogram, for example, the histogram on the left side of Figure 2.5-4. Frequency histogram of variable “age” with different interval lengthsseems to have two modes while on the right side there is only one mode. In case of many data this phenomenon can be observed better. On the site http://www.stat.sc.edu/~west/javahtml/Histogram.html the time until the next eruption of the "Old Faithful" geyser are summarized by a histogram. The length of the intervals can

13

be changed interactively. Using short intervals, the two peaks of the distribution can be observed. Histograms can be used to detect outliers. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them (real data, typing mistake or other). On Figure 2.5-8. Distribution of body weightsthere is an outlier, a patient with body weight 110 kg. If it is a real observation then the shape of the distribution is skewed. If it is a typing mistake, then it van be removed from the data base and the shape of the distribution becomes more symmetric.

Figure 2.5-8. Distribution of body weights

Summary. The distribution of a categorical variable describes what values it takes and how often it takes these values. Discrete distributions can be characterized by frequency tables and can be displayed by bar charts or pie charts. The distribution of a continuous variable describes what values it takes and how often these values fall into an interval. Continuous distributions can be displayed by histograms. The center, spread and shape describe the overall pattern of a distribution. Review questions What is the distribution of a categorical variable? What is the appropriate chart for the distribution of a categorical variable? What is the distribution of a continuous variable? What is the appropriate chart for the distribution of a continuous variable? What is a histogram? What is an outlier? Problem 2.5-1. Find the distribution of the variable „sex” in Table 2-1 Find the distribution of variable „age” in Table 2-1. Describe its properties. 2.6 Describing distributions with numbers. Besides histograms, the overall shape of the distribution namely the center and the dispersion can be characterised by numbers using formulas. To describe formula we shall use the following notation about the data measured about a continuous variable:

Jelenlegi testsúlya

110.0105.0

100.095.0

90.085.0

80.075.0

70.065.0

60.055.0

50.045.0

40.0

10

8

6

4

2

0

M

N

14

x1, x2, …, xn, or xi, i=1,2,…, n (2-1) here x denotes the measurement and the lower index denotes the serial number of the measurement. The number of cases is denoted by n. This sequence of numbers is often called a sample (see later in Chapter 3.), the xi –s are called sample elements, n is called sample size.

2.6.1 Measures of the center An investigator is often looking for a number to represent the general trend of the data. Such a number is called an average or a measure of central tendency. The mean, the mode and the median are three commonly used measures having the following definitions: Mean. The mean is the arithmetic average of the sample elements and is defined by sum of the data divided by the sample size and it is denoted by x (x dash).

x x x xn

x

nn

ii

n

=+ + +

= =∑

1 2 1... (2-2)

Median. The median is the value that half the members of the sample fall below and half above. In other words, it is the middle number when the sample elements are written in numerical order. If the sample size is odd, then finding the value in the middle is no problem. If the sample size is even, then there are two scores in the middle [the (n+1)/2nd] and the median is defined as their arithmetic mean. Mode. The mode is the most frequent number appearing in a sample. Contrary to a common misconception, the mode does not measure or estimate the center of the distribution of values, although in some common distributions, the mode is indeed near to the values commonly identified as central values. Which of the three measures is the best? That depends on the particular situation. Each average has certain properties. Depending on the context, these properties may or may not be useful. For example, the median is less affected by extremely large or extremely small values, while the mean is affected by every score. Example 2.6-1. Find the mean, median and mode for the following data: 2,4,3,4,7. The sum of the data is 20, the number of cases is 5, so the mean is x =20/5=4. The find the median we first sort the data in ascending order: 2,3,4,4,7. Here there is a unique element in the middle (the third), so the median is =4. The mode is 4 because it occurs twice. Example 2.6-2. Find the mean, median and mode for the following data: 2,4,3,4,17. The sum of the data is 30, the number of cases is 5, so the mean is x =30/5=6. To find the median we first sort the data in ascending order: 2,3,4,4,17. Here there is a unique element in the middle (the third), so the median is =4. The mode is 4 because it occurs twice. Example 2.6-3. Find the mean, median and mode for the following data 2,4,3,3,7,17. The mean is x =36/6=6. To find the median we firs order the data: 2,3,3,4,7,17. There are two “middle” elements: 3 and 4, so the median is their mean (3+4)/2=3.5.

2.6.2 Properties of the measure of the center As it could be seen from the previous examples, the different measures of the center are not necessarily the same. Comparing the distributions of examples Example 2.6-1.and Example 2.6-2. , the two samples differ in only one value (the last one). In the second sequence the last value is greater than the last value in the first data sequence, resulting in a bigger mean. However, the median did not change. It can be also seen that the distribution

15

is skewer in case of the second data sequence (Figure 2.6-1). It is true also in general, that in case of symmetric and unimodal distributions, the mean, the median and the mode are equal. The more skewed is the distribution, the more separate are these measures, in case of a distribution skewed to the right mode≤median≤mean, and is case of a distribution skewed to the left, their order is mean≤median≤mode.

0

0.5

1

1.5

2

2.5

2 3 4 5 6 7

Freq

uenc

y

mean=median=mode

0

0.5

1

1.5

2

2.5

2 3 4 5 6 7 8 9 10 11 1213 14 1516 17

Freq

uenc

y

meanmedian

Figure 2.6-1. The measure of the center in case if symmetric (a) and skewed (b) distribution

2.6.3 The effect of linear transformations to the measures of the center. Adding (or subtracting) the same number to each data value in a variable shifts each measures of center by the amount added (subtracted). Measures of center and spread change in predictable ways when we multiply or divide each data value by the same number. Proof. Let the linear transformation be x -> ax+b then

bxan

nbxxxan

baxbaxbaxn

baxnn

n

ii

+=++++

=++++++

=+∑

= )...(... 21211

2.6.4 Measures of dispersion.

We need a measure that will indicate whether the numbers in a distribution are close together or far apart. Such a measure is called a measure of dispersion, variability, scatter or spread. Ideally such a measure should be large if the data are spread out and small if they are close together. The range, the interquartile range, the variance and the standard deviation are the most commonly used measures of variability. The range is the difference between the largest number (maximum) in the sample and the smallest number (minimum). This measure depends on the extreme values int he data. For example, comparing distributions of examples Example 2.6-1. and Example 2.6-2. , the range of the second data sequence is wider because of the outlier 17. Percentiles. As the median cuts the distribution into two parts at 50%, we can define percentiles based on a similar principle: Definition. A p% percentile is the value below which p% of the cases fall. Quartiles. The 25%, 50% and 75% percentiles are called first, second and third quartiles. The are denoted by Q1, Q2, Q3. Interquarartile range. The interquartile range (often abbreviated IQR because 3 syllables is better than 5), is the difference between the upper (75%) quartile and the lower (25%) quartile. The IQR is a measure of spread that is resistant to the effects of outliers, and thus especially favored for unruly data. Comment. Sometimes 5% and 95% percentiles are used as a measure of dispersion.

16

The standard deviation. Next, we need a measure of dispersion about the mean. A value an equal distance above or below the mean should contribute the same amount to our measure of variability, even though in one case the deviation from the mean is positive and in the other it is negative. Squaring a number makes it positive, so let us describe the variability of a sample about the mean by computing the variance as the average squared deviation from the mean. It can be calculated according to the following formula:

Variance=1

)(1

2

2

−

−=∑=

n

xxSD

n

ii

2-3.

The standard deviation Since variances are often hard to visualize, it is more common to present the square root of the variance, this quantity has been named the standard deviation:

1

)(1

2

−

−=∑=

n

xxSD

n

ii

2-4.

SD is measured in the same physical measurement units as the original sample data. For computational purposes the following formula for the standard deviation is more appropriate:

11

)(2

1

21

2

1

2

−

−=

−

−=

∑∑

∑=

=

=

n

xnx

nn

xx

SD

n

ii

n

iin

ii

2-5.

Example 2.6-4. Find the measures of dispersion for the following data: 2,4,3,4,7 using formulas 2-4. and and 2-5! The range is maximum-minimum=7-2=5. The quartiles are now 3 and 4, so the interquartile range is 4-3=1. Computation of the standard deviation is in Table 2-5.

i xi xi- x (xi- x )2 xi xi2 1 2 2-4=-2 4 2 4 2 4 4-4=0 0 4 16 3 3 3-4=-1 1 3 9 4 4 4-4=0 0 4 16 5 7 7-4=3 9 7 49 Σ Σxi=20 Σ(xi- x )=0 Σ(xi-4)2= Σxi=20 Σxi2=94

87.15.34

14===SD

87.14

144

80944

52094

2

==

−=

−=SD

Table 2-5. Computation of the standard deviation

2.6.5 Properties of the measures of dispersion, effect of transformations

Adding (or subtracting) the same number to each data value in a variable does not change measures of dispersion. Multiplying (or dividing) each data value by the same number multiplies (or divides) all measures of center or spread by that value.

17

Proof of the effect of the x-> ax+b transformation to the standard deviation:

SDan

xxa

n

xxa

n

xaax

n

bxabax

n

bxabax

n

ii

n

ii

n

ii

n

ii

n

ii

=−

−=

−

−=

−

−=

−

−−+=

−

+−+

∑∑

∑∑∑

==

===

1

)(

1

)(

1

)(

1

))((

1

))()((

1

2

1

22

1

2

1

2

1

2

2.6.6 Some measures of an individual: z-score or standardized score We have already mentioned one measure of an individual score, the value that we order to the experimental unit (person, object). Another very important measure of an individual in a sample is called the z-score. The z-score measures how many standard deviations a sample element is from the mean. A formula for finding the z-score corresponding to a particular sample element xi is:

z x xsi

i=− , i=1,2,...,n.

Standardizing the whole sample, the mean of the standardized values is 0 and their standard deviation is 1. Example 2.6-5. The results of an exam in a class were good: the class mean was 83, the standard deviation was 5, the median was 87 and the range was 24. Peter is in the above class. His grade was 69. His z-score is z=(69-83)/5=-14/5=-2.8. This tells us that Peter’s score was below the mean almost 3 standard deviation. Example 2.6-6. The z-score can be used to compare the relative standings. Let's consider two test grades you might receive. Suppose you received a 85 in English and a 65 in physics. Suppose the mean in English was 70 and the mean in physics was 50. Thus, in both classes you scored 15 points above the mean. Does this mean that, relatively speaking, you did the same in both classes? The answer is no, the number of points above or below the mean is insufficient information to give you a rating relative to your position in the class, as you can see from the class scores given in the table Table 2-6. Here we also can check that the mean and the standard deviation of the standardized scores is 0 and 1, respectively.

English zi Physics zi 100 1.14 65 Peter’s score 1.8699 1.10 57 0.8798 1.06 55 0.6285 Peter’s score 0.57 53 0.3773 0.11 50 0.0067 -0.11 49 -0.1260 -0.38 47 -0.3753 -0.64 44 -0.7445 -0.95 44 -0.7420 -1.89 36 -1.74x =70 x =0 x =50 x =0 s=26.4 s=1 s=8.1 s=1

Table 2-6.

18

2.6.7 The use of the sample characteristics, preparing Figures If the distribution of our data has only one peak and it is quite symmetric, the distribution can be described by the mean and the standard deviation. Very often (in case of a normal distribution, see later) 68% of cases fall within one standard deviation of the mean and 95% of cases fall within two standard deviations. For example, if the mean age is 45, with a standard deviation of 10, 95% of the cases would be between 25 and 65 in a normal distribution. If the distribution is unimodal but not symmetric, the median and range or median and interquartile range is used to describe the center and the dispersion, respectively. Using the 5 number rule: minimum, first quartile, median, third quartile and maximum a plot, the so called box plot can be given that describes the shape of the distribution. A box plot displays the 5-number summary of a variable. Box plots show a central box running from the first to the third quartiles, marked by a line at the median. They then extend whiskers either to the maximum and minimum data values, or to the most extreme values within 1.5 IQR's of each quartile. If any data values are outside these limits, they are displayed individually because they may be outliers. Example 2.6-7. Prepare mean and standard deviation plot on the data of Example Example 2.6-1.(2,3,4,4,7). The mean was x =4, the standard deviation SD=1.87. We draw a bar with the length of the mean and with a “whisker” according to the value of the standard deviation. Sometimes the double SD is used (Figure 2.6-2). To find the box diagram we need the following 5 numbers: minimum=2, Q1=3, median=4, Q3=4, maximum=7 (Figure 2.6-3).

Figure 2.6-2. Mean-standard deviation plot

Figure 2.6-3. Box plot

These types of Figures are useful when data are available in groups: groups can be compared. Figure 2.6-4. shows body weights of boys and girls. The histogram of the girls

19

is skewed to the right. This asymmetry is reflected in the box diagram, however, the asymmetry is not detectable on the mean-SD plot, and only the dispersion is a little bit bigger than that of the boys.

Figure 2.6-5. Histogram, mean-standard deviation plot and box plot for groups of data.

Summary. The mean, the median and the mode are the most commonly used measures of the center of a distribution. The range, the interquartile range, the variance and the standard deviation are the most commonly used measures of variability. Review questions What is the mean? What is the median? What is the mode? When are they equal?

20

If the distribution is skewed to the right, what is the relationship between the mean and the median? What are the measures of the dispersion? What is the range? What is the variance? What is the meaning of the standard deviation? What is the interquartile range? Problem 2.6-8. Find the measures of the center and of the dispersion for the variable distribution of the variable „age” in Table 2-1. Table 2-1. Database about students, questionnaire data filled in by students.Try to describe the shape of the distribution based on these measures.

3 The basics of the probability theory 3.1 Experiments, events Since we are concerned with the analysis of data originating from experiments, we will have to state first what we mean by an experiment and its result. We call an experiment a random experiment, if the outcome is not determined uniquely by the considered conditions. For example, tossing a coin, rolling a dice, measuring the concentration of a solution, measuring the body weight of an animal, etc. are experiments. Every experiment has more, sometimes infinitely large outcomes. For example, if the experiment is tossing a coin, there are two outcomes: "heads" or "tails"; if the experiment is rolling dice, there are 6 outcomes. If the experiment is the measuring of concentration, the experiment has as many outcomes as many concentrations can be imagined, that is , as many real numbers are between 0 and 100. If an experiment has not too many and finite outcomes, than these outcomes will be realised if we repeat the experiment many times. But if the experiment has too many or infinitely much outcomes then there will be outcomes which will not be realised. If we measure the concentration 10 times, only ten will be realised of all possible outcomes, but we still keep in evidence those outcomes that were not realised. The result (or outcome) of an experiment is called event. Events are denoted by capital letters. There are elementary and composite events. The possible outcomes of an experiment are called elementary events. The set of all elementary events of an experiment is called sample space and is denoted by Ω. An event is called composite event if it can be divided into sub-events. Composite events are subsets of the sample space. An event is called certain, if it occurs in all condition (Ω). An event is called impossible and denoted by ∅ if it never occurs.

Examples: 1. The experiment is tossing a coin. The outcomes are "heads" (H) and "tails" (T). The elementary events are ∅, H,T, Ω. The certain event is Ω, the outcome is either head or tail. 2. The experiment is tossing two coins. The elementary events are the following: the first shows "heads" and the second shows "heads": (H,H) the first shows "heads" and the second shows "tails": (H,T) the first shows "tails" and the second shows "heads": (T,H) the first shows "tails" and the second shows "tails": (T,T)

21

3. The experiment is rolling a dice. The elementary events are 1,2,3,4,5,6. Events: E1={1,3,5} (the result is an odd number). E2={2,4,6} (the result is an even number). E3={5,6} (the result is greater than 4). Ω={1,2,3,4,5,6} (the result is the certain event).

3.2 Operations with events The complementary event of an event A is the event which occurs whenever A fails. It is denoted by A .

Example: E1 ={ }1 3 5, , ={2,4,6}

E3 ={1,2,3,4} Ω=∅

The sum of two events A and B is the event A+B which occurs if A or B occurs. Examples: E1+E2={1,3,5}+{2,4,6}={1,2,3,4,5,6} E1+E3={1,3,5}+{5,6}={1,3,5,6}

The product of two events A and B is the event AB which occurs if A and B occur. Examples: E1•E2={1,3,5}•{2,4,6}=∅ E1•E3={1,3,5}•{5,6}={5} E2•E3={2,4,6}•{5,6}={6}

If A•B =∅, then we say that A and B are mutually exclusive events. For example, E1 and E2 are mutually exclusive events. 3.3 The concept of probability Some experiments can be repeated every time under the same conditions. Lets repeat an experiment n times. In a large number of n experiments the event A is observed to occur k times ((0 ≤ ≤k n). k is called the frequency of the occurrence of the event A. The number

h = kn

is called the relative frequency of the occurrence of the event A. If n is large, h will approximate a given number. This number is called the probability of the occurrence of the event A and it is denoted by P(A). Because 0 1≤ ≤h , P(A) lies also between 0 and 1. Example: let us consider the simplest experiment namely the tossing of a (fair) coin. Let's count the frequencies of "heads" if we toss 10, 100, 1000, 10000 times the coin. We may get the frequencies and relative frequencies shown in Table 3-1. Possible frequencies and relative frequencies when tossing a coinand on Figure 3.3-1.Possible relative frequencies when tossing a coin n 10 100 1000 10000 50000 100000 k 4 52 482 5160 24821 49850 k/n 0.4 0.52 0.482 0.516 0.49642 0.4985

Table 3-1. Possible frequencies and relative frequencies when tossing a coin

22

Relative frequencies

0

0.2

0.4

0.6

0.8

1

0 20000 40000 60000 80000 100000

n

k/n

Figure 3.3-1.Possible relative frequencies when tossing a coin

The greater is n, the closer is h to a given number, i.e., if n is large, h approximates the number 0.5. We say that the probability that the coin shows "heads" is 0.5. In notation: P(H)=0.5.

3.3.1 Axioms of probability The above definition of the probability is generally sufficient for practical purposes, although it is mathematically unsatisfactory. One of the difficulties with this definition is the need for an infinity of experiments which are of course impossible to perform and even difficult to conceive. We should like to indicate the basic concepts of an axiomatic theory of probability due to KOLMOGOROV(1933). Axiom 1. To each event A there corresponds a non-negative number, its probability. The probability lies between 0 and 1: 0<= P(A)<=1. Axiom 2. The probability of the certain event is one: P(Ω)=1. Axiom 3. If A and B are mutually exclusive events than the probability of A or B (that is, A•B =∅, then probability of A+B is:

P(A+B)=P(A)+P(B). Consequences. 1. For every event A the rule P(A)=1- P(A) is true. 2. The probability of the impossible event is zero: P(∅)=0. 3. If A1,A2,...An are pairwise exclusive events, i.e., P(AiAj)=0, 1≤ ≠ ≤i j n , then

P(A A P(A P(A1 n 1 n+ + = + +... ) ) ... ). 4. The next equality holds for any A and B events: P(AB)=P(A)+P(B)-P(AB)

3.3.2 Rules of probability calculus Let us consider once again the tossing of a coin. What is the probability that a "regular" coin shows "heads" when tossed once? Our intuition suggests that its probability equals 0.5. It is based on the assumption that all elementary events in the sample space are equally probable and that the probability of the certain event is one. Generally, if an experiment has T outcomes (elementary events) , and they are equally probable, we can compute the probability of an event A on the following way:

outcomesofnumbertotaloutcomesfavoriteofnumber

TFP(A) == ,

where F is the number of favourable outcomes of the elementary events in A and T is the total number of outcomes.

23

Examples: 1. Rolling a dice. What is the probability that the dice shows 5? If we let X represent the value of the outcome, then P(X=5)=1/6, because the number of all elementary events is n=6 . 2. What is the probability that the dice shows an odd number? P(odd)=1/2. Here F=3, T=6, so F/T=3/6=1/2.

3.3.3 Conditional probability Let's consider a series of n repetitions of a given random experiment. From the total sequence of n repetitions, let us select the sub-sequence which consists of the kA repetitions where the event A has occurred. In this sub-sequence there are kAB repetitions where the event B has also occurred. The ratio of these frequencies kAB/kA is the frequency ratio of the event B within the selected sub-sequence. Dividing by n we get the following expression for large n-s:

kk

kn

kn

P(AB)P(A)

AB

A

AB

A= ≈

We define the conditional probability of B, under the condition A as

P(B|A) = P(AB)P(A) 3-1

It follows that P(AB)=P(A) P(B|A) 3-2.

Figure 3.3-2. Operations with events

From the example of Figure 3.3-2 it can be seen that this definition is reasonable. Here the event A+B occurs if a point lies inside of the region A or B respectively. If the point happens to lie in the overlapping region, the event AB occurs. Let the area of the different regions be proportional to the probabilities of the corresponding events. Then the probability of B under the condition A is the quotient of the areas AB and A.

Example: Two cards, C1 and C2 are drawn simultaneously from the same set. A denotes the event that C1 is a heart, B the event that C2 is a heart. Obviously we than have that P(A)=13/52. Further, if A has occurred, C2 will be a card drawn from a set containing 51 cards, 12 of which are hearts, so P(B|A)=12/51. The probability that

both cards are hearts, P(AB)=P(A) P(B|A)=1352

1251

117

⋅ =

24

3.3.4 Independence of events. Two events A and B are said to be independent if the knowledge that A has occurred does not change the probability for B and vice versa, that is,

P(B|A) = P(B) 3-3. or, substituting (3.3) to (3.2) we get

P(AB)=P(A)P(B). The probability of the product of two independent events is equal to the product of probabilities of the events. In this case the probability of any of the events A and B is independent of whether the other event has occurred or not, and we shall say that A and B are independent events from the point of view of probability theory. Sometimes this is expressed by saying that A and B are stochastically independent events.

Example: Let us modify the last example so that the cards C1 and C2 are drawn from two different sets of cards. We should feel that the two events A and B are independent, and the probability that both cards are hearts will then in this case be

P(AB) = 1352

⋅ =1352

116

Summary. The concept of probability is based on experiments. An event is the outcome of an experiment. The probability of an event is the number approximated by sequence the relative frequencies got under many repetitions. There is a simple rule for computing exact probabilities when every elementary event is equally probable: number of favourite outcomes divided by the number of total outcomes. Review questions What is the concept of probability? What is the formula for computing simple exact probabilities? When can it be used? Problems Problem 3.3-1. If we roll a dice, there are 6 possible outcomes. If X represents the value of the outcome, find the following probabilities: a) P(X=1); b) P(X>1); c) P(1<X<4) Problem 3.3-2. A fair coin is tossed twice. List the possible outcomes? What is the probability of getting two tails? Problem 3.3-3. A penny is tossed once and a dice is rolled once. The possible outcomes are H1,H2,H3,H4,H5,H6,T1,T2,T3,T4,T5,T6. Find the probabilities of the following outcomes: a) tossing a head and rolling an even number b) tossing a head or rolling an even number c) tossing a head and rolling a 5 d) tossing a head or rolling a 5 e) rolling either a 4 or a 6 Problem 3.3-4. Measuring the systolic blood pressure we define the following elementary events:

25

A: the blood pressure is less than 120 mm B: the blood pressure is between 120 mm and 150 mm C: the blood pressure is greater that 150 mm. The sample space is {A,B,C}. List the composite events. Problem 3.3-5. Let's denote the weight of glucose in 100 ml blood by x. Let x1 and x2 two fixed value (x1<x2). Let A,B,C,D be the following events: A: x<x1, B: x>x1, C: x<x2, D: x>x2. What events are exclusive? Problem 3.3-6. Measuring 15 mice the following values were found: 28 31 26 26 29 31 30 27 25 30 28 28 23 32 30 What is the relative frequency of the following events: E: x<26, F: x=31, G: 26<x<31? 3.4 Random variables, probability distributions In a great part of the experiments, the result of the experiment (the elementary event) is associated with a number. This number depends on the experiment, that is, repeating the experiment will result in different elementary events. Elementary events exist also if we don't perform the experiment. This idea leads us to the definition of the random variable. Definition. A random variable is a function defined on the set of elementary events Ω={ω1,ω2,...} taking values from the set R of real numbers. In other words, a function is called random variable, if it assigns to any elementary event a real number. Random variables will be denoted by capitals: X, Y, etc. In notation: X: Ω→R, or, ω∈ Ω→X(ω)∈ R..

Examples: 1. The experiment is tossing a coin, the elementary events are "heads" and "tails". Ω ={H,T}. We can associate the event "heads" with the number 0 and the event "tails" with the number 1. Then the random variable X has the following two values: X(H)=0 and X(T)=1. 2. We can define another random variable to the same experiment: Let's suppose that two persons are playing game, in the case of "heads" the first person has to pay $1 to the other, in the case of "tails" the first person has to receive $1 from the other. If X denotes the "income" of the first person, than X(H)= -1, and X(T)= 1. 3. Rolling a dice. Ω={1,2,3,4,5,6}. Let define the X to be the number shown on the dice. 4. Rolling two dices. Ω is now all possible pairs of the numbers from 1 to 6: (1,1),(1,2),(1,3)....(5,6),(6,); altogether 36 pairs. Let random variable X be the sum of the two numbers shown on the two dices: X((i,j))=i+j (i,j=1,2,3,4,5,6). Table 3-2 shows the elementary events and the values of X. 5. The experiment is measuring the body weight of mice, an elementary event is choosing one mouse arbitrarily, the random variable X is the result of the measuring. 6. Generally, all processes of measurements or production are subject to smaller or larger imperfections or fluctuations that lead to variations in the result, which is therefore described by one or several random variables. Thus the values of body height, weight, temperature characterising an animal are random variables.

26

j=1 j=2 j=3 j=4 j=5 j=6 i=1 (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) X 2 3 4 5 6 7

i=2 (2,1) (2,2) (2,3) (2,4) (2,5) (2,6) X 3 4 5 6 7 8

i=3 (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) X 4 5 6 7 8 9

i=4 (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) X 5 6 7 8 9 10

i=5 (5,1) (5,2) (5,3) (5,4) (5,5) (5,6) X 6 7 8 9 10 11

i=6 (6,1) (6,2) (6,3) (6,4) (6,5) (6,6) X 7 8 9 10 11 12

Table 3-2. Rolling two dices: possible outcomes

Random variables can be discrete, continuous or mixed. A random variable is called discrete if its possible values are finite. The range of a continuous random variable is an interval of real numbers. We will not deal with mixed random variables.

For example, the random variables of Examples 1,2,3 are discrete, the random variables of Example 4,5 are continuous variables.

3.4.1 Distribution function of discrete variables.

Let's denote the possible values of a discrete random variable X by x x xn1 2, ,..., . We can count the probabilities P(X=xi), i=1,2,....n. The sequence of these probabilities is called the distribution of the random variable. On the base of these probabilities we can also count the probability that the value of X is less than a given number, say x: P(X<x). Definition. We consider the random variable X and a real number x which can assume any value between -∞ and +∞ and study the probability for the event (X<x). This probability is a function of x and is called the (cumulative) distribution function of X and is denoted by F(x):

F(x)=P(X<x).

Example 1. Let the variable X be the result of the rolling a dice. X can be equal to 1,2,3,4,5,6. Let's count the probabilities P(X=xi) i=1,2,3,4,5,6 : P(X=1)=P(X=2)=P(X=3)=P(X=4)=P(X=5)=P(X=6)=1/6. This sequence gives the distribution of X. Now let's count the probabilities that X<x, where x is any real number. We can count it by adding the above probabilities: P(X<1)=0 because it is impossible to roll less than 1. P(X<2)=1/6 because we the event that the result is less than two is the only event whose result is 1. Its probability is 1/6. P(X<3)=2/6=1/3 because the event that (X<3) means either (X=1) or (X=2). P(X<4)=3/6=1/2) P(X<5)=4/6=2/3 P(X<6)=5/6 P(X<6.1)=6/6=1 P(X<7)=6/6=1 and so on. Figure 3.4-1 shows the graph of this function.

27

Figure 3.4-1. Distribution function of rolling a dice

Example 2. The experiment is rolling two dices. the random variable X is the sum of the two numbers shown on the two dices: X((i,j))=i+j (i,j=1,2,3,4,5,6). Now P(X=1)=0, (X=1 is impossible) P(X=2)=1/36 (the only favourable event is (1,1), and the number of all possible event is 36. ) P(X=3)=2/36 (favourable events are (1,2) and (2,1) ) ... P(X=12)=1/36. The plot of these probabilities can be seen on Figure 3.4-2.

02468

P(X=1

)

P(X=2

)

P(X=3

)

P(X=4

)

P(X=5

)

P(X=6

)

P(X=7

)

P(X=8

)

P(X=9

)

P(X=1

0)

P(X=1

1)

P(X=1

2)

/36

Figure 3.4-2. The distribution of rolling two dices.

We can find the distribution function of X. To get the probabilities P(X<x) we simply have to add the above probabilities: P(X<2)=0 because X<2 is impossible event P(X<2.1)=1/36 P(X<3)=1/36 P(X<4)=3/36 (favourable events: (1,1) , (1,2), (2,1) ) P(X<5)=6/36 (favourable events: (1,1) , (1,2), (2,1), (1,3),(3,1),(2,2) ... P(X<12)=35/36 P(X<13)=1 We can see that the distribution function is not continuous, it has breaking at every "jumps".(Figure 3.4-3.)

Figure 3.4-3. The distribution function of rolling two dices

28

3.4.2 Distribution function of a continuous variable If X is a continuous variable, its distribution function, F(x)=P(X<x), can be found in another way. Example : Let the random variable X be the angular position of the hand of a watch read at random intervals. If we measure the position of the hand in degrees, than the position (the value of X) is between 0° and 360°. Clearly P(X<0)=0, P(X<361)=1, and the other probabilities can be read from Figure 3.4-4.

0

0.5

1

0 360 Figure 3.4-4. Distribution function of a uniform distribution

Properties of the distribution function The distribution function of a (discrete or continuous) variable has the following properties: 1. F(x) is always a monotone increasing function. 2. The limit of F(x) at the minus infinity is 0, and at the infinity is 1:

lim ( ) , lim ( )x x

F x F x→−∞ →∞

= =0 1

3. F(x) is continuous from the left.

3.4.3 The probability density function of a continuous random variable. Let's find the probability that the value of a random variable lies in the interval (a,b), i.e., let's find P(a≤X<b)! This probability can be counted from the F(x) distribution function. It is easy to see that

P(X<a)+P(a≤X<b)=P(X<b). Expressing the second term we get that

P(a≤X<b)=P(X<b)-P(X<a)=F(b)-F(a) This formula gives us the possibility to define continuous random variables. Definition. The random variable X is called continuous if there exists a function f(x) 0 for which

∫=−=<≤b

a

dxxfaFbFbXaP )()()()( 3-4

f(x) is called the density function of the random variable X. Consequences: f(x) is the derivative of F(x): f(x)=F'(x). F(x) and f(x) are continuous functions. If a= -∞ and b=x, we get that the distribution function is

F(x)= f t dtx

( )−∞∫ .

The probability that the value of a continuous random variable lies in the interval (a,b) is equal to the area under the curve of f(x).

29

Properties of the density function. 1. f(x) is always non-negative

2. f x dx F F( ) ( ) ( )= ∞ − −∞ =−∞

∞

∫ 1, where F F x F F xx x

( ) lim ( ) , ( ) lim ( )∞ = = −∞ = =→∞ →−∞

1 0

Example : Let the random variable X be the angular position of the hand of a watch read at random intervals. If we measure the position of the hand in degrees, than the position (the value of X) is between 0 and 360. The density function is the derivative of the function shown in Figure 4, so it is a constant on the interval (0,360) having the height 1/360. (Figure 3.4-5.)

0

0.001

0.002

0.003

-60 0 60 120 180 240 300 360 420 Figure 3.4-5. Density function of a uniform distribution.

3.5 Population, sample In the last chapter we have discussed distributions of variables but we have not specified how they are realised in a particular case. We have only stated the probability that a random variable X will lie within an interval. Now this probability depends on certain parameters describing its distribution which are usually unknown. We have therefore no direct knowledge of the distribution and have to approximate it by an empirical distribution obtained experimentally. The number of experiments performed for this purpose, called sample, is necessarily finite. Definition. The set of all experimental units (persons, objects, scores) with all possible experimental results is called population. In other words, a population is the entire collection of elements about which information is desired. A population may contain a finite number of elements; in this case we call the population finite. Theoretically in an idealised case we may think that a population is infinite. If there is no finite number of elements that could potentially exist in a population, we say that the population is infinite. We are generally interested in studying properties of some numerical characteristic of the population elements. Here each element of the population possesses a particular value of the numerical characteristic under study. This property can be expressed with a random variable, that is, the population is connected with a random variable X. The distribution of X is called also the distribution of the population. Example 1: the inhabitants of Hungary form a finite population. The numerical characteristics (the random variable) may be the yearly income (in Forint) of an inhabitant. Example 2: The set of all first year pharmacy students might constitute a population. The numerical characteristics may be the height of the students. Example 3: Suppose a scientist is trying to determine the average weight of all 1-year-old male white rabbits which are raised in laboratories using a certain diet. It is impossible for him to weigh every rabbit in the population because the population never exists completely at any one time. If he selects 50 rabbits and determines their average weight, these 50 would be referred to as a sample from the population.

30

Definition. Sample. A sample is a part of the population that we actually examine in order to get information Let X be a random variable connected with a population. The series x x xn1 2, ,..., of n mutually independent random variables, having the same distribution as X, is called random sample. If F(x) is the distribution of X, then we say that x x xn1 2, ,..., is a sample drawn from a population distributed by F(x). The xi-s are called elements of the sample, n is called the sample size. There are several methods for selecting elements randomly from a population in order to get a sample (sampling with or without replacement, sequential sampling, etc.). In a random sample the elements are chosen from the elements of the population in such a way that each element of the population has equal chance of being chosen. Although sampling is an important part of the biostatistics, we will not study sampling methods. Example 2: If the population is the set of all first year pharmacy students, a sample might be the set of 10 students randomly chosen. Our aim is to study how to obtain descriptive information about the population from which the sample was drawn and to test hypotheses about that population. The first goal means to do descriptive statistics, the second means to do statistical inference. Let's repeat an experiment n time, on n members of the population, resulting a series of n observations. The repetitions are assumed to be mutually independent, and performed under the same conditions. The result of each experiment may be expressed in numerical form, by stating the observed values of a certain number of random variables. Let's sign this variable by X, and the n observed values by x x xn1 2, ,..., .

3.5.1 Population- and sample characteristics

3.5.1.1 The distribution of the population and the distribution of the sample Let's consider a population and a sample x x xn1 2, ,..., of the random variable X. The variable X has a certain probability distribution, which may be known or unknown according to circumstances. This is the distribution of the variable, or the distribution of the population, or the theoretical distribution. Like any probability distribution it can be defined by the F(x) distribution function or (in the continuous case) by the f(x) density function. If the observed points x x xn1 2, ,..., are marked on the axis of X, and the mass 1/n is placed in each point, we obtain another distribution, which may serve to represent the distribution of the sample values. This distribution will be denoted as the distribution of the sample or the empirical distribution. Similarly, if the sample is drawn from a continuous population and f(x) denotes the density function, then the histogram described in the last chapter and made from the sample, can be considered as an approximation of f(x). An example of the distribution function of a sample and of a population can be seen on Figure 3.5-1 while Figure 3.5-2 shows the density function of a sample and of a population.

31

Figure 3.5-1. A continuous distribution function and its approximation by an empirical distribution

function

Magasság

182181

180179

178177

176175

174173

172171

170169

168167

166165

164163

162161

160159

158157

156155

154

12

10

8

6

4

2

0

Std. Dev = 6.14

Mean = 167

N = 65.00

Figure 3.5-2. A continuous density function and its approximation by an empirical density function

(histogram) 3.5.1.2 Measures of the center of the distribution Sample characteristics vs. population characteristics. As we have seen in Chapter 2.6.1, the most often used sample characteristics for the measure of the center are the mean, the median and the mode. Like the theoretical and empirical distributions have connection concerning properties of the population and the sample, respectively, the sample mean, mode and median have also their corresponding "theoretical" measures. The sample mean is an approximation of the population mean or theoretical mean or expected value. The population mean denoted by the Greek letter μ: it is the sum of all values in the population divided by the number of values. The mode of a sample is an approximation of the local maximum of the theoretical distribution and the sample median is the approximation of the population median. 3.5.1.3 Measures of variability of the population We have seen the measures of dispersion (variability) of a sample in Chapter 2.6.4: sample range, interquartile range, variance, standard deviation. The corresponding measures can be defined for the population, too. The sample range, variance and standard deviation are approximations of the corresponding measures of the populations. 3.6 Some important distributions and theorems 3.6.1 The binomial distribution Let's consider an experiment that may have only two possible mutually exclusive outcomes and therefore be characterized by the simple decomposition Ω = A + A. Let's denote the probabilities of the outcomes by p and q:

P(A) = P(Ap p q, ) .= − =1

32

We now repeat the experiment n times, let X denote the absolute frequency of the event A. The frequency X is then a random variable with the possible values 0,1,...,n. It can be shown that the probability that X will assume any given possible value k is expressed by the binomial formula

P P X knk

p q k nkk n k= = =

⎛⎝⎜⎞⎠⎟ =−( ) , , ,...,0 1 , where

nk

nk n k

n n⎛⎝⎜⎞⎠⎟ =

−= ⋅ ⋅ ⋅

!!( )!

, ! ...1 2

It can be shown that the sum of the Pk probabilities is 1. For given values of p and n, the random variable X will thus have a distribution of discrete type. This distribution is known as the binomial distribution, n and p are the parameters of the distribution. This distribution is shown in Figure 2.3-1-a for fixed p and different n and in Figure 3.6-1-b for fixed n and different p. The figures will help us to discover similarities between the binomial distribution and other distributions. Sometimes p is not known, and the aim of an experiment is to approximate it.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 2 4 6 8 10 12 14 16 18 20

n

Prob

abili

ty

n=5n=10n=20

0

0.05

0.1

0.15

0.2

0.25

0 2 4 6 8 10 12 14 16 18 20

n

Prob

abili

ty

p=0.3p=0.5p=0.75

Figure 3.6-1. The binomial distribution (a) with fixed p=0.3 different n; (b) with fixed n=20 and

different p Example. In a certain population the occurrence of some disease is p=0.3. What is the probability that examining n=10 patients, there will be exactly k=4 diseased? According to the formula, P(X=4)=10!/(4!6!)(0.3)4(0.7)6=210 0.0081 0.117649= 0.200120949.

3.6.2 The Poisson distribution

If n tends to infinity, but at the same time np=λ is kept constant the binomial distribution approaches a fixed distribution:

lim lim ( )!n k

n

k n kk

Pnk

p q f kk

e→∞ →∞

−− −=⎛⎝⎜⎞⎠⎟ = =

λ λ

The quantity f(k) is the probability distribution of the Poisson distribution. It is defined only for integer values of k. The Poisson distribution has one parameter: λ. This is the mean and the standard deviation of the distribution. This distribution is plotted for different values of λ on Figure 3.6-2.

33

0

0.05

0.1

0.15

0.2

0.25

0 10 20 30 40 50 60

n

Prob

ablit

y

l=3l=5l=10l=20

Figure 3.6-2. The Poisson distribution for different λ-s.

We have obtained the Poisson distribution from the binomial distribution with large n but constant λ=np, i.e., small p. We therefore expect it to apply processes in which large number of events occurs but of which only very few have a certain property of interest to us (i.e. a large number of "trials" but few "successes"). We consider the very large number n of atomic nuclei in a radioactive source. The frequency fk with which k decays are observed in a certain interval of time follows the Poisson distribution. The frequency of finding k raisins per volume element of a fruit cake, the number of telephone-calls arriving to the telephone central in a time interval, the number of stars per element of the celestial sphere, the number of erythrocytes in the visual field of the microscope, are distributed according to the Poisson law.

3.6.3 The uniform distribution So far we have only discussed distributions of discrete variables. We now consider the simplest case of a continuous distribution function. The uniform distribution is defined as follows: The probability density is constant in the interval (a,b) and vanishes outside this region: f x c a x bf x x a x b

( ) , ,( ) , , .

= ≤ <= < ≥0

Example. The distribution of the random variable X on Figure 3.4-5. Density function of a uniform distribution., i.e., the angular position of the hand of a watch read at random intervals has uniform distribution.

3.6.4 The normal distribution. The distribution of a random variable is called normal distribution, if the density function has the following form:

f x ex

( )( )

=−

−12

2

22

πσ

μσ

The corresponding distribution function has the form

F x e dttx

( )( )

=−

−

−∞∫

12

2

22

πσ

μ

σ

It is of central importance in mathematical statistics and especially in the theory of errors, and it is called also Gaussian distribution. It has two parameters: μ (mean of the distribution) and σ (standard deviation of the distribution). We denote the normal

34

distribution by N(μ,σ). Because of its importance we will have to study its properties in detail and we shall deal with its quantitative behaviour. 3.6.4.1 Properties of the normal density function. There is symmetry about μ, and this point is its only maximum place. Therefore, μ is then the mean, median and modus of the distribution. The graph of the function is bell shaped. It can be determined using differential calculus that the function has two inflection points at μ- σ and μ+ σ. This distribution has two parameters, μ and σ. One can often assume that measurement errors are distributed normally around a true value μ, so μ is also called the mean of the distribution. The standard deviation σ of the distribution can be approximately determined from a sample by the sample standard deviation. The two parameters have special meaning: the probability that the result of a single experiment lies within one standard deviation of the true values is only 68.2 %. This seems rather low. Therefore it is a habit of experimenters to multiply the standard deviation by a more or less arbitrary factor, as for example 2 or 3, which increase that probability to 95.4 % or 99.8 %, respectively. In other words, only a few percentage of the scores are more than 3 standard deviations away from the mean (Figure 3.6-3.) .

Figure 3.6-3. Graph of the normal distribution

3.6.4.2 Special case: the standard normal distribution If μ=0 and σ =1, (3.4) has the following form and is denoted by ϕ(x):

ϕπ

( )x ex

= ⋅−1

2

2

2

and the corresponding distribution function is denoted by Φ(x) having the following form:

Φ( )x e dttx

=−

−∞∫

12

2

2

π

3.6.4.3 Standardization If a random variable X has normal distribution N(μ,σ) distribution, than the variable

z X=

−μσ

has a standard normal distribution N(0,1). Therefore, if the sample x x xn1 2, ,..., is drawn from a population distributed according to N(μ,σ), then the sample z-scores will have approximately standard normal distribution.

35

3.6.4.4 The use of the normal curve table The quantities of Φ(x) can be found in many tables. For each x the table gives the area under the curve to the left of x.(Table 3-3.)

x Φ(x): proportion of area to the left of x -4 0.00003 -3 0.0013 -2.58 0.0049 -2.33 0.0099 -2 0.0228 -1.96 0.0250 -1.65 0.0495 -1 0.1587 0 0.5 1 0.8413 1.65 0.9505 1.96 0.975 2 0.9772 2.33 0.9901 2.58 0.9951 3 0.9987 4 0.99997

Table 3-3. Standard normal probabilities. Example 1. Find the area under a standard normal curve between x=-1.65 and x=1. Solution. Φ(-1.65)=0.0495, Φ(1)=0.8413. We find the area by subtracting. Thus, the area between is 0.8413-0.0495=0.7918 Example 2. It was found that the weights of a certain population of laboratory rats are normally distributed with μ=14 ounces and σ=2 ounces. In such a population, what percentage of rats would we expect to weigh between 10 and 15 ounces? Solution. We draw and label a rough sketch of this population. (Figure 3.6-4.). Since μ=14, 14 corresponds to z=0 if standardized. The standard deviation is 2 ounces. Therefore the weight increases by 2 ounces, each time the z score increases by 1. We want to compute the area between 10 and 15. Converting the data to z scores, we get: z15=(15-14)/2=0.5 and z10=(10-14)/2=-4/2=-2. Φ(0.5)=0.6915 and Φ(-2)=0.0228. Subtracting 0.6915-0.0228, we get 0.6687. Therefore, we expect about 67 percent of the population of rats to weigh between 10 and 15 ounces.

Figure 3.6-4. 3.7 Theoretical distribution of the sample means. Let's see an example first. A young man being offered a job as a secretary in a large company asked the personnel director what the average age of the secretaries in the company was. She replied that the company employed several hundred secretaries, and she did not know the correct answer. But looking around the personnel office at the 38

36

secretaries there, she said that the average in her office was about 20 and secretaries had been selected at random from the secretarial pool. The young man has informally performed a type of hypothesis test which is very commonly done by statisticians. He came to a conclusion about the population mean, denoted by μ, on the basis of what he knew about the mean of a sample taken from that population. Imagine that the young man goes from office to office and for each office separately computes the mean age of the secretaries. By the end of the day he would have a long list of sample means, and these averages would certainly vary. Such a list of averages is called a distribution of sample means.

3.7.1 Central limit theorem The central limit theorem states that if the sample size n is large (say, at least 30), then the population of all possible sample means approximately has a normal distribution with

mean μ and standard deviation σn

, in notation N(μ, σn

), no matter what probability

describes the population sampled (see Figure 3.7-1). In other words, the distribution of means of large samples taken from a population has, in theory, three characteristics: 1. The shape is normal 2. The mean (of all possible sample means) is the same as the population mean μ 3. The standard deviation is smaller than the standard deviation of the original population:

it is nσ .

Figure 3.7-1. Examples (simulations) for the central limit theorem. Source:

http://www.ruf.rice.edu/~lane/stat_sim/index.html

3.7.2 The standard error of mean (SE or SEM)

The expression nσ is called the standard error of mean. It expresses the dispersion of the

sample means around the (unknown) population mean. Because the value of σ is generally unknown, the standard error of the mean is approximated on the sample standard deviation

divided by the square root of the sample size: n

SD .

37

Summary. Density curves are a concise way to describe the overall pattern of a distribution. Any density curve lies above the axis and has an area of one. Density Curves represent proportions of all observations. Histograms use the size of each bar to indicate the relative frequency of values in the corresponding interval. Density curves generalize this idea. The normal density is characterized by several important features: -It is specified by its mean and standard deviation. -It is unimodal and symmetric. -It is characterized approximately by the "68 - 95 - 99.7 rule." If the data follow a Normal distribution, standardizing makes it match the standard normal distribution, the Normal distribution with zero mean and unit standard deviation. For the Standard Normal Density curve, tables are usually given in statistics textbooks. The central limit theorem states that the sampling distribution of the sample mean approaches the normal distribution as the sample size, n, increases regardless of the distribution of the population. This sampling distribution has a mean equal to the population mean and a standard deviation equal to the population standard deviation divided by the square root of n. Problems 1. For a normal distribution, find the z scores that cuts off the top a) 5 percent b)2.5 percent c)1 percent d)0.5 percent 2. The results in a certain blood test performed in a medical laboratory are known to be normally distributed with N(60,18). a) What percentage of the results is between 40 and 80? b) What percentage of the results is below 60? c) What percentage of the results is outside the "healthy range" of 30 to 90? d) What is the probability that a blood sample picked at random will have results in the "healthy range" of 30 to 90? 3. At an urban hospital the weights of new-born infants are normally distributed with N(3500,70). Let X be the weight of a new-born picked at random. Find the following probabilities: a) P(X>4000) b) P(3000<X<4000)

4 Statistical estimation, confidence intervals. 4.1 Statistic, confidence interval A typical problem of data analysis is the following. The general mathematical form of the probability density of the population is known. Only the numerical value of one or several parameters has still be obtained from a sample. For example, it is known that the body weights are normally distributed, but the population parameters, μ and σ are unknown, we have to approximate them from a sample. We are therefore dealing with the problem of an estimation of parameters.

38

Definition. A function of the sample elements x x xn1 2, ,..., , that is itself a random variable, is called a statistic. A well known example is the sample mean: the statistic x (the sample mean) is an estimation of the population mean, or it is an estimation of the population parameter μ. Similarly, the statistic SD, the sample standard deviation is an estimation of the population standard deviation σ, and SD2, the sample variance is the estimation of the population variance σ 2. 4.2 Confidence intervals. In most statistical applications it is not enough to say that the correct value of a parameter is more or less well approximated by a statistic. When we give a range of values that we think includes the true value of some population parameters, this range of values is called an interval estimate. Such an interval is usually assigned a probability, and then it is called a confidence interval. The higher the probability assigned, the more confident we are that the interval does, in fact, include the true value. A confidence interval is an interval computed from the sample elements that contains the (unknown) population parameter with high probability. The value of “high” probability is arbitrary: it depends on us, and it is called the confidence level. The most often used confidence level is 95%. The “high” probability always defines the “small” probability, this is the level of “mistake” or “error”, the level of uncertainty. This small probability is denoted by α. According to the 95% confidence level, α=0.05. The confidence interval is based on the concept of repetition of the study under consideration. If the study were to be repeated 100 times, of the 100 resulting 95% confidence intervals, we would expect 95 of these to include the population parameter. A nice demonstration of this idea can be found at the sight http://www.kuleuven.ac.be/ucs/java/index.htm. Definition Let a be a population parameter, α an arbitrary (small) probability, x x xn1 2, ,..., a sample. If we can find a1,a2 statistics computed from the sample so that P(a a1 2≤ ≤ = −a ) 1 α , then the random interval (a1, a2) is called confidence interval (CI) for the parameter a. The (1-α)100% is called confidence level.

α is generally chosen to be α=0.1, α=0.05, α=0.01 or α=0.001, and therefore the confidence levels are 90%, 95%, 99% or 99.9%. For different population parameters several formulas of the confidence intervals were found. Mainly confidence intervals for the population mean are the most often used.

4.2.1 Confidence interval for a population mean μ if the population standard deviation (σ) is known

It can be shown that

),x( 2/2/ nux

nu σσ

αα +− 4-1.

is a (1-α)100% confidence interval for μ.. Here uα/2 is the α/2 critical value of the standard normal distribution, it can be found in standard normal distribution table (Table 3-3)

39

for μ=0.05 uα/2 =1.96 for μ=0.01 uα/2 =2.58

So a 95% confidence interval (CI) for the population mean is )96.1,96.1x(n

xn

σσ+− .

Proof of the formula. If x x xn1 2, ,..., is a statistical sample drawn from a population having the distribution N(μ,σ), then random variable

u x

n

n x=

−=

−μσ

μσ

has the distribution N(0,1). If α>0 is given, we can find a cut off the top α/2 percent and the cut off the bottom α/2 percent of the normal curve. Let's denote these points by uα/2 and -uα/2. So we have

P( u u P(-u u u≤ = ≤ ≤ = −α α α α/ / /) )2 2 2 1 Substituting u in the above expression and expressing μ we get that

P(x − ≤ ≤ + = −un

x unα α

σ μ σ α/ / )2 2 1

that is, the random interval (x − +un

x unα α

σ σ/ /, )2 2 include the value μ . This is

therefore a (α)100% level confidence interval for μ .

For a given α, the value of uα/2 can be found from the normal probability tables. If α=0.05, α/2=0.025, Φ(1.96)=0.975, so uα/2 =1.96. Similarly, for α=0.01, uα/2 =2.58. Example 4.2-1. We wish to estimate the average number of heartbeats per minute for a certain population. The mean for a sample of 64 subjects was found to be 90. Supposed that the population is normally distributed with a standard deviation of 16,we are able to find the 95 % confidence interval for μ. Now α =0.05, σ =16 n=64 uα/2=1.96 The lower limit is 90 – 1.96·16/√64=90-1.96 ·2=90-3.92=86.08 The upper limit is 90 + 1.96·16/√64=90+1.96 ·2=90+3.92=93.92 The 95% confidence interval for the population mean is (86.08, 93.92) It means that the true (but unknown) population means lies it the interval (86.08, 93.92) with 0.95 probability. We are 95% confident the true mean lies in that interval.

40

4.2.2 Confidence interval for a population mean μ if the population standard deviation (σ) is not known

Generally not only the mean but the standard deviation of the population (σ) is unknown. σ can be estimated by the sample standard deviation SD. However, when σ is replaced by SD in the formula of the confidence interval, then the multiplier uα/2 is no more valid. Instead of uα/2 , another multiplier, tα/2 hase to be used. Values of tα/2 are tabulated in so called t-tables or Student’s t-tables, according to α and the degrees of freedom =n-1. They are called also critical t-values. Because they also depend an degrees of freedom, the notation tα/2, n-1 is more appropriate. It can be shown that

),x( 2/2/ nSDtx

nSDt αα +− 4-2.

is a (1-α)100% confidence interval for μ.. Here tα/2 can be found in tables of the Student’s t distribution with n-1 degrees of freedom.

4.2.3 The t-distribution W.S. Gossett was a British statistician who worked in a brewery where he took small samples. In 1908 he wrote a paper under the pseudonym "A. Student", showing some properties of small samples. Basically, in repeated experiments with small samples, the values of SD tend to be quite variable, so that if in any one experiment you take your current value of SD and use it in the formula μ α±

s

nu /2 , the level of confidence interval

will be less than the number uα/2 indicates. To solve this problem, what Gossett did, from our point of view, was to come up with different sets of critical scores, called Student's t- scores, to be used in place of the critical z-scores, depending on how big the sample actually is. These scores are also called critical values of t. They are tabulated. Gossett showed that even these t-scores are not always reliable. They are reliable, however, in the case where the original population from which we are taking our sample is near normal to begin with. There are many t distributions, and for each degree of freedom (a number that depends on sample size) there is a slightly different distribution. The t-distributions are similar to the normal distribution in that they are symmetrical and bell-shaped. When drawn precisely, it can be seen that the smaller the degree of freedom, the flatter is the curve. Conversely, the larger the degree of freedom, the more closely the t-curve resembles the normal curve. The t-distributions are tabulated according to the different degrees of freedom. 4.2.3.1 Degrees of freedom To use the t table, you must find the entry that corresponds to your particular sample size. You would therefore expect the table to have a column labelled n for sample size. However, it turns out that this same table can be used in other problems (for example, problems involving two different samples of different sizes) where that label would not make sense. It is usual, instead, to have a column labelled degrees of freedom. It is not obvious why this name should be used. Technically it is used because the t curve is related to another curve we will study later (called chi-square curve) for which the phrase "degrees of freedom" makes more intuitive sense. We will not go into this relationship. For our purposes it is sufficient to know that in a hypothesis test about a population mean where

41

we are working with a single small sample of size n, the correct numerical value of the degrees of freedom for t is n-1. In practice, we find confidence intervals for means with Student's t distribution because we almost never know the true population standard deviation. For small samples, the t-distribution allows for the added variability introduced by estimating the standard deviation from the data. For larger sample, the results are even better as the t-distributions approach the Normal distribution in shape. Table 4-1 contains critical t-values for different α levels (columns) and different degrees of freedom (rows). It can be observed that for very large sample size (degrees of freedom=∞) we get the standard normal probabilities.

Two-sided alfa df 0.2 0.1 0.05 0.02 0.01 0.001 1 3.078 6.314 12.706 31.821 63.657 636.619 2 1.886 2.920 4.303 6.965 9.925 31.599 3 1.638 2.353 3.182 4.541 5.841 12.924 4 1.533 2.132 2.776 3.747 4.604 8.610 5 1.476 2.015 2.571 3.365 4.032 6.869 6 1.440 1.943 2.447 3.143 3.707 5.959 7 1.415 1.895 2.365 2.998 3.499 5.408 8 1.397 1.860 2.306 2.896 3.355 5.041 9 1.383 1.833 2.262 2.821 3.250 4.781

10 1.372 1.812 2.228 2.764 3.169 4.587 11 1.363 1.796 2.201 2.718 3.106 4.437 12 1.356 1.782 2.179 2.681 3.055 4.318 13 1.350 1.771 2.160 2.650 3.012 4.221 14 1.345 1.761 2.145 2.624 2.977 4.140 15 1.341 1.753 2.131 2.602 2.947 4.073 16 1.337 1.746 2.120 2.583 2.921 4.015 17 1.333 1.740 2.110 2.567 2.898 3.965 18 1.330 1.734 2.101 2.552 2.878 3.922 19 1.328 1.729 2.093 2.539 2.861 3.883 20 1.325 1.725 2.086 2.528 2.845 3.850 21 1.323 1.721 2.080 2.518 2.831 3.819 22 1.321 1.717 2.074 2.508 2.819 3.792 23 1.319 1.714 2.069 2.500 2.807 3.768 24 1.318 1.711 2.064 2.492 2.797 3.745 25 1.316 1.708 2.060 2.485 2.787 3.725 26 1.315 1.706 2.056 2.479 2.779 3.707 27 1.314 1.703 2.052 2.473 2.771 3.690 28 1.313 1.701 2.048 2.467 2.763 3.674 29 1.311 1.699 2.045 2.462 2.756 3.659 30 1.310 1.697 2.042 2.457 2.750 3.646 ∞ 1.282 1.645 1.960 2.326 2.576 3.291

Table 4-1. The student’s t distribution. For α=0.05 and df=12, the critical value is tα/2 =2.179

Example 4.2-2. We wish to estimate the average number of heartbeats per minute for a certain population. The mean for a sample of 36 subjects was found to be 90, the standard deviation of the sample was SD=15.5. Supposed that the population is normally distributed the 95 % confidence interval for μ: α=0.05, SD=15.5 Degrees of freedom: df=n-1=36-1=35

42

tα/2,35 =2.0301 The lower limit is 90 – 2.0301·15.5/√36=90-2.0301 ·2.5833=90-5.2444=84.755 The upper limit is 90 + 2.0301·15.5/√36=90+2.0301 ·2.5833=90+5.2444=95.24 The 95% confidence interval for the population mean is (84.76, 95.24) It means that the true (but unknown) population means lies it the interval (84.76, 95.24) with 0.95 probability. We are 95% confident that the true mean lies in that interval.

4.2.4 Sample size determination

Suppose we wish to determine the sample size n so that we are (1-α)100% confident that the sample mean is within d units of the population mean. By setting the bound in formula

),x( 2/2/ nux

nu σσ

αα +− 4-1 equal to d and solving for n we can determine the

sample size . So we get

nu

d= ⎛⎝⎜

⎞⎠⎟

α σ/ 22

. 4-3.

This formula involves the population standard deviation σ, which is probably unknown. Thus we must often find an estimate of σ. We can calculate the sample standard deviation of a preliminary sample. This preliminary sample consists of np values randomly selected from the population. If np-1 is at least 30, we calculate n by the formula 4-3. If np-1 is less

than 30, we calculate n by the formula

2

2

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛=

d

tn

σα

5 Statistical inference: hypothesis testing 5.1 General steps of hypothesis tests 5.1.1 Statistical hypotheses Statistical analysis is concerned not only with summarizing data but also with investigating relationships. The majority of statistical analyses involve comparison, generally between treatments or procedures or between groups of subjects. We examine whether our result (difference in samples) is greater then the difference caused only by chance. In statistics, hypotheses are usually claims about the population. As such, they can never (well, hardly ever) be known to be true, but can be disconfirmed by evidence in the data. An experimenter attempts to prove or disprove the statement beyond a reasonable doubt" by analyzing a sample from that population.

5.1.2 The two types of hypotheses.

Statisticians usually test the hypothesis which tells them what to expect by giving a specific value to work with. They refer to this hypothesis as the null hypothesis and symbolize it as H0. The null hypothesis is often the one that assumes fairness, honesty or equality. The opposite hypothesis is called alternative hypothesis and is symbolized by Ha. This hypothesis, however, is often the one that is of interest. Example 5.1-1. If the null hypothesis is that the population mean is 16 then there are several possible alternatives. The most often used “pairs” of hypotheses the null hypothesis contains “=”

43

sign and the alternative hypothesis contains “≠” sign. H0: μ=16 (the population mean is 16) Ha1: μ ≠16 (the population mean is not 16) Ha2: μ >16 (the population mean is greater than 16) Ha3: μ =15 (the population mean is 15) Example 5.1-2. Another possible hypothesis pairs H0: B=G (boys and girls score the same on mathematics exams) H1: B≠G (boys and girls score differently on mathematics exams) To summarize, the null hypothesis gives us a specific value of a population parameter on which to base our expectations. This is reflected by the appearance of an equals sign (=) when we use symbols to write it. The alternative hypothesis in symbols is often characterized by the appearance of a not equal sign (≠) or an inequality sign (<,>).

5.1.3 Testing the null hypothesis: the significance level

In significance testing, the significance level, commonly denoted by α and called the alpha level, is a probability that specifies how extraordinary the observed result must be for us to reject the null hypothesis. Typical significance levels are .10, .05, and .01, but any value may be used provided it is specified before the data are examined.

5.1.4 Testing the null hypothesis: decision rules At the beginning of the experiment you should formulate the two opposing hypotheses. Then you should state what evidence will cause you to say that you think the alternative hypothesis is the true one. This statement is called your decision rule. When the evidence supports the alternative hypothesis, we say that we "reject the null hypothesis". When the evidence does not support the alternative, we say that we "fail to reject the null hypothesis".

5.1.5 Testing the null hypothesis: decision Based on the experimental outcome and the previously calculated decision rule, you will make one of two decisions. a) Reject the null hypothesis and claim that your alternative hypothesis was correct. We say in that case that „the difference is significant” at b) Fail to reject the null hypothesis: you have been unable to prove that the alternative hypothesis is correct. We use the phrase „the difference is not significant” or "fail to reject H0" rather than "accept H0".

5.1.6 Steps of hypothesis-testing Step 1. State the motivated (alternative) hypothesis Ha. Step 2. State the null hypothesis H0. Step 3. You select the probability of the “error”, or the α significance level. Generally α =0.05 or α =0.01. Step 4. You choose the size n of the random sample Step 5. Select a random sample from the appropriate population and obtain your data. Step 6. Calculate the decision rule. Step 7. Decision. a) Reject the null hypothesis and claim that your alternative hypothesis was correct the difference is significant at α100% level. b) Fail to reject the null hypothesis correct the difference is not significant at α 100% level.

44

5.1.7 One- and two tailed tests If you suspect that a certain null hypothesis is false, you can formulate three different alternatives. Suppose that you examine the mean bottle fills: your null hypothesis would be that the population mean μ=16. H0: μ=16. Your alternative hypothesis could be any of the following. 1. You think that the mean is less than 16, you would write Ha: μ<16. 2. You think that the mean is greater than 16, you would write Ha: μ>16. 3. If you don' have any idea whether the mean is larger or smaller than 16, you could write Ha: μ≠16. The first two are referred to as one-tail tests, since the values of interest are only one direction away from 16. The third instance, however, is referred to as a two-tail test, since values far from 16 in either direction are of interest to you in your experiment. Note that we have formulated our hypotheses so that the equals sign (=) always appears in the null hypothesis, while either the less than (<) or the greater than (>) sign appears in the alternative hypothesis for a one-tail test. The alternative hypothesis for a two-tail test always contains the not equal sign (≠).

5.1.8 Statistical errors. One basic idea which is inseparable from hypothesis testing is that you can almost never have absolute proof as to which of the two hypotheses is the true one. When you are testing a null hypothesis, you are trying to decide if it is true or false. However, since statistical hypothesis testing is based on sample information and you cannot be absolutely sure that your decision is correct, you really are faced with four possible situations.

1. H0 is true and the sample data lead you correctly to decide that it is true. 2. H0 is true but by bad luck the sample data lead you mistakenly to think that it is

false. 3. H0 is false and the sample data lead you correctly to decide that it is false. 4. H0 is false but by bad luck the sample data lead you mistakenly to think that it

could be true. In the first and third situations above, you have been lead to make a correct decision. In the second situation you rejected the true null hypothesis. We refer to this as a Type I error. In the last situation, you failed to reject false null hypothesis. Statisticians call this a Type II error.

you fail to reject H0 you reject H0 H0 is true correct Type I error H0 is false Type II error correct

Table 5-1. Statistical errors We use the Greek letter alpha (α) to represent the probability of making an error of type I. Similarly, beta (β) represents the probability of an error of type II. We obviously desire that both α and β be small. It is common procedure to base an hypothesis test on taking a sample of a fixed size and on setting α equal to a specified value. We usually choose α=0.05 or α=0.01. Generally, for a fixed sample size the lower the probability of a Type I error, the higher the probability of a Type II error.

45

The power of a test A specific value of β cannot be computed until we have decided upon a specific value from the alternative ypothesis. Statisticians refer to the value 1 - β as the power of the test. The power of a test is a measure of how good the test is at rejecting the false null hypothesis. The more "powerful" a test is, (the closer is the value of 1 - β is to 1), the more likely the test is to reject a false null hypothesis. An important part of statistical theory deals with the problem of finding a decision rule that will make a hypothesis test as powerful as possible for any given value of α. 5.2 Testing the mean μ of a sample drawn from a normal population:

one sample t-test Let’s begin a practical example. The normal value of the systolic blood pressure is 120 mm Hg. The systolic blood pressures (mm Hg) of n=9 patients undergoing drug therapy for hypertension were measured. The question is whether the sample is drawn from a “healthy” population, i.e., from a population whose mean is say 120? This question can be summarized in the following two hypotheses: H0: the population mean is 120, μ=120 Ha: the population mean is not 120 , μ≠120 To answer this question, we chose the significance level to be α=0.05. The measured data are the following: 182.00 152.00 178.00 157.00 194.00 163.00 144.00 114.00 174.00 The sample mean=162 mmHg, the standard deviation SD=23.92 . The decision rule is based on 95% confidence interval: is the given number (the number in the null hypothesis) in the confidence interval? If yes: we decide that the difference is not significant at α level. If not: we decide that the difference is significant at α level. For our data, the 95% confidence interval is (143.61, 180.386). This is the interval that contains the true population mean (the mean of the population from which our data were drawn) with 95% probability. The possible values for the true population mean are between 143,6 and 180.386. With 95% probability, the true population mean cannot be 120 as stated in the null hypothesis. The 95% confidence interval does not contain 120, so we reject H0 and say that the difference is significant at 5% level.

5.2.1 One-sample t-test for the mean of a normal population

Using one-sample t-test we can test whether a sample is drawn from a normal population with a given mean. Assumption: the sample is drawn from a normally distributed population H0: μ=c, (c is a given constant). Ha: μ≠c. Denote the significance level by α. 5.2.1.1 Decision rule using (1-α)100% confidence interval. Find the (1-α)100% confidence interval for μ according to the formula

),x( 2/2/ nSDtx

nSDt αα +− 4-2.

This interval contains the true population mean with high probability. If the constant c (constant in the null hypothesis) is not in the interval, we say that the difference is not significant at α⋅100% level. If the constant c is outside of the confidence interval we say that the difference is significant at α⋅100% level.

46

5.2.1.2 Decision rule using critical value. There is another way for finding the decision rule: we can use the so called critical points or rejection points instead of the confidence interval. If the null hypothesis is true, the statistic

SDcxn

nSD

cxt −=

−= 5-1.

has a t-distribution with n-1 degrees of freedom. If α>0 is given, we can find a cut off the top α/2 percent and the cut off the bottom α/2 percent of the t distribution curve with n-1 degrees of freedom. Let's denote these values by tα/2 and -tα/2, respectively. The region between these two values is called acceptance (non-rejection) region; it is the set of values for which we accept the null hypothesis. The critical region (rejection region) is the set of values for which the null hypothesis is rejected. The following equality holds:

P( t P(-t t t/2 /2 /2≤ = ≤ ≤ = −tα α α α) ) 1 or P( t /2> =tα α) The decision rule is the following: we can reject H0 in favor of Ha at α⋅100% level if and only if t /2> tα .If /2t αt< then we say that the difference is not significant at α⋅100%.

5.2.1.3 Decision rule using a p-value The most often used computer programs compute the test statistics, the t value, the degrees of freedom, but instead of giving us the critical values they compute the generally the probability that at least as large as the one observed would occur if H0 is true. This probability is often denoted by p, or named as Sig.Level or Signif of t, and is called the observed significance level. We have to compare this value with α. If this probability is less then α, we decide to reject the null hypothesis saying that the difference is significant at α100% level, and in the opposite case we decide that we fail to reject the null hypothesis. Example 5.2-1. The normal value of the systolic blood pressure is 120 mm Hg. The systolic blood pressures (mm Hg) of n=9 patients undergoing drug therapy for hypertension were measured. The question is whether the sample is drawn from a “healthy” population, i.e., from a population whose mean is say 120? This question can be summarized in the following two hypotheses: H0: the population mean is 120, μ=120 Ha: the population mean is not 120 , μ≠120 Denote the significance level by α. The measured data are the following: 182.00 152.00 178.00 157.00 194.00 163.00 144.00 114.00 174.00 The sample mean=162 mmHg, the standard deviation SD=23.92. Decision based on 95% confidence interval. The degrees of freedom is 8, the critical value at the t-table is t0.05,8=2.306, the 95% confidence interval is (162-2.306*23.92/√9, 162+2.306*7.97)=(143.61,180.386). The constant 120 is not in this interval so we reject the null hypothesis and say that the difference is significant at 5% level. Decision based on critical value.

The t-statistic according to the formula SD

cxn

nSD

cxt −=

−= 5-1.:

47

26.5

992.23120162

=−

=t .

The degrees of freedom is 8, the critical value at the t-table is t0.05,8=2.306. We now compare the absolute value of the test statistics with the critical value in the table. As |5.26|>2.306, we reject the null hypothesis and say that the difference is significant at 5% level. Decision based on p-value. The p-value computed by a computer program is p=0.001. This is the probability, computed assuming that H0 is true, that the test statistic would take a value as extreme or more extreme than 5.26. In our case the difference is significant at 5% level, because p=0.001<0.05. The critical values, the acceptance region and the calculated t-values are shown on Figure 5.2-1. The p-value is the area under the curve to the right of the computed t(=5.26) and to the left of .5.26. This area is now equal to 0.001.

Figure 5.2-1. The t-distribution with 8 degrees of freedom

5.3 A t test for paired differences (paired t-test) A special type of experiment is referred to as a matched-pair experiment, when we do "before-and after" comparisons, or when we compare "siblings". Let's suppose that an experimenter would like to prove that a special treatment decreases the blood pressure. To prove this, he first measures the blood pressure of a group of patients randomly selected from a group of people suffering disease with high blood pressure. This gives him a sample "before treatment". After the treatment the experimenter measures the blood pressure of the same patients, this results another sample "after treatment". To see the effect of the treatment, it is possible to compute the differences of the values "after treatment" and "before treatment" to each patient. To summarize the situation, we generally have to following data:

48

Before treatment After treatment Difference x1 y1 d1 x2 y2 d2 . . .

.

.

.

.

.

. xn yn dn

If the treatment has effect to the blood pressure, then the mean difference must be a number different from zero. If the treatment hasn't any effect to the blood pressure, than the mean difference must not be differ from zero. Let's formulate the two hypotheses: H0: μ=0 (the mean of the population of differences is 0) Ha: 0≠μ (the mean of the population of differences is not 0)-(two tail test), or Ha: 0<or 0 μμ > (the mean of the population of differences is not 0)-(one tail test). This situation is a special case of the one sample t-test. If we suppose that the population of all d's is approximately normal, then our experiment reduces to a one-sample t-test, where the sample is no the sample of d differences. So we can test the null hypothesis by counting the value

t ds

nd

= ⋅ , where d is the mean of the sample of differences, sd is the standard deviation

of the sample of differences. The decision rule is the following: we can reject H0 in favour of Ha at α level if and only if t /2> tα .

Example. Suppose that we have the following data for the blood pressure after measuring 8 persons:

Before treatment After treatment Difference 170 150 20 160 120 40 150 150 0 150 160 -10 180 150 30 170 150 20 160 120 40 160 130 30 d =21.25

sd=18.077 t =3.324

The critical value of t, when the degrees of freedom is 7, and α=0.05, t0.025,7=2.365 from the Student t table. Decision: t=3.324>2.365, so we reject H0 and claim that the difference in mean is significant at 95 % confidence level. (The treatment was effective). The error of our decision must be a Type I error: its probability is 0.05. This means, that of we repeat the same experiment under the same circumstances, the probability that we would get a non-significant result is 0.05. Let's try what would happen if α=0.01. All of the computation are the same except the critical value of t will change: t0.005,7=3.499. Our t, t=3.324<3.499, so the decision is now not to reject the null hypothesis. In 99% significance level we have to reject the null hypothesis: with 1% mistake we do not have enough information to reject the null hypothesis.

49

Decision based on p-value: the p-value is p=0.013, so p<0.05, so the difference is significant at 5 % level. It can be seen that p>0.01 so the difference is not significant at 1% level. 5.4 Testing the mean of two independent samples: two-sample t-test Let's suppose that we have two independent samples with not necessarily equal sample sizes: x x x y y yn m1 2 1 2, ,..., , , ,...,and . The problem of testing the difference between two means is simplest when we can make two additional assumptions. 1. Both populations are approximately normal. 2. The variances of the two populations are approximately equal. That is the xi-s are distributed as N(μ1,σ) and the yi-s are distributed as N(μ2,σ). Let’s formulate the two hypotheses: H0: μ1=μ2 Ha: μ1≠ μ2 (two sided test). Let’s find the value of α. When these assumptions are reasonable, it follows that the distribution of the differences of sample means can be related to the t distribution. Let’s denote the means of the two samples by x and y , and the standard deviations by sx and sy, respectively. If H0 is true, then the quantity

t x y

sn m

x ys

nmn m

pp

=−

+=

−⋅

+1 1,

has Student t distribution with n+m-2 degrees of freedom. Here

2)1()1( 22

2

−+

⋅−+⋅−=

mnsmsn

s yxp

is the pooled estimate of the variance. The decision rule: find the value tα/2, n+m-2 from the table and a) if |t|>tα/2, n+m-2 , we reject the null hypothesis, b) if |t|<tα/2, n+m-2 , we do not reject the null hypothesis. Example 5.4-1. Suppose we measured the blood pressure of two groups of people: the first group was not treated (control group), the second group was treated. We would like to test prove that the treatment was effective. The null hypothesis is that the two population means are equal, the alternative hypothesis is that they are different. The control group has 8, the treatment group contains 10 elements:

Control group Treatment group 170 120 160 130 150 120 150 130 180 110 170 130 160 140 160 150 130 120 n=8 n=10

50

x =162.5 y=128 sx=10.351 sy=11.35 sx2=107.14 sy2=128.88

s

t

p2 7 107 14 9 128 88

10 8 274

162 5 128119 37

10 818

34 510 92

=⋅ + ⋅

+ −=

=−

⋅⋅

=

. .

..

..

The computed t statistic = 6.6569, the critical value of t is, t0.025,16=2.12. Because 6.6569>2.12, we reject the null hypothesis and claim that the difference in means is significant at 5% level. The p-value is 5.51⋅10-6,<0.05 so the difference is significant at 5% level. The 95% confidence interval for the difference is (23.51,45.49), it does not contain 0, so the difference is significant at 5% level. 5.5 Testing the mean of two independent samples in the case of different

standard deviations If the standard deviations are not supposed to be equal, another formula has to be used to test the same null hypothesis. Suppose we have two independent samples with not necessarily equal sample sizes:x x x y y yn m1 2 1 2, ,..., , , ,...,és .Suppose that the xi-s are distributed as N(μ1,σ1) and the yi-s are distributed as N(μ2,σ2) where the population- standard deviations are different. We would like to compare the mean of the two populations. Now H0: μ1=μ2 Ha: μ1≠ μ2 (two-sided test). Let's fix the value of α. If the null hypothesis is true, the statistic

d x y

sn

sm

x y

=−

+2 2

has a t-distribution. The degrees of freedom can be counted according to the

following formula:

degrees of freedom= ( ) ( )( ) ( ) ( )

n mg m g n

− ⋅ −⋅ − + − ⋅ −

1 11 1 12 2 , where g

sn

sn

sm

x

x y

=+

2

2 2 .

This is not an integer number, so we have to use the closest integer when using the t-distribution table. 5.6 Comparison of the standard deviations of two normal populations:

F-test We have seen that there are two different formulas to test the equality of two population means depending on the equality of population standard deviations. We generally know the nature of our data, also we can inspect the actual values of the sample standard deviations. Although it is not recommended, it is still a general solution to use a test for the comparison of the standard deviations. Let's suppose we have two independent samples with not necessarily equal sample sizes: x x x y y yn m1 2 1 2, ,..., , , ,...,és . .Suppose that xi-s are distributes as N(μ1, σ1) and yi-s are

51

distributed as N(μ2,σ2). One of the conditions of two two-sample t-test was the equality of standard deviations. The test is the following: H0: σ1=σ2 Ha:σ1 > σ2 (one-sided test). Let's fix the value of α. If the null hypothesis is true, the statistic

Fs ss s

x y

x y

=max( , )min( , )

2 2

2 2

has an F-distribution. The F-distribution is not symmetrical and has two degrees of freedom. In our case the degrees of freedom are: the sample size of the nominator-1 and the sample size of the denominator -1. There are tables for the critical values of the F-distribution, these are one-tailed tables. For α=0.05, Chapter 9.3 contains critical values of F-distribution. According to α and the two degrees of freedom , we denote the critical value in the table by Ftable value. Having this critical value, our decision is the following: In case of F > Ftable we reject the null hypothesis and claim that the variances are different at (α)100% level, In case of F < Ftable we do not reject the null hypothesis and claim that the two variances are not different at (α)100% level. Example5.6-1. In the previous example we supposed that the two variances are equal. but we haven’t tested it. Now we will perform an F-test. Let's compute the value of F:

F = =128 88107 14

1 2029..

. ,

The degree of freedom of the nominator is 10-1=9, that of the denominator is 8-1=7, the critical value of F (if α=0.05) is Fα,9,7=3.68. Because 1.2029<3.68, we do not reject the null hypothesis that the two standard deviations are equal.

5.6.1 Using SPSS to compute two-sample t-test: Data input: data must be arranged in two variables: one variable must contain the data to be evaluated (Test Variable), and another (categorical) variable must contain codes of groups (Grouping Variable). In our case, we have to following two variables:

GROUP BLOOD 1.00 170.00 1.00 160.00 1.00 150.00 1.00 150.00 1.00 180.00 1.00 170.00 1.00 160.00 1.00 160.00 2.00 120.00 2.00 130.00 2.00 120.00 2.00 130.00 2.00 110.00 2.00 130.00 2.00 140.00 2.00 150.00 2.00 130.00

52

2.00 120.00 Result: Group Statistics

VAR00002 N Mean Std. Deviation Std. Error

Mean VAR00001 1.00 8 162.5000 10.35098 3.65963

2.00 10 128.0000 11.35292 3.59011

Independent Samples Test

.008 .930 6.657 16 .000 34.50000 5.18260 23.51337 45.48663

6.730 15.669 .000 34.50000 5.12657 23.61347 45.38653

Equal variancesassumedEqual variancesnot assumed

VAR00001F Sig.

Levene's Test forEquality of Variances

t df Sig. (2-tailed)Mean

DifferenceStd. ErrorDifference Lower Upper

95% ConfidenceInterval of the

Difference

t-test for Equality of Means

SPSS first gives the summary statistics for the two groups. Then a special test, the Levene’s test is used to test equality of variances. In our case, variances are not significantly different at 5% level (p=0.930), so the assumption of the equality of variances can be accepted. The results of the two-sample t-test are now in the first row of the table. The p-value is not zero but its first three digits are zero. In that case we report that p<0.00001 or p<0.05.

6 Correlation and prediction 6.1 Relationship between two continuous variables It frequently happens that statisticians want to describe with a single number a relationship between two sets of scores. A number that measures a relationship between two sets of scores is called a correlation coefficient. There are several correlation coefficients for measuring various types of relationships between different kinds of measurements. We will illustrate the basic concepts of correlation by discussing only the Pearson correlation coefficient, which is one of the more widely used correlation coefficients. The statistic is named for its inventor, Karl Pearson (1857-1936), one of the founders of modern statistics. It is denoted by r, and is used to measure what is called the linear relationship between two sets of measurements. To explain how r works and what is meant by a linear relationship, we will look at a few overly simplified examples. It is unlikely that a real application of the correlation coefficient would be made with so few scores. Imagine that 6 students are given a battery of tests by a vocational guidance counsellor with the results shown in Table 6-1:

student interest in retailing

interest in theater

math aptitude

language aptitude

Pat 51 30 525 550

Sue 55 60 515 535 Inez 58 90 510 535

Amie 63 50 495 520 Gene 85 30 430 455 Bob 95 90 400 420

Table 6-1. Results of a test (hypothetical data).

53

The counsellor might want to see if there are any correlation among there set of marks: for example, between math and long. Variables measured on the same individuals are often related to each other. Let us draw a graph called scatter gram to investigate this relationship. We put math scores on the horizontal axis, but that is not important. We could have put it on the vertical axis. After both axes are drawn and labelled, we use one dot for each person (6.1-1). In a scatter plot, we look for the direction, form, and strength of the relationship between the variables. The simplest relationship is linear in form and reasonably strong. On the scatter diagram there is one point for each pair of scores, 6 points in all. On 6.1-1, the points are arranged approximately in a straight line. When this happens we say that there is a good linear correlation between the two variables. The higher numbers in the math column of the table correspond to the higher numbers in the language column. This causes the line to slope up to right. This is called positive correlation. Relationship between math scores and retailing-interest scores: there is a tendency for the points to lie in a line that slopes down to the right (Figure 6.1-2). This is called negative correlation. (The higher scores in the column for math correspond to the low scores in the column for retailing interest) Let's see the relationship between math scores versus theater-interest scores: you will notice that there is no special tendency for the points to appear in a straight line. We say that there is a little or no correlation between the math scores and the theater-interest scores (Figure 6.1-3). Also note that it is not necessary for both variables to be scored on the same scale, since the correlation coefficient describes the pattern of the scores, not the actual values.

400420440460480500520540560

400 450 500 550 600

math score

lang

uage

6.1-1. Relationship between math scores versus languages (positive correlation)

54

0

20

40

60

80

100

400 450 500 550 600

math score

reta

iling

Figure 6.1-2. Relationship between math scores versus retailing (negative correlation)

0102030405060708090

100

400 450 500 550 600

math score

thea

ter

Figure 6.1-3. Relationship between math scores versus theater (no correlation)

6.1.1 Computation of r

Let us denote the two samples by x x x y y yn n1 2 1 2, ,..., , , ,...,and . Then the coefficient of correlation can be computed according to the following formula:

rn x y x y

n x x n y y

x x y y

x x y y

i ii

n

ii

n

ii

n

ii

n

ii

n

ii

n

ii

n

i ii

n

ii

n

ii

n=

⋅ −

⋅ −⎛⎝⎜

⎞⎠⎟ ⋅ −⎛⎝⎜

⎞⎠⎟

=− −

− −

= = =

= = = =

=

= =

∑ ∑ ∑

∑ ∑ ∑ ∑

∑

∑ ∑1 1 1

2

1

2

1

2

1

2

1

1

2

1

2

1( ) ( )

( )( )

( ) ( )

6.1.2 Properties of r The value of r is always between -1 and 1. When there is no tendency for the points to lie in a straight line, we say that there is no correlation (r=0) or we have low correlation (r is near 0). If r is near +1 or -1 we say that we have high correlation. If r=1, we say that there is perfect positive correlation. If r=-1, then we say that there is a perfect negative correlation.

55

6.1.3 Testing the significance of r Suppose that we examined an entire population and computed the correlation coefficient for two variables. If this coefficient equalled zero, we would say that there is no correlation between these two variables in this population. Consequently, when we examine a random sample taken from a population, then a sample value of r near zero is interpreted as reflecting no correlation between the variables in the population. A sample value of r far from zero (near 1 or -1) indicates that there is some correlation in the population. The statistician must decide when a sample value of r is far enough from zero to be significant, that is, when it is sufficiently far from zero to reflect the correlation in the population. H0: correlation coefficient in population = 0, in notation ρ=0 Ha: ρ ≠ 0 This test can be carried out by expressing the t statistic in terms of r. It can be proven that the statistic

tr n

rr n

r=

⋅ −

−= ⋅

−−

2

12

12 2

has t-distribution with n-2 degrees of freedom. Decision using statistical table: If ttable denotes the value of the table corresponding to n-2 degrees of freedom and α probability, if |t| > ttable, we reject H0 and state that the population correlation coefficient, ρ is different from 0. Decision using p-value: if p < α, we reject H0 and state that the population correlation coefficient, ρ is different from 0. Example6.1-1. The correlation coefficient between math skill and language skill was found r=0.9989. H0: correlation coefficient in population = 0, in notation ρ=0. Ha: correlation coefficient in population is different from 0.

Let's compute the test statistic: tr

=⋅ −

−= ⋅

−=

0 9989 6 2

1 0 99890 9989 4

142 6

2 2

.

.. .

The critical value in the table is t0.05, 4 = 2.776. Because 42.6 > 2.776, we reject H0 ant claim that there is a significant linear correlation between the two variables at 95 % level. 6.2 Prediction based on linear correlation: the linear regression If the statistician determines that there is high1 linear correlation between two variables, we can try to represent the correspondence by an ideal line - a line that best represents the linear correspondence. We can then write the formula which determines this line, and use this formula which determines this line, and use this formula to predict, for instance, which value of the Y variable corresponds ideally to any given value of the X variable. For example, let us suppose that math aptitude and language aptitude have a high positive correlation. Suppose we have found a formula which predicts language aptitude from scores of math aptitude. Given that value of math aptitude 410 scores, the formula predicts 432.2 scores of language (language = 1.016 math + 15.5, r =0.9989, r2 = 91.7 %) Clearly, if 10 students had 410 scores of math, we do not expect all 10 to get 432,2 in language. In fact, maybe none of them will actually get a 432.2. The predicting formula really says that our best estimate of their mean score will be 432.2. On the other hand, if

1What is considered to be high correlation varies with the field of application

56

we do not want to predict one student's grade, the best point estimate we can make will be this mean 432.2. NOTE: Realize that the interpretation of the situation says nothing about the reasons for the correlation, the nature of the test questions, or the intelligence of the students. It merely acknowledges that a pattern exists, and as long as the population from which their applicants come remains the same, then it is likely that the predictions are reasonable. How to get the formula for the line which is used to get the best point estimates? The general equation of a line is y = a + b x. We would like to find the values of a and b in such a way that the resulting line be the best fitting line. Let's suppose we have n pairs of (xi, yi) measurements. We would like to approximate yi by values of a line . If xi is the independent variable, the value of the line is a + b xi. We will approximate yi by the value of the line at xi, that is, by a + b xi. The approximation is good if the differences y a b xi i− + ⋅( ) are small. These differences can be positive or negative, so let's take its square and summarize:

( ( )) ( , )i

n

i iy a b x S a b=∑ − + ⋅ =

1

2

This is a function of the unknown parameters a and b, called also the sum of squared residuals. To determine a and b: we have to find the minimum of S(a,b). In order to find the minimum, we have to find the derivatives of S, and solve the equations ∂∂

∂∂

Sa

Sb

= =0 0,

The solution of the equation-system gives the formulas for b and a:

bn x y x y

n x x

x x y y

x x

i ii

n

ii

n

ii

n

ii

n

ii

n

i ii

n

ii

n=⋅ −

⋅ −=

− −

−

= = =

= =

=

=

∑ ∑ ∑

∑ ∑

∑

∑1 1 1

2

1

2

1

1

2

1

( )

( )( )

( ) and a y b x= − ⋅

It can be shown, using the 2nd derivatives, that these are really minimum places. Geometrical meaning of a and b. b: is called regression coefficient, slope of the best-fitting line or regression line; a: y-intercept of the regression line.

6.2.1 Computation of the correlation coefficient from the regression coefficient.

There is a relationship between the correlation and the regression coefficient:

r b ss

x

y

= ⋅ , where sx, sy are the standard deviations of the samples x1, x2,...,xn and y1,

y2,...,yn. From this relationship it can be seen that the sign of r and b is the same: if there exist a negative correlation between variables, the slope of the regression line is also negative. It can be shown that the same t-test can be used to test the significance of r and the significance of b.

6.2.2 Coefficient of determination and coefficient of correlation.

Let's assume that we have the n observed values of the dependent variable Y but we do not have the n observed values of the independent variable X with which to predict yi. In such

57

a case the only reasonable prediction of yi would be the mean: yy

n

ii

n

= =∑

1 . The error of the

prediction is y yi − . However, in reality we do have the observed values of the independent variable X to use in

predicting yi. The prediction of yi is the value of the regression line at xi: y a bxi i

∧

= + . The

error of the prediction is y yi i−∧

. Taking the difference of the error terms:

( ) ( )y y y y y y y yi i i i i− − − = − + = −∧ ∧ ∧

Therefore, using the independent variable has decreased the error of prediction by an

amount y yi

∧

− . It can be shown that in general

i

n

ii

n

i ii

n

iy y y y y y= =

∧

=

∧

∑ ∑ ∑− − − = −1

2

1

2

1

2( ) ( ) ( )

Here the first term is the numerator of the total amount of variation, the second is called the unexplained term of variation (not explained by the regression, or error term), and the right side is called the explained variation (explained by the regression). So the above equation shows that "Total" - "Unexplained" = "Explained" If we express the "Total" we get "Total" = "Explained" +" Unexplained". It can be shown that the ratio of the explained and the total variation is the square of the correlation coefficient:

r ExplainedTotal

y y

y y

x x y y

x x y y

ii

n

ii

n

i ii

n

ii

n

ii

n2

2

1

2

1

1

2

2

1

2

1

= =−

−=

− −⎡

⎣⎢

⎤

⎦⎥

− −

∧

=

=

=

= =

∑

∑

∑

∑ ∑

( )

( )

( )( )

( ) ( )

This is called coefficient of determination. Generally it is multiplied by 100. The square of the correlation coefficient shows the percentages of the total variation explained by the linear regression. Example6.2-1. The correlation between math aptitude and language aptitude was found r =0.9989. The coefficient of determination, r2 = 0.917 . So 91.7% of the total variation of Y is caused by its linear relationship with X.

6.2.3 Regression using transformations Up to this point, we have suited linear models, when the relationship between x and y had the form y=a +b x. This model is linear in parameters. Sometimes, however, useful models are not linear in parameters. Examining the scatterplot of the data shows a functional, but not linear relationship between data. In special cases we are able to find the bets fitting curve to the data.

58

For instance, the model y=a (bx) is not linear in parameters. Here the independent variable x enters as an exponent. To apply the technique of estimation and prediction of linear regression, we must transform such a nonlinear model into a linear model that is linear in parameters. Some non-linear models can be transformed into a linear model by taking the logarithms on both sides. Either 10 base logarithm (denoted log) or natural (base e) logarithm(denoted ln) can be used. If a>0 and b>0, applying a logarithmic transformation to the model y=a (bx) yields log y = log a + x log b If we let Y=log y and A = log a and B = log b, the transformed version of the model becomes Y=A + B x Thus we see that the model with dependent variable log y is linear in the parameters A and B. Cases in which a model y=a (bx) may be appropriate can be identified by data scatterplots. If scatterplots of observed data would have points scattered about an exponential function, we are able to find the best fitting curve using logarithmic transformation. Example.6.2-2. A fast food chain, opened in 1974. Each year from 1974 to 1988 the number of steakhouses in operation, yi is recorded. The data are presented in Table 6-2. An analyst from the firm wished to use these data to estimate the growth rate for the company. The scatterplot of the original data suggests an exponential relationship between x (year) and y (number of Steakhouses) (first plot), taking the logarithm of y, we get linear relationship (plot at the bottom)

Year i yi ln yi 1974 0 11 2.398 1975 1 14 2.639 1976 2 16 2.773 1977 3 22 3.091 1978 4 28 3.332 1979 5 36 3.584 1980 6 46 3.829 1981 7 67 4.205 1982 8 82 4.407 1983 9 99 4.595

time

y

050

100150200250300350400450

0 5 10 15

time

ln(y)

0

1

2

3

4

5

6

0 5 10 15

59

1984 10 119 4.779 1985 11 156 5.05 1986 12 257 5.549 1987 13 284 5.649 1988 14 403 5.999

Table 6-2. Performing the linear regression procedure to x and log (y) we get the equation log y = 2.327 + 0.2569 x That is y = e2.327 + 0.2569 x = e2.327 e0.2569 x= 1.293 e0.2569 x is the equation of the best fitting curve to the original data. Another model that can be linearized by a transformation is the model y=a xb Taking the logarithms (natural or base 10) of both sides, we obtain log y = log a + b log x Letting Y= log y, A= log a, X = log x, we obtain Y = A + B X which is linear in parameters A and B. If visual inspection of scatterplot of the data suggests, or if theoretical considerations suggest , the model y=a xb can be used in such a way that we rake the logarithm of both variables and perform the linear regression on these transformed data. When data plot suggests that the usual linear model may not be appropriate, the reciprocal transformation may be useful. Examples of models involving the reciprocals of y, x, or both are

y a bx

ya bx

ya b

x

= +

= +

= +

1

1

1 1

7 Chi-Square Tests 7.1 Relationship between two discrete variables, the chi square test for

independence Example. In a clinical study, occurrences of diabetes were examined in three groups, the results (frequencies) are tabulated in a table, called contingency table or table of frequencies (Table 7-1).

DIAB Treatment1 Treatment 2 Treatment3 Total yes 31 27 25 83 no 69 73 75 217

Total 100 100 100 300 Table 7-1. Occurrence of diabetes in three groups

The question is whether the occurrence of diabetes is the same in the three groups. We see the actually observed frequencies, they are similar. However, these frequencies are collected from a sample of 300 patients. The question is whether the percentage of diabetes

60

is the same in the populations? In another word: is diabetes independent in treatment group? This question can be tested. Here the null hypothesis is that the occurrence of diabetes is independent of groups. H0: the occurrence of diabetes is independent of groups (the rates are the same in the population) Ha: the occurrence of diabetes is not independent of groups (the rates are different in the population) If this null hypothesis is true, the relative frequencies were the same in each group. According to this null hypothesis, we compute the so called expected frequencies: frequencies expected under the null hypothesis. Computation of expected frequencies: We fix the row and column total. If the null hypothesis is true, we expect the same ratio of yes-no answers in each treatment group, according to the ratio of the yes-no answers in the Total column. A simpler way to obtain these expected results would be to multiply the column total times wow total for each cell in the table and divide by the sample size. Expected frequencies are shown in Table 7-2

DIAB Treatment1 Treatment2 Treatment3 Total yes 27.7 27.7 27.7 83 no 72.3 72.3 72.3 217

Total 100 100 100 300 Table 7-2. The table of expected frequencies.

Expected result = column total row totalsample size

⋅

Let us examine the difference between the observed results and the expected results, are these differences large or small? We are looking for one number which will indicate whether these differences are large. cell

category

observed (O)

expected (E)

differenceO-E

diff. squared (O-E)2

ratio of differences squared to expected results (O-E)2/E

1 Treatment 1, diabetes 31 27.7 3.3 10.89 0.393 2 Treatment 1, no diabetes 69 72.3 -3.3 10.89 0.151 3 Treatment 2, diabetes 27 27.7 -0.7 0.49 0.018 4 Treatment 2, no diabetes 73 72.3 0.7 0.49 0.007 5 Treatment 3, diabetes 25 27.7 -2.7 7.29 0.263 6 Treatment 3, no diabetes 75 72.3 2.7 7.29 0.101 Sum=0 sum=0.932

Table 7-3. Computation of the chi square statistic The number that shows whether the differences are large enough to claim no independence is denoted by χ2, and can be computed by the following formula:

χ22

=−∑ ( )O EE

i i

i

, in our case it is equal 0.932.(Table 7-3

If there is no relationship between variables, then the differences would be near 0, so χ2 would be near zero. On the other hand, if χ2 is far away from 0, there is a high probability that the variables are not independent, and there is some statistical relationship between them. In these chi-square tests of independence our null hypothesis will be that the variables are independent, that is: As we did earlier with the normal and t-distributions, we compare a number based on sample data with a critical value. The theoretical distribution from which values are taken is called the chi-square distribution. It is tabulated according to the degrees of freedom. For a 3 x 2 table there are 2 degrees of freedom.

61

Generally the degrees of freedom in a contingency table can be computed as: df.=(number of rows -1) (number of columns -1) The critical value in the chi-squared table according to α=0.05 and 2 degrees of freedom is χ2 table, 0.05,2=5.99. The decision based on the critical value in the table: if our computed test statistic is greater than the critical value in the table, we reject the null hypothesis and say that the difference is significant at 5% level. In our case, 0.933<5.99(= χ2 table, 0.05,2) , so the difference is not significant at 5% level. Assumption for the chi-square test: it works well when there are enough data, that is, when the sample size is big. This assumption is defined for the expected frequencies: we need expected frequencies greater that 5 (in some text books, greater that 10. More precisely: cells with expected frequencies <5 are maximum 20% of the cells. 7.2 General formula for test of independence A total of n experiments may have been performed whose results are characterized by the values of two random variables X and Y. We assume that the variables are discrete and the values of X and Y are x1, x2,...,xr and y1, y2,...,ys , respectively, which are the outcomes of the events A1,A2,...,Ar and B1, B2,...,Bs . Let's denote by kij the number of the outcomes of the event (Ai, Bj). These numbers can be grouped into a matrix, called a contingency table. It has the following form:

B1 B2 ... Bs Sum

A1 k11 k12 ... k1s k1. A2 k21 k22 ... k2s k2. ... ... ... ... ... ... Ar kr1 kr2 ... krs kr. Sum k.1 k.2 ... k.s n

Here, k ki ijj

s

. ==∑

1

is the frequency of the event Ai and k kj iji

r

. ==∑

1

is the frequency of the

event Bj. The independence of the two variables means the independence of the events Ai and Bj , so the null hypothesis is: H0: P(Ai Bj) = P(Ai) P(Bj) Clearly kij -s are the observed frequencies, and the expected frequencies can be computed

by the formula nkk iji ⋅. . The test statistic is

X nk

k kn

k k

iji ij

i ijj

s

i

r2

2

11

=−

⋅

⋅==∑∑

( ).

.

which has asymptotically χ2 distribution with (r-1)(s-1) degrees of freedom. If X2 > χ2table then we reject the null hypothesis that the two variables are independent, in the opposite case we do not reject the null hypothesis.

7.2.1 Special case: 2x2 table Let’s denote the frequencies in a 2x2 table by a,b,c,d:

62

B1 B2 Sum A1 a b A2 c d Sum Then the formula for the chi-square test is:

χ22

=− + + +

+ + + +( ) ( )

( )( )( )( )ad bc a b c d

a b c d a c b d with 1 degrees of freedom.

For tables with very small expected frequencies, a test called Fisher’s exact test should be used. It gives the exact p-value, the probability of obtaining the observed results or results that are more extreme. Example7.2-1. The following table summarizes the information from a randomized clinical trial that compared two treatments (test, placebo) for a respiratory disorder. Such a table is called contingency table. The question of interest is whether the rates of favourable response for test and placebo are the same. Treatment Favourable Unfavourable Placebo 30 60 Test 40 20 H0: there is no association between treatment and outcome in the population (treatment and

outcome are independent) (the rates of favourable response for test and placebo are the same).

Ha: the rates of favourable response for test and placebo are different. The table of expected frequencies contains only numbers greater than 5 so the assumtion holds.: Treatment Favourable Unfavourable Sum Placebo 37.7 32.3 70 Test 32.3 27.7 60 Sum 70 60 130 The computed test statistic χ2=7.37, df=(2-1)⋅(2-1)=1 Decision. χ2table(α=0.05)=3.841, 7.37>3.841 so the difference is significant at 5% level. The rates of favourable response for test and placebo are significantly different. SPSS results

Crosstabulation

30 40 7037.7 32.3 70.0

40 20 6032.3 27.7 60.0

70 60 13070.0 60.0 130.0

CountExpected CountCountExpected CountCountExpected Count

Placebo

Test

Treament

Total

F N Total

63

Chi-Square Tests

7.370b 1 .0076.443 1 .0117.459 1 .006

.008 .005

7.313 1 .007

130

Pearson Chi-SquareContinuity Correctiona

Likelihood RatioFisher's Exact TestLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is27.69.

b.

Pearson chi-square

Fisher’s exact p

64

7.2.2 The chi-square test for goodness of fit. Goodness of fit tests are used to determine whether sample observation fall into categories in the way they "should" according to some ideal model. When they come out as expected, we say that the data fit the model. The chi-square statistic helps us to decide whether the fit of the data to the model is good. Suppose we have a sample of observations. Let's prepare a histogram of the sample. Let's denote the cut-point of the categories by c0,c1,...,cr and the frequency in the i-th interval by

ki (the number of the sample elements falling in the interval [ci-1,ci]) .Clearly k nii

n

==∑

1

.

We would like to test whether the sample is drawn from a population with given distribution. Now the null hypothesis is: H0: the distribution of the variable X is a given distribution Let's denote p1,...,pr the probabilities of falling into a given interval in the case of the given distribution. If H0 is true and n is large, then the relative frequencies are approximations of pi-s,

ii p

nk

≈ or ii npk ≈ , so npi-s are the expected frequencies, and ki-s are the observed

frequencies. The formula of the test statistic is

Xk n p

n pi i

ii

r2

2

1

=− ⋅⋅=

∑ ( ) has χ2 distribution with (r-s-1) degrees of freedom. Here s is the

number of the parameters of the distribution (if there are). We will show the use of the chi-square test in the case of uniform and normal distribution. 7.2.2.1 Test for uniform distribution Example7.2-2. We would like to test whether a dice is fair or biased. The dice is thrown 120 times. If it is fair, every throwing are equally probable so in ideal case we would expect 20 frequencies for each number. After throwing the dice we have got the following results: 1 2 3 4 5 6 Sum Results 24 15 15 19 25 22 120 H0: the dice is fair, every outcomes are equally probable The observed frequencies are in the table above, the expected frequencies are 20. so we get the following table to perform the chi-square test: 1 2 3 4 5 6 Sum Observed frequencies 24 15 15 19 25 22 120 Expected frequencies 20 20 20 20 20 20 120

X ki

i

22

1

6 2020

120

16 25 25 1 25 4 4 8=−

= + + + + + ==∑ ( ) ( ) .

The degrees of freedom is 5, the critical value in the table is χ20.05,5=11.07. As our test statistic, 4.8 < 11.07 we do not reject H0 and claim that the dice is fair.

65

7.2.2.2 Test for normality Let's suppose we have a sample and would like to know whether it comes from a normally distributed population. Let's make a histogram from the sample, so we get the "observed" frequencies. To test the null hypothesis: H0: the sample is drawn from a normally distributed population, we need the expected frequencies. We have to estimate the parameters of the normal density functions. We use the sample mean x and sample standard deviation SD. The expected frequencies can be computed using the tables of the normal distribution or can be computed directly using the formula

p f x dx F c F ci i ic

c

i

i

= = − −

−

∫ $( ) $( ) $( )1

1

, where $( )( )

f xs

ex x

s=−

−12

2

22

π,

and the test statistic is

Xk np

npi

ri i

i

2

1

2

=−

=∑ ( )

with r-2-1 degrees of freedom (2 is the number of parameters of the

normal distribution). 7.2.2.3 Using Gauss-paper There is a graphical method to check normality. The "Gauss-paper" is a special coordinate system, the tick marks of the y axis are the inverse of the normal distribution and are given in percentages. We simply have to draw the distribution function of the sample into this paper. In the case of normality the points are arranged approximately in a straight line.

8 Nonparametric Tests An important part of statistics deals with tests for which we do not need to make specific assumptions about the distribution of data. This might happen, for instance, if you were working with a non normal distribution, or a distribution whose shape was not yet evident. One type of data, called ordinal data, is the use of numbers to put something in rank order. Many consumer surveys ask people to rank something or other. One point should be made. In general, if you do have data for which there is an appropriate parametric test, then that is the test you should use. It will be a more powerful test because it does take into account certain features about the distribution of your data (say, that it is normal in shape) which the nonparametric test would ignore. Some students ask, "what happens when I use a parametric test even though my data do not really meet the proper requirements?" Currently, much research goes on to see how well various statistical test work when the populations do not meet the theoretical requirements. For example, when we use t-test, we assume that our populations are approximately normal. Therefore, t-test is a parametric test. But research has indicated that many times its results are still reliable even if the populations are not too close to normal. The idea of measuring how well a statistical test works when we violate some of its theoretical assumptions is associated with the concept of robustness. A test is robust if it still works well when the theoretical assumptions are violated.

8.1.1 Ranking the data Nonparametric tests can't use the estimations of population parameters. They use ranks instead. Instead of the original sample data we have to use its rank. To show the ranking procedure suppose we have the following sample of measurements:

66

Case Data Rank 1 199 6 2 126 5 3 81 2 4 68 1 5 112 3.5 6 112 3.5

The sum of all ranks must be r n ni

i

n

=+

=∑ ( )1

21

. Using this formula we can check our

computations. Now The sum of ranks is 21, and 6(7)/2=21.

8.1.2 One sample tests (the difference of two related samples): the sign test and the Wilcoxon signed rank test.

Various one-sample nonparametric procedures are available for testing hypotheses about the parameters of a population. These include procedures for examining differences in paired samples. The paired t-test for means is used to test that the means of the paired populations are equal. Remember that this test requires the assumption that the differences are normally distributed. The sign test is a nonparametric procedure used with two related samples to test the hypothesis that the distributions of two variables are the same. This test makes no assumptions about the shape of these distributions. To compute sign test, the difference between pairs is calculated for each case. Next, the numbers of positive and negative differences are obtained. If the distributions of the two variables are the same, the numbers of positive and negative differences should be similar. To perform the test, we simply have to compare the number of positive or negative sign to the value given in a table. If the sample size is large (>30), an approximation to the normal distribution is possible to obtain p-values. The Wilcoxon signed rank test incorporates information about the magnitude of the differences and is therefore more powerful than the sign test. To compute the Wilcoxon signed rank test, the differences are ranked without considering the signs. In the case of ties, average ranks are assigned. The sums of positive and negative differences are calculated. If the sample size is small, tables can be used . Tables contain Rmin-Rmax values. If one any of the sums lies within this interval, that the difference is not significant, if the opposite case the difference is not significant. Example.. 13 students were measured in reading speed and comprehension at a course ending and after 1 month (Table 8-1).Suppose we have reason to believe that the two distributions of reading scores are not normal. Then we should not perform a t-test on the paired differences of the means. So we will perform a Wilcoxon signed rank test. H0: the two samples are taken from the same population. The sum of positive ranks is R+=40.5, the sum of negative ranks is R-=25.5. The table gives and interval 10-56. As both rank sums are in this interval, we do not reject the null hypothesis and claim that the difference is not significant at 5% level. If the sample size is large, we can count the mean and standard deviation of the ranks and use the normal distribution to get the p-value. Computer packages use this normal approximation also in case of small sample size.

67

Student Score at course ending Score after 1 month Difference Rank 1 50 52 -2 5.5 2 48 51 -3 9 3 46 46 0 4 50 49 1 2 5 62 50 2 5.5 6 80 70 10 11 7 23 21 2 5.5 8 30 33 -3 9 9 45 46 -1 2 10 53 53 0 11 49 48 1 2 12 51 48 3 9 13 46 48 -2 5.5

Table 8-1. Computation of the Wilcoxon signed rank test

8.1.3 Two-sample test (two independent samples): the Mann-Whitney test

The Mann-Whitney test, also known as the Wilcoxon two sample test, does not require assumptions about the shape of the underlying distributions. It tests the hypothesis that two independent samples come from populations having the same distribution. The form of the distribution need not be specified. The test does not require that the variable be measured on an interval scale, an ordinal scale is sufficient. To compute the test, the observations from both samples are first combined and ranked from smallest to largest value. The statistic for testing the null hypothesis that the two distributions are equal is the sum of the ranks for each of the two groups. If the groups have the same distribution, their sample distributions of ranks should be similar. If one of the groups has more than its share of small or large ranks, there is reason to suspect that the two underlying distributions are different. Example 7.2-1. Consider the following table which shows a sample of the King data (1992) King examined the relationship between diet and tumour growth in rats. The rats were divided into two groups and fed diets of saturated or unsaturated fat. One hypothesis of interest is whether the length of time it takes for tumours to develop differs in the two groups. If we assumed a normal distribution of the tumour-free time, the two-sample t-test can be used to test the hypothesis that the population means are equal. However, if the distribution of times does not appear to be normal, and especially if the sample sizes are small, we should consider statistical procedures that do not require assumptions about the shape of the underlying distribution.

Saturated Unsaturated Case Time Rank Case Time Rank 1 199 6 4 68 1 2 126 5 5 112 3.5 3 81 2 6 112 3.5

Table 8-2. vomputation of the Wilxocon signed rank test The sum of the ranks in the first group, R1=13, in the second group R2=8. The test statistic T is the sum of the ranks in the smaller group. If the total sample size is less than 30, tables can be used where an interval for Rmin-Rmax is given. The null hypothesis is accepted if T lies in the interval given it table for the test. For large sample size a normal approximation is possible to get the p-value.

68

Example 7.2-2. Hypothetical example. The change of body weight is compared in two groups: patients having a special diet and control patients. The original data are ranked together (independently of groups), then the sum of ranks in each group is computed as shown in Table 8-3.

Patient Change in body weight (kg)

Group Rank Rank corrected for ties

1. -1 Diet 3 3 2. 5 Diet 16 16.5 3. 3 Diet 12 13 4. 10 Diet 21 21 5. 6 Diet 18 19 6. 4 Diet 15 15 7. 0 Diet 4 5.5 8. 1 Diet 8 9 9. 6 Diet 19 19 10. 6 Diet 20 19

Sum of ranks, R1 140 11. 2 Control 11 11 12. 0 Control 5 5.5 13. 1 Control 9 9 14. 0 Control 6 5.5 15. 3 Control 13 13 16. 1 Control 10 9 17. 5 Control 17 16.5 18. 0 Control 7 5.5 19 -2 Control 1 1.5 20. -2 Control 2 1.5 21. 3 Control 14 13

Sum of ranks R2 91 Table 8-3. Computation of the Mann Whitney U test

If the distributions of the two variables are the same (if the null hypothesis is true), then the sum of ranks in the two groups should be similar. The test statistic T is the sum of the ranks in the smaller group. The null hypothesis is accepted T lies in the interval given it table for the test. Sum of ranks in the first group (n=10): R1=140, sum of ranks in the second group (n=11): R2=91. The test statistic T is the sum of the ranks in the smaller group: T=140. For n1=10 and n2=11 and α=0.05, this interval is 81-139. As T lies outside of this interval, we reject the null hypothesis and claim that the difference is significant at 5% level.

8.1.4 Several related tests: the Friedman test The Friedman test is the nonparametric equivalent of a one-sample repeated measures design or a two-way analysis of variance with one observation per cell. Friedman tests the null hypothesis that k related variables come from the same population. For each case, the k variables are ranked from 1 to k. The test statistic is based on these ranks. As a result, it gives one p-value. If it is nit significant, the null hypothesis is accepted. If the null hypothesis is rejected, further tests are required to make pairwise comparisons. These pairwise comparisons are generally not available in standard statistical packages. Pairwise

69

comparisons can be performed by Wilxocon signed rank tests and p-values can be corrected by Bonferroni correction.

8.1.5 Several independent tests: the Kruskal-Wallis test The Kruskal-Wallis test, an extension of the Mann-Whitney U test, is the nonparametric analog of one-way analysis of variance and detects differences in distribution location. It tests whether k independent samples that are defined by a grouping variable are from the same population. This test assumes that there is no a priori ordering of the k populations from which the samples are drawn. As a result, it gives one p-value. If it is nit significant, the null hypothesis is accepted. If the null hypothesis is rejected, further tests are required to make pairwise comparisons. These pairwise comparisons are generally not available in standard statistical packages. Pairwise comparisons can be performed by Mann Whitney U tests and p-values can be corrected by Bonferroni correction.

8.1.6 Relationship between two variables: Spearman's rank correlation coefficient The Pearson correlation coefficient is appropriate only for data that attain at least an interval level of measurements. Normality is also assumed when testing the hypotheses about this correlation coefficient. For data that do not satisfy the normality assumption, another measure of the linear relationship between two variables, Spearman's rank correlation coefficient, is available. The rank correlation coefficient is the Pearson correlation coefficient based on the ranks of the data if there are no ties (adjustments are made if some of the data are tied). If the original data for each variable have no ties, the data for each variable are first ranked, and then the Pearson correlation coefficient between the ranks for the two variables is computed. Like Pearson correlation coefficient, the rank correlation ranges between -1 and +1, where -1 and +1 indicate a perfect linear relationship between the ranks of the two variables. The interpretation is therefore the same except that the relationship between ranks, and not values, is examined.

70

9 Statistical Tables 9.1 The Standard normal distribution one-tailed areas) Proportions of the standard normal distribution below the value z

z area to the left of z z area to the left of z -3 0.001 0 0.500 -2.9 0.002 0.1 0.540 -2.8 0.003 0.2 0.579 -2.7 0.003 0.3 0.618 -2.6 0.005 0.4 0.655 -2.5 0.006 0.5 0.691 -2.4 0.008 0.6 0.726 -2.3 0.011 0.7 0.758 -2.2 0.014 0.8 0.788 -2.1 0.018 0.9 0.816 -2 0.023 1 0.841 -1.9 0.029 1.1 0.864 -1.8 0.036 1.2 0.885 -1.7 0.045 1.3 0.903 -1.6 0.055 1.4 0.919 -1.5 0.067 1.5 0.933 -1.4 0.081 1.6 0.945 -1.3 0.097 1.7 0.955 -1.2 0.115 1.8 0.964 -1.1 0.136 1.9 0.971 -1 0.159 2 0.977 -0.9 0.184 2.1 0.982 -0.8 0.212 2.2 0.986 -0.7 0.242 2.3 0.989 -0.6 0.274 2.4 0.992 -0.5 0.309 2.5 0.994 -0.4 0.345 2.6 0.995 -0.3 0.382 2.7 0.997 -0.2 0.421 2.8 0.997 -0.1 0.460 2.9 0.998 0 0.500 3 0.999

71

9.2 The student’s t distribution (two-tailed areas). The tabulated values of the t distribution correspond to a given two-tailed p-values for different degrees of freedom. For α=0.05 and df=12, the critical value is tα/2 =2.179

Two-sided alfa df 0.2 0.1 0.05 0.02 0.01 0.001 1 3.078 6.314 12.706 31.821 63.657 636.619 2 1.886 2.920 4.303 6.965 9.925 31.599 3 1.638 2.353 3.182 4.541 5.841 12.924 4 1.533 2.132 2.776 3.747 4.604 8.610 5 1.476 2.015 2.571 3.365 4.032 6.869 6 1.440 1.943 2.447 3.143 3.707 5.959 7 1.415 1.895 2.365 2.998 3.499 5.408 8 1.397 1.860 2.306 2.896 3.355 5.041 9 1.383 1.833 2.262 2.821 3.250 4.781

10 1.372 1.812 2.228 2.764 3.169 4.587 11 1.363 1.796 2.201 2.718 3.106 4.437 12 1.356 1.782 2.179 2.681 3.055 4.318 13 1.350 1.771 2.160 2.650 3.012 4.221 14 1.345 1.761 2.145 2.624 2.977 4.140 15 1.341 1.753 2.131 2.602 2.947 4.073 16 1.337 1.746 2.120 2.583 2.921 4.015 17 1.333 1.740 2.110 2.567 2.898 3.965 18 1.330 1.734 2.101 2.552 2.878 3.922 19 1.328 1.729 2.093 2.539 2.861 3.883 20 1.325 1.725 2.086 2.528 2.845 3.850 21 1.323 1.721 2.080 2.518 2.831 3.819 22 1.321 1.717 2.074 2.508 2.819 3.792 23 1.319 1.714 2.069 2.500 2.807 3.768 24 1.318 1.711 2.064 2.492 2.797 3.745 25 1.316 1.708 2.060 2.485 2.787 3.725 26 1.315 1.706 2.056 2.479 2.779 3.707 27 1.314 1.703 2.052 2.473 2.771 3.690 28 1.313 1.701 2.048 2.467 2.763 3.674 29 1.311 1.699 2.045 2.462 2.756 3.659 30 1.310 1.697 2.042 2.457 2.750 3.646 ∞ 1.282 1.645 1.960 2.326 2.576 3.291

72

9.3 The F distribution The tabulated values of the F distribution correspond to given one-tailed p-values for different degrees of freedom for the numerator (n1) and for the denominator (n2) for α=0.05 .

n2

n2 1 2 3 4 5 6 7 8 9 10 50 60 70 80 90 100 1000 100001 161.4 199.5 215.7 224.5 230.1 233.9 236.7 238.8 240.5 241.8 251.7 252.2 252.5 252.7 252.9 253.0 254.2 254.32 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.48 19.48 19.48 19.48 19.48 19.49 19.49 19.503 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.58 8.57 8.57 8.56 8.56 8.55 8.53 8.534 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.70 5.69 5.68 5.67 5.67 5.66 5.63 5.635 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.44 4.43 4.42 4.41 4.41 4.41 4.37 4.376 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 3.75 3.74 3.73 3.72 3.72 3.71 3.67 3.677 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.32 3.30 3.29 3.29 3.28 3.27 3.23 3.238 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.02 3.01 2.99 2.99 2.98 2.97 2.93 2.939 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 2.80 2.79 2.78 2.77 2.76 2.76 2.71 2.71

10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.64 2.62 2.61 2.60 2.59 2.59 2.54 2.5420 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 1.97 1.95 1.93 1.92 1.91 1.91 1.85 1.8430 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 1.76 1.74 1.72 1.71 1.70 1.70 1.63 1.6240 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 1.66 1.64 1.62 1.61 1.60 1.59 1.52 1.5150 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03 1.60 1.58 1.56 1.54 1.53 1.52 1.45 1.4460 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.56 1.53 1.52 1.50 1.49 1.48 1.40 1.3970 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97 1.53 1.50 1.49 1.47 1.46 1.45 1.36 1.3580 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95 1.51 1.48 1.46 1.45 1.44 1.43 1.34 1.3390 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94 1.49 1.46 1.44 1.43 1.42 1.41 1.31 1.30

100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93 1.48 1.45 1.43 1.41 1.40 1.39 1.30 1.28200 3.89 3.04 2.65 2.42 2.26 2.14 2.06 1.98 1.93 1.88 1.41 1.39 1.36 1.35 1.33 1.32 1.21 1.19300 3.87 3.03 2.63 2.40 2.24 2.13 2.04 1.97 1.91 1.86 1.39 1.36 1.34 1.32 1.31 1.30 1.17 1.15400 3.86 3.02 2.63 2.39 2.24 2.12 2.03 1.96 1.90 1.85 1.38 1.35 1.33 1.31 1.30 1.28 1.15 1.13500 3.86 3.01 2.62 2.39 2.23 2.12 2.03 1.96 1.90 1.85 1.38 1.35 1.32 1.30 1.29 1.28 1.14 1.12600 3.86 3.01 2.62 2.39 2.23 2.11 2.02 1.95 1.90 1.85 1.37 1.34 1.32 1.30 1.28 1.27 1.13 1.11700 3.85 3.01 2.62 2.38 2.23 2.11 2.02 1.95 1.89 1.84 1.37 1.34 1.31 1.29 1.28 1.27 1.12 1.10800 3.85 3.01 2.62 2.38 2.23 2.11 2.02 1.95 1.89 1.84 1.37 1.34 1.31 1.29 1.28 1.26 1.12 1.09900 3.85 3.01 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.36 1.33 1.31 1.29 1.27 1.26 1.11 1.091000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.36 1.33 1.31 1.29 1.27 1.26 1.11 1.08

10000 3.84 3.00 2.61 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.35 1.32 1.29 1.28 1.26 1.25 1.08 1.03

72

9.4 The chi-squared distribution The tabulated values of the χ2 are values of the chi squared distribution corresponfing to given two-tailed p-values for different degrees of freedom.

Two-tailed probability df 0.2 0.1 0.05 0.02 0.01 0.001 1 1.642 2.706 3.841 5.412 6.635 10.828 2 3.219 4.605 5.991 7.824 9.210 13.816 3 4.642 6.251 7.815 9.837 11.345 16.266 4 5.989 7.779 9.488 11.668 13.277 18.467 5 7.289 9.236 11.070 13.388 15.086 20.515 6 8.558 10.645 12.592 15.033 16.812 22.458 7 9.803 12.017 14.067 16.622 18.475 24.322 8 11.030 13.362 15.507 18.168 20.090 26.124 9 12.242 14.684 16.919 19.679 21.666 27.877

10 13.442 15.987 18.307 21.161 23.209 29.588 11 14.631 17.275 19.675 22.618 24.725 31.264 12 15.812 18.549 21.026 24.054 26.217 32.909 13 16.985 19.812 22.362 25.472 27.688 34.528 14 18.151 21.064 23.685 26.873 29.141 36.123 15 19.311 22.307 24.996 28.259 30.578 37.697 16 20.465 23.542 26.296 29.633 32.000 39.252 17 21.615 24.769 27.587 30.995 33.409 40.790 18 22.760 25.989 28.869 32.346 34.805 42.312 19 23.900 27.204 30.144 33.687 36.191 43.820 20 25.038 28.412 31.410 35.020 37.566 45.315 21 26.171 29.615 32.671 36.343 38.932 46.797 22 27.301 30.813 33.924 37.659 40.289 48.268 23 28.429 32.007 35.172 38.968 41.638 49.728 24 29.553 33.196 36.415 40.270 42.980 51.179 25 30.675 34.382 37.652 41.566 44.314 52.620

73

9.5 The table of the Wilcoxon signed rank test If the sum of positive (or negative) ranks is equal to the tabulated values or is outside the range shown, the p-value of the test is less then the value at the top of the column.

74

9.6 The table of the Mann-Whitney U test For a comparison of two groups of size n1 and n2, where n1 ≤ n2, if the sum of the ranks in the smaller group is equal to the tabulated values or is outside the range shown, the the p-value of the test is less then the value at the top of the column.

75

10 References [1] Altman, D.G., Practical Statistics for Medical Research, Chapman&Hall London

Glasgow Weinheim New York Tokyo Melbourne Madras, 1995. [2] Bowerman B.L. and O’Connell, R.T.: Linear Statistical Models. An Applied Approach.

PWS-KENT Publishing Company, 1990. [3] Campbell, M.J. and Machin, D: Medical Statistics. A Commonsense Approach. John

Wiley & Sons Chichester-New York- Brisbane-Toronto-Singapore , 1993. [4] Hajtman Béla: Bevezetés a matematikai statisztikába pszichológusok számára. Akadémiai

Kiadó, Budapest, 1971. [5] Hosmer, D.W. and Lemeshow, S: Applied Logistic Regression. Wiley, 2000. [6] Juvancz. I, Paksy A.: Orvosi Biometria. Medicina Könyvkiadó, Budapest 1982. [7] M.S: Srivastava and E.M. Carter: An Introduction to Applied Multivariate Statistics.

North-Holland, 1983. [8] Moore, D.S., The Basic Practice of Statistics, W.H. Freeman and Company, New York,

1995. [9] Naiman, A., Rosenfeld, R. and Zirkel, G.: Understanding Statistics. McGraw-Hill

International Editions, 1983. [10] Reiczigel J., Harnos A., Solymosi N.: Biostatisztika nem statisztikusoknak. Pars Kft.,

Nagykovácsi, 2007. [11] Rice Virtual Lab in Statistics http://onlinestatbook.com/rvls.html

76

11 Contents 1 Introduction .................................................................................................................................. 2 2 Data .............................................................................................................................................. 2

2.1 Data table, cases, variables ........................................................................................ 2

2.2 Types of variables ...................................................................................................... 3

2.3 Gathering data ............................................................................................................ 7

2.3.1 Aspects of planning questionnaires.............................................................................. 7 2.4 Planning data base, data input .................................................................................... 8

2.4.1 Using Excel to data collection ..................................................................................... 8 2.4.2 A brief survey of statistical program systems (R, SPSS, Statistica, SAS). .................. 9

2.5 Distribution .............................................................................................................. 10

2.5.1 The distribution of a categorical variable. ................................................................. 10 2.5.2 The distribution of a continuous variable. ................................................................. 11 2.5.3 The overall pattern of a distribution ........................................................................... 12

2.6 Describing distributions with numbers. ................................................................... 13

2.6.1 Measures of the center ............................................................................................... 14 2.6.2 Properties of the measure of the center ...................................................................... 14 2.6.3 The effect of linear transformations to the measures of the center. ........................... 15 2.6.4 Measures of dispersion............................................................................................... 15 2.6.5 Properties of the measures of dispersion, effect of transformations .......................... 16 2.6.6 Some measures of an individual: z-score or standardized score ................................ 17 2.6.7 The use of the sample characteristics, preparing Figures .......................................... 18

3 The basics of the probability theory ........................................................................................... 20 3.1 Experiments, events ................................................................................................. 20

3.2 Operations with events ............................................................................................. 21

3.3 The concept of probability ....................................................................................... 21

3.3.1 Axioms of probability ................................................................................................ 22 3.3.2 Rules of probability calculus...................................................................................... 22 3.3.3 Conditional probability .............................................................................................. 23 3.3.4 Independence of events. ............................................................................................. 24

3.4 Random variables, probability distributions ............................................................ 25

3.4.1 Distribution function of discrete variables. ................................................................ 26 3.4.2 Distribution function of a continuous variable .......................................................... 28 3.4.3 The probability density function of a continuous random variable. .......................... 28

3.5 Population, sample ................................................................................................... 29

3.5.1 Population- and sample characteristics ...................................................................... 30 3.6 Some important distributions and theorems ............................................................ 31

3.6.1 The binomial distribution ........................................................................................... 31 3.6.2 The Poisson distribution............................................................................................. 32 3.6.3 The uniform distribution ............................................................................................ 33

77

3.6.4 The normal distribution. ............................................................................................. 33 3.7 Theoretical distribution of the sample means. ......................................................... 35

3.7.1 Central limit theorem ................................................................................................. 36 3.7.2 The standard error of mean (SE or SEM) .................................................................. 36

4 Statistical estimation, confidence intervals. ............................................................................... 37 4.1 Statistic, confidence interval .................................................................................... 37

4.2 Confidence intervals. ............................................................................................... 38

4.2.1 Confidence interval for a population mean μ if the population standard deviation (σ) is known 38 4.2.2 Confidence interval for a population mean μ if the population standard deviation (σ) is not known ............................................................................................................................... 40 4.2.3 The t-distribution........................................................................................................ 40 4.2.4 Sample size determination ......................................................................................... 42

5 Statistical inference: hypothesis testing ..................................................................................... 42 5.1 General steps of hypothesis tests ............................................................................. 42

5.1.1 Statistical hypotheses ................................................................................................. 42 5.1.2 The two types of hypotheses. ..................................................................................... 42 5.1.3 Testing the null hypothesis: the significance level .................................................... 43 5.1.4 Testing the null hypothesis: decision rules ................................................................ 43 5.1.5 Testing the null hypothesis: decision ......................................................................... 43 5.1.6 Steps of hypothesis-testing......................................................................................... 43 5.1.7 One- and two tailed tests ............................................................................................ 44 5.1.8 Statistical errors.......................................................................................................... 44

5.2 Testing the mean μ of a sample drawn from a normal population: one sample t-test45

5.2.1 One-sample t-test for the mean of a normal population ............................................. 45 5.3 A t test for paired differences (paired t-test) ............................................................ 47

5.4 Testing the mean of two independent samples: two-sample t-test ......................... 49

5.5 Testing the mean of two independent samples in the case of different standard deviations

50

5.6 Comparison of the standard deviations of two normal populations: F-test ............. 50

5.6.1 Using SPSS to compute two-sample t-test:............................................................... 51 6 Correlation and prediction ......................................................................................................... 52

6.1 Relationship between two continuous variables ...................................................... 52

6.1.1 Computation of r ........................................................................................................ 54 6.1.2 Properties of r............................................................................................................. 54 6.1.3 Testing the significance of r ....................................................................................... 55

6.2 Prediction based on linear correlation: the linear regression ................................... 55

6.2.1 Computation of the correlation coefficient from the regression coefficient. ............. 56 6.2.2 Coefficient of determination and coefficient of correlation. ...................................... 56 6.2.3 Regression using transformations .............................................................................. 57

7 Chi-Square Tests ........................................................................................................................ 59 7.1 Relationship between two discrete variables, the chi square test for independence 59

7.2 General formula for test of independence ................................................................ 61

78

7.2.1 Special case: 2x2 table ............................................................................................... 61 7.2.2 The chi-square test for goodness of fit. ...................................................................... 64

8 Nonparametric Tests .................................................................................................................. 65 8.1.1 Ranking the data......................................................................................................... 65 8.1.2 One sample tests (the difference of two related samples): the sign test and the Wilcoxon signed rank test. ......................................................................................................... 66 8.1.3 Two-sample test (two independent samples): the Mann-Whitney test ...................... 67 8.1.4 Several related tests: the Friedman test ...................................................................... 68 8.1.5 Several independent tests: the Kruskal-Wallis test .................................................... 69 8.1.6 Relationship between two variables: Spearman's rank correlation coefficient .......... 69

9 Statistical Tables ........................................................................................................................ 70 9.1 The Standard normal distribution one-tailed areas) ................................................. 70

Proportions of the standard normal distribution below the value z ................................. 70

9.2 The student’s t distribution (two-tailed areas). ........................................................ 71

9.3 The F distribution ..................................................................................................... 72

9.4 The chi-squared distribution .................................................................................... 72

9.5 The table of the Wilcoxon signed rank test ............................................................. 73

9.6 The table of the Mann-Whitney U test .................................................................... 74

10 References .............................................................................................................................. 75 11 Contents ................................................................................................................................. 76

Documents

Biostatistics, statistical software...Biostatistics or biometry is the application of statistics to a wide range of topics in biology. It has particular applications to medicine and