CHAPTER 01 Teminology

Terminology

1

Introduction

2

Statistics (as opposed to statistic) is the science

of gathering, organizing, analyzing, and

interpreting numerical and categorical

information.

“Statistics is the art of decision making in the

presence of uncertainty”

CPT

3

Descriptive Statistics involve methods

of organizing, picturing, and

summarizing information from samples

or populations (Chapters 1-3)

Inferential Statistics involve methods of

using information form a sample to draw

conclusions regarding the population

(Chapters 7-11)

4

5

Descriptive Statistics

Probability

Inferential

Statistics

1. Choose a topic and identify the problem to be addressed.

2. Background research –past research including historical descriptive statistics and references.

3. Develop a conjecture or hypothesis. 4. Design an experiment 5. Gather additional information through

experimentation and observation. 6. Analyze the results and interpret the results –if

hypothesis is rejected, then repeat Step 3 thru 6. 7. Formulate conclusions and draw inferences

6

A population, N, is a group of individual persons, objects, or items that one wishes to better understand certain characteristics about and from which samples are taken for statistical measurement.

The population data is the complete collection of information from all of the individuals or subjects of interest in a given study.

7

A census is a survey of every individual in the

population and the information gathered is called

the population data.

Recall: the population data is the complete

collection of information from all of the

individuals or subjects of interest in a given

study.

8

A sample, n, is a partial collection of

information for only some of the individuals or

subjects of interest in a given study

The sample size, n, of a sample is the number of

observations that constitute the sample. Whereas

the size of a population is denoted by the capital

letter N, the size of the sample is denoted by the

lowercase letter n.

9

Descriptive Statistics involve methods of

organizing and summarizing information (data)

and presenting it numerically or visually

(graphically).

Stem-and-leaf, Frequency Tables, and

Contingency Tables

Bar Charts, Pie Charts, and Histograms

Scatter Plots

Among others…

10

The distribution of the data is a list of all the

values recorded in a sample; that is, the observed

outcomes and their frequency.

Distributions can be given in tabular form as a

frequency or contingency table or illustrated

graphically in the form of a bar chart or

histogram.

11

Characteristics of

Distribution

The mean, median, and mode

are central tendencies

Uniform distribution is

symmetric, not skewed

The range of information

between minimum and

maximum

Location

Shape

Spread

12

Location within a distribution is a specific value

within the domain of the data such as the

extremes: minimum and maximum, central

tendencies, etc.

13

Outliers are data within a distribution of the

data, but outside the overall pattern (cluster) of

the graph; that is, extreme values that can distort

the interpretation of the data by creating

misleading statistics.

14

Common

Locations

Minimum and Maximum

Mean, Median, and Mode (among other weighted or trimmed means)

A value, xp%, such that p% of the observed values are to the left

p%=0%, 25%, 50%, 75%, 100%

Extremes

Central

Tendencies

Percentiles

Quartile

15

Shape of a distribution describes the symmetry

or lack thereof (skewness).

Data that is symmetric exhibits balance and self-

similarity whereas skewness is a measure of the

asymmetry.

16

Common Shapes

Equal frequencies – no mode

The mean equals the median equals the mode - looks like a

The above are symmetric

Left: mean < median Right: mean > median

Uniform

Bell

Shaped

Symmetric

Skewed

17

Spread of a distribution is a measure which

indicates how the data values are distributed, a

measure of the dispersion or variability within a

group of values. Some appropriate measures

include:

Range (from minimum to maximum)

Variance (mean square error)

Deviation (square root of variance)

Where error=observed value – expected value

18

Common Measures

Minimum and Maximum

Mean, Median, and Mode (among other weighted or trimmed means)

Range, Variance, and Standard Deviation

Count, Relative Frequency, Rate

Extremes

Central

Tendencies

Deviations

Frequency &

Proportions

19

An extreme is a characteristic farthest removed

from the ordinary; common extremes are the

minimum, which is the least observed value in a

sample, and the maximum, the greatest observed

value in a sample.

20

Measure of a characteristic which clusters around

a central value, using the data to estimate the

central tendency, called averages; there are

three common measures: the mode, median, and

mean.

21

Common Central

Tendencies

Most frequently observed value A value such that 50% of the

observed values fall to the left Sum of all data divided by

number of data points (equal weights)

Average such that the weights are not equal

Average such that some of the weights are zero

Mode Median Mean Weighted Mean Trimmed Mean

22

Common

Deviations

Maximum minus minimum

Q3 – Q1

Expected value of the

differences (error)-squared

Square root of the variance

Range IQR (Interquartile Range)

Variance Standard Deviation

23

Common

Frequency &

Proportions

1,2,3,…

The number of times a given value

occurs (a count)

The ratio of the frequency of one value to the sample size

Part in ratio to the whole – the relative frequency

Count

Frequency

Relative

Frequency

Rate

24

A statistical measure is sensitive if the

computed value changes readily even if a

single observed data value is different. Also, a

statistical method is sensitive if the decision

changes radically based on the assumptions

we made to develop it.

A statistical method is robust if the decision

is not strongly dependent on the assumptions.

Not formally defined in the text

25

The degree of confidence represents the

proportion of times the statistical methodology

used captures the true state of nature.

This value will be denoted (1-)%, where is

the level of significance.

26

An event is considered statistically significant if

its occurrence is unlikely to happen by chance.

This value will be denoted by %.

27

Inferential Statistics involves methods of

analyzing and interpreting descriptive statistics

to draw conclusions regarding a particular

characteristic in the population with a certain

degree of assurance based on a preset level of

significance and specified assumptions.

28

Hypothesis Testing is the use of a statistical

method in arguing for or against a hypothesized

value based on observed information and using

this information to make a decision regarding an

initial hypothesis and an alternative hypothesis.

Not formally defined in chapter 1

29

An experiment is the method (procedure) that

we follow to obtain data or information.

An experiment design is the art of planning

and executing experiments designed to gather

information (data) from the population, N, in

such a way as to ensure the sample, n, is

representative of the population, N.

30

Individuals are the people, places, or things

included in a study and for which information is

gathered. In medical research studies,

individuals are referred as the subjects in the

study.

31

A variable is a distinct characteristic of an individual to be observed and measured. These observed data can be qualitative or quantitative. A qualitative (categorical) variable is a variable

that describes the individual by placing the individual into a category or group

A quantitative (numerical) variable is a variable that takes on a real value or numerical measurement for which sums, differences, and ratios have meaning.

32

Regression is a statistical procedure used to

estimate the relationship among variables,

specifically between the primary (response)

variable of interest and all other variables.

33

In a study, the response variable is the primary

variable of interest; that is, the objective in the

given study.

This variable is also referred to as the dependent

variable (although the relationship of

dependence or cause-and-effect is yet to be

determined).

34

The explanatory variables are the extraneous variables that have been measured but are not the primary variable of interest; they are used to understand the behavior of the response (primary) variable. This variable is also referred to as the independent variable (although the relationship of dependent/independent has not yet been established).

35

Lurking variable(s) are the unknown variables

that have not been measured; however, they do

contribute to the response (primary) variable and

are not included as an explanatory variable.

36

Correlation is a measure of association between

a response variable and an explanatory variable.

Correlation measures the strength and direction

of a simple linear relationship, that is, a straight

line.

37

Causation is more than a measure of association

between a response variable and an explanatory

variable. It also implies direct cause or

dependence.

Correlation between two events may be a

common response to a lurking variable.

38

Confounding variables are the variables that

have been measured and are significantly

contributing variables; however, their

independent contributions to the subject response

are indistinguishable and are not deemed

significantly contributing in the larger model.

39

Discrete/Continuous

40

An instrument is any means by which

information is gathered or measured such as an

exam, survey, or other rulers such as a barometer,

thermometer, etc.

41

A parameter is a numerical measure that describes the outlined characteristic of the population such as central tendencies (mean, median, mode, and proportion), spread (range, variance, and standard deviation), and shape (symmetric and skewed). In general, when a specific parameter is not specified, the lowercase Greek letter Theta () is used to denote a population parameter.

42

Common

Parameters

Mean

Variance

Standard Deviation

Proportions

Correlation

43

p

2

A sample survey is a survey of only some of the

individuals in the population and the information

gathered is called the sample data.

The number of individuals included in the

sample survey is called the sample size, n.

The sample data is a subset of the population

data often denoted: x1, x2,…. xn.

44

A statistic is a numerical measure that yields an estimate of a population parameter. That is, a numerical measure that uses the data from the sample to estimate the outlined characteristic of the population. As opposed to Statistics - the study of how to gather, organize, analyze and interpret information.

45

POPULATION IS TO SAMPLE AS CENSUS IS TO SAMPLE SURVEY

POPULATION IS TO SAMPLE AS PARAMETER IS TO STATISTIC

46

Measure involves any standard of comparison, estimation, or judgment; property of an individual given a numerical value; a quantity, a count, a degree, a rate, or a proportion. In terms of data collections, the measured values are referred to as the outcomes or observed values. Two types of measure are discrete and continuous.

47

A discrete measure is such that the set of

possible observed outcomes are separate,

distinct, and finite such as a count.

Discrete measures are such that the outcomes can

be enumerated: one, two, three, etc.

48

Examples of

Discrete Measures

Number of children in a family tree –

depending on the number of generations

included in the tree, there can be either 1,

2, 3, …, but not 1.5 – nothing between 1

and 2, or 2 and 3, etc.

Count of whole beans – depending on the

number of pods included, there can be

either 1, 2, 3, …, but not 1.2 since the

count is restricted to the whole number

Frequency of blue-eyed men and green-

eyed women.

Number

Count

Frequency

49

A continuous measure is such that the set of

possible observed outcomes are infinite and

uncountable.

Continuous measures are dense; that is, between

any two values (outcomes) there exist another

value (outcome) such as a mean or rate.

50

Examples of

Continuous

Measures

Length of a road – it can measure 1 mile or 2

miles, and between these possible measures

exists 1.5 miles, 1.24 miles; in fact, between

any two values there exist other possible values

Height of a man – a man can be 5 feet tall or 6

feet tall, and between these potential values

exist 5.5 feet, 5.14 feet, etc. While we might

not have an instrument precise enough to

measure the 1/100th of a foot, this measure

exists

Age of a woman – between this moment and

the next, there is a continuous existence.

Between 1 yrs and 2 yrs, 1.8 yrs exist, etc.

Length

Height

Age

51

Samplings

52

53

Validity refers to the degree of accuracy to which a study reflects the specific concept or characteristic that the analyst or researcher is attempting to measure. Internal validity is the degree to which one can draw valid conclusions about the causal effect between variables. External validity is the degree to which one can extend the findings that are relevant to subjects and settings outside those included in the experimental design.

54

For example, when evaluating a class of 180 students from a single mass lecture of STA 2023, can this information be used to evaluate all students taking STA 2023 given there is more than one section taught by different instructors? Internal Validity – drawing conclusions about this specific subjects inside the study. External Validity – the ability to extend conclusions to subjects outside of the study.

55

56

57

Bias is a consistent deviation of the statistics to

one side of the parameter.

LOW BIAS HIGH BIAS

58

For example, when weighing out coffee to be

ground and brewed at a coffee shop, the

employee forgets to zero-out the scale with the

cup used to measure the coffee. This leads to the

coffee measured in the cup to be off by the

weight of the cup.

Solution: add the weight of the cup in coffee to

each cup.

59

60

61

Variability measures the degree of dispersion within a given data set. Some common measures of dispersion include range, mean (average) deviation, standard deviation, variance, inter-quartile range, and mean difference. Variability can appear as gaps in the data when illustrated graphically.

62

Reliable refers to the accuracy and precision of

the actual measuring instrument or procedure.

A reliable measure is a (precise) measurement

such that the random error is small.

63

Valid (Accurate)

Reliable (Precise)

We like samples to

represent the population

and the measures taken to

represent the parameters

estimated. These statistics

need to be a valid measure,

accurately estimating the

parameter with low bias as

well as be reliable,

measured with such

precision as to have low

variability when estimating

the parameter using

statistics.

64

ACCURACY (VALID MEASURE)

HITS THE TARGET’S “BULL’S EYE”

PRECISION (RELIABLE MEASURE) HITS THE SAME LOCATION REPEATEDLY

65

ACCURATE INACCURATE

66

PRECISE IMPRECISE

67

PRECISE IMPRECISE

68

Nominal, Ordinal, Interval, & Ratio

69

Common

Levels of Measure

Data that consist of names, labels, or categories

Data that can be arranged in order; however, differences between data values cannot be determined or are meaningless

Data that can be ordered and differences have meaning, but ratios do not (equal distances, but no fixed zero)

Ordinal and interval, but ratios have meaning (equal distances and fixed zero)

Nominal

Ordinal

Interval

Ratio

70

A nominal measure is one that measures a

characteristic of an individual by name only;

information in the form of categorical data where

the order of the categories is not relevant.

Names only – no calculations can be preformed.

71

Examples of

Nominal Measures

Can be made ordinal if considered alphabetically, but

otherwise, this is a name only

There are relations among the digits that make up such

numbers, but there is not a true ordering, difference, or

ratio

While these codes can be “ordered” numerically, the

order is arbitrary and therefore not meaningful – the zip

code 33617 is not “less than” the zip code 33620 – the

only difference is geographical

Male/Female: these are clearly labels for which there is

no order other than “alphabetically”; however, it is

meaningless to argue “less than” or “greater than” in

general

Surnames

SSN

Zip Code

Gender

72

An ordinal measure is one that measures a

characteristic of an individual by the rank order

(1st, 2nd, 3rd, etc.) of the entities measured or by

implied ordering such as worst, bad, good, great.

Ordering the measured outcomes.

73

A simple ranking imposes an order on the

measured characteristic of an individual and the

set of natural numbers by defining a relationship

that establishes the position within a sequence of

outcomes "ranked higher than," "ranked lower

than," or "ranked equal to.“

Imposing an ordinal scale.

74

A Likert scale establishes the hierarchy within a

sequence of outcomes.

For example, “how attractive is a person on a

scale from 1 to 10,” 1 meaning not very

attractive to a 10 which represents perfect

attraction.

75

Examples of

Ordinal Measures

What is the best-selling flavor of ice cream?

A five-point scale by which to evaluate an

instructor: poor, unsatisfactory, satisfactory,

good, great

Due to inconsistencies found in “sizes”

between designers – a size “0” is smaller than a

size “2,” which is smaller than a “4,” but this does

not mean the difference between a “2” and a “4” is

the same as the difference between a “0” and a “2.”

Furthermore, a “4” is not twice as large as a “2”;

this ratio has no meaning.

Ranking

Likert

Scale

Dress Size

&

Shoe Size

76

An interval measure is one that measures a

characteristic of an individual where differences

between measures have meaning; that is, the

distance between two adjacent units is the same

but there is not a meaning zero point. An interval

measure is such that sums and averages have

meaning; however, ratios do not have meaning.

Sums (differences) but not ratios.

77

Examples of

Interval Measures

If your watch reads 12:05 and mine reads 12:07, then my watch reads a later time than yours; hence the measure is at least ordinal. However there is a 2-minute difference, therefore this measure is interval. It is not ratio since 12:07 in ratio to 12:05 has no meaning.

If the daytime temperature is 50°F in New York and 100°F in Miami, then it is 50°F hotter in Miami than it is in New York. While the ratio of 100°F to 50°F is 2, this measure has no meaning and is therefore an invalid measure. You can not say 100°F is “twice as hot” as 50°F.

Some may argue that degrees Kelvin, which has an “absolute zero,” is ratio; however, in general, temperature is interval.

Time of Day

Temperature

78

A ratio measure is one that measures a

characteristic of an individual where not only do

differences between measures have meaning, but

ratios also have meaning. That is, a measure in

which any two adjoining values are the same

distance apart and there is a true zero point. Ratio

measures have fixed zeros; that is, an interval

measure with a true zero.

79

Examples of Ratio

Measures

At 2:00, the measure is 2 hours past noon and at 4:00, the measure is 4 hours past noon, 4 hours is greater than 2 hours; hence at least ordinal. The difference between 4 hours and 2 hours is 2 hours, which has meaning; hence at least interval. Moreover, the ratio of 4 hours to 2 hours is 2, that is 4 hours is twice as much time as 2 hours; thus this measure is ratio.

If you are 6 feet tall and your child is 3 feet tall, then you are taller than your child (ordinal), you are 3 feet taller than your child (interval), and you are twice as tall as your child (ratio). Therefore, this measure is Ratio.

If you are 36 years old and your child is 12 years old, then you are older than your child (ordinal), you are 24 years older than your child (interval), and you are three times as old as your child (ratio). Therefore, this measure is Ratio.

Time Past

Noon

Height

Age

80

Changing Level of

Measure

What is your yearly salary? (a continuous scale)

Interval–what is your income bracket? (a discrete scale) 0-9,999, 10,000-19,999, 20,000-29,999, 30,000-39,999, 40,000-49,999, 50,000-59,999, etc.?

Where the difference between intervals is 10,000

Ordinal–what is your tax bracket? (a discrete scale) 0-9,999, 10,000-39,999, 40,000-59,999, 60,000or more?

Where difference are not well-defined

Nominal–in what currency are you paid?

Dollar, Yen, Euro, etc. (ordinal if you consider exchange rates)

Ratio

Interval

Ordinal

Nominal

81

SRS, Systematic, Cluster, Stratified, etc.

82

Samples

Simple Random

Samples

Systematic

Cluster Samples

Stratified

Samples

Convenience

Samples

83

A random sample is a sample of size n taken

from a population of size N in such a way that

each individual observed has an equally likely

chance of being selected.

84

A simple random sample (SRS) is such that

(1) each individual has an equally likely chance

of being selected as well as

(2) all groups of size n have an equally likely

chance of being selected.

85

Common Sampling

Schemes

Using a system to select

Using clusters of individuals

that are pre-existing

Using “clusters” of individuals

selected by a specified strata

Using individuals who are

conveniently surveyed

More than one stage of

sampling done in succession

Systematic

Cluster

Stratified

Convenience

(Volunteer

Response)

Multi-stage

86

Systematic sampling is a sample such that every

kth individual or item is measured.

Every 3rd: 1, not 2, not 3, 4, not 5, not 6, 7

Every 5th: 1, 6, 11,16,… or 2, 7, 12, 17,…

or 5,10,15,20…. Etc.

87

Cluster sampling is such that groups are

selected based on pre-existing groups that is

arbitrary to the individual and not based on any

characteristic of the individual.

In the country, by region

In the state, by zip code

In the state or nation, by area code

For example, in a state, randomly selecting five

counties and surveying 100 individual from each

88

Stratified sampling is such that individuals are first grouped by specific characteristics such as gender and then samples are taken from each group or strata. Individuals grouped by gender Individuals grouped by age Individuals grouped by race For example, grouping individuals by gender, male/female, then selecting 100 individuals from each group

89

Convenience sampling is such that individuals

are selected based upon ease of access. Such

sampling techniques are prone to bias. An

example of a convenience sampling is a

volunteer response.

Individuals as they passed by

Individuals willing to call in on a talk show

Individuals who agree to take online surveys

90

Multistage sampling is such that more than one

sampling technique is employed in the gathering

of information.

First stratify by gender, then systematically

take every other individual in each group.

First cluster individuals by state, then poll

these regions using mailers which individuals

have the option to fill out at their convenience

91

Too Regular

Implausible Numbers Inconsistencies

Missing Information

Non-Adherers

Non-sampling Error

Hidden Agenda

Hidden Bias

Survey Error

Under-coverage

Incorrect Arithmetic

92

Control, Randomization, Replication, & Enough Information

93

An observational study is an experiment

designed to observe without interference from

the observer in that every effort is made not to

sway the subject response or lead a subject in

their response.

Do not sway individuals!

94

Common

Observational

Studies

Historical data (past)

Single point in time (present)

Data gathered over an extended

period of time (future)

Retrospective

studies

Cross

Sectional

Prospective

studies

(Longitudinal)

95

An experimental study is an experiment designed to be observed with interference from the observer in that specific treatments are applied to the individuals, in an effort to measure differences in the subject response. Note: the treatments used in an experiment are intended to sway the outcome of the subject response. Subject Treatment Response (Outcome)

96

A treatment is any condition set forth that is

applied to the individual or subject in an effort to

determine differences among a variety of

treatments as compared to each other or a control

group.

97

A control group is a group created for sake of

comparison. This group can be one of the

treatment groups or a group that receives a false

treatment called a placebo.

Experimental Group: Subject Treatment Response (Outcome)

Control Group: Subject No Treatment (placebo) or Secondary Treatment Response (Outcome)

98

The placebo effect occurs when a subject

receives a false treatment (such as a sugar pill) or

no treatment, but (incorrectly) believes he or she

is in fact receiving treatment and responds

favorably.

99

In an experimental design, a block is a group of

individuals stratified based on a similar

characteristic and given treatments.

A block design is an experimental design in

which individuals or subjects are grouped into

categories or blocks and then test blocks are

treated as experimental units given different

treatments.

100

A randomized-block design is an experimental

design in which individual subjects are matched

based on a specific variable. The subjects are

then put into blocks of the same size as the

number of treatments and then each block is

assigned to different treatment groups randomly.

101

A (single) blind experiment is an experiment in

which individual subjects do not know the

treatment they receive; however, the researcher is

aware.

A double blind experiment is an experiment in

which neither the individual subjects nor the

researcher are aware of who received what

treatment.

102

Principles of

Experimental

Design

A comparative or control group

Selected at random

To verify validity and reliability

More important in inferential

statistics and not so much in

descriptive statistics

Control

Randomization

Replication

Enough

Information

103

Stages of Sampling

Define population of concern

The set of variables to be measured

Systematic, Cluster, Stratified, etc.

Large Enough n (compared to N)

Implement sampling plan (ED)

Action of data collection

Population Sampling Frame Sampling Method Sampling Size (n) Experimental Design (ED) Sampling

104

Medical Trials and Simulations

105

Medical Trials

Internal Review Board

Independent Ethics Committee

Ethical Review board

Requires that the individual (1) be

informed and (2) give consent

IRB

IEC

ERB

Informed

Consent

106

Anonymity is when no personal information is taken, a coding system is in place to allow the subject to get the information regarding a survey without giving out any personal information; that is, the information is not personally identifiable. Confidentiality is when personal information is given, but not shared. Only the statistical summaries are made available to other organizations or persons involved in the study.

107

Informed consent is when the individual

person is both informed of the ramifications

involved in the study and gives consent to

participate in the knowledge of such things as

side effects.

108

Simulation is the imitation of a natural

process using general characteristics or

behaviors in an effort to mimic or model the

natural system.

“A simulation is only as good as the underlying

analytical model"

CPT

Can be used to verify statistical methods.

109

Examples of

Simulation

ONE POSSIBILITY:

Let evens represent a head and

odds represent a tail.

Hence the sequence

1,5,4,6,5

would represent

T,T,H,H,T

Use a fair dice to simulate

the tossing of a fair coin.

110

Random digit chart is the table of digits

selected at random and placed in a table in

Appendix B which can be used to simulate or

sample data.

07892632401926795457

111

Examples of

Simulation using

Random Digits

Let 0-5 represent a boy and

6-9 represent a girl; hence,

the sequence of random numbers

078

would simulate the sequence of

children: boy, girl, girl.

A man has a 60%

chance of having a

boy and a 40%

change of having a

girl, use the random

digit chart to

simulate the birth

order of three

children

Random digits: 07892632401926795457

112

Randomization or random charts can be used to sample or re-sample the data.

For example, if there are 100 data points available and we only need 30, then we can randomly select this sample by enumerating the data and using the random chart to select the required number with or without replacement. With replacement, we can resample 200 times even though there are only half this many data points to start – this technique is called bootstrapping.

113

Examples of

Sampling using

Random Digits

Let: 0 represent A, 1 -B, 2 -C, 3 -D, 4 -E, 5 –F, 6 -G, 7 -H, 8 -I and 9 -J. Using the random set of digits 9263 generate a random committee as follows:

9 J 2 C 6 G 3 D

A committee of four

is to be selected from

a group of ten

individuals: A, B, C,

D, E, F, G, H, I, and

J. Using the random

set of digits

07892632401926795457

generate a random

committee. Explain.

114

115

Descriptive Statistics vs.

Inferential Statistics

Population vs. Sample

N vs. n

Census vs. Sample Survey

Representative Samples

Sampling Techniques

Simulations

Re-sampling

Statistical

Perspective

Biologist have

microscopes

Physicist have

telescopes

Statisticians have

kaleidoscopes

116

Documents

CHAPTER 01 Teminology