STA6166-1-1 Introduction and Data Gathering (Chapters 1 – 2) At the end of this lecture, the student should: Be able to provide a definition of Statistics

STA6166-1-1

Introduction and Data Gathering (Chapters 1 – 2)

At the end of this lecture, the student should:

• Be able to provide a definition of Statistics.• Discuss the role of statistics in research.• Be able to state reasons for using statistics.• Identify the difference between observational and

experimental studies.• Be able to organize data into a two-dimensional matrix or

array.

I hear and I forget I see and I understand I do and I remember Chinese Proverb

STA6166-1-2

• Keep technical demands low.• Emphasize examples and context more

than mathematical derivations.• Incorporate active learning exercises.• Concentrate on a small set of basic but

easily generalizable ideas.

In mathematics, Context obscures structure.In data analysis, Context provides meaning.

General Course Approach

STA6166-1-3

Applied statistics lends itself naturally to discussion. To have time for discussion, the student needs to get most of the material from the readings in the book.

Learning statistics is a lot like learning a foreign language.

Discussion Versus Readings

As is typical of graduate level courses, discussion with classmates is encouraged. However, when it comes to assignments, each student should do his/her own work!

STA6166-1-4

What Is Statistics?

• Summary measures, such as totals, averages or

percentages of measurements, counts or ranks.

• A set of methods for obtaining, organizing,

summarizing, presenting and analyzing numerical

facts in order to help make wise decisions in the

face of uncertainty.

• An area of science concerned with the extraction

of information from numerical data and its use in

making inferences about a population from which

the data are obtained.

STA6166-1-5

• The pursuit of systematic interrelation of facts by logical arguments from accepted postulates, observation, and experimentation and a combination of these three in varying proportions.

Scientific Method

• Aid in creating the `best' research design with which to generate new data.

• Extract the information from the noise or variability at the data analysis step.

Roles of Statistics

STA6166-1-6

Logical Arguments• Deductive argument: Conclusion follows with logical necessity or

certainty from the premises. Nothing new is revealed because we are

arguing from the general to the specific.

• Specialization: Moving from a large set of objects, postulates, or

events, to consideration of a smaller set of objects or events.

• Inductive argument: The premises provide some evidence for the

truth of the conclusions. Discovering general laws by the observation

and combination of particular instances.

• Generalization: Passing from the consideration of one object,

postulate, or occurrence, to the consideration of a set of objects,

postulates, or occurrences.

• Analogy: Consideration of the kind and amount of agreement among

different objects or events.

In statistics we attempt to formalize and use these concepts in a quantitative way.

STA6166-1-7

HypothesisModel, Conjecture

Data, Measurements

Indu

ctiv

e A

rgum

entD

eductive

Argum

ent

New Hypothesis, New Model

New Data

Progressand Understanding

Scientific Progress

We gain knowledge by iterating between models and data.

STA6166-1-8

Scientific Thinking

Indu

ctiv

e A

rgum

ent D

eductive

Argum

ent

Literature Data

ResearchHypothesis

Sample Data

Important Factors

Observation

Experim

ent

Experimental Data

A typical PhD research project iterates as follows:1. Questions are asked and hypotheses formulated based on existing

knowledge.2. An observational study is performed to examine the validity of the

hypothesis. (knowledge created)3. Using the observational data, we identify factors that are hypothesized as

“driving” the process under study.4. An experiment is designed to test the importance of these factors and

possibly shed light on mechanisms. (knowledge created)

STA6166-1-9

• State the problem. What are the questions?

• Devise a plan of solution. What will I do?

• Implement the plan. This is how I do it?• Analysis of data. What happened?• Interpretation of results. What does this mean?• Reexamination. Is my logic correct? What next?

Basic Study Steps

Study design and study implementation may require iteration.

STA6166-1-10

Problem

Objectives &Hypotheses

Sample

Experiment

How to measure?

Interpretation

Kno

wle

dge

Bas

e

DATA

Conclusions

Graphical Depiction of Scientific Study

DESIGN

Constraints

STATISTICAL ANALYSISGraphics & Visualization•Modeling•Estimates and Confidence Intervals•Formal Statistical Tests

STA6166-1-11

• Census (Complete Enumeration): Every individual in the population

of interest is observed.

• Sampling Studies (Mensurative Experiments or Surveys):

Populations to be compared are defined, and individuals are selected

from these populations for measurement. All members of the

populations have a positive probability of selection for inclusion in the

study.

• Experimental Studies (Manipulative Experiments): Individuals in

one or more populations are carefully chosen or created to test

specific manipulations under highly controlled conditions.

Research Design Categories

STA6166-1-12

• What are populations of interest?• How will individuals be selected for measurement?• What will be measured?• Which analyses will be performed? • How many individuals are needed?• How large an effect will be considered important?• Are available resources adequate for this study?

Many of these questions are answered by subject matter experts, some can be answered by a statistical analysis.

Sampling Study Design

STA6166-1-13

Sampling Study( Mensuration Experiment)

Population 1 Population 2

Sample 1Sample 2

How Selected?

Characteristics1 1 x x x x x …2 1 x x x x x …3 1 x x x x x ……n 1 x x x x x ...

1 2 x x x x x …2 2 x x x x x …3 2 x x x x x ……m 2 x x x x x ...

Ob

servatio

nsS

ample 1 data

Sampl

e 2

data

What is measured?

STA6166-1-14

• Individually identified (the “sample unit”).• Randomly chosen (no biases introduced in selection).

Each possible set of individuals has the same

probability of selection (Simple Random Sampling).

• Stratification (account for an extraneous factor)• Clusters (select natural groups of sample units)• Multi-stage (select large units then parts of units)• Systematic (set pattern)

Special situations allow for increased efficacy of selection.

How are individuals selected?

STA6166-1-15

A researcher wishes to determine the prevalence of a disease in a greenhouse of tomato seedlings. Each seedling tested for the disease is destroyed in the process, hence only a minimal number should be tested. Expectations are that only about .01% of the roughly 50,000 seedlings in the greenhouse have the disease.

How to select a simple random sample?

1. Number each pot. Use a random number table (or spreadsheet random number generator) to produce a list of numbers, in random order from 1 to the total number of pots. Measure plants in pots whose numbers are selected (difficult).

2. Align pots in rows and columns. Use random number table to select a list of row and column number pairs. Measure plant in pots located in the (row, column) pair selected (easier).

Table 2 in Ott and Longnecker.

Simple Random Sampling

STA6166-1-16

A simple random sample of n units is defined such that each possible sample of size n is equally likely to be drawn.

This sampling principle assures that each unit in the population has the same probability (likelihood) of being selected in the sample.

Textbook definition.

Practical definition.

Simple Random Sample

STA6166-1-17

Stratification

Pine forest: Estimate expected yield from plot.

22 yearshealthy

16 yearshealthy

20 yearsdiseased

Individuals selected at random within each strata.Variability in diseased subpopulation expected to be much greater than in healthy area. Mean yield greater at 22y than 16y.

Allows us to take into account a factor we already know affects the response of interest. To “remove a source of known variability”.

STA6166-1-18

Clusters

259

14

5

21

7

12

Estimate the average sponge size on natural reefs.

Number ofsponges on reef

REEF

Selecting sponges at random would be very resource inefficient.Cheaper to select reefs (sponge clusters) at random with probability proportional to size. All sponges on selected reefs are measured (a cheap thing to do that increases the sample size easily).

STA6166-1-19

a. Random Selection b. Systematic Selection

First-stage unit

Second stage unit Measurement

units

randomly located grid

c. Multi-Stage Selection

randomstartingpoint

Typically large areas or large complex populations can be more effectively sampled in stages. At the first stage, natural or synthetic clusters are selected. At subsequent stages the selected clusters are subdivided into units and samples of these are selected.

Example: National crop yield survey.

MultiStage Sampling

STA6166-1-20

Greenhouse ExampleStratification: Maybe we have observed that plants near the door seem

less healthy than those further into greenhouse. Divide room into plants near door and plants “inside”. Random samples from each stratum.

Cluster: Suppose plants are arranged on tables. We could select tables at random then examine all plants on each table selected. Note that if one plant on a table is diseased, all plants on table have an increased probability of also being diseased.

Multi-Stage: Again suppose plants are on tables. Select some tables at random. Next select a few plants from each selected table for testing. First stage unit is the table. Second stage unit is the plant. Third stage unit could be the leaf on the plant, etc.

Systematic: Imagine plants arranged on a large table. Randomly pick a row and column to start. Then, following a systematic route, pick, say, every 10th plant.

STA6166-1-21

Variable: Apt or liable to vary or change from individual to individual, capable of being varied or changed (factor), alterable, inconsistent, having much variation or diversity, a quantity that may assume any given value from a set of values (the variable’s range).

What is measured?

Examples:• Plant biomass – varies from plant to plant.• Blood arsenic level – varies from person to person.• Gender – we are not all male or all female.

Opposite of variable - Constant

STA6166-1-22

Types of Variables

• Categorical, classification, or qualitative variable: Discrete; essentially describes some characteristic of a sample unit. For example: color, gender, age class, health status, treatment group. (Further subdivided into nominal or ordinal.)

• Quantitative or amount variable: Either discrete or continuous; measures the amount or level of a characteristic of a sample unit. For example: age, weight, height, temperature, biomass, volume. (Further subdivided into interval or ratio.)

In STA 6166-7, we will deal primarily with quantitative variables. STA 6126-7 deals primarily with categorical variables.

STA6166-1-23

Sampling Study Design Questions

• How is the response (effect) to be measured?

• What characteristics of the response are to be analyzed?

• What factors influence the characteristics to be analyzed?

• Which of these factors will be studied in this investigation?

• How many times should the basic experiment be performed?

• What should be the form of the analysis?

• How large an effect (effect size) will be considered important?

• What resources are available for this study? Are they adequate?

It is important to be able to define the underlined words.

STA6166-1-24

Terminology• The response typically refers to the measured variable(s) of primary

interest (e.g. weight, health status, growth, etc).

• Characteristics – Is it change in the average response, the spread of responses, the maximum response, etc, that will be examined? These characteristics typically refer to some “statistical” aspect of effects measured among individuals in the populations being studied.

• A factor refers to the characteristic(s) that primarily differ among the populations being studied (compared). Some factors we cannot manipulate (I.e. such as descriptors like gender, geographic location, genetic makeup). Other factors identify characteristics we have caused to be different between the two populations (as in an experiment where we manipulate the populations by giving them different “treatments”).

• Basic Experiment – The selecting of an individual for measurement. In a sampling study, the basic experiment is the selection and measurement of an individual from the population. In an Experimental Study, the basic experiment is the selection of an individual from the “pool”, the application of a treatment, and the measurement of responses.

STA6166-1-25

Terminology (Cont)• By the form of the analysis, we refer to the statistical procedure(s) that

match the characteristics of the study design, the characteristics of the responses measured and the estimates and hypothesis tests needed to answer the questions of interest. So, when someone asks “What form will your analysis take?” you might answer with something like “I will be using regression analysis (the statistical method) to explore associations between fat intake and cholesterol level (the hypotheses of interest) between two populations identified geographically and by gender (study design factors).”

• The size of the effect of interest refers to how big of a difference must there be before I (or others) would conclude that there is a “real” difference. Typically we are interested in specifying this at the design phase of a study since the size of the effect of interest drives the sample size question. Thus if you say a difference of less than 2 points in cholesterol level between gender groups would not be significant but anything greater than 2 is significant, you could use this to set the study sample size. If the difference were raised to 10 points, a much smaller sample size would be needed.

• Resources – Money, personnel, time, access, material.

STA6166-1-26

Manipulation Experiment• Manipulation Experiment: A research design

in which the researcher deliberately introduces certain changes in the levels of factors that are hypothesized as affecting the process of interest, and then makes observations to determine the effect of these changes.

• Experimental Design: A study plan which assures that measurements will be relevant to the problem under study.

• Treatments: Changes to those factors which are suspected of affecting the process under study.

STA6166-1-27

Nitrogen Level

Phosphorus Level

FACTORS

0 kg/ha

10 kg/ha

0 kg/ha 10 kg/ha 20 kg/haLEVELS

SITE 1(block 1)

SITE 2(block 2)

0 / 10

0 / 10

10 / 10

10 / 10

20 / 10

20 / 10

0 / 0

0 / 0

10 / 0

10 / 0

20 / 0

20 / 0

BLOCKED LAYOUT(complete block - all treatments in each block)

EXPERIMENTAL UNIT (PLOT)

0 / 0

0 / 10

10 / 0

10 / 10

20 / 0

20 / 10

TREATMENTS

Factorial Experiment

STA6166-1-28

1 1 F RED x x ... 10.2 x x ...2 1 F WHITE x x ... 12.9 x x ...3 1 M BLUE x x ... 20.1 x x .... .. .. .n 1 F BLUE x x ... 16.0 x x ...

CATEGORIES AMOUNTSObservationNumber

stratagender

color

Othercategoricalvariable

weightOtherquantitativevariable

Standard Form for a Data Set

STA6166-1-29

OBS ITEMP IRH IWB FWB REP BIRD BN IBT ATBT WEIGHT SATBT SITEMP SIWB1 24.47 64 20.2 20.25 1 1 1 40.6 39.7 2.21 -1.24351 -1.28723 -1.274342 24.47 64 20.2 20.25 1 2 2 40.6 40.2 2.265 -0.69343 -1.28723 -1.274343 24.47 64 20.2 20.25 1 3 3 40.9 39.4 2.185 -1.57355 -1.28723 -1.274344 24.45 50 18.55 18.6 2 1 4 40.3 40.1 2.275 -0.80345 -1.29196 -1.673865 24.45 50 18.55 18.6 2 2 5 40.4 39.4 2.264 -1.57355 -1.29196 -1.673866 24.45 50 18.55 18.6 2 3 6 40.1 39.2 2.205 -1.79358 -1.29196 -1.673867 24.68 50 18.45 19.52 3 1 7 41.1 40.5 2.343 -0.36338 -1.23754 -1.698078 24.68 50 18.45 19.52 3 2 8 41.2 40.8 2.193 -0.03334 -1.23754 -1.698079 24.68 50 18.45 19.52 3 3 9 40.9 40.9 2.238 0.07668 -1.23754 -1.69807

10 24.79 51 18.57 18.2 4 1 10 39.8 39.4 2.32 -1.57355 -1.21151 -1.6690211 24.79 51 18.57 18.2 4 2 11 39.6 39.4 2.298 -1.57355 -1.21151 -1.6690212 24.79 51 18.57 18.2 4 3 12 39.8 39.8 2.31 -1.13349 -1.21151 -1.6690213 25.03 74 21.6 21.8 1 1 13 39.8 38.9 2.212 -2.12363 -1.15472 -0.9353614 25.03 74 21.6 21.8 1 2 14 39.8 38.7 2.21 -2.34366 -1.15472 -0.9353615 25.03 74 21.6 21.8 1 3 15 39.4 39.4 2.198 -1.57355 -1.15472 -0.9353616 24.44 74 21.22 21.5 2 1 16 40.1 39.6 2.235 -1.35352 -1.29433 -1.0273717 24.44 74 21.22 21.5 2 2 17 40.1 39.8 2.257 -1.13349 -1.29433 -1.0273718 24.44 74 21.22 21.5 2 3 18 40 39.6 2.284 -1.35352 -1.29433 -1.0273719 24.43 73 21.2 21.76 3 1 19 39.4 39.9 2.33 -1.02348 -1.29669 -1.0322120 24.43 73 21.2 21.76 3 2 20 39.8 40.2 2.314 -0.69343 -1.29669 -1.0322121 24.43 73 21.2 21.76 3 3 21 39.5 39.2 2.295 -1.79358 -1.29669 -1.0322122 25.24 78 21.91 22.06 4 1 22 . . 2.149 . -1.10503 -0.860323 25.24 78 21.91 22.06 4 2 23 . . 2.12 . -1.10503 -0.860324 25.24 78 21.91 22.06 4 3 24 . . 2.127 . -1.10503 -0.860325 25.35 89 23.78 24.01 1 1 25 . . 2.213 . -1.079 -0.4075126 25.35 89 23.78 24.01 1 2 26 . . 2.216 . -1.079 -0.4075127 25.35 89 23.78 24.01 1 3 27 . . 2.36 . -1.079 -0.40751

Indicator of missing data

Example Data Set in Spreadsheet Format

STA6166-1-30

The more ambitious the plan, the more chances of success, and the more opportunity for failure.

How does one decide on what to do?

Are there open questions ?Are there available resources?Does someone really want the answer?Can a study be done?Will the study be able to answer the question?

Statistics may help answer the last question!

Inventor's Paradox

Documents

STA6166-1-1 Introduction and Data Gathering (Chapters 1 – 2) At the end of this lecture, the student should: Be able to provide a definition of Statistics