Statistical Thinking VIB Course Notes Nov 2011

Luc Wouters November 2011

STATISTICAL THINKING AND REASONING IN

EXPERIMENTAL RESEARCH

Statistical thinking will one day be as necessary for efficient citizenship as the

ability to read and write!

S. Wilks (1951) after H.G. Wells (1903; 1938).

TABLE OF CONTENTS

1 Introduction .................................................................................................................................... 1

2 The Architecture of Experimental Research ................................................................................... 1

2.1 The Controlled Experiment ..................................................................................................... 1

2.2 Some Terminology .................................................................................................................. 2

2.3 Scientific Research as a Phased Process ................................................................................. 2

2.4 Scientific Research as an Iterative, Dynamic Process ............................................................. 3

3 Smart Research Design by Statistical Thinking ............................................................................... 3

3.1 Research Styles – The Smart Researcher ................................................................................ 3

3.2 Principles of Statistical Thinking ............................................................................................. 4

4 Planning the experiment ................................................................................................................. 5

4.1 Types of Experiments .............................................................................................................. 5

4.2 The Objective of the Study and the Research Hypothesis ...................................................... 6

4.3 The Pilot Experiment ............................................................................................................... 7

5 Principles of Statistical Design ........................................................................................................ 7

5.1 The Structure of the Response Variable ................................................................................. 7

5.2 Defining the experimental unit ............................................................................................... 8

5.3 Variation is Omnipresent ........................................................................................................ 9

5.4 Balancing Internal and External Validity ............................................................................... 10

5.5 Bias and Variability ................................................................................................................ 10

5.6 Requirements for a Good Experiment .................................................................................. 11

5.7 Strategies for Minimizing Bias and Maximizing Signal‐to‐noise Ratio .................................. 12

5.7.1 Strategies for Minimizing Bias – Good Experimental Practice ...................................... 12

5.7.2 Strategies for Controlling Variability – Good Experimental Design .............................. 16

5.8 Simplicity of Design ............................................................................................................... 19

5.9 The Calculation of Uncertainty ............................................................................................. 19

6 Common Designs in Biological Experimentation .......................................................................... 19

6.1 The Completely Randomized Design .................................................................................... 20

6.2 The Factorial Design .............................................................................................................. 23

6.3 The Randomized Complete Block Design ............................................................................. 26

6.4 The Latin Square Design ........................................................................................................ 28

6.5 Split Plot Designs ................................................................................................................... 29

6.6 Repeated Measures designs ................................................................................................. 30

7 The Required Number of Replicates – Sample Size ...................................................................... 30

7.1 Determining Sample Size is a Risk – Cost Assessment .......................................................... 30

7.2 The Context of Biomedical Experiments ............................................................................... 31

7.3 The Hypothesis Testing Context – The Population Model .................................................... 31

7.4 Sample Size Calculations ....................................................................................................... 32

7.4.1 Power Analysis Computations ...................................................................................... 32

7.4.2 Mead’s Resource Requirement Equation ..................................................................... 33

7.4.3 Multiplicity and Sample Size ......................................................................................... 34

8 The Statistical Analysis .................................................................................................................. 34

8.1 The Statistical Triangle .......................................................................................................... 34

8.2 The Statistical Model Revisited ............................................................................................. 35

8.3 Types of data ......................................................................................................................... 35

8.4 Verifying the Statistical Assumptions ................................................................................... 36

8.5 Multiplicity ............................................................................................................................ 36

9 The Study Protocol ........................................................................................................................ 37

10 Interpretation and Reporting .................................................................................................... 38

10.1 The Methods Section ............................................................................................................ 38

10.1.1 Experimental Design ..................................................................................................... 38

10.1.2 Statistical Methods ....................................................................................................... 38

10.2 The Results Section ............................................................................................................... 39

10.2.1 Summarizing the Data ................................................................................................... 39

10.2.2 Graphical Displays ......................................................................................................... 40

10.2.3 Interpreting and Reporting Significance tests .............................................................. 40

11 Concluding Remarks .................................................................................................................. 42

12 Recommended Reading ............................................................................................................ 42

References and Selected Bibliography.................................................................................................. 45

Appendix Tools for Randomization in MS Excel and R ......................................................................... 49

Completely Randomized Design ....................................................................................................... 49

Randomized Complete Block Design ................................................................................................ 50

1

1 INTRODUCTION

Many biomedical studies are conducted with little or no thought about the statistical aspects of the

study design. As a consequence, these studies are often seriously flawed and are not capable of

meeting their intended purpose. In some cases the studies are designed too small to enable an

answer to the research question. Conversely, some studies use too many experimental units so that

valuable resources are wasted. This lack of quality was also demonstrated in a recent review carried

out by Kilkenny et al. (2009) who surveyed 271 papers reporting laboratory animal experiments.

They found that most of the studies had flaws in their design and almost one third of the papers that

used statistical methods did not describe them or present their results adequately. There is also a

genuine concern about the replicability (lack of confirmation) of research findings and it has been

argued that most research findings could be false (Ioannidis, 2005).

A way to circumvent this lack of quality is by changing the way scientists look at the research

process. This can be accomplished by introducing statistical thinking as an informed skill that

enhances the quality of the research data (Vandenbroeck et al., 2006).

Statistical thinking and reasoning are two powerful skills based on the fundamentals of statistics.

While the science of statistics is mostly involved with the complexities and techniques of statistical

analysis, statistical thinking and reasoning are generalist skills that focus on the application of

nontechnical concepts and principles. There are no clear, generally accepted definitions of statistical

thinking and reasoning. In our conceptualization we consider statistical thinking as a skill that helps

to better understand how statistical methods can contribute to finding answers to specific research

problems and what the implications are in terms of data collection, experimental setup, data

analysis, and reporting. Statistical thinking will provide us with a generic methodology to design

insightful experiments. On the other hand, we will consider statistical reasoning as being more

involved with the interpretation of statistical analyses. Of course, as is apparent from the above

there is a large overlap between the concepts of statistical thinking and reasoning.

2 THE ARCHITECTURE OF EXPERIMENTAL RESEARCH

2.1 THE CONTROLLED EXPERIMENT

There are two basic approaches to implement a scientific research project. One approach is to

conduct an observational (also called correlational) study in which we investigate the effect of

naturally occurring variation and the assignment of treatment groups is outside the control of the

2

investigator. Although there are often good and valid reasons for conducting an observational study,

their main drawback is that the presence of confounding variables can never be excluded, thus

weakening the conclusions. An alternative to an observational study is an experimental or

manipulative study in which the investigator manipulates the experimental system and measures

the effect of his manipulations on the experimental material. Since the manipulation of the

experimental system is under control of the experimenter, one also speaks of controlled

experiments. A well‐designed experimental study eliminates the bias caused by confounding

variables. The great power of a controlled experiment, provided it was well conceived, lies in the fact

that it allows us to demonstrate causal relationships. In the following we will focus on controlled

experiments and how statistical thinking and reasoning can be of use to optimize its design and

interpretation.

2.2 SOME TERMINOLOGY

We refer to a factor as the condition or set of conditions that we manipulate in the experiment, e.g.

the concentration of a drug. A factor level is the particular value of the factor, e.g. 10‐6M, 10‐5M. A

treatment consists of the combination of factor levels. In single‐factor studies a treatment

corresponds to a factor level. The experimental unit is defined as the smallest physical entity to

which a treatment is independently applied. The characteristic that is measured and on which the

effect of the different treatments is investigated and analyzed is referred to as the response or

dependent variable. The observational unit is the unit on which the response is measured. Often

the observational unit is identical to the experimental unit, but this is not necessarily always the

case.

2.3 SCIENTIFIC RESEARCH AS A PHASED PROCESS

From a systems analysis point of view the scientific research process can be divided into five distinct

stages: definition of the research question, design of the experiment, conduct of the experiment

and data collection, data analysis, and reporting. Each of these phases results in a specific

deliverable. The definition of the research question will usually result in a research or grant proposal

stating the hypothesis related to the research and the predictions that follow from this hypothesis.

The design of the experiment needed for testing the research hypothesis is formalized in a written

protocol. After the experiment has been carried out, the data will be collected yielding the dataset.

Statistical analysis of the data will yield conclusions that answer the research question by accepting

or rejecting the formalized hypothesis. Finally, a well carried out research project will result in a

report, thesis, or journal article.

3

2.4 SCIENTIFIC RESEARCH AS AN ITERATIVE, DYNAMIC PROCESS

As depicted in Figure 1, scientific research is not a simple static activity, but an iterative and highly

dynamic process. A research project is carried out within some organizational or management

context which can be rather authoritative; this context can be academic, government, or business. In

this context, the management objectives of the research project are put forward. The aim of our

research project itself is to fill an existing information gap. To fill this gap the research question is

defined, the experiment designed and carried out and the data analyzed. The results of this analysis

allow informed decisions to be made and provide a way of feedback to adjust the definition of the

research question. On the other hand, the experimental results will trigger research management to

reconsider their objectives and eventually request for more information.

Figure 1. Scientific research as an iterative process

3 SMART RESEARCH DESIGN BY STATISTICAL THINKING

3.1 RESEARCH STYLES – THE SMART RESEARCHER

The five phases that make up the research process modulate between the concrete and the abstract

world (Figure 2). Definition and reporting are conceptual and complex tasks requiring a great deal of

abstract reasoning. Conversely, experimental work and data collection is a very concrete,

measurable task handling with the practical details and complications of the specific research

domain.

Scientists exhibit different styles in their research depending on the relative fraction of the available

resources that they are willing to spend at each phase of the research process. This allows us to

recognize different archetypes of researchers: the “novelist” who needs to spend a lot of time

distilling a report from an ill-conceived experiment; the “data salvager” who believes that no matter

how you collect the data or set up the experiment, there is always a statistical fix-up at analysis time;

4

the “lab freak” who strongly believes that if enough data are collected something interesting will

always emerge. Finally, there is the “smart researcher” who is aware of the architecture of the

experiment as a sequence of steps and allocates a major part of his time budget to the first two

steps: definition and design. The “smart researcher” is convinced that time spent planning and

designing an experiment at the outset will save time and money in the long run. He opposes the

“lab freak” by trying to reduce the number of measurements to be taken, thus effectively reducing

the time spent in the lab. In contrast to the “data salvager”, the “smart researcher” recognizes that

the design of the experiment will govern how the data will be analyzed thereby reducing time spent

at the data analysis stage to a minimum. By carefully preparing and formalizing the definition and

design phase, the “smart researcher” can look ahead to the reporting phase with peace of mind,

which is in contrast to the “novelist”.

Figure 2. Modulating between the concrete and the abstract

3.2 PRINCIPLES OF STATISTICAL THINKING

The “smart researcher” recognizes the value of “statistical thinking” for his application area and he

himself is skilled in “statistical thinking” or he collaborates with a professional who masters this skill.

As noted before, “statistical thinking” is related to, but distinct from statistical science. While

statistics is a specialized technical skill based on mathematical statistics as a science on its own,

statistical thinking is a generalist skill based on informed practice and focused on the applications of

nontechnical concepts and principles. The “statistical thinker” attempts to understand how

statistical methods can contribute to finding answers to specific research problems in terms of data

collection, experimental setup, data analysis, and reporting. He or she is able to postulate which

statistical expertise is required to enhance the research project’s success. In this capacity the

“statistical thinker” acts as a “diagnoser”. In contrast to statistics which operates in a closed and

secluded mathematical context, statistical thinking is a practice that is fully integrated with the

researcher’s scientific field, not merely an autonomous science. Hence the “statistical thinker”

operates in a more ambiguous setting where he is deeply involved in applied research, with a good

5

working knowledge of the substantive science. In this role the “statistical thinker” acts as an

intermediary between scientists and statistician and goes into dialogue with them. He attempts to

integrate the several potentially competing priorities that make up the success of a research project:

resource economy, statistical power, and scientific relevance, into a coherent and statistically

underpinned research strategy. While the impact of the statistician on the research process is

limited to discrete interventions, the “statistical thinker” truly permeates the research process. His

combined skills lead to increased efficiency, which is important to increase the speed with which

research data, analyses, and conclusions become available. Moreover, these skills allow to enhance

the quality and to reduce the associated cost. Statistical thinking then helps the scientist to build a

case and negotiate it on fair and objective grounds with those in the organization seeking to

contribute to more business‐oriented measures of performance. In that sense, the successful

“statistical thinker” is a persuasive communicator.

Smart research design is based on seven basic principles of statistical thinking: 1) time spent

thinking on the conceptualization and design of an experiment is time wisely spent; 2) the design of

an experiment reflects the contributions from different sources of variability; 3) the design of an

experiment balances between its internal validity (proper control of noise) and external validity (the

experiment’s generalizability); 4) good experimental practice provides the clue to bias minimization;

5) good experimental design is the clue to the control of variability; 6) experimental design

integrates various disciplines; 7) a priori consideration of statistical power is an indispensable pillar

of an effective experiment.

4 PLANNING THE EXPERIMENT

4.1 TYPES OF EXPERIMENTS

Experimental set‐ups can be classified into four broad categories depending on the type of objective

of the study in question. A comparative experiment is one in which two or more techniques,

treatments, or levels of an explanatory variable are to be compared with one another. There are

many examples of comparative experiments in biomedical areas. For example in nutrition studies

different diets can be compared to one another in laboratory animals. In clinical studies, the efficacy

of an experimental drug is assessed in a trial by comparing it to treatment with placebo.

Comparative experiments are the primary focus of this tutorial.

A second type of experiment is the optimization experiment which has the objective of finding

conditions that give rise to a maximum or minimum response. Optimization experiments are often

6

used in product development, such as finding the optimum combination of concentration,

temperature, and pressure that give rise to the maximum yield in a chemical production plant. Dose

finding trials in clinical development are another example of optimization experiments.

The third type of experiment is the prediction experiment in which the objective is to provide some

statistical/mathematical model to predict new responses. Examples are dose response experiments

in pharmacology and immuno‐assay experiments.

The final experimental type is the variation experiment. This experiment has as objective to study

the size and structure of random variation. Variation experiments are implemented as uniformity

trials, i.e. studies without different treatment conditions. For example, the assessment of sources of

variation in microtiter plate experiments. These sources of variation can be plate effects, row

effects, column effects and their combination. Another example is a study to investigate the effect of

cage location in an animal experiment where animals are kept in a rack of 24 cages.

4.2 THE OBJECTIVE OF THE STUDY AND THE RESEARCH HYPOTHESIS

Prior to the experimental design, it is of importance that the scientist realizes what the actual goal is

of the study. Sometimes a pilot study, making preliminary observations on the study material is

useful to generate clear questions that will be answered in the actual experiment. It is wise to limit

the objectives of a study to a maximum of, say three (Selwyn, 1996). Having a lot of research

objectives compromises the integrity of the study and as a result, often none of the study objectives

will be satisfied. After having formulated the research objectives, the scientist will then try to

transfer them into scientific hypotheses that might answer the question and he will make

predictions of what to expect if these hypotheses are true. The next step in the planning process is

to decide which data are required to confirm or refute these predictions. Generating sensible

predictions is one of the key factors of good experimental design. Good predictions will follow

logically from the hypotheses that we wish to test, and not from other rival hypotheses. Good

predictions will also lead to insightful experiments that allow the predictions to be tested.

Throughout the sequence of question, hypothesis, prediction it is essential to assess each step

critically with enough skepticism and even ask a colleague to play the “devil’s advocate”. It is much

better that problems are identified at this early stage of the research process than after the

experiment started. At the end of the experiment the scientist should be able to determine whether

the objectives have been met, I.e. whether the research questions were answered to satisfaction.

7

4.3 THE PILOT EXPERIMENT

As researchers are often under considerable time pressure there is the temptation to start as soon

as possible with the experiment. However, a critical step in a new research project, that is often

missed, is to spend a bit of time and resources on the beginning of the study collecting some pilot

data. Such a preliminary or pilot experiment on a limited scale is of use for assessing the feasibility of

the actual experiment. During the pilot stage the researcher is allowed to make variations in

experimental conditions such as measurement method, experimental set‐up, etc.

The pilot experiment can be of help to make sure that a sensible research question was asked. For

instance, if our research question was about whether there is a difference in concentration of a

certain protein between diseased and non‐diseased tissue, it is of importance that this protein is

present in a measurable amount. Carrying out a pilot experiment in this case can save considerable

time, resources and eventual embarrassment.

A second crucial role for the pilot study is for the researcher to practice, validate and standardize the

experimental techniques that will be used in the full study. When appropriate, trial runs of different

types of assays allow fine‐tuning them so that they will give optimal results.

Finally, the pilot study provides basic data to debug and fine‐tune the experimental design. Provided

the experimental techniques work well, carrying out a small‐scale version of the actual experiment

will yield some preliminary experimental data. These pilot data can be very valuable and allow to

calculate or adjust the required sample size of the experiment and to set up the data analysis

environment.

The pilot experiment still belongs to the exploratory phase of the research project and is not part of

the actual, final experiment. In order to preserve the quality of the data and the validity of the

statistical analysis, the pilot data cannot be included in the final dataset.

5 PRINCIPLES OF STATISTICAL DESIGN

5.1 THE STRUCTURE OF THE RESPONSE VARIABLE

We assume that the response obtained for a particular experimental unit can be described by a

simple additive model consisting of the effect of the specific treatment, the effect of the

experimental design and an error component that describes the deviation of this particular

experimental unit from the mean value of its treatment group. There are some strong assumptions

associated with this simple model: 1) the treatment terms add rather than, for example multiply; 2)

treatment

the other

5.2 DE

The exper

treatment

that the e

applied to

high or lo

In many s

levels of s

their exp

analysis o

in biomed

Figure

Temme et

measured

t effects are c

units. These

FINING THE

rimental unit

t can be app

experimental

o one unit can

w result in on

studies the c

sampling it o

erimental ma

of their study

dical research

4. Morphome

Me

t al. (2001) co

d the diamete

constant; 3) t

assumptions

Figure 3

EXPERIMENT

t corresponds

lied, such tha

units respon

nnot affect th

ne unit has no

hoice of the

often happen

aterial. This

y. The followi

h.

etric analysis o

ans±SEM from

ompared two

ers of bile ca

the response

are particula

. The response

TAL UNIT

s to the smal

at any two u

nd independe

he response o

o effect on th

experimenta

s that invest

can lead to

ng examples

f the diameter

m three livers.

o genetic stra

naliculi in the

8

in one unit is

arly importan

e variable as a

lest division o

nits can rece

ently of one

obtained in an

he result of an

al unit is obv

igators have

serious erro

represent ty

r of bile canali

*P<0.005 (afte

ins of mice, w

e livers of thr

s unaffected b

t in the statis

dditive model

of the experi

ive different

another, in t

nother unit a

nother unit.

ious. Howeve

difficulties re

ors in both t

ypical situatio

culi in wild‐typ

er Temme et a

wild‐type and

ree wild‐type

by the treatm

stical analysis

mental mate

treatments.

the sense tha

nd that the o

er, in studies

ecognizing th

he design an

ons commonl

pe and C02‐de

al., 2001)

connexin32‐

e and of three

ment applied t

.

erial to which

It is importan

at a treatmen

occurrence of

s with multip

he basic unit

nd subseque

y encountere

eficient liver.

‐deficient. The

e C02‐deficien

to

a

nt

nt

f a

ple

in

nt

ed

ey

nt

9

animals, making several observations on each liver. Their results are shown in Figure 4. There is a

fairly obvious problem with the definition of the experimental units which also act as units of

analysis here. They have two groups and in each group there are only three experimental units, not

groups of 280 and 162 units. Hence their statistical analysis is completely wrong. Similar mistakes are

abundant whenever microscopy is concerned and the individual cell is used as experimental unit.

One could wonder whether these are mistakes made out of ignorance or out of convenience. The

concern is even greater when such studies get published in peer reviewed scientific journals.

Another example of a wrong definition of experimental unit concerns a study in laboratory animals

about the toxicity of N‐nitrosamines (Rivenson et al., 1988). The investigators provided the following

description of their experimental set‐up:

“The rats were housed in groups of 3 in solid‐bottomed polycarbonate cages with hardwood bedding

under standard conditions diet and tap water with or without N‐nitrosamines were given ad libitum.”

Since the treatment was supplied in the drinking water, it is impossible to provide different

treatments to any two individual rats. Furthermore, the responses obtained within the different

animals within a cage can be considered to be dependent upon one another in the sense that the

occurrence of extreme values in one unit can affect the result of another unit. Therefore, the

experimental unit here is not the single rat but the cage.

Still another example shows that a single individual can relate to several experimental units. In a

clinical trial the skin irritation potential of five dermatological compounds is evaluated by

administering all five compounds simultaneously to five separate test areas on the backs of normal

volunteers. Each test area was randomly assigned to one of the five different treatments conditions.

Therefore, the test area rather than the volunteer was the experimental unit.

5.3 VARIATION IS OMNIPRESENT

Variation is everywhere in the natural world and is often substantial in the biomedical area. Despite

accurate execution of the experiment, the measurements obtained in identically treated objects will

yield different results. For example, cells grown in test tubes will vary in their growth rates and in

animal research no two animals will behave exactly the same. In general, the more complex the

system that we study, the more factors will interact with each other and the greater will be the

variation between the experimental units. Experiments in whole animals will undoubtedly show

more variation than in vitro studies on isolated organs. When the variation cannot be controlled or

its source cannot be measured, we will refer to it as noise, random variation or error. This

10

uncontrollable variation masks the effects under investigation and is the reason why statistical

methods are required to extract the necessary information. This is in contrast to other scientific

areas such as physics, chemistry and engineering where the studied effects are much larger than the

natural variation.

5.4 BALANCING INTERNAL AND EXTERNAL VALIDITY

Internal validity refers to the fact that in a well‐conceived experiment the effect of the treatment is

unequivocally attributed to the treatment. However, as we saw earlier the effect of the treatment is

masked by the presence of the uncontrolled variation of the experimental material. An experiment

with a high level of internal validity should have a great chance to detect the effect of the treatment.

If we consider the treatment effect as a signal and the inherent variation of our experimental

material as noise, then a good experimental design will maximize the signal/noise ratio. Increasing

the signal can be accomplished by choosing experimental material that is more sensitive to the

treatment. Identification of factors that increase the sensitivity of the experimental material could

be carried out in preliminary experiments. Reducing the noise is another way to increase the signal‐

to‐noise ratio. This can be accomplished by repeating the experiment in a number of animals, but

this is not a very efficient way of reducing the noise. An alternative for noise reduction is by using

very uniform experimental material. The use of cells harvested from a single animal is an example of

noise reduction by uniformity of experimental material.

External validity is related to the extent that our conclusions can be generalized to the target

population. The choice of the target population, how a sample is selected from this population and

the experimental procedures used in the study are all determinants of its external validity. Clearly,

the experimental material should mimic the target population as close as possible. In animal

experiments specifying species and strain of the animal, the age and weight range and other

characteristics determine the target population and make the study as realistic and informative as

possible. External validity can be very low if we work in a highly controlled environment using very

uniform material. Thus there is a trade‐off between internal and external validity, as one goes up the

other comes down. Fortunately, as we will see, there are statistical strategies for designing a study

such that the noise is reduced while the external validity is maintained.

5.5 BIAS AND VARIABILITY

Bias and variability (Figure 5) are two important concepts when dealing with the design of

experiments. A good experiment will minimize or, at best, try to eliminate bias and will control for

variability. By bias, we mean a systematic deviation in observed measurements from the true value.

For exam

investigat

the contro

of the exp

clear that

confound

study is th

By variabi

also relat

means th

precise. G

most imp

thereby je

much var

increasing

conclusio

5.6 RE

Cox (1958

1. t

2. t

3. t

4. t

5. U

mple, an anim

ted with resp

ol treatment

periment the

t this treatm

ed with the d

he way exper

ility, we mean

ed to the con

hat our meas

Good experim

portant. Failur

eopardizes th

riability, this

g the sample

ns.

Figure

QUIREMENT

8) enunciated

reatment com

he compariso

he conclusion

the experime

Uncertainty i

mal study is p

pect to a con

and all femal

investigator

ent effect is

difference be

imental units

n a random f

ncepts of acc

surement is a

ments are as

re to minimiz

he internal va

can sometim

e size, or oth

5. Bias and va

TS FOR A GOO

d the followin

mparisons sho

ons should als

ns a wide ran

ental arrangem

n the conclus

performed in

trol treatme

les to the exp

finds a stron

a biased res

tween both s

s are allocated

luctuation ab

curacy and pr

accurate, wh

free as possi

ze the bias o

alidity. Conve

mes be remed

er technique

riability illustr

OD EXPERIM

g requiremen

ould as far as

so be made s

nge of validity

ment should

sions should b

11

n which the

nt. Suppose

perimental tre

g difference b

sult since the

sexes. One of

d to treatmen

bout a centra

recision of a

ile little varia

ble from bias

f an experim

ersely, if the

diated by ref

es. In this cas

rated by a mar

MENT

nts for a “goo

s possible free

ufficiently pr

y (external va

be as simple

be assessable

effect of an

the experime

eatment. Furt

between the

e difference b

f the most im

nt groups.

l value. The t

measuremen

ability means

s and variabi

ent leads to

outcome of

finement of t

se the study

rksman shot at

od experiment

e of systemat

recise (signal‐

alidity);

as possible;

e

experimenta

enter allocate

ther assume t

two treatme

between the

portant sour

erms bias and

nt process. A

s that the m

lity. Of the tw

erroneous co

the experim

the experime

may still rea

t a bull’s eye

t”:

tic error (bias

‐to‐noise);

al treatment

es all males t

that at the en

ent groups. It

two groups

ces of bias in

d precision a

Absence of bia

measurement

wo, bias is th

onclusions an

ent shows to

ental method

ch the corre

s);

is

to

nd

is

is

a

re

as

is

he

nd

oo

ds,

ct

12

These five requirements will determine the basic elements of the design. We have discussed already

the first three requirements in the preceding sections. The following section provides some basic

strategies to fulfill these requirements.

5.7 STRATEGIES FOR MINIMIZING BIAS AND MAXIMIZING SIGNAL‐TO‐NOISE RATIO

The basic principle of experimental design is about maximizing the internal validity by minimizing

the bias and maximizing the signal‐to‐noise ratio and, at the same time, maximizing the external

validity of the experiment. To accomplish this objective, the researcher should maximize the signal

by the proper choice of the measuring device and experimental domain, and minimize bias and

variability. Strategies for minimizing the bias are based on good experimental practice, such as: the

use of controls, blinding, the presence of a protocol, calibration, randomization, random sampling,

and standardization. Variability can be minimized by elements of experimental design: replication,

blocking, covariate measurement, and sub‐sampling. In addition random sampling can be added to

enhance the external validity. We will now consider each of these strategies in more detail.

5.7.1 STRATEGIES FOR MINIMIZING BIAS – GOOD EXPERIMENTAL PRACTICE

5.7.1.1 THE USE OF CONTROLS

In biomedical studies, a control or reference standard is a standard treatment condition against

which all others may be compared. The control can either be a negative control or a positive

control. The term active control is also used for the latter. In some studies, both negative and

positive controls are present. In this case, the purpose of the positive control is mostly to provide an

internal validation of the experiment. In some special type of experiment active controls are used to

show equivalence of treatments. When negative controls are used, subjects can act as their own

control (self‐control). In this case the subject is first evaluated under standard conditions (i.e.

untreated). Subsequently, the treatment is applied and the subject is re‐evaluated. This design has

the property that all comparisons are made within the same subject. Since most of the time1

variability within a subject is smaller than between subjects, this is a more efficient design than

comparing control and treatment in two separate groups. However, there are some drawbacks with

this design. The effect of treatment is completely confounded with the effect of time, which

introduces a potential source of bias. Furthermore, blinding which is another method to minimize

bias is impossible in this design.

1 For a quantitative outcome this is the case when the Pearson product‐moment correlation coefficient between the control measurement and the measurement following treatment is at least 0.5

13

Like in the previous case of self‐control, untreated controls are not blinded. Furthermore, applying

the treatment (e.g. a drug) often requires extra manipulation of the subjects (e.g. injection). The

effect of the treatment is then confounded with that of the manipulation and consequently

potentially biased.

Vehicle control (laboratory experiments) or placebo control (clinical trials) are terms that refer to a

control group that receives a matching treatment condition without the active ingredient. Another

term in the context of experimental surgery is sham control. In the sham control group subjects or

animals undergo a faked operative intervention that omits the step thought to be therapeutically

necessary. This type of control is the most desirable and truly minimizes bias. In clinical research the

placebo controlled trial has become the gold standard. However, in the same context of clinical

research ethical consideration may sometimes preclude its application.

5.7.1.2 BLINDING

Blinding is a very useful strategy for minimizing bias, especially when the response variable is

subjectively evaluated. Knowledge of which treatment was assigned to a particular experimental

unit can subconsciously lead to a biased assessment by the observer. In such experiments evaluation

must be done by a person who is blinded to treatment group. In other types of experiments handling

of the experimental material, especially animals, by the investigator can also influence the results.

Here too, proper blinding of investigators will enhance the credibility and reliability of the results.

In single blinding the investigators are uninformed regarding the treatment condition of the

experimental subjects. Single blinding neutralizes investigator bias. The term double blinding in

laboratory experiments means that both the experimenter and the observer are uninformed about

the treatment condition of the experimental units. In clinical trials double blinding means that both

investigators and subjects are unaware of the treatment condition.

Two methods of blinding have found their way to the laboratory: group blinding and individual

blinding. Group blinding involves identical codes, say A, B, C, etc., for entire treatment groups. This

approach is less appropriate, since when results accumulate and the investigator is able to break the

code, blindness is completely destroyed. A much better approach is to assign a code (e.g. sequence

number) to each experimental unit individually and to maintain a list that indicates which code

corresponds to which particular treatment. The sequence of the treatments in the list should be

randomized. In practice, this procedure often involves an independent person that maintains the list

and prepares the treatment conditions (e.g. drugs).

14

5.7.1.3 THE PRESENCE OF A PROTOCOL

The presence of a written technical protocol containing in full detail the specific definitions of

measurement and scoring methods is imperative to minimize potential bias. The technical protocol

describes practical actions and gives guidelines for lab technicians on how to manipulate the

experimental units (animals, etc.), the materials involved in the experiment, the required logistics. It

also gives details on data collection and processing. Last but not least the protocol lays down the

personal responsibilities of the technical staff. The importance and contents of the study protocol

will be discussed further in Section 9.

5.7.1.4 CALIBRATION

Calibration is an operation that compares the output of a measurement device to standards of

known value, leading to correction of the values indicated by the measurement device. Calibration

neutralizes instrument bias, i.e. the bias in the investigator’s measurement system.

5.7.1.5 RANDOMIZATION

Randomization, in our context, is the process of allocating experimental units to treatment groups or

conditions according to a well‐defined stochastic law1. It is an objective and scientifically accepted

method for the allocation of experimental units to treatment groups. Randomization ensures that

the effect of uncontrolled source of variability has equal probability in all treatment groups. In the

long run, randomization balances treatment groups on unimportant or unobservable variables, of

which we are often unaware. Any differences that exist in these variables are to be attributed by the

“play of chance”. The random allocation of experimental units to treatment conditions is also a

necessary condition for a rigorous statistical analysis, in the sense that it provides an unbiased

estimate of the standard error of the treatment effects and justifies the use of exact significance

tests (Fisher, 1935; Cox, 1958; Lehmann, 1975). Moreover, randomization is also of use as a device

for blinding the experiment. It is essential that the randomization procedure covers all important

sources of variation connected with the experimental units. Experimental units receiving the same

treatment should be dealt with separately and independently at all stages. If this is not the case

additional randomization procedures should be introduced. Methods of randomization using Excel

and the R system are contained in the Appendix.

Some investigators are convinced that a systematic arrangement is the preferred way to eliminate

the influence of uncontrolled variables. For example when two treatments A and B have to be

compared, one possibility is to set up pairs of experimental units and always assign treatment A to

1 By the term “stochastic” is meant that it involves some elements of chance, such as picking numbers out of a hat, or preferably, using a computer program to assign experimental units to treatment groups

15

the first member of the pair and B to the remaining unit. However, if there is a systematic effect

such that the first member of each pair consistently yields a higher or lower result than the second

member, the estimated treatment effect will be biased. Other arrangements are more commonly

used in the laboratory, e.g. the alternating sequence AB, BA, AB, BA, …. Here too, it cannot be

excluded that a certain pattern in the uncontrolled variability coincides with this arrangement. For

instance, if 8 experimental units are tested in one day, the first unit on a given day will always

receive treatment A. Furthermore, when a systematic arrangement has been applied, the statistical

analysis is based on the false assumption of randomness and can be totally misleading.

Researchers are sometimes tempted to improve on the random allocation of animals by re‐

arranging individuals so that the mean weights are exactly identical. However, by reducing the

variability between the treatment groups, the variability within the groups is automatically

increased. This reduces the precision of the experiment and invalidates the subsequent statistical

analysis. Later, we will see that the randomized block design instead of systematic arrangement is

the correct way of handling these last two cases.

Proper randomization must be distinguished from haphazard allocation to treatment groups. For

example, an investigator wishes to compare the effect of two treatments (A, B) on the body weight

of rats. All twelve animals are delivered in a single cage to the laboratory. The laboratory technician

takes six animals out of the cage and assigns them to treatment A, while the remaining animals will

receive treatment B. Although, many scientists would consider this as a random assignment, it is not.

Indeed, one could imagine the following scenario. Heavy animals react slower and are easier to

catch than the smaller animals. Consequently, the first six animals will on average weigh more than

the remaining six.

An important issue in the design of an experiment is the moment of randomization. For example, in

an experiment brain cells were taken from animals and placed in Petri dishes, such that one Petri

dish corresponded to one particular animal. The Petri dishes were then randomly divided into two

groups and placed in an incubator. After 72 hrs incubation one group of Petri dishes was treated

with the experimental drug, while the other group received solvent.

Although the investigators made a serious effort to introduce randomization in their experiment,

they overlooked the fact the placement of the Petri dishes in the incubator introduces a systematic

error. Instead of randomly dividing the Petri dishes in two groups at the start of the experiment, they

should have made random treatment allocation after the incubation period. It is important that the

randomization covers all substantial sources of variation connected with the experimental units. As

16

a rule, randomization should be performed immediately before treatment application. Furthermore,

after the randomization process has been carried out the randomized sequence of the experimental

units must be maintained, otherwise a new randomization procedure is required.

5.7.1.6 RANDOM SAMPLING

Using a random sample in our study increases its external validity and allows us to make a broad

inference, i.e. a population model of inference (Lehmann, 1975). However, in practice it is often

difficult and/or impractical to conduct studies with true random sampling. Clinical trials are usually

conducted using eligible patients from a small number of study sites. Animal experiments are based

on the available animals. This certainly limits the external validity of these studies and is one of the

reasons that the results are not always replicable.

In some cases, maximizing the external validity of the study is of great importance. This is especially

the case in studies that attempt to make a broad inference towards the target population

(population model), like gene expression experiments that try to relate a specific pathology to the

differential expression of certain genes probes (Nahon and Shoemaker, 2002). For such an

experiment the bias in the results is minimized if it is based on a random sample from the target

population.

5.7.1.7 STANDARDIZATION

Standardization of the experimental conditions can be of use to minimize the bias but also to reduce

the intrinsic variability in the results. Examples of standardization of the experimental conditions are

the use of genetically uniform animals, use of phenotypically uniform animals, environmental

control, nutritional control, acclimatization, and standardization of the measurement system. As

discussed before, too much standardization of the experimental conditions can jeopardize the

external validity of the results.

5.7.2 STRATEGIES FOR CONTROLLING VARIABILITY – GOOD EXPERIMENTAL DESIGN

5.7.2.1 REPLICATION

Ronald Fisher1 (1935) noted in his pioneering book The Design of Experiments that replication serves

two purposes. The first is to increase the precision of estimation and the second is to supply an

estimate of error by which the significance of the comparisons is to be judged.

1 Sir Ronald Aylmer Fisher (Londen,1890 – Adelaide 1962) is considered a genius who almost single‐handedly created the foundations of modern statistical science

17

In a comparative experiment with two treatments and equal number of experimental units per

treatment group, the precision of the experiment is quantified by the standard error of the

difference between the two treatments, i.e. the quantity: 2 . . ⁄

The precision of the experiment depends on the standard deviation SD, which is composed of the

intrinsic variability of the experimental material and the precision of the experimental work, and

inversely on the number of experimental units. Reduction of the standard deviation is only possible

to a limited extend. However, one can by increasing the number of experiments units effectively

enhance the experiment’s precision. Unfortunately, due to the inverse square root dependency of

the standard error on the sample size, this is not an efficient way to control the precision. Indeed,

the standard error is halved by a fourfold increase in the number of experimental units, but a

hundredfold increase in the number of units is required to divide the standard error by ten. In other

words, replication is an effective but expensive strategy to control variability.

5.7.2.2 SUBSAMPLING

As mentioned above, reduction of the standard deviation is only possible to a very limited extend.

This can be accomplished by standardization of the experimental conditions but also this method is

limited and jeopardizes the external validity of the experiment. In some experiments it is possible

manipulate the physical size of the experimental units. In this case one could choose units of a larger

size to obtain more precise estimates. In still other experiments, there are multiple levels of

sampling. The process of taking samples below the primary level of the experimental unit is known

as subsampling. The experiment reported by Temme et al. (2001) where the diameter of many liver

cells was measured in 3 animals/experimental condition, is an example of subsampling with animals

at the primary level and cells at the subsample level. When the variability of the measurement at the

sublevel (i.e. within‐subject variability) is substantial, e.g. large standard deviation between cell

diameters of the same animal, as compared to the intrinsic variability between the animals, it makes

sense to increase the number of subsamples. However, subsample replication is not identical and

not as effective as replication on the level of the true experimental unit.

Apart from replication and reduction of intrinsic variability there is a third effective and efficient

method to increase the precision of an experiment. This is accomplished by choosing an appropriate

experimental design which takes into account the different sources of variability that can be

identified.

18

5.7.2.3 BLOCKING

If one or more factors other than the treatment conditions can be identified as potentially

influencing the outcome of the experiment, it may be appropriate to group the experimental units

on these factors. Such groupings are referred to as blocks or strata. Units within a block are then

(randomly) assigned to the treatments. Examples of blocking factors are plates (in microtiter plate

experiments), greenhouses, animals, litters, date of experiment, or groupings based on groupings of

continuous characteristics such as body weight, baseline measurement, etc.

What we effectively do by blocking is dividing the variation between the individuals into variation

between blocks and variation within blocks. If the blocking factor has an important effect on the

response, then the between‐block variation is much greater than the within block variation. We will

take this into account in the analysis of the data (ANOVA with blocks as additional factor).

Comparisons of treatments are then carried out within blocks, where the variation is much smaller.

5.7.2.4 COVARIATES

Blocking on a baseline characteristic such as body weight is one possible strategy to eliminate the

variability induced by the heterogeneity in weight of the animals or patients. Instead of grouping in

blocks, or in addition to, one can also make use of the actual value of the measurement. Such a

concomitant measurement is referred to as a covariate. It is an uncontrollable but measurable

attribute of the experimental units (or their environment) that is unaffected by the treatments but

may have an influence on the measured response. Examples of covariates are body weight, age,

ambient temperature, measurement of the response variable before treatment, etc. The covariate

filters out the effect of a particular source of variability. Rather than blocking it represents a

quantifiable attribute of the system studied. The statistical model underlying the design of an

experiment with covariate adjustment is conceptualized in Figure 6. The model implies that there is

a linear relationship between the covariate and the response and that this relationship is the same in

all treatment groups. In other words, there is a series of parallel curves, one per treatment group,

relating the response to the covariate.

Figure 6. Additive model with linear covariate adjustment

19

5.8 SIMPLICITY OF DESIGN

In addition to external validity, bias and precision, Cox (1958) also stated that the design of our

experiment should be as simple as possible. When the design of the experiment is too complex, it

may be difficult to ensure adherence to a complicate schedule of alteration, especially if these are to

be carried out by relatively unskilled people. An efficient and simple experimental design has the

additional advantage that the statistical analysis will be simple without unreasonable assumptions.

5.9 THE CALCULATION OF UNCERTAINTY

This is the last of Cox’s requirements for a “good experiment” and it is the only true statistical

requirement. Fisher (1935) lamented that: “It is possible, and indeed it is all too frequent, for an

experiment to be so conducted that no valid estimate of error is available”. Without the ability to

estimate error, there is no basis for statistical inference. Therefore, in a well conceived experiment,

we should always be able to calculate the uncertainty in the estimates of the treatment

comparisons. This usually means estimating the standard error of the difference between the

treatment means. To make this calculation in a rigorous manner, the set of experimental units must

respond independently to a specific treatment and may only differ in a random way from the set of

experimental units assigned to the other treatments. This requirement again stresses the

importance of independence of experimental units and randomness in treatment allocation.

In experiments with a very small number of experimental units, it is sometimes not possible to

obtain an effective estimate of the standard deviation from the observations themselves. In this

case, one can make use of the results of previous experiments to estimate the standard deviation.

However, we then make the strong assumption that random variation is the same in the new

experiment.

6 COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION

There is a multitude of designs that can be considered when planning an experiment and some have

been employed more commonly in the area of biological research. Unfortunately, the literature

about experimental designs is littered with technical jargon, which makes its understanding quite a

challenge. To name a few, there are: completely randomized designs, randomized complete block

designs, factorial designs, split plot designs, Latin square designs, Greco‐Latin squares, Youden

square designs, lattice designs, Placket‐Burman designs, simplex designs, Box‐Behrken designs, etc..

It helps to find our way through this jungle of designs by keeping in mind that the major principle of

experimental design is to provide a synthetic approach to minimize bias and control variability.

20

Furthermore, we can consider each of the specialized experimental designs as integrating three

different aspects of the design: the treatment design, the error control design, and the sampling

design.

The treatment aspect is concerned about which treatments are to be included in the experiment

and closely linked to the goals and aims of study. Should a negative or positive control be

incorporated in the experiment or should both be present? How many doses or concentrations

should be tested and at which level? Is the interaction of two treatment factors of interest or not?

The error control aspect of the study design implements the strategies that we learned in Section

5.7.1 to filter out different sources of variability. The sampling and observation aspect of the design

of our experiment is concerned how experimental units are sampled from the population, how and

how many subsamples should be drawn, etc.

The complexity and required resources of a study are determined by these three aspects of

experimental design. The required resources, i.e. the number of experimental units, of a study are

governed by the number of treatments, the number of blocks and the standard error. The more

treatments, or the more blocks, the more experimental units are needed. The complexity of the

experiment is determined by the underlying statistical model of Figure 3. In particular the error

design aspect of the study controls its complexity. The randomization process is a major part of the

error‐control aspect. A justified and rigorous estimation of the standard error is only possible in a

randomized experiment. In addition to the above requirement for the calculation of the standard

error, randomization has the advantage that it distributes the effects of uncontrolled variability

randomly over the treatment groups.

Replication of experimental units should be sufficient, such that an adequate number of degrees of

freedom are available for estimating the experiment’s precision (standard error). This parameter is

related to the sampling & observation aspect of the design.

The three aspects of experimental design provide a framework for classifying and comparing the

different types of experimental design that are used in biomedical studies. As we will see, each of

these designs has its advantages and disadvantages.

6.1 THE COMPLETELY RANDOMIZED DESIGN

This is the most common and simplest possible experimental design. Each experimental unit is

randomly assigned to exactly one treatment condition. This is often the default design used by

investigators who do not really think about design problems. In our classification the completely

21

randomized design is a one‐way treatment design with a completely randomized error‐control

design. When the treatment condition represents a single factor (e.g., different concentrations of

the same drug including a zero level), then this design is also called a one‐way layout. On the other

hand, when the treatment condition consists of a combination of more factors, each factor at two or

more levels, it is called a factorial design. We will discuss factorial designs in more detail later.

The advantage of the completely randomized design is that it is simple to implement as

experimental units are simply randomized to the various treatments. The obvious disadvantage is

the lack of precision in the comparisons among the treatments, which is based on the variation

between the experimental units.

The following real‐life laboratory experiment is an example of a completely randomized design in

which the investigators used randomization, blinding, and individual housing of animals to guarantee

the absence of systematic error and independence of experimental units. To investigate the

interaction of chronic treatment with drugs on the proliferation of gastric epithelial cells in rats, an

experiment was set up in which two experimental drugs were compared with their respective

solvent. A total of 40 rats were randomly divided into four groups of each ten animals, using the MS

Excel randomization procedure described in the Appendix. To guarantee independence of the

experimental units, the animals were kept in separate cages. Cages were distributed over the racks

according to their sequence number. Blinding was accomplished by letting the sequence number of

each animal correspond to a given treatment. One laboratory worker was familiar with the codes

and prepared the daily drug solutions. Treatment codes were concealed from the laboratory staff

that was responsible for the daily treatment administration and final histological evaluation.

Figure 7. After Levasseur et al. (1996) presence of bias in 96‐well plates

22

Another example of a completely randomized design is about eliminating the bias present in

experiments using 96‐well mictrotiter plates. Burrows (1984) already described the presence of plate

location effects in ELISA assays carried out in 96‐well microtiter plates. Similar findings were

reported by Levasseur et al. (1995) and Faessel et al. (1999) who described the presence of

parabolic patterns (Figure 7) in cell growth experiments carried out in 96‐well microtiter plates. They

were not able to show conclusively the underlying causes of this systematic error, which, as shown

in Figure 7, could be of considerable magnitude. Therefore, they concluded that only by random

allocation of the treatments to the wells these systematic errors could be avoided. An ingenious

method (Figure 8) for randomizing treatments in 96‐well microtiter plates was developed. Drugs

were serially diluted into tubes in a 96‐tube rack. Next, a randomization map was generated using an

Excel macro. The randomization map was taped to the top of an empty tube rack. The original tubes

were then transferred to the new rack by pushing the numbered tubes through the corresponding

numbered rack position. Using a multipipette system drug‐containing medium was then transferred

from the tubes to the wells of a 96‐well microtiter plate. At the end of the assay, the random data

file generated by the plate reader was imported into the Excel spreadsheet and automatically

resorted. Other research groups make use of robotized systems to implement randomization in their

96‐well plate experiments. This example is also a paradigm of how randomization can be used to

eliminate systematic errors by transforming them into random noise.

Figure 8. Scheme for the addition and dilution of drug‐containing medium to tubes in a 96 tube rack,

randomization of the tubes, and then addition of drug containing medium to cell‐containg wells of a 96‐well

plate. (after Faessel et al. (1999))

23

6.2 THE FACTORIAL DESIGN

In some types of experimental work it can be of interest to assess the joint effect of two or more

factors. In this case, the investigator can make use of factorial designs. Factorial designs are

identified as designs with a factorial treatment design and a completely randomized error‐control

design. The factorial treatment design allows estimating the main effects of the treatments and also

their interaction, i.e. the deviation from additivity of their joint effect.

Table 1. Treatment combinations in 2 x 2 factorial study about joint effect of radiotherapy and

chemotherapy

Control: no radiation, no chemotherapy Chemotherapy alone

Radiotherapy alone Radiotherapy and chemotherapy

An example of a simple 2 x 2 factorial design, is an experiment in which the joint effect of

radiotherapy and chemotherapy was assessed in an animal model of tumor growth. It was shown

that radiotherapy as well as treatment with an experimental drug effectively reduced tumor size.

However, for the drug to be of real therapeutic value the joint effect of both radiotherapy and

chemotherapy should be more than the sum of their individual effects. A factorial experiment was

set up in which animals were randomly allocated to 4 treatment groups as in Table 1: a control

group that was not exposed to radiation and was administered the drug’s vehicle, a group with

radiotherapy and the drug’s vehicle, a group without radiation and treated with the experimental

drug, and finally a group receiving both radiotherapy and chemotherapy.

Table 2. Hypothetical results (expected values) from a 2 x 2 factorial experiment

Factor B (Drug)

Factor A(Radiotherapy)

Absent(no drug)

Present(drug)

Absent (no radiation)

μ μ + β

Present (radiotherapy)

μ + α μ + α + β + γ

To further explore the concept of interaction, consider the hypothetical outcome of the 2 x 2

experiment in Table 2. Let μ stand for the result obtained in the control group that received only the

solvent of the drug. When an additive model applies, the result obtained in the group that received

24

the vehicle and was exposed to radiotherapy can be written as μ + α, where α stands for the main

effect of radiotherapy. Analogously, the result obtained in the group that was treated with the active

drug but left without radiotherapy can be written as μ + β. Still using an additive model let γ stand

for the additional effect in the combination, then the result obtained in the group treated with both

drug and radiotherapy is μ + α + β + γ. Note that γ1, called the interaction effect, can have the

same sign as α or β which means that the effect of the corresponding factor is enhanced, or can

have the opposite sign indicating a counteraction of the effect of the respective factor. Table 2

summarizes these hypothetical results. The interaction effect γ, can be estimated by subtracting the

sum of the off‐diagonal elements from the sum of the diagonal elements, i.e. [(μ ) + (μ + α + β + γ)] –

[(μ + α) + (μ + β)] = γ.

Figure 9. Results from a hypothetical experiment with complete additivity of the two factors

The results from a hypothetical experiment with complete additivity are shown in Figure 9. The lines

which connect the treatment means for the two levels of factor A, are parallel to one another. The

presence of interaction is demonstrated by the lack of this parallelism as is illustrated in Figure 10

where the effect of radiotherapy is enhanced in the presence of the drug. Of course, the opposite is

also possible when the effect of one factor is counteracted by the presence of another factor.

Statistical interaction is sometimes confused with the concept of synergism in pharmacology.

1 In statistical texts, the interaction effect γ is normally denoted as (αβ)ij

25

However, the requirements for two drugs to be synergistic with each other are much more stringent

than just the superadditivity which is demonstrated by the presence of interaction (Tallarida,2001).

Factorial designs are highly efficient, but unfortunately their usage is not widespread. Although our

discussion here is restricted to the 2 × 2 factorial, more factors and more levels can be used.

However, the number of experimental units that are involved may become rather large. For

instance, an experiment with three factors at three levels involves 33 = 27 different treatment

combinations. When there are at least two experimental units in each treatment group, then 54

units will be required. Such an experiment may soon be too large to be practical.

Figure 10. Results from a hypothetical experiment in which the two factors interact

Table 3. Results from 2 x 2 factorial experiment without interaction

No drug Drug

No Radiation 18 15

Radiotherapy 13 10

When, from both a statistical and scientific point of view, there is no important interaction

between the two factors, it is possible to use the factorial experiment to increase the generality of

the results without increasing the size of the experiment. For instance, the mean results of the

experiment in Table 3 indicate complete additivity of the two factors. Suppose that for each

treatment condition n animals were tested. We can estimate the effect of the drug in animals

26

without radiation (15 – 18) and in those with radiotherapy as (10 – 13). We can combine these two

estimates and calculate the standard error on this new estimate. The latter would be based on 4 x

(n ‐1) degrees of freedom since all data from Table 3. The results obtained in this way would be

more general, in the sense that we can say that irrespective of radiotherapy or not, the drug will

induce a decrease in tumor size of 3. Furthermore, using the same data, the effect of radiotherapy

can be estimated in drug treated and solvent treated animals.

6.3 THE RANDOMIZED COMPLETE BLOCK DESIGN

In the randomized complete block design the effect of a single factor is investigated in the presence

of a single isolated extraneous source of variability (block) closely related to the response. We

already discussed the use of blocking as a design tool for enhancement of the signal‐to‐noise ratio in

Section 5.7.2.3. In a randomized complete block design, all treatments are applied within each block

and treatments are compared within the blocks. The randomization procedure randomizes

treatments separately within each block. When a study is designed such that the number of

experimental units within each block and treatment is equal, the design is called a balanced

randomized complete block design. The randomized complete block design is identified as a one‐

way treatment design and a one‐way block error‐control design.

There are two main reasons for choosing a randomized complete block design above a completely

randomized design. Suppose there is an extraneous factor that is strongly related to the outcome of

the experiment. It would be most unfortunate if our randomization procedure yielded a design in

which there was a great imbalance on this factor. If this were the case, the comparisons between

treatment groups would be confounded with differences in this nuisance factor and be biased. The

second main reason for a randomized complete block design is its possibility to considerably reduce

the error variation in our experiment, thereby making the comparisons more precise. The main

objection to a randomized complete block design is that it makes the strong assumption that there is

no interaction between the treatment variable and the blocking characteristics, i.e. that the effect of

the treatments is the same among all blocks.

Table 4. Number of viable cardiomyocytes in a paired experiment

Rat No. Solvent Drug Difference

1 44 68 24

2 64 88 24

3 62 81 19

4 46 54 8

5 76 92 16

27

When only two treatments are compared, the blocking can be simplified to a paired design. An

example of a paired experiment using animal as blocking factor is the following real‐life experiment.

Isolated cardiomyocytes provide an easy tool to assess the effect of drugs on calcium‐overload.

Cardiomyocytes of a single animal are isolated and seeded in plastic Petri dishes. The Petri dishes are

treated with the experimental drug or with its solvent. After a stabilization period the cells are

exposed to a stimulating substance (e.g. veratridine) and the percentage viable, i.e. rod‐shaped,

cardiomyocytes in a dish is counted. Although comparison of the treatment with the solvent control

within a single animal provides the best precision, it lacks external validity. Therefore, a paired

experiment with myocytes from different animals and with animal as blocking factor was carried

out. For each animal two Petri dishes containing exactly 100 cardiomyocytes were prepared. From

the resulting five pairs of Petri dishes, one member was randomly assigned to drug treatment, while

the remaining member received the vehicle. After stabilization and exposure to the stimulus, the

number of viable cardiomyocytes in each Petri dish was counted. The resulting data are contained in

Table 4 and displayed in Figure 11.

Figure 11. The gain in efficiency induced by blocking illustrated in a paired design. In the left panel the

myocyte experiment is considered as a completely randomized design and the two samples largely overlap

each other. In the right panel the lines connect the data of the same animal and show a marked effect of

treatment.

There are 10 experimental units, since Petri dishes can be independently assigned to vehicle or drug.

However, the statistical analysis should take the particular structure of the experiment into account.

More specifically, the pairing has imposed restrictions on the randomization such that data obtained

28

from one animal cannot be freely interchanged with that from another animal. Therefore, for each

pair the difference between the solvent‐treated and drug‐treated Petri dish is evaluated as is

illustrated in the right panel of Figure 11. For each pair the drug‐treated Petri dish yields consistently

a higher result than its vehicle control counterpart. Since the different pairs (animals) are

independent from one another, the mean difference and its standard error can be calculated and a

Student t‐test can be carried out. The mean difference is 18.2 with a standard error of 2.97.

Suppose the experimenter would not have used blocking, i.e. if he had used myocytes originating

from 10 completely different animals. The 10 Petri dishes would then be randomly distributed over

the two treatment groups. Suppose also that the results of this new experiment are identical to

those obtained in the paired experiment. As is illustrated in the left panel of Figure 11, the two

groups largely overlap one another. Since all experimental units are now independent of one

another, the effect of the drug is evaluated by calculating the difference between the two mean

values and its standard error1. Obviously, the mean difference is the same as in the paired

experiment. However, the standard error on the mean difference has risen considerably to a value

9.18, i.e. use of blocking induced a threefold increase in the precision of the experiment.

This example demonstrates that carrying out a paired experiment has the possibility to enhance the

precision of the experiment considerably, while the conclusions have the same validity as in a

completely randomized experiment. However, the forming of blocks of experimental units is only

successful when the criterion upon which the pairing is based, is related to the outcome of the

experiment. Using a characteristic that does not have an important effect on the response variables

as blocking factor is worse than useless since the statistical analysis will lose a bit of power by taking

the blocking into account. This can be of particular importance for small sized experiments.

6.4 THE LATIN SQUARE DESIGN

The Latin square design is an extension of the randomized complete block design, but now blocking

is done on two characteristics that affect the response variable. An obvious example from laboratory

practice is the simultaneous control of the row and column effects in microtiter plates, without loss

of precision. Another example concerns experiments on neuronal protection, where a pair of

animals was tested each day and the investigator expected a systematic difference not only between

the pairs but also between the animal tested in the morning and the one tested in the afternoon.

1 The standard error on the difference between two means is equal to

2 . . ⁄ , where SD is the square root of the mean of the two individual variances of the treated and control group. The variance is defined as the square of SD

29

In a Latin square design the k treatments are arranged in a k x k square such as in Table 5. Each of

the four treatments A, B, C, or D occurs exactly once in each row and exactly once in each column.

The Latin square design is categorized as a one‐way treatment design and two‐way block error

control design.

The main advantage of the Latin square design is that it simultaneous balances out two sources of

error. The disadvantage is the strong assumption that there are no interactions between the

blocking variables or between the treatment variable and blocking variables. In addition Latin square

designs are limited by the fact that the number of treatments, number of row, and number of

columns must all be equal. Fortunately, there are arrangements that do not have this limitation

(Cox, 1958).

Table 5. Arrangement for a 4 x 4 Latin square controlling for column and row effects. The letters A‐D indicate

the four treatments

Column 1 Column 2 Column 3 Column 4

Row 1 B A D C

Row 2 C B A D

Row 3 A D C B

Row 4 D C B A

In a k x k Latin square, only k experimental units are assigned to each treatment group. However, It

may happen that more experimental units are required to obtain an adequate precision. The Latin

square can then be replicated and several squares can be used to obtain the necessary sample size.

In doing this, there are two possibilities to consider. Either one stacks the squares on top of each

other and keeps them as separate independent squares, or one completely randomizes the order of

the rows (or columns) of the design. In general, keeping the squares separate is not a good idea and

leads to less precise estimation and loss of degrees of freedom, especially in the case of a 2 x 2

square. It is only when there is a reason to believe that the column (or row) effects are different

between the squares that keeping them separate makes sense.

6.5 SPLIT PLOT DESIGNS

This type of design incorporates subsampling and allows one to make comparisons among different

treatments at two or more sampling levels. The split plot design allows assessment of the effect of

two independent factors using different experimental units. The term split plot originates from

30

agricultural research where fields are randomly assigned to different levels of a primary factor and

smaller areas within the fields are randomly assigned to one level of another secondary factor.

An example of a split plot design is the following experiment on diets and vitamins. Cages each

containing two mice each were assigned at random to a number of dietary treatments (i.e. cage was

the experimental unit for comparing diets), and the color‐marked mice within the cage were

randomly selected to receive one of two vitamin treatments by injection (i.e. mice were the

experimental units for the vitamin effect).

The split‐plot design is a two‐way crossed (factorial) treatment design and a split plot error design.

Use of the split plot design as described above is less common in the field of biomedical research.

However, when the second factor is time its use is more frequent. This special type of split‐plot

design is then called a repeated measures design.

6.6 REPEATED MEASURES DESIGNS

The repeated measures design is a special case of the split plot design. In a repeated measures

design, we typically take multiple measurements on a subject over time. If any treatment is applied

to the subjects, they immediately become the whole plots, and “Time” is the subplot factor. The

major disadvantage of repeated measures designs is the presence of carry‐over effects by which the

results obtained for a treatment are influenced by the previous treatment. In addition, any

confounding of the treatment effect with time, as was the case in the use of self‐controls in Section

5.7.1.1, must be avoided. Therefore, a parallel control group must always be included in these

designs

7 THE REQUIRED NUMBER OF REPLICATES – SAMPLE SIZE

7.1 DETERMINING SAMPLE SIZE IS A RISK – COST ASSESSMENT

Replication is the basis of all experimental design and a natural question that arises in each study is

how many replicates are required. The more replicates, the more confidence we have in our

conclusions. Therefore, we would prefer to carry out our experiment on a sample that is as large as

possible. However, increasing the number of replicates incurs a rise in cost. Thus, the answer to how

large an experiment should be is that it should be just big enough to give confidence that any

biologically meaningful effect that exists can be detected.

31

7.2 THE CONTEXT OF BIOMEDICAL EXPERIMENTS

The estimation of the appropriate size of the experiment is straightforward and depends on the

statistical context, the assumptions made, and the study specifications. Context and specifications

on their turn depend on study objectives and design of the experiment.

In practice, the most frequently encountered contexts in statistical inference are point estimation,

interval estimation and hypothesis testing, of which hypothesis testing is definitely the most

important in biomedical studies.

7.3 THE HYPOTHESIS TESTING CONTEXT – THE POPULATION MODEL

In the hypothesis testing context one defines a null hypothesis and, for the purpose of sample size

estimation, an alternative hypothesis of interest. The null hypothesis will often be that the response

variable does not really depend on the treatment condition. For example, one may state as a null

hypothesis that the population means of a particular measurement are equal under two or more

different treatment conditions and that any differences found can be attributed to chance.

At the end of the study when the data are analyzed, we will either accept or reject the null

hypothesis in favor of the alternative hypothesis. As is indicated Table 6, there are four possible

outcomes at the end of the experiment. When the null hypothesis is true and we failed to reject it

then we have made the correct decision. This is also the case when the null hypothesis is false and

we did reject it. However, there are two decisions that are erroneous. If the null hypothesis is true

and we incorrectly rejected it, then we made a false positive decision. Conversely, if the alternative

hypothesis is true (i.e. the null hypothesis is false) and we failed to reject the null hypothesis we

have made a false negative decision.

Table 6. The decision process in hypothesis testing

State of Nature

Decision made

Null hypothesis true

Alternative hypothesis true

Do not reject null hypothesis Correct decision (1 – α)

False negative β

Reject null hypothesis False positive α

Correct decision (1 – β)

The basis of sample size calculation is formed by specifying an allowable rate of false positives and

an allowable rate of false negatives for a particular alternative hypothesis and then to estimate a

32

sample size just large enough so that these low error rates can be achieved. The allowable rate for

false positives is called the level of significance or alpha level and is usually set at values of 0.01,

0.05, or 0.10. The false negative rate depends on the postulated alternative hypothesis and is usually

described by its complement, i.e. the probability of rejecting the null hypothesis when the

alternative hypothesis holds. This is called the power of the statistical test of significance. Power

levels are usually 80% or 90%.

Significance level and power are already two of the four major determinants of the sample size for

hypothesis testing. The remaining two are the inherent variability in the study parameter of interest

and the size of the difference to be detected in the postulated alternative hypothesis. Other key

factors that determine the sample size are the number of treatments and the number of blocks used

in the experimental design.

When the significance level decreases or the power increases, the required sample size will become

larger. Similarly, when the variability is larger or the difference to be detected smaller, the required

sample size will also become larger. Conversely, when the difference to be detected is large or

variability low, the required sample size will be small.

It is convenient to express the difference in means as effect size by dividing it by the standard

deviation. The effect size then takes both the difference and inherent variability into account. Cohen

(1988) argues that effect sizes of 0.2, 0.5 and 0.8 can be regarded respectively as small, medium and

large.

7.4 SAMPLE SIZE CALCULATIONS

7.4.1 POWER ANALYSIS COMPUTATIONS

Now that we are familiar with the concepts of hypothesis testing and the determinants of sample

size, we can proceed with the actual calculations. There is a significant amount of free software

available to make elementary sample size calculations. In particular, there is the R‐package pwr

(Champely, 2009).

Assume we wish to plan an experiment comparing the mean values of two treatment groups. We

want to reject the null hypothesis of no difference at a level of significance of 0.05, whatever the

direction of the difference between the two samples (i.e. a two‐tailed test). We want to detect a

sufficient large effect between the two groups (d =0.8) with a power of 80%. The calculations carried

out in R show that 26 experimental units are required in each of the two treatment groups.

33

> pwr.t.test(d=0.8,power=0.8,sig.level=0.05,type="two.sample",

+ alternative="two.sided")

Two-sample t test power calculation

n = 25.52457

d = 0.8

sig.level = 0.05

power = 0.8

alternative = two.sided

NOTE: n is number in *each* group

A quick and dirty method for sample size calculation in 2 group comparison with a power of 0.8 and

a two‐sided critical value of 0.05 is provided by Lehr’s equation (Lehr, 1992; Van Belle, 2008):

16∆

where Δ represents the effect size and n stands for the required sample size in each treatment

group. For the above example, the equation results in: 16/0.64 = 25 animals per treatment group.

The numerator of Lehr’s equation depends on the desired power. Alternative values for the

numerator are 8 and 21 for a power of 50% and 90%, respectively.

7.4.2 MEAD’S RESOURCE REQUIREMENT EQUATION

There are occasions when it is difficult to use a power analysis, because there is no information on

the inherent variability (i.e. standard deviation) and/or because the effect size of interest is difficult

to specify. An alternative, quick and dirty method for approximate sample size determination was

proposed by Mead (1988). The method is appropriate for comparative experiments which can be

analyzed using analysis of variance (Grafen and Hails, 2002; Kutner et al., 2004), such as:

• Exploratory experiments

• Complex biological experiments with several factors and treatments

• Any experiment where the power analysis method is not possible or practicable

The method depends on the law of diminishing returns. Adding one experimental unit to a small

experiment gives good returns, while adding it to a large experiment does not do so. It has been

used by statisticians for decades, but has been explicitly justified by Mead (1988). An appropriate

sample size can be roughly determined by the number of degrees of freedom for the error term in

the analysis of variance or t test given by the formula:

N - 1 = E + T + B

34

where E, T and B are the error, treatment and block degrees of freedom (number of occurrences or

levels minus 1) in the ANOVA, and N stands for the total sample size. In order to obtain a good

estimate of error it is necessary to have at least 10 degrees of freedom, and many statisticians would

take 12 or 15 degrees of freedom as their preferred lower limit. On the other hand, if E is allowed to

be large, say greater than 20, then the experimenter is wasting resources. In a non‐blocked design

the total number of animals minus the number of treatments should be between ten and twenty.

Suppose an experiment is planned with four treatments, with eight animals per group (32 rats total).

In this case N=32, B=0 (no blocking), T=3, hence E=28. This experiment is a bit too large, and six

animals per group might be more appropriate (23 – 3 = 20).

There is one problem with this simple equation. It appears as though blocking is "bad" because it

reduces the error degrees of freedom. If the above example the experiment was going to be done in

eight blocks, then N=32, B=7, T=3 and E= 32‐1‐7‐3 =21 instead of 28. However, blocking nearly

always reduces the inherent variation which more than compensates for the decrease in the error

degrees of freedom unless the experiment is very small and the blocking criterion was not well

related to the response. Provided the error degrees of freedom is not less than about 6 in a blocked

design, then the experiment is probably of an adequate size.

7.4.3 MULTIPLICITY AND SAMPLE SIZE

As we will see in Section 8.5, when more statistical tests are to be carried out on the data the overall

rate of false positive findings is higher than the false positive rate for each one separately. To

circumvent this inflation of the false positive error rate, the critical value of each individual test is

usually set at a more stringent level (e.g. Bonferroni’s adjustment). However, as we already noted

above, when the significance level is set at a lower value, the required sample size will necessarily

increase.

It can be shown, that for a power of 80% or 90% carrying out two independent statistical tests

instead of one, involves a 20% larger sample size to maintain the overall error rate at its level of

0.05. Similarly, when 3 or 4 independent tests are involved the required sample size increases with

30% or 40%, respectively.

8 THE STATISTICAL ANALYSIS

8.1 THE STATISTICAL TRIANGLE

There is a one‐to‐one correspondence between the study objectives, the study design and the

analysis. The objectives of the study will indicate which of the designs may be considered. Once a

35

study design is selected, it will on its turn determine which type of analysis is appropriate. This

principle that the statistical analysis is determined by the way the experiment is conducted was

enunciated by Fisher in 1935:

"All that we need to emphasize immediately is that, if an experiment does allow us to calculate a

valid estimate of error, its structure must completely determine the statistical procedure by which

this estimate is to be calculated. If this were not so, no interpretation of the data could ever be

unambiguous; for we could never be sure that some other equally valid method of interpretation

would not lead to a different result."

In other words, choice of the statistical methods follows directly from the objectives and design of

the study. With this in mind, many of the complexities of the statistical analysis have now almost

become trivial

8.2 THE STATISTICAL MODEL REVISITED

We already stated that every experimental design is underpinned by a statistical model and that the

experimental results should be considered as being generated by this experimental model. This

conceptual framework greatly simplifies the statistical analysis to just fitting the statistical model to

the data and comparing the model component related to the treatment effect with the error

component (Grafen and Hails, 2002; Kutner et al., 2004). Hence, the choice of the appropriate

statistical analysis is straightforward. However, some important issue statistical issues remain, such

as the type data and the assumptions we make about the distribution of the data.

8.3 TYPES OF DATA

It is important to distinguish between continuous and discrete data. A continuous variable can, in

theory, take on an infinite number of values, but in practice, only a finite number of digits are kept. It

is the number and frequency of values that really make the distinction between continuous and

discrete.

A variable that takes on only a small number of values, say four or less, is called a discrete or

categorical variable and should be analyzed accordingly (Agresti, 1990). A variable whose values

encompass a wide range and for which many different values are observed is usually considered

continuous and on its turn requires specific methods for analysis of continuous data.

Within these broad categories of discrete and continuous data, the techniques for analysis differ. For

discrete data one distinguishes further interval, ordinal, nominal and binary or dichotomous data.

36

For a continuous variable, it can be assumed that the distribution of the error component of the

model is of a particular form. The most commonly used distributional form in statistics is that of the

normal or Gaussian distribution, which has been found to represent diverse sets of data. However, it

can be of interest to consider alternative distributions such as the log‐normal, exponential, gamma,

etc. Alternatively, one could apply methods that do not depend on an underlying distribution

(Lehmann, 1975). Both approaches have their advantages and disadvantages.

The final result of the statistical analysis is in many cases a p‐value. Unfortunately the concept of p‐

value is often misunderstood. It is related to but not the same as the level of significance α in the

hypothesis testing framework. Once the experiment, which was designed to test a specific null

hypothesis, is carried out the experimental data are used to calculate a quantity which we will call a

test statistic. The corresponding p‐value is then calculated as the probability of obtaining a test

statistic that is as extreme or more extreme than the one observed, provided the null hypothesis is

true. When this p‐value is less than the level of significance α, the result is declared significant.

8.4 VERIFYING THE STATISTICAL ASSUMPTIONS

When the inferential results are sensitive to the distributional and other assumptions of the

statistical analysis, it is essential that these assumptions are also verified. The aptness of the

statistical model is preferably assessed by informal methods such as diagnostic plotting (Grafen and

Hails, 2002; Kutner et al., 2004). When planning the experiment, historical data, or the results of the

pilot experiment can already be used for a preliminary verification of the model assumptions.

Another option is to use statistical methods that are robust against departures from the assumptions

(Lehmann, 1975).

8.5 MULTIPLICITY

In Section 4.2 we already mentioned that it is wise to limit the number of objectives in a study. The

same problem of multiplicity arises when a study includes a large number of variables, or when

measurements are made at a large number of time points. Only in studies of the most exploratory

nature the statistical analysis of every possible variable is acceptable. In this case, the exploratory

nature of the study should be stressed and the results interpreted with great care. The reason is that

in comparative experiments the overall rate of false positive findings across all variables is much

higher than the false positive rate for each one separately.

This is illustrated by the following example. Suppose a drug is tested at 20 different doses on a

variable. Suppose we reject the null hypothesis of no treatment effect for each dose separately

37

when the probability of falsely rejecting the null hypothesis (the significance level) is less than or

equal to 0.05. Then the overall probability of falsely declaring the existence of a treatment effect

when all underlying null hypotheses are in fact true is 0.641. This means that we are more likely to

get one significant result than not. The same multiplicity problem arises when a single dose of the

drug is tested on 20 independent variables. The problem of multiplicity is of particular importance

and magnitude in gene expression microarray experiments (Bretz, 2005). For example a microarray

experiment examines the differential expression of 30,000 genes in wildtype and in a mutant.

Assume that for each gene an appropriate two‐sided two‐sample test is performed at the 5%

significance level. Then we expect to obtain roughly 1,500 false‐positives. Strategies for dealing with

this curse of multiplicity in microarrays are provided by Amaratunga (2004) and Bretz ( 2005).

The multiplicity problem must at least be recognized at the planning stage. Ways to deal with it

(Curran‐Everett, 2000; Bretz et al., 2010) should be investigated and specified in the protocol.

9 THE STUDY PROTOCOL

The writing of the protocol finalizes the end of the research design phase. Every study should have a

written protocol before it is started. The complete study protocol consists of a more conceptual

research protocol and the technical protocol, which we already discussed in Section 5.7.1.3. The

research protocol states the rational for performing the study, the study’s objectives, the related

hypotheses and working hypotheses that are tested and their consequential predictions. It should

contain a section on experimental design, how treatments will be assigned to experimental units,

information and justification of planned sample sizes, and a description of the statistical analysis that

is to be performed. Defining the statistical methods in the protocol is of importance since it allows

preparation of the data analytic procedures beforehand and ensures against the misleading practice

of data dredging or data snooping. Writing down the statistical analysis plan beforehand prevents

also from trying several methods of analysis and report only those results that suit the investigator.

Such a practice is of course inappropriate, unscientific, and unethical.

Many investigators consider writing a detailed protocol a waste of time. However, the “smart

researcher” understands that by writing a good protocol he is actually preparing his final study

report. A well‐written protocol is even more essential when the design is complex or the study is

collaborative. Once the protocol has been formalized, it is important that it is followed as good as

possible and every deviation of it should be documented.

1 This probability is in fact 1 1 0.05

38

10 INTERPRETATION AND REPORTING While the previous sections focused on the planning phase of the study with the protocol as final

deliverable, this section deals with some points to consider when interpreting and reporting the

results of the statistical analysis. As a general rule in writing reports containing statistical

methodology, it is recommended not to use technical statistical terms such as “random”, “normal”,

“significant’, “correlation”, and “sample” in their everyday meaning.

10.1 THE METHODS SECTION

The Methods section should contain details of the experimental design such as the size and number

of experimental groups, how experiment units were assigned to treatment groups, how

experimental outcomes were assessed, and what statistical and analytical methods were used.

10.1.1 EXPERIMENTAL DESIGN

Readers should be told about weaknesses and strengths in study design, e.g. when randomization

and/or blinding was used it adds to the reliability of the data. A detailed description of the

randomization and blinding procedures, and how and when these were applied will allow the reader

to judge the quality of the study. Reasons for blocking and the blocking factors should be given and

how blocking was dealt with in the statistical analysis. When there is ambiguity about the

experimental unit, the unit used in the statistical analysis should be specified and a justification for

its choice should be provided.

10.1.2 STATISTICAL METHODS

Statistical methods should be described with enough detail to enable a knowledgeable reader with

access to the original data to verify the reported results. The authors should report and justify which

methods they used. A term like “tests of significance” is too vague, and should be more detailed.

The level of significance and, when applicable, direction of statistical tests should be specified, e.g.

“two‐sided p‐values less than or equal to 0.05 were considered to indicate statistical significance”.

Some procedures, e.g. analysis of variance, chi‐square tests, etc. are by definition two‐sided. Issues

about multiplicity (Section 8.5) and a justification of the strategy that deals with them should also be

addressed here.

The software used in the statistical analysis and its version should also be specified. When the R

system is used (see Dalgaard 2002), both R and the packages that were used should be referenced1

1 The function citation(‘pkgname’) will yield the required reference

39

10.2 THE RESULTS SECTION

10.2.1 SUMMARIZING THE DATA

The number of experimental units used in the analysis should always clearly be specified. When

possible findings should be quantified and presented with appropriate indicators of measurement

error or uncertainty. As measures of spread and precision, standard deviations (SD) and standard

errors (SEM) should not be confused. Standard deviations are a measure of spread and as such a

descriptive statistic, while standard errors are a measure of precision of the mean. Normally

distributed data should preferably be summarized as mean (SD), not as mean ± SD. For non‐normally

distributed data, medians and inter‐quartile ranges are the most appropriate summary statistics. The

practice of reporting mean ± SEM should preferably be replaced by the reporting of confidence

intervals which are more informative. Extremely small datasets should not be summarized at all

but should preferably be reported or displayed as raw data.

When reporting SD (or SEM) one should realize that for positive variables such as concentrations,

durations, and counts, the mean minus 2 x SD (or minus 2 x SEM x √ ) which indicates a lower 2.5%

of the distribution, can lead to a ridiculous negative value. In this case, an appropriate 95%

confidence interval based on the lognormal distribution, or alternatively a distribution‐free interval,

will avoid such a pitfall.

Figure 12. Scatter diagram with indication of median values and 95 % distribution‐free confidence intervals

Spurious precision detracts from a paper’s readability and credibility. Therefore, unnecessary

precision, particularly in tables, should be avoided. When presenting means and standard deviations,

M bM aF a F b

0

5

10

15

20

Occ

uren

ces

40

it is important to bear in mind the precision of the original data. Means should be given one

additional decimal place more than the raw data. Standard deviations and standard errors usually

require one more extra decimal place. Percentages should not be expressed to more than one

decimal place and with samples less than 100 the use of decimal places should be avoided.

Percentages should not be used at all for small samples. Note that the remarks about rounding only

apply to the presentation of results, rounding should not be done at all before or during the

statistical analysis.

10.2.2 GRAPHICAL DISPLAYS

Graphical displays complement tabular presentations of descriptive statistics. Generally, graphs are

better suited than tables for identifying patterns in the data, whereas tables are better for providing

large amounts of data with a high degree of numerical detail. When possible, one should always

attempt to graph individual data points. This especially the case when treatment groups are small.

Graphs such as Figure 12 and Figure 13 are much more informative than the usual mean ± SEM

plots. These graphs are easily constructed using the R language. Specifically the R package

“beeswarm” (Eklund, 2010) can be of great help.

Figure 13. Graphical display of longitudinal data showing individual patient profiles

10.2.3 INTERPRETING AND REPORTING SIGNIFICANCE TESTS

When data are summarized in the Results section, the statistical methods that were used to analyze

them should be specified. It is to the reader of little help to have in the Methods section a statement

such as “statistical methods included analysis of variance, regression analysis, as well as tests of

significance” without any reference to which specific procedure is reported in the Results part.

41

Tests of statistical significance should be two‐sided (two‐tailed). When comparing two means or

two proportions, there is a choice between a two‐sided or a one‐sided test. In a one‐sided test the

investigator alternative hypothesis specifies the direction of the difference, e.g. experimental

treatment greater than control. In a two‐sided test, no such direction is specified. A one‐sided test is

rarely appropriate and when one‐sided tests are used, their use should be justified (Bland and

Altman, 1994). For all two group comparisons, the report should clearly state whether one‐sided or

two‐sided p‐values are reported.

Exact p‐values, rather than statements such as “P<0.05” or even worse “NS” (not significant), should

be reported where possible. Readers can then compare the reported p‐value with their own choice

of critical values. One should avoid reporting a p‐value as p = 0.000 since a value with zero

probability of occurrence is, by definition, an impossible value. No observed event can ever a

probability of zero. Therefore, such an extreme small p‐value must be reported as p < 0.001. In

rounding a p‐value, it happens that a value that is technically larger than the critical value of 0.05,

say 0.051, is rounded down to p = 0.05. This is inaccurate and, to avoid this error, p‐values should be

reported to the third decimal. If a one sided test is used and the result is in the wrong direction, then

the report must state that p>0.05 (Levine and Atkin, 2004), or even better report the complement of

the p‐value, i.e. 1 – p.

An effect that is statistically significant is not necessarily of biomedical importance. Therefore, one

should avoid sole reliance on statistical hypothesis testing and preferably supplement ones findings

with confidence intervals which are more informative Confidence intervals on a difference of means

or proportions provide information about the size of an effect and its uncertainty and are of

particular value when the results of the test fail to reject the null hypothesis.

There is a common misconception among scientists that a non‐significant result implies that the null

hypothesis can be accepted. Consequently, they conclude that there is no effect of the treatment or

that there is no difference between the treatment groups. However, from a philosophical point of

view, one can never prove the non‐existence of something. As Fisher (1935) clearly pointed out “it

should be noted that the null hypothesis is never proved or established, but is possibly disproved,

in the course of experimentation”. To state it otherwise: Lack of evidence is no evidence for lack

of effect. However, in the case of a non‐significant result a confidence interval will provide us with a

region of plausible values for the magnitude of the treatment effect.

42

11 CONCLUDING REMARKS

We have looked at the complexities of the research process from the vantage point of a generalist.

Statistical thinking was introduced as a non‐specialist generalist skill that permeates the entire

research process. The seven principles of statistical thinking were formulated as: 1) time spent

thinking on the conceptualization and design of an experiment is time wisely spent; 2) the design of

an experiment reflects the contributions from different sources of variability; 3) the design of an

experiment balances between its internal validity (proper control of noise) and external validity (the

experiment’s generalizability); 4) good experimental practice provides the clue to bias minimization;

5) good experimental design is the clue to the control of variability; 6) experimental design

integrates various disciplines; 7) a priori consideration of statistical power is an indispensable pillar

of an effective experiment.

We elaborated on each of these and finally discussed some points to consider in the interpretation

and reporting. What we didn’t touch was the role of the statistician in the research project. The

statistician is a professional particularly skilled in solving research problems. She should be

considered as a team member and often even as collaborator or partner in the research process in

which she can have a critical role. Whenever possible, the statistician should be consulted when

there is doubt with regard to design, sample size, or statistical analysis. A statistician working closely

together with a scientist can greatly improve the project’s likelihood of success. Many applied

statisticians become involved into the subject area and, by virtue of their statistical training, take on

the role of statistical thinker, thereby permeating the research process. In a great many instances

this key role of the statistician is recognized and granted with a co‐authorship.

The most effective way to work with a consulting statistician is to include her or him from the very

beginning of the project. What should be avoided is contacting the statistical support group after the

experiment has reached its completion. Finally, to quote once more R.A. Fisher: “To consult the

statistician after an experiment is finished is often merely to ask him to conduct a post mortem

examination. He can perhaps say what the experiment died of.” (Presidential Address to the First

Indian Statistical Congress, 1938; Fisher, 1938).

12 RECOMMENDED READING

The book by David Salsburg “The Lady Tasting Tea” is a lucidly written account of the history of

statistics and how statistical thinking revolutionized 20th Century science. A clear and comprehensive

work on experimental design is the book by Murray Selwyn “Principles of Experimental Design for

the Life Sciences”. On a more introductory level is the book by Ruxton and Colegrave “Experimental

43

Design for the Life Sciences”. A gentle introduction to statistics in general and hypothesis testing,

confidence intervals and analysis of variance in particular, can be found in the highly recommended

book of the two Wonnacott brothers “Introductory Statistics”. For those who want to carry out their

analyses in the freely available R‐language is the book by Peter Dalgaard “Introductory Statistics with

R” a good starter. Guidances and tips for efficient data visualizations can be found in “The Visual

Display of Quantitative Information” by Edward Turfte and in the two books by William Cleveland:

“Visualizing Data” and “The Elements of Graphing Data”. In “Speaking of Graphics”, Paul Lewi takes

the reader on a fascinating journey through the history of statistical graphics. This e‐book is freely

available on http://www. datascope.be).

45

REFERENCES AND SELECTED BIBLIOGRAPHY

Agresti A (1990). The Analysis of Categorical Data. J. Wiley, New Tork, NY.

Amaratunga D, Cabrera J (2004). Exploration and Analysis of DNA Microarray and Protein Array

Data. J. Wiley, New Yokr, NY.

Bailar III JC, Mostelller F (1988). Guidelines for statistical reporting in articles for medical journals.

Ann Int Medicine 108, 226‐273.

Bland M, Altman DG (1994). One and two sided tests of significance. British Medical Journal 309,

248.

Bretz F, Landgrebe J, Brunner E (2005).Multiplicity issues in microarray experiments. Methods Inf

Med 44, 431‐437.

Bretz F, Hothorn T, Westfall P (2010). Multiple Comparisons Using R. CRC Press, Boca Raton

Burrows PM, Scott SW, Barnett OW, McLaughlin MR (1984). Use of experimental designs with

quantitative ELISA. J Virol Methods 8(3), 207‐216.

Champely S (2009). pwr: Basic functions for power analysis. R package version 1.1.1. http://CRAN.R‐

project.org/package=pwr

Cleveland WS (1993). Visualizing Data. Hobart Press, Summit, NJ

Cleveland WS (1994). The Elements of Graphing Data. Hobart Press, Summit, NJ

Cohen J (1988). Statistical Power Analysis for the Behavioral Sciences.2nd Ed. Lawrence Erlbaum

Associates, Hillsdale, NJ.

Cox DR (1958). Planning of Experiments. J. Wiley, New York, NY.

Curran‐Everett D(2000). Multiple comparisons: philosophies and illustrations. Am J Physiol

Regulatory Integrative Comp Physiol 279, R1‐R8.

Dalgaard P(2002). Introductory Statistics wit R. Springer, New York, NY.

Eklund AC (2010). beeswarm: The bee swarm plot, an alternative to stripchart.. R package version

0.0.7. http://CRAN.R‐project.org/package=beeswarm.

46

Faessel H, Levasseur L, Slocum H, Greco W (1999) Parabolic growth patterns in 96‐well plate cell

growth experiments. In Vitro Cell.Dev.Biol.‐Animal 35: 270‐278.

Fisher RA (1935). The Design of Experiments , (8th Ed. 1966), Hafner Press, New York, NY (First

published 1935, Oliver‐Boyd, London UK).

Fisher RA (1938). Presidential address: The first session of the Indian Statistical Conference, Calcutta,

Sankhya, 4, 14—17.

Giesbrecht FG, Gumpertz, ML (2004). Planning, Construction, and Statistical Analysis of Comparative

Experiments. J. Wiley, New York, NY.

Grafen A, Hails R. (2002). Modern Statistics for the Life Sciences. Oxford University Press, Oxford, UK.

Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124.

doi:10.1371/journal.pmed.0020124

Kilkenny C, Parsons N, Kadyszewski E, Festing MFW, Cuthill, IC, Fry D, Hutton J, Altman DG. (2009)

Survey of the quality of experimental design, statistical analysis and reporting of research

using animals. PLoS ONE 4 (11), e7824.

Kutner MH, Nachtsheim C, Neter J, Li W (2004). Applied Linear Statistical Models, 5th ed.,Chicago:

McGraw‐Hill/Irwin

Lehmann EL (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden‐Day:. San

Francisco, CA.

Lehr R (1992). Sixteen s squared over d squared: a relation for crude sample size estimates. Statistics

in Medicine 11, 1099–1102.

Levasseur L, Faessel H, Slocum H, Greco W . (1995) Precision and pattern in 96‐well plate cell growth

experiments. Proc Am Stat Assoc Biopharm Sect, 227‐232..

Levine TR., Atkin C (2004). The accurate reporting of software‐generated p‐values: A cautionary

note. Communication Research Reports, 21, 324‐327.

Lewi PJ (2006). Speaking of Graphics. http://www.datascope.be

Mead R (1988). The design of experiments. Cambridge University Pres,s Cambridge, NY.

Nadon R, Shoemaker J (2002). Statistical issues with microarrays: processing and analysis. Trends in

Genetics 15 (5), 265‐271.

47

R Development Core Team (2010). R: A language and environment for statistical computing.

R Foundation for Statistical Computing, Vienna, Austria. ISBN 3‐900051‐07‐0,

URL http://www.R‐project.org/.

Rivenson A, Hoffmann D, Prokopczyk B, Amin S, Hecht SS (1988). Induction of lung and exocrine

pancreas tumors in F344 rats by tobacco‐specific and Areca‐derived N‐nitrosamines. Cancer

Research, 48, 6912‐6917.

Ruxton GD, Colegrave N (2003). Experimental Design for the Life Science. Oxford University Press,

Oxford, UK.

Salsburg D (2001).The Lady Tasting Tea. Freeman, New York, NY.

Selwyn MR (1996). Principles of Experimental Design for the Life Sciences. CRC Pres, Boca Raton, FL.

Tallarida, RJ (2001). Drug synergism: its detection and applications. J Pharmacol & Exp Ther 298 (3),

865‐872.

Temme A, Sümpel F, Rieber GSEP, Willecke KJK, Ott T. (2001) Dilated bile canaliculi and attenuated

decrease of nerve‐dependent bile secretion in connexin32‐deficient mouse liver. Eur J Physiol

442, 961‐966.

Tufte ER (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT.

Van Belle G (2008). Statistical Rules of Thumb. J. Wiley, New York, NY.

Vandenbroeck P, Wouters L, Molenberghs G, Van Gestel J, Bijnens L (2006). Teaching statistical

thinking to life scientists. A case‐based approach. J Biopharm Stat 16, 61‐75.

Wells HG (1903). Mankind In The Making. 2007 Edition. 1st World Library, Fairfield, IA. p. 149

Wells HG (1938). World Brain. Cambridge University Press, London, UK, p. 88.

(http://ebooks.adelaide.edu.au/w/wells/hg/world_brain/).

Wilks SS (1951). Undergraduate statistical education. J Amer Statist Assoc 46, 1‐18.

Wonnacott TH, Wonnacott RJ (1990). Introductory Statistics. 5th Ed., J. Wiley, New York, NY.

49

APPENDIX TOOLS FOR RANDOMIZATION IN MS EXCEL AND R

COMPLETELY RANDOMIZED DESIGN

Suppose 21 experimental units have to be randomly assigned to three treatment groups, such that

each group contains exactly seven animals

A. MS EXCEL

A randomization list is easily constructed using a spreadsheet program like Microsoft Excel. This is

illustrated in Figure 14. We enter in the first column of the spreadsheet the code for the treatment

(1, 2, 3). Using the RAND() function, we fill the second column with pseudo‐random numbers

between 0 and 1. Subsequently the two columns are selected and the Sort command from the Data‐

menu is executed. In the Sort window that appears now, we select column B as column to be sorted

by. The treatment codes in column A are now in random order, i.e. the first animal will receive

treatment 2, the second treatment 3, etc.

Figure 14. Generating a completely randomized design in MS Excel

B. R‐LAN

In the ope

> set.se> x<-c(r> x [1] "A""C" "A" > rx<-sa> rx [1] "B" "C" "B"

RANDOM

Suppose 2

treatment

A. MS EX

In MS Exc

spreadshe

indication

numbers

from the

NGUAGE

en source stat

eed(1324)rep(c("A",

" "B" "C" "B" "C" ample(x) #

"B" "A" ""A"

MIZED COMP

20 experimen

t groups A, B,

XCEL

Figure

cel follow the

eet the code

n of the block

between 0 a

Data‐menu i

tistical langua

,”B”,”C”),7

"A" "B" "C

# randomize

"C" "A" "C"

PLETE BLOCK

ntal units, or

, C, D, such th

e 15. Generati

e procedure t

for the treat

(1:5). Using t

and 1. Subse

s executed. I

age R, the sam

7) # vecto

C" "A" "B"

ed treatme

" "B" "B"

K DESIGN

rganized in 5

hat each treat

ing a randomi

that is depict

ment (A, C, B

the RAND() fu

quently the

n the Sort wi

50

me result is o

or with tre

" "C" "A" "

ent assignm

"A" "A" "B

blocks of siz

tment occurs

ized complete

ted in Figure

B, D). The sec

unction, we fi

three column

indow that ap

obtained by:

eatment gr

"B" "C" "A

ment

B" "C" "C"

ze 4 have to

s exactly once

block design i

15. We ente

cond column

ill the third co

ns are select

ppears now,

roups A,B,C

A" "B" "C"

" "B" "C" "

be randomly

e in each bloc

n MS Excel

er in the first

(Column B) is

olumn with p

ted and the S

we select Co

C x 7

"A" "B"

"C" "A" "A

y assigned to

k.

column of th

s filled with a

seudo‐rando

Sort comman

olumn B as fir

A"

4

he

an

m

nd

rst

51

sort criterion and Column C as second sort criterion. The treatment codes in column A are now for

each block in random order, i.e. the first animal in block 1 will receive treatment A, the second

treatment D, etc.

B. R LANGUAGE

> set.seed(3223) > treat<- rep(c("A",”B”,”C”,”D”),5) # treatments > design<-data.frame(treat=treat,block=rep(1:5,rep(4,5)))#treats & blocks > rdesign<-design[sample(dim(design)[1]),] # random sequence > rdesign<-rdesign[order(rdesign$block),] # order by blocks for convenience > rdesign

treat block 11 C 1 6 B 1 16 D 1 1 A 1 17 D 2 2 A 2 12 C 2 7 B 2 13 C 3 18 D 3 3 A 3 8 B 3 14 C 4 9 B 4 4 A 4 19 D 4 5 A 5 15 C 5 10 B 5 20 D 5

Documents

Statistical Thinking VIB Course Notes Nov 2011