241

Click here to load reader

Introduction to Statistics in Pharmaceutical Clinical Trials

Embed Size (px)

Citation preview

Page 1: Introduction to Statistics in Pharmaceutical Clinical Trials

Introduction to Statisticsin PharmaceuticalClinical TrialsTodd A Durham and J Rick Turner

Page 2: Introduction to Statistics in Pharmaceutical Clinical Trials

Introduction to Statistics inPharmaceutical Clinical Trials

Page 3: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 4: Introduction to Statistics in Pharmaceutical Clinical Trials

Introduction to Statistics inPharmaceutical Clinical Trials

Todd A Durham MSSenior Director of Biostatistics and Data ManagementInspire PharmaceuticalsDurham, North Carolina, USA

J Rick Turner PhDChairman, Department of Clinical ResearchCampbell University School of PharmacyResearch Triangle Park, North Carolina, USAand Consulting Executive Director of OperationsCTMG, Inc.Greenville and Wilson, North Carolina, USA

London • Chicago

Page 5: Introduction to Statistics in Pharmaceutical Clinical Trials

Published by the Pharmaceutical PressAn imprint of RPS Publishing1 Lambeth High Street, London SE1 7JN, UK100 South Atkinson Road, Suite 200, Greyslake, IL 60030-7820, USA

© J Rick Turner and Todd A Durham 2008

is a trade mark of RPS PublishingRPS Publishing is the publishing organisation of the RoyalPharmaceutical Society of Great Britain

First Published 2008

Typeset by J&L Composition, Filey, North YorkshirePrinted in Great Britain by Cambridge University Press,Cambridge

ISBN 978 0 85369 714 5

All rights reserved. No part of this publication may be repro-duced, stored in a retrieval system, or transmitted in any formor by any means, without the prior written permission of thecopyright holder.

The publisher makes no representation, express or implied,with regard to the accuracy of the information contained in thisbook and cannot accept any legal responsibility or liability forany errors or omissions that may be made.

The right of J Rick Turner and Todd A Durham to be identified asthe authors of this work has been asserted by them in accordancewith the Copyright, Designs and Patents Act, 1988.

A catalogue record for this book is available from the BritishLibrary

Disclaimer

The drug selections and doses given in this book are for illustration only. The authors and publisherstake no responsibility for any actions consequent upon following the contents of this book withoutfirst checking current sources of reference.

All doses mentioned are checked carefully. However, no stated dose should be relied on as the basis forprescription writing, advising or monitoring. Recommendations change constantly, and a current copy ofan official formulary, such as the British National Formulary or the Summary of Product Characteristics,should always be consulted. Similarly, therapeutic selections and profiles of therapeutic and adverse activi-ties are based upon the authors’ interpretation of official recommendations and the literature at the time ofpublication. The most current literature must always be consulted.

Page 6: Introduction to Statistics in Pharmaceutical Clinical Trials

Contents

Foreword xPreface xiiDedications xiv

1 The discipline of Statistics: Introduction and terminology 11.1 Introduction 11.2 The discipline of Statistics 21.3 The term “statistic” and the plural form “statistics” 31.4 The term “statistical analysis” 31.5 Association versus causation 31.6 Variation and systematic variation 41.7 Compelling evidence 41.8 The terms “datum” and “data” 51.9 Results from statistical analyses as the basis for decision-making 51.10 Blood pressure and blood pressure medication 51.11 Organization of the book 61.12 Some context before reading Chapters 2–11 71.13 Review 81.14 References 8

2 The role of clinical trials in new drug development 92.1 Introduction 92.2 Drug discovery 92.3 Regulatory guidance and governance 102.4 Pharmaceutical manufacturing 132.5 Nonclinical research 142.6 Clinical trials 152.7 Postmarketing surveillance 182.8 Ethical conduct during clinical trials 192.9 Review 202.10 References 20

Page 7: Introduction to Statistics in Pharmaceutical Clinical Trials

3 Research questions and research hypotheses 233.1 Introduction 233.2 The concept of scientific research questions 233.3 Useful research questions 233.4 Useful information 243.5 Moving from the research question to the research hypotheses 243.6 The placebo effect 243.7 The drug treatment group and the placebo treatment group 253.8 Characteristics of a useful research question 253.9 The reason why there are two research hypotheses 263.10 Other forms of the null and alternate hypotheses 273.11 Deciding between the null and alternate hypothesis 283.12 An operational statistical definition of “more” 293.13 The concept of statistically significant differences 303.14 Putting these thoughts into more precise language 303.15 Hypothesis testing 313.16 The relationship between hypothesis testing and ethics in clinical trials 313.17 The relationship between research questions and study design 323.18 Review 333.19 References 33

4 Study design and experimental methodology 354.1 Introduction 354.2 Basic principles of study design 364.3 A common design in therapeutic exploratory and confirmatory trials 384.4 Experimental methodology 404.5 Why are we interested in blood pressure? 414.6 Uniformity of blood pressure measurement 434.7 Measuring change in blood pressure over time 434.8 The clinical study protocol 444.9 Review 454.10 References 45

5 Data, central tendency, and variation 475.1 Introduction 475.2 Populations and samples 475.3 Measurement scales 485.4 Random variables 495.5 Displaying the frequency of values of a random variable 495.6 Central tendency 525.7 Dispersion 535.8 Tabular displays of summary statistics of central tendency and dispersion 55

v i Contents

Page 8: Introduction to Statistics in Pharmaceutical Clinical Trials

5.9 Review 565.10 References 56

6 Probability, hypothesis testing, and estimation 576.1 Introduction 576.2 Probability 576.3 Probability distributions 606.4 Binomial distribution 616.5 Normal distribution 626.6 Classical probability and relative frequency probability 676.7 The law of large numbers 686.8 Sample statistics and population parameters 696.9 Sampling variation 696.10 Estimation: General considerations 706.11 Hypothesis testing: General considerations 746.12 Hypothesis test of a single population mean 786.13 The p value 806.14 Relationship between confidence intervals and hypothesis tests 816.15 Brief review of estimation and hypothesis testing 826.16 Review 836.,17 References 83

7 Early phase clinical trials 857.1 Introduction 857.2 A quick recap of early phase studies 857.3 General comments on study designs in early phase clinical studies 867.4 Goals of early phase clinical trials 867.5 Research questions in early phase clinical studies 877.6 Pharmacokinetic characteristics of interest 877.7 Analysis of pharmacokinetic and pharmacodynamic data 897.8 Dose-finding trials 917.9 Bioavailability trials 927.10 Other data acquired in early phase clinical studies 937.11 Limitations of early phase trials 947.12 Review 957.13 References 95

8 Confirmatory clinical trials: Safety data I 978.1 Introduction 978.2 The rationale for safety assessments in clinical trials 978.3 A regulatory view on safety assessment 98

Contents v i i

Page 9: Introduction to Statistics in Pharmaceutical Clinical Trials

8.4 Adverse events 998.5 Reporting adverse events 998.6 Using all reported AEs for all participants 1008.7 Absolute and relative risks of participants reporting specific AEs 1018.8 Analyzing serious AEs 1028.9 Concerns with potential multiplicity issues 1028.10 Accounting for sampling variation 1038.11 A confidence interval for a sample proportion 1038.12 Confidence intervals for the difference between two proportions 1058.13 Time-to-event analysis 1078.14 Kaplan–Meier estimation of the survival function 1098.15 Review 1148.16 References 115

9 Confirmatory clinical trials: Safety data II 1179.1 Introduction 1179.2 Analyses of clinical laboratory data 1179.3 Vital signs 1239.4 QT interval prolongation and torsades de pointes liability 1249.5 Concluding comments on safety assessments in clinical trials 1259.6 Review 1269.7 References 126

10 Confirmatory clinical trials: Analysis of categorical efficacy data 12710.1 Introduction: Regulatory views of substantial evidence 12710.2 Objectives of therapeutic confirmatory trials 13010.3 Moving from research questions to research objectives: Identification

of endpoints 13110.4 A brief review of hypothesis testing 13210.5 Hypothesis tests for two or more proportions 13310.6 Concluding comments on hypothesis tests for categorical data 14510.7 Review 14510.8 References 146

11 Confirmatory clinical trials: Analysis of continuous efficacy data 14711.1 Introduction 14711.2 Hypothesis test of two means: Two-sample t test or independent groups

t test 14711.3 Hypothesis test of the location of two distributions: Wilcoxon rank

sum test 15011.4 Hypothesis tests of more than two means: Analysis of variance 152

v i i i Contents

Page 10: Introduction to Statistics in Pharmaceutical Clinical Trials

11.5 A worked example with a small dataset 15511.6 A statistical methodology for conducting multiple comparisons 15911.7 Bonferroni’s test 16011.8 Employing Bonferroni’s test in our example 16111.9 Tukey’s honestly significant difference test 16311.10 Implications of the methodology chosen for multiple comparisons 16411.11 Additional considerations about ANOVA 16611.12 Nonparametric analyses of continuous data 16711.13 The Kruskal–Wallis test 16711.14 Hypothesis test of the equality of survival distributions: Logrank test 16911.15 Review 17111.16 References 172

12 Additional statistical considerations in clinical trials 17312.1 Introduction 17312.2 Sample size estimation 17312.3 Multicenter studies 18112.4 Analysis populations 18212.5 Dealing with missing data 18412.6 Primary and secondary objectives and endpoints 18512.7 Evaluating baseline characteristics 18612.8 Equivalence and noninferiority study designs 18712.9 Additional study designs 18912.10 Review 19012.11 References 190

13 Concluding comments 191Reference 191

Appendices 193Appendix 1: Standard normal distribution areas 195Appendix 2: Percentiles of t distributions 205Appendix 3: Percentiles of v2 distributions 207Appendix 4: Percentiles of F distributions (a � 0.05) 209Appendix 5: Values of q for Tukey’s HSD test (a � 0.05) 211

Review exercise solutions by chapter 215

Index 219

Contents ix

Page 11: Introduction to Statistics in Pharmaceutical Clinical Trials

With this introductory text, the authors havemanaged to de-mystify Statistics for students ofpharmacy and clinical research who may betaking their first, or one of their earliest, coursesin the subject. Three fundamental departuresfrom the standard treatment of statistics areevident from the start – the way in which “Statis-tics” is defined, the organization of the bookitself, and the use of a single, unifying diseasearea for illustration throughout the book. Thereference to Statistics as “an experimentalapproach to gaining knowledge” at the start ofthe first chapter sets the tone for the rest of thebook. Statistical concepts are defined andexplained relative to their usefulness in clinicaldecision making. A unique operational defini-tion of the discipline of Statistics is presentedthat consists of six components, beginning withthe posing of a research question and concludingwith both a regulatory submission and peer-reviewed publication. This is a far cry from thestandard definitions used in most Statistics textsand alerts the reader to the fact that applicationsand discussions of utility will be intertwinedwith mathematical concepts and methodologiesfor the duration of their reading.

Rather than simply providing an exposition ofmathematical terms and operators followed by acanvassing of the usual array of statistical toolsand techniques, the authors choose instead tofollow the product development pathway inorganizing their book, showing how statisticsplays an important role in providing the abilityto move from step to step with objectivity andsound decision making. After an overview of thedrug development paradigm that includes thenonclinical, manufacturing, and marketingaspects, the reader is introduced to the funda-mentals of experimental design, probability dis-

tributions, and hypothesis testing. The reader isthen guided through each phase of pharmaceu-tical clinical trials. From early phase to confir-matory trials, the questions that need to beaddressed and the types of data and statisticaltools needed to address them are explained andfully illustrated with an antihypertensive treat-ment example. Note, however, that the straight-forward nature of the exposition does not equateto simplicity of subject matter. Nonparametricstatistics, noninferiority trials, and adaptivedesigns all receive mention as the text covers themajority of situations these future researchers arelikely to encounter. And unlike many basic sta-tistics texts with a focus on efficacy occupyingthe majority of the pages, safety analyses receivenearly equal treatment here. Timely safety topicsof particular clinical interest, such as QT/QTcinterval prolongation and how to design a studyto test for this adverse effect, are covered.

Focusing on a potential treatment of hyper-tension for illustration throughout the book pro-vides consistency and really makes the productdevelopment pathway come alive. The readerhas the sense of helping move this product fromphase to phase, and the example serves to notonly illustrate the statistical methods, but alsothe clinical decision making surrounding theproduct’s development. In addition, the reader isable to accumulate the medical backgroundrequired to appreciate the data examples in aminimum number of pages, and the example isrich enough so as to not afford any loss ofgenerality with this singular focus.

The dialogue between clinician and statisticianis of the utmost importance in the successfulexecution of a product development programtoday. The need for strong communication skills,verbal and written, among those involved in this

Foreword

Page 12: Introduction to Statistics in Pharmaceutical Clinical Trials

complex process has never been greater. Theauthors have provided a text that fully equipsstudents to engage in clear and meaningful dia-logue with their clinical colleagues and regula-tory counterparts. By focusing on the commongoal of learning essential information aboutexperimental products through well-designedand well-conducted studies, and accuratelycollected and appropriately analyzed data, thereader never loses sight of the fact that statisticsare the means to the end, and not the end inthemselves. As the authors note on page 191,“Our interest in Statistics, then, is a pragmaticone: The discipline provides the best way cur-rently available to conduct clinical developmentprograms.”

I have had the pleasure of working with ToddDurham for over 10 years and I am not surprisedin the least with his clear and concise treatmentof Statistics in this text. He is known amongfriends, colleagues, and students for his thought-ful approach to study design and analysis

problems, his excellent communication skills,and his great capacity for mentoring and tutor-ing others. His contributions through this textwill enable other classrooms to benefit from hiswinning style of teaching even when he is notpresent to lead the discussion himself. Todd is anAdjunct Professor of Clinical Research at theCampbell University School of Pharmacy, andteaches Statistics to students in the Master ofScience in Clinical Research program. I amdelighted that this collaboration with RickTurner, the Chairman of the Department ofClinical Research, has proved so successful.

Lisa M. LaVange, Ph.D. Professor and Director, Collaborative

Studies Coordinating Center Department of Biostatistics,

School of Public Health University of North Carolina at Chapel Hill

Chapel Hill, NC, USAJanuary 2008

Foreword xi

Page 13: Introduction to Statistics in Pharmaceutical Clinical Trials

This book is an introductory statistics textbookdesigned primarily for students of pharmacy,clinical research, and allied health professions. Ittakes a novel approach by not only teaching youhow to conduct individual statistical analyses,but also placing these analyses in the context ofthe clinical research activities needed to developa new pharmaceutical drug. By taking thisapproach, we are able to provide you with a uni-fied theme throughout the book and, in addition,to teach you the computational steps needed toconduct these analyses and provide you with apowerful conceptual understanding of why theseanalyses are so informative. This approach alsomakes the book well suited to professionalsentering the pharmaceutical, biotechnology, andcontract research organization (CRO) industrieswhowish togainabroaderunderstandingof studydesign and research methodology in clinicaltrials. Both target audiences will find this book auseful introduction to the central role of the disci-pline of Statistics in the clinical development ofpharmaceutical drugs that improve the humancondition. Important concepts are reinforcedwith review questions at the end of chapters.

By focusing on the statistical analyses mostcommonly used in drug development andemploying an organizational structure that fol-lows the order in which these statistical analysesare commonly used in clinical drug develop-ment, the book shows you how the discipline ofStatistics facilitates the acquisition of optimumquality data, that is, numerical representationsof relevant information, which form the basis ofrational decision-making throughout the drugdevelopment process.

Although this book meaningfully integratesthe computational aspects of statistics with theoverall conceptual objectives for which they are

used, we have not included some topics that aretraditionally included in introductory statisticstextbooks, including linear regression and corre-lation. We believe that the selected topics andthe depth at which they are discussed are appro-priate and unique for our intended audience.While we are very happy with the title of thebook as it is, the title The Statistical Basis ofDecision-making in Pharmaceutical Clinical Trialswould capture one of the book’s major themesextremely well.

The motivation to write this book is directlyrelated to our professional activities. Both of usteach Statistics in the Department of ClinicalResearch at the Campbell University School ofPharmacy (TD, a professional biostatistician, isalso an Adjunct Professor of Clinical Research,and RT is Chairman of this department). Thedepartment is located in the heart of NorthCarolina’s Research Triangle Park, one of theworld’s leading pharmaceutical and biotech-nology research centers. Statistics courses in thisdepartment are therefore taught in the contextof the development of new pharmaceutical andbiopharmaceutical products, with the goal ofproviding a solid knowledge and understandingof the nature, methods, applications, and impor-tance of the discipline of Statistics. It should beemphasized that we are not training our studentsto be professional statisticians. Rather, we wishthem to become familiar with the basics ofdesign, methodology, and analysis as used in thedevelopment of new drugs. We aim to conveythe following information:

• why, and how, data are collected in clinicalstudies (to investigate a specific question,using appropriate study design and researchmethodology)

Preface

Page 14: Introduction to Statistics in Pharmaceutical Clinical Trials

Preface xi i i

• how these data are summarized and analyzed(descriptive statistics, hypothesis testing andinferential statistics, statistical significance)

• what the results mean in the context of theclinical research question (interpretation,estimation, and clinical significance)

• how the results are communicated to regu-latory agencies and to the scientific andmedical communities.

By presenting statistical analysis in a meaningful,integrated, and relevant manner, our students’knowledge and retention of the material ismarkedly improved. Moreover, their under-standing and appreciation of the discipline ofStatistics in all of their future scientific endeavors(both academically while studying, and profes-sionally once in the workforce) is considerablyenhanced. This book will become the text for thefirst of two Statistics courses in our Master ofScience in Clinical Research program.

It is appropriate to acknowledge here thatneither author is a clinician. The first author isa professional statistician and the second a pro-fessional educator, medical writer, and researchmethodologist. We are also both clinical trial-ists: The first author is experienced in statisticalaspects of clinical trials, and the second in writ-ing regulatory clinical documentation. At vari-ous points throughout this book, we discusshow the discipline of Statistics provides therational evidence for making clinical decisions.On several occasions we use hypothetical data,show how statistical analyses of these data andthe associated statistical interpretations canform the rational basis for clinical decision-making, and illustrate what the hypothetical

clinical decision might be. This is done for edu-cational purposes. Please remain aware, whenreading our hypothetical clinical interpreta-tions, that we are not clinicians: We are con-veying the logic and importance of incorporat-ing statistical information in the process of clin-ical decision-making. The crucial role that thediscipline of Statistics plays in clinical practiceis to provide the information upon which evi-dence-based clinical practice is based. The mosteffective drug development programs, one facetof the larger field of clinical research, resultfrom the collaboration of many specialists,including statisticians and clinicians. Actualclinical decisions are, of course, the province ofclinicians.

We express our thanks to professional col-leagues and students who have supported andinformed us during the preparation of this book.Christina De Bono and Kevin Tuley atPharmaceutical Press have provided constantsupport and assistance throughout this project,and we are very grateful. Richard Zink provideddetailed reviews of several drafts of the manu-script. Finally, our previous students have pro-vided invaluable feedback on lecture materialand initial drafts of this book.

Views expressed in this book are those of theauthors and Turner Medical CommunicationsLLC, and not necessarily those of InspirePharmaceuticals and/or the Campbell UniversitySchool of Pharmacy.

Todd Durham and Rick TurnerResearch Triangle Park, NC, USA

August 2007

Page 15: Introduction to Statistics in Pharmaceutical Clinical Trials

This book is dedicated to Heidi Durham and Karen Turner, who have been there for us every stepof the way. We also thank Rachel, Daisy, Misty, and Mishadow for their wonderful companionship.

Dedications

Page 16: Introduction to Statistics in Pharmaceutical Clinical Trials

1.1 Introduction

It is common to start a textbook with a defini-tion of the book’s topic. However, in the case ofthe discipline of Statistics (indicated in this bookwith a capital “S”), it is difficult to find a univer-sally accepted definition. Different textbooks arewritten for different target audiences, and theirgoals can therefore be quite different. So, beforegoing any further, it is appropriate to identifyour target audiences, and to provide you with adefinition of Statistics that is informative andmeaningful in the context of this book.

The discipline of Statistics is discussed in thecontext of pharmaceutical clinical trials becausetwo of the primary target audiences for this bookare students of pharmacy and students of clinicalresearch. A clinical trial can be defined as anexperiment testing a medical treatment onhumans (Piantadosi, 2005). In this definition,the words “human” and “clinical” are closelylinked. However, this definition also makes itclear that clinical trials can be performed to testa variety of medical treatments. In addition topharmaceutical trials, the focus of this book,clinical trials are conducted to test medicaldevices (see Becker and Whyte, 2006) and somesurgical practices.

In addition to being suitable for students ofboth pharmacy and clinical research this book isalso suitable for other students who are inter-ested in the development of new pharmaceuticaldrugs, including students of medicine, nursing,and other health-related professions wherepharmacotherapy is of relevance. Statisticscourses are typically part of such degreeprograms, and we have designed this book sothat these students will also benefit from thematerial presented and taught. Our goal in

this book is to introduce you to basic statisticalmethodology and analysis in a meaningful andvery relevant context, the conduct of pharma-ceutical clinical trials. We use the phrase “clin-ical trials” from now on with the understandingthat all discussions are about the development ofpharmaceutical drugs.

By teaching Statistics in a context that is veryrelevant to you, the statistical analyses that youwill learn about will not simply be abstract ideas:They will be techniques that meaningfullycollect and analyze numerical information ofimportance in your profession. The developmentof new drugs, whether brand-new chemical enti-ties (NCEs), biologics, or new forms of existingdrugs, requires three steps:

1. collection of numerical representations ofinformation

2. analysis and interpretation of this numericalinformation

3. decision-making based on this analysis andinterpretation.

By the end of this book you will have a solidconceptual knowledge and understanding of theexperimental methods and statistical analysesused in new drug development. In addition, youwill have gained computational knowledge: Youwill have learned how to conduct the mostcommonly used statistical analyses and how tointerpret the results of these analyses. This combi-nation of conceptual and computational knowl-edge and understanding is a powerful one thatwill serve you well in the rest of your studies.

A couple of conceptual points are useful here.First, there is much more to the discipline ofStatistics than “obligatory calculations” at theend of a study. Rather, the discipline of Statisticsis an integral component throughout the entire

1The discipline of Statistics: Introduction and terminology

Page 17: Introduction to Statistics in Pharmaceutical Clinical Trials

new drug process. Second, conducting the statis-tical analyses most commonly employed in newdrug development is really not that hard. Thehard parts of new drug development are askingthe correct research questions, making sure thatthe correct study design is used in each clinicaltrial, and making sure that the correct experi-mental methodologies are used to acquireoptimum quality data during the clinical trial(we talk more about these issues in Chapters 3and 4). If good research questions are asked, andthe correct study designs and experimentalmethodologies used to collect optimum qualitydata, performing the statistical analyses is notreally that difficult. As Piantadosi (2005)commented, “good trials are usually simple toanalyze correctly.”

1.2 The discipline of Statistics

The discipline of Statistics is a well-developedand powerful discipline that involves muchmore than simply number crunching.Crunching numbers is certainly part of it, but,unless those numbers have been carefully andmeaningfully collected, crunching them is notgoing to provide any useful information.

Statistics is a scientific discipline because itadopts the scientific method: A method ofthinking and conducting business in a certainway. You are very likely familiar with the sciencesof biology, chemistry, and physics, but these arenot the only sciences. Individual fields of investi-gation can be called a science, or a scientificdiscipline, if they adopt the scientific methodof inquiry. In this method, theories lead tohypotheses, and these hypotheses are then testedin the scientific manner. A key characteristic ofscientific hypotheses is that they need to be ableto be disproved. If repeated evaluations of atheory via appropriate hypothesis testing do notdisprove the theory, compelling evidence starts toaccumulate that the theory may have merit. It isthen deemed reasonable to proceed on the basisthat the theory does indeed have merit, but withthe knowledge and acceptance that future inves-tigations may provide evidence that the theorydoes not have merit (see Turner, 2007).

In this book we have adopted the broad oper-ational definition of Statistics provided byTurner (2007). In this definition, Statistics isregarded as a multi-faceted scientific disciplinethat comprises the following activities:

• Identifying a research question that needs tobe answered.

• Deciding upon the design of the study, themethodology that will be employed, and thenumerical information (data) that will becollected.

• Presenting the design, methodology, anddata to be collected in a study protocol. Thisstudy protocol specifies the manner of datacollection and addresses all methodologicalconsiderations necessary to ensure the collec-tion of optimum quality data for subsequentstatistical analysis.

• Identifying the statistical techniques that willbe used to describe and analyze the data in anassociated statistical analysis plan that iswritten inconjunctionwith thestudyprotocol.

• Describing and analyzing the data. Thisincludes analyzing the variation in the data tosee if there is compelling evidence that thedrug is safe and effective. This processincludes evaluation of the statistical signifi-cance of the results obtained and, of criticalimportance, their clinical significance.

• Presenting the results of a clinical study to aregulatory agency in a clinical study report,and presenting the results to the clinicalcommunity in journal publications.

This functional definition makes clear that thediscipline of Statistics is indeed multi-facetedand essential throughout clinical trials. It is crit-ical at the start of the clinical trial process so thata study can be designed appropriately to facili-tate the collection of optimum quality data,which then need to be organized and managedcorrectly. These data are then described andanalyzed, and the numerical results of theseanalyses are interpreted in the context of theparticular study. Finally, the numerical results ofthe analyses and the authors’ interpretation ofthese results are presented to regulatory agenciesto request permission to market the drug, andpublished in clinical communications to provideinformation to physicians.

2 Chapter 1 • The discipline of Statistics: Introduction and terminology

Page 18: Introduction to Statistics in Pharmaceutical Clinical Trials

This functional definition of Statistics maywell contain some concepts and terms withwhich you are not familiar at this point, and thatis fine. This book’s goal is to make you familiarwith these terms and concepts so that you willunderstand the statistical processes and proce-dures that are used in clinical trials. Individualchapters address different parts of this defini-tion. However, it is important for us to empha-size here that the individual aspects presented inthe chapters are really seamless components ofone overall experimental approach to gainingknowledge, the discipline of Statistics. Thesecomponents act together to ensure that high-quality data acquisition, correct analysis, andappropriate interpretations provide optimalanswers to good research questions.

1.3 The term “statistic” and the pluralform “statistics”

Having operationally defined the discipline ofStatistics, we will now operationally define theterms “statistic” and “statistics,” each writtenwith a lower case “s.” A statistic typicallyinvolves one piece of numerical information. Forexample, the number of states in the UnitedStates of America is 50, and the number of coun-tries comprising the United Kingdom is 4. Insome circumstances, however, a statistic canusefully involve more than one piece of numer-ical information. Consider how you mightdescribe a sports team’s performance in a season.A simple numerical representation of this successmight be the number of games they won,perhaps 17. However, it is probably more usefulto provide information about wins and losses(assume no draws), and therefore to provide thetotal number of games as well. In this context, a“17-3” summary of the season’s performance (awinning season) is quite different from a “17-23”summary of performance (a losing season), eventhough the number of wins is the same. So, ifyou want to regard 17-3 as a single statistic, eventhough it contains two numbers, this seemsperfectly reasonable to us. Indeed, one medicalexample of a single statistic involving two piecesof numerical information, a person’s blood

pressure, is particularly relevant for the discus-sions in this book concerning the developmentof a new investigational drug that is intended toreduce high blood pressure. This point isdiscussed further in Section 1.10.

The word “statistics” with a lower case “s” isused throughout the book as simply the pluralof the term statistic. A listing of the sportsteam’s wins and losses for the last 10 seasons,perhaps along with similar details for everyother team in the same division or confer-ence, would very adequately be described asstatistics.

1.4 The term “statistical analysis”

The term “statistical analysis” has two mean-ings. Statistical analysis, used in a generalsense, can be regarded as a global descriptionor plan of how data collected will be analyzed.The term “a statistical analysis” refers to anindividual analytical technique that is used todescribe and analyze numerical information.This book teaches you how to conduct a collec-tion of statistical analyses that are appropriatefor use in analyzing the results of clinicaltrials.

1.5 Association versus causation

If we were to measure a number of characteristics(for example, age, height, and bone mineraldensity) in a large group of individuals we wouldundoubtedly find that they were related in somesense. For example, as we age from infants toyoung adults, our height increases from 18inches (or 45 cm) or so to 50 inches (1.27 m) ormore. If we were to examine the age and heightof children aged less than 17 years, we would notfind many (if any) 17-year-old children whowere 18 inches (45 cm) tall, and we would notfind many (if any) 2-year-old children who were60 inches (1.52 m) tall. Biological and medicaltraits with such a relationship are said to beassociated.

Association versus causation 3

Page 19: Introduction to Statistics in Pharmaceutical Clinical Trials

When someone suffers an acute injury suchas a cut from a kitchen knife the immediatebodily response may include bleeding and sharpstinging pain. Various biological responses tothe trauma can lead to a number of measurableeffects. Had the trauma not occurred at thistime the finger would not have bled and norwould the sharp stinging pain have occurred.Some cuts are so minor that they do not resultin either bleeding or pain so the occurrence ofthe trauma does not perfectly predict the effectsbleeding and pain. The philosophical descrip-tion of causative effects is controversial.Without delving into this controversy we canthink of cause-and-effect relationships as beingestablished on the basis of:

• biological plausibility• temporal relationship between the antecedent

(cause) and the result (effect)• some quantitative demonstration that occur-

rence of the “cause” increases the likelihoodof observing the effect.

The two concepts association and causationoccur throughout this text and they should notbe confused. Association of two characteristics isa requirement to establish causation, but theconverse is not true. It is truer to say that aging“causes” human growth than to say that growthcauses aging. There are a number of researchmethods, especially in the field of epidemiology,that may be used to study the relationshipamong various health risks (including the use ofmedical treatments) and health outcomes.However, the randomized clinical trial is consid-ered the gold standard when it comes to estab-lishing cause-and-effect relationships. In thisbook we discuss the likely causative effects ofnew drugs on health outcomes of interest.

1.6 Variation and systematic variation

The study of the biological sciences has majordifferences from the physical and mathematicalsciences. In the physical sciences, the same oper-ation done under the same conditions alwaysproduces the same result. For example, a balldropped off the same building always accelerates

towards the earth at the same rate, a rategoverned by the gravitational pull between theearth and the ball. In the biological sciences,including the clinical sciences, this is simply notthe case. The same dose of medicine (even doseadjusted for weight) will not have an identicaleffect on two different people. In a large group ofpeople there will typically be considerable varia-tion in response. Similarly, the optimum clinicalcare of one patient will likely involve a differentcombination of therapeutic interventions thanthe optimum care of another. In clinical researchwe have to deal with variation, and examiningdata for systematic variation falls squarelywithin the province of Statistics. The topic ofvariation is discussed further in Chapter 5.

1.7 Compelling evidence

Compelling evidence in Statistics might bethought of as the inverse of ‘reasonable doubt’ inthe legal system, but with the advantage that itcan be quantified according to the precise rulesof Statistics. The discipline of Statistics has beendeveloped as a widely accepted method ofconducting scientific investigation in manyfields, including drug development. Data arecollected, analyzed, interpreted, and presentedin a certain way such that the scientific and clin-ical communities at large recognize the validityof the study.

In the practice of law, each attorney presentshis or her evidence in a certain manner to a juryunder the scrutiny of a judge, who makes surethat each component of the evidence is legiti-mate in that it meets an acknowledged level ofacceptance. It is often true in legal cases thatabsolute proof is not possible (unless there isincredibly strong evidence such as a video of thecrime, and even then the defense lawyer willprobably argue the existence of extenuatingcircumstances), so the prosecutor’s argumentshave to be demonstrated to be likely true beyonda reasonable doubt. In the context of clinicaltrials the discipline of Statistics incorporatesaccepted methods of data collection and dataanalysis that can provide compelling evidencethat an investigational drug does indeed do what

4 Chapter 1 • The discipline of Statistics: Introduction and terminology

Page 20: Introduction to Statistics in Pharmaceutical Clinical Trials

it is intended to do. A clinical trial may providecompelling evidence that an investigationaldrug does indeed lower blood pressure. The term“compelling evidence” is not the same as theterm “proof” because a single clinical trialcannot prove that a drug is effective. However, itcan certainly provide compelling evidence.

1.8 The terms “datum” and “data”

The word data is a plural word indicating morethan one piece of numerical information. Thesingular form of the term is datum. As the statis-tical analyses described in this book alwaysanalyze more than one data point the term“data” is used throughout. Accordingly, accom-panying plural words are used in conjunctionwith the word data. Examples are “the data are,the data were, these data, the data show.”

If you have any uncertainty as to how toconstruct a phrase including the word data,replace the word data in your mind with theword results. While the words data and resultsare not synonymous, the word results is a pluralconstruct, as is the word data. This strategy willlikely help you express a phrase including theword data correctly.

The word datum does not occur again in thisbook.

1.9 Results from statistical analyses asthe basis for decision-making

Many decisions have to be made throughout thedrug development process. A lot of these deci-sions concern whether it is prudent to continueto the next phase of development, as outlined inChapter 2. To allow rational decisions to bemade we need reliable, quantitative informationupon which to base our decisions. In a very realsense the main contribution of the discipline ofStatistics in research endeavors is providingnumerical representations of information thatfacilitate good decision-making.

Data are numerical representations of indi-vidual pieces of information. Once data have

been collected in a clinical trial that employedthe appropriate study design and experimentalmethodology, statistical analysis utilizes theseindividual pieces of numerical information toobtain a numerical representation of the “bigpicture.” The results of a statistical analysis and,very importantly, the interpretation of theseresults in the context of the specific researchquestion being asked provide the empiricalinformation upon which decisions can be made.We talk a lot about decision-making in this book.

1.10 Blood pressure and bloodpressure medication

We have deliberately chosen to focus on oneparticular therapeutic area for the worked exam-ples throughout this book: This strategy providesa unified approach in all of the discussions aboutclinical trials and statistical analyses. Our discus-sion and worked examples focus on the develop-ment of a new investigational drug for thetreatment of high blood pressure. The termhypertension is used for this condition, anddrugs intended for the treatment of hyperten-sion are called antihypertensive drugs, or simplyantihypertensives.

1.10.1 Blood pressure and itsmeasurements

The measurement of blood pressure usuallyinvolves the joint measurement of two aspects ofarterial blood pressure, systolic blood pressure(SBP) and diastolic blood pressure (DBP). The SBPprovides a representation of the pressure of bloodas it is ejected from the heart into the body’sarteries at each heart beat. The DBP provides arepresentation of the blood pressure in thearteries in between each heart beat. A healthyblood pressure for a young adult is often repre-sented as “120/80.” This is pronounced “onetwenty over eighty.” In this case, the “/” symboldoes not represent the division of 120 by 80 toget a value of 1.5: It is simply used to separate thetwo blood pressure readings. The units of bloodpressure are millimeters of mercury (mmHg), so

Blood pressure and blood pressure medication 5

Page 21: Introduction to Statistics in Pharmaceutical Clinical Trials

the actual blood pressure values in this case are anSBP of 120 mmHg and a DBP of 80 mmHg.

Suppose now that we measured someone’sblood pressure five times, once every 5 minutes.The average blood pressure could be representedby calculating the average SBP (say 124 mmHg)and the average DBP (say 82 mmHg) and thenwriting this average blood pressure as 124/82.

1.10.2 Medical management of bloodpressure

The Seventh Report of the Joint NationalCommittee on Prevention, Detection, Evalution,and Treatment of High Blood Pressure (JNC 7:National Institutes of Health, 2004) is a defin-itive publication concerning the treatment ofhypertension. It provides the following bloodpressure classifications for adult blood pressures:

• Normal: SBP � 120 mmHg and DBP � 80mmHg.

• Pre-hypertension: SBP 120–139 mmHg or DBP80–89 mmHg.

• Stage 1 hypertension: SBP 140–159 mmHg orDBP 90–99 mmHg.

• Stage 2 hypertension: SBP � 160 mmHg orDBP � 100 mmHg.

These classifications are related to manage-ment strategies for hypertension. This report isthe first of the JNC’s series of reports to use theterm “pre-hypertension,” a term introduced tosignal the need for increased awareness andeducation among healthcare professionals andthe general public of the benefits of reducingblood pressure before it reaches the levels in thehypertensive categories. The relationshipbetween blood pressure and risk of cardiovas-cular events is “continuous, consistent, andindependent of other risk factors. The higher theBP, the greater is the chance of heart attack, heartfailure, stroke, and kidney disease” (NationalInstitutes of Health, 2004, p 2). These classifica-tions are provided as a very useful means ofdirecting the management of blood pressure byclinicians, who have to make a decision whetheror not to treat a patient. If they decide that phar-macological treatment is warranted they need todecide what regimen should be prescribed. There

are several classes of antihypertensive drugs onthe market, and clinicians typically follow therecommendations in this report.

1.11 Organization of the book

As noted in Section 1.1, reading this bookprovides you with a solid conceptual knowledgeand understanding of the experimental methodsand statistical analyses used in new drug devel-opment, teaches you how to conduct statisticalanalyses commonly used in clinical trials, andshows you how to interpret the results of theseanalyses to facilitate rational, information-baseddecision-making. This information is taught inthe context of a particular category of clinicaltrials that are conducted during the develop-ment of a new drug. As you will see in Chapter 2,clinical drug development programs typicallyconsist of a progressive series of clinical trials.These range from the first trials in which theinvestigational drug is administered to humansto much larger trials that are conducted as thelast item before requesting marketing approvalfor the drug from a regulatory agency. This bookfocuses on these larger trials, frequently calledPhase III trials and also known as therapeuticconfirmatory trials.

This book, therefore, is a self-contained intro-duction to the discipline of Statistics and its use intherapeutic confirmatory clinical trials. The firstpart of the book provides introductory commentsabout the discipline of Statistics and lays thefoundations for our later discussions in thecontext of pharmaceutical trials. Chapter 2presents an overview of the process of new drugdevelopment and the role of clinical trials in thisprocess. The categories of different types of clin-ical trials are identified and discussed so that youwillunderstandthetypesofdatacollectedinthem.

Chapters 3 and 4 discuss how research questions are asked and answered in statistical language during clinical trials, andintroduce the study designs and experimentalmethodologies that are used to acquire optimumquality data with which to answer our researchquestions. Chapter 5 discusses statistical ways ofdescribing and summarizing these data. Chapter 6

6 Chapter 1 • The discipline of Statistics: Introduction and terminology

Page 22: Introduction to Statistics in Pharmaceutical Clinical Trials

introduces hypothesis testing and estimation,two important ways of analyzing data fromclinical trials.

The concepts and techniques discussed inthese chapters are then developed and extendedin later chapters, in which we teach you howto conduct statistical analyses commonlyemployed in preapproval clinical trials, that is,clinical trials that are conducted before applyingfor marketing approval from a regulatory agency.Chapter 7 discusses clinical trials that areconducted at the beginning of a clinical devel-opment program. While these trials are not themajor focus of this book, it is appropriate toconsider them briefly. The statistical challengesin early phase trials are different from those inPhase III trials, and it is appropriate to highlightthese differences.

Chapters 8–11 discuss clinical trials that areconducted later during the clinical developmentprogram. These chapters address both safety dataand efficacy data. Throughout these chapters,each new statistical analysis taught is addressedin the following way:

• identification of the research question (whatis the decision to be made?)

• identification of data that will provide ananswer to the research question

• identification of the appropriate study design(how to conduct a trial that will provide datacapable of answering the research question asaccurately as possible)

• identification of the best methodologies tocollect optimum quality data during the studywith which to answer the research question asaccurately as possible

• identification of the appropriate statisticalanalysis to be employed

• computational steps necessary to conduct thestatistical analysis chosen

• inference and decision-making: Interpretingthe results in the light of the specific researchquestion asked.

Chapters 12 and 13 then conclude the bookby providing an overarching discussion of thetopics covered in previous chapters, anddiscussing further the philosophical rationalesfor the employment of Statistics in new drugdevelopment.

1.12 Some context before readingChapters 2–11

The purpose of this section is to provide youwith a conceptual framework within which toassimilate the statistical material presented inChapters 2–11. While this book teaches you thecomputational skills to conduct some statisticalanalyses, as an introductory statistics textbookshould, we also want to provide you with aconceptual knowledge and understanding ofwhy these analyses are undertaken. Rephrasingthis last point, we want to provide you with aconceptual knowledge and understanding ofhow the information gained from a clinical trialis used in various forms of decision-making.

1.12.1 Decision-making during a clinicaldevelopment program

The process of developing a new drug is anextremely expensive one. While we do not knowthe precise exchange rate on the day that you arereading this, estimates of US$1bn and £600m arecertainly very meaningful at the time of writing.A related and highly relevant observation is thatmost drugs fail to reach marketing approval. Forevery 10 000 potential drug compounds only 10make it to initial clinical trials in which theinvestigational drug is administered to humansfor the first time. Of these ten, only one, ormaybe two, will successfully make it through allphases of clinical trials and be approved by aregulatory agency for marketing. Given that theother eight or nine investigational drugs will notreceive marketing approval, that is, they will“fail,” it makes sense from many perspectivesthat they fail as early as possible.

This statement may initially (and very reason-ably) seem counterintuitive to you: Isn’t the goalto approve a new drug? It certainly is, but, to geta drug approved, we need to provide a regulatoryagency with compelling evidence that the drugis both safe and efficacious. If a drug is not safeand efficacious, the sooner we find out thebetter, for various reasons. First, money spent ona drug that fails cannot be spent on developinganother drug that might receive marketing

Some context before reading Chapters 2–11 7

Page 23: Introduction to Statistics in Pharmaceutical Clinical Trials

approval – that is, from a business perspective, itis not optimal. Second, and more importantly,individuals volunteer to be in clinical trials.

In later phase clinical trials, individuals withthe disease or condition of interest, that is, thedesired indication for the investigational drugunder development, volunteer for these trials.One crucial ethical aspect of preclinical trials isthat we must be uncertain about whether theinvestigational drug works. If we know that thedrug works we should not be giving a placebo tohalf of our participants. If we know (or arguablyeven strongly suspect) that the drug does notwork we should not administer it to clinical trialparticipants. In addition, in the case of relativelyless common diseases, there are only so manyindividuals who can participate in clinical trials,and we would prefer that they participate in atrial employing an investigational drug with arelatively higher chance of being approved thanin one employing a drug with a relatively lowerchance. Therefore, we should be constantlylooking for evidence that our investigationaldrug does not work, and we should stop the clin-ical development program at that point. Thediscipline of Statistics provides the informationthat forms the rational basis for making the deci-sion to stop the clinical development program,that is, to kill the drug.

1.12.2 Decision-making during evidence-based clinical practice

Evidence-based clinical practice has twocomponents:

1. Providing the evidence: This is the domain ofclinical research and, in the case of drugdevelopment, the province of clinical trialists.The discipline of Statistics is a central and crit-ical component of the planning, conduct,analysis, and interpretation of clinical trials.

2. Using the evidence to decide on the best treat-ment for individual patients on a case-by-casebasis. This is the domain of clinical practice.

Both components are vital to evidence-basedclinical practice. Katz (2001, p xvii) has writtenpersuasively and eloquently on this issue:

All of the art and all of the science of medicinedepend on how artfully and scientifically we aspractitioners reach our decisions. The art of clin-ical decision making is judgment, an even moredifficult concept to grapple with than evidence.

1.12.3 Summary

This book introduces you to the statisticalmethodology employed in drug development atboth a conceptual and a computational level.Statistical methodology provides numericalrepresentations of information that facilitaterational, information-based decision-makingduring regulatory considerations and clinicalpractice.

1.14 References

Becker KM, Whyte JJ (eds) (2006). Clinical Evaluation ofMedical Devices: Principles and case studies, 2nd edn.Totowa, NJ: Humana Press.

Katz DL (2001). Clinical Epidemiology and Evidence-basedMedicine: Fundamental principles of clinical reasoning& research. Thousand Oaks, CA: Sage Publications.

National Institutes of Health (2004). The Seventh Reportof the Joint National Committee on Prevention, Detec-tion, Evaluation, and Treatment of High Blood Pressure.NIH Publication 04-5230. Bethesda, MD: NationalInstitutes of Health.

Piantadosi S (2005). Clinical Trials: A methodologicperspective, 2nd edn. Chichester: John Wiley & Sons.

Turner JR (2007). New Drug Development: Design,methodology, and analysis. Hoboken, NJ: John Wiley& Sons.

8 Chapter 1 • The discipline of Statistics: Introduction and terminology

1.13 Review

1. When studying the effect of an investigationalantihypertensive drug, what sources of backgroundvariation might we be concerned with inmeasurements of blood pressure?

2. Describe three cases where two biological orhealth traits are associated.

3. Describe three cases where two biological orhealth traits represent a cause and effect.

Page 24: Introduction to Statistics in Pharmaceutical Clinical Trials

2.1 Introduction

This chapter provides an overview of the processof new drug development. It introduces you tothis process in a relatively succinct way, but stillprovides enough details for you to understandthe relevance and importance of the statisticalmethodologies discussed in the following chapters.

As well as evaluating how well a new drug doesits job (for example, how well an investigationalantihypertensive drug lowers blood pressure), itis vital to evaluate the safety of the drug. Bothsafety and efficacy are investigated very thor-oughly in preapproval clinical trials, and thediscipline of Statistics provides the tools toconduct these investigations. As you will see indue course the types of statistical analyses usedin the assessment of safety and of efficacy arequite different. Moreover, the evaluation of datacollected during postmarketing surveillanceonce the drug has been approved for marketingemploys additional types of statistical analyses.Hence, there is a wide variety of statisticalanalyses, and it is critical to use the appropriatemethodologies and statistical analyses in eachcase.

New drug development is a very long,complex, and expensive undertaking. Theprocess starts with the identification of a drugmolecule, a chemical compound that maybecome an approved drug many years later.Extensive formulation and chemistry researchmust be undertaken to put the drug molecule ina form that may be used in nonclinical (animal)and clinical studies. The characteristics of a newdrug that make it useful include the following(Norgrady and Weaver, 2005):

• It is safe.• It is efficacious.• It can successfully navigate all necessary regu-

latory oversight, including those that governnonclinical trials and preapproval clinicaltrials, and be approved by regulatory agenciesfor marketing.

• It can be manufactured in sufficiently largequantities by processes that can comply withall necessary regulatory oversight and that arefinancially viable for the sponsor.

This list of attributes provides a good map forour discussions in this chapter.

2.2 Drug discovery

Drug discovery is the first part of the process ofdrug development. It can be conceptualized asthe research done from the time of the recogni-tion of a therapeutic need to the time that a drugcandidate is selected for initial nonclinicaltesting. A drug candidate may be a small mole-cule or a biological macromolecule such as aprotein or nucleic acid. Drug discovery activitiesvary between small molecules and biologicalmacromolecules but the way in which preap-proval clinical trials are structured is very similarin both cases. The descriptions here addresssmall molecule drug discovery.

The ultimate goals of drug discovery are toidentify a lead compound, a drug molecule that isthe first choice candidate for the next stage of thedrug development process, and then to optimizethe molecule. This latter activity is called leadoptimization. It refers to searching for a closelyrelated molecule or chemically engineering

2The role of clinical trials in new drug development

Page 25: Introduction to Statistics in Pharmaceutical Clinical Trials

modifications in the lead drug molecule toproduce the molecule that is best suited toprogress to nonclinical testing. Contemporarydisciplines such as genomics and proteomics,bioinformatics, structural–activity relationships,and in silico computer modeling (see Turner,2007) are used to maximize the chance of anidentified molecule producing the desired biolog-ical result (having beneficial pharmacodynamicactivity) while simultaneously minimizing itschance of producing unwanted side-effects(having pharmacotoxic activity). Once a drugmolecule that appears to have a ‘good’ chance ofbeing suitable for human pharmacotherapy isidentified (someone has to make this decision)the candidate drug is tested extensively innonclinical research.

2.3 Regulatory guidance andgovernance

Once a certain point in a nonclinical develop-ment program is reached, compliance with regu-latory governance becomes necessary. From thatpoint on all aspects of the drug developmentprocess – manufacturing, the remaining nonclin-ical studies, and clinical trials – are conductedfollowing regulatory guidance and governance.This section provides an overview of regulatoryagencies and their work and responsibilities.

There are many regulatory agencies aroundthe world, each charged to be responsible forpublic health within their respective countries.While there are some differences among theseagencies, there are also many similarities: Theactivities of the International Conference onHarmonisation of Technical Requirements forRegistration of Pharmaceuticals for Human Use,a long name that is usefully represented by theacronym ICH, have led to greater homogeneity.The ICH is an amalgamation of expertise fromregulatory agencies and trade associations inEurope, Japan, and the USA. This chapterincludes overviews of two regulatory agencies,the US Food and Drug Administration (FDA) andEuropean Medicines Agency (EMEA), and of theregulatory dossiers submitted to them during thedevelopment of a new drug.

2.3.1 The Food and Drug Administration

The US FDA is part of the Executive branch ofthe US government, and it is the country’s regu-latory agency responsible for the governance ofnew drug development. The FDA is housedwithin the Public Health Service, part of theDepartment of Health and Human Services.When a sponsor has generated sufficientdiscovery, formulation, and nonclinical data tojustify (in their opinion) initiation of studies inhumans, they prepare an investigational newdrug application (IND). Generally, an INDincludes data and information in four broadareas:

1. Animal pharmacology, pharmacokinetic, andtoxicology studies.

2. Manufacturing information: These dataaddress the composition, manufacture,stability, and controls used for manufacturingthe drug.

3. Clinical study protocol: When originallysubmitted the general investigational planshould outline the overall plan, but it needarticulate only the studies to be conductedduring the first year of clinical development.The clinical study protocol submittedincludes precise accounts of the design,methodology, and analysis considerationsnecessary to conduct the proposed studyand analyze their results (see Section 4.8). Aclinical investigator brochure is also typicallyincluded.

4. Investigator information: Information on thequalifications of clinical investigators isprovided to allow assessment of whether theyare qualified to fulfill their duties at theinvestigational sites used during the clinicaltrials.

As the IND progresses further clinical studyprotocols and the results of completed studies(manufacturing, nonclinical, and clinical) aresubmitted, and the IND grows accordingly.

When the clinical development program iscomplete and all nonclinical studies beingconducted contemporaneously are complete, thesponsor submits a new drug application (NDA)(in the case of a biologic product, a biologicslicense application [BLA] is submitted). Typically,

10 Chapter 2 • The role of clinical trials in new drug development

Page 26: Introduction to Statistics in Pharmaceutical Clinical Trials

sponsors meet with the FDA to discuss thecontent and format of an NDA before its prepara-tion, because this “pre-NDA meeting” can becrucial for the sponsor to understand the contentand format that will best facilitate the FDA’sreview (see Regulatory Affairs ProfessionalsSociety or RAPS, 2007, for more details).Marketing approval by the FDA means that a drugcan be marketed in all 50 states within the USA.

While the ICH publishes an extensive list ofguidances, the FDA also publishes guidances forindustry that can be very helpful and can belocated via the FDA’s website (www.fda.gov).

2.3.2 The European Medicines Agency

The European Medicines Agency (EMEA: thesecond E is correct here) has its headquarters inLondon and is responsible for the evaluation andsupervision of medicines for human (and veteri-nary) use in Europe. The EMEA coordinates theevaluation and supervision of medicinal prod-ucts throughout the European Union, bringingtogether the scientific resources of the 27 (at thetime of writing) European Union member states.It cooperates closely with international partnersin ICH activities.

At the point where an IND would besubmitted to the FDA, a clinical trial application(CTA) is submitted by the sponsor. We notedearlier that an IND grows in size as additionalclinical study protocols in a clinical develop-ment program are submitted to the FDA, eachbeing incorporated into the overall IND. Incontrast, CTAs are protocol specific and one CTAmust be filed for each clinical study protocol.Hence, in this case, the number of individual CTAsincreases during a clinical development program.CTAs are based on summary information only;no full study reports are submitted.

At the completion of the sponsor’s clinicaldevelopment program a marketing authoriza-tion application (MAA) is submitted. An MAA isused for both small molecule drugs andbiologics. There are two submission routes forthe sponsor to choose from:

1. the centralized procedure2. the decentralized procedure.

The centralized procedure has been in placesince 1995. The review of the MAA is coordi-nated by nominees from the Committee forMedicinal Products for Human Use (CHMP)called the rapporteur and co-rapporteur. Thisprocedure leads to a single EU scientific opinion,which is then translated into a pan-EU decisionby the European Commission. The centralizedprocedure is mandatory in some cases (forexample, for biotech drugs, and drugs intendedfor oncology, HIV, diabetes, and neurodegenera-tive disease indications) and it is also gainingpopularity for all new NCEs.

The decentralized procedure has been in placesince 2006. The review of the MAA is conductedby a single agency, called the Reference MemberState (RMS). However, other EU countries inwhich the sponsor wishes to market the drugreceive a copy of the MAA and are involved inconfirming the assessment made by the RMS.These additional agencies are called concernedmember states (CMSs). The decentralized proce-dure has its roots in the earlier “mutual recogni-tion” procedure that was put in place in 1995.The mutual recognition procedure operated in asimilar way except that the CMSs did not receivethe whole MAA until after the RMS had approvedthe product. In both the decentralized and themutual recognition procedure, the EMEA andCHMP do not get involved unless the RMSs andCMSs cannot reach a consensus decision.

Choice between centralized and decentralizedprocedures in the case of many NCEs (those forwhich the centralized procedure is not manda-tory) involves many factors, and the decision is astrategic milestone involving medical practice,manufacturing plans, the nature of product,market forces, and the size, resources, andstrengths of the sponsor in the EU (see Harman,2004, for more details).

Similarly to the FDA, CHMP and its ExpertWorking Parties provide scientific and regulatoryguidelines that apply across the EU to comple-ment ICH guidance. (Regulatory agencies inother countries and regions may develop guide-lines as needed.) Thus, while considerableprogress towards harmonization has beenmade, it is still important for those seekingglobal regulatory approvals to consider regionaland national regulatory guidance. (For further

Regulatory guidance and governance 11

Page 27: Introduction to Statistics in Pharmaceutical Clinical Trials

information see www.emea.europa.eu/htms/human/humanguidelines/efficacy.htm and www.emea.europa.eu/htms/general/contacts/CHMP/CHMP_WPs.html).

2.3.3 GMP, GLP, and GCP

These three acronyms refer to good manufac-turing practice (GMP), good laboratory practice(GLP), and good clinical practice (GCP). Thevarious stages of new drug development shouldbe conducted according to the appropriate regu-lations and guidance. The initial “c” can precedeeach of these acronyms, in each case standingfor the word current. The implication here isthat, in the years between rewrites of regulationsand guidance, certain modifications in the gener-ally accepted best way of performing a certainactivity (best practices) may occur. Therefore,while the guidance as written in the most recentversion reflects the “official” stance, it is consid-ered wise to conform to modified ideologies asappropriate.

2.3.4 Statistical aspects in the preparationof regulatory documentation

The discipline of Statistics plays a major role inthe preparation of all regulatory submissions,including the INDs, NDAs, CTAs, and MAAs thatwe have already mentioned. It also includesother important documents such as study proto-cols, statistical analysis plans, clinical investi-gator brochures, and the prescribing informationand promotional materials that will be used bythe sponsor to inform clinicians (and, in coun-tries such as the USA where direct-to-consumermarketing is permitted, patients) about the drug.Prespecification of the statistical analysis plan isnecessary to establish the credibility of studyresults. The accurate (and concise) presentationof the design of studies and their results is vitalto ensuring a favorable marketing decision andapproval of related documents.

2.3.5 Statistical aspects in the preparationof clinical communications

Sponsors typically publish the results ofimportant clinical trials in clinical communica-tions in medical journals, and present theresults at scientific conferences. As for the prepa-ration of regulatory documentation, scientificcommunications depend heavily on the disci-pline of Statistics. Piantadosi (2005) made thefollowing comment about publishing clinicalcommunications:

Reporting the results of a clinical trial is oneof the most important aspects of clinicalresearch. Investigators have an obligation toeach other, the study participants, and the scien-tific community to disseminate results in acompetent and timely manner (p. 479).

While the format of clinical communicationsis different from the format of clinical studyreports that are submitted to regulatory agencies,the discipline of Statistics provides the basis forthe approach taken. A typical format for aclinical communication is:

• Abstract: A concise overview of the entirearticle

• Introduction: The rationale for the study• Methods: The study design, study sample,

methodology used, statistical analysesemployed

• Results: The findings from the study • Conclusions: The main findings and their

interpretation• Discussion: How the conclusions fit in with

previous literature, and what the implicationsare for future research. Any limitations of thestudy are also a legitimate (and useful) aspectof this section.

The purpose of clinical research is to provideinformation that guides clinical practice. Clin-ical communications are read by physicians whouse the information provided when decidingwhether a particular treatment might be appro-priate for an individual patient. These articles are

12 Chapter 2 • The role of clinical trials in new drug development

Page 28: Introduction to Statistics in Pharmaceutical Clinical Trials

therefore extremely important, and the informa-tion must be presented accurately, meaningfully,and ethically. The term “ethically” emphasizesthat authors must tell “the truth, the wholetruth, and nothing but the truth” about theirstudy in these communications (see Turner,2007, for further discussion).

Guidelines for reporting clinical trials in clin-ical communications are provided by theConsolidated Standards of Reporting Trials(CONSORT) group. We recommend that you readthe CONSORT statements (see www.consort-statement.org). We also refer you to Bowers et al.(2006) and Stuart (2007) for extensive coverageof this topic.

2.4 Pharmaceutical manufacturing

Once it has been decided to progress an identi-fied drug molecule to nonclinical testing, themolecule will be tested in various ways, bothin vitro and in vivo. Initial testing may requireextremely small amounts of the candidate drugmolecule. However, as testing progresses, largeramounts are needed. In addition, when themolecule is ready to be given to animals, it needsto be administered in a certain manner – that is,a drug molecule delivery system needs to bemanufactured. Once a certain stage of nonclin-ical research has been reached, the drug deliverysystem must be manufactured according tocGMP standards as detailed by regulatory agen-cies. GMP regulations also apply to drug deliverysystems used in clinical trials.

When an investigational drug is tested in earlypreapproval clinical trials relatively smallamounts of drug product supplies are needed.However, if and when later stage preapprovalclinical trials are conducted, considerably largerdrug product supplies are needed. If the investi-gational drug is then approved for marketing theamount of drug that needs to be manufacturedincreases again. The manufacturing facilities thatare needed to make the drug on a postapproval

marketing scale are likely to be very different intheir operation (not just bigger) from the variousmanufacturing facilities used during the drugdevelopment process (see Turner, 2007).

Manufacturing is a critical topic thatfrequently does not get the recognition andattention that it deserves. Imagine discovering anew drug molecule that could do wonders, butyou cannot find a way to manufacture a drugdelivery system that will get the drug safely intoa patient who needs it. The drug delivery systemneeds to be able to be manufactured and then bereadily transported from the manufacturingplant to the pharmacy in a form that demon-strates stability, and therefore has a suitably longshelf life. If you cannot do all this, the wonderdrug would, for all intents and purposes, beuseless (recall that the definition of a usefuldrug in Section 2.1 included manufacturingconsiderations).

There are various methods of introducing adrug molecule into the body: Tablets and injec-tions are just two examples. In this chapter wefocus on the manufacture of drugs that are givenorally in tablets, because the largest percentageof drugs are administered in this manner. Atablet is a complex, manufactured, drug deliverysystem that gets the drug molecule, the activepharmaceutical ingredient (API) that exerts thedrug’s pharmacodynamic effect, into thesystemic (whole body) blood supply, whichcarries it round the body to its target receptors.

The API is likely to be a small component ofthe tablet. Various other nonpharmacologicallyactive ingredients, called excipients, are alsoconstituents of the tablet. Each of these excipi-ents has a specific characteristic that enables it toperform a useful function in getting the API toits target receptor. Some of the excipients protectthe API from various chemical attacks in themouth and on its way to the gastrointestinaltract. Others help it to travel through the gastro-intestinal tract. Eventually the API is releasedfrom its formulation so that it can be absorbed inthe small intestine and be transported aroundthe body in the blood supply.

Pharmaceutical manufacturing 13

Page 29: Introduction to Statistics in Pharmaceutical Clinical Trials

2.4.1 Manufacturing drug products forclinical trials

An additional consideration when manufac-turing the drug products that are used in preap-proval clinical trials is that they need to be“disguised.” As we see later in the book, thesafety and efficacy of an investigational drug arecompared with those of another compound,called a control compound. In the types of trialson which we focus this control compound istypically (but not always) a placebo. The experi-mental methodology employed in these trials(discussed in Chapter 4) requires that neither theparticipants in the trials nor the investigatorswho are conducting them know whether theparticipants are being given the investigationaldrug or the placebo. Therefore, both clinicaldrug products need to look, smell, and taste thesame – that is, they need to be blinded. Trials inwhich neither the investigator nor the partici-pant can identify the investigational drug arecalled double-blind trials.

The blinding of clinical drug products addsanother degree of complexity to the manufac-turing needed for these trials. It involves twosteps: Making the investigational drug and theplacebo the same in appearance, as noted, andthen packaging them in such a way that theycannot be distinguished by the package in whichthey are supplied to investigators. This practiceis, necessarily, contrary to the manner in whichmarketed drugs are supplied.

2.5 Nonclinical research

Nonclinical research (often called preclinicalresearch, but we prefer the term “nonclinical”)involves the in vitro and in vivo animal researchthat is conducted and reported to regulatoryagencies before starting preapproval clinicaltrials. Once the drug molecule candidate iden-tified in drug discovery has been optimized, itmoves into the nonclinical developmentprogram. While human pharmacological therapyis the ultimate goal, an understanding ofnonclinical drug safety and efficacy is critical tosubsequent, rationally designed, ethical, human

trials. The term “efficacy” is used in drug devel-opment to refer to the desired therapeutic(biological) effect of the candidate drug.Nonclinical research gathers critical informationabout the best likely drug dose, frequency, androute of administration if and when researchprogresses to human trials. It also investigatespharmacokinetics, pharmacodynamics, andtoxicology in animals.

Pharmacokinetics is the study of the effect thatthe body has on the drug. The pharmacokineticphase can be regarded as the time from thedrug’s absorption into the body until it reachesits target receptor site. Dhillon and Gill (2006)noted that pharmacokinetics “provides a mathe-matical basis to assess the time course of drugsand their effects in the body.” Pharmacokineticprocesses that determine the concentration of adrug that has been administered include absorp-tion, distribution, metabolism, and elimination(ADME). Pharmacodynamics is the study of thedesired effect that a drug has on the body. Forexample, the pharmacodynamic effect of anantihypertensive drug is to lower blood pressure.The pharmacodynamic phase begins once thedrug molecule reaches its target receptor. Toxico-dynamics is the analogous study of the unde-sired effect(s) that a drug has on the body. Thetoxicodynamic phase begins once the drugmolecule reaches a nontarget receptor(s).

Nonclinical safety pharmacology studiessubmitted to regulatory agencies are outlined in ICH Guidance S7A (2001), and the basic pack-age includes evaluation of a drug candidate’seffects on the central nervous system, respiration,and the cardiovascular system. Cardiovascularsystem evaluation includes assessment of cardiacfunction and cardiac electrophysiologicalactivity.

Nonclinical toxicological testing is necessarybecause some compounds can be so toxic thatthey cause cell death, leading to loss of impor-tant organ function. Other toxicological effectsare the result of interactions with variousbiochemical and physiological processes that donot affect the survival of the cells. ICH GuidanceM3 (R1) (2000) addresses several topics related totoxicity, including single and repeat dose toxi-city studies, genotoxicity, carcinogenicity, andreproductive toxicity. Relatively less evidence of

14 Chapter 2 • The role of clinical trials in new drug development

Page 30: Introduction to Statistics in Pharmaceutical Clinical Trials

toxicity is considered as relatively greaterevidence of the safety of the drug. The route ofadministration of the drug compound innonclinical research is typically the intendedroute in clinical settings and therefore the routethat will be used in clinical trials.

Exploratory toxicology studies are conductedto provide an idea of the main organs and phys-iological systems involved, and to estimate thedrug’s toxicity when administered across a rela-tively short period of time. These studies do notneed to be conducted according to cGLP guide-lines and they are not typically conducted witha drug compound that has been manufactured tocGMP standards.

Regulatory toxicology studies are submitted toregulatory agencies and are conducted accordingto cGLP standards. Some regulatory toxicologystudies need to be done before the first clinicaltrials are started. Other regulatory toxicologystudies are typically conducted in parallel withclinical trials. These include toxicological studiesin two or more animal species lasting up to 1 year, carcinogenicity tests and reproductive toxi-cology studies lasting up to 2 years, and interac-tion studies that examine possible drug–druginteractions with other drugs that may beprescribed concurrently in humans. These studiesare expensive to conduct, and so they are typi-cally not started unless and until the drugprogresses into clinical studies.

Mutagenicity is the chemical alteration ofDNA that is sufficient to cause abnormal geneexpression. Mutagenicity, also known as geno-toxicity, includes a comprehensive set of events,of which carcinogenicity and teratogenicity areimportant subsets. Carcinogenicity describesactivity that leads to cancer, and teratogenicitydescribes activity that leads to the impairment offetal development.

Nonclinical information can be useful inanother arena once a drug has been approved.Prescribing information can include the resultsof nonclinical toxicology studies (carcinogen-esis, mutagenesis, and impairment of fertility).In instances where no human (clinical) dataare available, it is possible for a clinician toincorporate nonclinical evidence into his orher decision-making process when decidingwhether the benefit:risk ratio of prescribing the

drug to a patient is favorable. The process ofusing clinical data to form such decisions canbe challenging at times, and the process ofusing nonclinical data even more so. Neverthe-less, in some instances nonclinical data mayprove of assistance in this regard.

While a nonclinical development program isinformative and important, no amount ofnonclinical research can predict precisely whatwill happen once the candidate drug is givento humans. Therefore, a clinical developmentprogram is also necessary. The clinical develop-ment program builds in many meaningful wayson the results from the nonclinical developmentprogram.

2.6 Clinical trials

Clinical development programs consist of avariety of preapproval clinical trials, all designedfor a specific purpose of revealing particularinformation concerning the investigationaldrug’s safety and efficacy.

2.6.1 Categorization of clinical trials byphase

A common system of categorization for preap-proval clinical trials includes Phase I, Phase II,and Phase III clinical trials (Phase IV clinicaltrials are conducted postapproval to collectadditional information about a marketeddrug). Phase I, II, and III clinical trials can besummarized as follows:

• Phase I: Pharmacologically oriented trials thattypically look for the best range of doses toemploy. These trials employ healthy adults,usually men. Comparison of the investiga-tional drug’s efficacy with other treatments(such as a placebo or a drug that is alreadymarketed) is not a specific aim of these trials,because by definition healthy individuals donot have the disease or condition of interest.However, incorporation of an inactive controlcan be useful because some of the proceduresemployed in these trials may themselves

Clinical trials 15

Page 31: Introduction to Statistics in Pharmaceutical Clinical Trials

give rise to physiological changes that couldotherwise be perceived as adverse events (seeSection 7.5 for additional discussion).

• Phase II: These trials are designed to look forevidence of activity and preliminary evidenceof efficacy and safety at a number of doses.Relatively small numbers of individuals withthe condition or disease of interest are used.To gain an understanding of efficacy a controltreatment is typically used at this stage.

• Phase III: These trials employ larger numbersof individuals with the condition or disease ofinterest, and they are comparative in nature –comparison with another treatment (often aplacebo, but possibly an active control) is afundamental component of the design. Thesetrials are undertaken if Phase I and II studieshave provided preliminary evidence that thenew treatment is safe and effective.

While this system of categorizing preapprovalclinical trials is widespread, unfortunately it isnot used consistently. As Turner (2007) noted,two studies with the same aims may be classifiedinto different phases, and two studies classifiedinto the same phase may have different aims. An alternative system has been suggested by the ICH.

2.6.2 ICH categorization of clinical trials

The ICH has published a series of Guidances onmany aspects of conducting clinical trials (seewww.ich.org). One of these, ICH Guidance E8(1997), provides an alternative approach to categorizing clinical trials, classifying themaccording to their objective. This system isshown in Table 2.1.

16 Chapter 2 • The role of clinical trials in new drug development

Table 2.1 ICH classification of clinical trials

Objectives of study Study examples

Human pharmacologyAssess tolerance Dose–tolerance studiesDescribe or define pharmacokinetics (PK) and Single and multiple dose PK and/or PD studies

pharmacodynamics (PD) Drug interaction studiesExplore drug metabolism and drug interactionsEstimate (biological) activity

Therapeutic exploratoryExplore use for the targeted indication Earliest trials of relatively short duration in well-definedEstimate dosage for subsequent studies narrow patient populations, using surrogates ofProvide basis for confirmatory study design, endpoints, pharmacological endpoints or clinical measures

methodologies Dose–response exploration studies

Therapeutic confirmatoryDemonstrate/confirm efficacy Adequate and well-controlled studies to establishEstablish safety profile efficacyProvide an adequate basis for assessing benefit:risk Randomized parallel dose–response studies

relationship to support licensing Clinical safety studiesEstablish dose–response relationship Studies of mortality/morbidity outcomes

Large simple trialsComparative studies

Therapeutic useRefine understanding of benefit:risk relationship in Comparative effectiveness studies

general or special populations and/or environments Studies of mortality/morbidity outcomesIdentify less common adverse reactions Studies of additional endpointsRefine dosing recommendation Large simple trials

Pharmacoeconomic studies

From ICH Guidance E8 (1997).

Page 32: Introduction to Statistics in Pharmaceutical Clinical Trials

Trials that might otherwise be categorized asPhase I, II, III, and IV trials are referred to ashuman pharmacology, therapeutic exploratory,therapeutic confirmatory, and therapeutic usetrials, respectively.

2.6.2.1 Human pharmacology trials

Human pharmacology or first-time-in-human(FTIH) clinical trials are undertaken in anextremely careful manner in tightly controlledsettings, often in residential or inpatient medicalcenters. Typically, between 20 and 80 healthyadults participate in these relatively shortstudies, and are often recruited from universitymedical school settings where trials are beingconducted. The main objectives are to assess thesafety of the investigational drug, understandthe drug’s pharmacokinetic profile and anypotential interactions with other drugs, andestimate pharmacodynamic activity. A rangeof doses and/or dosing intervals is typicallyinvestigated in a sequential manner.

Participants are given extensive physicalexaminations before they receive their first doseof the drug, at various intervals throughout thetreatment, and once they have finished the drugregimen. The trials are designed to collect datathat can be compared with similar types ofdata collected in nonclinical studies. As notedearlier, no animal model data can ensure that adrug will be safe when given to humans.However, it can be informative to see howsimilar the overall pictures of animal responsesand human responses are. Single-dose trials inwhich the dose chosen is based on the nonclin-ical work are conducted first. Later, dose-findingstudies are conducted to determine themaximum tolerated dose (MTD) of the drug, andto answer questions concerning the side-effectsthat are seen, their characteristics, and whetherthey are consistent across participants to anynotable degree.

Although the data collected during humanpharmacology trials are not in themselvesenough to obtain marketing approval, they cancertainly have the opposite effect: Unfavorabledata can lead to a sponsor’s decision not topursue further development of the drug. Whileachieving marketing approval of an investiga-

tional drug is the sponsor’s goal, if the drug isunlikely to succeed it is financially attractive todiscover this as early as possible in the drugdevelopment program. As we noted earlier, thesentiment here is “If you are going to fail, failfast!” A well-conducted human pharmacologystudy can reduce the possibility of later failedtrials by revealing unfavorable characteristics ofthe drug at this stage. This is preferable to thesponsor and, more importantly, in the bestinterests of individuals who may have beenparticipants in later trials that failed.

From a statistical viewpoint the design ofhuman pharmacology studies has certain impli-cations. These trials include a relatively smallnumber of participants, but a lot of measure-ments are collected for each participant. Thisstrategy has both advantages and limitations.The extensive array of measurements madeallows the drug’s effects to be characterizedreasonably thoroughly. However, few participantsin these studies makes generalization of resultsto the general population relatively harder thanfor studies with larger sample sizes.

2.6.2.2 Therapeutic exploratory trials

Therapeutic exploratory trials are conducted ifthe results of the human pharmacology trials areconsidered positive (someone has to decide thatthe results are positive). These trials involve thecomprehensive assessment of the investigationaldrug’s safety in perhaps 200–300 individualswith the disease or condition of interest. Theyare typically conducted by clinical pharmacolo-gists, and participants in these trials are oftenhospitalized and can therefore be closely moni-tored. Extensive data are collected, includingself-report assessments by the participants andbiochemical assessments.

Sometimes efficacy will be investigated inthese trials. Participants are again typicallyhospitalized and can therefore be closely moni-tored. Assessments of efficacy are typicallyconducted by individuals specifically trained inclinical trial methodology.

2.6.2.3 Therapeutic confirmatory trials

If the therapeutic exploratory trials are successful(someone has to decide if this is the case), the

Clinical trials 17

Page 33: Introduction to Statistics in Pharmaceutical Clinical Trials

drug development program will proceed to therapeutic confirmatory trials. By now, theearlier human pharmacology and therapeuticexploratory trials have defined the most likelysafe and effective dosage regimen(s) for use inthese therapeutic confirmatory trials. These trialsmay employ around 3000–5000 participants,each of whom has the disease or condition of interest, and they are typically conducted as randomized, double-blind, concurrentlycontrolled trials (see Chapter 4).

Tight experimental control is an extremelyimportant goal in all experimental trials. Conse-quently, we talk a lot about how to maximizesuch control in these trials. However, it issimply a realistic consequence of the way thattherapeutic confirmatory trials have to beconducted that the experimental control cannotbe quite as tight in these trials as it is in humanpharmacology trials and therapeutic exploratorytrials.

As we note many times in this book there areadvantages and disadvantages associated withmany occurrences in clinical trials, and with thelast statement of the previous paragraph. Thevery high level of experimental control that ispossible in therapeutic exploratory trials meansthat these trials are better at assessing the ‘pure’biological effect of the drug, that is, the efficacyof the drug under near-ideal circumstances.However, if and when a drug is approved formarketing and is being used by many patients,the drug will not be taken in the same veryhighly controlled manner in which it was givento participants in therapeutic exploratory trials.The data from therapeutic confirmatory trials aretherefore likely to be more indicative of how thedrug will actually work in the general populationif it receives marketing approval.

2.6.2.4 Therapeutic use trials

Therapeutic use trials are conducted once a drughas been approved to gain additional informa-tion about the safety and efficacy, or effective-ness, of the drug: The term “effectiveness” isused to describe how well the drug works inpatients once the drug has been approved. Oneexample of this type of trial is the simplified clin-ical trial (SCT). The intent of the word simplified

in the term “SCT” should be clarified here. Itrefers to the fact that the demands on partici-pants and investigators are less than in preap-proval trials. It is not meant to indicate that theimplementation of these large trials is simple:On the contrary, their design and logistics arecomplex.

As in preapproval therapeutic confirmatorytrials, participants in SCTs are randomly assignedto a treatment group. However, SCTs haveseveral characteristics that distinguish themfrom therapeutic confirmatory trials. Probablythe most immediately noticeable difference isthe number of participants who participate inthem. SCTs are designed to detect rare events byincluding sample sizes that are much larger thanthose employed in preapproval trials, and thusthey are very important in safety monitoring.However, in order to facilitate the conduct of anSCT involving such a large number of partici-pants, the amount of information collected perparticipant is much smaller than in therapeuticexploratory studies. The demands on both theparticipants and the investigators conductingthe trial have to be reduced to make these studiesviable. So, for example, instead of visiting theinvestigative site every week during a 12-weektreatment period and having a large number ofassessments made, participants may visit onlytwice (at the start and end of the treatmentperiod) and have relatively few assessmentsmade on those occasions. These assessments arethose of most interest.

Participants in SCTs receive treatments in amore naturalistic setting than those in thera-peutic confirmatory trials. Therefore, as well asmore participants, the treatment settings aremuch more representative of how patients ingeneral will be treated when the drug is approved.Advantages and disadvantages accompany manychoices and occurrences in designing andconducting clinical trials.

2.7 Postmarketing surveillance

Postmarketing surveillance is conducted oncethe approved drug is in widespread use toexamine safety in a more comprehensive

18 Chapter 2 • The role of clinical trials in new drug development

Page 34: Introduction to Statistics in Pharmaceutical Clinical Trials

manner. Postmarketing surveillance monitors allreports of adverse reactions and thus can be usedto compile extended safety data. This pharmaco-surveillance is a critical component of the overallprocess of ensuring that all members of a targetdisease population receive the greatest protec-tion from adverse reactions. As this book focuseson preapproval trials and their statisticalmethodology, postmarketing surveillance is notdiscussed. Readers are referred to Mann andAndrews (2007).

2.8 Ethical conduct during clinical trials

The ethical conduct of all clinical researchers isof supreme importance. Participants in all clin-ical trials are volunteers: while the word “volun-teers” is typically exclusively used to describeparticipants in human pharmacology studies, allparticipants in all clinical trials are, by defini-tion, volunteers (see Turner, 2007). Individualsparticipate in clinical trials for the greater good,not specifically to benefit themselves. Everyoneinvolved in clinical research has an obligation toconduct all aspects of this research to the highestethical standards.

2.8.1 Ethical principles

Several fundamental ethical principles guidedrug development research in clinical trials (seeTurner, 2007):

• Clinical equipoise: This requires that acomparative clinical trial must be started inthe good faith that the investigational drugand the control treatment are of equalmerit. The aim is to discover whether or notthe investigational drug is of greater merit.Once there is compelling evidence that theinvestigational drug is of greater merit, itbecomes unethical to give the control drugto participants.

• Respect for individuals: Investigators mustgive potential trial participants all pertinentinformation about the study, and answer allof their questions. If a potential participantthen agrees to participate voluntarily (that is,

he or she is not coerced in any real or impliedmanner), informed consent is obtained.

• Beneficence: The study design employed inthe trial must be scientifically sound, andany known risks of the research must beacceptable in relation to the likely beneficialknowledge that will be obtained.

• Justice: The burdens and the benefits ofparticipation in clinical trials must be distrib-uted evenly and fairly. Vulnerable popula-tions (for example, prisoners, residents innursing homes) should not be deliberatelychosen for participation in clinical trialswhen nonvulnerable populations are alsoappropriate participants. The benefits ofparticipation, such as access to potentiallylife-saving new therapies, should be availableto all, including those not historically wellrepresented such as women, children, andmembers of ethnic minorities.

The topic of ethics in clinical trials is addressedagain in Section 3.16.1, where we talk aboutethical considerations in choosing the nature ofthe control treatment used in trials involvinginvestigational antihypertensive drugs.

2.8.2 Ethical considerations in statisticalmethodology

It is appropriate here to highlight the additionalethical responsibilities that are shouldered bythose involved in statistical aspects of clinicaltrials, as outlined in the operational definition ofStatistics presented in Chapter 1. Derenzo andMoss (2006) addressed the importance of ethicalconsiderations in scientific and statistical aspectsof clinical studies:

Each study component has an ethical aspect.The ethical aspects of a clinical trial cannot beseparated from the scientific objectives. Segre-gation of ethical issues from the full range ofstudy design components demonstrates a flawin understanding the fundamental nature ofresearch involving humans. Compartmentaliza-tion of ethical issues is inconsistent with a well-run trial. Ethical and scientific considerationsare intertwined.

Derenzo and Moss (2006, p 4)

Ethical conduct during clinical trials 19

Page 35: Introduction to Statistics in Pharmaceutical Clinical Trials

Ethical awareness and ethical responsibility arekey aspects of statistical methodology in clin-ical trials. Areas where ethical and scientificconsiderations are inextricably linked include:

• Study design and experimental methodology:It is unethical to include people in a studywhere poor design and/or poor methodologywill lead to less-than-optimum quality dataand therefore less-than-optimum qualityanswers to the study’s research question.

• Sample-size estimation: A trial requires suffi-cient participants to answer the research ques-tion without exposing them unnecessarily tothe risks of the experimental therapy.

• Early termination of trials: Data monitoringcommittees (DMCs), independent groupscharged with reviewing interim data fromclinical trials, face difficult ethical challengeswhen deciding whether a clinical trial shouldbe terminated early. In a recent guidancedocument the FDA described the roles andresponsibilities of DMCs and when suchcommittees may be useful or required (USDepartment of Health and Human Services,FDA, 2006).

• Communicating trial results: Researchershave an ethical responsibility to report infor-mation accurately and fully in clinicalcommunications, as these directly impactpatient care.

Correct study design is absolutely essentialfrom both scientific and ethical perspectiveswhen conducting clinical trials. If a study’sdesign cannot lead to the collection of data thatcan be analyzed meaningfully, no meaningfulinformation about the investigational drug canbe gained. Participants in clinical trials have thelegitimate expectation that their participation inthe trial will help advance our knowledge of theinvestigational drug, and if the study’s designcannot possibly provide additional knowledgeabout the drug their expectation is not fulfilled(Turner, 2007).

2.10 References

Bowers D, House A, Owens D (2006). UnderstandingClinical Papers, 2nd edn. Chichester: John Wiley &Sons.

Derenzo E, Moss J (2006). Writing Research Protocols:Ethical considerations. Oxford: Elsevier.

Dhillon S, Gill K (2006). Basic pharmacokinetics. In:Dhillon S, Kostrzewski A (eds), Clinical Pharmaco-kinetics. London: Pharmaceutical Press, 1–44.

Harman RJ (2004). Development and Control of Medicinesand Medical Devices. London: Pharmaceutical Press.

ICH Guidance E8 (1997). General Consideration ofClinical Trials. Available at: www.ich.org (accessedJuly 1 2007).

ICH Guidance M3 (R1) (2000). Nonclinical Safety Studiesfor the Conduct of Human Clinical Trials for Pharma-ceuticals. Available at: www.ich.org (accessed July 12007).

ICH Guidance S7A (2001). Safety Pharmacology Studiesfor Human Pharmaceuticals. Available at:www.ich.org (accessed July 1 2007).

20 Chapter 2 • The role of clinical trials in new drug development

2.9 Review

1. What are the characteristics of a useful new drug?

2. Describe the ethical considerations in preparingclinical communications.

3. What is the role of pharmaceutical manufacturingin new drug development?

4. What is the role of nonclinical research in newdrug development?

5. Consider the ICH classification of clinical trials:Human pharmacology, therapeutic exploratory,therapeutic confirmatory, and therapeutic use:

(a) What is the role of each type of trial in thedevelopment of a new drug from a molecule to a new therapy?

(b) How do results from these types of trialspertain to the use of the new drug in medicalpractice?

Page 36: Introduction to Statistics in Pharmaceutical Clinical Trials

Mann, R. and Andrews, E. (2007) Pharmacovigilance.Hoboken, NJ: John Wiley & Sons.

Norgrady T, Weaver DF (2005). Medicinal Chemistry: Amolecular and biochemical approach, 3rd edn. Oxford:Oxford University Press.

Piantadosi S (2005). Clinical Trials: A methodologicperspective, 2nd edn. Chichester: John Wiley & Sons.

Regulatory Affairs Professionals Society (2007).Fundamentals of US Regulatory Affairs. Rockville, MD:RAPS.

Stuart MC (2007). The Complete Guide to MedicalWriting. London: Pharmaceutical Press.

Turner JR (2007). New Drug Development: Design,methodology, and analysis. Hoboken, NJ: John Wiley& Sons.

US Department of Health and Human Services, Foodand Drug Administration (2006). Establishmentand Operation of Clinical Trial Data MonitoringCommittees. Available from www.fda.gov (accessedJuly 1 2007).

References 21

Page 37: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 38: Introduction to Statistics in Pharmaceutical Clinical Trials

3.1 Introduction

One of the most efficient ways to acquire newknowledge about any topic is to devise questionsthat guide our investigations. These questionscan focus our thinking and keep us on track as we gather information that contributes to ourknowledge base. Devising more specific ques-tions as we progress is typically a good strategy.While relatively loosely formed questions can bevery helpful early on in the knowledge acquisi-tion process, refining our knowledge is facili-tated by asking more specific questions, and, inturn, acquiring more specific information andknowledge.

The scientific method is one particularmethod of acquiring new knowledge. In scien-tific research, including drug development, ourquestions need to be asked in a particularmanner. These questions are called researchquestions, and they lead to the development ofresearch hypotheses. The scientific methodrequires that these research hypotheses be struc-tured in a certain way, and then tested in a scien-tific manner. As noted earlier, a fundamentalcharacteristic of these research hypotheses is thatthey can be disproved. The word “disproved” isnot a typo: These hypotheses need to be able tobe disproved, not “proved.”

3.2 The concept of scientific researchquestions

In our operational definition, Statistics isregarded as a multifaceted scientific disciplinethat comprises several activities. The first ofthese is identifying a research question that

needs to be answered. We noted in Chapter 2that the reason for the development of a newdrug is usually the identification of an unmetmedical need. The development of a drug thatwill meet this need requires a series of nonclin-ical studies that comprise the nonclinical devel-opment program, followed by a series of clinicalstudies that comprise the clinical program. Ineach case, the study will be designed to answer aspecific research question or questions. (From astatistical perspective, it is a good idea to have asmall number of research questions that will beanswered by an individual study, despite the realtemptation to try to collect data that will answermany research questions.)

3.3 Useful research questions

In Section 2.1 we cited Norgrady and Weaver’s(2005) definition of a useful drug in the contextof drug development. That definition is a goodillustration of how the term “useful” is used inscientific research. A precise, operational defini-tion is needed, rather than a vague statementsuch as “it looks pretty useful to me.”

Turner (2007) provided an operationaldefinition of a useful research question:

• It needs to be specific (precise).• It needs to be testable.

If a candidate drug successfully makes it throughthe nonclinical development program, the safetyand efficacy of the investigational drug will betested in humans in a series of clinical trials thatcomprise the clinical development program.Each study that is conducted will test a particularaspect or facet of the drug, and the overall

3Research questions and research hypotheses

Page 39: Introduction to Statistics in Pharmaceutical Clinical Trials

development program will employ these trials ina systematic fashion. We noted in Section 2.6 that three kinds of trials are typically conductedbefore a new drug is approved for marketing:human pharmacology trials, therapeuticexploratory trials, and therapeutic confirmatorytrials. The information that is collected duringhuman pharmacology trials forms the basis for the testing that is done in therapeuticexploratory trials, and the information fromboth categories of trials forms the basis for thetesting that is done in therapeutic confirmatorytrials – that is, a body of information and knowl-edge about the investigational drug is gained inan incremental manner, a hallmark of scientificinvestigation.

Information is gathered by conducting studieseach of which asks a specific, testable researchquestion. A general and vague question such as“Is this investigational drug good for people’sblood pressure?” is simply not useful in thiscontext, because it does not facilitate the acqui-sition of useful information. The same is true innonclinical studies: the question “Do you thinkthat the drug is pretty safe when given to abunch of animals?” will not facilitate theacquisition of useful nonclinical information.

3.4 Useful information

You may have noticed that we used the term“useful information” twice in the previous paragraph. Accordingly, it is helpful to providean operational definition of useful informationin the context of drug development. Usefulinformation has the following characteristics:

• It needs to be specific (this is the same as the first characteristic of a useful researchquestion – see Section 3.3).

• It provides a solid basis for further studies thatwill acquire more useful information.

• It provides the rational basis for decision-making during the drug development process.

• It will be acceptable to regulatory agencies,which will eventually review the reports thatprovide the information to them.

3.5 Moving from the research questionto the research hypotheses

Before discussing the connection between theresearch question and the associated researchhypotheses, it should be noted that the word“hypotheses” here is not a typo: We meant towrite the plural form “hypotheses” and not thesingular form “hypothesis.” As discussed shortly,each research question has two associatedresearch hypotheses.

We noted in Section 3.3 that a research ques-tion needs to be useful. We also noted that thequestion “Is this investigational drug good forpeople’s blood pressure?” is not useful, because itdoes not provide a precise definition of “good” –that is, the research question is not specific, orprecise. A better research question might be“Does the investigational drug alter blood pres-sure?” At the outset of any scientific inquiryabout the potential effects of an investigationalhypertensive drug, we must entertain the notionthat the drug may, contrary to our expectations,actually increase blood pressure. For reasons thatwe explain in Chapter 6, this potential isexpressed in the statement of the statisticalhypotheses. For now, this improved researchquestion addresses the intention of our experi-ment and drug development program. However,we can do better.

3.6 The placebo effect

An interesting phenomenon in pharma-cotherapy is called the “placebo effect.” Thedictionary currently sitting in our office (TheAmerican Heritage College Dictionary, 3rd edn)provides several definitions of the word placebo,including:

A substance containing no medication andgiven to reinforce a patient’s expectation to getwell

An inactive substance used as a control in anexperiment to determine the effectiveness of amedicinal drug

24 Chapter 3 • Research questions and research hypotheses

Page 40: Introduction to Statistics in Pharmaceutical Clinical Trials

Both of these definitions are helpful in thepresent context. The first is a more general one,and relates to the observation that, if a person isgiven a substance containing no medication, butthat person believes that the substance will havea beneficial therapeutic effect, it is not unusualthat improvement may be seen in the person’scondition. The second definition is a particularlyrelevant definition in the context of this book,and the term “placebo” will be used extensivelyin later chapters.

3.7 The drug treatment group and theplacebo treatment group

The terms “drug treatment group” and “placebotreatment group” will become very familiar toyou as you work your way through this book.Chapter 4 provides more detail, but it is helpfulat this point to comment briefly on thefollowing aspect of certain preapproval clinicaltrials.

It is very common in therapeutic confirmatorytrials (and in some therapeutic exploratory trials)to compare the effects of the investigational drugwith the effects of a placebo (see also the discus-sions in Section 3.16.1) – that is, these trials arecomparative in nature. One common studydesign involves giving the investigational drugto one group of individuals, the drug treatmentgroup, and the placebo to a second group ofindividuals, the placebo treatment group. Inaddition, and extremely importantly, these indi-viduals do not know whether they are receivingthe investigational drug or the placebo. All indi-viduals are treated identically throughout thetrial, with the one exception of the treatmentthat they receive.

A key point to note is that individualsreceiving the placebo treatment often show asmall improvement in the condition that is thefocus of the trial. In some instances, such asstudies of antidepressants and analgesics,improvement is self-reported by the individualsthemselves and can often be marked. In addi-tion, improvement among individuals can be

accounted for by a phenomenon called “regres-sion to the mean.” Imagine a clinical therapeutictrial involving an investigational antihyperten-sive drug where individuals are eligible forenrollment only if they have a documentedsystolic blood pressure (SBP) of a specified level(say at least 140 mmHg). It is a simple fact that,as a result of expected and naturally occurringrandom variation, some individuals mayinitially show this level of SBP even though itdoes not accurately reflect their true SBP. Onsubsequent observations the level of the charac-teristic will return closer to its expected level,resulting in an “improvement” caused simply bythe fact that the individual was enrolled at atime when his or her SBP was higher thannormal. In trials involving investigational anti-hypertensive drugs, it is not unusual for individ-uals in the placebo treatment group to showsmall decreases in BP during the trial. Therefore,it becomes important to determine whether theinvestigational drug has a larger effect on BPthan the placebo – that is, the trial is compara-tive in nature and the placebo is the controlagainst which the investigational drug iscompared.

3.8 Characteristics of a useful researchquestion

In Section 3.4 we formulated some initialversions of a research question. We decided thatthe question “Is this investigational drug goodfor people’s blood pressure?” is not useful. Wenoted that an improved research question mightbe “Does the investigational drug alter bloodpressure?” This is certainly moving in the rightdirection. However, as noted at the end of thatsection, we can do better.

A good clue as to how we can devise a betterresearch question comes from the discussionsthat we have just had about the comparativenature of these trials, in which the effect of theinvestigational drug is compared with the effectof the placebo. Based on these discussions animproved research question can be phrased as

Characteristics of a useful research question 25

Page 41: Introduction to Statistics in Pharmaceutical Clinical Trials

“Does the new drug alter SBP more thanplacebo?” This version of the research questionhas several useful characteristics:

• It involves both of the treatments received byindividuals in the trial: The investigationaldrug and the placebo.

• It is comparative. The goal is to compare theeffects of the two treatments.

• It is precise. There will be a precise answer –yes it does, or no it doesn’t. We will have todefine the term “more” in more detail shortly,but in the meantime please trust us that thiscan indeed be done in a precise manner byusing the discipline of Statistics.

3.9 The reason why there are tworesearch hypotheses

In this book all research questions are addressedand then answered via the construction of tworesearch hypotheses, commonly called the nullhypothesis and the alternate hypothesis.(Although another name for the alternatehypothesis, the research hypothesis, has its ownappeal, we employ the commonly used term“alternate hypothesis” in this book.) Both ofthese hypotheses are key components of theprocedure of hypothesis testing. This procedureis a statistical way of doing business. It isdescribed and discussed in detail in Chapter 6,but it is beneficial to introduce the main concepthere.

3.9.1 The null hypothesis

The null hypothesis is the crux of hypothesistesting. (It is important to note that the form ofthe null hypothesis varies in different statisticalapproaches. As the main type of clinical trialdiscussed in this book is the therapeutic confir-matory trial, we talk about this first. We thentalk briefly about the forms of the null hypoth-esis that are used in other types of trials inSection 3.10.) As noted earlier, therapeuticconfirmatory trials are comparative in nature.We want to evaluate the efficacy of the investi-

gational drug, and the way that we do this is tocompare its efficacy with the efficacy of a controltreatment, typically a placebo. The key question,expressed in our research question, is “Does thenew drug alter SBP more than placebo?” Asnoted earlier, we need to provide a precise defin-ition of “more,” and we will do this in duecourse. In this type of trial, called a superioritytrial, the null hypothesis takes the followingform:

The average effect of the investigational drug onSBP is equivalent to the average effect of theplacebo on SBP.

3.9.2 The alternate hypothesis

The alternate hypothesis reflects the alternatepossible outcome of the trial, and therefore thealternate possible answer to the research ques-tion: The trial was conducted with the specificgoal of providing an answer to the researchquestion. In a superiority trial the alternatehypothesis takes the following form:

The average effect of the investigational drug onSBP is not equivalent to the average effect of theplacebo on SBP.

Note that the research question “Does thenew drug alter SBP more than placebo?” allowsfor the fact that the investigational drug couldactually increase SBP more than placebo, as doesthe alternate hypothesis, which includes thepossibility that the drug could increase SBP. Thereason for this is that the statistical informationused to decide which hypothesis is more plau-sible must include the possibility, howeverremote we believe it to be, that the drug has theopposite effect of what we hope for. Anotherway of stating this is that exclusion of one sideof the alternate hypothesis – that is, that thedrug is worse than the placebo – is presumptiveand contrary to the scientific process ofcollecting data to search for the true state ofnature. It is certainly true that, if we findourselves in a position to claim that the alter-nate hypothesis should be accepted because thedrug did more harm than good (that is,increased SBP), we will have answered the

26 Chapter 3 • Research questions and research hypotheses

Page 42: Introduction to Statistics in Pharmaceutical Clinical Trials

research question. For now, we must acceptthat, despite its seeming incongruence with ourpreferred research question, the use of a two-sided alternate hypothesis is the norm in theregulated world of drug development.

3.9.3 Facts about the null and alternatehypotheses

It is important to note that, whatever theoutcome of the trial, both of these hypothesescannot be correct. It is also true that one of themwill always be correct. Again, we operationallydefine the term “correct” in this context in duecourse, but the point to note here is that:

• It is never the case that neither hypothesis iscorrect.

• It is never the case that both hypotheses arecorrect.

• It is always the case that one of them iscorrect and the other is not correct.

The procedure of hypothesis testing allows us todetermine which hypothesis is correct.

A helpful way of remembering which hypoth-esis is which – that is, which form the nullhypothesis takes and which the alternatehypothesis takes – is to conceptualize that thealternate hypothesis states what you are“hoping” to find and the null hypothesis stateswhat you are not hoping to find. It must beemphasized here, however, that, althoughhelpful, this conceptualization skates on verythin scientific ice: As Turner (2007, p 101) noted:

In strict scientific terms, hope has no place inexperimental research. The goal is to discoverthe truth, whatever it may be, and one shouldnot start out hoping to find one particularoutcome. In the real world, this ideologicallypure stance is not common for many reasons(financial reasons being not the least of them).

When a pharmaceutical or biotechnologycompany has spent many years and huge sumsof money developing an investigational drug,and the drug has made it to the point where atherapeutic confirmatory trial is beingconducted, the company hopes that the drugwill indeed be more effective than placebo, and

in due course be approved for marketing by aregulatory agency. One reason for this hope isthat patients will (relatively) soon have theopportunity to receive a new drug that is thera-peutically beneficial for them. Another reason,as noted in the previous quote, is that the drugwill be approved and make money for thecompany. Drug companies are for-profit busi-nesses, and this is not a negative judgmentalcomment. The only way that they can developfuture drugs is to sell present drugs for a profit:The costs in pharmaceutical research and devel-opment (R&D) are enormous. In this pragmaticsense, the alternate hypothesis (at least one sideof it) can meaningfully be conceptualized asstating the outcome that you are hoping for.

3.10 Other forms of the null andalternate hypotheses

The forms of the null and alternate hypothesesare dictated by the goal of the trial. In the thera-peutic confirmatory trial discussed so far the goalof the trial is to demonstrate that the investiga-tional drug shows greater efficacy than thecontrol treatment – that is, we are hoping todemonstrate that the investigational drug showssuperior efficacy, hence the name superioritytrial. Following our earlier memory tip, you canconceptualize the alternate hypothesis as statingwhat you are hoping to find and the nullhypothesis as stating what you are not hoping tofind. As we have seen, this leads to the followingforms of these hypotheses:

• Null hypothesis: The average effect of theinvestigational drug on SBP is equivalent tothe average effect of the placebo on SBP.

• Alternate hypothesis: The average effect of theinvestigational drug on SBP is not equivalentto the average effect of the placebo on SBP.

When we are hoping to demonstrate some-thing different, however, these hypotheses takedifferent forms. Consider the example of a trialcalled an equivalence trial. Equivalence trials areconducted to demonstrate that an investigationaldrug has therapeutic equivalence compared witha control treatment. Equivalence trials are also

Other forms of the null and alternate hypotheses 27

Page 43: Introduction to Statistics in Pharmaceutical Clinical Trials

comparative in nature, but in this case the controltreatment is not a placebo but a marketed drug.Here, the control treatment is referred to as anactive comparator drug. This active comparatordrug is typically the drug that is currently thebest, or perhaps the only, treatment available forthe disease or condition of interest, and is referredto as the gold standard treatment. The intent inan equivalence trial is to provide compellingevidence that the efficacy of the investigationaldrug is “equivalent” to that of the activecomparator drug. (The term “equivalent” requiresa precise statistical definition, and this isprovided in Chapter 12.)

Why are equivalence trials important? That is,why would we be interested in a new drug thatis only as effective as an existing drug? This is agood question, but one that has an equally goodanswer. One reason would be that we believe(hope) that the investigational drug is equallyeffective and also has the considerable advantagethat its safety profile is better. This would lead tothe same efficacy with less likelihood of side-effects. If the side-effects of the current gold stan-dard drug are particularly unpleasant, this wouldbe a considerable advantage. Other advantagesthat may justify the use of equivalence trialsinclude convenience of the dosing regimen, andthe inability to use an inactive control for ethicalreasons.

In an equivalence trial the research questionis: Does the new drug demonstrate equivalentefficacy compared with the reference drug? Theresultant accompanying research and alternatehypotheses take the following forms:

• Null hypothesis: The investigational drugdoes not show equivalent efficacy to thecomparator drug (the reference drug).

• Alternate hypothesis: The investigationaldrug does show equivalent efficacy to thecomparator drug (the reference drug).

Following the logic of our memory tip, youwill see that the alternate hypothesis in thiscase, just like in the case of a superiority trial,expresses what we are hoping to find, while thenull hypothesis states what we are hoping notto find. The actual natures of the null and alter-nate hypotheses in an equivalence trial are

different from those in a superiority trial, butthis sentiment is the same. This is equally truein the case of various other types of trials,including noninferiority trials.

Noninferiority trials are similar to equivalencetrials, but require that the investigational drugbe in the worst case only trivially worse than thereference to be considered noninferior. A precisestatistical definition of “trivially worse” must beagreed upon before the start of the trial. Fornoninferiority trials the research question is:Does the new drug demonstrate efficacy that is not unacceptably worse (Fleming 2007) thanthe reference drug? The null and alternatehypotheses corresponding to this researchquestion take the form of:

• Null hypothesis: The efficacy of the investiga-tional drug is unacceptably worse than theefficacy of the comparator drug (the referencedrug).

• Alternate hypothesis: The efficacy of theinvestigational drug is not unacceptablyworse than the efficacy of the comparatordrug (the reference drug).

As for equivalence trials, additional detailsabout noninferiority designs are provided inChapter 12.

3.11 Deciding between the null andalternate hypothesis

A research question of interest, then, leads to thenull hypothesis and the alternate hypothesis. Asnoted in Section 3.9.3:

• It is never the case that neither hypothesis iscorrect.

• It is never the case that both hypotheses arecorrect.

• It is always the case that one of them iscorrect and the other is not correct.

(We also noted in that section that we wouldaddress the precise operational definition ofcorrect in due course, and we have notforgotten this.) Therefore, statistical analysis ofthe data acquired in the trial enables us to

28 Chapter 3 • Research questions and research hypotheses

Page 44: Introduction to Statistics in Pharmaceutical Clinical Trials

decide between these two mutually exclusivehypotheses. An ongoing theme in this book isthat many decisions have to be made in drugdevelopment, and the discipline of Statisticsprovides numerical representations of informa-tion that provide the rational basis for decision-making. At the end of every trial, a decisionneeds to be made: Which of these two mutuallyexclusive hypotheses is correct – the nullhypothesis or the alternate hypothesis? For therest of this chapter we talk about superioritytrials, but the points made apply equally toequivalence trials and noninferiority trials.

3.12 An operational statisticaldefinition of “more”

Imagine the following results from a superioritytrial:

• The average decrease in SBP for the individ-uals in the drug treatment group was 3 mmHg– that is, on average, the investigational druglowered SBP by 3 mmHg in this trial.

• The average decrease in SBP for the individ-uals in the placebo treatment group was 2 mmHg – that is, on average, the placebolowered SBP by 2 mmHg in this trial.

In this superiority trial we are interested to findout whether the investigational drug is moreeffective than the comparator treatment, theplacebo. As the number “3” is numerically greaterthan the number “2,” the simple mathematicalanswer is clear: Yes, the investigational druglowered BP more than the placebo. However, youmay well feel that this simple mathematicalanswer, although true, does not capture the spiritof the findings from the trial. The investigationaldrug managed to lower blood pressure only1 mmHg more than the placebo, a substancethat has no pharmacotherapeutic capability.

The term “treatment effect” is an importantone in drug development, and is defined as thedifference between the average response to theinvestigational drug and the average response tothe comparator being used in the trial. In thiscase of the development of a new antihyperten-

sive drug it is defined as the difference betweenthe average decrease in BP shown by the indi-viduals in the drug treatment group and theaverage decrease in BP shown by the individualsin the placebo treatment group. The logic here isthat, as the placebo resulted in a small decrease,even though it has no pharmacotherapeuticcapability, the pharmacotherapeutic capabilityof the investigational drug should be regarded asthe decrease in BP over and above that caused bythe placebo. Another interpretation of the treat-ment effect is that it is the amount of change inSBP attributed to the drug over and above thatwhich would have been observed had the drugnot been given. Therefore, the treatment effecthere is 1 mmHg. This is � 0, and so the investi-gational drug is mathematically more effectivethan the comparator treatment, but, as notedearlier, you may be left feeling that the term“more” is not quite appropriate.

3.12.1 A very important aside: Theconcept of clinical significance

We are about to introduce you to the concept ofa statistically significant difference between themeans of two sets of numbers. These numberscan be any kind of data, not just the clinical datathat are the focus of this book. No matter howlarge or how small the difference found betweenthe means of two sets of numbers, it is possiblefor that difference to be declared statisticallysignificantly different after appropriate analysisof those data. In some circumstances and forsome data, a difference of just one unit betweenthe means of the two groups of data may indeedbe found to be statistically significantlydifferent, and in such a case the use of the term“more” would be statistically appropriate.

However, in the context of clinical data,another extremely important concept is clinicalsignificance. The clinical significance of a treat-ment effect is a completely separate assessmentfrom the treatment effect’s statistical signifi-cance. This is a clinical judgment, not the resultof a single numerical calculation. It is perfectlypossible for a treatment effect to be found to bestatistically significant after an appropriate

An operational statistical definition of “more” 29

Page 45: Introduction to Statistics in Pharmaceutical Clinical Trials

statistical analysis and yet judged by clinicians tobe not clinically significant. As you will see, bothstatistical significance and clinical significancemust be addressed in drug development – anobservation that highlights that statisticians andclinicians must work very closely togetherduring this process. Later discussions address thetopic of clinical significance in detail.

3.12.2 An operational statistical definitionof the term “more”

The discipline of Statistics provides us withmethodology that tells us whether or not the useof the word “more” is appropriate from a statis-tical point of view. The formulation of a scientif-ically meaningful research question and its twoassociated hypotheses, the null hypothesis andthe alternate hypothesis, allows us to reach ananswer in an objective manner by following aprescribed methodology. Moreover, the regula-tory and clinical communities acknowledge thismethodology. Therefore, for a given set of datathat have been collected in a trial, statisticaltesting provides a precise answer that is couchedin statistical terms and that has effectively beenagreed upon as objective by all interested parties(see Turner, 2007). This leads us to the concept ofa statistically significant difference.

3.13 The concept of statisticallysignificant differences

The concept of statistical significance and itspractical implementation are discussed in moredetail in Chapter 6, but it is appropriate here toset the scene for those discussions. The words“significant,” “significance,” and “significantly”are used differently in Statistics than they are ineveryday language. In the language of Statisticsthey have precise quantitative meanings. Wefocus here on the meaning of the term “signifi-cantly” in the discipline of Statistics. The disci-pline of Statistics facilitates a single, quantitative

answer to questions concerning assessments of“more.” For a given set of data collected in asuperiority trial, the employment of the appro-priate statistical analysis will reveal whether thetreatment effect attained statistical significance –that is, it will reveal whether or not the investi-gational drug was statistically significantly moreeffective than the placebo. In this manner itprovides a precise definition of the term “more.”If there is a statistically significant differencebetween the average decrease in blood pressurein the drug treatment group and the placebotreatment group – that is, if there is a statisticallysignificant treatment effect – the use of the term“more” is warranted.

3.14 Putting these thoughts into moreprecise language

The following are the research question and thetwo associated research hypotheses that we haveformulated so far:

• Research question: Does the new drug alterSBP more than the placebo?

• Null hypothesis: The average effect of theinvestigational drug on SBP is equivalent tothe average effect of the placebo on SBP.

• Alternate hypothesis: The average effect of theinvestigational drug on SBP is not equivalentto the average effect of the placebo on SBP.

The concept of statistical significance allows allthree of these to be reframed as follows:

• Research question: Does the new drug alterSBP statistically significantly more than theplacebo?

• Null hypothesis: The average effect of theinvestigational drug on SBP is not statisticallysignificantly different from the average effectof the placebo on SBP.

• Alternate hypothesis: The average effect of theinvestigational drug on SBP is statisticallysignificantly different from the average effectof the placebo on SBP.

30 Chapter 3 • Research questions and research hypotheses

Page 46: Introduction to Statistics in Pharmaceutical Clinical Trials

Each of these is now expressed in a moreprecise manner. In addition, the null hypothesisnow allows for the treatment groups to differ toa certain extent. In any trial involving any treat-ments, the group averages will almost certainlydiffer to some extent: The probability of atreatment effect of zero is extremely small.

3.15 Hypothesis testing

We have now formulated our research question,null hypothesis, and alternate hypothesis inprecise statistical language. This facilitates thestrategy of hypothesis testing. Hypothesis testingrevolves around two actions after an appropriatestatistical analysis: Rejecting the null hypothesisor failing to reject the null hypothesis. Thelanguage used to express these two actions isvery important. After thinking about these twoactions for a few minutes, you might think thatthese actions could be expressed as “acceptingthe alternate hypothesis” and “accepting thenull hypothesis,” respectively. In everydaythinking this might be thought very reasonable.However, in the discipline and the language ofStatistics, the terminology of rejecting the nullhypothesis or failing to reject the null hypoth-esis is deliberately employed. While data fromtwo groups (for example, means) may suggestthat the alternate is more plausible than the nullhypothesis, the sample size of the study also hasa direct bearing on our ability to reject the nullhypothesis. If there is at least a small differencebetween groups in a study (which will almostcertainly be the case), it is possible that a nullhypothesis that is not rejected in a study of acertain size would be rejected if the study hadbeen larger. The relationship between samplesize and the probability of rejecting the nullhypothesis or failing to reject the null hypoth-esis is explored in Chapter 12.

The statistical convention of using the expres-sion “failing to reject the null hypothesis”reflects the position that null hypotheses of nodifference can always be rejected if enough

observations are studied. Statistical methodologynecessitates making a choice here. One of thesetwo actions, rejecting or failing to reject the nullhypothesis, has to be taken at the end of allhypothesis testing. The action taken is preciselydetermined by the result obtained from thestatistical technique used to analyze the data.

3.16 The relationship betweenhypothesis testing and ethics in clinicaltrials

The issue of ethical considerations in clinicaltrials was introduced in Section 2.8. The proce-dure of hypothesis testing illustrates one impor-tant ethical consideration, clinical equipoise,particularly well. The fact that we have tworesearch hypotheses that express two opposingpossible occurrences makes it clear that we donot know which best represents the actual stateof affairs. The bottom line is that, at the point intime when the study is planned and started, wedo not know whether or not the investigationaldrug will be more effective than placebo. Thisuncertainty is a necessary prerequisite ofconducting a trial: If it were known that theinvestigational drug were more effective it wouldbe unethical for people with the disease orcondition of interest to be given a placebo.

The philosophy that makes it acceptable thatsome individuals receive a placebo in a clinicaltrial is that a comparative clinical trial in whichsome receive a placebo while others receive theinvestigational drug is the best way to find out ifthat drug is indeed effective. If it is, the individ-uals who received the placebo would not them-selves have benefited on this occasion, but theirparticipation in the trial was a crucial compo-nent contributing to the later treatment ofpatients with the approved drug. As noted earlierindividuals take part in clinical trials for thegreater good, not for their own immediatebenefit.

Uncertainty is therefore a fundamental prereq-uisite to conducting a therapeutic exploratory or

The relationship between hypothesis testing and ethics in clinical trials 31

Page 47: Introduction to Statistics in Pharmaceutical Clinical Trials

therapeutic confirmatory trial. The results of thetrial will be used in a decision-making process atthe end of the trial: In the light of the prevailinguncertainty, we need to decide whether theresults of the trial have provided compellingevidence that the investigational drug is indeedeffective. If the statistical results show that thedrug is indeed effective – that is, the treatment isstatistically significant – the next step is for clin-icians to decide if the treatment is also clinicallysignificant (as noted in Section 3.12.1 we discussclinical significance later in the book). If thetreatment effect is deemed to be both statisti-cally and clinically significant, the study team islikely to decide to move forward to the nextstudy in their clinical development program.

3.16.1 Ethics and the use of placebocontrols

At this point, it is appropriate to point outdiverging views on the use of placebo controltreatment in any clinical trial. Some authorshave expressed the view that the comparatorshould always be an active control if possible– that is, in cases where there is already atleast one drug on the market that has beendemonstrated to be effective for treating thedisease or condition of concern, thecomparator should be one of these drugs. Thisissue is reviewed very effectively by Templeand Ellenberg (2000), and we strongly recom-mend that you read their paper. We agreewith the arguments that they presentsupporting the use of placebo controls inappropriate circumstances. In addition, ICHGuidance E12A (2000) comments on this issuespecifically in the context of the evaluation ofefficacy in clinical trials for an investigationalantihypertensive drug. It states that, forseveral important reasons, short-term (definedas 4–12 weeks in duration), blinded, placebo-controlled studies are “essential.” It also statesthat long-term studies (defined as 6 months ormore) should also be conducted to demon-strate maintenance of efficacy and to assesslong-term safety, and that these trials wouldtypically use an active control.

In this book we have chosen to focus onteaching the computational aspects of statisticalanalyses by using examples involving designs inwhich the investigational drug is compared witha placebo in short-term trials. In Section 4.7, wedescribe two potential measurement scenariosthat are possible during a 12-week trial: Takingmeasurements at baseline and at the end ofevery 2-week period, and simply taking measure-ments at baseline and at the end of week 12, theend-of-treatment measure. In this book we usethe latter design for the sake of simplicity. Wehave therefore chosen this form of examplebecause it is ethical as discussed in ICH GuidanceE12A, and, as noted in the previous section, acomparative trial using both an investigationaldrug and a placebo treatment group is the bestway to find out if the former is indeed effective.

3.17 The relationship between researchquestions and study design

Having introduced and discussed researchquestions in this chapter, the following quotesfrom authoritative sources emphasize theirimportance:

The most critical and difficult prerequisite fora good study is to select an important feasiblequestion to answer. Accomplishing this is pri-marily a consequence of biological knowledge.(Piantadosi, 2005)

The essence of rational drug development is toask important questions and answer them withappropriate studies. (ICH Guidance E8, 1997)

Between them, these quotes capture thenotion that, once an important research ques-tion has been formulated, a “good” and “appro-priate” study must be conducted to answer thequestion. As you have seen already, we provideoperational definitions of terms that are used instatistical contexts. Accordingly, Chapter 4provides operational definitions of the terms“good” and “appropriate” in the context ofdeciding how best to answer an importantresearch question.

32 Chapter 3 • Research questions and research hypotheses

Page 48: Introduction to Statistics in Pharmaceutical Clinical Trials

Before that, however, it is informative toconsider the second sentence in the quotefrom Piantadosi (2005) – “Accomplishing this[selecting an important feasible question toanswer] is primarily a consequence of biologicalknowledge.” This book discusses the employ-ment of the discipline of Statistics in a particularcontext, the development of a new drug. It iscertainly true that the individual statisticalanalyses that we teach you can be informativelyapplied in other areas of investigation, but in thisbook their application is in the development of abiologically active drug that will influencepatients’ biology for the better (see Turner, 2007).

3.19 References

Fleming TR, DeMets DL (1996). Surrogate endpoints inclinical trials: are we being misled? Ann Intern Med125:605–613.

ICH Guidance E8 (1997). General Consideration of Clin-ical Trials. Available at: www.ich.org (accessed July 12007).

ICH Guidance E12A (2000). Principles for Clinical Eval-uation of New Antihypertensive Drugs. Available at:www.ich.org (accessed July 1 2007).

Norgrady T, Weaver DF (2005). Medicinal Chemistry: Amolecular and biochemical approach, 3rd edn. Oxford:Oxford University Press.

Piantadosi S (2005). Clinical Trials: A methodologicperspective, 2nd edn. Chichester: John Wiley & Sons.

Temple R, Ellenberg S (2000). Placebo-controlled trialsand active-control trials in the evaluation of newtreatments. Ann Intern Med 133:455–463.

Turner JR (2007). New Drug Development: Design,methodology, and analysis. Hoboken, NJ: John Wiley& Sons.

References 33

3.18 Review

1. What are characteristics of a useful researchquestion?

2. Why is a placebo control used in many clinicaltrials?

3. When would a placebo control not be used?

4. How do the null and alternate hypotheses relate tothe objective of a clinical trial?

5. What form do the null and alternate hypothesestake for a superiority trial with a placebo control?

6. What form do the null and alternate hypothesestake for an equivalence trial with an active control?

Page 49: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 50: Introduction to Statistics in Pharmaceutical Clinical Trials

4.1 Introduction

Chapter 3 introduced the central topic ofresearch questions in clinical trials. It also introduced the null and alternate hypotheses.This chapter discusses the relationship betweenresearch questions and study design, and showshow optimum experimental methodology is crit-ical to the successful implementation of studiesconducted to answer research questions.

Each study in a clinical development programaddresses one or more research questions (wenoted in the previous chapter that it is a goodidea to limit the number of research questions inany given trial). In Chapter 3 we also noted twocharacteristics that a research question mustpossess to be considered useful:

• It needs to be specific (precise).• It needs to be testable.

We can now take this thinking one stepfurther. Formulating a good research questionand then fine-tuning it is critical to the potentialsuccess of a trial. The research question is thedriving force behind the way that the trial will bedesigned and implemented, because certain trialdesigns, or study designs, are needed to permitthe acquisition of data that can be used success-fully to answer the research question. The bestresearch question in the world cannot beanswered by the acquisition of inappropriatedata via the conduct of an inappropriatelydesigned trial, no matter how well the data arecollected.

A useful research question suggests how astudy needs to be designed to provide the appro-priate information to answer the question.Choosing the best study design to answer theresearch question is therefore critical. The word

“best” in the previous sentence is meaningfulbecause there may be more than one studydesign that is capable of providing data thatenable the question to be addressed andanswered, but one of these designs may be moreappropriate than the other possibilities.

This occurrence illustrates an important point.It is certainly true that the discipline of Statisticscontains precise aspects that provide definiteanswers in situations where that answer is theonly possible correct answer. However, it is alsotrue that the successful practice and implemen-tation of the discipline of Statistics require aconsiderable amount of well-informed judg-ment. It is therefore vital that professional statis-ticians are involved in all aspects of clinicaltrials. This comment may initially come as some-what of a surprise: This is because there is a wide-spread tendency to think of statisticians beinginvolved only at the end of a trial when all thedata have been collected. This misperception isas unfortunate as it is widespread. The conductof a successful trial requires that statisticians areinvolved from Day 1, which can be thought of asthe time when a research team first decides thatknowledge about a certain characteristic of theinvestigational drug is needed, to the end of theentire process of acquiring and disseminatingthe trial’s results. This includes submitting theresults from the trial to a regulatory agency(multiple agencies if marketing permission isdesired in multiple countries) and publishing theresults in a clinical communication for otherresearch scientists and clinicians to read.

This chapter also discusses experimentalmethodology. As well as employing the beststudy design to facilitate the collection of data toanswer a research question most appropriately

4Study design and experimental methodology

Page 51: Introduction to Statistics in Pharmaceutical Clinical Trials

and successfully the data acquired for thispurpose must be of optimum quality. Forexample, each and every individual’s blood pressure must be measured as accurately as poss-ible every single time that a measurement needsto be made. The appropriate choice of the beststudy design and the implementation ofoptimum quality methodology work hand inhand to facilitate the acquisition of optimumquality data with which to answer the researchquestion that led to the clinical trial beingconducted.

4.2 Basic principles of study design

At the end of Chapter 3 we noted that we wouldprovide operational definitions of the terms“good” and “appropriate study design” in thecontext of answering an important researchquestion of biological (clinical) importance. Twomore quotes from Piantadosi (2005) and ICHGuidance E8 (1997) are illuminating:

Conceptual simplicity in design and analysis is avery important feature of good trials . . . . Goodtrials are usually simple to analyze correctly.

Piantadosi (2005, p 130)

Clinical trials should be designed, conducted,and analyzed according to sound scientificprinciples to achieve their objectives.

ICH Guidance E8 (1997, p 2)

The first quote provides an excellent opera-tional definition of the term “good” in thecontext of the design of clinical trials. It alsocaptures a sentiment to which we return timeand time again in this book. Study design andstatistical analysis “are intimately and inextri-cably linked: the design of a study determinesthe analysis that will be used once the data havebeen collected” (Turner, 2007, p 5). Conceptualsimplicity, as Piantadosi (2005) noted, is veryimportant when designing a trial. As design andanalysis are intimately linked, conceptualsimplicity in design leads to conceptualsimplicity in the associated statistical analyses.

Our operational definition of the term “appro-priate” in the context of the design of clinical

trials has two aspects, one of which comes fromICH Guidance E8 (1997) as cited earlier: Trialsneed to be designed, conducted, and analyzedaccording to sound scientific principles. Thischapter discusses the scientific experimentalmethodology that is appropriate for the designand conduct of trials. The second aspect of ouroperational definition of the term “appropriate”is that the design employed must be capable ofproviding the data needed to answer the researchquestion of interest. There are many studydesigns, each of which is appropriate forproviding the data necessary for specificallyformulated research questions. A design thatcannot possibly provide the data to answer theresearch question of interest is not appropriate,and the decision not to use that design is there-fore clear cut. In some circumstances more thanone study design is capable of providing the dataneeded to answer the research question. In thiscase the decision as to which one is the mostappropriate requires a decision based on aninformed judgment, and statisticians must beinvolved in this decision.

Clinical trials embody several fundamentalprinciples of experimental design (Piantadosi,2005). Three of these are:

1. replication2. randomization3. local control.

4.2.1 Replication

Replication refers to the fact that clinical trialsemploy more than one individual in each treat-ment group. The reason for this is that there isconsiderable variation in how individualsrespond to the administration of the same drug,so it is not appropriate to choose only one indi-vidual to receive the investigational drug andanother to receive the placebo: There is no wayof knowing how representative the individuals’responses are of the typical responses of peoplein general.

Replication allows two important features ofindividuals’ responses to the investigational drugto be assessed. One is just how different theirresponses are from each other. It might be thatall individuals show responses that are pretty

36 Chapter 4 • Study design and experimental methodology

Page 52: Introduction to Statistics in Pharmaceutical Clinical Trials

close to each other, or that there is a consider-able difference between individuals’ responses.The second is to evaluate the “typical” responseof all the individuals. Both of these features areimportant assessments in the discipline of Statis-tics. Chapter 5 talks about these assessments andputs these ideas into statistical language. It alsoprovides operational definitions of the term“typical”: We use the plural term “definitions”because the typical response can be opera-tionally defined in several ways, each of which isappropriate in certain circumstances.

4.2.2 Randomization

The goal of randomization is to eliminate bias or,in practical terms, to reduce bias to the greatestextent possible. Bias is the difference betweenthe true value of a particular quantity and anestimate of the quantity obtained from scientificinvestigation. Various influences can introduceerror into our assessment of treatment effects,and these are discussed at various points in thefollowing chapters. At this point we discuss anexample of systematic error, or bias.

Randomization involves randomly assigningexperimental individuals to one of the treatmentgroups, the drug treatment group or the placebotreatment group. The premise of randomizationis simple: Many potential influences on thedrug response of individuals participating in thetrial (for example, differences in the heightsand weights of participants, differences in meta-bolic pathways involved in the metabolism ofthe investigational drug) cannot readily becontrolled for. It is therefore important that, tothe best of our ability, we take steps to ensurethat these characteristics are likely to be equallyrepresented in both treatment groups. If all ofthe individuals in one treatment group share acharacteristic that is not present in any of theindividuals in the other treatment group, it isnot possible to ascribe differences between thegroups to the one influence of central interest,that is, the different treatments received by thetwo groups. Putting all relatively tall individualsinto one treatment group and all relatively shortindividuals into the other treatment groupwould be an example of systematic bias. In this

scenario, height would have a direct impact onthe formation of the treatment groups and,therefore, if height were to be a source of influ-ence on the blood pressure change demonstratedby individuals, height could be a cause ofsystematic bias in the results obtained.

The preapproval clinical trials discussed in thisbook are experimental studies: The datacollected comprise a series of observations madeunder conditions in which the influence ofinterest, the type of treatment received, iscontrolled by the research scientist. (The term“experimental” is used here as defined byPiantadosi [2005]. The converse of an experi-mental study is a nonexperimental study.Nonexperimental studies are often called obser-vational studies, but this term is inadequate,because it does not definitively distinguishbetween nonexperimental studies and experi-mental studies, for example, preapproval clinicaltrials, in which observations are also made. Themethodology employed in preapproval clinicaltrials is experimental: It comprises a series ofobservations made under conditions in whichthe influences of interest are controlled by theresearch scientist. The methodology employedin other types of study can be nonexperimental;the research scientist collects observations butdoes not exert control over the influences ofinterest. The term “nonexperimental” is not arelative quality judgment compared with experi-mental; the nomenclature simply distinguishesdifferent methodological approaches [Turner,2007].)

There are various types of randomizationstrategies. The strategy employed in the trialsdiscussed in this book is called simple random-ization, which involves assigning treatmentsto individuals in a completely random way.Other more complex randomization techniquesinclude block randomization, stratified random-ization, and cluster randomization: These arenot addressed in this book in any detail (seeTurner, 2007, for a brief review). Randomizationtechniques have an important role in clinicalresearch in general and in drug development inparticular because they allow for balancedassignment of treatments within strata ofinterest (stratified randomization), minimize thepossibility of a long run of assignments to the

Basic principles of study design 37

Page 53: Introduction to Statistics in Pharmaceutical Clinical Trials

same treatment (block randomization), andfacilitate the assignment of large groups ofindividuals to the same treatment (clusterrandomization).

4.2.3 Local control

Another important feature of conducting, orrunning, clinical trials is local control. This topictakes us into the realm of methodology. Tightcontrol on all aspects of methodology – forexample, the manner in which the treatmentsare administered, the manner in which bloodpressure measurements are made, and the appa-ratus used to make these measurements – mustbe exercised at all investigative sites. As anexample, it is not appropriate that blood pres-sures for all individuals in one treatment groupbe measured using one strategy and measuringdevice whereas blood pressures for all individ-uals in the other treatment group are measureddifferently. This naïve strategy could bias theresults of the study. As noted in Section 4.2.2, animportant objective of control in clinical trials isto remove as much error from the results aspossible, that is, to reduce potential bias.

Environmental conditions should also becontrolled as much as possible. Taking measure-ments and evaluating some individuals in rela-tively cold conditions and others in a relativelywarmer environment is not recommended.Taking this example further, and consideringfactors such as ease of access to the investigativesite and the general atmosphere (relaxed,frenetic) of the site and its investigators, it is notappropriate to have all individuals in one treat-ment group enrolled at one investigative site andall individuals in the other treatment groupenrolled at a different site.

4.3 A common design in therapeuticexploratory and confirmatory trials

As noted in Section 4.2, there are many studydesigns that are employed during a clinical

development program. Some of these are typi-cally used in early human pharmacology trials,whereas others are typically used in later thera-peutic exploratory and therapeutic confirmatorytrials. We discuss one particular study designmore than any others, but it must be emphasizedthat this does not mean that it is more importantthan other designs. Rather, its employment asour central example allows us to introduce youto statistical methodology and statistical analysisin our chosen way.

The design that is predominantly discussed inthis book is the randomized, concurrentlycontrolled, double-blind, parallel group design.The four descriptors in this title – randomized,double-blind, concurrently controlled, andparallel group – identify different aspects of thisdesign. We start with the last two descriptors,parallel group and concurrently controlled,because these capture the fundamental natureof the design. We then discuss the first twodescriptors, randomized and double blind.

4.3.1 The concurrently controlled, parallelgroup design

Individuals participating in a parallel group trialare randomly assigned to one of two or moredistinct treatments. Those who are assigned tothe same treatment are frequently referred to asa treatment group. While the treatments thatthese groups receive differ, all groups are treatedequally in every other regard, and they completeexactly the same procedures. This parallelactivity on the part of the groups of individualsis captured in the term “parallel group design.”

The term “concurrently controlled” capturestwo aspects of this study design. We have alreadycome across the concept that to quantifymeaningfully the effect of the investigationaldrug (that is, the treatment effect), it is necessaryto compare the average blood pressure reduc-tion of the group of individuals receiving theinvestigational drug with the average of thosereceiving the placebo. The two parallel groupshere are the drug treatment group and theplacebo treatment group, and the latter functions

38 Chapter 4 • Study design and experimental methodology

Page 54: Introduction to Statistics in Pharmaceutical Clinical Trials

as the control group. Sometimes this design iscalled a placebo-controlled parallel group design,because the control employed is a placebo andnot another active, marketed drug. This isperfectly valid.

The term “concurrently” refers to the factthat the individuals in the placebo treatmentgroup are participating in the trial at the sametime as those in the drug treatment group. Thisis an important aspect of the study design. Ifall the individuals in the drug treatment groupparticipated first, followed at some later timeby all those in the placebo treatment group,several potential influences could impact theresults of the trial. For example, the staff atthe investigational sites at which individualsparticipate in the trial might have changedconsiderably, and aspects of the overall opera-tion of these sites may have changed. Thegoal of experimental methodology is to controlfor all influences other than the type of treat-ment (drug or placebo) received by individ-uals, and so having all the individuals in onegroup participate at one time under one setof conditions and all those in the othergroup(s) participate at a later time under apotentially different set of conditions is notdesirable. (The goal of study protocols is todetail the experimental procedures to beemployed during the trial in sufficient detailthat they will be executed identically byall research staff at all times. This shouldtherefore minimize the differences justdescribed. However, practical reality sets inhere and, in the extreme example employed inthe text, the study protocol probably wouldnot be 100% successful.) This point links wellwith the discussion of local control in Section4.2.3.

This last point is well acknowledged, and it isunlikely that a trial would not be “concurrently”controlled. Therefore, if the term “placebocontrolled” is seen in a published report of atrial, it is almost certainly fair to assume that thetrial is concurrently controlled. However,assumption is a dangerous thing, and youshould check the details of the trial presented inthe report to confirm this.

4.3.2 The crossover design

In contrast to the parallel design, individuals ina crossover design are assigned to receive two ormore treatments in a particular sequence. Forexample, an individual in a crossover study mayreceive the drug treatment in the first period.Then, after a suitably long washout periodduring which the individual is off drug, he or shewill receive placebo. Other individuals willreceive placebo in the first period and then crossover to the drug treatment in the second period.Such a study would be considered a two-period,two-treatment, two-sequence crossover design.Crossover designs may involve a number of treat-ments, sequences, and periods. In a crossoverdesign, individuals are randomized to treatmentsequences, not treatment groups.

The greatest advantage of the crossover designis that individuals receive more than one treat-ment so that they act as their own controls. Thisresults in more statistical efficiency and there-fore smaller sample sizes. Crossover designs can,for this reason, be particularly useful in earlypharmacology studies. Another advantage ofcrossover designs is that they can aid in recruit-ment of study participants when a serious condi-tion is being treated and they would like to haveaccess to a potentially helpful investigationaldrug.

Crossover designs also have some disadvan-tages – one is that their results can be difficult tointerpret. As all individuals receive more thanone treatment there can be a carryover effectfrom one or more early periods to subsequentperiods, leading to a biased estimate of the treat-ment effect. When an individual does notcontribute data to one of the periods, none ofthe data can be used in the most straightforwardanalyses – attributing adverse events or otheruntoward effects to a single treatment can bedifficult. Another disadvantage is that they arenot applicable to all therapeutic indications.Crossover trials are ideally suited for indicationsthat are chronic in nature and do not vary inseverity over time. For example, studying a newanalgesic for migraines may not be feasiblebecause, thankfully, migraines do not occur as

A common design in therapeutic exploratory and confirmatory trials 39

Page 55: Introduction to Statistics in Pharmaceutical Clinical Trials

frequently or predictably as would be required toevaluate a number of treatments over the courseof two or more periods. Crossover designs wouldbe better suited for chronic conditions such ashypercholesterolemia or hypertension. However,having two or more observations from the sameindividual under different experimental condi-tions introduces additional complexities (that is,dependence) for statistical analyses. Thesemethods are beyond the scope of this book.

4.3.3 Randomization and the descriptor“randomized”

The topic of randomization was discussed inSection 4.2.2. Any trial that has employed arandomization strategy in its design is called arandomized trial.

4.3.4 Blinding and double-blind trials

We discussed blinding earlier. As a brief recap,making a drug and a placebo look, taste, andsmell the same ensures one part of the doubleblind: It means that individuals do not knowwhich treatment they are receiving. A secondcomponent of the blinding process is needed toensure that the investigators – that is, thoseadministering the treatments – do not knowwhich treatment individuals are receiving. Thiscomponent necessitates packaging the drugand placebo products at their site of manufac-ture so that investigators receiving them at theinvestigational sites cannot tell which is which.A system of codes guarantees that, when theblind is eventually broken once all the data havebeen acquired, it will be known which treatmenteach and every individual received.

The importance of double-blind trials can beexpressed in both scientific and regulatoryterms, with the second being a consequence ofthe first. These trials are the scientific gold stan-dard, and the results from a trial that is run in a

double-blind manner are afforded particularweight by regulatory agencies and clinicians.

4.4 Experimental methodology

Experimental methodology is concerned with allthe aspects of implementation and conduct of astudy. Experimental methodology and studydesign work hand in hand to ensure thatoptimum quality data are collected from whichoptimum quality answers to the research ques-tion can be provided. As we have seen, an appro-priate study design must be used to allow thecollection of optimum quality data, and we canthink of study design as providing the oppor-tunity to collect such data. To take advantage of this opportunity, optimum experimentalmethodology must be used in the acquisition ofthe data. Optimum quality methodology is nouse if the wrong study design has beenemployed, and the appropriate study design canlead to optimum quality data only if optimumexperimental methodology is employed.

Consider also the data analysis and interpreta-tion that occur once data have been acquired ina trial. First, the appropriate analysis has to beemployed as determined by the study design.However, the employment of this analysis aloneis not enough to ensure optimum qualityanswers to the research question. A computa-tionally perfect execution of the appropriateanalysis, and the most meaningful interpreta-tion of the results obtained, will not yieldoptimal answers if the data being analyzed are ofless than optimal quality. Therefore, experimentalmethodology is also of critical importance.

At this point it is worth considering the lengthof time it takes to run a therapeutic confirmatoryclinical trial. Such trials are often conducted asmulticenter trials. Although the total numbers ofindividuals who participate in trials vary, wenoted earlier that a typical number for a thera-peutic confirmatory trial is 3000–5000 individ-uals. Each of these individuals needs to have the

40 Chapter 4 • Study design and experimental methodology

Page 56: Introduction to Statistics in Pharmaceutical Clinical Trials

disease or condition of interest. It may be that 50investigational sites are needed to enroll thetotal number of individuals needed for the trial.Imagine a hypothetical scenario where 5000individuals are recruited at 50 sites, with thetypical number of individuals per site at around100 (in reality, the number of individualsrecruited at each of the sites might differ consid-erably). Imagine also that the treatment lengthemployed in this trial is 12 weeks. That is, inves-tigational site 01 recruits a total of 100 individ-uals, and each individual receives either theinvestigational drug or the placebo for a periodof 12 weeks.

The question of interest is: How long does ittake to complete the trial? Although the answer“12 weeks” tends to come to mind when firstthinking about this, the answer is that it willalmost certainly take much longer than thatbecause not all of the 100 individuals will starttheir participation in the trial on the same day.They will be recruited into the trial, and hencestart participation in the trial, in a staggeredmanner. It is therefore quite possible that the lastof the 100 individuals might start his or her 12-week participation months (and possibly years)after the first. This will likely be true at all theinvestigational sites. The expression “first partici-pant first visit to last participant last visit” is oftenused to describe the length of the entire trial.

In addition to giving you a feeling for howlong it takes to run real-life trials, we mentionthis point because it emphasizes that it is essen-tial that methodological considerations receiveconstant vigilance in all studies because sometrials can last several years.

4.5 Why are we interested in bloodpressure?

The domain of experimental methodologyembraces many aspects of conducting a trial, andwe do not discuss the vast majority in this book.However, it is important to make you aware of

the need for optimum quality methodology, andyou can learn more about this from othersources: In particular we recommend Piantadosi(2005). Our discussions focus on one aspect ofmethodology that is directly relevant to a centraltheme of this book, namely the measurement ofblood pressure. Before discussing this topic,however, it is worth considering why we want todevelop drugs that lower blood pressure in thefirst place, and why optimum quality bloodpressure measurements are therefore critical.

4.5.1 Clinically relevant observations

It is possible to make all sorts of observationsabout people. For example, some are tall, someare short, some have blonde hair, some have darkhair, some love dogs, some love cats, some haverelatively high blood pressure, and some haverelatively low blood pressure. In drug develop-ment we are interested in clinically relevantobservations and in making these observationsduring a trial. (Recall the definition of experi-mental studies presented earlier: In experimentalstudies, observations are made when the influ-ence of interest is under the control of theresearcher.) In the trials discussed in this book, weare interested in observing (measuring) bloodpressure for the duration of individuals’ partici-pation in the trial with the ultimate goal ofassessing the investigational drug’s treatmenteffect, that is, how much more the investigationaldrug lowers blood pressure than a placebo. There-fore, the question of interest here is: Why is bloodpressure a clinically relevant observation? Thistakes us into the realm of surrogate endpoints.

4.5.2 Surrogate endpoints

Two clinical endpoints of particular relevanceare morbidity and mortality: Morbidity canlessen quality of life and make mortality morelikely, and mortality speaks for itself. Notsurprisingly, pharmacotherapy (along with

Why are we interested in blood pressure? 41

Page 57: Introduction to Statistics in Pharmaceutical Clinical Trials

other medical interventions) is concerned withreducing both these clinical endpoints. However,the development of morbidity can be prolonged,and the impact of drug therapy on mortalityduring a clinical trial can be very difficult toevaluate. As it is very unlikely that many individ-uals will die during (most) clinical trials, thedifference in death rates between the drug treat-ment group and the placebo treatment group islikely to be very small, and quite possibly zero.(Mortality is unfortunately not uncommon inclinical trials in some therapeutic areas and intrials involving very ill or terminally ill patients.On these occasions, it may well be possible todetect the beneficial influence of an investiga-tional drug by focusing on the clinical endpointof mortality.)

It therefore becomes important in clinical trialsto evaluate the influence of the investigationaldrug on other endpoints of relevance. These canbe termed “clinically relevant endpoints” or“surrogate endpoints.” Surrogate endpoints arebiomarkers or other indicators that substitute forthe clinical endpoint by predicting its likelybehavior. Justification for the choice of theseendpoints is of fundamental importance: Theendpoint chosen as the surrogate needs to repre-sent the clinical endpoint in a meaningfulmanner. How can this be demonstrated? Thefollowing are characteristics of meaningful anduseful surrogate endpoints (see Oliver and Webb,2003; Machin and Campbell, 2005):

• Biological plausibility: A detailed knowledgeof the pathophysiology of the disease or condi-tion of interest is helpful, as is demonstrationthat the surrogate endpoint of interest in theclinical trial is on the causal pathway to theclinical endpoint of primary interest.

• A detailed knowledge of the drug’s mecha-nism of action: Coupled with similar knowl-edge of the pathophysiology of the disease orcondition of interest, this can provide a solidbasis for believing that the drug will be bene-ficial. (A drug can certainly be clinically bene-ficial even if we do not know its mechanismof action, but this does not mitigate the pointmade here in the context of good surrogateendpoints.)

• The surrogate endpoint predicts the clinicalendpoint consistently and independently.

• They are particularly useful in cases where theclinical endpoints occur after long periods.

The choice of endpoints used in studies of newtherapies may evolve over time as knowledge isgained about the natural history of the disease orthe reliability of surrogate endpoints. Endpointsused to evaluate the benefits of new drugs areprovided in Table 4.1 for a number of diseases.Many diseases are associated with numerousmedical conditions of consequence to thepatient (for example, pain and disabilityresulting from rheumatoid arthritis), which maybe the target for a particular new therapy.

Fleming and DeMets (1996) suggested that theuse of surrogate endpoints is most helpful inearly therapeutic exploratory studies to studyactivity and decide if larger, more definitivestudies are warranted. Establishing the accept-ability of a surrogate endpoint is a difficultundertaking. Fleming and DeMets (1996)cautioned about the use of surrogate endpointsin Phase III confirmatory trials. One frequentlycited example (CAST Investigators 1989, 1992) ofa misleading surrogate endpoint is from theCardiac Arrhythmia Suppression Trial (CAST).Ventricular arrhythmia has been established as arisk factor for sudden death. In this study, threedrugs that had been approved for the control ofarrhythmias (encainide, flecainide, and mori-cizine) were evaluated for their effect onmortality among individuals with myocardialinfarction and ventricular arrhythmia. Theresults from this study were surprising. All threedrugs were associated with higher risks of deaththan placebo. Hence the benefit of these drugswith respect to arrhythmia did not extend to theunderlying clinical endpoint.

We are interested in high blood pressure(hypertension) because of our interest in cardio-vascular disease, a leading cause of morbidityand mortality. High blood pressure is a mean-ingful and useful cardiovascular surrogate end-point because it is well established that chronichigh blood pressure causes cardiovascular andcerebrovascular events.

42 Chapter 4 • Study design and experimental methodology

Page 58: Introduction to Statistics in Pharmaceutical Clinical Trials

4.6 Uniformity of blood pressuremeasurement

One method of measuring blood pressure is touse a stethoscope and a sphygmomanometer;you may have experienced this in your doctor’sclinic/office. Other methods include the use ofvarious automated devices. Although we do notgo into these in detail here, the important pointis that considerable attention must be paid tomethodological considerations. It is importantthat the same measurement technique be used atall the investigational sites in a trial, and thattime is taken before the trial starts to train everysite in the correct use of whichever measuringdevice is chosen. This might happen at an inves-tigators’ meeting or a central meeting of all prin-cipal investigators held before the start of thetrial to address procedural consistency acrosssites. It is also important that every measure-ment at each site be made correctly, and that anyroutine calibration of the measurement device isconducted as mandated.

4.7 Measuring change in bloodpressure over time

As antihypertensive drugs are intended to lowerblood pressure, their evaluation in clinical trialsrequires at least two measurements. One of theseis an initial measurement, typically called a base-line measurement, and the other is a measure-ment some time later, such as at the end of thetreatment phase (the end-of-treatment measure-ment). These two measurements allow us tocalculate a change score that represents thechange in blood pressure from the start to theend of the treatment phase. Change scores canbe calculated in several ways. One of these, andthe method that is used in all of the examples inthis book, is simply to calculate the arithmeticdifference between each individual’s baselinemeasurement and his or her end-of-treatmentmeasurement.

It is also possible that blood pressure may bemeasured more than twice in a trial. If the treat-ment period is 12 weeks long, measurements

Measuring change in blood pressure over time 43

Table 4.1 Examples of endpoints used in clinical trials of experimental drugs

Disease Example endpoints

Cancera SurvivalObjective response (reduction in tumor size for a minimum amount of time)Time to progression of cancer symptoms

Rheumatoid arthritisb Improvement in signs and symptomsRadiological progression of disease

Uncomplicated urinary tract infectionc Eradication of bacterial pathogenHypertensiond Change from baseline SBPPostmenopausal osteoporosise Bone mineral density

Bone fractures

aFood and Drug Administration (US Department of Health and Human Services or DHHS, FDA, 2007).bFood and Drug Administration (DHHS, FDA, 1999). cFood and Drug Administration (DHHS, FDA, 1998). dICH Guidance E12A (2000). eFood and Drug Administration (DHHS, FDA, 1994).

Page 59: Introduction to Statistics in Pharmaceutical Clinical Trials

might be taken, for example, at baseline, week 2,week 4, week 6, week 8, week 10, and week 12(end of treatment). By taking several measure-ments, the change across the treatment phasecan be examined in more detail. Suppose that anindividual’s SBP decreases by 20 mmHg frombaseline to end of treatment. There are manypossible patterns of change across time here. Forexample, most of the individual’s decrease inblood pressure could happen in the first fewweeks, it could decrease steadily across the 12weeks, or most of the decrease could occurduring the last few weeks. Although this levelof analysis is of interest in some trials, we focuson change scores calculated by using twomeasurements, the baseline measurement andthe end-of-treatment measurement.

4.8 The clinical study protocol

When the clinical research team has decided ontheir research question, and the appropriatestudy design and methodology to acquireoptimum quality data with which to answer thisquestion, all this information needs to be docu-mented. The clinical study protocol is the docu-ment that is written for this purpose. Chow andChang (2007, p 1) noted that the study protocolis “the most important document in clinicaltrials, since it ensures the quality and integrity ofthe clinical investigation in terms of its planning,execution, conduct, and the analysis of the data.”

The study protocol is a comprehensive plan ofaction that contains information concerning thegoals of the study, details of individual recruit-ment, details of safety monitoring, and allaspects of design, methodology, and analysis.Input is therefore required, for example, fromclinical scientists, medical safety officers, studymanagers, data managers, and statisticians.Consequently, although one clinical scientist ormedical writer may take primary responsibilityfor its preparation, many members of the studyteam make critical contributions to it.

The following are some of the fundamentalcomponents in a study protocol for a thera-peutic confirmatory trial for an investigationalantihypertensive drug:

• How the disease or condition of interest willbe diagnosed, that is, participating individ-uals need to be diagnosed as hypertensive.The protocol will state the precise criteria thatconstitute high blood pressure in thisparticular study, and how and by whomdetermining measurements will be taken.

• Inclusion and exclusion criteria: Theseprovide detailed criteria for individual eligi-bility for participation in the trial. Theseeligibility criteria can often represent acompromise among several perspectives, suchas regulatory, medical, and logistical. Forexample, the most valuable informationabout the benefits of the new treatment willbe obtained from a group of study individualswho are most representative of the patients towhom the drug will be prescribed. On theother hand, “real world” patients may betaking a number of medications or haveconcurrent illnesses that may confound theability to evaluate the investigational treat-ment. It may be logistically impossible tostudy individuals with poor reading abilitiesbecause they will not comply with studyprocedures. Eligibility criteria define the studypopulation, a term that is discussed in greaterdetail in Chapter 5.

• The primary objective and any secondaryobjectives (it is a very good idea to limit thenumber of objectives): These must be statedprecisely.

• Measures of safety: The criteria to be used toevaluate safety are provided. These will typi-cally include adverse events, clinical labora-tory assays, electrocardiograms (ECGs), vitalsigns, and physical examinations.

• Measures of efficacy: The criteria to be used todetermine efficacy are provided. Decrease inblood pressure will be the primary measure-ment of interest. Also, it may be the case thataverage decreases of a certain magnitude arerequired for the investigational drug to bedeemed effective.

• Drug treatment schedule: Route of adminis-tration, dosage, and dosing regimen aredetailed. This information is also provided for the control treatment.

• The statistical analyses that will be used oncethe data have been acquired. The precise

44 Chapter 4 • Study design and experimental methodology

Page 60: Introduction to Statistics in Pharmaceutical Clinical Trials

analytical strategy needs to be detailed, hereand/or in an associated statistical analysisplan.

A study protocol is often supplemented withanother very important document called thestatistical analysis plan (sometimes referred to bysimilar names such as a data analysis plan orreporting analysis plan). The statistical analysisplan often supplements a study protocol byproviding a very detailed account of the analysesthat will be conducted at the completion of dataacquisition. The statistical analysis plan shouldbe written in conjunction with (and at the sametime as) the protocol, but in reality this does notalways happen. At the very least it should befinalized before the statistical analysis andbreaking of the blind. In many instances (forexample, confirmatory trials) it may be helpfulto submit the final statistical analysis plan to theappropriate regulatory authorities for theirinput.

4.10 References

Cardiac Arrhythmia Suppression Trial Investigators(1989). Preliminary report: Effect of encainide andfleicanide on mortality in a randomized trial ofarrhythmia suppression after myocardial infarction.N Engl J Med 321:406–412.

Cardiac Arrhythmia Suppression Trial Investigators(1992). Effect of the antiarrhythmic agent moricizine on survival after myocardial infarction.N Engl J Med 327:227–233.

Chow S-C, Chang M (2007). Adaptive Design Methods inClinical Trials. Boca Raton, FL: Chapman &Hall/CRC.

Fleming TR, DeMets DL (1996). Surrogate end points inclinical trials: are we being misled? Ann Intern Med125:605–613.

ICH Guidance E8 (1997). General Consideration of Clin-ical Trials. Available at: www.ich.org (accessed July 12007).

ICH Guidance E12A (2000). Principles for Clinical Eval-uation of New Antihypertensive Drugs. Available at:www.ich.org (accessed July 1 2007).

Machin D, Campbell MJ (2005). Design of Studiesfor Medical Research. Chichester: John Wiley &Sons.

Oliver JJ, Webb DJ (2003). Surrogate endpoints. In:Wilkins MR (ed.), Experimental Therapeutics. BocaRaton, FL: Taylor & Francis Group, 145–165.

Piantadosi S (2005). Clinical Trials: A methodologicperspective, 2nd edn. Chichester: John Wiley &Sons.

Turner JR (2007). New Drug Development: Design,methodology, and analysis. Hoboken, NJ: John Wiley& Sons.

US Department of Health and Human Services, Foodand Drug Administration (1998). Providing ClinicalEvidence of Effectiveness for Human Drug and Biological Products. Available from www.fda.gov(accessed July 1 2007).

US Department of Health and Human Services, Foodand Drug Administration (2007). Challenge andOpportunity on the Critical Path to New Medical Products. Available from www.fda.gov (accessed July1 2007).

References 45

4.9 Review

1. What is the importance of replication,randomization, and local control in experimentaldesign?

2. Define the following aspects of clinical trial studydesign:

(a) double blind

(b) concurrent control

(c) parallel group.

3. What is the difference between a clinical endpointand a surrogate endpoint?

4. What information is included in a study protocol?

Page 61: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 62: Introduction to Statistics in Pharmaceutical Clinical Trials

5.1 Introduction

Selecting an appropriate study design to bestaddress the study objectives is just the first steptowards answering the questions of interest.When most people think about Statistics theyare probably thinking about data. Unfortunately,statisticians are not infrequently assigned thenickname “number crunchers,” a name thataccentuates the numerical aspects of the use ofstatistics but completely ignores the design,methodology, and interpretation aspects of thediscipline of Statistics. Number crunching (andcomputational accuracy) is certainly a necessarycomponent of Statistics, but it is important tobear in mind that it is far from sufficient.

Having reviewed the concepts of study designwe now turn our attention to data. We are inter-ested in various questions relating to data, suchas: What are data? How might we classifydifferent types of data? How are data used toanswer questions arising during clinical trials?This last question is, perhaps, the most impor-tant one for this book. We start to answer it firstin conceptual terms before turning our attentionto more specific points.

5.2 Populations and samples

It is of considerable interest in new drug devel-opment to assess the effects of a drug in a partic-ular population, the population containingindividuals who may be prescribed the drug ifand when it is approved. This population isknown as the target population. Not all theadults in the USA and the UK would be ideal

candidates for a therapeutic confirmatory trialbecause of the presence of other conditions orthe use of other drugs, or for logistical reasonsbecause they do not live close enough to a centerthat participates in clinical studies. Therefore,another population of interest is all adults in theUSA and the UK who meet the specific eligibilitycriteria (including a precise definition of hyper-tensive) of a study. This group of individuals isconsidered the study population.

As study populations are often very large,however, it is not possible to administer the drugto every member of the population, so a samplefrom the study population is chosen and theeffects of the drug in that sample are determinedin a clinical study. In clinical trials, samples aretypically considered or assumed to be simplerandom samples from the study population. Asimple random sample is a sample in which eachobservational unit (for example, study partici-pant) has the same probability of selection fromthe population. In other fields in which Statisticsare used (most notably population surveys)samples need not be selected in this manner.

A clinical trial provides numerical state-ments of the drug’s effects in the specificsample employed, but the investigator and theregulatory agency are really interested in thedrug’s (likely) effect in the whole population.Therefore, statistical procedures have beendeveloped to allow numerical assessments ofthe likely effects in the study populationbased on the evidence collected from thesample that participated in the trial.

There are important limitations to the useful-ness of generalizing the effects from a series ofclinical trials to the patient population as awhole. The population from which clinical trialparticipants are sampled, the study population,

5Data, central tendency, and variation

Page 63: Introduction to Statistics in Pharmaceutical Clinical Trials

may not truly be representative of the population(the target population) about which we wouldlike to make conclusions. The target populationmay be sicker, have greater needs for concomitantmedications, and have more chronic illnessesthan the relatively homogeneous populationfrom which the study sample arose. This point iswell expressed by Senn (1997, p 28):

In a clinical trial the primary formal objective isto assess what effects the treatments did have onthe patients studied in order to say what effectsthey may have. To say what effects the treat-ments will have or even will probably have,requires arguments which go well beyond anyformal examination of the data.

The following discussions address statisticalmethods that are applied to data from a sampleof study participants with the objective ofmaking an inference about the study population.By including relevant populations in studies andcarefully documenting the methodology thatgave rise to the study sample in regulatory docu-ments and clinical communications, reviewersand physicians can judge for themselves theextent to which the results from the study can beinferred to the clinical situation.

5.3 Measurement scales

Data are anything that is measured. Examples ofdata encountered in clinical studies includeheight, weight, plasma concentration of a drugin a sample, days from the start of a study to aparticular adverse event, the presence or absenceof a characteristic of interest, and the gender of astudy participant. Some of these examples maybe surprising because we often think of data asnumbers, but data may also be non-numeric.

Data can generally be classified into one of the following scales of measurement: nominal,ordinal, interval, or ratio.

5.3.1 Nominal scale

Nominal measurement scales involve names of characteristics. Characteristics frequently

encountered in clinical studies that are measuredon the nominal scale include gender (female ormale), occurrence or not of an adverse event, acoded adverse event (for example, headache,asthenia, nausea), and race or ethnicity. Datameasured on a nominal scale cannot be operatedon arithmetically. We could not, for example,compare the values of females and males andcome up with a meaningful result. An importantcaution is worth noting at this point. It is notuncommon to encounter data measured on thenominal scale to be represented as numbers orcodes in electronic databases. An example wouldbe when, in a database of a clinical study, thepresence or absence of an adverse event (forexample, headache) is represented as 0 (absent)or 1 (present). Before we undertake a statisticalanalysis of any sort it is necessary to understandfully the nature of the data.

5.3.2 Ordinal scale

This scale is best defined as one in which anordering of values can be assigned. Examples ofdata from clinical studies measured on anordinal scale include: severity of an adverse eventclassified as mild, moderate, or severe; age cate-gorized as � 65, 65–70, 71–75, and � 75 years.The ordinal nature of the measurement scalesmeans that we can say that a mild headache isless severe than a moderate headache, which is less severe than a severe headache. However,we cannot say that the difference between mildand moderate is the same as the differencebetween moderate and severe.

5.3.3 Interval scale

In contrast, differences between any two valuesmeasured on the interval scale do have meaning.Temperature measured on the Celsius or Fahren-heit scale is an example of an interval scale. Forexample, the difference between 32°F and 64°F isthe same as the difference between 64°F and96°F. On the interval scale, a value of zero is nota true zero (meaning absence of heat) because avalue of �1°F is colder still. We can perform

48 Chapter 5 • Data, central tendency, and variation

Page 64: Introduction to Statistics in Pharmaceutical Clinical Trials

addition and subtraction on interval scaled databut, because the value of zero is meaningless, wecannot perform multiplication or division andobtain a meaningful result.

5.3.4 Ratio scale

Data measured on the ratio scale have all of thecharacteristics of interval scaled data with theexception that, in this case, a value of zero doesrepresent a true zero. Height, which is measuredon the ratio scale, has a true zero. A height ofzero centimeters or inches means that there is noheight. Likewise, a weight of zero kilograms orpounds means that there is no weight. Animportant characteristic of ratio scaled data isthat the ratio of two values can be computed. Forexample, a study participant who weighs 220pounds weighs twice as much as one who weighs110 pounds.

The importance of identifying these scales ofmeasurement is that not all statistical analysisapproaches are appropriate for each of them. It isimportant to note that, although a particularcharacteristic may be measured on one scale, itmay be reported using another. For example, ageat the time of study entry may be measured on aratio scale, but reported using an ordinal scale(for example, � 25, 25–64, � 64 years).

5.4 Random variables

Many individuals are involved in clinical trialsand a number of characteristics of these partici-pants are recorded. As characteristics such asage, systolic blood pressure, and gender canvary from individual to individual, they aregenerally classified as random variables (or,simply, variables). A common convention instatistics is to represent a particular randomvariable as a letter, such as x. A particular real-ization, or value, of a random variable for aparticular individual (participant i in this case)is often denoted using a subscript such as xi. Weuse these conventions in this chapter andthroughout the text.

5.5 Displaying the frequency of valuesof a random variable

Since a random variable such as age can take ona number of values for a group of study partici-pants it is of interest to know something aboutthe relative frequency of each value. The relativefrequency is the count of the number of obser-vations with a specific value (for example, thenumber of 30-year-old participants) divided bythe total number in the sample. An informativefirst step in a statistical analysis is to examinecharacteristics of the relative frequency of valuesof the random variable of interest, which canalso be called the empirical distribution of therandom variable. This knowledge is an essentialpart of selecting the most appropriate statisticalanalysis. Statistical software packages offer anumber of methods to describe the relativefrequency of values including tabular frequencydisplays, dot plots, relative frequency histograms,and stem-and-leaf plots.

An example of a frequency table is provided inTable 5.1, in which the frequency of age valuesin a sample of 100 study participants isdisplayed. The left-hand column is the value ofage for which frequency information isprovided. The column labeled “Frequency” is thecount of the number of participants with theparticular value of age. The column “Percentage”is the count of the number of participants withthe particular value of age divided by the totalnumber of observations in the sample and multi-plied by 100 to express this figure as a percentageof the total. The next column “Cumulativefrequency” represents the total count of agevalues less than or equal to the age value on acertain row. Similarly, “Cumulative percentage”is the cumulative frequency count of age valuesas a percentage of the total. As seen in Table 5.1there is one 40-year-old individual (1% of thetotal) and there are five who are 40 and younger(5% of the total). A frequency table allows us tosee how common all values are, but it can bedifficult to see whether or not certain valuestend to cluster together.

Another helpful way of displaying the relativefrequency of observed values is to group valuesinto equally spaced intervals and display the

Displaying the frequency of values of a random variable 49

Page 65: Introduction to Statistics in Pharmaceutical Clinical Trials

resulting frequency in a histogram. There is nosingle width of each interval, or bin, that can berecommended. However, one might considerthe quantity W as a starting point for thewidth:

Maximum value � Minimum valueW � ––––––––––––––––––––––––––––––––––

n

It is typically desirable to have at least 5 binsand no more than 10, although less or more may

50 Chapter 5 • Data, central tendency, and variation

Table 5.1 Frequency table of age values

Age (years) Frequency Percentage Cumulative frequency Cumulative percentage

31 1 1.00 1 1.0035 1 1.00 2 2.0039 2 2.00 4 4.0040 1 1.00 5 5.0043 3 3.00 8 8.0045 1 1.00 9 9.0048 2 2.00 11 11.0049 4 4.00 15 15.0050 4 4.00 19 19.0051 2 2.00 21 21.0052 2 2.00 23 23.0053 2 2.00 25 25.0055 3 3.00 28 28.0057 3 3.00 31 31.0058 3 3.00 34 34.0059 5 5.00 39 39.0060 2 2.00 41 41.0061 6 6.00 47 47.0062 4 4.00 51 51.0063 2 2.00 53 53.0064 1 1.00 54 54.0065 1 1.00 55 55.0066 3 3.00 58 58.0067 4 4.00 62 62.0068 3 3.00 65 65.0069 2 2.00 67 67.0070 3 3.00 70 70.0071 4 4.00 74 74.0072 3 3.00 77 77.0073 4 4.00 81 81.0074 1 1.00 82 82.0075 3 3.00 85 85.0076 1 1.00 86 86.0077 3 3.00 89 89.0078 3 3.00 92 92.0079 1 1.00 93 93.0080 2 2.00 95 95.0081 2 2.00 97 97.0082 1 1.00 98 98.0083 1 1.00 99 99.0088 1 1.00 100 100.00

Page 66: Introduction to Statistics in Pharmaceutical Clinical Trials

be informative. Once the number and width ofthe bins have been determined the next step is tocount the number of observations that fall intoeach interval and display the frequency of eachgrouping with contiguous bars. It is importantthat the intervals or bins are defined such thateach observation can be assigned to only oneinterval. Using the 100 age values in the previousexample, a histogram, displayed in Figure 5.1, hasbeen constructed from the following categories:30–39, 40–49, 50–59, 60–69, 70–79, 80–89. Notethat each bar is centered over the intervalmidpoint. For example, the bar centered at 54.5represents the relative frequency of age values inthe interval 50–59.

By grouping the 100 age values (that is, the100 participants in the indicated age groups)into categories, much of the detail evident inTable 5.1 has been lost. A display that retains thegraphical nature of the histogram and the detailof the tabular frequency is a stem-and-leaf plot.A stem-and-leaf plot displays the first significantdigit of the value of random variable as a “stem”and the subsequent significant digit as a “leaf.”The stems are ordered from lowest to highest sothat the relative frequency of each value can be

surmised in one concise display. A stem-and-leafplot of 100 individuals’ age values is provided inFigure 5.2. To assist in your interpretation of thisdisplay, the youngest participant in this studywas 31, the oldest was 88, and there were four 50year olds.

The shape of the overall distribution in thiscase could be called somewhat bell shaped,as characterized by relatively fewer observationsat either extreme than in the middle. Somedistributions are symmetric, whereas others areasymmetric. Those that are asymmetric are said tobe skewed. If fewer observations are at the upperend of the distribution (that is, the long tail istoward the right or higher values) the distribu-tion’s shape is called positively skewed. If thelong tail is pointing toward the left, or lowervalues, the distribution is called negativelyskewed. In the case of this particular example,turning Figure 5.2 on its side so it has lower valueson the left reveals that the distribution of agevalues is somewhat negatively skewed. Althoughthe stem-and-leaf display in Figure 5.2 has moredetail (that is, more bins) than the histogram inFigure 5.1, the histogram retains the basic shapeelements of the stem-and-leaf display.

Displaying the frequency of values of a random variable 51

Perc

enta

ge

34.5 44.5 54.5 64.5 74.5 84.5

Age (years)

0

5

10

15

20

25

30

Figure 5.1 Histogram of 100 age values

Page 67: Introduction to Statistics in Pharmaceutical Clinical Trials

5.6 Central tendency

One fundamental idea in the development ofnew pharmaceutical products is that pharmaceu-tical companies (sponsors) would like to demon-strate that participants who receive a testtreatment tend to fare better than those whoreceive some alternate therapy. This alternatetherapy could be an inactive control (a placebo)or some other approved therapy (an activecontrol). We said “tend to fare better” becauseparticipants will not all respond in the same wayto the same test treatment. It is also true that, ifand when the drug is approved for marketingand prescribed for patients, some patients will dobetter on the drug than others, but it is still veryuseful to clinicians to know how patients willtend to respond.

When we flip a fair coin ten times, we do notalways expect to observe five heads and five tails.If we do several series of ten flips, we know that,by chance, we will observe six heads and fourtails sometimes, and even more lopsided resultswould not be all that surprising. The samephenomenon happens with the response to testtreatments in clinical studies. When doctorsprescribe a new medicine to a patient it would behelpful to know what kind of response could beexpected. Although we might expect that a faircoin flipped ten times will result in five heads,we also would expect that four or three headscould be observed. The determination of valuesthat might be expected is the next topic in thischapter, that is, measures of central tendency.

Once we have assembled individual observa-tions in a sample from a clinical study, ourability to understand the nature of those obser-vations as a whole is limited by our ability tosynthesize several disparate pieces of observationinto an overall impression. Imagine that youhave observed the following 10 observations ofage of study participants in an early exploratorytherapeutic clinical trial: 45, 62, 32, 38, 77, 28,25, 62, 41, and 50.

Regulatory authorities are concerned abouthow well study participants match those in thegeneral population of patients with the condi-tion. How might such a question be answered?There are several strategies here.

5.6.1 The mode

One possible way to answer this question is toreport that the most common value of age is 62.There are two such observations with this valueof age. This measure of central tendency isknown as the mode. The mode is mostcommonly used with non-numeric data (forexample, most of the study participants werefemale), but it may also be useful for numericdata if there are only a few unique values. Unfor-tunately, the choice of the mode as the typicalvalue in this case is a little misleading. Althoughthere are two 62-year-olds in the study, moststudy participants (seven of them) are youngerthan that.

A question that comes to mind here is: Whatwould the mode have been if all values of agewere unique? The answer is that there wouldhave been no mode – all values occurred equallyas frequently. Likewise, suppose that there hadalso been two observations with the value of ageof 32: In this case there would not be one valueof the mode, but two. These two properties ofthe mode – that is, it is undefined in someinstances and it may have multiple values inothers – are considerable drawbacks to its use.

5.6.2 The median

Another reasonable choice for the typical valueof age would be the value of age that is right in

52 Chapter 5 • Data, central tendency, and variation

334455667788

1599033358899990000112233555777888999990011111122223345666777788899000111122233334555677788890011238

Figure 5.2 Stem-and-leaf display of 100 age values

Page 68: Introduction to Statistics in Pharmaceutical Clinical Trials

the middle of all values. This middle value iscalled the median. Citing the median wouldensure that there were as many participantsyounger than the typical value as there wereparticipants older. In this case, as there are 10values there is no single middle value becausethere are an even number of values. The fifthand sixth greatest values of age are 41 and 45. Toobtain the median value with an even number ofobservations, we simply split the differencebetween the two middle values. In this case themedian is calculated as:

(41 � 45)––––––––– � 43.

2

A quick check to know that we got it right is tosee that exactly 5 observations are less than 43and exactly 5 observations are greater than 43.When there is an odd number of observations,the median is the value of the middle observa-tion after ordering them from the smallest to thelargest. Unlike the mode, the median for a set ofobservations is unique: There is only one valueand it is always defined.

5.6.3 The arithmetic mean

The last measure of central tendency that weconsider is the most commonly encountered.The arithmetic mean is the sum of the individualobservations divided by the total number ofobservations. Using mathematical notation themean is calculated as:

n

R xi

i � 1x � –––—–n

where R stands for the addition of the values ofeach observation in the sample (n of them), thatis:

n

R xi � x1 � x2 � . . . � xn.i � 1

For our sample of 10 values of age, the mean is46 (verification of this calculation is left to you).The arithmetic mean, commonly called theaverage, is the value that balances the weight ofthe distribution.

Some noteworthy characteristics of the meanare that, like the median, it is unique and alwaysdefined for a set of observations. However, themean is sensitive to extreme observations – thatis, if there is a single observation that is muchhigher or lower than the rest, the mean will beheavily influenced by that single observation.

One of the primary goals of Statistics is to usedata from a sample to estimate an unknownquantity from an underlying population, calleda population parameter. In general, we typicallyuse the arithmetic mean as the measure ofcentral tendency of choice because the samplemean is an unbiased estimator of the populationmean, typically represented by the symbol l.The main conceptual point about unbiased esti-mators is that they come closer to estimating thetrue population parameter, in this case the popu-lation mean, than biased estimators. Whenextreme observations influence the value of themean such that it really is not representative of atypical value, use of the median is recommendedas a measure of central tendency.

Returning to the query posed by the regulatoryauthorities, we came up with the followingresponses. The typical value of age in our sampleusing the mode is 62, using the median is 43,and using the mean is 46. Suppose that theauthorities are satisfied with that responseinitially and then pose the following question.“So was your study among middle-aged adultswith the condition?” You refer once again to thelist of 10 observations and realize that it is notthat simple. There actually were some youngeradults in your study and it would be ideal toquantify the extent to which the mean does nottell the whole story. It is no surprise that not allvalues of age in the study are the same. Fortu-nately there are ways to quantify the extent towhich they vary from participant to participant.

5.7 Dispersion

Dispersion refers to the variety or “spread” ofindividual observations in a sample. As forcentral tendency there are various measures ofdispersion.

Dispersion 53

Page 69: Introduction to Statistics in Pharmaceutical Clinical Trials

5.7.1 The range

A quick way to reflect the variety of values in asample is to cite the lowest and highest values,the minimum and maximum. Calculating thedifference between these two (calculated asmaximum minus minimum) yields a valuecalled the range:

Range � xmax � xmin.

Although the range is informative in that itconveys the difference between the two mostextreme values, it does have a deficiency: Itreally does not adequately reflect the extent towhich observations are similar or dissimilar.Imagine a study in which 99 of the 100 partici-pants are aged between 20 years and 29 years,and one is 60 years old. The range is quite large(40 years), but the value of this range does notgive any indication about how close togethermost age values in the sample are to each other.

5.7.2 The variance

In contrast, the variance of a sample does indi-cate how close together most values in a sampleare. The sample variance is calculated as the sumof squared deviations of each observation fromthe sample mean divided by the sample sizeminus 1:

n

R (xi � x)2

i � 1s2 � –––––––––—–.n � 1

A calculation of this sort ensures that the measureof dispersion is positive (squaring the deviationsensures that) and dividing by (n � 1) results in aquantity that represents an average of sorts. Thesample variance is the “typical” or “average”squared deviation of observations from thesample mean. The use of the (n � 1) in thedenominator may seem confusing, but the reasonwhy this is done is that calculating the samplevariance in this manner yields an unbiased esti-mator of the population variance, which is repre-sented by the symbol r2. (The exact mathematical

demonstration that s2 is an unbiased estimator ofr2 is beyond the scope of this text.)

5.7.3 The standard deviation

Although very useful in some ways, the samplevariance has the unfortunate characteristic thatit is expressed in terms of squared units that aretypically nonsensical. From our earlier exampleof 10 ages, we would calculate the sample vari-ance as 282 “squared years.” To overcome thesignificant drawback of squared units we cantake the square root of the sample variance toobtain the standard deviation (s):

s � �__s2 .

The standard deviation represents an average(of sorts) deviation of each observation from thesample mean. Again, the only reason why we donot call this quantity the average deviationwithout qualification is that there really are ndeviations from the sample mean, but the stan-dard deviation is calculated using the denomi-nator of (n � 1) instead of n. The samplestandard deviation is an unbiased estimator ofthe population standard deviation. For ourprevious example of 10 age values, the value ofthe sample standard deviation is 16.8 years (weleave confirmation of this to you).

The sample standard deviation captures a greatdeal of information about the spread of the data.The value of the standard deviation is helpfulacross a number of datasets because of the resultsof what is called Tchebysheff’s theorem. A simpleway of thinking of Tchebysheff’s theorem is thatmost values lie close to the sample mean.According to this theorem, no matter what theshape of the distribution is:

• 25% (1–4 � 1–22) or less of observations lie outsideof 2 standard deviations away from the mean

• 11% (1–9 � 1–32) or less of observations lie outsideof 3 standard deviations away from the mean

• 6% ( 1––16 � 1–42) or less of observations lie outsideof 4 standard deviations away from the mean.

Applying Tchebysheff’s theorem to our sampleof 10 ages, we can say to the regulatory agency thatthe study really is not just among middle-agedadults.

54 Chapter 5 • Data, central tendency, and variation

Page 70: Introduction to Statistics in Pharmaceutical Clinical Trials

5.7.4 Variability and the coefficient ofvariation

A commonly asked question among investiga-tors is: How do I know if I have a lot of or a littlevariability in my study results? There is nostraightforward answer to this question: Themagnitude of the variance (or synonymously thestandard deviation) can be called a lot or a littleonly when it is compared with some otherquantity – that is, it is relative.

In Chapter 11 we discuss an analytical strategycalled analysis of variance (ANOVA) in whichone variance is compared with another. For now,another useful measure of relative dispersion isthe coefficient of variation (CV), calculated asthe ratio of the sample standard deviation to thesample mean:

sCV � ––.x

The coefficient of variation is useful whencomparing the magnitude of variability betweentwo or more different random variables.

To illustrate the coefficient of variation,consider the following (extremely simple andartificial) example. Imagine that there are tworandom variables in an early therapeuticexploratory clinical trial. One random variable ispulse (ranging from 50 to 80) and the other isage, which in this case is pulse minus 20. We cansee that, from this example, values of pulse andage are just as disperse, but what differs betweenthem is the mean. Hence, when we calculate thestandard deviation, one random variable willappear to have more or less dispersion, but, afterre-scaling the standard deviation with thesample mean, the measure of dispersion is thesame.

5.7.5 Percentiles

Another descriptive measure of variability ordispersion is the percentile. The Pth percentile isthe value of the random variable, X � X P––100

, suchthat:

• P% of values of X are � X P––100• 100 � P% of values of X are � X P––100

.

For example, the 75th percentile is the value ofX below which 75% of the values lie and abovewhich 25% lie. The 50th percentile is synonymouswith the median. Likewise the 25th percentile isthe value of X below which 25% of the values lieand above which 75% lie. The difference betweenthe 75th and 25th percentiles is called theinterquartile range, which can be a useful measureof dispersion when the distribution of the randomvariable is heavily skewed or asymmetric.

5.8 Tabular displays of summarystatistics of central tendency anddispersion

As we discuss in more detail in Chapter 6, one ofthe primary goals of studies in a clinical devel-opment program is to describe the effect that thetest treatment had on study participants so thatsome inference can be made about the drug’seffects on patients who may receive the drug in the future. Summary descriptive statistics ofcentral tendency and dispersion give us better understanding of the typical effect of the test treatment and how varied participants’responses were.

In our experience the mean and the standarddeviation are the most commonly used summarystatistics for these purposes. However, othermeasures can be useful to reviewers when inter-preting data from clinical studies. We encourageresearchers to present the following statistics:The sample size, the mean, the median, the stan-dard deviation, and the minimum and themaximum. Presenting all these values for a givenrandom variable provides a reviewer with twomeasures of central tendency and two measuresof dispersion. For clinical studies that arecomparative in nature, such as therapeuticconfirmatory trials, it is our recommendationthat summary statistics – for example, the meanand standard deviation – be formatted in areport so that the primary comparison of interestis read across columns (left to right). In clinicalstudies this is typically treatment groups or dosegroups. Secondary comparisons of interest – forexample, time points of observation – should bearranged as separate rows.

Tabular displays of summary statistics of central tendency and dispersion 55

Page 71: Introduction to Statistics in Pharmaceutical Clinical Trials

5.10 Reference

Senn S (1997). Statistical Issues in Drug Development.Chichester: John Wiley & Sons.

56 Chapter 5 • Data, central tendency, and variation

5.9 Review

1. What scale are each of the following participantcharacteristics measured on:

(a) eye color(b) body mass index (kg/m2)(c) number of cerebrovascular events diagnosed in

the past 5 years(d) days from study entry to last follow-up visit(e) concentration of test drug in plasma (ng/mL) (f) blood pressure classification: Normal;

prehypertension; stage 1 hypertension; stage 2hypertension.

2. Using the histogram in Figure 5.1 and the stem-and-leaf plot in Figure 5.2, comment on theappropriateness of each of the following measuresof central tendency:

(a) mean(b) median(c) mode.

3. From the frequency table of age values in Table 5.1,calculate:

(a) the median or 50th percentile(b) the 25th percentile(c) the 75th percentile.

Page 72: Introduction to Statistics in Pharmaceutical Clinical Trials

6.1 Introduction

A common goal of pharmaceutical clinical trialsis to establish with some high degree of confi-dence that the test treatment is superior to acontrol with respect to some measurable effect. Ifwe are able to say that the expected effect of thetest treatment tends to be superior (by someamount) to the expected effect of the control,we could conclude that the test treatment wassuperior to the control.

To accomplish this objective, sponsors designstudies that allow them to attribute any differ-ence in the response of interest to the test treat-ment itself. This is accomplished through theuse of randomization, a carefully selected studypopulation, treatment blinding, careful datacollection, and other measures that minimizethe possibility that other factors may have influ-enced the outcome of the study. However, if toofew study participants are studied any differenceobserved might have been caused by chance. Achance, or spurious, result is one that may not berepeatable or, to put it another way, a chanceresult is not reliable. Provision of a high degreeof confidence that a new drug is beneficialrequires sponsors to demonstrate that effectsobserved from a new treatment are reliable. Thischapter discusses the statistical concepts thatallow researchers to make the conclusion thatthe effect seen in a study was unlikely to bethe result of chance.

6.2 Probability

The statements at the end of the previous sectioncan be expressed differently, and more quantita-

tively, in the language of Statistics. We notedthat provision of substantial evidence that a newdrug is beneficial requires sponsors to demon-strate that effects observed from a new treatmentare reliable. This chapter discusses the statisticalconcepts that allow researchers to make theconclusion that the effect seen in a study is reli-able, that is, it is unlikely to be the result ofchance.

The statistical techniques that can be used torule out chance events require us first to considersome concepts of probability. Many outcomes inlife are inherently uncertain, and others can beconsidered certain. If you play the lottery, it isuncertain whether you will win on any givenoccasion (it is also incredibly unlikely). If youdrop an apple, it is certain that it will fall to theground. Other outcomes fall in the middle ofthe range. It is useful to be able to quantify thedegree of certainty, and conversely the degreeof uncertainty, associated with a particularoccurrence. This is the realm of probability.

Like the word significance, the concept ofprobability is used in everyday language as wellas in the discipline of Statistics. As Turner (2007)noted, the statement “I’ll probably be there onSaturday” involves a probabilistic statement, butthere is no degree of quantification (if you knowthe individual making this statement, past expe-rience may lead you to have an informedopinion concerning the relative meaning of“probably,” but this is a subjective judgment).

As for many aspects of statistical analysis,there are axioms in probability that make it avery useful tool. In the context of Statistics,probability can be defined in quantifiable terms.A probability is a numerical quantity betweenzero and one that expresses the likely occurrenceof a future event. A probability of 0 denotes that

6Probability, hypothesis testing, and estimation

Page 73: Introduction to Statistics in Pharmaceutical Clinical Trials

the event will not occur. A probability of 1denotes that the event will undoubtedly occur.Any numerical value between 0 and 1 expressesa relative likelihood of an event occurring.

A probability value can be represented as afraction or as a decimal value. In addition, it iscommon in some aspects of Statistics to multiplythe decimal expression of a particular probabilityby 100 to create a percentage statement of likeli-hood. A probability of 0.5 would thus beexpressed as a 50% chance that an event wouldoccur. Percentage statements of likelihood are acentral component of hypothesis testing, whichis introduced later in the chapter.

The probability of an event (E) can be repre-sented as P(E) and we use this notationalconvention. In general, the probability of eitherof two events (A or B) occurring is calculated as:

P(A or B) � P(A) � P(B) � P(A and B).

In other words, the probability of either eventoccurring is the sum of the probabilities of eachevent minus the probability of both occurringtogether (or jointly).

Consider the cross-tabulation of the genderand age of participants in a clinical trial aspresented in Table 6.1. As seen there were 200participants, 100 of whom were male and 100female. There were 65 participants aged 45 yearsor younger, 90 between 46 and 64 years, and 45who were aged 65 years or older. We illustrateseveral of the axioms of probability using Table 6.1. For example, the probability ofselecting at random a participant from thisgroup who was male or aged 65 years or older:

P(maleor�65) � P(male) � P(�65) � P(maleand�65) �

100 45 15 130P(male or � 65) � –––– � –––– � –––– � –––– � 0.65.

200 200 200 200

In the special case that the events A and Bcannot occur at the same time, they are said tobe mutually exclusive, meaning that P(A and B)� 0. Hence, for mutually exclusive events Aand B:

P(A or B) � P(A) � P(B).

A randomly selected participant cannot beboth “� 45” and “� 65.” The events of selectinga participant aged 45 years or younger and one

65 years or older are mutually exclusive. Thisresult is generalizable to more than two events ofinterest.

If one or more events, E1, E2, . . . En, representall unique and mutually exclusive outcomes in a particular circumstance, the probability ofobserving at least one of the events sums to one:

P(E1 or E2 or . . . or En) � P(E1) � P(E2) � . . . � P(En) � 1.

This result can be used to calculate the proba-bility of one or more events of interest. For anyevent E1 among n mutually exclusive andexhaustive events:

P(E1) � 1 � {P(E2) � . . . � P(En)}.

This expression is called the complement ruleand will be referenced throughout this book.

The probability of selecting a male at randomcan be calculated by adding the probabilities forthe events “male � 45 years,” “male 46–64years,” and “male � 65 years,” because these areall mutually exclusive events. The probabilitycan be calculated as follows:

P(male) � P(male � 45 years) �P(male 46–64 years) � P(male � 65 years)

35 50 15� –––– � –––– � ––––

200 200 200100 1

� –––– � –– .200 2

The probability of an event B given that A hasbeen observed is called a conditional probabilityand is defined as:

P(A and B)P(B | A) � ––––––––––,

P(A)

where the vertical bar signifies “given.”

58 Chapter 6 • Probability, hypothesis testing, and estimation

Table 6.1 Cross-tabulation of age and gender

Age (years) Male Female

� 45 35 30 6546–64 50 40 90� 65 15 30 45

100 100 200

Page 74: Introduction to Statistics in Pharmaceutical Clinical Trials

The probability of selecting a participant � 45years of age, given that a male has beenselected, is:

P(� 45 years and male)P(� 45 years | male) � –––––––––––––––––––––

P(male)

35––––P(� 45 years and male) 200 35–––––––––––––––––––––– � –––– � –––– .

P(male) 100 100––––200

It follows that the probability of two eventsoccurring jointly is calculated as:

P(A and B) � P(A)P(B|A).

One important use of conditional probabilitiesoccurs in Bayes’ theorem. The conditionalprobability of an event A given an event B is:

P(B | A)P(A)P(A | B) � ——————.

P(B)

Note that throughout this book we haveadopted a standard mathematical notation forthe product of two or more terms. In the expres-sion above the numerator is the product of thetwo terms, P(B|A) and P(A), that is, these twoquantities are multiplied. Please keep thisstandard in mind when you encounter othermathematical expressions.

It is also possible to state the probability of anevent, A, as a function of two or more condi-tional events. If the events B and C are mutuallyexclusive and exhaustive – for example, theyrepresent male and female – the probability ofevent A can be expressed as:

P(A) � P(A | B)P(B) � P(A | C)P(C).

This expression can be extended to more thantwo conditional events.

A common application of Bayes’ theorem is inestimating the probability of a participanthaving a disease, given a positive test for thatdisease. These concepts are important in theirown right with regard to the development ofdiagnostic tests. As the clinical trials discussed inthis book are for the purposes of developing newpharmaceutical interventions rather than testingfor the existence of a disease or condition, thisissue may not seem directly relevant. However,

these concepts are discussed in Chapter 12 in adifferent light, and we would therefore like toestablish these concepts at this earlier stage.

For simplicity, in the notation used in thisexample we define the following events usingthe symbol “�” which means “is equivalent to”:

• D� � participant has the disease of interest• D� � participant does not have the disease of

interest• T� � participant tests positive for the disease• T� � participant tests negative for the disease.

When developing a diagnostic test, investigatorsidentify two groups: One is known (by somegold standard testing procedure) to have thedisease; the other is known (also by a gold stan-dard testing procedure) not to have the disease.Then all participants in both these groups aregiven the new diagnostic test. The accuracy of anew diagnostic test is measured by two criteria:

1. Sensitivity is the probability that a new testwill have a positive result among those whoare known to have the disease. This is denotedby: P(T�|D�).

2. Specificity is the probability that a new testwill have a negative result among those whoare known not to have the disease. This isdenoted by: P(T�|D�).

Once a new diagnostic test has been developed itmay be considered for a public health screeningprogram. Evaluating the utility of a proposednew diagnostic test in a population involves thefollowing two criteria:

1. The true positive rate is the probability that aparticipant has the disease given that she orhe has tested positive. This is denoted byP(D�|T�) and is also referred to as predictivevalue positive. The complement, 1 � P(D�|T�)� P(D�|T�), is the false-positive rate.

2. The true-negative rate is the probability that aparticipant does not have the disease giventhat she or he has tested negative. This isdenoted by P(D�|T�) and is also referred to as predictive value negative. The complement,1 � P(D�|T�) � P(D�|T�), is the false-negative rate.

If the true-positive rate is low (or the false-positive rate high) a number of participants will

Probability 59

Page 75: Introduction to Statistics in Pharmaceutical Clinical Trials

needlessly incur the expense and anxiety offurther medical investigations. If the true-negative rate is low (or the false-negative ratehigh) a number of them will carry on undiag-nosed. The goal would be to adopt a screeningtool that had high rates of true positives and truenegatives. Bayes’ theorem can be used to showthat the rates of true positives and true negativesare a function of the sensitivity and specificity ofthe diagnostic test itself and the prevalence ofthe disease in the population of interest.

We illustrate this concept for the true-positiverate, which is:

� P(D� | T�)

P(T� | D�)P(D�)� ––––––––––––––––– by Bayes’ theorem.

P(T�)

Bayes’s theorem is applied again to obtain:

P(T� | D�)P(D�)� –––––––––––––––––––––––––––––––––––.

P(T� | D�)P(D�) � P(T� | D�)P(D�)

Noting that P(T� | D�) � 1 � P(T� | D�) wehave the desired result.

P(T� | D�)P(D�)� ––––––––––––––––––––––––––––––––––––––––.

P(T� | D�)P(D�) � [1 � P(T� | D�)]P(D�)

Note that P(D�) is often called the prevalence ofthe disease in a population. Its complement isP(D�) � 1 � P(D�). Prevalence of a disease isestimated through the use of epidemiologicstudies and not clinical trials. Thus, to fully eval-uate the utility of the new diagnostic test, wemust have an estimate of the prevalence ofthe disease, the sensitivity of the test, and thespecificity of the test.

Two events are said to be statistically indepen-dent if the probability of one occurring does notdepend on the other. If A and B are independentevents the joint probability is given by:

P(A and B) � P(A)P(B).

If we sample from our 200 study participants“with replacement” – that is, after each selectionthe participant is available for selection again –the probability of selecting a male does notdepend on previous selections. The probabilityof selecting two males in a row is given by:

P(two males in a row) � P(male)P(male)

1 1� ( –– ) ( –– )2 2

1� –– � 0.25.

4

The probability of selecting four males in arow is:

P(four males in a row) �P(male)P(male)P(male)P(male)

1 1 1 1� ( –– ) ( –– ) ( –– ) ( –– )2 2 2 2

1� –– � 0.0625.

16

These basic principles and characteristics ofprobability are referred to throughout subse-quent chapters.

6.3 Probability distributions

In Chapter 5 we described a number of ways toexamine the relative frequency distribution of arandom variable (for example, age). An importantstep in preparation for subsequent discussions isto extend the idea of relative frequency to proba-bility distributions. A probability distribution is amathematical expression or graphical representa-tion that defines the probability with which allpossible values of a random variable will occur.There are many probability distribution func-tions for both discrete random variables andcontinuous random variables. Discrete randomvariables are random variables for which thepossible values have “gaps.” A random variablethat represents a count (for example, number ofparticipants with a particular eye color) is consid-ered discrete because the possible values are 0, 1,2, 3, etc. A continuous random variable does nothave gaps in the possible values. Whether therandom variable is discrete or continuous, allprobability distribution functions have thesecharacteristics:

60 Chapter 6 • Probability, hypothesis testing, and estimation

Page 76: Introduction to Statistics in Pharmaceutical Clinical Trials

• All possible values of the random variablemust be represented by the distributionfunction.

• The probability of each value of the randomvariable occurring is bounded by 0 and 1,inclusive.

• The probabilities of values of the randomvariable occurring must sum to 1 (in thecase of a discrete random variable) or inte-grate to 1 (in the case of a continuousrandom variable).

A simple example of a discrete probabilitydistribution is the process by which a singleparticipant is assigned the active treatmentwhen the event “active treatment” is equallylikely as the event “placebo treatment.” Thisrandom process is like a coin toss with aperfectly fair coin. If the random variable, X,takes the value of 1 if active treatment israndomly assigned and 0 if the placebo treat-ment is randomly assigned, the probabilitydistribution function can be described asfollows:

1P(X � x) � ––, where x � 0 or 1.

2

This probability distribution function has thecharacteristics defined previously:

• The random variable can take on only valuesof 0 or 1.

• The probability distribution function isdefined for both values.

• The probability of each value is between 0and 1, inclusive. It is, in fact, half for bothvalues.

• Finally, the sum of the probability of allmutually exclusive outcomes is equal to one,that is, P(X � 0)�P(X � 1) � 1.

6.4 Binomial distribution

The first probability distribution function that wediscuss in detail is the binomial distribution,which is used to calculate the probability ofobserving x number of successes out of n observa-tions. As the random variable of interest, the

number of successes, is discrete (as are all counts),the binomial distribution is called a discreterandom variable distribution. The binomialdistribution is applicable when the followingconditions apply:

• Each of n observations results in only one oftwo outcomes (one is typically called a successand the other failure).

• The probability of a success, p, is the samefrom observation to observation.

• Each observation is independent of theothers.

The probability of observing x successes out of nobservations under these conditions (called aBernoulli process) can be expressed as:

P(X � x; p, n) � Cnxpx(1 � p)n�x.

The left part of this expression can be read as“the probability of the random variable, X,taking on a particular value of x, given parame-ters p and n.” The quantity (1 � p) is the proba-bility of failure for any trial. The notation Cn

x isshorthand to represent the number of combina-tions of taking x successes out of n observationswhen ordering is not important. This quantitycan be calculated as:

n!Cn

x � –––––––––.x!(n � x)!

The expression n! is read as “n factorial” and iscalculated as n(n � 1)(n � 2) . . . (1).

The mean of the binomial distribution func-tion is:

Mean � np.

The variance of the binomial distribution is:

Variance � np(1 � p).

A simple example of the use of the binomialdistribution is the result of four random assign-ments to either the active or the placebo treat-ment group when each outcome is equally likely.What is the probability of observing 0, 1, 2, 3, or4 assignments to the active treatment group outof 4 random treatment assignments when theprobability of assigning to active or placebo isequally likely? We must assume that theoutcome of one assignment does not impact the

Binomial distribution 61

Page 77: Introduction to Statistics in Pharmaceutical Clinical Trials

outcome on subsequent assignments, that is,they are independent. There are only twopossible outcomes on any given trial: Assign-ment to active or placebo. The probability ofeach outcome, the number of “successes” orassignments to active, can then be calculatedusing the binomial probability distributionfunction.

The probability of each outcome of fourrandom treatment assignments is displayed inTable 6.2. In some instances, we may be inter-ested in knowing what the probability ofobserving x or fewer successes would be, that is,P(X � x). This cumulative probability is alsodisplayed for each outcome in Table 6.2. For adiscrete random variable distribution, the sum ofprobabilities of each outcome must sum to 1, orunity.

As you might expect, the most probableoutcome is 2 actives (probability 0.375) and theleast probable outcomes are 0 and 4 actives (eachwith a probability of 0.0625). We can use thecumulative probability distribution to answerother probability questions of interest. Forexample, what is the probability of observing 3or fewer actives? This probability is denoted asP(X � 3) � 0.9375. We can use the complementrule from Section 6.2 to calculate the probabilityof observing 2 or more actives, P(X � 2), as:

1 � P(X � 1) � 1 � 0.3125 � 0.6875.

The binomial distribution is discussed later inthe chapter to illustrate concepts of hypothesistesting.

6.5 Normal distribution

Similar probability models can be used forcontinuous random variables. The mostcommon, and arguably the most important ofthese in Statistics, is the normal distribution. Asit is encountered so frequently in this book, wespend some time describing its characteristicsand uses.

The normal distribution is a particular form ofa continuous random variable distribution. Therelative frequency of values of the normal distri-bution is represented by a normal density curve.This curve is typically described as a bell-shapedcurve, as displayed in Figure 6.1.

More precisely, it is one specific kind of symmet-rical curve. The precise nature of this curve can bedescribed mathematically by a formula thatcontains both the mean, l, and the standarddeviation, r, of the population that is beingrepresented graphically by the normal curve:

(x��)2

1 � ––––––2r2

f (x ; l, r) � –––––– e .r�

___2p

The term “population” is defined in detail laterin the chapter. Until then we can think of apopulation as the largest group of experimentalunits (for example, study participants) aboutwhich we would like to make a conclusion.

As we need to know two parameters – that is,the mean, l, and the standard deviation, r, tofully characterize this distribution – it is consid-

62 Chapter 6 • Probability, hypothesis testing, and estimation

Table 6.2 Distribution of the number of assignments to active from four random assignments when the probabilityof assignment to active and placebo is equal (p�0.5)

Outcome Probability of the Cumulative probability(no. of actives, x) outcome P(X � x) P (X � x)

0 C400.50(0.5)4 � 0.0625 0.0625

1 C410.51(0.5)3 � 0.2500 0.3125

2 C420.52(0.5)2 � 0.3750 0.6875

3 C430.53(0.5)1 � 0.2500 0.9375

4 C440.54(0.5)0 � 0.0625 1.0000

Page 78: Introduction to Statistics in Pharmaceutical Clinical Trials

ered to be a two-parameter distribution. This factis also conveyed by the use of the symbols l andr on the left side of the expression. The meanspecifies the distribution’s location, whereas thestandard deviation specifies the spread of thedistribution. If a random variable X has a normaldistribution with mean l and variance r2, this iswritten as X ~ N(l,r2). Note that most practicalapplications involve the use of r rather than r2,but it is conventional to describe the normaldistribution in terms of its mean and variance.

Figure 6.2 displays three normal density curveswith the same mean (location) but differentstandard deviations (spread). Several characteris-tics of the normal distribution are very helpful indeveloping the statistical tests introduced in thisbook:

• The highest point of the normal curve occursfor the mean of the population, l.

• The shape of the curve (relatively narrow orrelatively broad) is influenced by the standarddeviation, r. The sides of the curve descendmore gently as the standard deviationincreases.

• At a distance of 2 standard deviations fromthe mean, the slope of the curve changes froma relatively smooth downward slope to acurve that technically extends out to infinity,that is, the curve technically never reaches(touches) the x axis of the graph. This concept

is analogous to starting a certain distanceaway from a fence and taking steps thatalways cover half the distance between youand the fence. As your next step always coversonly half the remaining distance, theoreti-cally you never reach the fence. However,after a certain number of steps, you are, to allpractical purposes, at the fence. In the samemanner, the curve is regarded as interceptingthe axis at a distance of 4 standard deviationsfrom the mean.

• The area under the curve is 1.0. This can bedemonstrated formally using integralcalculus, which is beyond the scope of thisbook: A simpler demonstration is provided byTurner (2007, pp 94–5). That the area underthe curve is equal to 1 is analogous to thestatement that the probability of all mutuallyexclusive events must sum to 1.

The precise mathematics of the normal distribu-tion allows quantitative statements of the areaunder the curve between any two points on thex axis. Of most interest here is the area under thecurve between two points that are equidistantfrom the mean.

These points, equidistant from the mean oneither side, can be represented by statements ofthe form:

l � distance.

Normal distribution 63

f (x)

0 10 20 30 40 50 60 70 80

x

0.00

0.02

0.01

0.04

0.03

0.07

0.06

0.05

0.09

0.08

0.11

0.10

0.13

0.12

Figure 6.1 A normal density curve (l � 40, r � 10)

Page 79: Introduction to Statistics in Pharmaceutical Clinical Trials

It can be shown for any normal distribution that:

• 68.3% of the area under the curve lies in therange l � r

• 95.4% of the area under the curve lies in therange l � 2r

• 99.73% of the area under the curve lies in therange l � 3r.

As the area under the entire density curve equals1 the statements above also imply by thecomplementary rule that:

• 31.7% of the area under the curve lies outsidel � r

• 4.6% of the area under the curve lies outsidel � 2r

• 0.27% of the area under the curve lies outsidel � 3r.

Expressing a similar concept in terms of pertinent“round number” percentages:

• The central 90% of the area lies in the range l � 1.645r

• The central 95% of the area lies in the range l � 1.960r

• The central 99% of the area lies in the range l � 2.576r.

You may recall from Chapter 5 that by usingTchebysheff’s theorem we could estimate theprobability with which observations fall within k standard deviations for any distribution. Youare encouraged to compare the results fromTchebysheff’s theorem and those cited above forthe normal distribution. Although values of anypercentage of interest can be determined fromstatistical tables of normal distributions, the 95%and 99% values are of particular importance inthe context of this book.

It is important to note here that the areasunder the curve of a continuous random variabledistribution can be thought of as probabilities.Assume that we know that age in a population ofstudy participants is normally distributed with amean of 40 and variance of 100 (standard devia-tion of 10). This normal distribution is displayedin Figure 6.3 with vertical lines marking 1, 2, and3 standard deviations from the mean.

It is then possible, using the results above andsimilar ones from statistical tables, to estimatethe probability that a participant randomlyselected from the population of study partici-pants would be aged 50 or 30. The answer is0.32 or 32% – that is, the proportion or

64 Chapter 6 • Probability, hypothesis testing, and estimation

f (x)

�100 0 100 200 x

0.00

0.02

0.01

0.04

0.03

0.07

0.06

0.05

0.09

0.08

0.11

0.10

0.13

0.12

Figure 6.2 Three normal density curves with the same mean (location) but different standard deviations (spreads)

Page 80: Introduction to Statistics in Pharmaceutical Clinical Trials

percentage of the area under the curve translatesdirectly in to the percentage of participants (orother observational units) whose age values falloutside of the two identified points.

6.5.1 The standard normal (Z) distribution

One unique and important normal distributionis the standard normal distribution, or Z distrib-ution, which has a mean of 0 and a variance of1. If a random variable X is distributed as stan-dard normal with mean 0 and variance 1, it iswritten as X ~ N(0,1). To use some of the generalresults from normal distributions providedearlier, we can make the following statements forthe standard normal distribution:

• The central 90% of the area lies between � 1.645

• The central 95% of the area lies between � 1.960

• The central 99% of the area lies between � 2.576.

The standard normal or Z distribution is usedextensively in Statistics and throughout this

book. For later reference, the standard normaldistribution is provided in Figure 6.4. Note thatthe area under the curve to the left of the value�1.96 is 0.025 (or 2.5%). As the distribution issymmetric, the area under the curve to the rightof the value 1.96 is also 0.025. Another way ofstating this is that, if we were to randomly selecta value from the distribution, there is a 95%chance that the value would be between �1.96and �1.96. One can also think of the values�1.96 and �1.96 as the 2.5th and 97.5th

percentiles, respectively. Values of the Z that define areas under the

standard normal curve in the left tail, the righttail, and the symmetric central region areprovided in Appendix 1.

6.5.2 Transforming a normal distributionto the standard normal distribution

One helpful method possible with a randomvariable that has a normal distribution withmean, l, and variance, r2, is to transform valuesof the random variable so that they have thescale of the standard normal distribution. This

Normal distribution 65

f (x)

0 10 20 30 40 50 60 70 80

x

0.00

0.02

0.01

0.04

0.03

0.07

0.06

0.05

0.09

0.08

0.11

0.10

0.13

0.12

Figure 6.3 Normal distribution with mean of 40 and standard deviation of 10. Note that the vertical lines representl�r, l�2r, and l�3r

Page 81: Introduction to Statistics in Pharmaceutical Clinical Trials

makes it possible to answer a number of proba-bility questions using statistical tables thatprovide the areas under the standard normalcurve. In general, for a random variable X ~N(l,r2), the random variable:

X � �Z � ––––––

r

is normally distributed with mean 0 and variance 1.

We can use the example from earlier in Section6.5 to illustrate this method. If age in a popula-tion of study participants is normally distributedwith a mean of 40 and variance of 100 (standarddeviation of 10), what is the probability that aparticipant randomly selected from the popula-tion of study participants would be aged 50 or 30?

First we are interested in the probability that arandomly selected participant will be 50 yearsof age. The transformed value for X � 50 is:

50 � 40Z � –––––––– � 1.

10

As a result of this transformation, P(X 50)corresponds to P(Z 1). The probability, P(Z 1), can be obtained from Appendix 1, thelook-up table for areas under the standard

normal distribution curve. As seen in Appendix1, the area under the standard normaldistribution curve for Z 1 is 0.159.

Then we would like to know what the proba-bility is that a randomly selected participant willbe 30 years of age. The transformed value forX � 30 is:

30 � 40Z � –––––––– � �1.

10

As above, P(X 30) is equal to P(Z �1). UsingAppendix 1 as a reference, the area under thestandard normal distribution curve for Z �1 is0.159.

The probability of interest is obtained bysumming the two probabilities associated withP(Z 1) and P(Z �1) because the two eventsare mutually exclusive. That is, a participantcannot be both 30 and 50, so the probabilityof interest is 0.159 � 0.159 or 0.318.

At first glance it may seem that this transfor-mation method is useful only in a few instances(when the random variable is known to have anormal distribution) and contrived ones at that.However, it is actually useful in many instances.Many random variables can be shown to haveapproximately normal distributions. The reasonfor this is given shortly. It turns out that, if a

66 Chapter 6 • Probability, hypothesis testing, and estimation

f (x)

�4 �3 �2 �1 0 1 2 3 4

x

X~N (0, 1)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Figure 6.4 The standard normal (Z ) distribution

Page 82: Introduction to Statistics in Pharmaceutical Clinical Trials

random variable has an approximate normaldistribution, for which the mean and varianceare known, a transformation results in a randomvariable that has an approximate standardnormal distribution.

6.6 Classical probability and relativefrequency probability

Before concluding the first part of this chapteron the fundamentals of probability it is impor-tant to point out that there are two ways to esti-mate a probability. To contrast these two types ofprobability we consider the question: “What isthe probability of observing a ‘head’ whentossing a coin?”

The first type of probability, termed by some“classical” probability, is based on an assumptionabout the state of the experiment and some basicmathematical expressions. For example, wewould begin answering this question byassuming that the coin was fair. Further, wewould note to ourselves that a fair coin has twosides, the only two outcomes of a coin toss are“heads” and “tails,” and only one of these twooutcomes is the one of interest. The probability ofobserving a head from a single toss of a fair coinis therefore 1⁄2 or 0.5. The most straightforwardway to solve classical probability problems is towrite out all of the unique possible outcomes, thesample space, and then identify the number oftimes that the outcome of interest would occur. Inthis case the sample space is “heads” or “tails.”The event of interest, observing heads, is repre-sented by just one of these events, so the proba-bility of interest is 1⁄2. The use of the binomialdistribution to calculate the probability distribu-tion of observing the number of assignments toactive is another example. In that case we knew(by design) that the probability of assignmentto active was exactly half.

Many Statistics students have sufferedimmensely over the years by having to solveclassical probability problems. Marilyn vosSavant (1997) stumped many readers with thefollowing classical probability problem:

A woman and a man (unrelated) each have twochildren. At least one of the woman’s children is

a boy, and the man’s older child is a boy. Do thechances that the woman has two boys equal thechances that the man has two boys?

vos Savant (1997, p 15)

What is your answer? We leave it to you toconduct an online search to investigate thecontroversy surrounding this problem. We donot dwell any further on this method of esti-mating probabilities because we also dislikethem, and the second type is more useful for usanyway.

The second type of probability, relativefrequency probability, is calculated by repeatingan experiment a large number of times (say n)and counting the number of times out of n thatthe outcome of interest (say m) occurred. Theprobability of the event is then calculated as:

mP(event) � –––.

n

The calculated probability is simply an estimateof the true probability (which remainsunknown).

Using a relative frequency approach to esti-mating the probability of observing a head wewould toss the coin a number of times (forexample, 10), count the number of times a headlanded face up (for example, 4 times), and thencalculate the probability as 4/10 or 0.4. It isperhaps not surprising that the estimated proba-bility here is not exactly 0.5. We were only onehead shy of 5/10, so the relatively small numberof coin tosses may have had an impact. You canimagine tossing the coin 100 times andobserving 46 heads for a probability of 0.46. Thatwould be much closer to the classical probabilitysolution. Another possible reason that only fourheads came up could be that the coin really wasnot fair at all. For the classical probability solu-tion to this problem we would need to assumethat the coin was fair or be told that it was. Therelative frequency solution has the advantage ofnot requiring the assumption of a fair coin, buthas the disadvantage of possibly being limited bythe number of experiments.

Our initial probability estimate of 0.4 from 10coin tosses does not seem to be that far offbecause we might reason that observing 4, 5, or6 heads would be expected from 10 tosses of a

Classical probability and relative frequency probability 67

Page 83: Introduction to Statistics in Pharmaceutical Clinical Trials

fair coin. You may be intuitively thinking that, ifwe were to repeat the 10 coin tosses, we wouldprobably count 4–6 heads again. Your intuitionwould be correct, and there is a statisticalconcept that explains how results from experi-ments vary from sample to sample. The magni-tude of expected differences from sample tosample enables us to estimate a quantity that wecan never really know, one that represents thetruth. In the case of the coin-tossing experiment,our goal would be to infer whether or not thetrue probability of observing a head was 0.5.

6.7 The law of large numbers

In clinical trials we do not know what the proba-bility of observing a particular serious adverseevent is, but we observe a large number ofoutcomes (for example, participants exposed to anew treatment) to estimate it. As the sample sizeincreases the estimate becomes more precise (thatis, closer to the truth). An illustration of the “lawof large numbers” is provided in Figure 6.5.Suppose that a relatively uncommon adverse

event is represented by the chance event of twothrown dice landing with a total of two (or “snakeeyes”). The classical probability solution to esti-mating the probability of this event is (1/6)2 �

0.02778. The relative frequency solution can beobtained by rolling two dice a large number oftimes (n), counting the number of times “snakeeyes” occurs (m), and estimating the probabilityas m/n. The most convenient means to conductthis experiment is using computer simulation.As seen in Figure 6.5, the estimates of the prob-ability (denoted by the oscillating curve) varyquite a bit from the truth (represented by thehorizontal reference line) until the sample size isaround 10 000. The implication of the law oflarge numbers for clinical trials of new drugs isthat the unknown quantities of interest are moreprecisely estimated with larger samples. Thesewould include the mean change in SBP (systolicblood pressure) or the proportion of partici-pants experiencing a serious adverse event. Giventhe limited size of most clinical developmentprograms, the most precise estimates of risks ofnew therapies become evident only once a newdrug has been marketed and used by manythousands of patients.

68 Chapter 6 • Probability, hypothesis testing, and estimation

f

0 10 000 20 000

n

0.0000.0020.0040.0060.0080.0100.0120.0140.0160.0180.0200.0220.0240.0260.0280.0300.0320.0340.0360.038

Figure 6.5 Illustration of the law of large numbers: Proportion of two dice coming up as “snake eyes” as a function ofsample size (n)

Page 84: Introduction to Statistics in Pharmaceutical Clinical Trials

6.8 Sample statistics and populationparameters

The unknown quantities of interest described inthe previous section are examples of parameters.A parameter is a numerical property of a popula-tion. One may be interested in measures ofcentral tendency or dispersion in populations.Two parameters of interest for our purposes arethe mean and standard deviation. The popula-tion mean and standard deviation are repre-sented by l and r, respectively. The populationmean, l, could represent the average treatmenteffect in the population of individuals with aparticular condition. The standard deviation, r,could represent the typical variability of treat-ment responses about the population mean. Thecorresponding properties of a sample, the samplemean and the sample standard deviation, aretypically represented by x� and s, which wereintroduced in Chapter 5. Recall that the term“parameter” was encountered in Section 6.5when describing the two quantities that definethe normal distribution. In statistical applica-tions, the values of the parameters of the normaldistribution cannot be known, but are estimatedby sample statistics. In this sense, the use ofthe word “parameter” is consistent between theearlier context and the present one. We haveadhered to convention by using the term“parameter” in these two slightly differentcontexts.

An expression that defines how individualobservations are used to derive a numerical esti-mate is called an estimator (much like a formulais used to calculate a number). The sample mean,

n

R xi

i � 1x � –––—–,n

is considered an estimator for the populationmean, l. When individual observations areapplied to the estimator, the result is a numericvalue or estimate. When a single value is calcu-lated, it represents a best guess of sorts, and iscalled a point estimate. No single estimate couldbe expected to be perfect so “interval estimates”are commonly used to reflect more accurately arange of plausible values.

Inferential statistics comprises two distinct,although closely related, procedures. In eachcase observations from a sample are used to:

• calculate an interval estimate that includesthe unknown population parameter withsome degree of confidence; in clinical trials, itis common practice to use a 95% confidenceinterval

• test whether or not a sample statistic is consis-tent with or contrary to a hypothesized valueof the population parameter.

Inferences about a population are made on thebasis of a sample taken from that population.The process of inferential statistics requires:

• identification of a representative sample ofparticipants from a population of interest

• collection of individual observations• calculation of sample statistics from the indi-

vidual observations• a statistical method to relate the sample

statistic to the parameter of interest; this canbe done in one of two ways:

– estimation of plausible values of theparameter

– testing a hypothesis of a proposed valueof the parameter.

We discuss the former method, confidenceintervals, first, after a necessary introduction tothe concept of sampling variation. The lattermethod (hypothesis testing) is discussed later.First, however, it is useful to introduce a fewother ideas.

6.9 Sampling variation

If we take a sample of 100 numbers from a popu-lation of 100 000 numbers, that sample’s mean,which is precisely known, will provide an esti-mate of the population mean. The same is truefor the standard deviation, that is:

• x� is an estimate of l• s is an estimate of r.

If we replaced the first sample of 100 numbersand then took another sample of 100 numbers,it is likely (effectively guaranteed) that the

Sampling variation 69

Page 85: Introduction to Statistics in Pharmaceutical Clinical Trials

numbers would not be identical to those in thefirst sample, and that the calculated samplemean would be different from the first one. Thislogic applies to any number of means taken.Suppose that we were to repeat this process anumber of times (in a simulated manner usingcomputer software) and, at the end of each repli-cation, tabulate the values of the sample mean orplot their relative frequencies. The shape of theresulting distribution of values would be recog-nizable. We would notice that a typical valuewould be apparent (the population mean l), aswould a symmetrical bell-shaped distribution. Inshort, the sample statistic from a sample of size n(in this case the sample mean) varies fromsample to sample and its distribution has a meanand a standard deviation. Such a distribution iscalled a sampling distribution. An importantgeneral result for the sampling distribution ofthe sample mean is as follows:

• For any continuous random variable X whichhas a distribution with population mean, �,and variance, r2, the sampling distribution ofthe mean for samples of size n has a distribu-tion with population mean, l, and variance,r2/n.

The square root of the variance,

___r2 r�––– � ––– ,n �

__n

is the population standard error of the mean,which describes the typical variability of samplemeans around the population mean. If we know,or can assume, that the random variable X has anormal distribution with population mean, l,and variance, r2, the sampling distribution ofthe mean of samples of size n will also have anormal distribution with population mean, l,and variance, r2/n. Using the notation describedearlier, this result can be summarized in thismanner:

r2If X ~ N(l,r2) then Xn ~ N(l,–––).n

6.10 Estimation: General considerations

It is not possible to know whether any singlesample estimate, like the sample mean, is a goodestimate of the population parameter that it isintended to estimate. However, it is possible touse the fact that most estimates of the samplestatistic (for example, sample mean) are not toofar removed from the population parameter, asspecified by the shape of the sampling distribu-tion, to define a range of values of the popula-tion parameter (for example, population mean)that are best supported by the sample data.

As exact knowledge of the population para-meter is not possible, we must settle for a rangeof values that, with some specified probability orconfidence, are most plausible. In other words,we would like to know the lower limit (LL) andupper limit (UL) of the most probable range ofvalues of the true population parameter. In thecase of the population mean, we seek two values,LL and UL, such that:

P(LL � UL) � 1 � a.

The quantity a is the probability that the intervalestimate does not include the value of the para-meter of interest – that is, l in this case. In mostcases small values of a are desirable (for example,0.10 or 0.05). Depending on the importance ofthe decision to be made on the basis of theinterval estimate defined by LL and UL, verysmall values of a may be desirable (for example,0.01 or 0.001).

When conducting a clinical trial, we do notknow if our sample was representative of thepopulation or not. We have only data from asample and the statistics calculated from thesample data. Yet, our ultimate interest is not inthe sample but in the population. In this chapterwe consider the sample statistics for the meanand the standard deviation, x and s. A clinicaltrial represents a situation in which we can takeonly one sample from a population. Given that,what degree of certainty can we have that the

70 Chapter 6 • Probability, hypothesis testing, and estimation

Page 86: Introduction to Statistics in Pharmaceutical Clinical Trials

mean of that sample represents the mean of thepopulation? Before we answer this question fullywe define a confidence interval for the samplemean in a special case. This special case will serveas our starting point for more realistic andcommon cases.

6.10.1 Confidence interval for thepopulation mean when the populationvariance is known

Assume that the random variable X has a normaldistribution with an unknown population mean,l, and with a known population variance, r2. Fora sample size of n, the sampling distribution ofthe sample mean has a normal distribution withpopulation mean, l, and variance, r2/n. Theimplication of this result is that, for example:

• 90% of the sample mean values lie between

l � 1.645 r–

�–n

• 95% of the sample mean values lie between

l � 1.960 r–

�–n

• 99% of the sample mean values lie between

l � 2.576 r–

�–n.

In general, the following statement is true: Forsamples of size n, (1 � a)% of sample means x liein the range:

rl � z1�a/2

–––�__n

where z1�a/2 is the value from the standardnormal distribution that defines the upper andlower tail areas of size a/2. Note that z1�a/2 is aparticular example of a reliability factor. As the Zdistribution is symmetric it is also true that the Zvalue on the negative side that cuts off an area ofsize a/2 in the lower tail is equal to the Z valueon the positive side (change in sign) that cuts offan area of size a/2 in the upper tail. Equivalently,in mathematical terms, this means:

|za/2| � z1�a/2.

We can therefore express a two-sided (1 � a)%confidence interval for the population mean as:

r rP(x � z1�a/2 ––– l x � z1�a/2 –––) � 1 � a.

�__n �

__n

This expression for a two-sided (1 � a)% confi-dence interval can be shortened in the followingmanner:

x � z1�a/2 (r/�__n ).

The assumption of a normal distribution for therandom variable X is somewhat restrictive.However, for any random variable, as the samplesize increases, the sampling distribution of thesample mean becomes approximately normallydistributed according to a mathematical resultcalled the central limit theorem. For a randomvariable X that has a population mean, l, andvariance, r2, the sampling distribution of themean of samples of size n (where n is large, thatis, 200) will have an approximately normaldistribution with population mean, l, and vari-ance, r2/n. Using the notation described earlier,this result can be summarized as:

r2

Xn → N(l, –––)n

when n is large. This is an important result,because it holds no matter the shape of the orig-inal distribution of the random variable, X. Thereader is encouraged to search for online refer-ences that illustrate, through animation, thisimportant theorem.

Therefore, the expression written above for theconfidence interval for the population mean alsoapplies to any continuous random variable aslong as the sample size is large (as just noted, ofthe order of 200 or more). The other ratherrestrictive assumption required for this confi-dence interval is that the population variance beknown. Such a scenario is neither common norrealistic.

We now apply the fundamental concept of theconfidence interval as developed here to the case

Estimation: General considerations 71

Page 87: Introduction to Statistics in Pharmaceutical Clinical Trials

where the population variance is not known, butthere is interest in defining a confidence intervalfor the population mean.

6.10.2 Confidence interval for thepopulation mean when the populationvariance is unknown

A reasonable suggestion for devising a confi-dence interval for the population mean would beto substitute the sample estimate, s, for the corre-sponding population parameter, a and proceedas described earlier in Section 6.10. However,when the sample size is small (particularly 30)the use of the Z distribution is less appropriate.William S Gossett, writing anonymously as“Student” while employed at Guinness Brewery,proposed the following statistic as an alternative.When X is a normally distributed variable andthe sample size is small, the statistic

x � �t � ––––––

s/�__n

follows a t distribution (“Student’s t”). The singleparameter defining its shape is (n � 1) degrees offreedom (df), the sufficient number of observa-tions needed to estimate the sample mean. The

t distribution is symmetric about its mean (zero)and looks like a normal distribution with, incases of sample sizes 200, heavier “tails.”

Three density functions are plotted for t distri-butions with 5, 30, and 200 df in Figure 6.6. Thegreater the number of df, the “flatter” the tails.In the figure, the two curves that are closesttogether are associated with 30 and 200 df. It isinteresting to note (and a convenient fact) thatthe area under the density curve between anytwo points for the case with 30 df is not appre-ciably different from the case with 200 df.

As was the case with the normal distribution,the shape of the t distribution can be used tofind two values that define a central area underthe density curve of size (1 � a). It can be shownthat, once a value of t associated with an area ofinterest is determined, the difference betweenthe sample mean x– is within t(s/�

__n ) of the

population mean, l. This enables us to calculatea confidence interval for the population mean when the sample size is small and thepopulation variance unknown.

The interval estimate of the population mean,the two-sided (1 � a)% confidence interval forthe population mean, is:

x � t1�a/2,n�1(s/�__n ).

72 Chapter 6 • Probability, hypothesis testing, and estimation

f (x)

�5 �4 �3 �2 �1 0 1 2 3 4 5

x

t (5 df), t (30 df), t (200 df),

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Figure 6.6 The t distribution with 5, 30, and 200 degrees of freedom

Page 88: Introduction to Statistics in Pharmaceutical Clinical Trials

As in Section 6.10, the confidence interval hasthree components:

1. point estimate2. standard error3. reliability factor.

The point estimate in this case is the samplemean, which represents the best estimate of thepopulation mean.

The second component is the standard error ofthe mean, which quantifies the extent to whichthe process of sampling has mis-estimated thepopulation mean. The standard error of themean has the same meaning as in the case fornormally distributed data – that is, the standarderror describes the degree of uncertainty presentin our assessment of the population mean on thebasis of the sample mean. It is also the standarddeviation of the sampling distribution of themean for samples of size n. The smaller the stan-dard error, the greater the certainty with whichthe sample mean estimates the populationmean. When n is very large the standard error isvery small, and therefore the sample mean is avery precise estimate of the population mean. Aswe know the standard deviation of the sample, s,we can make use of the following formula todetermine the standard error of the mean, SE:

sSE � –––.�__n

At this point, it is worth emphasizing thedifference between the terms “standard error”and “standard deviation,” which, despite thesame initial word, represent very differentaspects of a data set. Standard error is a measureof how certain we are that the sample meanrepresents the population mean. Standard devia-tion is a measure of the dispersion of the originalrandom variable. There is a standard error asso-ciated with any statistical estimator, including asample proportion, the difference in two means,the difference in two proportions, and the ratioof two proportions. When presented with theterm “standard error” in these applications theconcept is the same. The standard error quanti-fies the extent to which an estimator varies oversamples of the same size. As the sample sizeincreases (for the same standard deviation) there

is greater precision in the estimate of the popu-lation mean because the standard error becomessmaller as a result of the division of the squareroot of the sample size.

The third component of the interval estimateis a reliability factor, which represents thenumber of standard deviations required toenclose (1 � a)% of the sample means from thesampling distribution. It is used to quantify howclose we would like our estimate to be to the realpopulation mean or, in short, how reliable it is.The particular value of the reliability factorchosen above, t1�a/2,n�1, is the value of t with(n � 1) df that “cuts off” an area of a/2 in theupper tail. As the t distribution is symmetric it isalso true that the t value on the negative sidethat cuts off an area of size a/2 in the lowertail is equal to the same t value but with achange in sign that cuts off a/2 in the upper tail.Equivalently, in mathematical terms, this meansthat |ta/2,n�1| � t1�a/2,n�1. Values of t1�a/2,n�1 areprovided in Appendix 2 for various values of a.

For a sample size of 100 (99 df) the reliabilityfactors for two-sided 90%, 95%, and 99% confi-dence intervals are 1.66, 1.98, and 2.63. Theimplication of these three values is that, all otherthings being equal (that is, x–, s, and n), requiringgreater confidence in the interval estimateresults in wider interval estimates. The moreconfidence that is required, the less reliable is thesingle sample estimate, and therefore greaternumerical uncertainty is expressed in theinterval estimate. This very important point isillustrated in the following example.

The following values of age (n � 100) wereexamined using a stem-and-leaf display inChapter 5:

53, 69, 72, 48, 60, 61, 49, 71, 43, 31, 62, 51,58, 61, 70, 66, 78, 39, 75, 63, 59, 53, 49, 61,50, 88, 51, 80, 68, 75, 78, 81, 57, 70, 68, 66,43, 60, 57, 35, 75, 61, 71, 45, 50, 82, 52, 65,61, 77, 80, 58, 50, 59, 55, 59, 50, 39, 78, 72,71, 79, 48, 55, 52, 55, 62, 59, 68, 63, 81, 69,67, 67, 58, 57, 70, 73, 49, 43, 76, 73, 71, 77,61, 62, 72, 73, 67, 62, 64, 40, 66, 74, 77, 67,49, 83, 73, 59.

Assuming that these observations represent asimple random sample from the population of

Estimation: General considerations 73

Page 89: Introduction to Statistics in Pharmaceutical Clinical Trials

interest and that the age values in the popula-tion are normally distributed, the task is tocalculate the two-sided 90%, 95%, and 99%confidence intervals for the population meanage.

The sample mean age is 62.6 and the standarddeviation 12.01. The standard error of the meanis calculated as:

SE � 12.01/�____100�1.20.

With these numbers calculated, all that is left tocompute the three confidence intervals are thereliability factors associated with each. For the90% confidence interval, the value of the relia-bility factor will be the value of t that cuts off theupper 5% of the area (half the size of a) underthe t distribution with 99 df. This value is 1.66and can be verified from a table of values or fromstatistical software. Note that the t value of�1.66 is the value of t that cuts off the lower 5% of the area (half of the size of a) under the t distribution with 99 df. The reliability factorslisted previously for the two-sided 95% and 99%confidence intervals can also be used to computethe following interval estimates:

• 90% CI � 62.6 � (1.20)(1.66) � (60.6, 64.6)• 95% CI � 62.6 � (1.20)(1.98) � (60.2, 65.0)• 99% CI � 62.6 � (1.20)(2.63) � (59.4, 65.8).

Note that the two values that comprise the lowerand upper limits of the confidence interval aretypically placed in parentheses. The width of theconfidence intervals (the difference between theupper and lower limits) increases because greaterconfidence (corresponding to smaller values ofa) is required.

A statistical interpretation of these results is tosay that we are 90% confident that the mean ageof the population from which this sample wasselected is enclosed in the interval 60.6–64.6years. If greater confidence is required, we cansay that with 99% confidence the mean age ofthe population is enclosed in the interval59.4–65.8 years. Another interpretation of theseconfidence intervals is that they represent themost plausible values of the population mean. Itis important to note that the lower and upperlimits of the confidence interval are randomvariables. The population mean is considered to

be an unknown fixed quantity for which theconfidence interval serves as an estimate.

To summarize, the computational aspects ofconfidence intervals involve a point estimate ofthe population parameter, some error attributedto sampling, and the amount of confidence (orreliability) required for interpretation. We haveillustrated the general framework of the compu-tation of confidence intervals using the case ofthe population mean. It is important to empha-size that interval estimates for other parametersof interest will require different reliability factorsbecause these depend on the sampling distri-bution of the estimator itself and differentcalculations of standard errors. The calculatedconfidence interval has a statistical interpretationbased on a probability statement.

Another useful interpretation of confidenceintervals is that the values that are enclosedwithin the confidence interval are those thatare considered the most plausible values of theunknown population parameter. Values outsidethe interval are considered less plausible. Allother things being equal, the need for greaterconfidence in the estimate results in widerconfidence intervals, and confidence intervalsbecome narrower (that is, more precise) as thesample size increases. This last fact is exploredin greater detail in Chapter 12 because it isdirectly relevant to the estimation of therequired sample size for a clinical trial. Themethods to use for the calculation of confi-dence intervals for other population parame-ters of interest are provided in subsequentchapters.

6.11 Hypothesis testing: Generalconsiderations

As this book focuses on clinical trials our primaryinterest is in providing you with relevant exam-ples of hypothesis testing in that arena.However, it is useful initially to lay some concep-tual foundations with simpler examples. As formany other examples in statistics and proba-bility, we illustrate these concepts first with flipsof a coin.

74 Chapter 6 • Probability, hypothesis testing, and estimation

Page 90: Introduction to Statistics in Pharmaceutical Clinical Trials

Imagine the following scenario. You areholding a half-dollar coin. Our question to youis: Do you think this coin is fair or not? Youexamine it and hold it and with no other infor-mation you decide that you really cannot tellwithout more information. You propose to flipthe coin twice. Flipping it twice, you get thefollowing results: Heads (H) and heads (H). If weforced you to answer our question at this time,you may guess, based on these two observations,that the coin is not fair. After all, if the coin werereally fair you would “expect” one head (H) andone tail (T). However, you are not at all confi-dent with your answer because you note that theprobability of observing two heads is not thatsmall. It is (0.5)(0.5) � 0.25. This means that anoutcome like this results 25% of the time thatyou conduct such an experiment. Accordingly,you wisely recognize that it would be better tohave additional data before making your guess,because with just two heads observed out of twoflips there is a non-trivial chance that you haveguessed incorrectly.

Suppose you then revise the experiment andrequest that the results from 10 flips of the coinbe recorded. You reason that, if the coin werefair, you would expect five heads and five tails. Ifyou were to observe that only one or as many asnine heads came up out of ten tosses, you wouldconclude that the coin was not fair. Your logic isthat, by chance alone, a fair coin would not verylikely yield such a lopsided result. If you were toobserve an event with even more extreme result,that is, 0 or 10 heads out of 10 tosses, you wouldalso have concluded, perhaps with even moreconfidence, that the coin was not fair.

The rule that you intuitively arrived at wasthat if you observed as few as 0 or 1 or as manyas 9 or 10 heads out of 10 coin flips, you wouldconclude that the coin was not fair. How likely isit that such a result would happen? In otherwords, suppose you repeated this experiment anumber of times with a truly fair coin. Whatproportion of experiments conducted in thesame manner would result in an erroneousconclusion on your part because you followedthe evidence in this way? This is the point wherethe rules of probability come into play. You canfind the probability of making the wrongconclusion (calling the fair coin biased) by

following such a decision rule using the bino-mial distribution.

Using the binomial distribution, the proba-bility of observing 9 heads out of 10 whenthe probability of observing a head with eachtrial is 1⁄2 is 0.00977. Likewise, the probability ofobserving 10 heads out of 10 is 0.00098. So theprobability of observing either 9 or 10 heads isthe sum of these two (we sum them becausethese are mutually exclusive outcomes). Thatprobability is 0.01075 (around 1%). We note thatthe probability of observing 0 or 1 heads is thesame as for observing 9 or 10. Therefore theprobability of observing a result as extreme as 1or fewer or 9 or more heads is around 0.02. Ifafter 10 coin flips, we have observed 1 or fewerheads or 9 or more heads, we would concludethat the coin was biased because a fair coinwould yield such a result only with probabilityaround 2% (not very often). Put another way, ifwe conducted this experiment many times andused such a rule when we have observed such aresult, we would be incorrect in 2% of the exper-iments. That seems like an acceptable risk totake. Besides, in this scenario, there seems to beno adverse consequence to being wrong exceptfor a bit of damaged pride.

This rather simple example is an illustration ofthe conceptual components of hypothesistesting. The basis of hypothesis testing is “proofby contradiction.” We use the word “proof”rather liberally here because the scientific stan-dard for establishing proof is more rigorous thana single trial or set of trials could possiblyprovide. Hypothesis testing is a statisticalmethod in which we use data (evidence) tochoose between two decisions, each with theirown course of action and related implications.The real world implication of making either deci-sion depends on the field of study. In the worldof new drug development, these decisions couldbe to decide that a drug is not efficacious at anydose studied, and is therefore not worthstudying further. Another decision could be toselect one particular dose (among many studied)for further development in confirmatory trials.

The process of testing a hypothesis usuallybegins with the statement of the hypothesisthat we would like to conclude as a result ofthe research (we refer to this as the alternate

Hypothesis testing: General considerations 75

Page 91: Introduction to Statistics in Pharmaceutical Clinical Trials

hypothesis). There is another hypothesis that weneed to define and it is referred to as the nullhypothesis. The null hypothesis can be viewedas a “straw man” hypothesis, one that we wouldlike to knock over by collecting evidence thatcontradicts it in favor of the alternate hypoth-esis. One important consideration in the state-ment of the two hypotheses is that they shouldrepresent all possible outcomes. In the context of our coin-tossing illustration, the alternatehypothesis would be that the coin is biased andthe null hypothesis would be that the coin is fair.In that experiment, we counted the number ofheads and were looking for evidence that wouldcontradict the null hypothesis and compel us toconclude that the alternate hypothesis was true.Evidence that contradicted the null hypothesiswould be a very high or very low proportion ofheads because a fair coin would yield approxi-mately the same number of heads as tails. Impor-tantly, these two hypotheses cover the only twopossible outcomes: The coin is either fair orbiased.

The next part of the hypothesis-testing processis to decide on a numerical result (a test statistic)that, if observed, would sufficiently contradictthe null hypothesis such that the null hypoth-esis would be rejected in favor of the alternatehypothesis. As we discovered with our coin-tossing example, some results would not be allthat rare by chance alone. Therefore, our deci-sion rule should be defined such that erroneousconclusions are not made more often than weare willing to tolerate.

You will recall that we might have chosenother results before we concluded that the coinwas biased, but we chose results that wouldrarely be expected by chance alone. In fact, thedecision rule is based on our chosen probabilityof rejecting the null hypothesis when it is reallytrue. For the coin example, this is the probabilityof claiming that the coin is biased when it isreally fair. When asked to take part in this exper-iment the fairness of the coin remains unknownto us, but we choose a decision rule that isconsistent with results that would not beexpected by chance very often.

6.11.1 Type I errors and type II errors

Rejecting the null hypothesis when it is true iscalled a type I error. The probability of making atype I error is called alpha (a). There is anotherkind of error that we might commit by usingdata from our sample (in this case, 10 cointosses) to make an inference about the state ofnature. This second kind of error is called a type IIerror and results from failing to reject the nullhypothesis (suppose we observed seven heads)when, in fact, the alternate hypothesis is true(that is, the coin was biased). We would then actas if the coin were fair – perhaps taking part in anew challenge that involved wagering a lot ofmoney.

When making decisions of any type, whetherthey are as inconsequential as our coin-tossingexperiment or as important and costly as devel-oping a new drug, we would like to minimize thechances that we make the wrong decision. Inplanning a new study or experiment, such as aclinical trial, it is worthwhile to consider mini-mizing the probability of committing each ofthese errors. The two types of errors arepresented in Table 6.3. In clinical trials a type Ierror is committed when we claim that the newantihypertensive is superior to placebo but itreally is similar. A type II error is committedwhen we fail to claim the new antihypertensiveis superior to placebo but it really is. In reality wecannot know the truth, but the study design,including the sample size and the statisticalanalyses used to evaluate the trial, will enable usto limit the probability of committing each ofthese errors.

6.11.2 Probability of type I and II errors

An important aspect of study design is definingthe probabilities of committing each of thesetwo kinds of errors. A type I error could meanthat a new drug is approved for marketing butreally does not provide a benefit. Ideally, theprobability of committing a type I error of thistype would be fairly small. Committing a

76 Chapter 6 • Probability, hypothesis testing, and estimation

Page 92: Introduction to Statistics in Pharmaceutical Clinical Trials

type II error in a superiority trial of a new anti-hypertensive is not appealing for a studysponsor because it could lead to discontinua-tion of a development program for a new treat-ment that is actually efficacious. Therefore, it isdesirable to limit the probability of committinga type II error as well. We have more to sayabout these two probabilities in subsequentchapters, but for now it is sufficient to identifythem formally.

The probability of committing a type I error isthe probability of rejecting the null hypothesiswhen it is true (for example, claiming that thenew treatment is superior to placebo when theyare equivalent in terms of the outcome). Theprobability of committing a type I error is calleda, which is sometimes referred to as the size ofthe test. The probability of committing a type IIerror is the probability of failing to reject the nullhypothesis when it is false. This probability isalso called beta (b). The quantity (1 � b) is referredto as the power of the statistical test. It is theprobability of rejecting the null hypothesis (infavor of the alternate) when the alternate is true.As stated earlier it is desirable to have low errorprobabilities associated with a test. As we wouldlike a and b to be as low as possible the quanti-

ties (1 � a) and (1 � b) are typically fairly large.These probabilities are provided in Table 6.4.

6.11.3 Hypothesis testing and researchquestions

Statistical hypothesis testing represents a meansto formulate and answer the research question ina quantitative manner. The null hypothesis isthe hypothesis that is tested. If quantitative dataare produced that are not consistent with thenull hypothesis, it is rejected.

Beforeproceedingwith this statistical approach,a researchquestionmustbeposed,whichwill thenprompt the design of a study that will lead to thecollection of data and an appropriate statisticalanalysis. A simple research question from a drugdevelopment program, as stated in Chapter 3, is“Does the investigational drug lower blood pres-sure?” A way to answer this research question is todesign a study to estimate the mean change frombaseline in SBP. If the mean change from baselineis negative, the answer to the research questionwould be that the investigational drug does lowerblood pressure. This example will be used toillustrate the concept of hypothesis testing.

Hypothesis testing: General considerations 77

Table 6.3 Two possible errors in hypothesis testing

Truth about null hypothesis

Decision based on test statistic True False Fail to reject null hypothesis Correct Type II errorReject null hypothesis Type I error Correct

Table 6.4 Probabilities of outcomes (conditional on the null hypothesis) in hypothesis testing

Truth about null hypothesis

Decision based on test statistic True False Fail to reject null hypothesis 1 � a b

Reject null hypothesis a 1 � b (i.e., power)

Page 93: Introduction to Statistics in Pharmaceutical Clinical Trials

6.12 Hypothesis test of a singlepopulation mean

Suppose our interest is in testing whether thepopulation mean was equal to a particularhypothesized value, l0. A hypothesis testingprocess typically starts with a statement of thenull and alternate hypotheses. The null hypoth-esis can be stated in the following manner:

H0: � � l0.

If data are found to contradict the null hypoth-esis, it will be rejected in favor of the alternatehypothesis:

HA: � � l0.

The alternate hypothesis is two sided in thesense that values clearly less than l0 would beconsistent with it as would values that wereclearly greater than l0. Rejection of the nullhypothesis because l0 l (l is much greaterthan the hypothesized value l0) may lead to onedecision (for example, continue with the devel-opment of the new drug with a larger study)whereas rejection of the null hypothesis becausel0 l (l is much less than the hypothe-sized value l0) may lead to a completely differentdecision (for example, to stop development ofthe new drug because it has no effect on SBP oractually increases SBP). What is important isthat, a priori, either outcome is possible.

The next step of the hypothesis testing processis to identify a numeric criterion by which theplausibility of the null hypothesis is tested. Thisnumeric criterion is called the test statistic, andwe use it to decide if the value that resulted fromthe study contradicts the null hypothesis or not.The test statistic to be used in this case is:

x � �0t � –––––– .s/�

__n

If the null hypothesis is true – that is, the popu-lation mean is the hypothesized value, l0 – thevalue of the test statistic will be close to 0. Thefurther the test statistic value is from 0 (eithernegative or positive) the less plausible is thehypothesized value, l0 – that is, the null hypoth-esis should be rejected in favor of the alternate.

The next step of hypothesis testing is to deter-mine those values of the test statistic that wouldlead to rejection of the null hypothesis, that is,to determine the critical region.

Assuming that the random variable isnormally distributed (or approximately so if thesample size is 30) and if the null hypothesis istrue, the test statistic just defined has a t distrib-ution with (n �1) df. Referring to Figure 6.6 youwill see that most values of a random variablethat follow a t distribution fall in the range �1 to�1. A value in this range would be expected justby chance alone. However, values �2 or �2occur much less frequently, that is, there is lessarea to the left of �2 and to the right of �2. Wewould like to define a critical region that is asso-ciated with small tail areas because values in thetail do not occur frequently, whereas values inthe center of the distribution are very common.In other words, we would like to define the testso that we do not reject the null hypothesis veryoften when in fact it is true – that is, we wouldlike to define a critical region so that the proba-bility of committing a type I error, a, is small.

In most scientific endeavors the choice of a is0.05. We are willing to accept a 1 in 20 chancethat, at the end of the study, it is concluded thatthe population mean is not the hypothesizedvalue when in fact it really is. It is important toremember that the choice of a is part of thestudy design, and not a result of a study. Also, itis important to note here that there is nothingspecial about the value of 0.05. Depending onthe stage of development or the severity orimportance of the disease for which we wish todevelop the drug, we may choose a value of athat is higher or lower than 0.05. What is impor-tant in the choice of a are the implications (forsponsors, regulatory bodies, clinicians, andpatients) of committing a type I error. Havingalerted you to the possibility of choosing othervalues for a, and the fact that this choice hasvarious implications, we adopt the conventionalvalue of a of 0.05 in subsequent discussions.

Knowledge of the distribution of the teststatistic enables us to define a critical region thatwould erroneously lead to rejection with proba-bility of 0.05. In the case of the current test, thecritical region will be any value of the teststatistic such that:

78 Chapter 6 • Probability, hypothesis testing, and estimation

Page 94: Introduction to Statistics in Pharmaceutical Clinical Trials

t ta/2,n�1 or t t1�a/2,n�1.

Similarly to the case of the standard normaldistribution, the critical values can be obtainedfrom a series of tabulated values or from statis-tical software. A number of percentiles of varioust distributions are provided in Appendix 2. It isimportant to note that there is not just one t distribution; there are many of them, and theirshapes are determined by the number of degreesof freedom. As either low or high values of thetest statistic could lead to rejection, the hypoth-esis test is considered a two-sided test. The prob-ability of committing a type I error is 0.05, but,because the critical region is evenly splitbetween low values and high values, the proba-bility of committing a type I error in favor of onedirection (for example, large values of t) is a/2.

Once the critical region of the test has beendefined, the next step of hypothesis testing is tocalculate the value of the test statistic from thesample data. The test statistic is calculated as:

x � �0t � ––––––s/�

__n

where x is the sample mean, l0 the hypothesizedvalue of the population mean, s the samplestandard deviation, and n the sample size.

If the value of the test statistic is in the criticalregion the null hypothesis is rejected and theconclusion is made that the population mean isnot equal to l0. When the null hypothesis isrejected, such a result is considered “statisticallysignificant” at the a level, meaning that theresult was unlikely (with probability no greaterthan a) to have been observed by chance alone.If the value of the test statistic is not in the crit-ical region we fail to reject the null hypothesis. Itis important to emphasize the fact that wecannot claim that the population mean is equalto l0, but simply that the data were not sufficientto conclude that they were different.

The use of this method, the one-sample t test,is appropriate when:

• the observations represent a simple randomsample from the population of interest

• the random variable is continuous • the random variable is normally distributed

or approximately normally distributed

(mound shaped) with a sample size of atleast 30.

This hypothesis test is illustrated with thefollowing simple example.

Imagine that, having identified a promisingnew investigational antihypertensive drug, apharmaceutical company would like to admin-ister it to a group of 10 hypertensive individualsto see if the drug has the desired effect. Forsimplicity we assume that there is no controlgroup. The first study of the new antihyperten-sive will be a single-dose, nonrandomized,uncontrolled trial in 10 participants. SBP wasrecorded at the start of the study before initia-tion of treatment (baseline) and at the end of 4weeks (end of study). The research question ofinterest is: Does the new drug lower SBP? Thescientists designing the trial would like to main-tain a type I error of 0.05, that is, a � 0.05. As thetest conducted is two sided, the probability ofmaking a type I error in favor of the drug havinga beneficial effect (one side of the critical region)is 0.025.

The null hypothesis is:

H0: � � 0.

And the alternate hypothesis is:

HA: � � 0.

The one-sample t test will be used to test thenull hypothesis. As there are 10 observations andassuming the change scores (the random vari-able of interest) are normally distributed, the teststatistic will follow a t distribution with 9 df. Atable of critical values for the t distribution(Appendix 2) will inform us that the two-sidedcritical region is defined as t �2.26 and t

2.26 – that is, under the null hypothesis, theprobability of observing a t value �2.26 is0.025 and the probability of observing a t value 2.26 is 0.025.

Baseline and end-of-study values of SBP arepresented for the 10 participants in Table 6.5,along with their respective change scores.

The mean change score is �7 and the standarddeviation is 7.1. (We leave it to you to verifythis.) The test statistic is therefore calculated as:

� 7 � 0t � –––––––––– � �3.10.

7.1/�_____10

Hypothesis test of a single population mean 79

Page 95: Introduction to Statistics in Pharmaceutical Clinical Trials

As this calculated test statistic is in the criticalregion (t � �3.10 �2.26) the null hypothesis isrejected. The result is considered statisticallysignificant at the a � 0.05 level because therewas less than a 5% chance of such a result beingobserved by chance alone. The conclusion fromthe study is that the new drug did lower SBP bya mean of 7 mmHg. Scientists from the sponsorcompany may use this information as sufficientpreliminary evidence to continue with thedevelopment of the new drug.

6.13 The p value

One shortcoming of the hypothesis testingapproach is the arbitrary choice of a value for a.Depending upon our risk tolerance for commit-ting a type I error, the conventional value of 0.05may not be acceptable. Another way to conveythe “extremeness” of the resulting test statistic isto report a p value.

A p value is the probability that the resultobtained or one more extreme (in favor of thealternate) would be observed by chance alone.We know from the definition of the criticalregion that a value of the test statistic t �2.26or t 2.26 would have occurred with probability� 0.05 by chance alone. In fact, the test statisticvalue was �3.10 which lies to the left of �2.26.A value of �3.10 led to rejection, as would values �3.10 or 3.10. The p value in this case is the

area under the t-distribution density curve with9 df associated with values of t �3.10 or 3.10and is equal to 0.01. This means that there isonly a 1% chance of observing a value of the teststatistic as large as 3.10 (in absolute magnitude)or larger by chance alone. The differencebetween a (a design parameter) and the p value(a study result) can be seen in Figure 6.7, wherethe areas to the left and right of the dashed linesrepresent a and the areas to the left and right ofthe solid line represent the p value.

The p values can be estimated from a table ofvalues from the appropriate t distribution (forexample, by finding the tail areas associated witha particular value of the test statistic). Morecommonly, however, statistical software is usedfor all statistical analyses and p values areincluded in the results. The following is a helpfulway to interpret p values:

• Hypothesis tests are rejected if the calculatedp value � a.

• Hypothesis tests are not rejected if thecalculated p value a.

It is not uncommon for results of hypothesistests to be represented simply by the p value.However, it is not a wise practice to rely solely onthem. Recall that increasing the sample sizereduces the standard error, which increases thesize of the test statistic and therefore reduces thep value. This serves as a reminder that it is notjust the statistical significance of the result (thatis, the p value) that counts. The clinical rele-vance of the size of the effect (for example, the

80 Chapter 6 • Probability, hypothesis testing, and estimation

Table 6.5 Systolic blood pressure (SBP) values and change scores

Study participant Baseline SBP End-of-study SBP Change in SBP (mmHg) (mmHg) (mmHg)

1 143 147 42 152 144 �83 162 159 �34 158 157 �15 147 131 �166 149 133 �167 150 145 �58 148 144 �49 154 150 �4

10 149 132 �17

Page 96: Introduction to Statistics in Pharmaceutical Clinical Trials

confidence interval for the parameter of interest)is probably more important than the p value, aswe argue in the following section.

6.14 Relationship between confidenceintervals and hypothesis tests

Confidence intervals can be used to test anumber of hypotheses. This is illustrated usingthe study data from the previous example inSection 6.12.

Scientists from the pharmaceutical companybelieve that reporting a 95% confidence intervalfor the population mean change in SBP mayprove helpful. Following the confidence intervaldefined in Section 6.10, a 95% confidenceinterval for the population mean is:

� 7 � 2.26(7.1/�_____10) � (�12.1, �1.9).

The scientists can report from this study thatthey are 95% confident that the true populationmean change in SBP is within the interval(�12.1, �1.9). One interpretation of this intervalis that the scientists are 95% confident that thedrug works by reducing SBP, as evidenced by anupper limit of the confidence interval that is lessthan 0. Another less favorable interpretation isthat the drug does not work all that well – afterall, the confidence interval does not rule outsome very minor reductions in SBP (upper limitof �1.9 mmHg). It is true that, had the scientistshypothesized a value of the population meanoutside of the values of this 95% confidenceinterval, the null hypothesis would have beenrejected at the a � 0.05 level. For example, thefollowing null hypotheses would have beenrejected:

H0: � � 2H0: � � �15.

Relationship between confidence intervals and hypothesis tests 81

�4 �3 �2 �1 0 1 2 3 4

Figure 6.7 The t distribution with 9 degrees of freedom, critical region (dashed line) and p value (solid line). Note thatthe critical region is represented by the tail areas to the left and right of the dashed lines; the p value is represented bythe tail areas to the left and right of the solid lines

Page 97: Introduction to Statistics in Pharmaceutical Clinical Trials

Conversely, the following null hypotheses wouldnot have been rejected:

H0: � � �8H0: � � �2.

This relationship can be stated more generally as:

• All values outside the (1 � a)% confidenceinterval for a parameter of interest would berejected by a hypothesis test (of size a) of theparameter.

• Values within the (1 � a)% confidence intervalfor a parameter of interest would fail to berejected by a hypothesis test (of size a) of theparameter.

In this example, l represents the populationmean change from baseline SBP. If the upperlimit of the 95% confidence interval excludes 0,negative values of population mean are mostplausible, implying that the drug lowered SBP. Ifthe lower limit of the 95% confidence intervalexcludes 0, positive values of the populationmean are most plausible, implying that the drugactually increased SBP. If 0 is enclosed in the 95%confidence interval, negative and positive valuesof the mean are most plausible, implying that wecannot rule out the possibility that the drug hadno effect. These three scenarios are displayed inFigure 6.8.

As confidence intervals can be used to test anumber of hypotheses simultaneously, theyconvey much more information than a single p value resulting from a hypothesis test. In addi-tion to being able to test various hypotheses (thenull hypothesis of zero change from baseline wasrejected) the confidence interval allows regula-tory agencies and physicians who review thedata to interpret the clinical relevance of themagnitude of the values within the confidenceinterval.

6.15 Brief review of estimation andhypothesis testing

This chapter started with an introduction to theconcepts of probability and random variabledistributions. The role of probability is to assistin our ability to make statistical inferences. Teststatistics are the numeric results of an experi-ment or study. The yardstick by which a teststatistic is measured is how extreme it is. Theterm “extreme” in Statistics is used in relation toa value that would have been expected if therewas no effect, that is, the value that would beexpected by random chance alone. Confidenceintervals provide an interval estimate for a popu-lation parameter of interest. Confidence intervalsof (1 � a)% can also be used to test hypotheses,as seen in Chapter 8.

The process of hypothesis testing is carriedout using the following steps, which will behighlighted in subsequent chapters:

• State the null and alternate hypotheses. It issometimes easier to state the alternatehypothesis first because that is what we would like to conclude at the end of thestudy. The null hypothesis then covers theremainder of values of the population para-meter. The specific statements of the null andalternate hypotheses depend on the type ofstudy and the analysis approach used. Wecover many different examples in laterchapters.

• Determine the test statistic appropriate for themethod used. Choosing the appropriate teststatistic depends on the analysis method andthe assumptions that we must make.

• Select a value of a (as noted earlier, ourstandard for this book is 0.05).

82 Chapter 6 • Probability, hypothesis testing, and estimation

H0: Drug had no effecton SBPHa: Drug lowered SBP Ha: Drug increased SBP

No change from baseline

0lendpoint minus baseline

SBP reduced from baseline SBP increased from baseline

Figure 6.8 Conclusions to be drawn from the population mean change from baseline

Page 98: Introduction to Statistics in Pharmaceutical Clinical Trials

• Calculate the value of the test statistic underthe null hypothesis and the corresponding p value. Compare the p value with the valueof a.

• State the statistical decision either to reject orto fail to reject the null hypothesis.

Statistical inference is one way to use data tomake a decision in the presence of uncertainty.The resulting decisions are not perfect. Thecommission of either a type I or a type II errorcan have significant impacts on drug companies,study participants, patients, and public health.Therefore, minimizing the probability that eachmight occur is an important part of the studydesign, including the manner in which data areanalyzed and interpreted.

6.17 References

Turner JR (2007). New Drug Development: Design,methodology, and analysis. Hoboken, NJ: John Wiley& Sons.

vos Savant M (1997). Ask Marilyn. Parade Magazine 30March, p. 15.

References 83

6.16 Review

1. Using Table 6.1, calculate the probability ofselecting a participant who is:

(a) female(b) female and � 65 years of age(c) � 65 years of age(d) female given that the participant is � 65 years

of age.

2. Show that the true negative rate of a diagnostictest is a function of the sensitivity and specificity ofthe test and the prevalence of the disease.

3. Assume that SBP among all adults aged 30 yearsand older in the UK has a normal distribution withmean 120 mmHg and variance 100 mmHg. Whatproportion of participants in this population has:

(a) SBP 90 mmHg?(b) SBP 120 mmHg?(c) SBP 100 mmHg or SBP 140 mmHg?(d) SBP 160 mmHg?

4. What is the difference between standarddeviation and standard error?

5. What is a? How does a researcher decide on avalue for a?

6. What is b? How does a researcher decide on avalue for b?

7. What are the three components of a confidenceinterval?

8. What is a two-sided hypothesis test?

9. The one-sample t test is being used for a two-sided test of the null hypothesis, H0: l � 0. Foreach of the following scenarios, define therejection region for the test:

(a) n � 10; a � 0.10(b) n � 10; a � 0.01(c) n � 30; a � 0.05(d) n � 30; a � 0.001.

10. For each of the following 95% confidenceintervals for the population mean, would a two-sided test of the null hypothesis, H0: l � 0, berejected or not rejected?

(a) (�4.0, 4.0)(b) (�2.0, �1.0)(c) (22.3, 44.6)(d) (�12.7, 0.01).

Page 99: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 100: Introduction to Statistics in Pharmaceutical Clinical Trials

7.1 Introduction

As we noted in Section 1.11, this book focuses onteaching you the statistical methodologies andanalyses that are employed in the therapeuticconfirmatory clinical trials conducted before asponsor applies for marketing approval for thedrug that they have been developing. We alsonoted that there are other clinical trials thatprecede therapeutic confirmatory trials. Twoother categories of preapproval trials mentionedare Phase I (human pharmacology) trials andPhase II (therapeutic exploratory) trials. There-fore, before focusing on therapeutic confirma-tory trials in Chapters 8–11, it is appropriate toprovide an overview of human pharmacologyand therapeutic exploratory trials.

The usefulness of numerical information fromclinical trials in decision-making is an ongoingtheme in this book. The first few clinical studiesfor new drugs are important because theyprovide information relevant to the critical deci-sions that must be made with regard tocontinued investment in the developmentprogram. Ideally, studies are designed to answerresearch questions, the answers to which providesufficient information to inform the next step ofdevelopment, that is, either to go forward (a“go” decision) or not to go forward (a “no-go”decision). The answer to these critical early ques-tions must be “go” if we are to reach the laterstages of clinical development. For example, weneed to have reasonable confidence that thedrug is safe enough to progress to therapeuticexploratory trials in which it will be adminis-tered for the first time to participants with thedisease or condition of interest.

In addition, we need to have reasonable confi-dence that a particular selected route of adminis-

tration will prove successful for administeringthe drug to patients if and when the drug isapproved. Although many other questions mustbe addressed during later-stage clinical develop-ment, these critical early phase questions havesignificant bearing on the ultimate safety andefficacy attributes of the product, as well ascommercial implications (for example, route orschedule of administration).

Discussions in this chapter emphasize statis-tical considerations in early phase clinical trials.These include study designs employed, thetypes of data collected, and the usefulness andlimitations of these data.

7.2 A quick recap of early phasestudies

Human pharmacology studies are pharmacolog-ically oriented trials that typically look for thebest range of doses to employ. These trials typi-cally involve healthy adults. Comparison withother treatments (such as a placebo or a drugthat is already marketed) is not typically an aimof these trials, which are undertaken in anextremely careful manner in very controlledsettings, often in residential or inpatient medicalcenters. Typically, between 20 and 80 healthyadults participate in these relatively shortstudies, and participants are often recruited fromuniversity medical school settings where trialsare being conducted. The main objectives are toassess the safety of the investigational drug,understand the drug’s pharmacokinetic profileand any potential interactions with other drugs,and estimate pharmacodynamic activity. A rangeof doses and/or dosing intervals is typicallyinvestigated in a sequential manner.

7Early phase clinical trials

Page 101: Introduction to Statistics in Pharmaceutical Clinical Trials

From a statistical viewpoint, the design ofhuman pharmacology studies has certain impli-cations. They include a relatively small numberof participants, but a lot of measurements arecollected for each participant. This strategy hasboth advantages and limitations. The extensivearray of measurements made allows the drug’seffects to be characterized reasonably thor-oughly. However, as so few participants partici-pate in these studies, generalizations to thegeneral participant population are relativelymore tenuous than for studies with larger samplesizes.

7.3 General comments on studydesigns in early phase clinical studies

A disappointing result in early clinical studies, asa result of either a real liability of the investiga-tional drug or chance alone, can doom theprospects for the new drug ever entering themarket. No-go decisions are a logical conse-quence of such disappointing results. To provideoptimum quality data and the associatedoptimum quality information upon which tobase go and no-go decisions, early clinicalstudies are very well controlled, thereby limitingextraneous sources of variation as much aspossible. Early clinical studies, especially FTIH(first-time-in-human) studies, are typicallyconducted at a single investigative center. As arelatively small number of participants arestudied in such early phase trials, a single centercan feasibly accommodate the study by itself. Itcan recruit enough participants at that singlelocation, and provide all the necessary resourcesfor investigators at that site to conduct all thestudy procedures documented in the studyprotocol. Conducting a study at a single centerensures greater consistency with respect to parti-cipant management, study conduct, and assess-ment of adverse events, and provides for frequentand careful monitoring of study participants.

Participants in early clinical studies are usuallyhealthy adults whose health status is carefullydocumented at the start of the study throughphysical examinations, clinical laboratory tests,and medical histories. Limiting early studies to

healthy participants allows the sponsor toattribute any untoward findings to the drug, orto a particular dose of the drug, as significantbackground diseases are all but absent.

Early clinical studies frequently involve theuse of a concurrent inactive control. This can be important because the study procedures canbe somewhat invasive and associated with someadverse effects themselves – for example,frequent blood draws resulting in a lowering ofhematocrit. Without a concurrent control arm(even in a study of healthy participants) studysponsors and investigators would not be able to rule out a drug effect when observing such occurrences, which are expected, easilyexplained, and non-drug related. In early studiesthat involve inpatient facilities for close moni-toring, other controls may be instituted, forexample, standardized meals and set times forstudy procedures.

7.4 Goals of early phase clinical trials

Early clinical trials used in new drugdevelopment typically have the following goals:

• characterize the pharmacokinetic profile ofthe investigational drug

• describe the safety and tolerability of theinvestigational drug in study participants whodo not have significant medical conditions

• describe the extent to which a pharmacody-namic effect is affected by different doses ofthe new drug

• begin to identify a dose range that wouldlikely provide adequate exposure to yield animportant clinical effect.

Although somewhat overly simplistic (especiallyto readers who are students of pharmacy) we canconsider pharmacokinetic effects as “what thebody does to the drug” and pharmacodynamiceffects as “what the drug does to the body.” Forthose readers who are less familiar with pharma-cokinetics and pharmacodynamics, Tozer andRoland (2006) provide an excellent and veryreadable introduction to these topics.

Patients with diseases or conditions of interestcan have a number of attributes that, although

86 Chapter 7 • Early phase clinical trials

Page 102: Introduction to Statistics in Pharmaceutical Clinical Trials

very important in the context of the eventualuse of the new drug, make accurate assessmentsof the safety and pharmacokinetics of the inves-tigational drug difficult. For example, patientswith the disease may have compromised kidneyor liver function, which would confound thecharacterization of the metabolism of the newdrug. Similarly, patients with the disease maytake several other medications for the diseaseunder study or for other related or unrelateddiseases. It then becomes difficult to ascertain inearly studies whether potential adverse effects orlaboratory abnormalities are attributable to theinvestigational drug, to concomitant drugs, or toany potential interactions between the investiga-tional drug and other drugs. (Drug interactionsare not discussed in this book and readers arereferred to Hansten [2004].)

The employment of healthy participants inearly clinical studies provides essential informa-tion about the pharmacokinetics, pharmacody-namics, and safety of the new drug. This chapterfocuses on the research questions relevant toearly human studies, the designs used to address them, the data and analysis approachescommonly encountered, and the developmentdecisions that are made as a result of thesestudies.

We should note here that there are somespecial cases for which the use of healthyparticipants is not justified in early studies. Forparticularly invasive therapies (for example,implantation of a medical device) or therapieswith known toxicity (for example, oncologics) itis not ethical to study healthy participants. Theuse of healthy participants in early studies mayalso provide a misleading result for futurestudies of participants with disease. Forexample, the maximum tolerated dose of newantidepressants or anxiolytics may differ quitemarkedly between healthy participants andthose with the disease.

7.5 Research questions in early phaseclinical studies

In the early clinical development of a new drug,the following questions arise:

• How does the magnitude of systemic expo-sure to the new drug differ as a function ofincreasing concentrations of the drug?

• How does the magnitude of systemic expo-sure to the new drug differ as a function ofdifferent dosing schedules (for example, once,twice, or three times a day)?

• How do varying degrees of drug exposuremodify measurable pharmacodynamic effects?

• How does the total amount of drug exposurefrom the route of administration being studied(for example, oral) compare with the totalamount of drug exposure when administeredparenterally (that is, intravenously or intra-arterially)? In other words, how bioavailable isthe drug?

• How safe is the new drug? Evaluations includeclinical laboratory tests, physical assessments,vital signs, adverse events, and cardiac effectsthrough electrophysiological monitoring viaan ECG.

To address the first four research questions listedabove, pharmacokinetic data are typicallycollected at various time points in early clinicalstudies: These data are discussed in the nextsection. To evaluate the difference between back-ground variation (influences that are not directlyof interest) and changes brought about by theadministration of drug (influences that are ofinterest), measurements are collected on severaloccasions before the start of the drug, at severaltimes during drug administration, and at leastonce after the administration of the drug whenits effect is likely to be minimal (for example, 24hours later). Evaluation of the fifth question isdiscussed in Section 7.10.

7.6 Pharmacokinetic characteristics ofinterest

Investigations at this stage of a clinical develop-ment program focus primarily on a very carefulevaluation of how well the drug reaches thebloodstream, and how its concentrations in the bloodstream change over time, that is, onpharmacokinetics. The extent and duration of adrug’s presence in the bloodstream determine

Pharmacokinetic characteristics of interest 87

Page 103: Introduction to Statistics in Pharmaceutical Clinical Trials

how good a chance it has ultimately to exert itsintended clinical effect by reaching and inter-acting with its target receptors, the domain ofpharmacodynamic investigation. Therefore, weneed to study pharmacokinetic factors beforestudying the actual clinical effects of the drug intherapeutic exploratory trials, trials in which therelationship between drug concentrations andclinical response are typically addressed for thefirst time.

As mentioned in Section 2.5, the term “phar-macokinetics” generally refers to the absorption,distribution, metabolism, and excretion (ADME)of a drug. When developing a new drug a greatdeal of time and effort is devoted to formulatingthe drug so that it has the most desirable charac-teristics from the standpoint of safety, efficacy,and commercial concerns (for example, patientconvenience and patient adherence to theprescribed regimen). There are several commonly

used summary measures that are useful for quan-tifying absorption and excretion. In contrast,metabolism and distribution are not as easy todefine in terms of quantifiable measures,although it is possible to characterize how a drugis metabolized through the identification ofcertain markers.

7.6.1 Total systemic exposure

Total systemic exposure to an administered drugis usually measured by the area under the drugconcentration curve. For each participant thedrug concentration (in nanograms/milliliter) canbe plotted as a function of time, as displayed inFigure 7.1. The maximal drug concentration(Cmax) and the time at which it is observed (tmax)are also shown. These two parameters arediscussed in Section 7.6.2.

88 Chapter 7 • Early phase clinical trials

00

100

200

Dru

g co

ncen

tratio

n (n

g/m

L)

tmax Hours

Cmax

1 2 3 4 5 6 9 12 15 18 21 24

Figure 7.1 Sample drug–concentration time curve for a single participant (Cmax of 290 ng/mL and tmax of 6 hours)

Page 104: Introduction to Statistics in Pharmaceutical Clinical Trials

The estimated area under the curve from timepoint zero to infinity (AUC(0��)) is calculatedusing the trapezoidal rule. There are two steps inthis process:

1. Calculate the trapezoidal area between alladjacent time points

2. Sum all areas calculated in the first step.

This calculation is an estimate of the realAUC(0��), but a meaningful and useful estimate.By progressively increasing the samplingfrequency we could obtain more and moreprecise measurements, but pragmatism dictatesfrequency, and a reasonable frequency producesa useful estimate of AUC(0��). AUC(0�t) denotesthe area under the curve from 0 to any timepoint t.

7.6.2 Maximum concentration

Another important measure of absorption is thepeak or maximum concentration or maximumsystemic exposure (Cmax). It may be of interest toknow the Cmax associated with a beneficial effect.However, it is more common to use the value ofCmax to provide assurance that, despite observinga specific Cmax value, there was no unwanted toxi-city. If the Cmax is too high for a given dose of drugas measured by a clinical effect, such a findingcould guide development of other formulationsand treatment schedules. The Cmax is calculated asthe maximum value of the drug concentrationduring the period of monitoring. The time fromadministration to achieve the Cmax is called tmax.Depending on the intended clinical use for thenew drug, it may be more desirable to haveshorter or longer values of tmax. For example,when in need of headache pain relief, we mightbe interested in a tmax that is as short as possible.As noted in the previous section, both Cmax andtmax are shown in Figure 7.1, where Cmax has a valueof 290 ng/mL and tmax has a value of 6 hours.

7.6.3 Elimination

Elimination of a drug is measured using a quan-tity called a half-life (t1/2). A half-life is the timerequired to reduce the plasma concentration to

half its initial value. Longer half-lives can beassociated with desirable characteristics (forexample, longer activity requiring less frequentadministration of the drug) or undesirable ones(for example, adverse effects).

7.6.4 Excretion

Excretion concerns the removal of a drugcompound from the body. Both the original(parent) drug compound and its metabolites canbe excreted. The primary mode of investigationhere is excretion balance studies. A radiolabeleddrug compound is administered and radioac-tivity is then measured from excretion sites (forexample, urine, feces, expired air). These studiesprovide information on which organs areinvolved in excretion and the time course ofexcretion.

7.7 Analysis of pharmacokinetic andpharmacodynamic data

Statistical analyses of pharmacokinetic and phar-macodynamic effects are primarily descriptive innature. As described in Chapter 6, inferentialstatistical methods such as hypothesis testing areused to make decisions in the presence of uncer-tainty, while limiting the likelihood of makingdecisions with unwanted consequences (forexample, marketing an ineffective drug or notbringing to market an effective one). The deci-sions to be made in pharmacokinetic studies donot have such dire consequences nor are theydirectly applicable to the real world use of thenew drug. Rather, the data acquired in pharma-cokinetic studies are used as a starting point toidentify doses, dosage forms, and dosage regi-mens for the new drug which, when studied inindividuals with the disease, will allow a reason-able chance at evaluating the potential benefitsand risks associated with its use.

Pharmacokinetic measures such as AUC(0–24),Cmax, and tmax are analyzed as continuousmeasures. As seen in the example in Table 7.1,measures of central tendency and dispersion canbe helpful to highlight differences among groups.

Analysis of pharmacokinetic and pharmacodynamic data 89

Page 105: Introduction to Statistics in Pharmaceutical Clinical Trials

Before discussing the interpretation of the datain Table 7.1, a couple of points about tabulardisplays like this one are worth pointing out.First, every study or analysis has a primarycomparison of interest: In this case it is tocompare AUC across groups. The ability of aregulatory reviewer to interpret data with respectto the primary comparison is aided by displayingdata across columns for the comparison. In thiscase, a single summary measure of interest(AUC) represents a row and the groups may becompared by reading left to right. Secondarycomparisons should then be placed as rows on atable.

A common example of a secondary compar-ison in pharmacokinetic studies is the concen-tration of drug at various time points during thestudy. It is important to know how the within-group average concentration changes over time,but it is more important to know how the meanconcentration differs among groups at one timepoint. The fundamental nature of clinical trials iscomparative, above all else. The second pointabout tabular displays such as this one is that thetable itself is well labeled with titles and columnheaders. In the regulated world of drug develop-ment, presentation is extremely important.“Substance” is our first concern, but “style” iscertainly important to convey the substance to aregulatory reviewer.

Descriptive analyses were discussed in Chapter 5, particularly measures of centraltendency and dispersion. Those discussions nowenable us to examine the pharmacokinetic datapresented in Table 7.1. As can be seen, a total of10 participants were studied in each group. Themean (SD) AUC values were 812 (132), 1632

(264), and 2237 (412), respectively. From theseresults we can conclude that the three timesdaily dosage regimen resulted in overall greatersystemic exposure to the drug.

7.7.1 Decisions and inferences from FTIHstudies

Having completed one or more pharmacokineticand pharmacodynamic studies in early develop-ment, a multidisciplinary team, consisting ofclinical scientists, regulatory specialists, pharma-cologists, and statisticians, will examine theseearly clinical data to plan for studies of early effi-cacy and safety in individuals with the disease orcondition of interest. They will interpret the datato decide which combinations of dosage forms,concentration, and regimen resulted in theoptimal exposure to the drug with minimalapparent toxicities. In many instances the datamay be too ambiguous to make clear decisions,especially as a degree of subjectivity is present insuch decisions. For example, two regimens mayhave similar total exposure (as measured byAUC) but one may be associated with a greaterCmax, which may lead to adverse effects in subse-quent studies. These judgments and decisionsare fairly imperfect anyway because the relation-ship between pharmacokinetics and clinicaleffects is more relevant.

For this reason alone, early clinical studies arenot considered definitive and most sponsors arewise to interpret the data carefully. Ideally,certain combinations of dosage forms, drugconcentrations, and regimens can be eliminatedfrom future consideration as a result of early

90 Chapter 7 • Early phase clinical trials

Table 7.1 Pharmacokinetic measures: Mean (SD) for three dosage regimens of a new investigational drug

Dosage regimen

20 mg once a day 20 mg twice daily 20 mg three times daily (n � 10) (n � 10) (n � 10)

AUC(0–24) (ng h/mL) 812 (132) 1632 (264) 2237 (412)Cmax (ng/mL) 174 (61) 181 (74) 308 (94)tmax (h) 2.4 (0.9) 2.6 (0.6) 7.4 (1.0)

Page 106: Introduction to Statistics in Pharmaceutical Clinical Trials

studies (for example, inadequate exposure, toomuch exposure). Pharmaceutical companies canthen conduct future research on the drug informs that may realistically provide a benefit.AUC and Cmax are very useful measures in initialclinical development activities and, accordingly,we need statistical methods and analyses toassess them in a scientific and therefore infor-mative manner. However, these are not the para-meters that are of ultimate interest: It is theclinical benefits and risks of the new drug thatare ultimately the characteristics of importance.The statistical evaluation clinical benefits (thera-peutic efficacy) and risk (adverse events, etc.) arecovered in Chapters 8–11.

7.8 Dose-finding trials

A drug’s dosing regimen comprises the dose ofthe drug given and the schedule on which it isadministered – that is, both concentration andtiming are important characteristics. A variety ofdosing regimens may be explored in these trials,and the specific regimens chosen in a specifictrial depend on the objectives of the trial and thetype of drug being studied.

Dose-finding studies are conducted to provideinformation that facilitates selection of a safeand efficient drug administration regimen.Chevret (2006a, p 5) defined dose-finding trialsas “early phase clinical experiments in whichdifferent doses of a new drug are evaluated todetermine the optimal dose that elicits a certainresponse to be recommended for the treatmentof patients with a given medical condition.”Chevret (2006a) also provided some relateddefinitions that are helpful:

• Dose: The amount of active substance that isgiven in a single administration or repeatedover a given period, as dictated by an admin-istration schedule of equal or unequal singledoses at equal or unequal intervals.

• Response: The outcome of interest in studyparticipants. This can be defined in pharma-codynamic terms as the therapeutic points ofinterest, or in terms of pharmacotoxicity/tolerability of the drug.

• Maximum tolerated dose (MTD): The highestdose that produces an “acceptable” risk fortoxicity or, expressed differently, the dosethat, if exceeded, would put individuals at“unacceptable” risk for toxicity.

• Minimally effective dose (MED): The dose thatelicits a specified lowest therapeutic response.

Before continuing with our current discussions,the word “acceptable” in the third bullet pointmay initially seem somewhat incongruous here.All drugs lead to some side-effects, that is, someadverse events. Therefore, there is some degreeof risk associated with taking any drug. To be useful, a drug needs to have an acceptablebenefit–risk ratio – that is, the benefit must belarger than the risk, and it must be larger by acertain amount. Stating the precise amount bywhich a drug must provide more benefit than itmay lead to harm is a difficult judgment call thatmust be made ultimately by physicians.However, we can make some observations. If a drug that is extremely beneficial to very sickpatients shows relatively strong side-effects, aclinician may well decide that the benefit–riskratio is still acceptable. In contrast, a drug takenfor a relatively mild condition such as a head-ache would need to show relatively much lessstrong side-effects for the benefit–risk ratio to beacceptable.

Human pharmacology studies often involvedose-finding trials that focus on the evaluationof MTDs such as trials in oncology. It is impor-tant to note that studies that aim to define anMTD require clear and consistent definitions oftoxicities and toxicity grades. In many diseaseareas outside cancer it is difficult to define theMTD in a clear manner because the drugs them-selves may not be as apparently toxic as a newchemotherapeutic. Dose-finding trials that focuson the evaluation of MED are commonlyreferred to as early Phase II trials. (As noted inChapter 2, the categorization of clinical trialsinto Phase I, II, or III, although very common,can result in confusing and less than definitivenomenclature. Here, the nomenclature “earlyPhase II trials” is used to distinguish these trialsfrom therapeutic exploratory or “late Phase IItrials.”) One common design for FTIH studies isa dose-escalation cohort study. In this design the

Dose-finding trials 91

Page 107: Introduction to Statistics in Pharmaceutical Clinical Trials

first cohort consists of participants administeredthe lowest dose of the drug or placebo. An assess-ment of the safety of the first dose is undertakenand, if the lowest dose is considered safe, asecond cohort of participants is studied at thenext highest dose. Additional cohorts are studiedin this manner until a dose has been found tohave unacceptable risks or the maximum dosehas been studied in the final cohort. Chevret(2006b) provides a comprehensive discussion ofdose-finding experiments.

7.9 Bioavailability trials

Another type of early clinical study may beconducted with the primary objective of estab-lishing the bioavailability of a particular dosageform, concentration, and regimen. Bioavail-ability can be defined as the proportion of anadministered dose that reaches the systemiccirculation in an unchanged form. Maximumbioavailability results after an intravenous injec-tion of the drug. In this case, the bioavailabilityis by definition 100%. When administeredorally, however, a drug experiences first-passmetabolism, also called first-pass loss, before itreaches the systemic circulation.

Metabolism is a complex and tremendouslybeneficial process in most cases, but one thatposes interesting challenges in pharmacologicaltherapy. We are constantly exposed to xenobi-otics, substances that are foreign to our bodies.For example, our modern environment is aconstant source of xenobiotics that are toxi-cants. These can enter our bodies via our lungsas we breathe and our stomachs as we eat, andsome can enter the body through our skin. Inaddition, animal and plant food contains manychemicals that have no nutritional value butdo have potential toxicity. Fortunately, ourbodies are very good at getting rid of bodilytoxicants. The processes of metabolism andexcretion are involved in this. As noted byMulder (2006), metabolism can be divided intothree phases:

1. Phase 1: The chemical structure of thecompound is modified by oxidation, reduc-

tion, or hydrolysis. This process forms anacceptor group.

2. Phase 2: A chemical group is attached to theacceptor group. This typically generatesmetabolites that are more water soluble andtherefore more readily excreted.

3. Phase 3: Transporters transport the drug ormetabolites out of the cell in which Phase 1and Phase 2 metabolism has occurred.

Along with all animals, humans have a widevariety of xenobiotic-metabolizing enzymesthat convert a wide range of chemical structuresto water-soluble metabolites, which can beexcreted in urine. Humans have a high concen-tration of these enzymes in the gut mucosaand the liver. This arrangement ensures thatsystemic exposure to potentially toxic chemicalsis limited. A high percentage of these may becaught in first-pass metabolism. Xenobioticsthat are absorbed from the intestine travel viathe hepatic portal vein to the liver, the majororgan of metabolism, before being circulatedsystemically, and metabolism in the liver meansthat damage to the rest of the body is amelio-rated. Under normal circumstances this isextremely advantageous.

From the point of view of pharmacologicaltherapy, however, this protective system repre-sents a considerable challenge. Orally adminis-tered drugs also travel via the hepatic portal veinto the liver before being circulated systemically.Therefore, before the drug gets a chance to exertany therapeutic activity in the body, it has towithstand this first attempt to degrade it. Thisfirst-pass metabolism is more or less effectivedepending on factors including the drug’schemical and physical properties, but almostcertainly there will be some degree of degrada-tion. This means that most orally administereddrugs display less than 100% bioavailability.

The most rigorous quantitative way to assessthe extent of bioavailability for an orally admin-istered drug is to compare the areas under therespective plasma–concentration curves afteroral and intravenous administration of the samedose of drug. The AUC is then calculated forboth, and a ratio calculated by dividing the AUCfor the oral administration by that for the intra-venous administration. If the area ratio for the

92 Chapter 7 • Early phase clinical trials

Page 108: Introduction to Statistics in Pharmaceutical Clinical Trials

drug administered orally and intravenously is0.5 (which can be expressed as 50%), only 50%of the oral dose was absorbed systemically.

Consider the development of a new drug thatis going to be given orally. Assessing its bioavail-ability is important. An intravenous infusion ofthe new drug will result in a certain systemicexposure as measured by the AUC. This amountof systemic exposure will by definition be called100% bioavailability. It is important to identifythe dosage form and schedule that provide rela-tively high bioavailability. In this case, partici-pants may be randomly assigned to receive oneof the following drug administration regimens:

• intravenous infusion of the drug for 4 hours• 10 mg tablet once a day • 10 mg tablet twice a day • 10 mg three times a day • 20 mg tablet once a day • 20 mg tablet twice a day • 20 mg three times a day.

At the end of the study, the pharmacokineticcharacteristics of the drug would be evaluated,and the systemic exposure for each dosageregimen compared with the intravenous route ofadministration.

7.10 Other data acquired in earlyphase clinical studies

As we saw in Section 7.5, one research questionof interest in early phase trials is:

• How safe is the new drug? Evaluations includeclinical laboratory tests, physical examina-tions, vital signs, adverse events, and cardiaceffects through ECG monitoring.

More extensive discussion of these safety assess-ments is provided in the following chapters, butit is useful to introduce these topics at this point.

7.10.1 Clinical laboratory tests

There is a very wide range of clinical chemistrytests that can be conducted, including liver

(hepatic) and kidney (renal) tests. These arediscussed in Section 9.2.

7.10.2 Physical examinations

Although perhaps not as sensitive as other safetyassessments, physical examinations are still very helpful, because a general exam may iden-tify more pronounced effects to the drug such asallergic reactions or edema (fluid retention).Data collected from physical exams include asubjective assessment by the investigator as to whether the participant has “normal” or“abnormal” function for each body system (forexample, respiratory, dermatologic) examined. Ifthe body system is considered abnormal, addi-tional descriptions of the particular abnormalityare also recorded. Data recorded as normal orabnormal are measured on the nominal scale.These data are typically summarized by tabu-lating the number and percentage of individualswith each result.

7.10.3 Vital signs

Monitoring of vital signs, including heart rate,respiration rate, and blood pressure, is carriedout on a regular basis, typically several times aday. Each of these is measured on the continuousscale. Analyses of these outcomes primarily focuson measures of central tendency and dispersion.

7.10.4 Adverse events

The collection of adverse events can be based onobservation by either the investigator or partici-pant self-report. Participant self-reports ofadverse events can vary according to how theinformation is elicited from them. It is advisableto standardize the manner in which participantsare asked about how they feel during the trial.Data collected from adverse events usuallyinclude text descriptions of several characteris-tics of the adverse event:

• the adverse event, for example, “rash on leftforearm”

Other data acquired in early phase clinical studies 93

Page 109: Introduction to Statistics in Pharmaceutical Clinical Trials

• the severity or intensity of the adverse event,for example, mild, moderate, severe

• the date and time of onset• the outcome of the adverse event (resolved

without sequelae, resolved with sequelae, orongoing)

• any treatments administered for the adverseevent

• any action taken with the study drug (forexample, temporarily discontinued, stopped,none)

• whether or not the adverse event is consideredserious.

To standardize the reporting of adverse events,the adverse event descriptions are coded usingmedical dictionaries such as MedDRA (MedicalDictionary for Drug Regulatory Affairs codingdictionary: See, for example, Chow and Liu,2004, p. 563) or COSTART. The original descrip-tion of the adverse event provides qualitativeinformation about the finding that may not becaptured in the coded event. Both aspects – thatis, coded and uncoded – are retained in thescientific database for reporting and analysis.

7.11 Limitations of early phase trials

In this chapter we have discussed the impor-tance, and the strengths, of early phase clinicaltrials. Before moving on to later phase clinicaltrials, it is also appropriate to consider their limi-tations. The word “limitations” should not beseen as a negative assessment in this context. Aswe will discuss in Chapter 12, later-phase preap-proval clinical trials also have their limitations.Acknowledgment of the strength and the limita-tions of any method of inquiry is legitimate andhelpful: As Katz (2001, p xi) noted, “to workskillfully with evidence is to acknowledge itslimits.”

7.11.1 Studying pharmacokinetics inhealthy participants

Studying the pharmacokinetics of a new investi-gational drug in FTIH studies – that is, in indi-viduals with healthy renal and hepatic systems –results in a pharmacokinetic assessment that issomewhat artificial. In later stages of develop-ment, it may be necessary to study the drug inindividuals with impaired kidney or liver func-tion, especially if these conditions are expectedin the types of patients who will be prescribedthe drug if and when it is approved formarketing. However, this initial FTIH assessmentcan serve as a useful starting point and provideguidance for such later studies.

7.11.2 Extremely tight experimentalcontrol

It may seem paradoxical to see tight experi-mental control listed in a section discussing thelimitations of a clinical trial. After all, inChapter 4 we extolled the merits of such control.The issue here is related to the issue addressed inSection 7.2. Since the investigational drug isadministered in such a carefully controlledmanner, the generalizability of the results fromthese studies becomes questionable. If and whenthe drug is approved for marketing, patients whoare prescribed the medication will be unlikely totake the medication in such a preciselycontrolled manner. As in many places in drugdevelopment, there are advantages and disad-vantages to this strategy. We have noted thedisadvantages and now focus on the advantages.

The advantage of very tight control in earlyPhase II (therapeutic exploratory) trials is thatthe “pure” efficacy of the drug can be assessed aswell as possible. The drug has every chance todemonstrate its efficacy in these circumstances.

94 Chapter 7 • Early phase clinical trials

Page 110: Introduction to Statistics in Pharmaceutical Clinical Trials

In other words, we can assess how well the drugcan work. It is not so easy to assess how the drugwill work if and when approved and prescribedto a very large population of heterogeneouspatients who take the drug in various states ofadherence with the prescribed regimen, but thatis another question for another stage of theclinical development program.

7.13 References

Chevret S (2006a). Basic concepts in dose-finding.In: Chevret S (ed.), Statistical Methods for Dose-finding Experiments. Chichester: John Wiley &Sons, 5–18.

Chevret S (ed.) (2006b). Statistical Methods for Dose-finding Experiments. Chichester: John Wiley & Sons.

Chow S-C, Liu J-P (2004). Design and Analysis of ClinicalTrials: Concepts and methodologies. Chichester: JohnWiley & Sons.

Hansten PD (2004). Important drug interactions andtheir mechanisms. In: Katzung BG (ed.), Basic and Clinical Pharmacology, 9th edn. New York: McGraw-Hill, 1110–1124.

Katz DL (2001). Clinical Epidemiology and Evidence-based Medicine: Fundamental principles of clinicalreasoning & research. Thousand Oaks, CA: SagePublications.

Mulder GJ (2006). Drug metabolism: inactivation andactivation of xenobiotics. In: Mulder GJ, Dencker L(eds), Pharmaceutical Pharmacology. London:Pharmaceutical Press, 41–66.

Tozer TN, Roland M (2006). Introduction to Pharmacoki-netics and Pharmacodynamics: The quantitative basisof drug therapy. Baltimore, MD: Lippincott, Williams& Wilkins.

References 95

7.12 Review

1. What are some reasons that inferential statistics(that is, hypothesis testing) are not used very oftenin early phase studies?

2. What information from early phase trials may beused to inform the study designs of therapeuticexploratory and therapeutic confirmatory trials?

3. Name three advantages or strengths of early phasetrials as they pertain to the overall development ofa new drug.

4. Name three disadvantages of early phase trials asthey pertain to the overall development of a newdrug.

Page 111: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 112: Introduction to Statistics in Pharmaceutical Clinical Trials

8.1 Introduction

The regulatory standard for the approval of newdrugs for marketing can be framed in thefollowing manner: The benefits associated withthe new treatment outweigh the risks associatedwith the new treatment. All pharmaceuticalproducts carry the potential for side-effects,some of which are more serious than others.Therefore, for a given investigational drug to beapproved for marketing the regulatory agencyneeds to be presented with compelling evidencethat the likely benefits to the target populationwith the disease or condition of interest out-weigh the likely risks. This requires conductingclinical trials that employ samples selected fromthe target population, and use of Statistics todesign these trials appropriately, collect optimumquality data, analyze and interpret the datacorrectly, and make inferences about the popula-tion from which those samples were drawn.

Judgments about the benefit–risk profile of aninvestigational drug require, by definition,consideration of both benefit and risk. Thismeans that the therapeutic benefit of the inves-tigational drug needs to be assessed quantita-tively, and considered together with quantitativeassessments of risk. In this chapter we discuss theassessment of risk in terms of evaluating thedrug’s safety profile. Even though we typicallyuse the nomenclature benefit–risk profile andnot risk–benefit profile, we discuss safety evalua-tions first because the safety of patients must beour first concern.

Safety analyses in pharmaceutical clinical trialstend to be largely descriptive because there are somany adverse events (AEs) and other safety para-meters evaluated, and analysis of them leads toissues of multiplicity (see Section 8.9). As

described in Chapter 6, the appropriate use ofinferential statistics requires a prespecifiedhypothesis of interest. As knowledge is gainedabout an experimental therapy during its devel-opment (for example, in therapeutic exploratorytrials) a specific hypothesis about the drug’ssafety may emerge and can then be tested appro-priately. In such instances there are inferentialstatistical analyses that can be used for safetydata, and we present some of those applicable toAEs in this chapter. (See also Chow and Liu[2004b, Chapter 13] for additional discussions ofsafety assessment.)

8.2 The rationale for safetyassessments in clinical trials

When a clinician prescribes a new treatment fora patient for the first time, the clinician andindeed the patient may be interested in thefollowing questions about the safety of the drug:

• How likely is it that my patient will experiencean adverse drug reaction? (The term “adversedrug reaction” refers to an unwanted occur-rence caused by a drug. Hence, a prescribingclinician [and researchers conducting post-marketing surveillance studies] is concernedwith adverse drug reactions. During preap-proval clinical trials, we do not know whichtreatment an individual is receiving, sounwanted occurrences are called AEs. Formaldefinitions are provided shortly.)

• How likely is it that my patient will experi-ence an adverse drug reaction that is soserious that it may be life threatening?

• How will the risk of an adverse drug reactionvary with different doses of the drug?

8Confirmatory clinical trials: Safety data I

Page 113: Introduction to Statistics in Pharmaceutical Clinical Trials

• How will the risk of an adverse drug reactionchange with the length of treatment?

• Are the typical adverse drug reactionstemporary or permanent in nature?

• Are there specific clinical parameters thatshould be monitored more closely in mypatient while he or she is receiving this treat-ment because of increased risks from thenewly marketed drug?

At the time that a new drug receives marketingapproval, the best information available uponwhich the clinician can form an answer to thesequestions is the information gathered during thepreapproval clinical trials. This information isprovided to the clinician (and to all patientsreceiving an approved drug) in the packageinsert. (This situation changes in due course asadditional [and more detailed] safety evaluationtakes place during the process of postmarketingsurveillance [see Mann and Andrews, 2007].However, this process may take several years toacquire meaningful data, and so the statement inthe text may remain true for quite a while.)

A number of clinical parameters are assessedduring preapproval clinical trials. This informa-tion provides the basis upon which the clinicianwill formulate answers to these questions. Theprecise set of clinical parameters employed in agiven trial may vary according to the disease andthe type of drug under study. In general, thesafety evaluation of new drugs is intended todetect quantifiable effects in as many organs andsystems as possible. In other words, whenlooking for risks associated with a new drug, thestrategy is to “cast a wide net.”

8.3 A regulatory view on safetyassessment

The view of the US Food and Drug Administra-tion (FDA) concerning safety reviews ispresented in their guidance document on thesafety review of new drug applications (US FDA,2005, p 5). As this guidance states, most thera-peutic exploratory and therapeutic confirmatorytrials are carefully designed to establish that a

new drug is efficacious, while controlling theprobability of committing a type I or II error.Unless safety concerns have arisen in earlierstages of the clinical development program,these trials typically do not involve assessmentsof safety that are as sensitive as those designedfor establishing the efficacy of the investiga-tional drug. Quoting from this guidance:

In the usual case, however, any apparent findingemerges from an assessment of dozens of poten-tial endpoints (adverse events) of interest,making description of the statistical uncertaintyof the finding using conventional significancelevels very difficult. The approach taken is there-fore best described as one of exploration andestimation of event rates, with particular atten-tion to comparing results of individual studiesand pooled data. It should be appreciated thatexploratory analyses (for example, subsetanalyses, to which a great caution is applied in ahypothesis testing setting) are a critical andessential part of a safety evaluation. Theseanalyses can, of course, lead to false conclusions,but need to be carried out nonetheless, withattention to consistency across studies and priorknowledge. The approach typically followed isto screen broadly for adverse events and toexpect that this will reveal the common adversereaction profile of a new drug and will detectsome of the less common and more seriousadverse reactions associated with drug use.

US FDA (2005, p 5)

Safety evaluations of investigational drugs focusprimarily on estimating the risk of unwantedevents associated with the drug, and, morespecifically, on the risk of those events relative towhat would be expected in the patient popula-tion as a whole if the drug were to be approved.Although more specialized tests and assays maybe evaluated in certain instances, in this chapterand in Chapter 9 we describe statisticalapproaches used for the most common clinicaldata used to assess the safety of new drugs: AEs,clinical laboratory data, vital signs, and changesin ECG parameters. This chapter focuses ondiscussions of AEs. Adverse events are nominaldata, and therefore summaries of AEs are basedon counts.

98 Chapter 8 • Confirmatory clinical trials: Safety data I

Page 114: Introduction to Statistics in Pharmaceutical Clinical Trials

8.4 Adverse events

ICH Guidance E6 (R1) (1996, p 2) provides thefollowing definition of the term adverse event:

Any untoward medical occurrence in a patientor clinical investigation subject administered apharmaceutical product and which does notnecessarily have a causal relationship with thetreatment. An adverse event (AE) can thereforebe any unfavourable and unintended sign(including an abnormal laboratory finding),symptom, or disease temporally associated withthe use of a medicinal (investigational) product,whether or not associated with the medicinal(investigational) product.

ICH Guidance E2A (1995, p 3) provides a defin-ition of the term “adverse drug reaction” that isapplicable during preapproval clinical experi-ences with a new medicinal product:

All noxious and unintended responses to amedicinal product related to any dose should beconsidered adverse drug reactions.

There are various types of AEs, as shown in Table 8.1.

The length of observation for AEs is typicallyspecified in the study protocol. In most instances,on-treatment AEs (also called treatment-emergent AEs) are considered to be thoseevents with an onset from the time that studytreatment has been initiated through theprotocol-defined follow-up period. For example,

a protocol may specify that AEs occurring within30 days of the last exposure to the study drug bereported. In some therapeutic areas it may bedesirable to assess separately those AEs thatoccur once treatment has been discontinued, forexample, to evaluate withdrawal or reboundeffects during the follow-up period.

In the hypothetical data presented in Table 8.1,the numbers of participants in the drug andplacebo groups are deliberately similar but notidentical. This is why provision of both absolutenumbers and percentages is so informative whenmaking comparisons between the treatmentgroups.

8.5 Reporting adverse events

Adverse events are typically reported in one oftwo ways:

1. By study investigators on the basis of theirown observations (for example, from aphysical exam)

2. By the study participant as a self-reportedevent.

In the second case, it is advisable to elicit AEsfrom participants using a standardized script toensure that they are collected as accurately aspossible. For example, a question such as “Haveyou noticed anything different or had anyhealth problems since you were last here?” is a

Reporting adverse events 99

Table 8.1 Participant accountability (Safety Population: Study AB0001)

Adverse events (AEs) Number (%) of participants

Placebo (n � 2603) Drug (n � 2456)

Pre-treatment AEs 24 (1) 31 (1)On-treatment AEsa 297 (11) 386 (16)Drug-related AEsb 31 (1) 42 (2)Serious AEs 20 (1) 27 (1)AEs leading to withdrawal 12 (� 1) 17 (1)

aAEs that occur on any treatment, whether active or nonactive.b”Drug-related” is a designation made by an investigator who decides that there is a reasonable chance that the AE was caused by the treatment being

taken.

Page 115: Introduction to Statistics in Pharmaceutical Clinical Trials

way of asking a participant about potential AEswithout leading him or her to answer in a certainway.

Study personnel who interact with partici-pants are trained to capture the essence of anyself-reported AEs on a case report form (CRF),one of the most important documents in clinicaltrials. Examples of reported AEs include “short-ness of breath,” “rash on left wrist,” “dry mouth,”and “vomiting.” In addition to the description ofthe nature of the AE, additional informationsuch as the following is typically collected:

• the severity• the date and time of onset• the resolution date (if the event resolved), any

action that was taken with the study drug (forexample, stopped, dosage reduced)

• the presumed relationship to the studytreatment

• whether or not the AE was considered“serious” according to a regulatory definition.

8.6 Using all reported AEs for allparticipants

The first question listed in Section 8.2 was: “Howlikely is it that my patient will experience anadverse drug reaction?” We turn this questionaround, and reframe it in terms of assessing howlikely it is that a participant in a preapprovalclinical trial will experience an AE. The data thatare typically used to answer this question are allon-treatment AEs for all participants treated (orexposed) in each treatment group. The proba-bility that a participant in a particular treatmentgroup will report any AE is estimated by theproportion of participants in the group whoreported any AE.

When describing proportions, it is importantto note what event is being counted in thenumerator and what event in the denominator.Many times it is clear what the appropriatenumerator for a proportion should be, but notso clear what the appropriate denominatorshould be. The simplest starting point for deter-mining which participants should be countedin the denominator is to identify all those who

are at risk of experiencing the event of interest.For example, the proportion of participantsexperiencing an AE in the first 90 days shouldbe calculated by counting the number of parti-cipants who were treated for at least 90 days inthe denominator and the number of parti-cipants who were treated for at least 90 daysand reported an AE in the first 90 days in thenumerator.

As described earlier, proportions are numbersbetween 0 and 1. We have also noted that it iscommon for proportions to be multiplied by 100so that the quantity being assessed is expressedin percentage terms. In the present context, weare interested in the percentages of participantsexperiencing a certain event.

The probability of an individual reporting anAE in a trial is estimated by the followingproportion:

[Number of participants who were administered the treatment and reported

any AE]––––––––––––––––––––––––––––––––––––––––

[Number of participants who wereadministered the treatment]

Some participants will have reported more thanone AE. For this analysis, we count participantsonly once if they experienced any AE(s).

As noted in Chapter 6, this calculated propor-tion is considered a point estimate, because itwas obtained from a single sample and the esti-mate does not take into account any variabilityattributed to sampling. In most clinical studyreports (and, ultimately, package inserts formarketed products), the point estimate of theproportion of individuals experiencing AEs isexpressed as a percentage of individuals. Thisquantity can be thought of as a rate (ratio ofindividuals experiencing an event among thoseexposed to the treatment) or, in the terminologyused in the discipline of epidemiology, theincidence of AEs.

Calculating the proportion (or, equivalently,the percentage) of individuals reporting any AEfor all treatment groups in a study enables us tosee whether AEs are more or less likely in the testtreatment group than in other groups. The use ofan inactive control group (for example, a

100 Chapter 8 • Confirmatory clinical trials: Safety data I

Page 116: Introduction to Statistics in Pharmaceutical Clinical Trials

placebo) in a study allows us to compare theprobability attributed to the test treatment groupto what can be thought of as the backgroundrisk, which is approximated for by the risk in theinactive control group.

8.7 Absolute and relative risks ofparticipants reporting specific AEs

Similar analysis approaches are used to describethe risk, in both the absolute and relative(comparative) sense, of individuals reportingspecific AEs. These analyses are much moreuseful clinically because not all AEs are createdequal. One example is to estimate the proportionof individuals in a given group who reported aheadache. To do this in a standardized manner itis necessary to “code” the AE descriptions (forexample, “tension headache”, “achy head”). Theuse of the MedDRA coding dictionary for thispurpose is now widely accepted, and in someinstances may be required. Coding is performedbefore statistical analysis and the “coded” termsare used in statistical summaries that requirecounting of participants reporting each event.

The proportion from the sample in the studycan be estimated as:

[Number of participants who received the treatment and reported a headache during the study]

––––––––––––––––––––––––––––––––––[Number of participants who received

the treatment].

For example, if 25 participants received treat-ment A and, among them, 5 reported a head-ache, the estimated proportion of participantsreporting a headache is 5/25 � 0.20, whichcan also be expressed as 20%. When suchan analysis is repeated for all AEs reported,and the quantities are expressed as percentagesand displayed in tabular form in a packageinsert, it is relatively easy for prescribing physi-cians and their potential patients to answertheir questions.

Suppose that an investigational antihyperten-sive drug is evaluated at multiple doses in aparallel-group placebo-controlled study. Partici-pants in this therapeutic exploratory study wererandomly assigned to receive either placebo orone of three possible doses of the test treatment(low, medium, or high). The treatment periodwas for 6 weeks. The number and percentage ofparticipants experiencing any AE, and particularAEs, are displayed in Table 8.2.

We now have data with which to begin toanswer the question: How likely is a patient toexperience an AE after use of the new treatment?As this study included three doses of the testtreatment, we need to consider the dose in ouranswer. Examining the top row in Table 8.2, itseems that the overall chance of observing an AEat all doses of the active treatment is similar tothat for the placebo group: The percentages for“Any event” range from 10% to 13% across thegroups with no apparent relationship to dose.From these data, our best guess as to the proba-bility of an individual treated with the test treat-ment experiencing any AE is between 10% and13%. However, the probability of experiencing

Absolute and relative risks of participants reporting specific AEs 101

Table 8.2 Number and percentage of participants reporting adverse events (AEs) by group

AE Placebo (n � 98) Low (n � 101) Medium (n � 104) High (n � 97)

Any event 12 (12%) 13 (13%) 10 (10%) 12 (12%)Headache 6 (6%) 8 (8%) 9 (9%) 8 (8%)Dizziness 1 (1%) 3 (3%) 4 (4%) 3 (3%)Upper respiratory infection 4 (4%) 1 (1%) 2 (2%) 2 (2%)Nausea 1 (1%) 2 (2%) 1 (1%) 3 (3%)

Page 117: Introduction to Statistics in Pharmaceutical Clinical Trials

an AE is almost equal to the probability of expe-riencing an AE after treatment with placebo. Theimplication of this result is that the risk of expe-riencing an AE after treatment with the new drugis no different than if the participant had notbeen treated with the drug.

Looking at the specific AEs in Table 8.2, there isreally little difference among the groups with oneexception, the AE of dizziness. Only 1% of parti-cipants in the placebo group reported dizzinesscompared with 3–4% of participants treated withthe active drug. How might a regulatory reviewerinterpret these data? The first conclusion is thatdizziness was not reported very often in anyparticipant group, so, if the drug is approved andmarketed, most patients treated with the newdrug would probably not have a problem.However, the difference in the percentage ofparticipants might generate some concern.

Initially, the absolute difference in dizzinessrates (2–3%) may not seem extreme. However,when considering the rates in relative terms,those treated with the investigational drug arethree to four times as likely to experience dizzi-ness as someone who did not receive the activedrug. This measure of risk is called a relative riskand is calculated as follows:

The probability of the event in group ARelative risk � –––––––––––––––––––––––––––––––––––––.

The probability of the event in group B

In many instances, the communication of arisk (probability of experiencing the event) ismost clear with an absolute measure (such as the point estimate for a group) and a relativemeasure (such as the relative risk). The relativerisk is a ratio of two probabilities, and cantherefore range from zero to infinity.

8.8 Analyzing serious AEs

ICH Guidance E2A (1994) provides the followingdefinition of a serious event: A serious adverseevent (experience) or reaction is any untowardmedical occurrence that at any dose:

• results in death• is life threatening (note that the term “life

threatening” in the definition of “serious”

refers to an event in which the patient was atrisk of death at the time of the event; it doesnot refer to an event that hypotheticallymight have caused death if it were moresevere)

• requires inpatient hospitalization orprolongation of existing hospitalization

• results in persistent or significant disability/incapacity

• is a congenital anomaly/birth defect.

A similar analysis can be performed to addressanother question of interest: How likely is it thatan individual will experience an adverse effectthat is potentially life threatening? The data usedto answer this question include all of the AEsthat were rated as serious at the time ofreporting. We would estimate the probability bycalculating the proportion of participants treatedin each group who experienced a serious AE. Theproportions (or, equivalently, the percentages) ofpatients could be compared across groups to seeif there was an increased risk of a serious AEassociated with the new treatment.

8.9 Concerns with potential multiplicityissues

As noted earlier, safety analyses in clinical trialstend to be largely descriptive because so manyAEs and other safety parameters are evaluated. Ifwe were to perform hypothesis tests for the largenumber of parameters evaluated – for example,for all AEs reported in a trial – it is probable thatat least one of the tests would be nominallystatistically significant at the a � 0.05 level. Inmost instances the statistical analysis is plannedso that the probability of making a type I error is� 0.05. In Table 8.2 rates were presented for fiveAEs (including any event). If we were interestedin identifying statistically significant differencesfor each active dose group versus placebo, wewould need to conduct 15 hypothesis tests (threedose groups to be tested against placebo for fiveAEs). As we saw in Chapter 6, if we test a numberof hypotheses without taking into accountmultiplicity of comparisons we will likelycommit a type I error. For the 15 tests that could

102 Chapter 8 • Confirmatory clinical trials: Safety data I

Page 118: Introduction to Statistics in Pharmaceutical Clinical Trials

be conducted using the data in Table 8.2, it iscertainly possible that one of them might have anominal p value � 0.05 by chance alone.Committing a type I error in this setting wouldmean concluding that the new treatment wasassociated with an excess risk of an AE when thatreally is not the case.

If a single test were considered nominallystatistically significant after looking at so manyother AEs, the result should be treated with agreat deal of skepticism. Before making any regu-latory or business decisions on the basis of sucha result, medical, clinical, and statistical expertsshould, at a minimum, evaluate the medical andstatistical plausibility of the result. Ideally, addi-tional data would be collected to providesupporting evidence for such a finding. As wehave pointed out a number of times already,statistical results such as these aid in decision-making, in concert with insights and evidencefrom other disciplines. This view, as it relates toanalyses of safety data, is perfectly in line withthe EMEA’s Committee for Proprietary MedicinalProducts (CPMP) (2002, p 4) guidance, Points toConsider on Multiplicity Issues in Clinical Trials:

In those cases where a large number of statis-tical test procedures is used to serve as a flag-ging device to signal a potential risk caused bythe investigational drug it can be generallystated that an adjustment for multiplicity iscounterproductive for considerations of safety.It is clear that in this situation there is nocontrol over the type I error for a singlehypothesis and the importance and plausibilityof such results will depend on prior knowledgeof the pharmacology of the drug.

8.10 Accounting for sampling variation

Hypothesis tests and interval estimates ofproportions are frequently presented in clinicalstudy reports, especially in earlier studies ofdevelopment when late phase studies are beingplanned. Accordingly, discussion now turns toanalysis methods that can be used to account forsampling variation and, therefore, determine ifthe results observed are likely due to chancealone.

In Chapter 6 we described the basic compo-nents of hypothesis testing and interval estima-tion (that is, confidence intervals). One of thebasic components of interval estimation is thestandard error of the estimator, which quantifieshow much the sample estimate would vary fromsample to sample if (totally implausibly) we wereto conduct the same clinical study over and overagain. The larger the sample size in the trial, thesmaller the standard error. Another componentof an interval estimate is the reliability factor,which acts as a multiplier for the standard error.The more confidence that we require, the largerthe reliability factor (multiplier). The reliabilityfactor is determined by the shape of thesampling distribution of the statistic of interestand is the value that defines an area under thecurve of (1 � a). In the case of a two-sidedinterval the reliability factor defines lower andupper tail areas of size a/2.

If the shape of the sampling distribution issymmetric (for example, the Z or t distributions),the reliability factor used for the lower and upperlimits is exactly the same, but with a change insign. Some sampling distributions are notsymmetric (for example, the F distribution forthe ratio of two variances) and, therefore, thereliability factors for the lower and upper limitsare not equal.

Let us now look at how we would calculate aconfidence interval for a single proportion, suchas a within-treatment group proportion ofparticipants experiencing an AE.

8.11 A confidence interval for a sampleproportion

The estimator for a sample proportion can bedefined as follows:

number of observations with the event of interestp � ––––––––––––––––––––––––––––––––––––––––––––––,

total number of observations at risk of the event

which is an unbiased estimator of the unknownpopulation proportion, P. The standard error ofthe estimator,

_____p q

SE(p) � � ––––,n

A confidence interval for a sample proportion 103

Page 119: Introduction to Statistics in Pharmaceutical Clinical Trials

where q � 1 � p, the sample proportion of obser-vations without the event of interest. The esti-mator p is approximately normally distributedfor large samples (that is, when pn � 5) so thereliability factor for interval estimates will comefrom the Z distribution. For now, we willconsider only two-sided confidence intervals.Hence, the reliability factor Z1�a/2 will be thespecific value of Z such that an area of (1 � [a/2])lies to the right of the cutoff value. A two-sided(1 � a)% confidence interval for a sampleproportion, p is:

p � z1�a/2SE(p).

This is also a confidence interval for the para-meter p, probability of success, of the binomialdistribution. The use of the Z distribution forthis interval is made possible because of theCentral Limit Theorem. Consider the randomvariable X taking on values of 0 or 1, such thatthe sampling distribution of the sample mean(the proportion) is approximately normallydistributed. A table of the most commonlyencountered values of the standard normaldistribution is provided in Table 8.3 for quick reference. Others are provided inAppendix 1.

This methodology can be used to answer aquestion about the data presented in Table 8.2,where the percentages of participants reportingheadache during the 6-week study were 6%, 8%,9%, and 8% for the placebo, low-dose, medium-dose, and high-dose groups respectively.Headaches may be reported fairly often amongpeople with hypertension as a matter of course,but these data suggest that the proportion(expressed here as a percentage) of individualsreporting headache is a bit higher for individuals

treated with the active treatment than for thosein the placebo group. We can calculate the 95%confidence interval for the proportion of partici-pants in the combined active dose groupsreporting a headache. We can also calculate the corresponding confidence interval for theplacebo group and compare the two.

The research question

Is the risk (or probability) of experiencing aheadache after treatment with the active drug(all doses combined) higher than the risk aftertreatment with placebo?

Study design

In this study, an investigational antihypertensivedrug was evaluated at multiple doses in aparallel-group, placebo-controlled study. Animportant feature of the design was random-ization to treatment, which provides us withunbiased (accurate) estimates of treatment differences. Another feature of the design of thestatistical analysis is that we have chosen tocompare the rates of one particular AE amongmany only after seeing the results (that is, aposteriori). As we have already seen, any differ-ence between treatments that we may find atthis point may be a type I error resulting fromthe large number of AEs that could have beenselected for this particular analysis.

Data

The data for this analysis are the counts of parti-cipants treated in each group (that is, the denom-inator for within-group proportions) and thecounts of participants within each group whoreported a headache during the study (that is, thenumerator for the within-group proportions). Asthe research question involved all active dosegroups combined (that is, any dose of the drug) itis necessary to pool the data across the active dosegroups to calculate the confidence interval ofinterest. Having done that, we now have thefollowing data for our example: 6 out of 98participants in the placebo group reported aheadache, and 25 out of 302 participants in thecombined active groups.

104 Chapter 8 • Confirmatory clinical trials: Safety data I

Table 8.3 Selected values of Z for two-sidedconfidence intervals

a (two sided) Z1 � a/2

0.10 1.6450.05 1.960.01 2.5760.001 3.3

Page 120: Introduction to Statistics in Pharmaceutical Clinical Trials

Statistical analysis

The statistical analysis approach is to calculate95% confidence intervals for the proportion ofparticipants in each group (placebo and com-bined active) reporting a headache. This analysisapproach is reasonable because the sample size is sufficiently large (that is, the values, pn, ineach group are at least five). Satisfying thisassumption enables us to use the Z distributionfor the reliability factor.

The first step is to calculate the point estimateof the proportion. For the placebo group theproportion is 0.06. The second step is to calcu-late the standard error. For this estimator thestandard error is calculated as follows:

____________

(0.06)(0.94)� ––––––––––– � 0.02.98

The third component of the interval estimate isthe reliability factor. As we are calculating a two-sided 95% confidence interval, we select thevalue of Z from Table 8.3 corresponding to a of0.05, that is, 1.96.

With all of the components now available, thelast step is to calculate the confidence interval.The lower limit is 0.06 – 1.96(0.02) � 0.02. Theupper limit is 0.06 � 1.96(0.02) � 0.10. We writethe 95% confidence interval as (0.02, 0.10).Repeating these steps for the combined activedose group, we obtain a 95% confidence intervalof (0.04, 0.12). (We leave it to you to verify thiscalculation.)

Interpretation and decision-making

Using these two confidence intervals we cannow make some conclusions about theunknown population proportion of participantswho experience headache after exposure in eachgroup. In the case of the placebo group, we are95% confident that the population proportion ofparticipants experiencing a headache is enclosedin the interval (0.02, 0.10). For the combinedactive dose group, we are 95% confident that thepopulation proportion of participants experi-encing a headache is enclosed in the interval(0.04, 0.12). Although it may initially haveseemed that there may be an increased risk ofheadache associated with the active treatment,

the overlapping within-group confidence inter-vals suggest that there is insufficient evidence toconclude that the observed difference is real(that is, not due to chance).

8.12 Confidence intervals for thedifference between two proportions

There is also another way to answer this researchquestion. If the proportions of individualsreporting headache are the same among partici-pants in the active dose groups and the placebogroup, the difference between the two propor-tions would be 0. Further, because of the influ-ence of sampling error, with which we are nowvery familiar, we would not necessarily expectthe difference to be exactly 0 (just like we do notexpect precisely equal numbers of heads andtails in a series of coin tosses). In this approach,therefore, we calculate a confidence intervalabout the difference in proportions for two inde-pendent groups. This interval estimate allows usto exclude implausible values of the difference.This method and others throughout this bookrequire independence of groups (for example,two groups of participants). Examples of groupsthat are not considered independent aremeasurements on the same study participant (forexample, in ophthalmology left eyes are notconsidered independent of right eyes in thesame individual).

For this method we have sample proportionsfor independent groups 1 and 2 defined asabove:

number of observations in group 1 with the event of interestp1� ––––––––––––––––––––––––––––––––––––––––––––––––––––––––– and

total number of observations in group 1 at risk of the event

number of observations in group 2 with the event of interestp2� –––––––––––––––––––––––––––––––––––––––––––––––––––––––––.

total number of observations in group 2 at risk of the event

The estimator for the difference in the twosample proportions is p1 � p2 and the standarderror of p1 � p2 is:

_______________p1q 1 p2q 2

SE(p1 � p2) � �–––––– � ––––––,n1 n2

where q1 � 1 � p1 and q2 � 1 � p2.

Confidence intervals for the dif ference between two proportions 105

Page 121: Introduction to Statistics in Pharmaceutical Clinical Trials

For large samples (that is, when p1 n1 � 5 and p2 n2 � 5) the estimator p1 � p2 isapproximately normally distributed with mean,

p1 � p2

and variance,

p1(1 � p1) p2(1 � p2)–––––––––– + ––––––––––.

n1 n2

So the reliability factor for interval estimates willcome from the Z distribution. Then a two-sided(1 � a)% confidence interval for the differencein sample proportions, p1 � p2 is:

(p1 � p2) � z1�a/2SE(p1 � p2).

While this form of the confidence interval iswidely used we suggest the use of a correctionfactor,

1 1 1–– ( ––– � ––– ) ,2 n1 n2

attributed to Yates (see Fleiss et al., 2003). Thiscontinuity correction factor accounts for the factthat the normal distribution is being used as anapproximation to the binomial. With the correc-tion factor, a two-sided (1 � a)% confidenceinterval for the difference in sample proportions,p1 � p2, is:

1 1 1(p1 � p2) � (z1�a/2SE(p1 � p2) � –– (––– � –––)) .

2 n1 n2

As an example, we look at the headache AE dataagain and calculate a two-sided confidenceinterval for the difference in sample proportions.

Data

As above, the data are in the form of counts: 6out of 98 participants in the placebo groupreported a headache and 25 out of 302participants in the combined active groupsreported a headache.

Statistical analysis

As with the previous method, the first step is tocalculate the point estimate, but this time thepoint estimate of the difference in sampleproportions. For the placebo group the propor-tion is 0.06. For the active group the proportion

is 0.08. So the point estimate for the difference is0.06 � 0.08 � �0.02.

The next step is to calculate the standard error,which is:

__________________________

(0.06)(0.94) (0.08)(0.92)� ––––––––––– � ––––––––––– � 0.03.98 302

The third component of the interval estimate isthe reliability factor. The Z value will be the sameas for the previous example (that is, 1.96). Forthis interval estimate, we also use the continuitycorrection factor. The continuity correction iscalculated as 0.5(1/98 � 1/302) � 0.007.

We now have all the components of theinterval calculation. The lower limit is given asfollows:

�0.02 � 1.96(0.03) � 0.007 � �0.09.

The upper limit is given as follows:

�0.02 � 1.96(0.03) � 0.007 � 0.04.

Note that the calculated limits do not appear tobe equidistant from the point estimate, as wemight have expected. This is the result ofrounding to two significant digits in the calcula-tions. The calculated 95% confidence intervalabout the difference in proportion of partici-pants reporting headache as an AE is written asfollows:

95% Cl � (�0.09, 0.04).

Interpretation and decision-making

Given its importance, it is worth restating theinterpretation of this confidence interval. We are95% confident that the true difference in propor-tions of individuals reporting headache as an AEis within the interval (�0.09, 0.04). As theinterval includes 0, there is not enough evidenceto suggest that the two groups are statisticallysignificantly different with respect to the risk ofheadache as an AE. Following this conclusion,we could reasonably continue with furtherstudies in our clinical development program ofthe active drug, with some assurance from theselimited data that the active treatment did notincrease the risk of headache.

Suppose, however, that a skeptical colleagueinsisted that the risk of headache had to be

106 Chapter 8 • Confirmatory clinical trials: Safety data I

Page 122: Introduction to Statistics in Pharmaceutical Clinical Trials

higher for participants treated with the activedrug than with placebo. Using these data, howconfident could he or she be that this was reallythe case? What if all of the headaches in theactive treated group were reported in the firstweek of treatment, whereas in the placebo groupthe events were spread evenly over the entire 6-week treatment period? Would your view ofthe relationship between the active treatmentand the risk of headache change? A method-ology called time-to-event analysis is useful here.

8.13 Time-to-event analysis

An illustration of this scenario is given in Figure 8.1, which shows data from a hypothet-ical study, study 1 (we discuss another hypothet-ical scenario, study 2, in due course). There aretwo treatment groups represented: Active andplacebo. Suppose for this example that there are10 participants in each group, and the length oftreatment is 20 days. On the x axis of each panelis time, that is, the number of days since the startof study treatment. Different study participantsare represented on the y axis of each panel.Participants numbered 1–10 are in the placebogroup and participants numbered 11–20 are inthe active group. The occurrence of an AE (“A”)is represented with an “X.” Completion of thestudy on day 20 without the AE is denoted by anopen circle. The time to either the first report ofthe AE or the completion of the study is repre-sented by the length of the line from day 1 to theevent. Note that it is possible for participants toreport more than one instance of the same AE,but only the first occurrence is represented inFigure 8.1.

Here is a descriptive summary of the datadisplayed in Figure 8.1. For both groups (placeboand active), 5 out of 10 (50%) of the participantsreported the particular AE. So, if we were toreport these rates and a 95% confidence intervalabout the difference in proportions, there wouldnot appear to be any difference between thesetwo groups. However, when we look at the timesrelative to the start of study treatment, this is notso clear any more. In the placebo group, the AE was reported on days 4, 9, 11, 14, and 18. In

contrast, the AE was reported much earlieramong participants in the active group, on days1, 2, 4, 5, and 6. The remainder of participants inboth groups completed the study on day 20without experiencing the AE. It appears as if theprobability of experiencing the AE (as estimatedby the proportion of participants reporting it) isthe same between the groups, but that there is atemporal relationship between the start of thestudy treatment and the time at which the AE isreported. How might we report such a result?

One possibility that might come to mind,although it is not recommended for reasons wediscuss shortly, would be to calculate the averagenumber of days to the reported AE. This is prob-lematic, however, because we can calculate sucha quantity only for those participants who actu-ally reported the AE. The mean number of daysis 11.2 and 3.6 among participants reporting AE“A” in the placebo and active groups, respec-tively. This analysis completely ignores thosewho did not report the AE. It hardly seems accu-rate to exclude these individuals from ouranalysis. In fact, although half of the partici-pants in both groups did not report “A,” theymight have eventually reported it if we hadfollowed them longer. Such an estimate of theexpected time at which an AE is reported isbiased, because not all participants were part ofthe estimate.

This example suffers from an oversimplifica-tion that we have to deal with in the real world,namely that study participants do not alwayscomplete the study for the full length of thefollow-up period. Participants may drop out ofstudies for a number of reasons, some of whichreflect their experience with the drug (forexample, it may be poorly tolerated). Therefore,the “time at risk” differs from individual toindividual within the same trial, and it can differto a considerable degree from trial to trialthroughout a clinical development program.

The most important points to remember hereare as follows. Simply comparing the relativefrequency (that is, the proportion of participantsreporting the AE) of the AE between two groupsdoes not tell the whole story: Such an analysisdoes not address the potential temporal relation-ship between exposure to the study treatmentand the AE of interest. As we saw in this

Time-to-event analysis 107

Page 123: Introduction to Statistics in Pharmaceutical Clinical Trials

example, exactly the same proportion of partici-pants in both groups reported the AE. However,the AE occurred in the first 6 days among parti-cipants in the active group, whereas the AEreported by participants in the placebo groupoccurred at evenly spaced intervals over thecourse of the time at risk. Such a difference intimes of the events would suggest that there is acause-and-effect relationship between the activetreatment and the AE.

A more informative approach would be to takeinto account the time of the event relative to thestart of treatment. Ideally, we should use the datafrom all participants in this approach and shouldaccount for varying lengths of time at risk forexperiencing the event. O’Neill (1987) advocated

such an approach especially for serious AEscaused by the shortcomings of simply describingthe incidence (or “crude rate” as he defines it) of AEs:

For drugs used for chronic exposure, onenumber or rate such as the crude rate is notlikely to be informative without reference totime. To be useful as a summary measure ofcombined safety data from several studies andwhich would estimate an overall rate thatdescribes experiences of all participants exposedfor varying time periods, there is a need tostratify for time as well as other factors. (O’Neill1987, p 20)

The next section in this chapter addresses justsuch a method.

108 Chapter 8 • Confirmatory clinical trials: Safety data I

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Days since start of study drug

20

Act

ive

Plac

ebo

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Figure 8.1 Days since the start of study drug at which adverse event “A” was first reported: Study 1 with no dropouts

Page 124: Introduction to Statistics in Pharmaceutical Clinical Trials

8.14 Kaplan–Meier estimation of thesurvival function

The analysis method attributed to Kaplan andMeier (1958) enables us to analyze the time to thefirst reported AE while accounting for differentlengths of time at risk. To illustrate this methodfully, we have modified the data from theprevious example slightly, as shown in Figure 8.2.We refer to this new example as study 2.

The proportion of participants with the eventis still equal between the groups (this time 0.6 inboth). As seen in Figure 8.2, some participantsdropped out of the study before reporting theAE, which are denoted by the open circles atdays before day 20. When analyzing data in thisway, observations for which the event of interestwas not recorded during the time at risk arecalled censored observations. As noted earlier it

is conceivable that, if we had followed theseparticipants for a longer period of time, or if theyhad not dropped out of the study, they may haveexperienced the AE of interest.

When analyzing the time to the AE, we needan analytic way to deal with these censoredobservations. Although we do not know whatwould have happened for these participants, wedo know that they were at risk for some period oftime and “survived” their time in the studywithout experiencing the AE. Accordingly, themain objective of this analysis is to describe howlong participants survive without experiencingthe event.

The name survival analysis reflects one situa-tion in which this type of analysis is used. Whenthe participants in a clinical trial are very ill, themeasurement of efficacy can be the length oftime that they live, that is, death is the “event.”

Kaplan–Meier estimation of the survival function 109

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Days since start of study drug

20

Act

ive

Plac

ebo

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Figure 8.2 Days since the start of study drug at which adverse event “A” was first reported: Study 2 with dropouts

Page 125: Introduction to Statistics in Pharmaceutical Clinical Trials

An example may be an oncology trial in whichone group receives the investigational drug andthe other receives an active control, usually thecurrent gold standard of therapy for the specifictype of cancer. Of interest is whether thosereceiving the investigational treatment survivelonger than those receiving the active control.However, survival analysis, as will be seen in ourexamples, can also be used to measure the timeto any defined event.

In this analytic methodology the data for eachparticipant are expressed in a different manner.We present the event times for every participant,defined in one of two ways:

1. The day at which the participant reported theAE

2. The last day the participant was “at risk” forreporting the AE without having done so. Thistype of participant is labeled parentheticallyas “censored.”

The data are therefore as follows:

• placebo: 4, 9, 11, 14, 15 (censored), 17, 18, 19(censored), 20 (censored), 20 (censored)

• active: 1, 2, 4, 4 (censored), 5, 6, 9, 9(censored), 20 (censored), 20 (censored).

Before discussing the formal definition of thismethod, it is instructive to think through howwe might interpret these data. Let us start withthe active group. At the start of the study, all 10participants are at risk of reporting the AE.Therefore, at day 0 (the day before the start ofstudy treatment), the probability of survivingday 0 without having experienced the AE is 1.00(we accept this as a given when we define thisanalysis formally). On day 1, 1 participant out of10 at risk reported the AE. The probability of anAE on day 1 is 1/10 or 0.10 (that is, 10%). Thisparticipant is no longer at risk of reporting theAE later. On day 2 there are nine participants atrisk and on this day one more participantreported the event. The probability of an AE onday 2 is 1/9 or 0.11.

This also leaves eight participants at risk onday 3. On day 3 no participant reported theevent. Of the eight participants who were at riskon day 4, one reported the AE and one droppedout (that is, was “censored” from the analysis).As before, the probability of an AE occurring is

calculated relative to the number at risk, that is,1/8 or 0.13. On day 5 there are only six partici-pants still at risk. These data are provided in thefirst five columns of Table 8.4 for the activegroup, and the same interpretation followsthrough the end of the 20-day study.

The primary interest in this analysis is notwhat happens at a single time point, but ratherwhat happens at time t and all points precedingtime t. This leads us to the final column of Table 8.4. The numbers in this last column arethe estimated probabilities of participantssurviving the interval time t without havingreported the AE. Given these data, it becomespossible to compare among treatments the prob-ability of a participant not having the event ofinterest at any given time t.

This method has two desirable characteristicsthat a simple comparison of proportions doesnot have. First, it takes into account the variabletiming of AEs, which can occur if there is acause-and-effect relationship of drug to AE.Second, it takes into account the possibility thatnot all participants will remain at risk for thesame amount of time.

Having thought about this methodology inconceptual terms, we now address the necessarycalculations for arriving at the data presented inthe final column in Table 8.4. This methodologyis called the survival function.

8.14.1 The survival function

We introduced you to Bayes’ theorem in Chapter 6. According to this theorem, the condi-tional probability of A given B can be written as:

P(B | A)P(A | B) � ———— P(A).

P(B)

Or, equivalently, as:

P(A | B)P(A) � ———— P(B).

P(B | A)

In this methodology we define A as survivingthrough time t, and B as surviving through timet � 1. Then, P(A|B) is the probability of a partic-ipant surviving through time t given that he orshe has survived through all preceding times

110 Chapter 8 • Confirmatory clinical trials: Safety data I

Page 126: Introduction to Statistics in Pharmaceutical Clinical Trials

(t � 1), (t � 2), . . . , (1). In addition, P(B|A) is theprobability of surviving through t � 1 given thatthe participant survived through time t. Bydefinition, that probability is 1.00. Therefore, to calculate the conditional probability ofsurviving through time t, we need two pieces ofinformation:

1. The probability of surviving through time t given that the participant survived theprevious time

2. The probability of surviving the previousinterval.

At day 0 (before any participants are at risk), theprobability of surviving through time t is 1.00 bydefinition. On day 1 the probability of survivingthrough day 1 is the probability of survivingthrough day 1 given survival through day 0(that is, 1 minus the probability of the event onday 1 among those at risk), which is equal to

1 � 0.10 � 0.90 times the probability ofsurviving through day 0 (1.00). That is, theprobability is 0.90 1.00 � 0.90. Therefore, tocalculate the probability in the last column weuse the cumulative survival probability (lastcolumn) for the previous time and the probabilityof the event in the interval among those at risk.

Sometimes, these data are presented in ashorter table that displays only those time pointsat which an individual had an event or wascensored, and thus the only values of time forwhich the probability of survival changes. It ismore common, however, to see analyses of thistype displayed graphically. The Kaplan–Meierestimate of the survival distribution is displayedfor both groups in Figure 8.3. The survival curvesdisplayed in the figure are termed “step func-tions” because of their appearance. We return tothe interpretation of Figure 8.3 after we havefully specified the survival distribution function.

Kaplan–Meier estimation of the survival function 111

Table 8.4 Event times for the active group in study 2

Time (day), t Individuals at risk Individuals Probability of AE Individuals Probability offor the AE before reporting AE at time t among dropping surviving throughtime t at time t those at risk out at time t time t without AE

0 10 0 0 0 1.001 10 1 0.10 0 0.902 9 1 0.11 0 0.803 8 0 0 0 0.804 8 1 0.13 1 0.705 6 1 0.17 0 0.586 5 1 0.20 0 0.467 4 0 0 0 0.468 4 0 0 0 0.469 4 1 0.25 1 0.35

10 2 0 0 0 0.3511 2 0 0 0 0.3512 2 0 0 0 0.3513 2 0 0 0 0.3514 2 0 0 0 0.3515 2 0 0 0 0.3516 2 0 0 0 0.3517 2 0 0 0 0.3518 2 0 0 0 0.3519 2 0 0 0 0.3520 2 0 0 2 0.35

Page 127: Introduction to Statistics in Pharmaceutical Clinical Trials

8.14.2 Kaplan–Meier estimation of asurvival distribution

The survival function is the probability that aparticipant survives (that is, does not experiencethe event) longer than time t:

S(t) � P (participant survives longer than t).

By definition, a participant cannot experiencethe event until he or she is at risk of the event,so will survive longer than time 0 or, equiva-lently, S(0) � 1. Also, we accept as a given that, ifwe waited an infinite amount of time, an indi-vidual would eventually experience the event nomatter how rare. Therefore, the survival distribu-tion at infinity is defined to be 0, or S() � 0.We also define the following:

• ti is the unique event time, where i � 1, 2, . . ., i.• ni is the number of participants who are at risk

just before ti

• mi is the number of participants with eventsat time ti

• ci is the number of participants censored inthe interval (ti, ti�1).

The Kaplan–Meier estimate of the survivalfunction at time t is:

mtS(t) � P (1 � –––).ti �t

nt

We can write this series of products out in full asfollows:

m1 m1 m2 mt�1 mtS(t) � 1 (1 �–––) (1 �–––)(1 �–––) . . . (1 �–––––)(1 �–––).

n1 n1 n2 nt�1 nt

This expression means that the probability ofsurviving past time t is the product of the proba-bility of surviving time t conditional uponsurviving all preceding time points and the prob-ability of surviving all other preceding timepoints.

112 Chapter 8 • Confirmatory clinical trials: Safety data I

Surv

ival

dis

tribu

tion

func

tion

0.0

days

group � Active

group � Placebo

Censored group � Active

Censored group � Placebo

0.00

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

0.25

0.50

0.75

1.00

Figure 8.3 Kaplan–Meier estimate of the survival distribution for adverse event A

Page 128: Introduction to Statistics in Pharmaceutical Clinical Trials

The variance of the survival distributionfunction at time t is:

mtvar[S(t)] � [S(t)] R –––––––––––.t(i)�t

nt (nt � mt )

Consequently, we can take the square root of thevariance to obtain the standard error and calcu-late a (1 � a)% confidence interval:

________S(t) � Z1�a/2 �var[S(t)].

A common measure of central tendency fromthe Kaplan–Meier estimate is the mediansurvival time (note that this can be estimatedonly if more than half the participants experi-ence the event). The median survival time is theearliest value of t such that the probability ofsurvival is � 0.5. Note that when observationsare censored any estimate of the mean is biasedbecause, technically, the event would eventuallyoccur if we followed participants indefinitely.

We now return to our example to work withsome of these expressions. Looking at the lastcolumn in Table 8.4 (the estimated survivaldistribution), we can see that the probability ofsurviving day 5 is 0.58. Similarly the probabilityof surviving day 6 is 0.46. Therefore, the esti-mated median time to an AE in the active groupis 6 days, the earliest time at which the proba-bility of survival is � 0.5. For a comparison, themedian time to an AE is 16 days in the placebogroup. The graphical representation of thesurvival distribution in Figure 8.3 can also beused to estimate the median time to event.

In Figure 8.3 the survival distribution isplotted against time. As can be seen from thetabular presentation of these estimates in Table 8.4, the survival estimate changes onlywhen there is an event. In the active group onday 1 the estimate is 0.9 and then it drops downto 0.8 on day 2. An important property of thestep function defined using discrete event timesis that it is a discontinuous function (that is, notdefined) between event times. For example, thesurvival distribution function is 0.46 on days 6,7 and 8, and then at day 9 the estimate is 0.35.Looking at the Kaplan–Meier curve for the activegroup you could read day 9 as having an esti-mate of 0.35 or 0.46, but it is appropriate toremember that the outside edge of the step (right

at day 9) is discontinuous, and thus the esti-mated probability of survival for day 9 or later is0.35.

Using this guideline we can read off themedian survival times by drawing a referenceline across Figure 8.3 at S(t) � 0.50 and findingthe earliest value of time on the curve below thereference line. We leave it to you to verify themedian times of 6 and 16 days for the active andplacebo groups, respectively, using this method.

The point estimate of the probability ofsurviving past day 6 is 0.46 for the active group.Using the notation above, we write S(6) � 0.46.We can now calculate a 95% confidence intervalabout this estimate. The first step is to calculatethe variance about the estimate. Using theexpression above and point estimate and thenumber of events and participants at risk at eachtime point before day 6, we obtain the following:

1 1 1 1var[S(6)] � (0.46)2 [––––– � –––– � –––– � ––––]10(9) 9(8) 8(7) 6(5)

1 1 1 1� (0.46)2 [––– � ––– � ––– � –––]90 72 56 30

� 0.016.

As we have chosen a confidence level of 95%,the corresponding value of Z (the reliabilityfactor) is 1.96. Finally, the 95% confidenceinterval is calculated as follows:

0.46 � 1.96 (0.016), i.e., (0.43, 0.49).

That is, we are 95% confident that the true prob-ability of not experiencing the event (surviving)past day 6 is in the interval (0.43, 0.49).

The Kaplan–Meier estimate is a non-parametric method that requires no distribu-tional assumptions. The only assumptionrequired is that the observations are indepen-dent. In the case of this example, the observa-tions are event times (or censoring times) foreach individual. Observations on unique studyparticipants can be considered independent. Theconfidence interval approach described here isconsistent with the stated preference for estima-tion and description of risks associated with newtreatments. A method for testing the equality ofsurvival distributions is discussed in Chapter 11.

Kaplan–Meier estimation of the survival function 113

Page 129: Introduction to Statistics in Pharmaceutical Clinical Trials

8.14.3 Cox’s proportional hazards model

Although we do not cover them in detail, thereare parametric methods to analyze time to eventdata of this type, the most notable of which isCox’s proportional hazards model.

A hazard can be thought of as the risk of theevent in a small interval of time, given survivalup to the start of the short interval. Parametricapproaches to time-to-event data such as Cox’smodel have a number of advantages, includingthe ability to adjust for other explanatory effectsin a model and to extend them to recurringevents for a single individual. In this case, eventtimes would not be independent because within-participant event times would be correlated.Such an approach is appealing statisticallybecause it makes use of more data. However, the main disadvantage of Cox’s model is that thesingle parameter of the model, the ratio of thehazards of two groups, is assumed to be constantover time. The risk of an AE for participantstreated with an active drug could vary in anonconstant manner over time relative to therisk for placebo-treated participants, makingsuch an assumption tenuous.

8.14.4 Considerations for the use ofKaplan–Meier estimation for AEs

We suggest the use of the Kaplan–Meier estimatefor a better understanding of the risk of AEs inclinical trials for two reasons. First, the propor-tional hazards model has important assumptionswhich must be made. Secondly, theKaplan–Meier method is easier to implementand interpret. The analysis of AEs using theKaplan–Meier method allows us to account forthe different lengths of time at risk withoutmaking any significant assumptions about theshape of the underlying distributions of thesurvival or hazard functions. Reviewing therather exaggerated data from Figure 8.3 it mayseem obvious that ignoring the time at risk couldbe problematic. Employing an appropriatemethod of analysis (for example, properlyaccounting for all individuals and calculating aninterval estimate for the proportion) does notnecessarily mean that the analysis is the most

appropriate one. Consideration should be givento the varying lengths of follow-up or “time atrisk” when reporting AEs. It is wise to considerthe denominator carefully when making anystatement about probabilities.

A final word of caution here is that, althoughthe Kaplan–Meier method (and other methodsfor time-to-event data) appropriately accountsfor the time at risk of an event within a group, ifthe pattern of censoring is dependent on thetreatment (for example, suppose the dropoutrate is dose dependent as might be seen withchemotherapy), any treatment group compar-isons of the estimate of the risk of AEs would bepotentially biased. Thus, a more completeanalysis would include first an assessment ofcensoring times (visually at a minimum) and thereason for drop out, and then the appropriateanalysis to account for the time at risk. Failing toquantify the probability of an AE accuratelyduring drug development can have significantimplications for sponsors, regulatory authorities,prescribing clinicians, and patients.

114 Chapter 8 • Confirmatory clinical trials: Safety data I

8.15 Review

1. What measures are taken to ensure that AE dataare of a high quality?

2. Refer to Table 8.2. Calculate a two-sided 99%confidence interval for the proportion ofparticipants reporting any event in the:

(a) placebo group(b) active dose groups combined.

3. In a therapeutic exploratory trial, 22 participantsout of 140 reported an AE:

(a) What is the 95% confidence interval for thesample proportion of participants reporting anAE?

(b) What is the 99% confidence interval for thesample proportion of participants reporting anAE?

(c) How confident would you be that the truepopulation proportion of participants reportingan AE does not exceed 0.18?

Page 130: Introduction to Statistics in Pharmaceutical Clinical Trials

8.16 References

Chow S-C, Liu J-P (2004). Design and Analysis of ClinicalTrials: Concepts and methodologies. Chichester: JohnWiley & Sons.

EMEA Committee for Proprietary Medicinal Products(CPMP) (2002). Points to Consider on MultiplicityIssues in Clinical Trials. London: EMEA.

Fleiss JL, Paik MC, Levin B (2003). Statistical Methods forRates and Proportions, 3rd edn. Chichester: JohnWiley & Sons.

ICH Guidance E2A (1995). Clinical Safety Data Manage-ment: Definitions and Standards for ExpeditedReporting. Available at: www.ich.org (accessed July 12007).

ICH Guidance E6 (R1) (1996). Good Clinical Practice.Available at: www.ich.org (accessed July 1 2007).

Kaplan EL, Meier P (1958). Nonparametric estimationfrom incomplete observations. J Am Statist Assn53:457–481.

Mann R, Andrews E, eds (2007). Pharmacovigilance, 2ndedn. Chichester: John Wiley & Sons.

O’Neill RT (1987). Statistical analyses of adverse eventdata from clinical trials: special emphasis on seriousevents. Drug Information J 21:9–20.

US Food and Drug Administration (2005). Conducting aClinical Safety Review of a New Product Applicationand Preparing a Report on the Review. Available fromwww.fda.gov (accessed July 1 2007).

References 115

4. A total of 290 participants were studied in the firsttherapeutic exploratory trial of an investigationalantihypertensive drug. Of the 150 individualstreated with the test treatment, 32 reported fatigue.Of the 140 treated with placebo, 19 reportedfatigue:

(a) Calculate a 90% confidence interval for thedifference in proportions of participantsreporting fatigue.

(b) Calculate a 95% confidence interval for thedifference in proportions of participantsreporting fatigue.

(c) Calculate a 99% confidence interval for thedifference in proportions of participantsreporting fatigue.

(d) What is the statistical interpretation of theseresults?

(e) How might these results influence the course offuture development of the drug?

5. Why is it important to account for the time thatindividuals are at risk of an AE?

6. Describe in your own words what a survivalfunction is.

Page 131: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 132: Introduction to Statistics in Pharmaceutical Clinical Trials

9.1 Introduction

Chapter 8 focused on adverse event (AE) data, alarge component of the overall safety datacollected in clinical trials. Although AE data areoften presented descriptively, we demonstratedthat it is indeed possible to conduct inferentialstatistical analyses using AE data. This chapterdiscusses other safety data, including laboratorydata, vital signs, and an assessment of cardiacsafety that involves investigation of the cardiacQT interval (the QT interval can be identifiedon the ECG, as seen in Figure 9.2). In each ofthese cases, descriptive statistics, includingmeasures of central tendency and dispersion,and categorical data are common forms ofassessment.

9.2 Analyses of clinical laboratory data

Safety monitoring in clinical studies can be bothdata and labor intensive. In the context of later-stage therapeutic exploratory and therapeuticconfirmatory trials, the collection of laboratorydata is no exception. Typically, participants inclinical trials provide blood or urine samples atevery clinic visit. There is an expansive range ofclinical chemistry tests that can be conductedusing these samples.

Samples may be analyzed by laboratories asso-ciated with each site (sometimes called locallabs), each with its own handling procedures,assays, and reporting conventions, but this is notan optimal strategy. The use of a site’s own labor-atory poses no difficulties when the emphasis ison medical care, that is, the values obtained fora single individual. However, when conducting

clinical research the emphasis is on using datafrom a group of individuals to make optimallyinformed conclusions and decisions.

Differences from local lab to local lab maypreclude a sponsor from meaningfullycombining data from all participants across anumber of investigative sites. A statisticalapproach to standardizing laboratory values froma number of different labs (each potentially withtheir own reference ranges) has been described byChuang-Stein (1992). However, standardizationis time-consuming and the use of a number oflocal labs can introduce unwanted sources ofvariability that are neither easily quantified noraccounted for.

To overcome the difficulties with using locallabs the use of central laboratories (central labs)is desirable. The advantages of using a central labare that the samples are handled in a similarfashion, the assays used are consistent overtime and across individuals, and the reportingconventions (for example, units of measurement)are uniform. Techniques for proper samplecollection, storage, and handling, includingshipment to a central lab, should be included instudy protocols. Once the samples have beenobtained by the central lab they are analyzedand the data recorded in a database that includesparticipant identifiers, study visit, date and timeof sample collection, test name, result, reportingunits, and the value of the reference (“normal”)range.

The determination of values for referenceranges is based on the distribution of test valuesin large samples. Reference ranges are deter-mined using large databases from a generalpopulation and typically represent “2r” limits,assuming that the values are normally distrib-uted in the general population. The lower limit

9Confirmatory clinical trials:Safety data II

Page 133: Introduction to Statistics in Pharmaceutical Clinical Trials

of the reference range is the value that cuts offthe lowest 2.5% of values from individuals in thegeneral population (l �2r). Likewise, the upperlimit of the reference range is the value that cutsoff the highest 2.5% of values from individualsin a general population (l � 2r). Reference rangesfor certain parameters (for example, hematocrit)may be defined specific to age and gender.Whichever approach is employed, local orcentral labs, the reference ranges are providedwith lab values themselves to gauge the extent towhich an individual’s value is considered withinan expected range or extreme.

In ICH Guidance E3 (1995), several analyses ofclinical laboratory data are recommended. Theapproaches to describing clinical laboratory datainclude:

• measures of central tendency (for example,means or medians) for all groups at all timepoints examined

• shift analyses that classify laboratory values atbaseline and later time points as normal, lowor high relative to a reference range

• description of the number and proportion ofparticipants for whom a change of a specifiedmagnitude or more was reported at a partic-ular time point. This is typically called aresponders’ analysis

• graphical displays of each subject’s baselinevalue plotted against an on-treatment and/orend-of-study value

• identification of individual values that areso extreme that they would be consideredclinically significant.

9.2.1 Measures of central tendency ateach time point

Laboratory values are summarized descriptivelyfor continuous measures by displaying thesample size, measures of central tendency(including the mean and median), the standarddeviation, and the minimum and maximumvalues. A sample of such a descriptive display isprovided in Table 9.1.

As the primary comparison is among orbetween treatment groups, the groups aredisplayed in the columns. Values of each testover time are of secondary interest and, there-fore, are placed on the rows of the table. Readingbetween columns, we can see if the typical value(for example, the mean) for a parameter differsbetween groups. It is also possible to read downthe column (that is, across time within a group)to see how the typical values vary over time.Provision of the minimum and maximum valuesallows the reviewer to identify any extremevalues that might be considered out of thenormal range. On occasion, similar analyses mayalso be presented for change from baselinevalues (typically calculated as endpoint valueminus baseline value). If there are consistent andsystematic changes from the start of the study,they may be apparent by examining the meanvalues and looking for values that deviateconsiderably from zero.

It may be of interest to provide a confidenceinterval for the change from baseline valuewithin a group where an interval estimate that

118 Chapter 9 • Confirmatory clinical trials: Safety data II

Table 9.1 Summary of hemoglobin values (g/dL)

Treatment group

Visit Statistic Placebo Active

Baseline n 20 20Mean (SD) 13.78 (1.97) 14.61 (2.05)Median 13.5 14.6Min., Max. 11.0, 17.3 11.2, 17.7

Endpoint (last visit) n 20 20Mean (SD) 13.41 (2.07) 13.75 (2.00)Median 13.3 13.5Min., Max. 10.6, 16.9 10.4, 17.2

Page 134: Introduction to Statistics in Pharmaceutical Clinical Trials

excludes zero represents evidence of a change inmean value that exceeds what might be observedby chance alone. Similarly, confidence intervalsmay be calculated to provide an estimate of thebetween-group difference in a laboratory para-meter. Comparison with a control group can beespecially important when there is a laboratorytest that changes as a result of study procedures(for example, decreases in hematocrit or hemo-globin as a consequence of frequent bloodsampling). A summary of the change frombaseline at the last visit is provided in Table 9.2.

9.2.2 A confidence interval for a meanwith unknown variance

For a sample size of n observations of a randomvariable, the sample mean, an estimator of thepopulation mean, is calculated as:

n

R xi

i�1x � –––—–n

and the sample standard deviation, s, an esti-mator of the population standard deviation iscalculated as:

n____________

R (xi � x)2

i�1s � �–––––––––—–.n � 1

The standard error of the sample mean is thencalculated as:

sSE(x) � –––.�__n

Finally, assuming that the random variable isnormally distributed (or at least symmetricallydistributed with a sample size � 30), a (1 �a/2)%confidence interval is:

x � t1�a/2,n�1SE(x),

where t1�a/2,n�1 represents the reliability factorand is the value of the t distribution with n �1degrees of freedom (df) to the left of which is (1 �a/2)% of the area under the curve. Thesevalues are provided in Appendix 2.

As an example, let us calculate a 95% two-sided confidence interval for the mean hemo-globin value at the end of the study for the activegroup using data in Table 9.1.

Data

The data are 20 hemoglobin values from indi-viduals treated with the active drug, obtainedfrom blood samples collected at the last visit ofthe study. The mean and standard deviationwere calculated as 13.75 and 2.00, respectively,and these values serve as the basis of the confi-dence interval.

Statistical analysis

As the population variance is unknown and istherefore being estimated by the sample vari-ance, we use the t distribution for a reliabilityfactor. The use of the t distribution requires us toassume that the underlying distribution ofhemoglobin values is approximately normallydistributed, or at least symmetrically distributed.The standard error of the sample mean iscalculated as:

Analyses of clinical laboratory data 119

Table 9.2 Summary of change from baseline hemoglobin values (g/dL)

Treatment group

Visit Statistic Placebo Active

Endpoint (last visit) n 20 20Mean (SD) �0.37 (1.47) �0.86 (1.67)Median �0.4 �0.8Min., Max. �3.8, 2.1 �4.5, 1.4

Page 135: Introduction to Statistics in Pharmaceutical Clinical Trials

s 2.00SE(x) � ––– � –––– � 0.44.�__n �

___20

As we are interested in a 95% two-sided confi-dence interval, the value of the variate thatcuts off the upper 2.5% of area from the t distri-bution with 19 df is 2.093. Therefore, the 95%confidence interval for the mean hemoglobin iscalculated as follows:

13.75 � 2.093 (0.44) � (12.83, 14.67).

Interpretation and decision-making

From the confidence interval we can concludewith 95% confidence that the true populationmean hemoglobin is in the interval (12.83, 14.67).Assuming that the reference range is 12–15 g/dLfor females and 14–17 g/dL for males, we canproceed with development of the new drug withsome degree of assurance although gender-specific intervals would be more informative.

9.2.3 A confidence interval for thedifference in two means with equalunknown variance

Within-group confidence intervals can be infor-mative, but usually the primary interest in a clinical trial is to compare the effect of one treatment with that of another. Therefore, aconfidence interval for the difference in twomeans can better address the goals of theresearch.

For two independent groups 1 and 2, a samplesize of n1 observations of a random variable fromgroup 1 and n2 observations of a random variablefrom group 2, the sample means from eachgroup are:

n1

R x1i

i�1x1 � –––—– andn1

n2

R x2j

j�1x2 � –––—–, respectively.n2

The within-group sample variances areestimated as:

n1

R (x1i � x1)2

i�1s21 � ––––––––––—– and

n1 � 1

n2

R (x2j � x2)2

j�1s22 � ––––––––––—–, respectively.

n2 � 1

As before, these sample statistics are estimates ofthe unknown population parameters, the popu-lation means, and the population variances. Ifthe population variances are assumed to beequal, each sample statistic is a different estimateof the same population variance. It is thenreasonable to average or “pool” these estimatesto obtain the following:

(n1 � 1)s21 � (n1 � 1)s2

2s2p � –––––––––––––––––––––.

(n1 � n1 � 2)

The standard error of the difference in samplemeans is:

_______1 1SE(x1 � x2) � sp �–– � ––.n1 n2

Calculation of a confidence interval for thedifference in means requires an assumption ofnormal data (or, alternately, symmetricaldistributions with sample sizes of 30 ormore). If the population variances areassumed to be equal, a two-sided (1 � a)%confidence interval is:

(x1 � x2) � t1�a/2,n1+n2�2SE(x1 � x2),

where t1�a/2,n1+n2�2 represents the reliability factorand is the value of the t distribution withn1�n2�2 df to the left of which is (1 � a/2)% ofthe area under the curve.

To illustrate this methodology we use the datafrom Table 9.2 to calculate a between-groupdifference in the mean change from baselinehemoglobin at the end of the study.

120 Chapter 9 • Confirmatory clinical trials: Safety data II

Page 136: Introduction to Statistics in Pharmaceutical Clinical Trials

Data

The description of the data for this analysis isprovided in Section 9.2.1.

Statistical analysis

For this analysis we are required to use the t distribution, and therefore to make the assumption that the distribution of change from baseline values is normally or approxi-mately normally distributed. When calculating a between-group confidence interval, it is veryimportant to understand how the difference isbeing calculated and what the interpretationof the interval is, given the direction of thedifference.

In this case, each change from baseline value iscalculated as “endpoint minus baseline.” There-fore, a mean value of change from baseline thatwas � 0 would imply an increase from baseline,whereas a mean change from baseline value thatwas � 0 would imply a decrease from baseline.In this instance we are interested in the between-group difference in mean change from baseline.We interpret the calculated confidence intervalaccordingly.

To start, the point estimate for the between-group (active minus placebo) difference inmean change from baseline is (�0.86) � (�0.37)� �0.49. To calculate the standard error we firstneed to obtain an estimate of the pooled variance,which is calculated as follows:

(20 � 1)1.672 � (20 � 1)1.472 (19)2.79 � (19)2.16s2p � –––––––––———–—–––—––––– � –––––––——––––––– � 2.47.

(20 � 20 � 2) 38

The pooled standard deviation is calculated as:

�____2.47 � 1.57.

The standard error of the difference in means iscalculated as:

_______1 1SE(x1 � x2) � 1.57�–– � –– � 0.50.

20 20

The final component is to obtain the reliabilityfactor from the t distribution with 38 df, which

is 2.02. The 95% confidence interval for thedifference in means is therefore calculated as:

�0.49 � 2.02(0.50) � (�1.5, 0.52).

Interpretation and decision-making

On the basis of this confidence interval, theredoes not appear to be much of a differencebetween the groups with respect to a change inhemoglobin from baseline to the end of thestudy, particularly because the confidenceinterval includes the value 0.

9.2.4 Shift analysis

Another method used to analyze clinical labora-tory data is called a shift analysis. For this analysisthe data themselves are not the actual numericvalues of the laboratory test, but a categoricalordinal variable that indicates whether the valuewas within the reference range (normal), lowrelative to the reference range (low), or high rela-tive to the reference range (high). With theseclassifications on observations from baseline andsome other post-randomization time point (forexample, end of study), the primary interest is inthe proportion of individuals who shifted fromnormal to high or normal to low. Depending onthe parameter being investigated, a shift fromhigh to low or low to high may also be ofinterest.

A typical summary table representing this kindof analysis is provided in Table 9.3. As seen there25% of participants in the placebo group whohad normal values at baseline had low values atthe last visit. In the active group 20% of partici-pants experienced this shift from baseline to lastvisit.

9.2.5 Responders’ analysis

We noted earlier in this book that there is nosuch thing as an effective drug without someassociated risks. Some drugs may be known to be

Analyses of clinical laboratory data 121

Page 137: Introduction to Statistics in Pharmaceutical Clinical Trials

associated with a small but consistent change ina clinical laboratory parameter (for example,treatment with hydrochlorothiazide is oftenassociated with increases in blood glucose).Imagine a scenario in which a small change isnot troubling in itself. A concern may then be:What is the chance that an individual whoreceives the test treatment will have a change inthe lab test above a certain threshold, one thatwould no longer be trivial?

An analysis approach that may be informa-tively used here is to calculate a change frombaseline for each observation and then categor-ize the change from baseline value as either aresponder (that is, someone whose change frombaseline was less or greater than a specifiedvalue) or a non-responder (that is, someonewhose change from baseline was within thetolerable values of change). Whether or not adecrease or increase in the lab value is indicativeof harm depends on the laboratory test itself.The descriptive analysis for this type of dataincludes the presentation of counts and percent-ages (recall that these can be represented asproportions) of responders in each group. Asthere usually are a number of visits at which thelab test is performed, the analysis may bepresented for all post-baseline visits, the lastvisit, or both.

An extension of the responder analysisdescribed above would be to categorize thechange from baseline values into several (� 2)categories (for example, no change, increase X,increase � X).

9.2.6 Graphical displays of end-of-studyvalues plotted against baseline

One common element shared by a number ofthe analyses of laboratory data that we havedescribed is that the magnitude of change fromthe start to the end of the study is important,but so is the final value itself. In addition, therelative frequency of such outcomes is of vitalinterest when gauging the overall risk of treat-ment with a new drug. One descriptiveapproach to address several of these issues is agraphical one.

A scatter plot of each individual’s baselinevalue plotted against his or her end-of-studyvalue enables us to see how many individuals (inthe absolute or relative sense) had end-of-studyvalues beyond a normal level or changes frombaseline to end-of-study that represent a signifi-cant health risk. As an example, hemoglobinvalues at the end of the study have been plottedagainst the baseline value for two treatmentgroups (placebo and active) in Figure 9.1.

Note the diagonal line in each plot thatconnects all points for which the baseline valueis equal to the end-of-study value. With the end-of-study value on the y axis, points above thediagonal line represent an increase from baselineand points below represent a decrease from base-line. Larger vertical deviations from the diagonalline represent larger changes from baselinevalues. Thus, the need to interpret a number ofquantities at once is satisfied by one graphicaldisplay.

122 Chapter 9 • Confirmatory clinical trials: Safety data II

Table 9.3 Shift analysis of hemoglobin values

Baseline value

Placebo (n � 20) Active (n � 20)

Last visit Low Normal High Low Normal High

Low 3 (15%) 5 (25%) 0 1 (5%) 4 (20%) 1 (5%)Normal 2 (10%) 7 (35%) 2 (10%) 1 (5%) 11 (55%) 0High 0 1 (5%) 0 0 0 2 (10%)

Page 138: Introduction to Statistics in Pharmaceutical Clinical Trials

9.2.7 Clinically significant laboratoryvalues

A graphical display such as the one in Figure 9.1,or a table of summary descriptive statisticsincluding the minimum and maximum, mayreveal values that are so extreme that they meritadditional scrutiny. This is typically accom-plished with the use of a listing that provides allvalues of the laboratory test, the dates and timesof the sample collection, and the characteristicsof the participant. Such an analysis is not basedon aggregate information but rather on an indi-vidual observation. If a clinically significantobservation were noted a medical reviewerwould look to see if the participant’s valuesreturned to normal levels or remained abnormal,and if there were any accompanying AEs.

9.3 Vital signs

Vital signs typically measured in clinical trialsare blood pressure (both systolic or SBP and dias-tolic or DBP) and heart rate, often measured aspulse rate, in beats per minute. Weight mightalso be of interest. In our ongoing scenario of

the development of a new antihypertensivedrug, blood pressure measurements are efficacymeasurements, not safety measures as such.However, we discuss the use of blood pressures insafety assessment here because this is so commonin the development of non-antihypertensivedrugs.

As for laboratory data, both continuous andcategorical data analytical methods can beemployed here. Measures of central tendencyand dispersion are appropriate for continuousdata. Categories of interest, and the associatedcategorical data, can take various forms. Imaginea trial in which the treatment phase is 12 weeksand participants visit their investigational siteevery 2 weeks – that is, a baseline value takenbefore treatment commences is followed by sixvalues measured during the treatment phase. Itmay be of interest to know how many individ-uals show clinically significant vital sign changesduring the treatment period. In this case aprecise definition of clinically significant mustbe provided in the study protocol. The followinghypothetical changes in vital signs might beconsidered of clinical significance by clinicianson the study team if they occurred at any ofthe six measurement points in the treatmentphase:

Vital signs 123

Hgb

(g/d

L) a

t end

poin

t

10

Hgb (g/dL) at baseline(a)

10

11 12 13 14 15 16 17 18

11

12

13

14

15

16

17

18

Hgb

(g/d

L) a

t end

poin

t

10

Hgb (g/dL) at baseline(b)

10

11 12 13 14 15 16 17 18

11

12

13

14

15

16

17

18

Figure 9.1 Scatterplot of hemoglobin values at baseline and end of study: (a) Placebo; (b) active

Page 139: Introduction to Statistics in Pharmaceutical Clinical Trials

• an increase from baseline in SBP � 20 mmHg• an increase from baseline in DBP � 12 mmHg• an increase from baseline in SBP � 15 mmHg

and an increase in DBP � 10 mmHg• a pulse rate � 120 beats/min and an associated

increase from baseline of at least 15 beats/min.

The clinicians on the study team might also beinterested in sustained changes in vital signs.Hypothetical examples of definitions ofsustained changes might be:

• an increase from baseline in SBP � 15 mmHgat each of three consecutive visits

• an increase from baseline in DBP � 10 mmHgat each of three consecutive visits

• an increase from baseline in pulse rate � 10beats/min at each of three consecutive visits.

Appropriate categorical analyses could then beused with these data.

9.4 QT interval prolongation andtorsades de pointes liability

The ECG is a very recognizable pattern of bio-logical activity. The ECG consists of the P wave,the QRS complex, and the T wave. These compo-nents, represented in Figure 9.2, are associatedwith different aspects of the cardiac cycle: Atrialactivity, excitation of the ventricles, and repolar-ization of the ventricles, respectively. Modern

computerized systems not only display theseelectrophysiological signals but also concur-rently digitize them and store them for laterexamination.

The QT interval is highlighted in Figure 9.2.This interval is of particular interest in assessingcardiac safety in drug development, because QTinterval prolongation is one potentially informa-tive surrogate biomarker available for veryserious cardiac events including sudden cardiacdeath. (This section focuses on cardiac safetyassessment in all systemically available drugsbeing developed for uses other than the controlof cardiac arrhythmias: QT/QTc intervalprolongation – an occurrence deemed highlyundesirable in all other drugs – can occur withantiarrhythmic drugs as a consequence of theirmechanism of clinical efficacy [ICH GuidanceE14, 2005].) The ICH Guidance E14 addressesthe evaluation of QT intervals in clinicaldevelopment programs.

The time interval between the onset of theQRS complex and the offset of the T wave isdefined as the QT interval. Consider an indi-vidual with a steady heart rate of 60 beats/min, anumber chosen to make the math easy in thisexample. This represents one heart beat/second,and so the total length (in the time domain) ofall ECG segments during one beat would add upto 1 second, represented in this research field as1000 milliseconds (ms). Each component of theECG can therefore be assigned a length, or dura-tion, in milliseconds. The length of the QTinterval can be obtained by measurement frominspecting the ECG and identifying the QRSonset and the T-wave offset.

As the heart beats faster (heart rate increases),the duration of an individual cardiac cycledecreases, because more cardiac cycles now occurin the same time. Therefore, as the cardiac cycleshortens, so do each of the components of thecardiac cycle. This means that the QT intervalwill tend to be shorter at a higher heart rate. Asit is of interest to examine the QT interval atvarious heart rates, the interval can be“corrected” for heart rate. This leads to the termQTc, which is calculated (by one of severalmethods including two corrections attributed toBazett and Fridericia), taking into account theactual QT and the heart rate (the duration of

124 Chapter 9 • Confirmatory clinical trials: Safety data II

QRS onset

QS

P

T

R

T offset

QT interval

Figure 9.2 Stylized representation of the ECG, showingthe QT interval

Page 140: Introduction to Statistics in Pharmaceutical Clinical Trials

the entire cardiac cycle, sometimes referred to asthe RR interval) at that point. The title of ICHGuidance E14 uses the term “QT/QTc interval” toindicate that both QT and QTc are of interest: Inthis book, the term “QT interval” representsboth QT and QTc.

It is of considerable interest in drug develop-ment to determine whether the investigationaldrug under development leads to prolongationof the QT interval: Although QT interval prolon-gation can be congenital, it can also be acquired,for example, induced by drug therapy. QTprolongation, which represents delayed cardiacrepolarization of the myocardial cells, is regardedas a potentially very informative surrogatemarker for certain dangerous cardiac arrhyth-mias, namely polymorphic ventricular tachy-cardia and torsades de pointes,and sudden cardiacdeath. Extensive ECG monitoring during preap-proval clinical trials is therefore a critical part ofclinical development programs, and results fromthis testing must be presented to a regulatoryagency to obtain marketing approval. One ofthe biggest causes of delay in getting a newdrug approved by a regulatory agency, or failureto be given marketing approval, is cardiacsafety issues, and therefore the choice of thecorrect study design, appropriate methodologyfor collecting optimum quality data, and appro-priate statistical analyses are of tremendousimportance.

Although ICH Guidance E14 (2005) providesguidance on each of these considerations, wefocus here on the statistical approaches thatshould be taken in the investigation of QTprolongation. As this guidance noted (p 9), “TheQT/QTc interval data should be presented bothas analyses of central tendency (for example,means, medians) and categorical analyses. Bothcan provide relevant information on clinical riskassessment.” The effect of the investigationaldrug on the QT intervals is most commonlyanalyzed using the largest time-matched meandifference between the drug and placebo(adjusted for baseline) over the data collectionperiod.

Categorical analyses are based on the numberand the percentage of individuals who meet orexceed a predefined upper limit. Such limits canbe stated in the study protocol in terms of either

absolute QT interval prolongation values orchanges from baseline. At this time, there is noconsensus concerning what is the “best” choiceof these upper limit values. ICH Guidance 14therefore suggested that multiple analyses usingseveral predefined limits is a reasonableapproach in light of this lack of consensus. Forabsolute QT interval data, the guidance suggestsproviding absolute numbers and percentages ofindividuals whose QT intervals exceed 450, 480,and 500 ms. For change-from-baseline QTinterval data, the same information might beprovided for increases exceeding 30 ms andthose exceeding 60 ms. The design and analysisof studies intended to evaluate changes in QTcan be rather difficult to implement. Some of thedifficulties and areas for further research broughtto light by ICH Guidance E14 are discussed in arecent paper (Pharmaceutical Research andManufacturers of America QT Statistics ExpertWorking Team, 2005).

For further discussion of QT/QTc intervalprolongation and other cardiac safety assess-ments for noncardiac drugs, see Morganroth andGussak (2005) and Turner and Durham (2008).

9.5 Concluding comments on safetyassessments in clinical trials

In this chapter we have seen that the goal ofsafety analyses is to cast a wide net in the hopesof identifying any events that may be attribut-able to treatment with the new drug. Such abroad search, however, also has a significantdisadvantage: If we look at so many outcomes,we might find one that looks problematic just bychance alone. Rather than rely solely on statis-tical approaches to limit the chance of thisoccurring, a sensible approach is to substantiatesuch a finding with additional data, either asimilar result in a different study or some data onthe medical explanation for the event (themechanism of action).

The analysis tools that we have described inthis chapter provide ways to evaluate the risk ofthe new drug, given the constraints of samplesizes obtainable in clinical development. Thelimitations of relatively little human experience

Concluding comments on safety assessments in clinical trials 125

Page 141: Introduction to Statistics in Pharmaceutical Clinical Trials

before marketing approval have to be consid-ered, especially when reviewing clinical safetydata. Thus far, regulatory agencies have notrequired pharmaceutical companies to increasethe sizes of their studies to find the best way touncover safety risks that would otherwise behard to find. Rather, the emphasis has been touse more modern tools (for example, geneticsand candidate screening) to identify potentiallydangerous drugs before there are a large numberof participant exposures (US Department ofHealth and Human Services, FDA, 2004). Therole of postmarketing surveillance will continueto be important (see also ICH Guidance E2E,2004; Strom, 2005; Mann and Andrews, 2007).This is especially true when we think of the rela-tive homogeneity of participants in clinical trialscompared with patients in the real world and theimplications of the law of large numbers (recalldiscussions in Chapter 6).

As a final note to this chapter, any potentialrisks to individuals treated with a new drug haveto be considered and cannot automatically beconsidered trivial. The acceptability of themagnitude of the risk depends largely on astatistical demonstration of the expected benefitof the new treatment, which is the topic ofChapters 10 and 11.

9.7 References

Chuang-Stein C (1992). Summarizing laboratory datawith different reference ranges in multi-centertrials. Drug Information J 26:74–84.

ICH Guidance E2E (2004). Pharmacovigilance Planning.Available at: www.ich.org (accessed July 1 2007).

ICH Guidance E3 (1995). Structure and Content ofClinical Study Reports. Available at: www.ich.org(accessed July 1 2007).

ICH Guidance E14 (2005). The Clinical Evaluation ofQT/QTc Interval Prolongation and ProarrhythmicPotential for Non-Antiarrhythmic Drugs. Available at:www.ich.org (accessed July 1 2007).

Mann R, Andrews E, eds (2007). Pharmacovigilance, 2ndedn. Chichester: John Wiley & Sons.

Morganroth J, Gussak I, eds (2005). Cardiac Safety ofNoncardiac Drugs: Practical guidelines for clinicalresearch and drug development. Totowa, NJ: HumanaPress.

Pharmaceutical Research and Manufacturers ofAmerica QT Statistics Expert Working Team (2005).Investigating drug-induced QT and QTc prolon-gation in the clinic: a review of statistical designand analysis considerations: report from thePharmaceutical Research and Manufacturers ofAmerica QT Statistics Expert Team. Drug InformationJ 39:243–266.

Strom BL, ed. (2005). Pharmacoepidemiology, 4th edn.Chichester: John Wiley & Sons.

Turner JR, Durham TA (2008). Integrated Cardiac Safety:Assessment methodologies for noncardiac drugs indiscovery, development, and postmarketing surveillance.Hoboken, NJ: John Wiley & Sons, in press.

US Department of Health and Human Services, Foodand Drug Administration (2004). Challenge andOpportunity on the Critical Path to New Medical Prod-ucts. Available from www.fda.gov (accessed July 12007).

126 Chapter 9 • Confirmatory clinical trials: Safety data II

3. What is the statistical and clinicalinterpretation (or relevance) of the following95% confidence intervals for the between-group difference (for example, test groupminus placebo) in mean change from baselinehemoglobin (g/dL) at endpoint (last visit)?

(a) (�1.2, 2.6)(b) (1.7, 3.4)(c) (�6.2, �2.3).

9.6 Review

1. What are some advantages and disadvantages ofthe various analytical approaches cited from ICHGuidance E3 listed in Section 9.2?

2. Refer to the data in Table 9.1:

(a) Calculate a two-sided 90% confidence intervalfor the difference in mean hemoglobin value atendpoint (last visit).

(b) Calculate a two-sided 95% confidence intervalfor the difference in mean hemoglobin value atendpoint (last visit).

(c) Calculate a two-sided 99% confidence intervalfor the difference in mean hemoglobin value atendpoint (last visit).

(d) What is the statistical interpretation of theseresults?

Page 142: Introduction to Statistics in Pharmaceutical Clinical Trials

10.1 Introduction: Regulatory views ofsubstantial evidence

When thinking about the use of statistics in clin-ical trials, the first thing that comes to mind formany people is the process of hypothesis testingand the associated use of p values. This is veryreasonable, because the role of a chance outcomeis of utmost importance in study design and theinterpretation of results from a study. A sponsor’sobjective is to develop an effective therapy thatcan be marketed to patients with a certain diseaseor condition. From a public health perspective,the benefits of a new treatment cannot be sepa-rated from the risks that are tied to it. Regulatoryagencies must protect public health by ensuringthat a new treatment has “definitively” beendemonstrated to have a beneficial effect. Themeaning of the word “definitively” as used hereis rather broad, but we discuss what it means inthis context – that is, we operationally define theterm “definitively” as it applies to study design,data analysis, and interpretation in new drugdevelopment.

Most of this chapter is devoted to describingvarious types of data and the correspondinganalytical strategies that can be used to demon-strate that an investigational drug, or testtreatment, is efficacious. First, however, it isinformative to discuss the international stand-ards for demonstrating efficacy of a new product,and examine how regulatory agencies haveinterpreted these guidelines. ICH Guidance E9(1998, p 4) addresses therapeutic confirmatorystudies and provides the following definition:

A confirmatory trial is an adequately controlledtrial in which the hypotheses are stated in

advance and evaluated. As a rule, confirmatorytrials are necessary to provide firm evidence ofefficacy or safety. In such trials the key hypoth-esis of interest follows directly from the trial’sprimary objective, is always pre-defined, and isthe hypothesis that is subsequently tested whenthe trial is complete. In a confirmatory trial it isequally important to estimate with due precisionthe size of the effects attributable to the treat-ment of interest and to relate these effects totheir clinical significance.

It is common practice to use earlier phasestudies such as therapeutic exploratory studiesto characterize the size of the treatment effect,while acknowledging that the effect size foundin these studies is associated with a certainamount of error. As noted earlier, confidenceintervals can be helpful for planning confirma-tory studies. The knowledge and experiencegained in these earlier studies can lead tohypotheses that we wish to test (and hopefullyconfirm) in a therapeutic confirmatory trial, forexample, the mean reduction in systolic bloodpressure (SBP) for the test treatment is 20 mmHggreater than the mean reduction in SBP forplacebo. As we have seen, a positive result froma single earlier trial could be a type I error, so asecond study is useful in substantiating thatresult.

The description of a confirmatory study inICH Guidance E9 (1998) also illustrates theimportance of the study design employed. Thestudy should be designed with several importantcharacteristics:

• It should test a specific hypothesis.• It should be appropriately sized.• It should be able to differentiate treatment

effects from other sources of variation (for

10Confirmatory clinical trials: Analysis of categorical efficacy data

Page 143: Introduction to Statistics in Pharmaceutical Clinical Trials

example, time trends, regression to the mean,bias).

• The size of the treatment effect that is beingconfirmed should be clinically relevant.

The clinical relevance, or clinical significance, ofa treatment effect is an extremely importantconsideration. The size of a treatment effect thatis deemed clinically relevant is best defined bymedical, clinical, and regulatory specialists.

Precise description of the study design andadherence to the study procedures detailed inthe study protocol are particularly important forconfirmatory studies. Quoting again from ICHGuidance E9 (1998, p 4):

Confirmatory trials are intended to provide firmevidence in support of claims and hence adher-ence to protocols and standard operating proce-dures is particularly important; unavoidablechanges should be explained and documented,and their effect examined. A justification of thedesign of each such trial, and of other importantstatistical aspects such as the principal featuresof the planned analysis, should be set out inthe protocol. Each trial should address only alimited number of questions.

Confirmatory studies should also providequantitative evidence that substantiates claimsin the product label (for example, the packageinsert) as they relate to an appropriate popula-tion of patients. In the following quote fromICH Guidance E9 (1998, p 4), the elements ofstatistical and clinical inference can be seen:

Firm evidence in support of claims requires thatthe results of the confirmatory trials demonstratethat the investigational product under test hasclinical benefits. The confirmatory trials shouldtherefore be sufficient to answer each key clinicalquestion relevant to the efficacy or safety claimclearly and definitively. In addition, it is impor-tant that the basis for generalisation . . . to theintended patient population is understood andexplained; this may also influence the numberand type (e.g. specialist or general practitioner) ofcentres and/or trials needed. The results of theconfirmatory trial(s) should be robust. In somecircumstances the weight of evidence from asingle confirmatory trial may be sufficient.

The terms “firm evidence” and “robust” donot have explicit definitions. However, as clin-ical trials have been conducted and reported inrecent years, some practical (operational) defini-tions have emerged, and these are discussedshortly.

In its guidance document Providing ClinicalEvidence of Effectiveness for Human Drug andBiological Products, the US Food and DrugAdministration (US Department of Health andHuman Services, FDA, 1998) describes theintroduction of an effectiveness requirementaccording to a standard of “substantial evidence”in the Federal Food, Drug, and Cosmetic Act(the FDC Act) of 1962:

Substantial evidence was defined in section 505(d)of the Act as “evidence consisting of adequateand well-controlled investigations, includingclinical investigations, by experts qualified byscientific training and experience to evaluate theeffectiveness of the drug involved, on the basisof which it could fairly and responsibly beconcluded by such experts that the drug willhave the effect it purports or is represented tohave under the conditions of use prescribed,recommended, or suggested in the labeling orproposed labeling thereof.”

US Department of Health and Human Services, FDA (1998, p 3)

The phrase “adequate and well-controlled inves-tigations” has typically been interpreted as atleast two studies that clearly demonstrated thatthe drug has the effect claimed by the sponsorsubmitting a marketing approval. Furthermore, atype I error of 0.05 has typically been adopted asa reasonable standard upon which data fromclinical studies are judged. That is, it was widelybelieved that the intent of the FDC Act of 1962was to state that a drug could be concluded to beeffective if the treatment effect was clinicallyrelevant and statistically significant at the a �

0.05 level in two independent studies. The ICH Guidance E8 (1998, p 4) clarified this

issue:

The usual requirement for more than oneadequate and well-controlled investigationreflects the need for independent substantiation

128 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

Page 144: Introduction to Statistics in Pharmaceutical Clinical Trials

of experimental results. A single clinical experi-mental finding of efficacy, unsupported by otherindependent evidence, has not usually beenconsidered adequate scientific support for aconclusion of effectiveness. The reasons for thisinclude the following:

• Any clinical trial may be subject to unantici-pated, undetected, systematic biases. Thesebiases may operate despite the best intentionsof sponsors and investigators, and may leadto flawed conclusions. In addition, someinvestigators may bring conscious biases toevaluations.

• The inherent variability in biological systemsmay produce a positive trial result by chancealone. This possibility is acknowledged, andquantified to some extent, in the statisticalevaluation of the result of a single efficacytrial. It should be noted, however, thathundreds of randomized clinical efficacy trialsare conducted each year with the intent ofsubmitting favorable results to the FDA. Evenif all drugs tested in such trials were ineffec-tive, one would expect one in forty of thosetrials to “demonstrate” efficacy by chancealone at conventional levels of statisticalsignificance. It is probable, therefore, thatfalse positive findings (that is, the chanceappearance of efficacy with an ineffectivedrug) will occur and be submitted to FDA asevidence of effectiveness. Independentsubstantiation of a favorable result protectsagainst the possibility that a chance occur-rence in a single study will lead to an erro-neous conclusion that a treatment is effective.

• Results obtained in a single center may bedependent on site or investigator-specificfactors (for example, disease definition,concomitant treatment, diet). In such cases,the results, although correct, may not begeneralizable to the intended population.This possibility is the primary basis foremphasizing the need for independence insubstantiating studies.

• Rarely, favorable efficacy results are theproduct of scientific fraud.

Although there are statistical, methodologic,and other safeguards to address the identifiedproblems, they are often inadequate to addressthese problems in a single trial. Independent

substantiation of experimental results addressessuch problems by providing consistency acrossmore than one study, thus greatly reducing thepossibility that a biased, chance, site-specific, orfraudulent result will lead to an erroneousconclusion that a drug is effective.

This guidance further clarified that the need forsubstantiation does not necessarily require twoor more identically designed trials:

Precise replication of a trial is only one of anumber of possible means of obtaining indepen-dent substantiation of a clinical finding and, attimes, can be less than optimal as it could leavethe conclusions vulnerable to any systematicbiases inherent to the particular study design.Results that are obtained from studies that areof different design and independent in execu-tion, perhaps evaluating different populations,endpoints, or dosage forms, may providesupport for a conclusion of effectiveness that isas convincing as, or more convincing than, arepetition of the same study.

ICH Guidance E8 (1998, p 5)

Regulatory agencies have traditionally acceptedonly two-sided hypotheses because, theoreti-cally, one could not rule out harm (as opposed tosimply no effect) associated with the test treat-ment. If the value of a test statistic (for example,the Z-test statistic) is in the critical region at theextreme left or extreme right of the distribution(that is, � �1.96 or � 1.96), the probability ofsuch an outcome by chance alone under the nullhypothesis of no difference is 0.05. However, theprobability of such an outcome in the directionindicative of a treatment benefit is half of 0.05,that is, 0.025. This led to a common statisticaldefinition of “firm” or “substantial” evidence asthe effect was unlikely to have occurred bychance alone, and it could therefore be attrib-uted to the test treatment. Assuming that twostudies of the test treatment had two-sided p values � 0.05 with the direction of the treat-ment effect in favor of a benefit, the probabilityof the two results occurring by chance alonewould be 0.025 � 0.025, that is, 0.000625 (whichcan also be expressed as 1/1600).

It is important to note here that this standardis not written into any regulation. Therefore,

Introduction: Regulatory views of substantial evidence 129

Page 145: Introduction to Statistics in Pharmaceutical Clinical Trials

there may be occasions where this statisticalstandard is not met. In fact, it is possible toredefine the statistical standard using one largewell-designed trial, an approach that has beendescribed by Fisher (1999).

Whether the substantial evidence comes fromone or more than one trial, the basis forconcluding that the evidence is indeed substan-tial is statistical in nature. That is, the regulatoryagency must agree with the sponsor on severalkey points in order to approve a drug formarketing:

• The effect claimed cannot be explained byother phenomena such as regression to themean, time trends, or bias. This highlights theneed for appropriate study design and dataacquisition.

• The effect claimed is not likely a chanceoutcome. That is, the results associated with aprimary objective have a small p value, indi-cating a low probability of a type I error.

• The effect claimed is large enough to beimportant to patients, that is, clinically rele-vant. The magnitude of the effect mustaccount for sampling during the trial(s).

A clinical development program containsvarious studies that are designed to provide thequantity and quality of evidence required tosatisfy regulatory agencies, which have theconsiderable responsibility of protecting publichealth. The requirements for the demonstra-tion of substantial evidence highlight theimportance of study design and analytic strate-gies. Appropriate study design features such asconcurrent controls, randomization, standardiza-tion of data collection, and treatment blinding

help to provide compelling evidence that anobserved treatment effect cannot be explainedby other phenomena. Selection of the appropriateanalytical strategy maximizes the precision andefficiency of the statistical test employed. Theemployment of appropriate study design andanalytical strategies provides the opportunity foran investigational drug to be deemed effective ifa certain treatment effect is observed in clinicaltrials.

10.2 Objectives of therapeuticconfirmatory trials

Table 10.1 provides a general taxonomy of theobjectives of confirmatory trials and specificresearch questions corresponding to each.Confirmatory trials typically have one primaryobjective that varies by the type of trial. In thecase of a new antihypertensive it may be suffi-cient to demonstrate simply that the reductionin blood pressure is greater for the test treatmentthan for the placebo. A superiority trial is appro-priate in this instance. However, in other thera-peutic areas – for example, oncology – otherdesigns are appropriate. In these therapeuticareas it is not ethical to withhold life-extendingtherapies to certain individuals by randomizingthem to a placebo treatment if there is already anexisting treatment for the disease or condition.

In such cases, it is appropriate to employ trialswith the objective of demonstrating that theclinical response to the test treatment is equiva-lent (that is, no better or worse) to that of anexisting effective therapy. These trials are called

130 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

Table 10.1 Taxonomy of therapeutic confirmatory trial objectives

Objective of trial Example indication Example research question

Demonstrate superiority Hypercholesterolemia Is the magnitude of LDL reduction for the test treatment greater than for placebo?

Demonstrate equivalence Oncology Is the test treatment at worst trivially inferior to and at best slightly better than the active control with respect to the rate of partial tumor response?

Demonstrate noninferiority Anti-infective Is the microbial eradication rate for the test treatment at least not unacceptably worse than for the active control?

Page 146: Introduction to Statistics in Pharmaceutical Clinical Trials

equivalence trials. A question that arises here is: Why would we want to develop another drugif there is already an existing effective treat-ment? The answer is that we believe the testtreatment offers other advantages (for example,convenience, tolerability, or cost) to justify itsdevelopment. Another type of trial is the noninferiority trial. These trials are intendedonly to demonstrate that a test treatment is notunacceptably worse (noninferior) than an activecontrol. Again, the test treatment may provideadvantages other than greater therapeuticresponse such as fewer adverse effects or greaterconvenience.

Equivalence and noninferiority trials are quitedifferent from superiority trials in their design,analysis, and interpretation (although exactlythe same methodological considerations applyto collect optimum quality data in these trials).Superiority trials continue to be our focus in thisbook, but it is important that you are aware ofother designs too. Therefore, in Chapter 12 wediscuss some of the unique features of theseother design types.

10.3 Moving from research questions toresearch objectives: Identification ofendpoints

There is an important relationship betweenresearch questions and study objectives, and it isrelatively straightforward to restate researchquestions such as those in Table 10.1 in terms ofstudy objectives. As stated in ICH Guidance E9, aconfirmatory study should be designed toaddress at most a few objectives. If a treatmenteffect can be quantified by an appropriatestatistical measure, study objectives can be trans-lated into statistical hypotheses. For example,the extent of low-density lipoprotein (LDL)-cholesterol reduction can be measured by themean change from baseline to end-of-treatment,or by the proportion of study participants whoattain a goal level of LDL according to a treat-ment guideline. The efficacy of a cardiovascularintervention may be measured according to themedian survival time after treatment. For manydrugs, identification of an appropriate measure

of the participant-level response (for example,reported pain severity using a visual analog scale)is not difficult. However, there may be instanceswhen the use of a surrogate endpoint can bejustified on the basis of statistical, biological andpractical considerations. Measuring HIV viralload as a surrogate endpoint for occurrence ofAIDS is an example.

Identification of the endpoint of interest isone of the many cases in clinical research thatinitially seem obvious and simple. We knowexactly what disease or condition we are inter-ested in treating, and it should be easy to iden-tify an endpoint that will tell us if we havebeen successful. In reality, the establishment ofan appropriate endpoint, whether it is themost clinically relevant endpoint or a surrogateendpoint, can be difficult. Some of the statisticalcriteria used to judge the acceptability of surro-gate endpoints are described by Fleming andDeMets (1996), who caution against their use inconfirmatory trials. One might argue that themost clinically relevant endpoint for a antihy-pertensive is the survival time from myocardialinfarction, stroke, or death. Fortunately, theincidence of these events is relatively low duringthe typical observation period of clinical trials.The use of SBP as a surrogate endpoint enables theuse of shorter and smaller studies than would berequired if the true clinical endpoint had to beevaluated. For present purposes, we assume thesimplest scenario: The characteristic that we aregoing to measure (blood pressure) is uncontro-versial and universally accepted, and a clinicallyrelevant benefit is acknowledged to be associatedwith a relative change in blood pressure for thetest treatment compared with the control.

Common measures of the efficacy of a testtreatment compared with a placebo include thedifferences in means, in proportions, and insurvival distributions. How the treatment effectis measured and analyzed in a clinical trialshould be a prominent feature of the studyprotocol and should be agreed upon withregulatory authorities before the trial begins. Inthis chapter we describe between-group differ-ences in general terms. It is acceptable tocalculate the difference in two quantities, A andB as “A minus B” or “B minus A” as long as theprocedure chosen is identified unambiguously.

Moving from research questions to research objectives 131

Page 147: Introduction to Statistics in Pharmaceutical Clinical Trials

10.4 A brief review of hypothesistesting

We discussed hypothesis testing in some detailin Chapter 6. For present purposes, the role ofhypothesis testing in confirmatory clinical trialscan be restated simply as follows:

Hypothesis testing provides an objective way tomake a decision to proceed as if the drug iseither effective or not effective based on thesample data, while also limiting the probabilityof making either decision in error.

For a superiority trial the null hypothesis is thatthe treatment effect is zero. Sponsors of drugtrials would like to generate sufficient evidence,in the form of the test statistic, to reject the nullhypothesis in favor of the alternate hypothesis,thereby providing compelling evidence that thetreatment effect is not zero. The null hypothesismay be rejected if the treatment effect favorsthe test drug, and also if it favors the placebo(as discussed, we have to acknowledge thispossibility).

The decision to reject the null hypothesisdepends on the value of the test statistic relativeto the distribution of its values under the nullhypothesis. Rejection of the null hypothesismeans one of two things:

1. There really is a difference between the twotreatments, that is, the alternate hypothesis istrue.

2. An unusually rare event has occurred, that is,a type I error has been committed, meaningthat we reject the null hypothesis given thatit is true.

Regulatory authorities have many reasons to beconcerned about type I errors. As a review at theend of this chapter, the reader is encouraged tothink about the implications for a pharmaceu-tical company of committing a type I or II errorat the conclusion of a confirmatory efficacy study.

The test statistic is dependent on the analysismethod, which is dependent on the studydesign; this, in turn, is dependent on a preciselystated research question. By now, you have seen

us state this fundamental point several times,but it really cannot be emphasized enough. Inour experience, especially with unplanned dataanalyses, researchers can be so anxious to know“What’s the p value?” that they forget toconsider the possibility that the study thatgenerated the data was not adequatelydesigned to answer the specific questionof interest. The steps that lead toward opti-mally informed decision-making in confirma-tory trials on the basis of hypothesis testing areas follows:

1. State the research question.2. Formulate the research question in the form

of null and alternate statistical hypotheses.3. Design the study to minimize bias, maximize

precision, and limit the chance of committinga type I or II error. As part of the study design,prespecify the primary analysis method thatwill be used to test the hypothesis. Dependingon the nature of the data and the size of thestudy, consider whether a parametric ornonparametric approach is appropriate.

4. Collect optimum-quality data using optimum-quality experimental methodology.

5. Carry out the primary statistical analysisusing the prespecified method.

6. Report the results of the primary statisticalanalysis.

7. Make a decision to proceed as if the drug iseither effective or ineffective:

(a) If you decide that it is effective based onthe results of this study, you may chooseto move on to conduct the next study inyour clinical development plan, or, if thisis the final study in your developmentplan, to submit a dossier (for example,NDA [new drug application], MAA[marketing authorisation application]) toa regulatory agency.

(b) If you decide that it is ineffective based onthe results of this study, you may chooseto refine the original research questionand conduct a new study, or to abandonthe development of this investigationalnew drug.

132 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

Page 148: Introduction to Statistics in Pharmaceutical Clinical Trials

10.5 Hypothesis tests for two or moreproportions

The research question of interest in some studiescan be phrased: Does the test treatment result ina higher probability of attaining a desired statethan the control? Examples of such applicationsinclude:

• survival after 1 year following a cardiovascularintervention

• avoiding hospitalization associated withasthmatic exacerbations over the course of6 months

• attaining a specific targeted level of LDLaccording to one’s background risk.

In a confirmatory trial of an antihypertensive,for example, a sponsor might like to know ifthe test treatment results in a higher propor-tion of hypertensive individuals (which can beinterpreted as a probability) reaching an SBP � 140 mmHg.

10.5.1 Hypothesis test for twoproportions: The Z approximation

In the case of a hypothesis test for two propor-tions the null and alternate statisticalhypotheses can be stated as follows:

H0: p1 � p2 � 0HA: p1 � p2 � 0

where the population proportions for each oftwo independent groups are represented by p1and p2.

The sample proportions will be used to esti-mate the population proportions and, as inChapter 8, are defined as:

number of observations in group 1 with the event of interestp1 � ––––––––––––––––––––––––––––––––––––––––––––––––––––––––

total number of observations in group 1 at risk of the event

and

number of observations in group 2 with the event of interestp2 � ––––––––––––––––––––––––––––––––––––––––––––––––––––––––.

total number of observations in group 2 at risk of the event

The estimator for the difference in the twosample proportions is p1 �p2 and the standarderror of the difference p1 �p2 is:

_______________p1q1 p2q 2

SE(p1 � p2) � �–––––– � ––––––,n1 n2

where q1 � 1 �p1 and q2 � 1 �p2. The test statisticfor the test of two proportions is equal to:

(p1 � p2)Z � –––––––––––.

SE(p1 � p2)

Use of a correction factor may be useful as well,especially with smaller sample sizes. A teststatistic that makes use of the correction factor is:

1 1 1|p1 � p2| � –– (–– � ––)2 n1 n2

Z � ––––––––––––––––––––––.SE(p1 � p2)

For large samples (that is, when p1n1 � 5 andp2n2 � 5), these test statistics follow a standardnormal distribution under the null hypothesis.Values of the test statistic that are far away fromzero would contradict the null hypothesis andlead to rejection. In particular, for a two-sidedtest of size a, the critical region (that is, thosevalues of the test statistic that would lead torejection of the null hypothesis) is defined by F � Fa/2 or F � F1 �a/2. If the calculated value ofthe test statistic is in the critical region, the null hypothesis is rejected in favor of the alter-nate hypothesis. If the calculated value of thetest statistic is outside the critical region, the null hypothesis is not rejected.

As an illustration of this hypothesis test,consider the following hypothetical data from aconfirmatory study of a new antihypertensive.In a randomized, double-blind, 12-week study,the test treatment was compared with placebo.The primary endpoint of the study was theproportion of participants who attained an SBPgoal � 140 mmHg. Of 146 participants assignedto placebo, 34 attained an SBP � 140 mmHg atweek 12. Of 154 assigned to test treatment, 82attained the goal. Let us look at how these resultscan help us to make a decision based on the

Hypothesis tests for two or more proportions 133

Page 149: Introduction to Statistics in Pharmaceutical Clinical Trials

information provided. We go through the stepsneeded to do this.

The research question

Is the test treatment associated with a higher rateof achieving target SBP?

Study design

As noted, the study is a randomized, double-blind, placebo-controlled, 12-week study of aninvestigational antihypertensive drug.

Data

The data from this study are in the form ofcounts. We have a count of the number of partic-ipants in each treatment group, and, for both ofthese groups, we have a count of the numberof participants who experienced the event ofinterest. As the research question pertains to aprobability, or risk, we use the count data toestimate the probability of a proportion ofparticipants attaining the goal SBP.

Hypotheses and statistical analysis

The null and alternate statistical hypotheses inthis case can be stated as:

H0: pTEST � pPLACEBO � 0HA: pTEST � pPLACEBO � 0

where the population proportions for eachgroup are represented by pTEST and pPLACEBO. Asthe response is attaining a lower SBP, the groupwith the greater proportion of responses will beregarded as the treatment with a more favorableresponse. The difference in proportions is calcu-lated as “test minus placebo.” Positive values ofthe test statistic will favor the test treatment.

As the samples are large according to the defi-nition given earlier, the test of the two propor-tions using the Z approximation is appropriate.For a two-sided test of size 0.05 the critical regionis defined by Z � �1.96 or Z � 1.96. The valueof the test statistic is calculated as:

pTEST � pPLACEBOZ � ––––––––––––––––––.SE(pTEST � pPLACEBO)

The difference in sample proportions is calculated as:

82 34pTEST � pPLACEBO � –––– � –––– � 0.5325 � 0.2329 � 0.2996.

154 146

The standard error of the difference in sampleproportions is calculated as:

_________________________________(0.5325)(0.4675) (0.2329)(0.7671)

SE(pTEST � pPLACEBO) � � ––––––––––––––– � ––––––––––––––– � 0.0533.154 146

Using these calculated values, the value of thetest statistic is:

0.2996Z � ––––––– � 5.62.

0.0533

The test statistic using a correction factor isobtained as:

1 1 10.2996 � – (–––– � ––––)2 154 146

Z � –––––––––––––––––––––––– � 5.50.0.0533

Interpretation and decision-making

As the value of test statistic – that is, 5.62 – is inthe critical region (5.62 � 1.96), the null hypoth-esis is rejected in favor of the alternate hypoth-esis. Note that the value of the test statistic usingthe correction factor was also in the criticalregion. The probability of attaining the SBP goalis greater for those receiving the test treatmentthan for those receiving placebo.

It is fairly common to report a p value fromsuch an analysis. As we have seen, the p value isthe probability (under the null hypothesis) ofobserving the result obtained or one that is moreextreme. In this analytical strategy we refer to atable of Z scores and the tail areas associatedwith each to find the sum of the two areas (thatis, probabilities) to the left of �5.62 (a result asextreme as the observed or more so) and to theright of 5.62 (the result observed and those moreextreme). A Z score of this magnitude is way outin the right-hand tail of the distribution, leadingto a p value � 0.0001.

The results of this study may lead the sponsorto decide to conduct a second confirmatory trial,being confident that the drug is efficacious.Alternately, if the entire set of clinical data are

134 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

Page 150: Introduction to Statistics in Pharmaceutical Clinical Trials

satisfactory, the sponsor may decide to apply formarketing approval.

10.5.2 Hypothesis test for two (or more)proportions: v2 test of homogeneity

An alternative method to the Z approximationfor the comparison of two proportions fromindependent groups is called the v2 test, which isconsidered a goodness-of-fit test; this quantifiesthe extent to which count data (for example, thenumber of individuals with and without theresponse of interest) deviate from counts thatwould be expected under a particular mathemat-ical model. The mathematical model used inclinical studies for goodness-of-fit tests is that ofhomogeneity. That is, if a particular response ishomogeneous with respect to treatment, wewould expect all the responses of interest to beproportionally distributed among all treatmentgroups. The assumption of homogeneity willallow us to calculate the cell counts that wouldbe expected. These will then be compared withwhat was actually observed. The more theexpected counts under the particular model ofinterest (for example, homogeneity) deviatefrom what is observed, the greater the value ofthe test statistic, and therefore the more the datado not represent goodness of fit. The v2 test isuseful because it can be used to test homogeneityacross two or more treatment groups. We firstdescribe the case of two groups and the moregeneral case is described in Section 10.5.3.

If there are two independent groups of interest(for example, treatment groups in a clinical trial)each representing an appropriate population,the proportions of participants with the charac-teristic or event of interest are represented by p1� m1/n1 and p2 � m2/n2. The counts of partici-pants with events and nonevents can bedisplayed in a contingency table with twocolumns and two rows, representing thenumbers of observations with (m1 and m2) andwithout (n1 � m1 and n2 � m2) the characteristicof interest. The marginal total of individualswith events (the sum across the two groups) isdenoted by R � m1 � m2. The marginal total ofindividuals without the events (sum across thetwo groups) is denoted by S � (n1 � n2) � (m1 � m2).

Finally, the total sample size (sum across the twogroups) is denoted by N � n1 � n2. The overallproportion of responses of interest across bothgroups is p � R/N. The complementary propor-tion of responses is q � S/N. A sample contin-gency table displaying the observed counts isrepresented in Table 10.2.

The null hypothesis for the v2 test of homogeneity for two groups is stated as:

H0: The distribution of the response of interest ishomogeneous with respect to the two treatmentgroups. Equivalently, the proportion of “yes”responses is equal across the two groups.

The alternate hypothesis is:

HA: The distribution of the response of interestis not homogeneous with respect to the twotreatment groups.

If the null hypothesis is true – that is, the propor-tion of participants with the event of interest issimilar across the two groups – the expectedcount of responses in groups 1 and 2 would be inthe same proportion as observed across allgroups. That is, the expected cell count in row 1(participants with events of interest) forgroup 1 is:

E1,1 � pn1.

Likewise, the expected cell count in row 1(participants with events of interest) forgroup 2 is:

E1,2 � pn2.

Similarly, the expected cell count in row 2(participants without the event of interest) forgroup 1 is:

E2,1 � q n1.

Hypothesis tests for two or more proportions 135

Table 10.2 Sample contingency table for two groupsand two responses (2 � 2)

Group

Event or characteristic? 1 2 Total

Yes m1 m2 RNo n1 � m1 n2 � m2 S

n1 n2 N

Page 151: Introduction to Statistics in Pharmaceutical Clinical Trials

Lastly, the expected cell count in row 2 (partici-pants without the event of interest) for group 2 is:

E2,2 � q n2.

The corresponding observed counts in Table 10.2are:

O1,1 � m1,O1,2 � m2,O2,1 � n1 � m1,

and

O2,2 � n2 � m2.

The test statistic v2 is calculated as the sum ofsquared differences between the observed andexpected counts divided by the expected countfor all four cells (two groups and two responses)of the contingency table:

2 2 (Or,i � Er,i)2

X2 � RR –––––––––––.i�1 r�1

Er,i

Under the null hypothesis of homogeneity, thetest statistic, X2, for two groups and tworesponses (for example, interest is in the propor-tion) is approximately distributed as a v2 with 1degree of freedom (df). Only large values of thetest statistic are indicative of a departure fromthe null hypothesis. Therefore, the v2 test isimplicitly a one-sided test. Values of the teststatistic that lie in the critical region are thosewith X2 � v1

2. The notation in this section tends to be

more complex than we have encountered inprevious chapters. A worked example using thedata from Section 10.5.1 may clarify the descrip-tion. In a randomized, double-blind, 12-weekstudy, the test treatment was compared withplacebo. The primary endpoint of the study was the proportion of participants who attainedan SBP goal � 140 mmHg. Of 146 partici-pants assigned to placebo, 34 attained an SBP � 140 mmHg at week 12. Of 154 assigned totest treatment, 82 attained the goal.

The research question

Are participants who take the test treatmentmore likely than placebo participants to attaintheir SBP goal?

Study design

The study is a randomized, double-blind, placebo-controlled, 12-week study of an investigationalantihypertensive drug.

Data

The data from the study are represented as thecontingency table displayed in Table 10.3.

Statistical analysis

The null and alternate statistical hypotheses canbe stated as follows:

H0: The proportion of individuals who attainedSBP � 140 mmHg is homogeneous (equal) acrossthe two treatment groups.

HA: The proportion of individuals who attainedSBP � 140 mmHg is not homogeneous acrossthe two treatment groups.

In cases where there are only two categories,such as in this one, we need to know only howmany individuals are in the “yes” row, becausethe number in the “no” row can be obtained bysubtraction from the sample size within eachgroup.

To calculate the test statistic, we first need toknow the expected cell counts. These can becalculated as the product of the marginal rowtotal and the marginal column total divided bythe total sample size. The expected cell countsunder the null hypothesis of homogeneity areprovided in Table 10.4. The expected cell countfor the placebo group in the first row (“Yes”) wascalculated as: (146)(116)/300 � 56.453. Theexpected cell count for the test treatment groupin the second row (“No”) was calculated as:(154)(184)/300 � 94.453. You are encouraged to

136 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

Table 10.3 Contingency table for individualsattaining goal SBP

Attained SBP � 140? Placebo Test Total

Yes 34 82 116No 112 72 184

146 154 300

Page 152: Introduction to Statistics in Pharmaceutical Clinical Trials

verify the remaining two cell counts using thesame methodology.

Now that we have calculated the expected cellcounts, we can calculate the test statistic usingthese expected cell counts in conjunction withthe observed cell counts:

(34 � 56.453)2 (82 � 59.547)2 (112 � 89.547)2 (72 � 94.453)2

X2 � ––––––––––––– � ––––––––––––– � ––––––––––––– � –––––––––––––56.453 59.547 89.547 94.453

� 28.3646

Tabled values to determine critical regions arenot as concise as those for the standard normaldistribution, because there is not just one v2

distribution but many of them. However, the v2 distribution with 1 df is quite frequentlyencountered as 2 � 2 contingency tables. Hence,for reference, values of the v2 distribution for 1 df that cut off various areas in the right-handtail are provided in Table 10.5. Additional valuesof v2

1�a are provided in Appendix 3.

For a test of size 0.05 the value of the teststatistic, 28.3646, is much greater than thecritical value of 3.841.

Interpretation and decision-making

Just as the hypothesis test using the Z approxi-mation resulted in a rejection of the null hypo-thesis, so does this v2 test. We can also tell fromthe critical values in Table 10.5 that the p valuemust be � 0.001 because less than 0.001 of thearea under the 1 df v2 distribution lies to theright of the value 10.38 and the calculated teststatistic, 28.3646 lies to the right of that value.

10.5.2.1 Odds ratio as a measure ofassociation from 2 � 2 contingency tables

Many articles published in medical journals citea measure of association called an odds ratio,which is an estimate of the relative risk of theevent or outcome of interest, a concept that wasintroduced in Chapter 8. If the probability of anoutcome of interest for group 1 is estimated as p1the odds of the event are:

p1Odds of the event for group 1 � –––––––.

1 � p1

Similarly:p2

Odds of the event for group 2 � ––––––.1 � p2

Then the estimated odds ratio is calculated as:p1(1 � p2)

Odds ratio � ––––––––––.p2(1 � p1)

Note that an equivalent definition of the oddsratio using the observed counts from the 2 � 2contingency table in Section 10.5.2 is:

O1,1O2,2Odds ratio � ––––––––.

O1,2O2,1

A standard error may be calculated forpurposes of constructing a confidence intervalfor the odds ratio, but it requires an iterativesolution. Statistical software is useful for thispurpose. Interested readers will find a wealth ofinformation on the odds ratio in Fleiss et al.(2003).

If the estimated probabilities of the event arethe same (or similar) between the two groups,the odds ratio will have a value around 1 (unity).Thus an assumption of no association in a 2 � 2table implies that the odds ratio is equal to 1.

Hypothesis tests for two or more proportions 137

Table 10.4 Expected cell counts for v2 test ofhomogeneity

Attained SBP � 140? Placebo Test Total

Yes 56.453 59.547 116No 89.547 94.453 184

146 154 300

Table 10.5 Critical values for the v2 distribution with1 degree of freedom

a (one sided) v2(1 � a),1

0.10 2.7060.05 3.8410.01 6.6350.001 10.38

Page 153: Introduction to Statistics in Pharmaceutical Clinical Trials

This also means that the v2 test for binaryoutcomes from Section 10.5.2 can be considereda test of the null hypothesis that the populationodds ratio � 1. Values of the odds ratio appro-priately � 1 or appropriately � 1 are suggestiveof an association between the group and theoutcome.

Using the data from Table 10.3 as presentedand using the formula for observed cell counts,the estimated odds ratio is calculated as:

(34)(72)Odds ratio � –––––––– � 0.27.

(112)(82)

Interpreting this value as an estimate of the rela-tive risk of attaining the target SBP level, wewould say that patients treated with placebo are0.27 times as likely as patients treated with theactive drug to attain the SBP goal. This statementmay seem awkward (we would not disagree),which points out a potentially difficult aspect ofthe odds ratio. As the name implies it is a ratioscaled quantity so the odds ratio can beexpressed as a/b or b/a. Keeping in mind that theodds ratio is an estimate of the relative risk,selecting the more appropriate method will aidthe clinical interpretation of the result. In thiscase the response of interest is a favorableoutcome, so a relative risk � 1 would imply thata favorable outcome was more likely after treat-ment with the active drug than the placebo.Similarly, if the response of interest is a badoutcome (for example, serious adverse event) arelative risk � 1 would suggest that the proba-bility of a bad outcome was less for the activedrug than the placebo.

Hence we can make more sense of this calcu-lated value by taking its inverse as 1/0.27 � 3.75.This expression is more appealing and an accu-rate interpretation in that patients treated withthe test drug are 3.75 times more likely to attainthe SBP goal than those treated with placebo.One can also obtain this result by switching theorder of the columns in Table 10.3 andperforming the calculation as:

(82)(112)Odds ratio � –––––––– � 3.75.

(72)(34)

Odds ratios are one of the most common statis-tics cited from logistic regression analyses.

Logistic regression is an advanced topic andtherefore not included in this book. An overly simple description is that it is an analysismethod by which binary outcomes are modeled(or explained) using various predictor variables.The proper interpretation of odds ratios fromlogistic regression models will depend on theway in which the predictors were used in thestatistical model. However, the general conceptis the same as in this example. The odds ratiorepresents the relative increase in risk of a partic-ular event for one group versus another. Werecommend two excellent texts on logisticregression by Hosmer and Lemeshow (2000) andKleinbaum and Klein (2002).

10.5.2.2 Use of the v2 test for twoproportions

The v2 test of homogeneity is useful forcomparing two proportions under the followingcircumstances:

• The groups need to be independent.• The responses need to be mutually exclusive.• The expected cell counts are reasonably sized.

With regard to the last of these requirements, weneed to operationally define “reasonably sized.”A commonly accepted guideline is that the v2

test is appropriate when at least 80% of the cellshave expected counts of at least five. In the caseof the worked example, the use of the v2 test isappropriate on the basis of independence (noparticipant was treated with both placebo andtest treatment) and sample size. If a participantcan be counted in only one response categorythe responses are considered mutually exclusiveor non-overlapping, as was the case here.

The v2 test of homogeneity is a special casebecause it can be used for any number of groups.The more general case is discussed in thefollowing section.

10.5.3 Hypothesis test for g proportions:v2 test of homogeneity

If there are g independent groups of interest (forexample, treatment groups in a clinical trial)each representing relevant populations, the

138 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

Page 154: Introduction to Statistics in Pharmaceutical Clinical Trials

proportion of individuals with the characteristicor event of interest is represented by:

pi �

mi—

ni

for i � 1, 2, . . . g, where g represents the numberof groups. The counts of individuals with eventsand nonevents can be displayed in a contin-gency table with g columns and two rows repre-senting the numbers of observations with (mi)and without (ni � mi) the characteristic of interest.The marginal total of individuals with events(the sum across the g groups) is denoted by:

g

R � R mi .i�1

The marginal total of individuals without theevents (sum across the g groups) is denoted by:

g

S � R ni �mi .i�1

Finally, the total sample size (sum across the g groups) is denoted by:

g

N � R ni .i�1

The overall proportion of responses across allgroups is:

Rp � ––.

N

A sample contingency table displaying theobserved counts in this more general case isrepresented in Table 10.6.

As before with the case of two groups, the nullhypothesis is stated as:

H0: The distribution of the response of interest ishomogeneous with respect to the g treatmentgroups. Equivalently, the proportion of “yes”responses is equal across all g groups.

The alternate hypothesis is:

HA: The distribution of the response of interest isnot homogeneous with respect to the g treatmentgroups.

If the null hypothesis is true, that is, the propor-tion of individuals with the event of interest issimilar across the groups, the expected count ofresponses in group i will be in the same propor-tion as observed across all groups. That is, theexpected cell count in row 1 (individuals withevents of interest) for group i is:

E1,i � pni .

Similarly, the expected cell count in row 2(individuals without the event of interest) forgroup i is:

E2,i � q ni.

The expected cell counts are calculated in thismanner for all 2g cells of the contingency table.The corresponding observed counts for groups i � 1, 2, . . ., g, in Table 10.6 are:

O1,i � mi

and

O2,i � ni � mi.

The test statistic X2 is calculated as the sum ofsquared differences between the observed andexpected counts divided by the expected count

Hypothesis tests for two or more proportions 139

Table 10.6 Sample contingency table for g groups and two responses (g � 2)

Group

Event or characteristic? 1 2 . . . g Total

Yes m1 m2 . . . mg RNo n1 � m1 n2 � m2 . . . ng � mg S

n1 n2 . . . ng N

Page 155: Introduction to Statistics in Pharmaceutical Clinical Trials

for all 2g cells (g groups and 2 responses) of thecontingency table:

g 2 (Oi,g � Ei,g)2

X2 � RR –––––––––––.i�1 r�1

Ei,g

Under the null hypothesis of homogeneity, thetest statistic, X2, for g groups and two responsesis approximately distributed as a v2 with (g �1)df. Values of the test statistic that lie in thecritical region are those with X2 � v2

g �1.

10.5.4 Hypothesis test for r responsesfrom g groups

The v2 test can be applied to more general situa-tions, including data with r response levels and gindependent groups. When there are more thantwo response categories, however, the null andalternate hypotheses cannot be stated simply interms of one proportion, but need to be stated interms of the distribution of response categories.

One example containing more than twogroups would be an evaluation of the followingthree categories of response: Worsening, nochange, and improvement. It would not be suffi-cient to state the null hypothesis in terms of theproportion of individuals with a response ofworsening because there are two other responsesof interest. We highlight this point because thev2 test is used extensively in clinical research,and it can be correctly applied to multilevelresponses and multiple groups. If we use themore general terminology, “distribution ofresponses is homogeneous with respect to treat-ment group,” we are always correct no matterhow many responses there were or how manygroups.

The specific methodology associated withthese more general cases is beyond the scope ofour text. The most appropriate and efficientanalyses of data of this type can depend on thehypothesis of interest and whether or not theresponse categories are ordered. Additionaldetails can be found in two excellent texts byStokes et al. (2001) and Agresti (2007).

10.5.5 Hypothesis test for twoproportions: Fisher’s exact test

The two methods described earlier, the Z approxi-mation and the v2 test of homogeneity, are appro-priate when the sample sizes are large enough.There are times, however, when the sample sizes ineach group are not large enough or the proportionof events is low such that np � 5. In such casesanother analysis method, one that does notrequire any approximation, is appropriate.

An alternate hypothesis test for two propor-tions is attributed to Fisher. Fisher’s exact test isapplicable to contingency tables with two ormore responses in two or more independentgroups. We consider one case, 2 � 2 tables, repre-sented by counts of individuals with andwithout the characteristic of interest (two rows)in each of two treatment groups (two columns),for which the cell counts are small. For this testthe row and column marginal totals are consid-ered fixed. That is, one assumes that the totalnumber of individuals with events is fixed aswell as the number in each group. The extent towhich the two groups are similar or dissimilaraccounts for the distribution of events betweenthe two groups. For any 2 � 2 table, the proba-bility of the particular distribution of responsecounts, assuming the fixed marginal totals, canbe calculated exactly via something called thehypergeometric distribution (we do not go intodetails here). Using slightly different notationfrom the examples above, the cell counts andmarginal totals of a general 2 � 2 table aredisplayed in Table 10.7. The total number of

140 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

Table 10.7 Cell counts and marginal totals from ageneral 2 � 2 table

Event or characteristic Group 1 Group 2 Totalof interest?

Yes Y1 Y2 Y.No N1 N2 N.

n1 n2 n

Page 156: Introduction to Statistics in Pharmaceutical Clinical Trials

“yes” responses is denoted by the symbol, Y.,where the dot in the index means that the countis obtained by summing the responses over thetwo columns, that is, Y1 � Y2. Likewise, the totalnumber of “no” responses is denoted by thesymbol, N., the sum over groups 1 and 2.

Given the fixed margins as indicated in Table 10.7, the probability of the distribution ofresponses in the 2 � 2 table is calculated fromthe hypergeometric distribution as:

Y.!N.!n1!n2!P(Y1, Y2, N1, N2, | Y., N., n1, n2, n) � ––––––––––––.n!Y1!Y2!N1!N2!

The null and alternate hypotheses in this caseare as follows:

H0: The proportion of responses is independentof the group.HA: The proportion of responses is not independent of the group.

If the null hypothesis is rejected, the alternatehypothesis is better supported by the data.

For this test there is no test statistic as such,because this test is considered an exact test.Therefore, we need not compare the value of atest statistic to a distribution. Instead, the p valueis calculated directly and compared with thepredefined a level. Recall that a p value is theprobability, under the null hypothesis, ofobserving the obtained results or those moreextreme, that is, results contradicting the nullhypothesis. The calculation of the p value forthis exact test entails the following three steps:

1. Calculate the probability of the observed cellcounts using the expression above.

2. For all other permutations of 2 � 2 tables withthe same marginal totals, calculate the proba-bility of observed cell counts in a similarmanner.

3. Calculate the p value as the sum of theobserved probability (from the first step) andall probabilities for other permutations thatare less than the probability for the observedtable.

As a consequence, the p value represents the like-lihood of observing, by chance alone, the actual

result or those more extreme. The calculated pvalue is compared with the value of a and weeither reject or fail to reject the null hypothesis.

As an example of Fisher’s exact test, weconsider other data from the antihypertensivetrial introduced in Section 10.5.1. These data arepresented in Table 10.8.

The research question

Is there sufficient evidence at the a � 0.05 levelto conclude that the probability of attaining aSBP � 120 mmHg (a remarkable response for ahypertensive person!) is greater for peoplereceiving the test treatment than for thosereceiving the placebo?

Study design

The study is a randomized, double-blind, placebo-controlled, 12-week study of an investigationalantihypertensive drug.

Data

The data from the study are represented as acontingency table as displayed in Table 10.8. Asseen in Table 10.8, only four individuals had theevent of interest. Neither the Z approximationnor the v2 test would be appropriate given thesmall cell sizes of one and three.

Statistical analysis

The null and alternate statistical hypotheses canbe stated as:

Hypothesis tests for two or more proportions 141

Table 10.8 Contingency table for individualsattaining SBP � 120 mmHg

Attained SBP � 120? Placebo Test Total

Yes 1 3 4No 145 151 296

146 154 300

Page 157: Introduction to Statistics in Pharmaceutical Clinical Trials

H0: The proportion of individuals who attainedSBP � 120 mmHg is independent of treatmentgroup.HA: The proportion of participants who attainedSBP � 120 mmHg is not independent of treat-ment group.

In this instance, independence means that theprobability of the response is no more or lesslikely for one group versus the other. In his orig-inal paper, Fisher stated the null hypothesisslightly differently (although equivalent mathe-matically). The null hypothesis, after Fisher, canbe stated in this form: The population odds ratioof response to nonresponse for one group versusthe other is equal to one.

In Figure 10.1 all of the possible permutationsof cell counts, given the marginal totals, aredisplayed. To be concise, the row and columnlabels are not included. The calculated proba-bility from the hypergeometric distribution isprovided to the right of each arrangement of cellcounts. The probabilities in Figure 10.1 areincluded to illustrate the calculation. Note thatby definition, 0! � 1. For this particular dataset itis manageable to calculate each probability witha calculator, but in many instances this partic-

ular test should be done using statistical soft-ware. When calculating these probabilities byhand it is helpful to re-write the factorialexpressions in a way so that numerator anddenominator terms “cancel out.” For example,writing 154! as 154*153*152*151! allows us tocancel 151! from the numerator and denomi-nator of the probability associated with theobserved result.

The calculated p value is the probability fromthe observed result plus all probabilities less thanthe probability associated with the observedresult. For this example the exact p value is:

p value � 0.263453 � 0.236537 � 0.068119� 0.054910 � 0.623019.

Rounding to three significant digits, this can beexpressed as p value � 0.623.

Interpretation and decision-making

Comparing the p value of 0.623 to a � 0.05, thestatistical conclusion is not to reject the nullhypothesis. There is insufficient evidence to con-clude that the alternate hypothesis is true. If thegoal of a new antihypertensive therapy were to

142 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

0 4

146 150

4! 296! 146! 154!

0! 4! 146! 150! 300!P � 0.068119�

1 3

145 151

4! 296! 146! 154!

1! 3! 145! 151! 300!P � 0.263453 (observed result)�

2 2

144 152

4! 296! 146! 154!

2! 2! 144! 152! 300!P � 0.376981�

3 1

143 153

4! 296! 146! 154!

3! 1! 145! 151! 300!P � 0.236537�

4 0

142 154

4! 296! 146! 154!

4! 0! 142! 154! 300!P � 0.054910�

Figure 10.1 All permutations of response counts given fixed marginal totals and probabilities of each

Page 158: Introduction to Statistics in Pharmaceutical Clinical Trials

reduce SBP to levels � 120 mmHg, such a resultwould be disappointing and may lead to a decis-ion to halt the clinical development program.However, the study was not designed to answersuch a question. In fact, the research question,having been formulated as an exploratoryanalysis, may not be well suited for the study thatwas actually conducted. Perhaps a greater dose ormore frequent administration of the investiga-tional antihypertensive drug would increase therate of the desired response. In any case, as theanalysis earlier in the chapter illustrated, the newdrug does seem to lower SBP to levels that wouldbeconsideredclinically important (�140mmHg).

10.5.6 Test of two proportions fromstratified samples: The Mantel–Haenszelmethod

Confirmatory efficacy studies typically involve anumber of investigative centers and, accord-ingly, are known as multicenter trials. Multi-center trials have a number of benefits, whichare discussed later. A common analysis methodused in multicenter trials is to account for differ-ences from center to center by including them inthe analysis. Stratifying the randomization totreatment assignment by investigative centerensures that there are approximately equalnumbers of participants assigned to test orplacebo within each center. Analyses fromstudies with this design typically account forcenter as it is conceivably another source of vari-ation. This is accomplished by calculating asummary test statistic within each center andthen pooling or calculating weighted averages of the within-center statistics across all centers,thereby removing the effect of the centers fromthe overall test statistic.

The weights used in the analysis are chosen atthe trial statistician’s discretion, which providesa good example of the “art” of Statistics, becausethe statistician must make a well-informed judg-ment call. Some commonly employed choices ofweights are as follows:

• equal weights for all centers• weights proportional to the size of the center• weights that are related to the standard error

of the within-center statistic (for example,

more precise estimates have more weightthan less precise estimates).

One method applicable to the difference of twoproportions, originally described by Mantel andHaenszel (1959) and well described by Fleiss etal. (2003), utilizes weights that are proportionalto the size of each stratum (in this case, centers)to calculate a test statistic that follows approxi-mately a v2 distribution.

Assume that there are h strata of interest, andwithin each of the strata (h � 1, 2, . . ., H) thereare nh1 observations for group 1 (for example,treatment group 1) and nh2 observations forgroup 2 (for example, treatment group 2). Theproportion of observations with the character-istic of interest within each stratum for the twogroups is denoted by ph1 and ph2, respectively.The overall proportion of participants with thecharacteristic of interest within each stratum isdenoted by p–h; the overall proportion withoutthe characteristic of interest with each stratum isdenoted by q–h �1 � p–h.

The null hypothesis tested by theMantel–Haenszel method is as follows:

H0: There is no overall association betweenresponse and group after accounting for thestratification factor.

If the null hypothesis is rejected, the data favorthe following alternate hypothesis:

HA: There is an overall association betweenresponse and group after accounting for thestratification factor.

The test statistic for the Mantel–Haenszelmethod is:

H nh1 nh2( | R –––––– (ph1 � p h2)| � 0.5)2

h�1nh

X2MH � ––––––––––––––––––––––––––––––.

H nh1 nh2R –––––– p hq h

h�1nh�1

Note that the differences in proportions, ph1 – ph2, are weighted by the quantities nh1nh2——— .nhThis test statistic utilizes a continuity correctionfactor of 0.5 as well. As described by Fleiss et al.(2003), the test performs well when expected cellcounts within each of H 2 � 2 tables differ by at

Hypothesis tests for two or more proportions 143

Page 159: Introduction to Statistics in Pharmaceutical Clinical Trials

least 5 (maximum – minimum). The test statisticthat is computed in this manner is approxi-mately distributed as a v2 with 1 df.

A similar test statistic, Cochran’s statistic, orig-inally attributed to Cochran (1954), is describedby Fleiss et al. (2003):

H nh1 nh2( R –––––– (ph1 � p h2))2

h�1nh

X2CMH � –––––––––––––––––––––––––.

H nh1 nh2R –––––– phqh

h�1nh

Note that Cochran’s statistic does not use acorrection factor and the denominator of thestratum weights is nh instead of (nh � 1). Wemention Cochran’s statistic because it is used bysome statistical software packages instead of theMantel–Haenszel statistic. Fleiss points out thatthe difference between the Mantel–Haenszelstatistic and Cochran’s statistic is small when thesample sizes are large, but considerable when thesample sizes within each of the strata are small.

As an illustration of the Mantel–Haenszelmethod, we take the data from our example asdetailed in Section 10.5.1 and separate them intodata collected at each of three centers, which inthis case represent the three strata.

The research question

Is there sufficient evidence at the a � 0.05 levelto conclude that the probability of attaining agoal SBP level is greater for individuals receivingtest treatment than for those receiving theplacebo after accounting for differences inresponse among centers?

Study design

The study is a randomized, double-blind, placebo-controlled, 12-week study of an investigationalantihypertensive drug.

Data

The data from the study are represented as threecontingency tables, one for each of the centers inTable 10.9.

Statistical analysis

The null and alternate statistical hypotheses canbe stated as:

H0: There is no overall association between theresponse (attaining SBP � 140 mmHg) andtreatment group after accounting for center.HA: There is an overall association between theresponse and treatment group after accountingfor center.

For a test of size a � 0.05, a v2 test with 1 df hasa critical value of 3.841.

The differences in the proportions of interest(test minus placebo) are as follows:

144 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

Table 10.9 Contingency table for individualsattaining goal SBP by center

Center 1Attained SBP � 140? Placebo Test Total

Yes 12 24 36No 34 21 55

46 45 91

Center 2Attained SBP � 140? Placebo Test Total

Yes 15 31 46No 29 19 48

44 50 94

Center 3Attained SBP � 140? Placebo Test Total

Yes 7 27 34No 49 32 81

56 59 115

OverallAttained SBP � 140? Placebo Test Total

Yes 34 82 116No 112 72 184

146 154 300

Page 160: Introduction to Statistics in Pharmaceutical Clinical Trials

• Center 1: (0.533 � 0.261) � 0.272• Center 2: (0.620 � 0.341) � 0.279• Center 3: (0.458 � 0.125) � 0.333.

The overall response rates for the event ofinterest and their complements are:

36 55Center 1: p1 � ––– � 0.396 and q1 � ––– � 0.60491 91

46 48Center 2: p2 � ––– � 0.489 and q2 � ––– � 0.51194 94

34 81Center 3: p3 � ––– � 0.296 and q3 � ––– � 0.704.115 115

The test statistic is then computed as:

46 * 45 44 * 50 56 * 59{ | [(––––––)(0.272) �(––––––)(0.279) � (––––––)(0.333)] | � 0.5}2

91 94 115X2

MH � –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––46 * 45 44 * 50 56 * 59(––––––)(0.396)(0.604) �(––––––)(0.489)(0.511) �(––––––)(0.296)(0.704)

90 93 114

� 27.21

Although the calculation details are not shownhere, the value of Cochran’s statistic for thisexample is 28.47, which is consistent with theresult obtained for the Mantel–Haenszel statistic.

Interpretation and decision-making

The value of the test statistic is much greaterthan the critical value of 3.841. Hence the statis-tical decision is to reject the null hypothesis ofno association after accounting for center differ-ences. The proportion of responders is signifi-cantly higher among those receiving the testtreatment. The p value associated with the testcan be obtained from statistical software.However, we know from the sample of criticalvalues in Table 10.5 that the p value must be � 0.001. As before, a pharmaceutical companywould be encouraged by such results.

10.6 Concluding comments onhypothesis tests for categorical data

All of the methods described in this chapter areapplicable to data that are in the form of “binary”

events, that is, either the event or characteristicoccurred for a given individual or it did not. Forbinary data, the summary statistic representingeach treatment group is a sample proportion. Toaccount for variation from sample to sample,hypothesis-testing methods allow a researcher todraw an inference about the underlying popula-tion difference in proportions. Although notcovered in great detail, some of the methods canalso be expanded to more than two categories.

In contrast, the methods described in Chapter 11 are applicable to data with outcomesthat are continuous in nature. In those cases,other summary statistics are required to describethe typical effect in each group and the typicaleffect expected for the population under study.

Review 145

10.7 Review

1. What constitutes “compelling evidence” of abeneficial treatment effect?

2. Consider a pharmaceutical company that has justcompleted a confirmatory efficacy study. What are the implications for the company of committinga type I error? What are the implications for thecompany of committing a type II error?

3. The equality of two proportions is being tested withthe null hypothesis, H0: pTEST � pPLACEBO � 0. Giventhat this is a two-sided test and using the followinginformation, would the null hypothesis be rejectedor not rejected?

(a) a � 0.05, Z approximation test statistic � 1.74(b) a � 0.10, Z approximation test statistic � 1.74(c) a � 0.05, Z approximation test statistic � 4.23(d) a � 0.01, Z approximation test statistic � 4.23(e) a � 0.05, v2 test statistic � 1.74(f) a � 0.10, v2 test statistic � 1.74(g) a � 0.05, v2 test statistic � 4.23(h) a � 0.01, v2 test statistic � 4.23.

4. The equality of two proportions is being tested withthe null hypothesis, H0: pTEST � pPLACEBO � 0. Giventhat this is a two-sided test, what is the p value thatcorresponds to the following values of the Zapproximation test statistic?

(a) �1.56(b) �2.67

Page 161: Introduction to Statistics in Pharmaceutical Clinical Trials

10.8 References

Agresti A (2007). An Introduction to Categorical DataAnalysis, 2nd edn. Chichester: John Wiley & Sons.

Cochran WG (1954). Some methods of strengtheningthe common v2 tests. Biometrics 10:417–451.

Fisher LD (1999). One large, well-designed, multicenterstudy as an alternative to the usual FDA paradigm.Drug Information J 33:265–271.

Fleiss JL, Paik MC, Levin B (2003). Statistical Methods forRates and Proportions, 3rd edn. Chichester: JohnWiley & Sons.

Fleming TR, DeMets DL (1996). Surrogate end points inclinical trials: are we being misled? Ann Intern Med125:605–613.

Hosmer DW, Lemeshow S (2000). Applied LogisticRegression, 2nd edn. Chichester: John Wiley & Sons.

ICH Guidance E8 (1997). General Consideration of Clin-ical Trials. Available at: www.ich.org (accessed July 12007).

ICH Guidance E9 (1998). Statistical Principles for Clin-ical Trials. Available at: www.ich.org (accessed July 12007).

Kleinbaum DG, Klein M (2002). Logistic Regression: Aself-learning text, 2nd edn. New York: Springer.

Mantel N, Haenszel W (1959). Statistical aspects of theanalysis of data from retrospective studies ofdisease. J Natl Cancer Instit 22:719–748.

Stokes ME, Davis CS, Koch GG (2001). Categorical DataAnalysis using the SAS System, 2nd edn. Chichester:Wiley & Sons.

US Department of Health and Human Services, Foodand Drug Administration (1998). Providing ClinicalEvidence of Effectiveness for Human Drug and Biolog-ical Products. Available from www.fda.gov (accessedJuly 1 2007).

146 Chapter 10 • Confirmatory clinical trials: Analysis of categorical efficacy data

(c) 3.29(d) 1.00.

5. The term “responders’ analysis” was firstintroduced in Chapter 9 with regard to clinicallaboratory data. A responders’ analysis approachcan be used in the context of efficacy data, aswell. Consider a double-blind, placebo-controlled,therapeutic confirmatory trial of an investigationalantihypertensive (“test drug”). Based on earlierexperience, a period of 12 weeks is consideredsufficient to observe a clinically meaningfultreatment effect that can be sustained for manymonths. In this study, a participant whose SBP isreduced by at least 10 mmHg after 12 weeks of treatment is considered a responder. Similarly, a participant whose SBP is not reduced by at least 10 mmHg after 12 weeks is considered anon-responder. A total of 1000 participants werestudied: 502 on placebo and 498 on test drug.Among the placebo participants, 117 wereresponders. Among those on the test drug, 152were responders.

(a) Summarize these results in a 2 � 2contingency table.

(b) The sponsor’s research question of interest is:Are individuals treated with the test drug morelikely to respond than those treated withplacebo? What are the null and alternatestatistical hypotheses corresponding to thisresearch question?

(c) What statistical tests may be used to test thenull hypothesis? Are any more appropriatethan others?

(d) Is there sufficient evidence to reject the nullhypothesis using a test of size a � 0.05?Describe any assumptions necessary and showthe calculation of the test statistic.

(e) Calculate the odds ratio from the contingencytable. What is the interpretation of thecalculated odds ratio?

6. When would the Mantel–Haenszel v2 test be moreuseful than the v2 test?

Page 162: Introduction to Statistics in Pharmaceutical Clinical Trials

11.1 Introduction

As we have seen, several summary measures ofcentral tendency can be used for continuousoutcomes. The most common of these measuresis the mean. In clinical trials we calculate samplestatistics, and these serve to estimate theunknown population means. When developinga new drug, the estimated treatment effect ismeasured by the difference in sample means forthe test treatment and the placebo. If we caninfer (conclude) that the corresponding popula-tion means differ by an amount that is consid-ered clinically important (that is, in the positivedirection and of a certain magnitude) the testtreatment will be considered efficacious.

In Chapter 10 we saw that there are variousmethods for the analysis of categorical (andmostly binary) efficacy data. The same is truehere. There are different methods that are appro-priate for continuous data in certain circum-stances, and not every method that we discuss isappropriate for every situation. A careful assess-ment of the data type, the shape of the distribu-tion (which can be examined through a relativefrequency histogram or a stem-and-leaf plot),and the sample size can help justify the mostappropriate analysis approach. For example, ifthe shape of the distribution of the random vari-able is symmetric or the sample size is large(� 30) the sample mean would be considered a“reasonable” estimate of the population mean.Parametric analysis approaches such as the two-sample t test or an analysis of variance (ANOVA)would then be appropriate. However, when thedistribution is severely asymmetric, or skewed,the sample mean is a poor estimate of the popu-lation mean. In such cases a nonparametricapproach would be more appropriate.

It should be emphasized at this point that theterm “nonparametric” is not a quality judgmentcompared with the term “parametric.” Thenomenclature simply serves to delineate twotypes of analyses. Nonparametric tests are not“less good” than parametric tests. Indeed, if it were appropriate to use a nonparametricapproach in a certain circumstance, that testwould have higher statistical power than a para-metric approach. We respectfully feel that thedifferentiation between parametric and nonpara-metric approaches in many introductory Statis-tics textbooks is misleading, and does tend toimply that nonparametric tests are naturallyinferior to the other: Nonparametric tests arecommonly discussed separately, often towardthe end of the book, leaving the reader feelingthat the books’ authors regarded these discus-sions as unwanted but obligatory. We encourageyou as your first step to consider what valid andappropriate analyses there are for a given situa-tion, and then to select the most efficientanalysis method from among them. We havereinforced this notion by including nonpara-metric analysis approaches side by side withparametric approaches.

11.2 Hypothesis test of two means:Two-sample t test or independentgroups t test

A common measure of central tendency ofcontinuous outcomes is the mean. In clinicalstudies employing measurement of continuousvariables such as blood pressure, the typicalresponse among participants in a treatmentgroup is represented by this summary descriptive

11Confirmatory clinical trials: Analysis of continuousefficacy data

Page 163: Introduction to Statistics in Pharmaceutical Clinical Trials

statistic. As we have seen, sample statistics, bydefinition, vary from sample to sample. Whendeveloping new drugs we would like to make aninference about the magnitude of the differencebetween two population means, typically repre-sented by the symbol l, one for a test treatmentand the other for a control. If the difference inmeans exceeds the typical variability that wouldbe expected from sample to sample, we canconclude that the difference is unlikely to be dueto chance. More specifically, when comparingtwo population means, we are interested intesting the null hypothesis,

H0: l1 � l2 � 0.

If the null hypothesis is rejected the followingalternate hypothesis is better supported by thedata:

HA: l1 � l2 � 0.

Treatment group 1 is represented by n1 observa-tions, x11, x12, x13, . . ., x1n1

. Similarly, treatmentgroup 2 has n2 observations, x21, x22, x23, . . ., x2n2

.For this statistical test these two groups must beindependent. The population means, l1 and l2,are estimated by the sample means from eachgroup, x1 and x2:

n1

R x1i

i � 1x1 � –––—–,n1

n2

R x2i

i � 1x2 � –––—–.n2

The population variances, r21 and r2

2, areestimated by the sample variances:

n1

R (xli � x1)2

i � 1s21 � –––––––––—––,

n1 � 1

n2

R (x2i � x2)2

i � 1s22 � ––––––––––—––.

n2 � 1

Assuming that the two populations have thesame, albeit unknown, population variance, anaverage or pooled estimate of the sample

variances is an estimator of the unknown popu-lation variance. The pooled variance, s2

p, isobtained as:

s21(n1 � 1) � s2

2(n2 � 1)s2p � –––––––––––––––––––––.

n1 � n2 � 2

Finally, the pooled standard deviation, sp, is thesquare root of the variance:

sp � �__s2

p .

The estimator for the difference in populationmeans is the difference in sample means, that is,x1 � x2. The standard error of the estimator,SE(x1 � x2), is calculated as:

_______1 1SE(x1� x2) � sp � –– � ––.n1 n2

The test statistic for the two-sample t test is:x1� x2t � ––––––––––.

SE(x1� x2)

Under the null hypothesis of equal populationmeans, the test statistic follows a t distributionwith n1 � n2 � 2 degrees of freedom (df),assuming that the sample size in each group islarge (that is, � 30) or the underlying distribu-tion is at least mound shaped and somewhatsymmetric. As the sample size in each groupapproaches 200, the shape of the t distributionbecomes more like a standard normal distribu-tion. Values of the test statistic that are far awayfrom zero would contradict the null hypothesisand lead to its rejection. In particular, for a two-sided test of size a, the critical region (that is,those values of the test statistic that would leadto rejection of the null hypothesis) is defined byt � ta/2,n1�n2�2 or t � t1�a/2,n1�n2�2. Note that as tdistributions are symmetric, |ta/2|�t1�a/2. If thecalculated value of the test statistic is in thecritical region, the null hypothesis is rejected in favor of the alternate hypothesis. If thecalculated value of the test statistic is outside thecritical region, the null hypothesis is notrejected.

As there are an infinite number of t distribu-tions there is no concise way to display allpossible values that may be encountered.However, as can be seen in Table 11.1, the valueof the t distribution that cuts off the upper 2.5%

148 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Page 164: Introduction to Statistics in Pharmaceutical Clinical Trials

area of the distribution becomes smaller withincreasing sample sizes (and therefore increasingdf). Tabled values in Appendix 2 are provided forother values of a.

The use of the two-sample t test is illustratedhere with sample data from a clinical trial of aninvestigational antihypertensive drug.

The research question

Does the test treatment lower SBP more thanplacebo?

Study design

In a randomized, double-blind, 12-week study,the test treatment, one tablet taken once a day,was compared with placebo (taken in the samemanner). The primary endpoint of the study wasthe mean change from baseline SBP. The primaryanalysis will be based on a two-sample t test with a � 0.05 (two-sided).

Data

In the placebo group (146 individuals) the meanchange from baseline was �3.4 mmHg with astandard deviation of 17.4 mmHg. In the testtreatment group (154 individuals) the meanchange from baseline was �19.2 mmHg with astandard deviation of 16.9 mmHg.

Statistical analysis

The null and alternate statistical hypotheses canbe stated as:

H0: lTEST � lPLACEBO � 0.

HA: lTEST � lPLACEBO � 0.

The pooled sample variance is calculated as:

17.42(145) � 16.92(153)s2p � –––––––––––––––––––––– � 293.95.

146 � 154 � 2

It follows from this that the pooled standarddeviation is:

sp � �______293.95 � 17.1.

The estimate of the difference in mean changefrom baseline is:

xTEST � xPLACEBO � �19.2 � (�3.4) � �15.8.

The standard error of the difference is calculatedas:

__________1 1SE(xTEST � xPLACEBO) � 17.1� –––– � –––– � 1.98.

146 154

The test statistic is then calculated using thesevalues:

�15.8t � –––––– � �7.98

1.98

Under the null hypothesis of no difference inpopulation means, and assuming somewhatsymmetric distributions, the test statistic followsa t distribution with 298 (that is, 146 � 154 � 2)df. Therefore the critical region (values of the teststatistic that lead to rejection) is defined ast � �1.968 and t � 1.968. Note that this partic-ular entry is not in Appendix 2, but the closest isfor 300 df.

Interpretation and decision-making

As �7.98 � �1.968, the null hypothesis isrejected in favor of the alternate one. The meanchange from baseline for the test treatmentgroup is significantly different from the placebogroup’s at the a � 0.05 level. To determine the pvalue associated with this test, we need statisticalsoftware or an extensive look-up table. Given thelarge sample size in this example, we can use thepercentiles of the standard normal distributionto approximate the p value. These study resultsallow us to conclude that the test treatment is

Hypothesis test of two means 149

Table 11.1 Sample values from t distributions for atwo-sided test of a � 0.05

Degrees of freedom (n1 � n2 � 2) t0.975

10 2.228130 2.042350 2.0086

100 1.9840200 1.9719

Page 165: Introduction to Statistics in Pharmaceutical Clinical Trials

efficacious. The difference between treatments inthe magnitude of the change in SBP was unlikelyto be the result of chance. Therefore the sponsorcan submit these data as substantial statisticalevidence of the test treatment’s efficacy.

11.3 Hypothesis test of the location oftwo distributions: Wilcoxon rank sumtest

The two-sample t test is useful on many occa-sions, but there are occasions when its use is notjustified. One reason is that the sample size issmall (� 30 per group). Although small studiesare certainly encountered frequently in clinicalresearch, most confirmatory efficacy studies aresizable, and so this reason is not applicable here.A second reason is, however, applicable. Themost common reason why a two-sample t testwould not be appropriate is a heavily skeweddistribution, whether or not the sample size islarge.

The sample mean is a poor measure of centraltendency when the distribution is heavilyskewed. Despite our best efforts at designingwell-controlled clinical trials, the data that aregenerated do not always compare with the(deliberately chosen) tidy examples featured inthis book. When we wish to make an inferenceabout the difference in typical values among twoor more independent populations, but the distri-butions of the random variables or outcomes are not reasonably symmetric, nonparametricmethods are more appropriate. Unlike para-metric methods such as the two-sample t test,nonparametric methods do not require anyassumption about the shape of a distribution forthem to be used in a valid manner. As the next analysis method illustrates, nonparametricmethods do not rely directly on the value of therandom variable. Rather, they make use of therank order of the value of the random variable.

It is appropriate to note here that performingan analysis on an assigned rank instead of on theraw data results in a loss of information. Thinkof the related example of receiving a grade A onan assignment. If a grade A is given for any markbetween 90% and 100%, the grade alone does

not tell you how well you have actually done onthe assignment: A score of 91% is assigned thesame grade as a score of 100%. If the mark forthis assignment is the first one of several in acourse that will ultimately be combined to yieldyour final grade in some manner, you may verylegitimately be interested in your actual (raw)score. Nevertheless, in clinical trials there can be a sound rationale for not using raw data incertain circumstances.

When rather extreme departures fromrequired assumptions are noted, our choice of anappropriate statistical method should be one offirst validity and second efficiency. Thedifference between an extreme departure fromrequired assumptions and any departure fromrequired assumptions is again a matter ofjudgment. It should be noted that many of theparametric methods in this book are robust todepartures from distributional assumptions,meaning that the results are valid under anumber of conditions. This is especially truewith the larger sample sizes encountered in ther-apeutic exploratory and confirmatory trials. Weshould also note that all the methods describedin this book require that observations in theanalysis are independent. There are statisticalmethods to be used for dependent data, but theyare not described in this book.

In our opinion, therefore, nonparametricmethods should be chosen when assumptions(such as normality for the t test) are clearly notmet and the sample sizes are so small that thereis very little confidence about the properties ofthe underlying distribution. The nonparametricmethod discussed in this section is a test of ashift in the distribution between two popula-tions with a common variance represented bytwo samples, and it will always be valid whencomparing two independent groups.

The two-sample t test was based on theassumption that the two samples were drawnfrom an underlying normal population with thesame (assumed) population variance. A rejectionof the null hypothesis in the setting of the two-sample t test would imply that the two popula-tions from which the samples were drawn wererepresented by two normal distributions withthe same variance (shape), but with differentmeans. The Wilcoxon rank sum test does not

150 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Page 166: Introduction to Statistics in Pharmaceutical Clinical Trials

require the assumption of the normal distribu-tion, but does require that the samples be drawnfrom the same population. The Wilcoxon ranksum test tests a similar hypothesis such that, if itis rejected, the two populations from which thesamples were drawn had the same shape (notnecessarily normal or otherwise symmetric), butdiffered by some distance. That is, a rejection ofthe null hypothesis in the setting of Wilcoxon’srank sum test would imply that the two popu-lation distributions were shifted, that is, notoverlapping.

Although this approach has its advantages,one disadvantage is that no single numericalestimate, either a point estimate or an intervalestimate, can convey the extent to which thepopulations differ because the test of the loca-tion shift is based on relative rank and not theoriginal scale.

Using the Wilcoxon rank sum test, interest isin a location shift between two populationdistributions so the following null hypothesis istested:

H0: The location of the distribution of therandom variable in population 1 does not differfrom the location of the random variable forpopulation 2.

If the null hypothesis is rejected the followingalternate hypothesis is better supported by thedata:

HA: The location of the distribution of therandom variable in population 1 is differentfrom the location for population 2.

Treatment group 1 (representing population 1) isrepresented by n1 observations measured on acontinuous scale, x11, x12, x13, . . ., x1n1

. Similarly,treatment group 2 (representing population 2)has n2 observations measured in a continuousscale, x21, x22, x23, . . ., x2n2

. The total sample sizeof the two groups is n1 � n2. The first step incalculating the test statistic is to order the valuesof all observations from smallest to largest,without regard to the treatment group. Then, arank is assigned to each observation, startingwith 1 for the smallest value after sorting, then2, and so on for all n1 � n2 observations. If twoor more observations are tied, the assigned rankwill be the average of the ranks that would havebeen assigned if there were no ties. For example,

if the third, fourth, and fifth sorted observationswere all tied, the assigned rank for each of thethree observations would be [3 � 4 � 5]/3 � 4.The next largest value would then be assigned arank of 6.

At this stage, we now have n1 ranks for treat-ment group 1, r11, r12, r13, . . ., r1n1

. Similarly, treat-ment group 2 has n2 ranks, r21, r22, r23, . . ., r2n2

.The test statistic for the Wilcoxon rank sum testis the sum of the ranks in group 1:

n1

S1 � R r1i .i�1

Only the ranks from group 1 are requiredbecause, if the values from group 1 tend to besmaller than those from group 2, the sum ofranks will be small, leading to rejection of thenull hypothesis. Similarly, if the values fromgroup 1 tend to be larger than those from group 2 the sum or ranks will be a large numberand will also lead to rejection.

The null hypothesis will be rejected if the teststatistic is less than or equal to or greater than orequal to cut points obtained from a table (whichneed not be provided here) – that is, the nullhypothesis will be rejected if S1 � WL or S1 � WU.Other authors (Schork and Remington, 2000)have suggested a large sample approximation,which is possible because the test statistic,S1, is approximately normally distributedwith mean [n1(n1 � n2 � 1)]/2 and variance[n1n2(n1 � n2 � 1)]/12. The derivation of thesetwo parameters is beyond the scope of this text.Applying a familiar mathematical operation(standardization of a normally distributedrandom variable), we obtain an alternate teststatistic, which has an approximate standardnormal distribution:

n1(n1 � n2 � 1)S1 � ––––––––––––––

2Z � –––––––––––––––––––._________________

n1n2(n1 � n2 � 1)––––––––––––––––� 12

Values of this test statistic can then be comparedwith the more familiar critical values of the stan-dard normal distribution.

To illustrate this method, consider thefollowing example that (deliberately) has a smalldataset.

Hypothesis test of the location of two distributions 151

Page 167: Introduction to Statistics in Pharmaceutical Clinical Trials

The research question

Does the test treatment lower SBP more thanplacebo?

Study design

In a randomized, double-blind, 6-week study, thetest treatment (one tablet taken once a day) wascompared with placebo. The primary endpointof the study was the mean change from base-line SBP. Given the small sample size of the study,the primary analysis is based on the Wilcoxonrank sum test with a � 0.05 (two-sided).

Data

Each value listed below represents change frombaseline SBP for a participant in a clinical trialcomparing a new antihypertensive treatmentwith placebo. Lower values indicate a greaterreduction in blood pressure from baseline, thefavored outcome.

Test treatment (n � 10):�8, �1, 0, 2, �20, �18, �12, �17, �14, �11.Placebo (n � 10):�9, 0, �4, �4, �3, 1, �7, 1, 2, �3.

Statistical analysis

After ordering all observations from highest tolowest within the two groups, we have thefollowing:

Then ranking each observation across groups,accounting for ties as described above, we obtainthe following ranks:

The test statistic is computed as the sum of theranks for the test treatment group:

S1 � 1�2�3�4�5�6�8�14�15.5�19.5 � 78.

When testing the null hypothesis at the two-sided a � 0.05 level and a sample size of 10 in

each group, the critical region is any value ofS1 � 78 or � 132.

Interpretation and decision-making

As the value of the test statistic is in the rejectionregion (only just, but still in it), the null hypoth-esis is rejected. The conclusion is that the distri-butions of the two populations from which thesamples were selected differ in their location.The test treatment is associated with a greaterreduction in SBP than placebo.

Alternately, if we were to use the test statisticbased on a normal approximation, it would be:

10(10 � 10 � 1)78 � ––––––––––––––––

2Z � –––––––––––––––––––––– � �2.007._____________________

10 * 10(10 � 10 � 1)––––––––––––––––––––� 12

Under the null hypothesis, this test statisticfollows a standard normal distribution. Thenull hypothesis is rejected because the teststatistic falls in the rejection region for a two-sided test of a � 0.05 based on the standardnormal distribution (Z � � 1.96 or Z � 1.96).

11.4 Hypothesis tests of more than twomeans: Analysis of variance

The t tests are extremely helpful, commonlyused tests, but they do have one noteworthylimitation: They can address only the equality oftwo means. In the present context, they cancompare only the results from two treatmentgroups. Situations that require us to test theequality of more than two means occur quitefrequently, and so a test that can be used withtwo or more groups is needed.

In many instances in drug development, twoor more doses may seem to be promising basedon results from earlier phases of clinical devel-opment. The question of interest thereforebecomes: Of all the doses studied, which has thegreatest beneficial effect? Confirmatory efficacystudies aim to answer this question. As in otherstudy designs that we have discussed, thesponsor would like to minimize the chance of

152 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Test �20 �18 �17 �14 �12 �11 �8 �1 0 2Placebo �9 �7 �4 �4 �3 �3 0 1 1 2

Test 1 2 3 4 5 6 8 14 15.5 19.5Placebo 7 9 10.5 10.5 12.5 12.5 15.5 17.5 17.5 19.5

Page 168: Introduction to Statistics in Pharmaceutical Clinical Trials

committing a type I or II error. We thereforeneed an appropriate statistical method that canidentify the best dose (among a number ofthem), while accounting for the inherent vari-ability in the data and limiting the chances ofcommitting an error in the final decision-making process. Analysis of variance (ANOVA) iswell suited to this task.

Assume that there are k independent groups(k � 2), each of which represents populations ofinterest, for example, individuals given a partic-ular treatment. An important objective of manyclinical trials is to determine if there is any differ-ence among the treatments administered withregard to the underlying population means. Thenull hypothesis for such an objective is:

H0: l1 � l2 � . . . � lk.

If there is sufficient evidence to conclude thatthe null hypothesis should be rejected, thealternate hypothesis that would be favored isthat there was at least one difference among all[k(k � 1)]/2 pairs of population means:

HA: l1 � l2 or . . . l1 � lk or . . . lk�1 � lk.

Each treatment group j (j � 1, 2, . . ., k) is repre-sented by nj observations, x1j, x2j, x3j, . . ., xnjj

. Thesample sizes for each of the groups need not beequal. For each group the population mean, �j, isestimated by the sample mean, xj:

nj

Rxij

i�1x j � –––—–.nj

We can calculate the mean of all values acrossthe k groups, the grand mean, as:

k nj

RRxij

j�1 i�1x . � ––––––––––,n

where

k

n �R nj ,j�1

the overall sample size. The total variabilityacross all n � n1 � n2 � . . . nk observations is thesum of the squared difference between each

observation and the grand mean divided by thenumber of df:

k nj

RR(xij � x.)2

j�1 i�1VT � –––––––––––––––.

n � 1

The sum of the squared deviations of each obser-vation from the overall mean (the numerator) isalso called the “total sums of squares.”

The population variance for each group, r2j, is

estimated by the sample variance:

nj

R(xij � x j)2

i�1s2

j � –––––––––––––.nj � 1

While the notation here is a little more compli-cated than we have seen before (because of theaddition of the subscript j) the basic principle isexactly the same. All we have done to this pointin this example is to calculate the sample meansand variances for each group in the study.

An estimate of the average variance over all k groups represents the “typical” spread of dataover the entire study or experiment. This vari-ability is often referred to as random variation ornoise. In the ANOVA strategy this number iscalled the within-group variance (or meansquare error), and is calculated as a weightedaverage of the sample variances:

k

R(nj � 1)s 2j

j�1Within-group variance (Vw) � –––––––––––––.

n � k

The denominator – that is, the df – in this calcu-lation may be puzzling at first, but, again, theprinciple is the same as we have seen before.Recall that, when estimating the sample vari-ance, the df value is n � 1. This is because thesum of deviations has to equal 0. Given knowl-edge of n � 1 observations in the sample, we candetermine the last observation: It is the valuethat will ensure that the sum of all deviationsadds to 0. In this case, the “minus 1” is appliedfor all k groups. This leads to:

(n1 � 1) � (n2 � 1) � . . . (nk � 1) �

(n1 � n2 � . . . nk) � k � n � k.

Hypothesis tests of more than two means: Analysis of variance 153

Page 169: Introduction to Statistics in Pharmaceutical Clinical Trials

As there are also k sample means, each repre-senting an estimate of the typical value of thepopulation (that is, the population mean), thoseestimates may also vary from sample to sample.The variance of the means across all groups iscalled the among-group variance (or the meansquare among groups), and is calculated as aweighted average (weighted by the sample size)of the squared differences of each sample meanfrom the grand mean:

k

Rnj (xj � x.)2

j�1Among-group variance (VA) � ––––––––––––––.

k � 1

where

k nj

RRxij

j�1 i�1x � ––––––––––,. n

the grand mean, as before. The total variabilityin the data can be split or partitioned as thewithin-group variability (the background vari-ability) and the among-group variability ofmeans (how much the sample means vary fromthe overall mean):

VT � VA � VW.

As we have seen with a number of methods so far(most notably, the two-sample t test) the extentto which point estimates differ is measuredagainst the typical variability of means fromsample to sample. In the case of an ANOVA, wehave an analogous method by which we canevaluate the extent to which the means differ. Ifthe variance among the samples greatly exceedsthe typical variance of the data in general thereis an indication that the typical difference inmeans is not the result of random variation, butof systematic variation. If the variance amongthe samples is similar to the variance of the datain general such a result suggests that, whateverthe difference in means, it is just like whathappens by chance alone.

The test statistic (F) for this comparison in theANOVA takes the form of a ratio of the among-sample variance to the within-sample variance:

VAF � ––– .VW

This test statistic is not well defined in all cases,which means that a rejection region is not auto-matically defined from a known distribution.However, if some assumptions are made aboutthe distribution of the random variable X, thedistribution of the test statistic can be defined.The following assumptions are required for anappropriate use of ANOVA:

• Each group represents a simple randomsample from each of k populations and theobservations are statistically independent.

• The random variable, X, is normallydistributed within each population.

• The variance of the random variable, X, isequal among all k populations.

Given these assumptions the test statistic, F,follows an F distribution with (k � 1) numeratordf and (n � k) denominator df. This is written inshorthand as Fk�1,n�k. Although we do notdescribe this distribution in detail, its essentialcharacteristics are that it is a two-parameterdistribution (that is, the numerator and denom-inator df) and it is asymmetric. As you mightimagine, this distribution is not nearly asconvenient to work with as the standard normaldistribution. Defining the critical region for agiven situation is best accomplished usingstatistical software because there are countless F distributions, each requiring a table. Similarly,calculating the sums of squares is best left to soft-ware (it can certainly be done by hand, but therequired calculations are tedious).

ANOVA can be extended to situations wherethe experimental units (in our context, studyparticipants) are classified on a number offactors. When they are classified on the basis ofone factor, it is referred to as a one-way ANOVA.The result of partitioning the total variance intoits components, in this case among and withinsamples defined by one factor, is displayed inTable 11.2.

The F distribution with (k � 1) numerator dfand (n � k) denominator df is used to define therejection region for a test of size a. The criticalregion may be obtained from a table of values orprovided by statistical software. Tabled F values

154 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Page 170: Introduction to Statistics in Pharmaceutical Clinical Trials

for a number of combinations of a, numeratorand denominator df are provided in Appendix 4.The null hypothesis of no difference amongmeans will be rejected only if the value of thetest statistic, F, is larger than the cut point speci-fied from the parameters of the distribution.Therefore, the test is inherently one sided – thatis, the rejection region is any value F � F(k�1),(n�k).

Rejection of the null hypothesis means onlythat there is at least one difference among allpairwise comparisons of means. This conclusionis hardly satisfactory in the world of drug devel-opment because the decisions to be made typi-cally require the selection of a dose or treatmentregimen for purposes of designing another studyor proposing a dose for marketing approval.

11.5 A worked example with a smalldataset

Since, as noted, the calculations involved inANOVA are fairly tedious, we illustrate themethod using an overly simplistic example witha small dataset. This example is for illustrative

purposes: In reality, datasets for which ANOVA ismost appropriate have large sample sizes and areanalyzed using statistical software. However,once you have a conceptual understanding ofANOVA you can interpret ANOVA tables for awide variety of study designs.

The research question

Does the reduction in SBP differ among threedoses of an investigational antihypertensivedrug?

Study design

A clinical study was conducted to investigatethree doses of an investigational antihyperten-sive drug. Fifteen participants were recruited(five per group), and randomized to three treat-ment groups: 10 mg, 20 mg, and 30 mg. Eachtreatment was taken once a day. SBP wasmeasured 5 min before the administration of thedrug (baseline) and again 30 min after. A“change from baseline score” was calculated foreach participant by subtracting the baselinevalue from the post-treatment value.

A worked example with a small dataset 155

Table 11.2 General one-way ANOVA table

Source Sum of squares Degrees of freedom Mean square F

Among samples k � 1

Within samples n � k

Total n � 1

k

Rnj(xj � x .)2

j�1

� SSA

k

Rnj(xj � x .)2

j�1–––––––––––––––––––

k � 1� VA

VA–––VW

k nj

RR (xij � x.)2

j�1 i�1

� SST

k nj

RR (xij � x.)2

j�1 i�1–––––––––––––––––––

n � 1� VT

k

R (nj � 1)s2j

j�1

� SSW

k

R (nj � 1)s2j

j�1–––––––––––––––––––

n � k� VW

Page 171: Introduction to Statistics in Pharmaceutical Clinical Trials

Data

The change from baseline scores for the 15participants are displayed below:

• 10 mg treatment group: �6, �5, �6, �7, �6• 20 mg treatment group: �8, �9, �8, �9, �6• 30 mg treatment group: �10, �8, �10, �8, �9.

Statistical analysis

A one-factor ANOVA is the appropriate analysishere assuming that the data are normal: Theonly factor of interest is the dose of drug given.There are three levels of this factor: 10, 20, and30 mg. Following convention, the results of anANOVA are displayed in an ANOVA summarytable such as the model in Table 11.3. In thefollowing calculations the values are presentedwithout their units of measurement (mmHg)simply for convenience. At the end of the calcu-lations, however, it is very important toremember that the numerical terms representvalues measured in mmHg. The calculationsneeded are as follows.

1. Calculate the group means and the grandmean:

�30• 10 mg group mean � x10 � –––– � �6

5�40

• 20 mg group mean � x20 � –––– � �85

�45• 30 mg group mean � x30 � –––– � �9

5(�6) � (�8) � (�9)

• grand mean � x � –––––––––––––––––– � �7.67.3

2. Calculate the group sample variances:

3. Calculate the total sums of squares (SST): Thetotal sums of squares is the variability ofobservations across all three groups. It iscalculated by summing the squared differenceof each observation (in this case 15 of them)from the grand mean, � 7.67. For brevity, thecalculation is not written out here. We suggestthat you verify the calculations with software:

• SST � 35.33.

4. Calculate the among-sample sums of squares(SSA):

SSA � 5((� 6) � (� 7.67))2 � 5(( � 8) �

(� 7.67))2 � 5((� 9) � (� 7.67))2 � 23.33.

5. Calculate the within-sample sums of squares(SSW):

• SSW � (4)(0.50) � (4)(1.50) � (4)(1.00) � 12.

As expected, the total sums of squares is thesum of the among-sample sums of squaresand the within-sample sums of squares.

6. Calculate the df:

• Total: We started with 15 scores. To getthe same grand mean, 14 of these canvary, but number 15 cannot. Therefore,there are (n � 1) df:

df (total) � 15 � 1 � 14.

• Among samples: There are three groups,and thus three sample means. These mustalso average to the grand mean. Once twohave been determined, the third can beonly one value (that is, it cannot vary).Again, therefore, there are (k � 1) df:

156 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

• 10 mg group sample variance �

((�6) � (�6))2 � ((�5) � (�6))2 � ((�6) � (�6))2 � ((�7) � (�6))2 � ((�6) � (�6))2

s210 � ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– � 0.50

4

• 20 mg group sample variance �

((�8) � (�8))2 � ((�9) � (�8))2 � ((�8) � (�8))2 � ((�9) � (�8))2 � ((�6) � (�8))2

s220 � ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– � 1.50

4

• 30 mg group sample variance �

((�10) � (�9))2 � ((�8) � (�9))2 � ((�10) � (�9))2 � ((�8) � (�9))2 � ((�9) � (�9))2

s230 � ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– � 1.00.

4

Page 172: Introduction to Statistics in Pharmaceutical Clinical Trials

df (among) � 3 � 1 � 2.

• Within samples: By exactly the same logicthat we saw for the within-groups sums ofsquares, we can calculate these df as:

df (within) � df (total) � df (among) �

14 � 2 � 12.

(Note: There is also another way to thinkof this. Within each sample there are fivevalues. Therefore, there are four df persample. There are three samples. The totalwithin-samples df is the total of the dfwithin each sample, or 4 � 4 � 4 � 12.)

7. Construct the ANOVA table: Having calcu-lated the total sums of squares from allsources of variation, along with their degreesof freedom, we can now start to construct theANOVA table. The only other calculationsrequired are the mean squares for among-samples and within-samples (divide eachsums of squares by its associated df) and thetest statistic, F (divide among-samples meansquare by within-samples mean square). All ofthis information is shown in the partialANOVA table presented as Table 11.3.

8. Determine if the test statistic is in the rejec-tion region: As always, we need to determineif the test statistic F falls in the rejectionregion. So far, we have not determined the

rejection region for this test. As noted earlier,the F distribution has two parameters thatdetermine its shape and, therefore, the Fvalues that cut off tail areas of the distribu-tion. The two parameters are the numeratordf (associated with the numerator of the Fratio or the among-sample source of varia-tion) and the denominator df (associated withthe denominator of the F ratio or the within-samples source of variation). In this case, thenumerator df is 2 and the denominator df is12. This is written as:

F(2,12) � 11.67.

Tables with values of F for several distributionsare used to determine the significance of thisresult, or the critical values can be obtained fromstatistical software. We have provided a table inAppendix 4. For a test of size a � 0.05, the crit-ical value associated with 2 numerator df and 12denominator df that cuts off the upper 5% of thedistribution is 3.89. Although tabled values arehelpful at identifying nominal p values (forexample, � 0.01) statistical software is requiredto report the specific p value. Using statisticalsoftware, you will find that the actual p value is0.002. Table 11.4 shows the completed ANOVAtable for this example. You will see that the p value is commonly included in a completeANOVA table.

A worked example with a small dataset 157

Table 11.3 One-way ANOVA table for the SBP study (partially complete)

Source Sum of squares Degrees of freedom Mean square F

Among samples 23.33 2 11.67 11.67Within samples 12.00 12 1.00

Total 35.33 14

Table 11.4 Completed one-way ANOVA table for the SBP study

Source Sum of squares Degrees of freedom Mean square F p value

Among samples 23.33 2 11.67 11.67 0.002Within samples 12.00 12 1.00

Total 35.33 14

Page 173: Introduction to Statistics in Pharmaceutical Clinical Trials

It is important to recognize that the actual pvalue, not simply p � 0.05, is stated in the table.Regulatory reviewers and journal editors preferthis practice, because the actual value providesmore information than simply a statement thatthe value is less than 0.05.

Interpretation and decision-making

As the value of the test statistic, 11.67, is in therejection region for this test of size a � 0.05 (thatis, 11.67 � 3.89), the null hypothesis is rejected infavor of the alternate, which means that at leastone pair of the population means is not equal.

Recall the original research question: Does thereduction in SBP differ among three doses of anew antihypertensive? The results of the one-way ANOVA that we have conducted so far areinterpreted in the following manner:

• There is evidence at the a � 0.05 level thatthe levels of the factor “dose of drug” differ.Therefore, there is a statistically significantdifference in SBP change scores between thegroups. (The p value of 0.002 indicates thatthe null hypothesis would also have beenrejected at smaller a levels, for example, at thea � 0.01 level.)

The above statement by itself does not, however,tell us anything about which group showed thegreatest change score, or indeed how any specificgroup compared with any of the other groups.Consideration of the group means is necessary todo this. These means, with the associated unitsof measurement reinserted, are:

• 10 mg group � �6 mmHg• 20 mg group � �8 mmHg• 30 mg group � �9 mmHg.

Therefore, we can now state that the 30 mggroup showed the greatest mean decrease in SBP,the 20 mg group the second greatest meandecrease, and the 10 mg group the least meandecrease. However, a full answer to the researchquestion has still not been supplied, at least notin terms of determining possible statisticaldifferences between specific pairs of dose levels.

The ANOVA test statistic revealed that, overall,the groups differed statistically significantly, but,

as there are more than two groups, it cannotreveal the precise pattern of statistical signifi-cance. For any three groups (call them D, E, andF) there are C3

2 � 3 possible comparisonsbetween pairs of groups: D can be comparedwith E; D can be compared with F; and E can becompared with F. These three comparisons canlead to the following patterns of outcomes:

• All groups differ statistically significantlyfrom each other.

• None of the groups differs statistically significantly from any other group.

• D and E both differ statistically significantlyfrom F, but do not differ statistically significantly from each other.

• D and F both differ statistically significantlyfrom E, but do not differ statistically significantly from each other.

• E and F both differ statistically significantlyfrom D, but do not differ statistically significantly from each other.

To determine which pattern of outcomesoccurred in any given situation, an additionalstatistical test is needed. In situations such asthis, where we have a partial answer to our orig-inal research question, multiple comparisons areperformed. These are tests that allow us tocompare the means of each pair of groups to seewhich pairs (if any) differ statistically signifi-cantly from each other. Multiple comparisonstherefore provide a more detailed understandingof our data than the overall test (referred to asthe omnibus test) provided by the ANOVA. If theomnibus test yields a nonsignificant result,multiple comparisons are not necessary, because,in fact, none of them would be significant. Inthe case of a significant omnibus test, the secondoption above is not actually a possible outcome,whereas all of the others are. This means that weneed a method of determining which of theother possibilities is the case – that is, we need astatistical methodology that will allow us toconduct multiple comparisons, and to use thismethodology before we can provide the fullanswer to our original research question. The fullanswer is provided in Section 11.10, but first weneed to look at another issue.

158 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Page 174: Introduction to Statistics in Pharmaceutical Clinical Trials

11.6 A statistical methodology forconducting multiple comparisons

In clinical studies, the probability of declaring atreatment efficacious when in reality it is not effi-cacious is termed a. This is the probability ofdetecting a false positive, or committing a type Ierror. As we have seen, the probability of commit-ting a type I error should be limited to a specificvalue so that erroneous conclusions are not madevery often. For a sponsor, committing a type I errorcould result in investing significant amounts ofmoney on a drug that really does not work. For aregulatory agency, committing a type I error(approving a drug that is not efficacious) couldresult in many people taking a drug that does notoffer a meaningful treatment benefit and maycarry some risk (every drug has a side-effectprofile). It is therefore important to constrain theprobability of committing a type I error to anacceptable level. Traditionally, this acceptablelevel has been and is still regarded as the a � 0.05level, but, as noted before, we can choose othervalues when we consider them appropriate.

The important point to note here is that thea � 0.05 level is deemed appropriate when asingle test is being conducted. Multiple compar-isons, by definition, mean that more that one testis being conducted. When testing a number ofpairwise comparisons – for example, after anANOVA where the null hypothesis has beenrejected – it is not acceptable to test each pairwisecomparison at the a � 0.05 level because of thepotential inflation of the overall type I error rate.

When three treatment groups are evaluated ina clinical study, there are three possible pairwisecomparisons of means (D vs E, D vs F, and E vsF). If each mean is tested at the a � 0.05 level,and assuming that they are mutually exclusive,the probability of declaring at least one of thepairs significantly different is equal to 1 minusthe probability of accepting all three (by thecomplement rule). Assuming that the compar-isons are independent, the probability ofaccepting all three null hypotheses is the proba-bility of accepting the first null hypothesismultiplied by the probability of accepting the second multiplied by the probability of

accepting the third. When testing each at thea � 0.05 level, this probability becomes:

P (incorrectly rejecting at least one hypothesis)� 1 � (0.95)(0.95)(0.95) � 1 � 0.953 � 0.14.

That is, instead of a type I error rate of a � 0.05,this analysis has resulted in a higher probability ofcommitting a type I error, just by chance alone.

In fact, the comparisons made here cannot bethought of as independent because each group iscompared with two others in this case. It is morecorrect to use an inequality sign to say that theprobability is no more than 0.14, that is, � 0.14.However, this technicality is of little comfortbecause, to make sound decisions, we wouldreally like to limit that probability to a reason-able level. In general, if C comparisons are eachmade at the a level, the probability of rejectingat least one by chance alone is:

P (rejecting at least one of c hypotheses) � (1 � (1 � a)c)

Table 11.5 lists the probability of rejecting atleast one hypothesis for a number of values of C,the number of hypothesis tests performed at theconventional a � 0.05 level.

Suppose that a clinical trial has to evaluatefour doses of a test treatment and a placebo (atotal of five groups) on relieving headache pain.The study was carefully designed and conducted,

A statistical methodology for conducting multiple comparisons 159

Table 11.5 Maximum probability of committing atype I error when each hypothesis is tested at a � 0.05

C: No. of hypotheses Maximum probabilitytested at a � 0.05 of type I error

1 0.0502 0.0983 0.1434 0.1855 0.2266 0.2657 0.3028 0.3379 0.370

10 0.40115 0.53720 0.642

Page 175: Introduction to Statistics in Pharmaceutical Clinical Trials

and the data are now ready for the statisticalanalysis. A one-way ANOVA is conducted, andthe conclusion from the omnibus F test (compar-ison of the among-sample variance with thewithin-sample variance) is that the populationmeans are not all equal. Five treatment groupsgive rise to C5

2 � 10 pairwise group comparisons.Suppose that one of the researchers failed to getinput from the trial statistician, and hurriedly(and mathematically correctly) analyzed all 10pairwise comparisons of means performed using10 two-sample t tests. The researcher takes his orher results to the study director and the rest ofthe study team and points out with tremendousexcitement that the pairwise comparison of thelowest dose with the placebo yielded a p value of0.023, a statistically significant result at thea � 0.05 level. A surge of positive energy fills theroom as everyone but the statistician declares,“We have found our lowest effective dose! On tothe confirmatory trial!”

As you have probably realized by now, therewould actually be little reason for enthusiasm, asthe study statistician would very soon point out.The problem is this: While each of the 10 two-sample t tests had been conducted mathemati-cally correctly, it is not appropriate statisticalmethodology to use 10 two-sample t tests in thissetting. The analytic strategy employed did notlimit the type I error rate to 0.05. Rather, as seenin Table 11.5, when 10 such pairwise compar-isons are made – that is, 10 hypotheses are tested– the probability of rejecting at least one of thehypotheses is limited to 0.401, a value consider-ably greater in magnitude than 0.05. In otherwords, use of this naïve analytic strategy hasresulted in an inflated type I error. There is up toa 40% chance of being misled by one test with anominal p value � 0.05.

The issue of type I error inflation caused bymultiple testing appears in many guises in therealm of new drug development. This issue is ofgreat importance to decision-makers, and wediscuss this topic again later in the chapter. Fornow, we have not yet provided a full answer toour research question; our description of analysisof variance is incomplete without a discussion ofat least one analysis method that controls theoverall type I error rate when evaluating pairwisecomparisons from an ANOVA.

11.7 Bonferroni’s test

Bonferroni’s test is the most straightforward ofseveral statistical methodologies that can appro-priately be used in the context of multiplecomparisons. That is, Bonferroni’s test canappropriately be used to compare pairs of meansafter rejection of the null hypothesis following asignificant omnibus F test. Imagine that we havec groups in total. Bonferroni’s method makes useof the following inequality:

P(R1 or R2 or R3 or . . . or Rc)� P(R1) � P(R2) � P(R3) � . . . � P(Rc).

This means that the probability of rejecting atleast one of c hypotheses is less than or equal to(thus the term “inequality”) the sum of the prob-abilities of rejecting each hypothesis. Thisinequality is true even if the events, in this caserejecting one of c null hypotheses, are not inde-pendent. Recall from Section 6.2 that, whenevents are not independent, the probability ofintersecting events should be subtracted. UsingBonferroni’s method, testing each pair of meanswith an a level of aB � a–c will ensure that theoverall type I error rate does not exceed thedesired value of a. It follows that the probabilityof rejecting at least one of c null hypotheses canbe expressed as follows:

ap(rejecting at least one of c hypotheses at aB level)� c (––) � a.

c

It is important to note that the researcher in ourscenario in Section 11.6 who hurriedlyconducted 10 pairwise comparisons using 10two-sample t tests and rejoiced in one particularfinding was not completely out of line in theanalytic strategy chosen. It is indeed possible toapproach this situation (the need for 10 pairwisecomparisons) with the intent to conduct 10 two-sample t tests. However, a correction must bemade to the a level used to determine statis-tical significance. In the scenario as told inSection 11.6 the researcher did not perform thiscritical step.

In practice, then, we can carry out each of aseries of pairwise comparisons of means using atwo-sample t test for each comparison, but the a level must be modified accordingly. Whendeciding whether or not to reject the null

160 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Page 176: Introduction to Statistics in Pharmaceutical Clinical Trials

hypothesis associated with each comparison, weneed to use an a level of aB � a–c instead of thenaïve choice of a. Note that this is equivalent todefining a rejection region for each test as:

t � ta/2c,n�k or t � t1�(a/2c),n�k

which makes sense as the tail areas in the leftand right of the t distribution are smaller thanthose obtained using the two-sided test of size a.

Consider the two-sample t-test statistic again:

(x1 � x2)t � ––––––––––– ._______1 1sp � –– � ––n1 n2

In an ANOVA involving more than two groups,we estimate the underlying variability frommore than two samples, and yet we are inter-ested in the extent to which (only) two of themeans differ from each other. Therefore, whencomparing the means of two samples, the pooledstandard deviation from the two-sample case, sp,is replaced by an estimate that captures the vari-ability across all groups in the analysis – themean square error or the within-samples meansquare. Recall from Section 11.4 that this quan-tity has the same interpretation as the pooledstandard deviation, the typical spread of dataacross all observations.

When using Bonferroni’s method, the nullhypothesis associated with a pairwise com-parison is rejected if the calculated test statistic,that is,

x1 � x2t � ––––––––––––––____________1 1� Vw (–– � ––)n1 n2

is in the rejection region defined as t � t(a/2c),n�kor t � t1�(a/2c),n�k.

Remember that Vw comes from the ANOVAtable and it is the mean square error, which hasalso been referred to as the within-samples vari-ability or, more informally, the backgroundnoise. This is analogous to s2

p in the two-samplecase. As we assume equal variances, we use theestimator that uses the most data and thereforegives the most precise estimate.

The critical value can be determined from atable or software (using a two-sided test of size

a/c). The estimate of the underlying variability,Vw, comes from the ANOVA table, and thesample sizes for each group are known and equal.Then we can define a quantity, the minimallysignificant difference (MSD), which is thesmallest difference (in absolute value) betweenany two sample means that could be consideredstatistically significant at the a level. (Note thatwhen sample sizes are not equal the MSD is notdefined, but there are other methods available.)

____________1 1

MSD � t1�(a/2c),n�k � Vw (–– � ––).n1 n2

Once the value of MSD has been determined, theabsolute value of the difference in means will becompared with the MSD. If the absolute value ofthe difference in means, |(x1 � x2)|, is greaterthan or equal to the MSD the null hypothesiswill be rejected.

11.8 Employing Bonferroni’s test in ourexample

Having introduced Bonferroni’s test, we can now return to our earlier example to see how to apply Bonferroni’s method to our pairwisecomparisons of treatment group means.

Statistical analysis

The significant result of the omnibus F test ledto the rejection of the null hypothesis of nosignificant differences, thereby revealing thepresence of a significant difference between atleast one pair of means. It is now of interest todetermine precisely which pair or pairs ofmeans are significantly different.

Given that the decisions made from this trial could result in sizeable further investmentin the development of the investigationalantihypertensive drug, the company would liketo minimize its chances of committing a type Ierror. That is, it would like to maintain an overalltype I error of 0.05. As we have just seen inSection 11.7, one analysis that will maintain thisdesired type I error of 0.05 is Bonferroni’smethod.

Employing Bonferroni’s test in our example 161

Page 177: Introduction to Statistics in Pharmaceutical Clinical Trials

In our example of three treatment groupsthere are three pairwise comparisons of interest.Therefore, each pairwise comparison will betested at an a level of 0.05/3 � 0.01667. This alevel will require defining a critical value fromthe t distribution with 12 (that is, 15 � 3) df thatcuts off an area of 0.00833 (half of 0.01667) inthe right-hand tail. Use of statistical softwarereveals that the critical value is 2.77947. Frominspection of the ANOVA table presented asTable 11.4 the within-samples mean square(mean square error) can be seen to be 1. The finalcomponent needed for the MSD is:

________1 1� –– � –– � �

___0.4 � 0.632.

n1 n2

Then the MSD is equal to:

MSD � (2.77947)(1)(0.632) � 1.757.

The mean values for each group are �6 mmHg(10 mg), �8 mmHg (20 mg), and �9 mmHg (30 mg). The absolute values of the threedifferences in means are displayed in Table 11.6.

Each cell represents the differences in meansfor the groups represented by each row andcolumn. The differences between the 10 mg and20 mg groups and the 10 mg and 30 mg groupswere both greater than the MSD (1.757). There-fore, these differences are considered statisticallysignificant at the a � 0.05 level. The differencebetween the 20 mg and 30 mg groups was notsignificant, however, because it was less than theMSD.

Interpretation and decision-making

We are now in a position to provide a fullanswer to our research question of interest asexpressed at the start of Section 11.5: Will the

reduction in SBP differ among three doses of aninvestigational antihypertensive drug?

The first step in our analytical strategy was toconduct an ANOVA. This ANOVA tested the nullhypothesis that there were no differences amongthe three means. The null hypothesis was testedat an a level of 0.05, and was rejected on thebasis of the significant omnibus F test.

The second step in our analytical strategy wasto determine which of the pairs of means weresignificantly different from each other. Testingeach of the three hypotheses at an a level of0.05 would have resulted in a probability ofcommitting a type I error possibly � 0.05 (thedesired level). Bonferroni’s inequality was there-fore used to test each of the three hypotheses atan a level of 0.05/3 � 0.01667. Using the crit-ical value for this a level resulted in two pairs ofmeans being declared significantly different atthe 0.05 level.

The full interpretation of the study, therefore,is that the magnitude of the reduction in SBPdoes indeed differ according to different doselevels. The 20 mg and 30 mg doses both resultedin a statistically significantly greater SBP reduc-tion than the 10 mg dose. There was insufficientevidence to claim that there is a statisticallysignificant difference between the 20 mg and 30 mg doses.

What are the implications of this interpreta-tion? First, if we decided that it would be usefulto continue the clinical development programwith another trial, it would be salient to notethat, in terms of efficacy, the 10 mg dose wasinferior to the other two. Therefore, if contin-uing, it is likely that we would not include the 10 mg dose in further trials. What else wouldhelp us to decide to continue with the clinicaldevelopment program? The safety and tolera-bility of the 20 and 30 mg doses would need tobe examined and deemed acceptable. Examiningthe safety and tolerability data from the partici-pants in these two treatment groups wouldprovide the evidence on which to base this deci-sion (the safety and tolerability data from parti-cipants in the 10 mg treatment group would notbe informative at this point). If there were nosafety or tolerability concerns with the 20 or 30 mg doses, the next stage in developmentcould be to continue to investigate both of these

162 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Table 11.6 Absolute values of differences in means

20 mg 30 mg treatment treatment group group

10 mg treatment group 2 320 mg treatment group 1

Page 178: Introduction to Statistics in Pharmaceutical Clinical Trials

doses. Another possible interpretation isdiscussed in Section 11.10.

11.9 Tukey’s honestly significantdifference test

Bonferroni’s method that we have just discussedis perhaps one of the most easily understoodmethods to maintain an overall type I error,which is one of its advantages. In addition,Bonferroni’s method does indeed control theoverall type I error rate well, such that it is guar-anteed to be � a. However, like many items thatwe discuss in this book, it has its disadvantagesas well as its advantages.

Bonferroni’s test is overly conservative, in thatthe critical values required for rejection need notbe as large as they are. In other words, using aless conservative method may result in morenull hypotheses being rejected. The reason thatBonferroni’s method is so conservative is that itdoes not in any way account for the extent ofcorrelation among the various hypotheses beingtested. If a method could take into account the overlap, or lack thereof, of the varioushypotheses, the critical values would not need tobe defined as narrowly as with Bonferroni’s. Inthis section, we therefore discuss another analyt-ical strategy for multiple comparisons, Tukey’shonestly significant difference (HSD) test.

Bonferroni’s method for testing pairs of means(maintaining an overall type I error rate of a)involved comparing the absolute differences inmeans to the MSD, which was defined as afunction of:

• the critical value from a t distribution with acombined area of a/c in the tails of thedistribution

• the within-samples variability• the sample sizes in each group.

Once a value of the MSD was determined eachdifference in means was calculated andcompared with the MSD. Any difference that wasequal to or greater than the MSD was consideredstatistically significant. Tukey’s HSD test iscarried out in a similar manner. A value called

the honestly significant difference is determinedas a function of three things:

1. The critical value from the studentized rangestatistic

2. The within-samples variability3. The sample sizes in each group.

The studentized range statistic, called q in thefollowing description of the test, has a limiteduse for us now and we shall not spend any addi-tional time characterizing it, except to say thatthe value of q does account for the relative size ofdifferences among the normalized means,resulting in a test with an overall type I error ofexactly 0.05. The value, q, is often provided intables and to look it up we need to know thenumber of groups (k from the ANOVA descrip-tion), and the number of df associated with thewithin-samples mean square (n � k). Statisticalsoftware packages also supply this number. TheHSD (or, equivalently, the MSDT for minimumsignificant difference – Tukey) is defined as:

___VWHSD � MSDT � q�–––.n

In this expression n represents the per-groupsample size which, for the moment, we requireto be equal.

Once the value of HSD has been determined,the absolute value of the difference in means iscompared with it. If the absolute value of thedifference in means, |(x1 � x2)|, is greater than orequal to the HSD the null hypothesis is rejected.

The quantity represented by the letter “q” isdetermined from a table of values used justfor this test. Two characteristics are needed todetermine the appropriate value of q eachtime that it is used. These characteristics arerepresented by the letters “a” and “v.” Theletter a represents the number of groups,which in this example is 3. The letter vrepresents the df, which in this test is the dfassociated with the within-samples meansquare. In this case, the value of v is 12, ascalculated for and shown in the ANOVAsummary table in Table 11.4. From the tableof q values for Tukey’s test (provided inAppendix 5) the value of q associated with an

Tukey’s honestly significant dif ference test 163

Page 179: Introduction to Statistics in Pharmaceutical Clinical Trials

(a, v) value of (3, 12) is 3.77. HSD is thencalculated as follows:

__1

HSD � 3.77� –– � 1.686.5

The absolute values of the three differences inmeans were displayed in Table 11.6. The differ-ences between the 10 mg and 20 mg groups andthe 10 mg and 30 mg groups were both greaterthan the HSD (1.686). Therefore, these differ-ences are considered statistically significant atthe 0.05 level. The difference between the 20 mgand 30 mg groups was not significant, however,because it was less than the HSD.

Although Tukey’s method does not requireequal sizes among the groups, imbalanced groupsizes do require a different calculation of HSD.When the sample sizes are unequal among allgroups being compared, there is not onecommon value of HSD because this value relieson the sample size per group. For the compar-ison of any two means with group sample sizesof n1 and n2, the value of HSD corresponding tothat particular comparison is:

____________q 1 1

HSD � –––– � Vw (–– � ––).�__2 n1 n2

In the case that n1 and n2 are equal this expres-sion simplifies to the one we originallypresented.

Interpretation and decision-making

Having gone through the calculations necessaryfor Tukey’s test, we can look at how these resultswould lead to decision-making, and alsocompare the interpretation and decision-makingwith those that followed from using Bonferroni’smethodology on the same dataset.

The statistical interpretations of these resultsare the same as with Bonferroni’s method. The 20and 30 mg doses both resulted in a statisticallysignificantly greater SBP reduction than the 10 mgdose. There was insufficient evidence to claimthat there is a statistically significant differencebetween the 20 mg and the 30 mg doses.

11.10 Implications of the methodologychosen for multiple comparisons

The most important lesson to be learned fromour discussions of various analytic methodolo-gies for multiple comparisons is that the methodchosen can have a major impact on the risk ofmaking incorrect decisions.

Consider the absolute difference in any twomeans that was required to reject a null hypoth-esis of H0: l1 � l2 � 0 after rejection of theomnibus F test. In the case of the naïveapproach, which was to test each pair of meansseparately and use an a level of 0.05 in each case,the minimum significant difference would be1.126, but the overall type I error could be guar-anteed to be bounded only by 0.143 (see Table 10.5). The use of Bonferroni’s methodresulted in a minimum significant difference of1.757, but it is overly conservative and theoverall type I error rate would be guaranteed tobe � 0.050. Tukey’s method, which accounts forthe actual distribution of differences through q,resulted in a minimum significant difference of1.686 and guaranteed that the overall type Ierror rate � 0.050, resulting in a more powerfultest than Bonferroni’s method. Given theirimportance, these characteristics are summarizedin Table 11.7.

Lastly, it is important to note that differencessuch as these underscore the importance ofdeclaring the primary analysis approach in astudy protocol or statistical analysis plan.Committing to the most appropriate analysisfrom first principles is not only good scientificdiscipline, it is also necessary to withstandregulatory scrutiny.

It should be noted that these are not the onlyacceptable methods applicable to multiplecomparisons from an ANOVA. In each individualcase, the choice among possible approaches islargely dependent on the study design. Forexample, Dunnett’s test can be used when theonly comparisons of interest are each test treat-ment versus a control (for example, in a placebo-controlled, dose-ranging study). Like Tukey’stest, Dunnett’s method is more powerful thanBonferroni’s. In general, other methods gainpower compared with Bonferroni’s method by

164 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Page 180: Introduction to Statistics in Pharmaceutical Clinical Trials

using methods that account for the correlationof tests (for example, Tukey’s HSD test) or byreducing the number of tests about which wewould like to make an inference (for example,Dunnett’s test). When conducting these types ofanalyses, it is theoretically possible (althoughnot common) to report a significant overall Ftest, but not declare any pairwise comparison asstatistically significant as a result of the multiplecomparison procedure.

Consideration of the possible clinical interpre-tation of these results is also worthwhile. Theinterpretations given in the above sections arethe full statistical interpretations from the statis-tical analyses that were performed on the datacollected in this study. In real clinical trials, theseresults are also interpreted clinically, that is, theirclinical significance is discussed. Making theseclinical efficacy interpretations is the province ofthe clinicians on the study team. As we empha-sized earlier in this book, we are not clinicians,and these “hypothetical comments” concerningthe potential clinical significance of hypothet-ical data must be regarded in this light.

First, the clinical significance of a decrease inSBP of 6 mmHg versus a decrease of 8 or 9mmHg would need to be considered. As thesenumerical values are all relatively close, let uscreate some hypothetical values that conformto the same overall pattern of significance butare more different from each other. Supposethat these mean decreases in SBP wereobserved using the same doses of a differentantihypertensive drug:

• 10 mg group mean � �6 mmHg• 20 mg group mean � �18 mmHg• 30 mg group mean � �19 mmHg.

Suppose also that Tukey’s test provided evidenceof the same pattern of statistical significance:

• 10 versus 20 mg � �6�(�18) � 12; p � 0.05• 10 versus 30 mg � �6�(�19) � 13; p � 0.05• 20 versus 30 mg � �18�(�19) � 1; not

significant (ns).

In this scenario, the clinical significance of adecrease in SBP of 6 mmHg versus a decrease of18 or 19 mmHg would need to be considered.Suppose that decreases of 18 and 19 mmHg areboth considered to be much more clinicallysignificant than a decrease of 6 mmHg. Supposealso that the 20 mg dose had a good (and there-fore acceptable) safety profile, whereas the safetyprofile of the 30 mg dose was not so good. Ofrelevance in this scenario is that there was not astatistically significant difference in efficacybetween these two dose groups. It is true that themean decrease in the 20 mg group was numeri-cally less than the mean decrease in the 30 mggroup, but it was not statistically significantlyless. Therefore, it might be the case that, wheninput had been received from all members of thestudy team, including statisticians and clini-cians, a decision would be made to progress onlythe 20 mg dose to further trials: The 10 mg doseis statistically significantly less effective, and the 30 mg dose has a less desirable safety profilewhile also not being statistically significantlymore effective (see Turner, 2007).

This scenario illustrates several key points:

• Decision-making is not necessarily straight-forward.

• The empirical evidence from our clinical trials provides the basis for rational decision-making.

Implications of the methodology chosen for multiple comparisons 165

Table 11.7 Characteristics of the methods to test the three pairwise comparisons of means in the ANOVA example

Method Minimally significant Overall type I error rate: P (rejectingdifference at least one null hypothesis)

Naïve approach (incorrect) 1.126 � 0.143Bonferroni’s test (correct but 1.757 � 0.050conservative)Tukey’s HSD test (correct, and more 1.686 � 0.050powerful than Bonferroni’s)

Page 181: Introduction to Statistics in Pharmaceutical Clinical Trials

• In most cases many members of the studyteam, including statisticians and clinicians,are needed to make the optimum decision.

In real life, clinical interpretations are vital tobalance the relative weight of safety and efficacyconsiderations. If a higher dose of a given drug isconsiderably more efficacious than a lower doseand leads to only a minimal increase in verymild side-effects, a clinician may decide that, onbalance, it is worth recommending the higherdose. Conversely, if a higher dose of a given drugis only minimally more efficacious than a lowerdose and leads to a considerable increase inmoderate or severe side-effects, a clinician mayrecommend the lower dose.

11.11 Additional considerations aboutANOVA

Before completing our discussions of ANOVA,there are several additional points that we wouldlike to address, because these questions mayhave occurred to you as you have read thepreceding descriptions of the use of ANOVA andmultiple comparisons in this chapter.

11.11.1 ANOVAs with only two groups

A one-way ANOVA containing three levels wasused as the worked example in this sectionbecause a t test cannot address a design withmore than two levels. However, the one-wayANOVA can certainly be used in situationsinvolving only two levels. A reasonable question,therefore, is: In situations involving only twolevels, where the only possible comparison isbetween one level and the other, is there anyadvantage in using the one-way ANOVA insteadof the t test?

The answer is no. In cases where there are only two levels, either test is applicable. Thevalues obtained in the calculations of the respec-tive tests will be different, but the tests will giveprecisely the same answer in terms of the degreeof statistical significance obtained by the respec-tive test statistic. That is, the t value and F value

will not be the same (the F-test statistic will besquare of the t-test statistic), but the associated p values will be identical. The advantage of theANOVA lies with its ability to address situationsinvolving more than two levels, which are verycommon in clinical research.

11.11.2 Only collect data that you intendto analyze

Consider a scenario where a series of possiblecomparisons exists, but the investigator isgenuinely interested only in one of these compar-isons. Such a hypothetical scenario might involvea study employing four groups, with participantsin each group receiving one of four dose levels (1,2, 3, and 4) of a particular drug, and primaryinterest lay with comparing dose levels 1 and 4 –that is, out of the possible six comparisons,interest lay only with the comparison of doselevels 1 and 4. A question that arises here is: Is itpossible to argue that this one comparison couldbe made without having to adopt a more conser-vative approach? The correct answer from apurely statistical computational viewpoint is yes,this argument can successfully be made. The indi-vidual test may be undertaken using a t test at thea � 0.05 level, that is, without adopting a moreconservative approach, because this one partic-ular comparison of interest was specified fromfirst principles. However, this is not the finalanswer here.

Although this argument is perfectly satisfac-tory from a purely computational view, anotherquestion begs to be asked: If the investigator wasinterested only in comparing dose levels 1 and 4,why were dose levels 2 and 3 included in thestudy? This question is pertinent in several ways.It costs a lot of time and money to collect suchclinical data, and the costs associated withparticipants in two of the four experimentalgroups would be wasted. Much more importantthan the unnecessary costs, however, would bethat the participants in the dose level 2 and 3treatment groups would have taken part in thestudy for no useful reason, a gross violation ofexperimental ethics.

A much more realistic scenario is one in whichfour doses are included in such a study because

166 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Page 182: Introduction to Statistics in Pharmaceutical Clinical Trials

the investigator does not have clear logical ideas(hypotheses) about the relative merits (perhapsrelative efficacy) of the doses. In this case, anoriginal omnibus analysis such as the one-factorANOVA provides a very efficient initial test fordifferences among the groups. If a statisticallysignificant result is given by the ANOVA, theinvestigator can then proceed to comparing pairsof groups in formal (and appropriate) multiplecomparison testing.

11.12 Nonparametric analyses ofcontinuous data

There are times when the required assumptionsfor ANOVA, a parametric test, are not met. Oneexample would be if the underlying distributionsare non-normal. In these cases, nonparametrictests are very useful and informative. Forexample, we saw in Section 11.3 that a nonpara-metric analog to the two-sample t test, Wilcoxon’srank sum test, makes use of the ranks of obser-vations rather than the scores themselves. Whena one-factor ANOVA is not appropriate in aparticular case a corresponding nonparametricapproach called the Kruskal–Wallis test can beused. This test is a hypothesis test of the locationof (more than) two distributions.

11.13 The Kruskal–Wallis test

All that is required for this test to be employed isthat the observations classified into k groups areindependently sampled from populations andthe random variable is continuous with the samevariability across the populations represented bythe samples. Importantly, no assumption aboutthe shape of the underlying distribution isrequired, making this test suitable for non-normal underlying distributions.

In the Kruskal–Wallis test the original scoresare first ranked and an ANOVA analysis is thencarried out on the ranks. As with Wilcoxon’srank sum test, ranking of the observations mustdeal with ties. The sums of squares are based on

these ranks, and the test statistic is based on aratio of the among-samples variability in ranksand the within-samples variability in ranks.

All observations, xij, are assigned ranks, rij, andtherefore the usual sums of squares can be calcu-lated for the rank scores, rij. For brevity, theexpressions for each are provided in Table 11.8, ageneral one-way ANOVA table, on the basis ofranks.

The quantities in the ANOVA table based onranks represent similar quantities as the ANOVAtable based on the original scores:

rij is the rank for individual i in group jnj is the sample size for group j

k

n �R nj is the total sample sizej�1

rj is the average rank for group jnj

R(rij � r j )2

i�1s2j � –––––––—–– is the variance of ranks in group j.

nj � 1

r is the average rank over all groups (the grandmean rank), which can be simplified as:

n � 1r . � ––––––.

2

The omnibus test statistic, X2, follows a v2 distri-bution with k � 1 df. If the omnibus test isrejected the pairs of groups can be evaluatedusing a Bonferroni-type approach. This requiresthe assumption that the ranks are normallydistributed. As with the parametric one-wayANOVA, a minimally significant difference inranks can be calculated for this purpose as:

________________1 1

MSD � z1�(a/2c) � VW,ranks (–– � ––).n1 n2

For the sake of this example, we use the datafrom the parametric ANOVA example to illus-trate the Kruskal–Wallis test. If it seems at allstrange to use the same data for both examples,a parametric analysis and a nonparametricanalysis, it is worth noting that a nonparametricanalysis is always appropriate for a given datasetmeeting the requirements at the start of thechapter. Parametric analyses are not alwaysappropriate for all datasets.

The Kruskal–Wallis test 167

Page 183: Introduction to Statistics in Pharmaceutical Clinical Trials

Statistical analysis

The analysis begins with ordering all 15 observa-tions. Note that statistical software packagesorder and rank the observations and do theANOVA for you. The ordered observations fromlowest to highest across the three groups are asfollows:

Then ranking each observation, accounting forties as described for the one-way ANOVA, thefollowing ranks are obtained:

The within-samples average ranks are:

r10 � 12.5r20 � 7.1r30 � 4.4.

And the grand mean rank:

r . � 8.

The within-samples variances (of ranks) are:

s210 � 4.63

s220 � 12.18

s230 � 9.05.

The among-samples mean square is calculatedas:

5(12.5 � 8)2 � 5(7.1 � 8)2 � 5(4.4 � 8)2

VA,ranks � –––––––––––––––––––––––––––––––––––––– � 85.05.2

The within-samples mean square (mean squareerror) is calculated as:

4(3.13) � 4(12.18) � 4(9.05)VW,ranks � ––––––––––––––––––––––––––– � 8.12.

12

Finally, the test statistic is the ratio of these two:

85.05/8.12 � 10.48.

We note that the test statistic is greater than thecritical value of 5.991 (2 df with an a level of0.05), so the null hypothesis is rejected.

The next step is to decide which groups (threecomparisons) are different with respect to theirlocation. For this purpose the MSD is calculatedas:

____________ ____________1 1 1 1MSD � z0.992 �8.12 (–– � ––) � 2.41 �8.12 (–– � ––) � 4.34.5 5 5 5

168 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Table 11.8 General one-way ANOVA table for ranks (Kruskal–Wallis test)

Source Sum of squares Degrees of freedom Mean square X2

Among samples k � 1

Within samples n � k

Total n � 1

k

Rnj(r j � r .)2

j�1

� SSAranks

k

Rnj(r j � r .)2

j�1–––––––––––––––––––

k � 1� VA, ranks

VA,ranks/VW,ranks

k nj

RR (rij � r )2

j�1 i�1

� SSTranks

k nj

RR (rij � r )2

j�1 i�1–––––––––––––––––––

n � 1� VT, ranks

k

R (nj � 1)s2j

j�1

� SSWranks

k

R (nj � 1)s2j

j�1–––––––––––––––––––

n � k� VW, ranks

10 mg �7 �6 �6 �6 �520 mg �9 �9 �8 �8 �630 mg �10 �10 �9 �8 �8

10 mg 10 12.5 12.5 12.5 1520 mg 4 4 7.5 7.5 12.530 mg 1.5 1.5 4 7.5 7.5

Page 184: Introduction to Statistics in Pharmaceutical Clinical Trials

The differences in mean ranks are displayed inTable 11.9. Differences in mean ranks that aregreater than the MSD are considered significantlydifferent.

Interpretation and decision-making

As the difference in mean ranks exceeds the MSDfor the comparison of 10 vs 20 mg and 10 vs 30 mg, we can conclude that these distributionsdiffer in location. This testing procedure ensuredthat the overall type I error did not exceed 0.05.To interpret the clinical relevance of the differ-ences detected by the test requires some addi-tional point estimates. As the initial procedurewas a nonparametric one, the differences insample means are not appropriate. A morereasonable choice would be to compare themedians as an estimate of the treatment effect.

The nonparametric one-way ANOVA can bequite useful in a number of settings. The mostobvious is when reasonable judgment does notallow you to conclude that the distributionalassumptions for the one-way parametric ANOVAwill hold. Another instance is when the dataavailable for analysis are only ordinal (forexample, like a rank) such that the differencebetween two values does not hold the samemeaning as an interval scaled random variable.

There are a number of nonparametric analysismethods dealing with continuous data. The laststatistical method included in this chapter is tobe used when the continuous outcome is time toan event.

11.14 Hypothesis test of the equality ofsurvival distributions: Logrank test

In Chapter 8 we described analyses to estimatethe survival distribution of time to an adverseevent. The survival function is the probability

that an individual survives (that is, does notexperience the event) longer than time t:

S(t) � P(individual survives longer than t).

In Chapter 10 the use of this method wasdiscussed in terms of estimating the mediansurvival time for participants in a clinical trial.The median survival time can be helpful as asingle summary statistic that defines a typicalsurvival time. However, survival distributionsmay deviate at various points in time. In thissection we present the logrank test, which canbe used to test the equality of two or moresurvival distributions. This is not the only testthat can be used for this purpose, but it is anatural extension of a method that we havealready described and so we have chosen todiscuss it.

A test of the equality of two survival distribu-tions would be expressed in terms of the nullhypothesis:

H0: S1(t ) � S2(t ).

If there is sufficient evidence to reject the nullhypothesis the alternate hypothesis would befavored:

HA: S1(t ) � S2(t ).

If, in the context of the survival distribution weconsider all of the times at which an eventoccurred and index them as t(1) � t(2) � t(3). . . � t(H), it is possible to create a 2 2 classifi-cation table for event times t(h), where h � 1, 2,3, . . ., H in which the numbers of individualswith and without the event of interest aredisplayed for each group. Table 11.10 is a samplecross-classification table for time h.

Given the familiar set-up of this contingencytable it may not surprise you that we can use the

Hypothesis test of the equality of survival distributions: Logrank test 169

Table 11.9 Absolute differences in mean ranks

20 mg 30 mg

10 mg 5.4 8.120 mg 2.7

Table 11.10 Cross-classification table of treatmentand event at time h

Event? Group 1 Group 2 Total

Yes m1h m2h mh

No n1h � m1h n2h � m2h nh � mh

n1h n2h nh

Page 185: Introduction to Statistics in Pharmaceutical Clinical Trials

methods of the stratified (Mantel–Haenszel) v2

test to define a test statistic. Each of the distinctevent times is treated as a stratum, just as wetreated investigative centers as strata earlier. Thetest statistic for the logrank test is:

H nh1 nh2( R –––––– (ph1 � ph2))2

h�1nh

X2LR � ––––––––––––––––––––––––.

H nh1 nh2R –––––– phqh

h�1nh�1

As before, the proportion of observations withthe characteristic of interest at time h for the twoindependent groups is denoted by ph1 and ph2,respectively. The overall proportion of individ-uals with the characteristic of interest withineach time h is denoted by ph. The overall propor-tion of individuals without the characteristic ofinterest within each time h is denoted byqh � 1 � ph.

When the sample size is reasonably large(n � 30), the test statistic X2

LR follows a v2 distri-bution with 1 df. Values of the test statistic thatlie in the critical region are those withX2

LR � v21,1�a, that is, values of v2 with 1 df that cut

off the upper tail area of a.To illustrate an example, we use the data from

Chapter 9 with some modifications. Althoughthe event of interest in that case was an adverseevent, a safety parameter, we can treat it thistime as an efficacy parameter.

Event of interest

The event of interest is return to a state ofnormal blood pressure (by some measure). Thetreatment administered to the group demon-strating earlier event times would be consideredthe better treatment.

Design

In this 10-day study of a novel antihypertensive,hypertensive study participants were randomlyassigned to test treatment or placebo (10 in eachgroup). They were monitored once a day (in theevening) to measure their resting SBP. Theprimary endpoint of the study was the time(days) to return to a normal blood pressure.

Data

The event times for the placebo and activegroups are provided below (“C” indicates acensored observation):

The unique times at which events occurred (notcensored observations) are on days 2, 3, 4, 5, and 8. Table 11.11 represents the requiredcontingency tables for the logrank test.

170 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

Table 11.11 Contingency table of treatment by eventat each event time

Day 2:Normal SBP? Active Placebo Total

Yes 1 0 1No 9 10 19

10 10 20

Day 3:Normal SBP? Active Placebo Total

Yes 2 0 2No 7 10 17

9 10 19

Day 4:Normal SBP? Active Placebo Total

Yes 3 1 4No 4 8 12

7 9 16

Day 5:Normal SBP? Active Placebo Total

Yes 0 1 1No 4 7 11

4 8 12

Day 8:Normal SBP? Active Placebo Total

Yes 0 3 3No 4 4 8

4 7 11

Placebo: 3(C) 4 5 8 8 8 10(C) 10(C) 10(C) 10(C)Active: 2 3 3 4 4 4 10(C) 10(C) 10(C) 10(C).

Page 186: Introduction to Statistics in Pharmaceutical Clinical Trials

Note that on day 2 there were 10 participantsat risk for the event in the active group. On day2 one participant in the active group had theevent of interest and is therefore removed fromthe number at risk at later time points. At day 3there were nine remaining in the active group,two of whom experienced the event, leavingseven in the “risk set” for later times. On day 3one placebo participant was censored, meaningthat day 3 was the last known time at which theparticipant had not experienced the event. Thisperson is removed from the risk set for latertimes. The tables are filled out in a similarmanner for all times at which the eventsoccurred. The important thing to remember withthese contingency tables is that the number ineachgroupdecreasesfor latertimepointswhentheindividual either had the event or was censored.

The test statistic can be computed by hand, butsoftware is the ideal method, especially for morethan a handful of event times. The numeratorpart of the test statistic would be calculated as:

(10)(10) (9)(10) (4)(7)[ ––––––– (0.10 � 0) � –––––– (0.22 � 0) � . . . � ––––– (0 � 0.43)]2

20 19 11

� 1.90.

The denominator would be calculated as:

(10)(10) (9)(10) (4)(7)––––––– (0.05)(0.95) � –––––– (0.11)(0.89) � . . . � ––––– (0.27)(0.73)

19 18 10

� 2.286.

The test statistic is calculated as the ratio of thetwo:

1.90v2

LR � –––––– � 0.831.2.286

Interpretation and decision-making

As we saw in Table 10.5, the critical value for thetest at an a level of 0.05 is 3.841. As the value ofthe test statistic 0.831 � 3.841 there is notenough evidence to reject the null hypothesis.

Small studies such as this can be difficult tointerpret. There is a suggestion that the times toresponse may be shorter with the active treat-ment, but the hypothesis test did not suggestthat the variation seen was attributable toanything but chance given the sample size.

Review 171

11.15 Review

1. In a therapeutic exploratory trial comparing asingle dose of a new analgesic to placebo, 17individuals were treated with the new analgesic(test treatment) and 15 were treated with theplacebo (control). The participants reported theseverity of their pain 6 hours after dental surgeryusing a visual analog scale (VAS). Pain scores onthis scale range from 0 to 100, where 0 � “nopain” and 100 � “very severe pain.” The mean(SD) pain score in the test treatment group(n � 17) was 18 (7). The mean (SD) score in thecontrol group (n � 15) was 24 (8). Investigatorswould like to know if the mean VAS pain score isdifferent between the two populations assumed tobe represented by the two samples of studyparticipants.

(a) What are the null and alternate hypotheses?(b) Assume a � 0.05. What are the values of the

rejection region?(c) What assumptions are necessary for the use

of the t test? (d) What is the value of the test statistic?(e) What is your interpretation of the hypothesis

test?

2. This ANOVA table represents data from a study ofan analgesic. The variable of interest is a painscore (higher values mean greater pain).

Source SS df MS Fdrug 99.89459 2 * *Error * 30 *Total 338.57355 32

(a) Write in the missing values of the ANOVAtable (denoted with *).

(b) In this study, how many treatments weretested?

(c) What are the null and alternate hypotheses?(d) How many individuals were studied? (e) What is the critical region for a test with

a � 0.05?(f) What is the statistical conclusion and

interpretation of the hypothesis test?

3. Consider an ANOVA with four treatment groups(30 participants in each), placebo, and threedoses of an investigational drug: Low, medium,and high.

Page 187: Introduction to Statistics in Pharmaceutical Clinical Trials

11.16 References

Schork MA, Remington RD (2000). Statistics with Appli-cations to the Biological and Health Sciences, 3rd edn.Upper Saddle River, NJ: Prentice-Hall.

Turner JR (2007). New Drug Development: Design,methodology, and analysis. Hoboken, NJ: John Wiley& Sons.

172 Chapter 11 • Confirmatory clinical trials: Analysis of continuous efficacy data

(a) What are the null and alternate hypotheses?(b) What assumptions must be made for the

ANOVA?(c) If the omnibus F test is significant, what are

the pairwise comparisons that would be ofinterest?

(d) Why would Tukey’s test be useful to evaluatethe pairwise comparisons in (b)?

(e) Assume the mean square within-samples is20. What is the value of the minimumsignificant difference – Tukey (MSDT) thatwould determine whether pairs of treatmentswere significantly different?

4. In what situations would the Kruskal–Wallis test beappropriate?

Page 188: Introduction to Statistics in Pharmaceutical Clinical Trials

12.1 Introduction

The previous chapters have provided you withan introduction to statistical methods andanalyses that are commonly used in pharmaceu-tical clinical trials, with an emphasis on thera-peutic confirmatory trials. Although we certainlyhave not covered all of the analyses that can beconducted in these trials, those that we havediscussed have given you a solid foundation thatwill also enable you to understand the basics ofother analyses.

Throughout our discussions we have illus-trated the importance of selecting the appro-priate analytical strategy that best serves theobjective of a given trial. There is hardly asingle statistical method that always applies to agiven study design or type of data: Rather, thechoice of the analytical strategy for a given trialis the result of statistical considerations, clinicaljudgments, and regulatory standards.

In this chapter we highlight additional statis-tical considerations relevant to therapeuticconfirmatory trials, and other study designs thatalso provide important information upon whichto base decision-making. These additionalinsights and information build upon the mater-ial presented so far. As this chapter is largelyconceptual rather than computational, we haveincluded a number of references to guide yourfurther reading.

12.2 Sample size estimation

An important part of study design is the “deter-mination” of the required sample size. Beforestarting, we should note that we prefer the term

“estimation” to the terms “determination” and“calculation” of a sample size. Although a math-ematical calculation is certainly performed here,the values that are put into the appropriateformula are chosen by the researcher.

It is also appropriate to note that not all clin-ical trials utilize formal sample size estimationmethods. In many instances (for example, FTIHstudies) the sample size is determined on thebasis of logistical constraints and the size of thestudy thought to be necessary to gather suffi-cient evidence (for example, pharmacokineticprofiles) to rule out unwanted effects. However,when the objective of the clinical trial (forexample, a superiority trial) is to claim that atrue treatment effect exists while at the sametime limiting the probability of committing typeI or II errors (a and b), there are computationalmethods used to estimate the required samplesize. The use of formal sample size estimation isrequired in therapeutic confirmatory trials, thisbook’s major focus, and strongly suggested intherapeutic exploratory trials.

12.2.1 Sample size for continuousoutcomes in superiority trials

Consider the simple case of a superiority trial ofan investigational drug (the test treatment)being compared with placebo with respect to acontinuous outcome (for example, change frombaseline SBP). The null hypothesis typicallytested in such a trial and its complementaryalternate hypothesis are:

H0: lTEST � lPLACEBO � D

HA: lTEST � lPLACEBO � D.

12Additional statistical considerations in clinical trials

Page 189: Introduction to Statistics in Pharmaceutical Clinical Trials

There are a number of values of the treatmenteffect (delta or D) that could lead to rejection ofthe null hypothesis of no difference between thetwo means. For purposes of estimating a samplesize the power of the study (that is, the proba-bility that the null hypothesis of no difference isrejected given that the alternate hypothesis istrue) is calculated for a specific value of D. In thecase of a superiority trial, this specific valuerepresents the minimally clinically relevantdifference between groups that, if found to beplausible on the basis of the sample data throughconstruction of a confidence interval, would beviewed as evidence of a definitive and clinicallyimportant treatment effect.

Another way of stating this is that, if the truedifference in population means is as large as aspecific value of D proposed as clinically impor-tant, we would like to find the sample size suchthat the null hypothesis would be rejected (1 �b)% of the time. The sample size must alsobe chosen so that a is maintained at an acceptablylow value.

The sample size formula required to test (two-sided) the equality of two means from randomvariables with normal distributions is:

2r2(Z1�a/2 � Z�)2

n per group � ––––––––––––––––.D2

In this equation:

• n is the sample size per group• r2 is the assumed variance• Z1�a/2 is the value of the Z distribution that

defines an area of size a/2 in the upper tail ofthe Z distribution

• Zb is the value of the Z distribution thatdefines an area of size b in the lower tail of theZ distribution

• D is the difference in means that we wouldlike to detect, if it exists, by virtue of rejectingthe null hypothesis.

Both a and b are design parameters, and arechosen at the discretion of those designing thetrials. In confirmatory trials, a is 0.05 and b istypically 0.10 or 0.20 (meaning that the studyhas 90% or 80% power, respectively). Thechoices of r and D are not quite as straightfor-ward, because the range of possible values is

outside the direct control of the study planner.The standard deviation r must be estimatedusing (any) available data, and the value of thetreatment effect D is determined using clinicaljudgment.

It is important to note that, all other thingsbeing equal, the following statements are true:

• The required sample size increases as the vari-ance increases.

• The required sample size increases as the sizeof the treatment effect decreases.

• The required sample size increases as adecreases.

• The required sample size increases as thepower (1 �b) increases.

This sample size formula can be illustrated withthe following example. Suppose that, inexploratory therapeutic trials of a new anti-hypertensive, the standard deviation for thebetween-treatment difference in mean changefrom baseline SBP was estimated to be 50 mmHg.After reviewing the literature and consultingwith regulatory authorities, it is agreed that abetween-treatment group difference in meanchange from baseline (that is, the treatmenteffect) of at least 20 mmHg would be considereda clinically important benefit of a new drug totreat hypertension. The study sponsor is plan-ning a confirmatory trial comparing the testdrug with a placebo and would like to have anexcellent chance (90%) of claiming that thetreatment effect is not zero if the drug is as effi-cacious as they believe. From the expressionabove, the sample size required per group is:

2(50)2(1.96 � 1.645)2

n � ––––––––––––––––––––202

� 133 per group for a total of 266 individuals.

This sample size estimate would be described inthe study protocol in this manner:

A total of 266 participants (133 per group) willbe randomized in this study in a 1:1 ratio to testand placebo. Assuming a common standarddeviation of 50 mmHg, this sample size willprovide 90% power to detect a between-groupdifference in mean change from baseline of atleast 20 mmHg using a two-sided test of size a �

0.05.

174 Chapter 12 • Additional statistical considerations in clinical trials

Page 190: Introduction to Statistics in Pharmaceutical Clinical Trials

The power of the study is the probability ofrejecting the null hypothesis of no difference inmeans, assuming that the true difference is atleast 20 mmHg and the estimated variance iscorrect. As we have seen in this book, all esti-mates have sampling variation associated withthem. Therefore, it can be helpful to see how thepower to detect a difference of 20 mmHg variesas a function of sample size using three differentvalues of the standard deviation. The impact ofthese two factors on the power can be seen inFigure 12.1, a graphical display called a powercurve. Figure 12.1 is a compelling illustration ofthe importance of the assumed value of the stan-dard deviation. Consider that, in the design ofthe study in our worked example, the assumedstandard deviation of 50 mmHg led to a samplesize of 133 per group for a power of 90% todetect the important difference of 20 mmHg. Ifthe standard deviation was underestimated suchthat it was really 70 mmHg, the study wouldreally only have 64% power to detect the differ-ence that was considered important. Of course,this cannot be known in advance of a trial, but apost hoc examination of the study data, and apossible re-estimation of the standard deviation,can better inform future trials and increase theprobability that they will be successful.

12.2.2 Sample size for binary outcomes insuperiority trials

We have encountered a number of statisticalmethods used to test the difference between twopopulation proportions. Suppose that we areinterested in estimating the sample size for asuperiority trial of an investigational drug (thetest treatment), which will be compared withplacebo with respect to a binary outcome, forexample, proportion of individuals attaining agoal SBP. The null hypothesis and its comple-mentary alternate hypothesis typically tested insuch a trial are:

H0: pTEST � pPLACEBO � 0.

HA: pTEST � pPLACEBO � 0.

As in Chapter 10, the population proportions foreach of two independent groups are representedby pTEST and pPLACEBO. Just as for the case ofcontinuous outcomes, the power of the study iscalculated for a specific value of D � pTEST �

pPLACEBO, a value that is considered the minimallyclinically relevant difference (CRD).

The sample size formula required to test (two-sided) the equality of two population propor-tions used here is cited from Fleiss et al. (2003).

Sample size estimation 175

Pow

er

0

Sample size per group

0

100 200

30SD

50

70

300 400

0.2

0.4

0.6

0.8

1.0

Figure 12.1 Power curve (D � 20 mmHg) as a function of sample size (n) for r � 30 mmHg, 50 mmHg, and 70 mmHg

Page 191: Introduction to Statistics in Pharmaceutical Clinical Trials

The calculation involves two parts. The firstmakes use of a normal approximation:

_____ ___________________________(Z1�a/2 �2p q � Zb�pTESTqTEST � pPLACEBOqPLACEBO)2

n� � –––––––––––––––––––––––––––––––––––––––––––,(D)2

where:

D � pTEST � pPLACEBO,

pTEST � pPLACEBO,p � ––––––––––––––,

2

q � 1 � p ,

qTEST � 1 � pTEST,

and

qPLACEBO � 1 � pPLACEBO.

Note that the sample size depends not only onthe value of D, but also on the individual propor-tions themselves. The implication of this is thatthe sponsor must make a reasonable estimate ofthe response in the placebo group (that is,pPLACEBO) and then postulate a value of D that isclinically relevant. The corresponding value ofpTEST can be obtained by subtraction. This firstsample size estimate (n�) can be improvedthrough the use of a continuity correction,which gives more accurate results when adiscrete distribution (in this case the binomialdistribution) is used to approximate a contin-uous distribution (in this case the normal). Thesample size formula with continuity correctionis:

________n� 4

n per group � –– (1 � � 1� ––––)2

.4 n���

In a confirmatory efficacy trial the study sponsorwould like to evaluate a test treatment (an anti-hypertensive) versus placebo with respect to abinary outcome of attaining a goal SBP 140mmHg. After reviewing several sources of datathe sponsor estimates that the placebo responsewill be around 0.20 (that is, 20% of individualswill attain the goal without medical therapy).The sponsor would like to estimate the sample

size required to detect a difference in responserates of 0.20 – that is, the postulated value of theresponse for test treatment is 0.40. As the studyis a confirmatory trial, 90% power is recom-mended and the test will be a two-sided test witha � 0.05.

Substituting these values into the formula forthe per-group sample size, we obtain:

____________ ________________________(1.96�2(0.30)(0.70) � 1.645�(0.40)(0.60) � (0.20)(0.80)2

n� � ––––––––––––––––––––––––––––––––––––––––––––––––––––– � 306.(0.20)2

With a continuity correction the result is:_____________

306 4n � –––– (1 � � 1� –––––––––)

2

� 316 individuals per group.4 306(0.20)

This sample size estimate would be described inthe study protocol in this manner:

A total of 632 individuals (316 per group) will berandomized in this study in a 1:1 ratio to the testtreatment and the placebo treatment. Assuminga placebo response rate of 20%, this sample sizewill provide 90% power to detect a between-group difference in response rates of 20% usinga two-sided test with a � 0.05.

As for continuous data, a power curve can begenerated for a number of scenarios for binaryoutcomes. As seen in Figure 12.2, the power of atest of proportions (for a fixed value of D) is quitesensitive to the particular assumed value of theresponse rate in the control (for example,placebo) group.

12.2.3 a and b reconsidered using Bayes’theorem

After a study has been completed, a statisticalanalysis provides a means either to reject or tofail to reject the null hypothesis. The statisticalconclusion will, in part, be used to justifywhether or not further investment is made inthe development of a test product. A sound busi-ness strategy would dictate that further invest-ment be made only if objective informationfrom the study suggests it. Inferential statistics

176 Chapter 12 • Additional statistical considerations in clinical trials

Page 192: Introduction to Statistics in Pharmaceutical Clinical Trials

(for example, a hypothesis test) is the appro-priate means to differentiate a real effect from achance effect. For the remainder of this sectionwe investigate the wisdom of adopting such apolicy, using an approach similar to thatdescribed by Lee and Zelen (2000). In particular,the remainder of this section will address thefollowing two questions:

1. How likely is the sponsor to be misled by theresult of the statistical conclusion from ahypothesis test with design parameters aand b?

2. What is the role of accumulating evidenceabout the true treatment effect on thecredibility of results from a hypothesistest?

Sample size estimation 177

Pow

er

0

Sample size per group

0100 200

0.05Referenceproportion 0.20

0.40

300 400 500 600

0.2

0.4

0.6

0.8

1.0

Figure 12.2 Power curve (D � 0.10) as a function of sample size (n) for pTEST � 0.05, pTEST � 0.20, and pTEST � 0.40

Page 193: Introduction to Statistics in Pharmaceutical Clinical Trials

Throughout this book we have emphasized therole of Statistics in designing and analyzingstudies that enable sponsors to make decisionsabout the future development of new drugs.When developing a new drug, information isaccumulated over time, with each stepinforming the next. As studies are completedthrough various stages of clinical development(FTIH studies, therapeutic exploratory studies,and one or more therapeutic confirmatorystudies) evidence is gathered that supports theefficacy of the new drug. This is true only if newstudies are planned because such a promiseexists. Hence we assume over time, with theaccumulation of new information, that variousscientists involved in the development programcould make an informed guess about theprobability that the drug works (that is, that thealternate hypotheses considered in Sections12.2.1 and 12.2.2 really represent the truth). Letus call this probability, s (tau).

s � probability that D � 0.

However, as additional studies would beconducted only if the accumulating evidencesuggested that there was a treatment benefit, srepresents the probability that the treatment istruly effective. We can think of s � 0 as repre-senting a molecule that has just been discovered,for which no evidence has been generated aboutits ultimate effect on a clinical outcome ofinterest. At the other extreme a value of s � 0.8represents a drug for which a great deal of infor-mation has been collected and most of the datasupport a beneficial effect of the treatment.Values of s around 0.5 may represent a drug forwhich some (or limited) data support a treat-ment benefit.

Consider the following probabilities, whichexpress the likelihood of the true state of affairsgiven the statistical conclusion at the end of thestudy:

a* � P(Null is true|Reject null)

b* � P(Alternate is true|Fail to reject the null).

The value a* is the probability that a rejectednull hypothesis (for example, p value a) ismisleading. That is, it represents the chance that,having rejected the null hypothesis of no effect,the treatment is not efficacious. Its complement,(1 �a*), is the probability that the rejected nullhypothesis is consistent with the truth (that is,the treatment is efficacious).

Similarly, the value b* is the probability thatfailure to reject the null hypothesis (for example,p value � a) is misleading. It represents thechance that, having failed to reject the nullhypothesis of no effect (and acting as if the nullis true), the treatment really is efficacious. Thecomplement, (1 �b*), is the probability that ourinability to reject the null hypothesis is consis-tent with the truth (that is, the treatment is notefficacious).

If we are to adopt a policy of using inferentialstatistics to make decisions in the light of uncer-tainty, we would like to minimize these proba-bilities, a* and b*, as they directly lead to wastedinvestment in the former case or a lost commer-cial opportunity in the latter.

Using Bayes’ theorem (recall our discussions inChapter 6), the probability,

a* � P(Null is true|Reject null),

can be written as:

P(Null is true)a* � P(Reject null | Null is true) –––––––––––––.

P(Reject null)

Recall also from Chapter 6 that the marginalprobability of an event can be expressed as aseries of conditional probabilities as long asthe conditional events are mutually exclusiveand exhaustive. This allows us to express theprobability,

178 Chapter 12 • Additional statistical considerations in clinical trials

P(Reject null) � P(Reject null) | Null is true) P(Null is true) � P(Reject null) | Alternative is true)P(Alternative is true).

Finally, putting this entire expression together we have:

P(Reject null) | Null is true)P(Null is true)a* � ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––.

P(Reject null) | Null is true)P(Null is true) � P(Reject null) | Alternative is true)P(Alternative is true)

Page 194: Introduction to Statistics in Pharmaceutical Clinical Trials

This probability can then be expressed as a func-tion of the design parameters, a and b, and theestimated probability that the alternative istrue, s:

a(1 � s)a* � –––––––––––––––––.

a(1 � s) � (1 � b)s

Bayes’ theorem and algebra can be used in asimilar fashion to obtain the following expres-sion for b*:

�sb* � –––––––––––––––––.

�s � (1 � a)(1 � s)

We can use these two expressions to answer thequestions posed at the beginning of this section.The first of these is:

How likely is the sponsor to be misled by theresult of the statistical conclusion from ahypothesis test?

The short answer is that it depends on the power(and therefore the sample size) of the study. Toillustrate this, assume that the value of thedesign parameter a is dictated by regulatoryconcerns, which is reasonable especially inconfirmatory trials. Further, before a new studyis completed there is still some doubt as towhether the new treatment is efficacious, suchthat the value of s is conjectured to be 0.5.Resulting values of the error rates, a* and b*, arepresented in Table 12.1 as a function of thepower (or, equivalently, b) of the study.

The key message from Table 12.1 is that theprobability of both errors decreases withincreases in statistical power. A study plannedwith power of 0.5 and a statistical decision toreject the null (for example, because p value

0.05) yields a probability of 0.09 that the two

treatments are not significantly different. Incontrast, a study with power 0.9 and the sameoutcome (to reject the null) yields a probabilityof 0.05 that the two treatments are really signif-icantly different. Even though the statistical testhas indicated that further investment should beconsidered because the test treatment appears tobe efficacious, the underpowered study leads toan unwise decision 1.8 times (0.09/0.05) moreoften than the conventionally powered study.Similar statements can be made about unwiselyabandoning an efficacious product by exam-ining the values of b*. Another way of statingthis is that the greater the statistical power for astudy, the more reliable the decisions made as aresult.

Now consider the second question:

What is the role of accumulating evidence aboutthe true treatment effect on the credibility ofresults from a hypothesis test?

To address this question, the error rates, a* andb*, are presented in Table 12.2 as a function of s(a measure of the likelihood the treatment is effi-cacious) with power 0.9 and a � 0.05, typicalvalues for highly powered studies. An examina-tion of a couple of cases will help to answer thisquestion.

When s � 0 there is a great deal of uncertaintyabout the probability that the treatment is effi-cacious. This situation may apply when there isno experience with the test treatment or someexperience with mixed or poor results. When astatistically significant result leading to rejectionof the null hypothesis has been observed in thissituation, the sponsor will be misled intothinking that the drug is effective when it reallyis not with probability 0.33. On the other hand,

Sample size estimation 179

Table 12.1 Error rates a* and b* as a function of b (a � 0.05 and s � 0.5)

b 1 � b (power) a* 1 � a* b* 1 � b*

0.5 0.5 0.09 0.91 0.34 0.660.4 0.6 0.08 0.92 0.30 0.700.3 0.7 0.07 0.93 0.24 0.760.2 0.8 0.06 0.94 0.17 0.830.1 0.9 0.05 0.95 0.10 0.90

Page 195: Introduction to Statistics in Pharmaceutical Clinical Trials

failure to reject the null hypothesis in this situa-tion will mislead the sponsor who abandonsdevelopment with probability 0.01.

Once some studies have been completed andevidence has been gathered to support the effi-cacy of the new treatment, the value of s may bearound 0.5. This value represents at leastmoderate evidence that the treatment is trulyefficacious. When a statistically significant resultleading to rejection of the null hypothesis hasbeen observed in this situation, the sponsor willbe misled into thinking that the drug is effectivewhen it really is not, with probability only 0.05.This reflects previous experience, which hasshown that the treatment provides a benefit.Failure to reject the null hypothesis will misleadthe sponsor who then abandons development asa result, with probability 0.10. Again, this proba-bility reflects previous experience because thenew evidence contradicts the prior belief thatthe treatment is efficacious so that acting on thenew study result may be misleading.

The case where s � 0.9 represents nearlycertain knowledge. It is hard to understand whyany additional data would be required in thisinstance. However, an examination of the errorrates a* and b* in this situation is illuminating.Rejection of the null hypothesis would come asno surprise so that such a result would rarely bemisleading. Failure to reject the null hypothesiswould come as a surprise because it is almostknown with certainty that the null is false. Thus,this information is too bad to be true and actingon it is unwise.

The probability s is analogous to the under-lying prevalence of disease in a population. In

the setting of diagnostic testing, a* and b* referto the positive and negative predictive values ofa test. As illustrated in Chapter 6, when evalu-ating a diagnostic test, even high values of sensi-tivity and specificity can lead to skepticismabout a positive test when the prevalence of theunderlying disease is low.

In a similar manner, s should serve to temperthe enthusiasm of study sponsors who haveobserved a new positive study result, especiallyearly in development programs. It can be used tocalibrate the credibility of statistical results.Without sufficient prior information about thetreatment even a statistically significant resultcan lead to poor (and expensive) business deci-sions. When a sponsor desires either to continueor to discontinue development of a new drug asa result of a study, the results in this sectionpoint to the importance of power. Despite theirother benefits, exploratory therapeutic trials,which tend to be small in size (and thereforehave low power), are poor studies on which tomake business decisions. Small, early clinicalstudies may provide some evidence on which tobase future research, including s. However, oncethat is done, there is no substitute for definitive,highly powered studies in appropriate popula-tions, using acceptable clinically relevantendpoints. In short, power, a statistical designparameter, has a direct bearing on the quality ofdecision-making. We believe that recognition ofthis relationship is very much underappreciated,and that it has a profound bearing on the waysound business decisions should be made.

12.2.4 Importance of collaboration insample size estimation

Sample size estimation requires the input of anumber of specialists involved with the develop-ment of new drugs. The estimate of the standarddeviation can be informed by exploratory thera-peutic trials of the same drug or by literaturereviews of similar drugs. Synthesis of these datafrom a number of sources requires statistical andclinical judgments. As was seen in Figure 12.1the estimate of the standard deviation has animportant effect on the sample size. Study teamsshould understand the sources of variability in

180 Chapter 12 • Additional statistical considerations in clinical trials

Table 12.2 Error rates a* and b* as a function of s(a � 0.05 and 1 � b � 0.9)

s a* 1 � a* b* 1 � b*

0.1 0.33 0.67 0.01 0.990.3 0.11 0.89 0.04 0.960.5 0.05 0.95 0.10 0.900.7 0.02 0.98 0.20 0.800.9 0.01 0.99 0.49 0.51

Page 196: Introduction to Statistics in Pharmaceutical Clinical Trials

the response variable and attempt to minimizeunwanted variability.

The definition of the minimally clinically rele-vant difference of interest involves clinical,medical, and regulatory experience and judg-ments. The appropriate sample size formuladepends on the test of interest and should takeinto account the need for multiple comparisons(either among treatments or with respect tomultiple examinations of the data). The projectstatistician provides critical guidance in this area.

It is appropriate to note here that in someinstances the sample size may not be completelydictated by the statistical requirements for agiven power calculation. The ICH has publisheda guidance document (ICH Guidance E1, 1994)applicable to drugs given chronically. This guid-ance specifies the minimum number of individ-uals who should be exposed for certain periodsof time so that potential adverse events (AEs)may come to light before the drug is marketed.The need for a larger safety database may super-sede the sample size required to demonstrate astatistically significant and clinically relevanttreatment effect.

In summary, sample size estimation requiresthe input of a number of disciplines involved inthe design of clinical trials.

12.3 Multicenter studies

A certain number of participants need to berecruited for any given trial. In Section 12.2 wediscussed sample size estimation, which takesinto account a number of considerations that areimportant not only to the statistician but also tothe clinical scientist and the regulator. Oncedetermined, the value produced by this processof estimation is incorporated into the studyprotocol.

We have seen that relatively small numbers ofparticipants are recruited for early phase trials(perhaps 20–80 in FTIH studies and 200–300 inearly Phase II studies), and relatively largernumbers are recruited for therapeutic confirma-tory trials (perhaps 3000–5000). It is relativelyeasy to recruit between 20 and 80 participants at

a single investigational site. Indeed, as we notedin Section 7.3, conducting a FTIH study at asingle center enhances consistency with respectto management of participants, study conduct,and assessment of AEs, and provides forfrequent and careful monitoring of study partic-ipants. However, it is not feasible to recruit3000–5000 participants at a single investiga-tional site, so multicenter studies are typical atthis stage of clinical development programs.

As for so many of the topics that we havediscussed, multicenter studies have both advan-tages and disadvantages. Let us consider thedisadvantages first and then focus on the advan-tages. The major disadvantage relates to thelogistic demands of coordinating a multicentertrial. The rarer the medical condition of interestin the trial, the more sites that will probablybe needed because fewer individuals will likelybe available at each site. It is not unusual tohave between 50 and 100 investigational sitesparticipating in some trials. These sites may bescattered across a country and, increasingly,they may be scattered across several countriesand continents. This occurrence has manyconsequences, including:

• All investigational sites must obtain approvalfrom their investigational review board (IRB)to conduct their portion of the trial.

• The drug products used in the trial must beshipped to all sites, which may entail dealingwith customs and import/export controls.

• If some sites speak different languages, allrelevant issues must be addressed (forexample, translating the informed consentform into each language).

• All principal investigators (one from each site)and certain members of their staff must receivetraining that will attempt to ensure consis-tency of all methodology used in the trial.Investigator meetings are held accordingly.

• Multicenter studies benefit from (rely on) theuse of central labs to analyze certain samplestaken during the trial (for example, bloodsamples). This is a complex shippingproblem, especially when samples must betransported to the central laboratory undercertain conditions and very quickly fromdistant locations.

Multicenter studies 181

Page 197: Introduction to Statistics in Pharmaceutical Clinical Trials

Each of the above considerations addsconsiderably to the total cost of a multicentertrial.

From a statistical point of view, while everyattempt is made to standardize the implementa-tion of study methodology at all sites, perfectstandardization is not a realistic expectation.Although simpler is usually better, study proto-cols often become complex during their devel-opment, and different investigational sites willlikely differ in their implementation of someprocedural aspects. This occurrence introducesextraneous variability into data collected andanalyzed. Extensions of some of the analysismethods described in this book can be used to account for center-to-center variability,including multi-way ANOVA models for contin-uous variables, stratified v2 tests for categoricalvariables, and stratified log-rank tests for time-to-event analyses. Various other methodologicalcontrols can be introduced in an attempt tominimize such extraneous variability, but thesuccess of any control strategy is unlikely to beperfect.

For these and other reasons, we believe that, ifit were possible to conduct a trial requiring3000–5000 participants at a single investiga-tional site, sponsors would do so, even thoughthis statement is at odds with the commonlycited major advantage of multicenter trials,which is that they enable greater generalizationof results obtained from the trial. It is statisticallypossible to assess the treatment effect at eachinvestigational site as well as assessing it usingthe data from all sites, although a given siteneeds to have reasonable enrollment for thetreatment effect calculated from its participantsto be meaningful. If similar treatment effects areobserved at sites that tended to enroll relativelyolder individuals, relatively younger ones, ethni-cally homogeneous samples, ethnically hetero-geneous samples, with less or more experience oftreating the designated indication, and so forth,it is reasonable to have a certain degree of faiththat the treatment effect is generalizable to theeventual patient population if and when the drugis approved for marketing.

12.4 Analysis populations

Various analysis populations for clinical trialdata can be defined and are used in statisticalanalyses, including:

• The intent-to-treat (ITT) population: Thiscomprises all participants in a clinical trialwho were randomized to a treatment group,regardless of whether any data were actuallycollected from them.

• The safety population: This is a subset of theITT population, defined as the population ofparticipants who received at least one dose ofa study treatment.

• The per-protocol population (also known asthe efficacy or evaluable population): This isalso a subset of the ITT population, andcomprises individuals whose participationand involvement in the trial were consideredto comply with significant requirements andactivities detailed in the study protocol.Participants would typically be excluded fromthe per-protocol population if they exhibitedpoor dosing compliance, missed a number ofclinic visits, or used prohibited medicationsthat may interfere with the evaluation of thetest treatment.

Both the ITT and the safety populations can beused in the analysis of safety data. The ITT andper-protocol population are typically used in theanalysis of efficacy data.

12.4.1 Using both ITT and per-protocolpopulations in efficacy analyses

In therapeutic confirmatory trial efficacy, thesame analyses are typically conducted twice,using data from the ITT population and datafrom the per-protocol population (see Turner,2007). The analyses conducted using the ITTpopulation are considered to be the primaryanalyses because ITT analysis provides a conser-vative strategy in the sense that it tends to biasagainst finding the results that the researcher

182 Chapter 12 • Additional statistical considerations in clinical trials

Page 198: Introduction to Statistics in Pharmaceutical Clinical Trials

“hopes” for, particularly in the case of superi-ority trials. The conservative nature of ITTanalysis is deemed particularly appropriate whenattempting to demonstrate the efficacy of aninvestigational drug because these data do notfavor the desired outcome. Then, if there iscompelling evidence of the drug’s efficacy, thisevidence will be particularly noteworthy. TheITT population is the most appropriate samplepopulation from which to make inferences tothe population of patients who may receive thedrug if and when it receives marketing approval.

Having conducted primary analyses using theITT population it is then appropriate to conductsecondary analyses using the per-protocol popu-lation, the subset of participants whose partici-pation in the trial was compliant with the studyprotocol. This analysis is regarded as less conser-vative than ITT analysis because analysis of theper-protocol population may maximize theopportunity to demonstrate efficacy: The per-protocol population is the population in whichthe treatment is likely to perform best.

Regulatory authorities are encouraged if theresults from the ITT efficacy analysis and the per-protocol efficacy analysis are similar, andtheir overall confidence in the trial results isincreased. However, if they are not similar, ques-tions may be raised as to why they are not. Someof these questions are (Turner, 2007):

• Is the per-protocol population a lot smallerthan the ITT population (it will almostcertainly be somewhat smaller)?

• If so, were there a lot of major protocolviolations?

• Were a lot of participants removed for thesame protocol violation?

• Were many of the participants with protocolviolations enrolled at the same investigativesite?

• Are there any systematic problems in theconduct of the trial?

All of the issues addressed by these questions can reduce the regulatory reviewers’ overallconfidence in the trial’s findings.

12.4.2 Proper and improper subgroupanalysis

Investigators may be interested to examinepotential differences among groups of partici-pants according to some characteristic. Forexample, there may be differences in the responseto treatment according to age. An analysis toinvestigate such a phenomenon could involveseparate analyses for participants aged 18–34,35–54, 55–74, and 75 years and older. Similaranalyses could be presented in which partici-pants are grouped according to some measure ofdisease severity. Results such as these should beinterpreted with caution because the moresubgroups that are examined the greater thechance of discovering a false positive (recall ourearlier discussions of multiple comparisons).

Although we have not discussed this topic,differences in treatment effects may be tested tosee if they are homogeneous across the varioussubgroups. This test is called a test of thetreatment-by-subgroup (for example, treatmentby age) interaction. It is useful because it can ruleout, using a hypothesis test, apparent differencesamong subgroups of subgroups that really repre-sent random variation. Citing from the ICHGuidance E9 (1994, p 27):

The treatment effect itself may also vary withsubgroup or covariate – for example, the effectmay decrease with age or may be larger in aparticular diagnostic category of subjects. Insome cases such interactions are anticipated orare of particular prior interest (e.g. geriatrics),and hence a subgroup analysis, or a statisticalmodel including interactions, is part of theplanned confirmatory analysis. In most cases,however, subgroup or interaction analyses areexploratory and should be clearly identified assuch; they should explore the uniformity of any treatment effects found overall. In general,such analyses should proceed first through theaddition of interaction terms to the statisticalmodel in question, complemented by additionalexploratory analysis within relevant subgroupsof subjects, or within strata defined by thecovariates. When exploratory, these analyses

Analysis populations 183

Page 199: Introduction to Statistics in Pharmaceutical Clinical Trials

should be interpreted cautiously; any conclusionof treatment efficacy (or lack thereof) or safetybased solely on exploratory subgroup analysesare unlikely to be accepted.

These cautions having been noted, some unex-pected subgroup findings may actually revealimportant findings that should be further inves-tigated in additional studies. This can be espe-cially important when there is evidence ofdifferent safety profiles among subgroups.Matthews (2006) distinguished between twosorts of subgroup formation, and hence analysis:

• a limited number of subgroups identified apriori with an apparent biological/clinicalreason for the difference of interest

• subgroups whose apparent importance isretrospective, and arises only as a result ofdoing analyses.

If the treatment effect appears to differ acrosssubgroups identified in the first way, thephenomenon “should be taken much more seri-ously” than if the subgroups came to light viathe second process (Matthews, 2006, p 171).

12.5 Dealing with missing data

For various reasons there are often participantsin a trial for whom a complete set of data is notcollected. This is the province of missing data.When conducting efficacy analyses we need toaddress this issue, and the way(s) in which it isaddressed can influence the regulatory reviewers’interpretation of the analyses presented. Theissue of missing data is problematic in clinicalresearch because humans have complex lives.Human participants may choose to leave a studyearly or be unable to attend a specific visit, bothsituations leading to missing data. Nonclinicalresearch involves tighter experimental control inwhich the subjects (animals) do not have theability voluntarily to leave the study early.

Piantadosi (2005) observed that there are onlythree generic analytic approaches to addressingthe issue of missing data:

1. Disregard the observations that contain amissing value.

2. Disregard a particular variable if it has a highfrequency of missing values.

3. Replace the missing values by some appropriatevalue.

The last of these approaches is called imputationof missing values. As Piantadosi (2005, p 400)commented, although this approach sounds alot like “making up data,” when done properly itmay be the most sensible strategy. While tech-niques for addressing missing data can be tech-nically difficult, one commonly used, simpleimputation method is called last observationcarried forward (LOCF). In a study with repeatedmeasurements over time, the most recentobservation replaces any subsequent missingobservations (Piantadosi, 2005).

An assumption of such an imputation strategyis that the future course of the individual’scondition can reasonably be predicted by the lastknown state. If participants in the test groupdrop out of the study more often than those onplacebo because the test treatment has failed,such an assumption may not be realistic. It ispossible that participants who dropped out fortreatment failure actually got worse than whenthey left the study. A commonly proposedstrategy is to use a number of imputationmethods and see how the analysis results changeas a result. If the results of this sensitivityanalysis suggest that the overall conclusionremains the same, it is less important how themissing data are managed.

Differential rates of loss to follow-up amonggroups or high rates in any single group compli-cate the management of missing data. Strategiesthat minimize the chance that participants willleave a study prematurely should be consideredat the design and protocol writing stage, andincorporated in the protocol as appropriate.

A number of approaches to dealing withmissing data are described by Molenberghs andKenward (2007).

12.5.1 The importance of study conductand study monitoring

While there are widely accepted methodologiesfor dealing with missing data, it is certainly

184 Chapter 12 • Additional statistical considerations in clinical trials

Page 200: Introduction to Statistics in Pharmaceutical Clinical Trials

preferable to have as many “actual” data aspossible. This simple point underscores the crit-ical nature of study conduct. All proceduresdetailed in the study protocol need to befollowed, and all data required need to becollected to the greatest degree possible (therewill always be occasional genuine reasons whythis was not possible in a specific situation).

This need for as complete a dataset as possibleunderscores the importance of the clinicalmonitor. Two related and critical responsibilitiesof clinical monitors are to ensure that all sites inthe trial follow the study protocol, and to checkthat the required data are recorded as and whenthey should be. A good monitor will spot theabsence of recorded data sooner rather thanlater, which considerably increases the likeli-hood of locating and subsequently recordingthose data.

12.6 Primary and secondary objectivesand endpoints

A given trial is conducted to collect optimumquality data with which to answer an identifiedand important research question. The datacollected are intended to provide the most accu-rate answers to the research questions posed. Astudy protocol will often include both primaryand secondary objectives, and also the associatedprimary and secondary endpoints.

12.6.1 The primary objective andendpoint

Turner (2007) noted that, in a very real sense, allthe clinical studies that are conducted before atherapeutic confirmatory trial is undertakenhave one purpose: To allow the primary objec-tive in the therapeutic confirmatory trial to bestated as simply as possible. An objective thatcan be stated simply can be tested simply, that is,in a straightforward and unambiguous manner.This is a highly desirable attribute in a primaryobjective.

By the time a therapeutic confirmatory trial isappropriate it should be possible to state a single

primary objective (or perhaps two if the sponsorreally feels that this is appropriate) that is clini-cally relevant and biologically plausible. Havingstated this primary objective, deciding upon theprimary endpoint should be straightforward.Deciding on the appropriate study design andthe associated statistical analyses should also bestraightforward. Throughout this book we havefocused on the development of a new antihyper-tensive drug. The primary objective of a thera-peutic confirmatory trial in this therapeuticarea will be to determine if the investigationaldrug does indeed lower blood pressure, and theassociated endpoint(s) may be a certain magni-tude reduction in systolic blood pressure (SBP),diastolic BP (DBP), or both.

At the analysis stage of the trial this endpointprovides the focus for rigorous statistical analysisand interpretation (Machin and Campbell, 2005).Formal hypothesis testing will be employed todetermine the presence or absence of a statisti-cally significant difference between the meandecrease seen in the drug treatment group andthat seen in the control group. In addition, theclinical significance of the treatment effect willbe addressed.

Having a single primary objective has an addi-tional advantage in a study. It means thatsample-size estimation can be based on thatobjective and the associated estimated treatmenteffect of interest (recall our discussion of sample-size estimation in Section 12.2). Having multipleprimary endpoints requires adjustments formultiplicity and can be difficult to interpret ifonly one of multiple primary endpoints is foundto have a statistically significant effect.

12.6.2 Secondary objectives andendpoints

In addition to the primary objective, a study mayhave a small number of secondary objectives. Asecondary endpoint will be associated with eachsecondary objective. For example, assessments of quality of life may fall under the category ofsecondary objectives. (In some studies quality of life may be the primary objective: It is simplyused here as a realistic example.) Quality of life(QoL) is an extremely important consideration,

Primary and secondary objectives and endpoints 185

Page 201: Introduction to Statistics in Pharmaceutical Clinical Trials

and particularly so in long-term pharmaceuticaltherapy. Even if a disease or condition cannot becured, keeping the symptomatology at accept-able levels can be considered a tremendoussuccess.

Formal hypothesis testing is less likely to occurfor secondary endpoints. Descriptive statisticsare more likely to be presented. It is also possiblethat findings of particular interest may lead to aprimary objective in a subsequent trial. That is,these data are more suited to hypothesis forma-tion than hypothesis testing. It is important toemphasize here that data leading to the forma-tion of a hypothesis cannot be used to test thathypothesis: As just noted, a new dataset must begenerated.

12.6.3 How many objectives should welist?

The number of objectives that should be incor-porated in any clinical trial is often a topic ofconsiderable debate among study teams (Turner,2007). Some members will likely argue that,while taking all the trouble to conduct the trial,why not collect as much data as possible and askas many questions as possible? This approachleads to a large number of study objectives,sometimes broken down into primary objectives,secondary objectives, and even tertiary objec-tives. It is certainly legitimate in some studies tobe interested in more than one primaryendpoint and possibly in several secondaryendpoints. However, from a statistical point ofview, increasing the number of objectives leadsto serious problems, and it can compromise theweight of any particular piece of evidence that iseventually presented to regulatory agencies.

In Chapter 11 we discussed the issue ofmultiple comparisons and multiplicity in thecontext of pairwise treatment comparisonsfollowing a significant omnibus F test. When weadopt the 5% significance level (a � 0.05), bydefinition it is likely that a type I error will occurwhen 20 separate comparisons are made. That is,a statistically significant result will be “found”by chance alone. The greater the number ofobjectives presented in a study protocol, thegreater the number of comparisons that will be

made at the analysis stage, and the greater thechance of a type I error. Machin and Campbell(2005) commented: “If there are too manyendpoints defined, the multiplicity of compar-isons then made at the analysis stage may resultin spurious statistical significance.”

The concern of multiplicity can also apply tostudies in which data are examined during thestudy at interim time points. Interim analysesare discussed in Section 12.9.

12.7 Evaluating baseline characteristics

It is common practice in analyses of clinical datato inspect the distributions of baseline character-istics – for example, demographics and measuresof disease severity – through the use of descrip-tive summary statistics. This is an importantanalysis because it helps to describe the samplerepresenting the target population of interest. Ifthe sample is representative of the target popula-tion the inferences drawn from the study will beconsidered relevant.

Sometimes the baseline homogeneity of thesecharacteristics is assessed using a hypothesis test,for example, an omnibus F-test from a one-wayANOVA testing for differences in age. If a “signif-icant” result is found, some researchers mightoffer this as evidence that something went awrywith the randomization process. However, thisview has two problems: One is that multiplehypothesis tests can lead to spurious findings or“false positives;” the second is that, on any givensingle instance, a proper randomization cannotensure that this possibility does not occur. Whatrandomization can ensure is that, on average,over all possible randomizations, distributions ofbaseline characteristics will be homogeneousacross groups. This result is all that is required forproper statistical inferences. Senn (1997) empha-sized that “inferential statistics calculated from aclinical trial make an allowance for differencesbetween patients and that this allowance will becorrect on average if randomization has beenemployed.” It is worth noting that standarderrors represent such allowances.

When there is evidence to suggest a baselineimbalance with respect to a characteristic that

186 Chapter 12 • Additional statistical considerations in clinical trials

Page 202: Introduction to Statistics in Pharmaceutical Clinical Trials

may influence an important outcome of thestudy, such as the primary efficacy endpoint,some investigators choose to examine the effectof this factor in additional statistical analyses.Possible approaches to this would includeANOVA or analysis of covariance (ANCOVA) inwhich a continuous variable (for example, age) isadjusted for in assessing the main effect ofinterest, that is, the treatment effect for theprimary outcome variable.

Such a step is not required from a statisticalpoint of view, as a result of the role of a properlyexecuted randomization process, but it can becomforting if it supports the clinical relevance ofthe effect after adjustment for the baselinecovariate. If there are specific explanatory factorsthat are suspected of having an effect on theoutcome of interest at the start of a study, it isadvisable to incorporate them into the overallstudy design (for example, through stratifiedrandomization). A brief discussion of this topichas been published by Roberts and Torgerson(1999). The EMEA CPMP has also published aguidance document on baseline covariates(EMEA CPMP 2003).

12.8 Equivalence and noninferioritystudy designs

The goal of equivalence trials is to demonstratethat a new (test) drug (T) and an activecomparator drug (C) are “equivalent” or have asimilar effect. This means that, in the best-casescenario, the test treatment is trivially betterthan the reference treatment and, in the worst, itis tolerably worse.

Equivalence trials are important when itwould be unethical to compare the test treat-ment with an inactive control, and whencomparing the test with the control for equiva-lent efficacy with a superior safety profile forthe test drug. The difference between groupsthat we believe to be “trivially better” or “toler-ably worse” is called the equivalence margin.Defining the equivalence margin is not an easytask and requires input from regulatory authori-ties. The definition of the equivalence margin isrequired in estimating a sample size for such a

study and it must be decided upon in advance ofthe study and detailed in the study protocol.

Noninferiority trials are very similar to equiva-lence trials in the manner of their statisticalapproach. In noninferiority trials the objective isto demonstrate that the test drug is no worsethan – that is, not inferior to – the control.Assuming that the test drug had some otherbenefit, such as better tolerability or safety orcost, a claim of noninferiority could mean thatthe effect for the test drug is trivially worse thanthe control. The design, including the choice of the noninferiority margin, must be agreed towith regulatory authorities and provided in thestudy protocol. A guidance document publishedby the EMEA CPMP (2000) addresses issuesrelated to interpreting data from superioritystudies for noninferiority claims, although it isour opinion that such a practice is rarely justified.

12.8.1 Why the hypothesis-testingstrategies are different in these designs

The research questions in equivalence andnoninferiority trials are different from thoseused in superiority trials. Hypothesis testingstrategies that are so frequently used in superi-ority trials do not serve the needs of thesedesigns well. As Matthews (2006, p 199)commented: “Failing to establish that one treat-ment is superior to the other is not the same asestablishing their equivalence.” In other words,obtaining a nonsignificant p value in a superi-ority trial does not demonstrate that the twotreatments are the same. As we shall see, conven-tional p values have no role in establishingequivalence or noninferiority.

12.8.2 Use of confidence intervals forinferences

Given that the research questions in these trialsare different from those used in superiority trials,the formats of the null and alternate hypothesesare also different. The research question associ-ated with an equivalence trial is: Does the testdrug demonstrate equivalent efficacy comparedwith the comparator drug? The null hypothesis,

Equivalence and noninferiority study designs 187

Page 203: Introduction to Statistics in Pharmaceutical Clinical Trials

stated in terms of differences in populationmeans, is:

H0: | lTEST � lCONTROL | � equivalence.

The alternate hypothesis is:

HA: | lTEST � lCONTROL | � equivalence.

If the null hypothesis is rejected in this case, theconclusion would be that the two populationmeans were within dequivalence units of each other.The equivalence margin would be selected suchthat the two treatments were considered equiva-lent. If two antihypertensive therapies werecompared in this manner, an equivalencemargin might be 5 mmHg (a trivial difference).The inferential statistical analysis for equiva-lence trials typically involves the calculation of a(1 �a)% confidence interval for the difference inpopulation means. If the lower and upperbounds of the confidence interval are bothwithin the equivalence margin, the conclusion isthat we are (1 �a)% confident that the truedifference in population means does not exceeddequivalence. The conclusions that can be drawnfrom an equivalence trial are displayed in Figure12.3.

The research question for a noninferiority trialis stated as: Is the test drug not inferior to thecontrol? The null hypothesis to be tested in thisstudy is:

H0: lTEST � lCONTROL � noninferiority.

If the null hypothesis is rejected, the followingalternate hypothesis will be favored:

HA: lTEST � lCONTROL � noninferiority.

If the null hypothesis is rejected in this case, theconclusion would be that the population meanfor the control treatment did not exceed that forthe test group by more than dnoninferiority. Theinferential statistical analysis for noninferioritytrials typically involves the calculation of a one-sided (1 �a)% confidence interval for the differ-ence in population means. If the upper bound of the confidence interval is within the non-inferiority margin, the conclusion is that we are (1 �a)% confident that the true difference in population means is less than dnoninferiority. The conclusions that can be drawn from a noninferiority trial are displayed in Figure 12.4.

Equivalence and noninferiority trials may bethe only viable means to test a new drug in

188 Chapter 12 • Additional statistical considerations in clinical trials

HA: Test is non-inferior tocontrol

H0: Test is inferior tocontrol

Test appreciably worse than controlTest better than controlTest tolerably

worse than control

l test � lcontrol

dnoninferiority0

Figure 12.4 Conclusions to be drawn from the difference in population means from a noninferiority trial

HA: Test is equivalent tocontrol

H0: Test is inferior tocontrol

Test worse than controlTest better than control

H0: Test is superior tocontrol

0l test � lcontrol

�dequivalence�dequivalence

Figure 12.3 Conclusions to be drawn from the difference in population means from an equivalence trial

Page 204: Introduction to Statistics in Pharmaceutical Clinical Trials

certain circumstances. One important considera-tion in equivalence trials with a single activecomparator is to consider what it means toconclude that the test drug is equivalent to thecomparator. Not all marketed drugs are effica-cious in every study. If the test drug were shownto be equivalent to the control, the test drugwould be either efficacious or not efficacious.Which of these outcomes represents the truthdepends on how the comparator would haveperformed had it been tested against a placebo.The ability to establish that a study can distin-guish effective treatments from ineffective onesis called assay sensitivity. One way to establishthis for equivalence trials is to select a comparatorthat had demonstrated consistent superiority toa placebo. Another option for equivalence trialsin some instances is to include a third placeboarm. This is not possible when the ethics of thesituation preclude this possibility. This is yetanother illustration of the complexity ofdesigning trials for which the outcomes haveuniversally meaningful interpretations.

12.9 Additional study designs

Other appealing design features in new drugdevelopment include those that allow for moni-toring of data while the trial is ongoing, andthose that permit adaptations during a trial.

12.9.1 Interim analyses

Analyses conducted during a study are calledinterim analyses. Common uses of interimanalyses are as follows:

• re-estimate the study sample size• evaluate whether or not a study has accumu-

lated sufficient data to stop early for definitiveevidence of efficacy, for evidence of harm, ordefinitive evidence that the trial is unlikely tobe successful in terms of its originally plannedobjectives.

A number of methodologies are available toassist in the quantification of evidence(accounting for type I and II errors) that enable

early stopping of trials. Jennison and Turnbull(1999) provide a detailed description of sequen-tial designs in which data are evaluated periodi-cally for evidence of benefit, harm, or futility.Sequential designs typically involve the use ofboundaries for the test statistic that define eachof these outcomes.

One complicating factor of interim analyses isthat they require the use of a data monitoringcommittee (DMC), which is independent of thestudy sponsor and others involved in the study.This is intended to protect the integrity of theclinical trial and to avoid any influence thatknowledge of results may have on the futurecourse of the trial. The work of the DMC isdictated by a specific protocol, or charter, writtenfor the purpose of listing responsibilities ofall parties and measures undertaken to protectthe integrity of the trial. Ellenberg et al. (2003)have written a valuable reference outliningthe complex issues associated with DMCinvolvement in trials.

12.9.2 Adaptive designs

Adaptive designs have become a topic of greatinterest, as evidenced by a recent PharmaceuticalResearch and Manufacturers of America(PhRMA) working group convened to discussadaptive designs methods. Dragalin (2006)provided an excellent overview of these studies.The ability to modify a study in midcourse mayoffer significant advantages to pharmaceuticalcompanies, especially given the tremendousinvestment of time and money required fordeveloping new drugs.

However, the logistical aspects of monitoringdata at several points during a study are nottrivial. An important concern with interimanalyses is to ensure that knowledge of theresults, however vague, does not unduly influ-ence or bias the study. Hung et al. (2006, p 572)stated: “When the adaptation in confirmatorytrials is extensive, the key hypothesis testedbecomes unclear, protection of trial integrity isdifficult, the infrastructure that is needed forlogistics may be impossible to establish, andevaluation by regulatory agencies may be impos-sible.” Summarizing the opportunities and the

Additional study designs 189

Page 205: Introduction to Statistics in Pharmaceutical Clinical Trials

challenges of adaptive designs on behalf of thePhRMA working group, Gallo and Krams (2006,p 423) stated that: “We feel that the potentialbenefits for all involved parties suggested byadaptive designs are too enticing not to makeevery effort to find out if their promise can berealized.” These designs represent an area foremerging research.

12.11 References

Dragalin V (2006). Adaptive designs: terminology andclassification. Drug Information J 40:425–435.

Ellenberg SE, Fleming TR, DeMets DL (2003). DataMonitoring Committees in Clinical Trials: A practicalperspective. Chichester: John Wiley & Sons.

EMEA Committee for Proprietary Medicinal Products(CPMP) (2000). Points to Consider on SwitchingBetween Superiority and Non-Inferiority. London:EMEA.

EMEA CPMP (2003). Points to Consider on Adjustment forBaseline Covariates. London: EMEA.

Fleiss JL, Paik MC, Levin B (2003). Statistical Methods forRates and Proportions, 3rd edn. Chichester: JohnWiley & Sons.

Gallo P, Krams M (2006). PhRMA working group onadaptive designs: introduction to the full whitepaper. Drug Information J 40:421–423.

Hung HMJ, O’Neill RT, Wang SJ, Lawrence J (2006). Aregulatory view on adaptive/flexible clinical trialdesign. Biometr J 48:565–573.

ICH Guidance E1 (1994). The Extent of PopulationExposure to Assess Clinical Safety for Drugs Intendedfor Long-Term Treatment of None-Life-ThreateningConditions. Available at: www.ich.org (accessed July1 2007).

Jennison C, Turnbull BW (2000). Group SequentialMethods with Applications to Clinical Trials. BocaRaton, IL: Chapman & Hall/CRC.

Lee SJ, Zelen M (2000). Clinical trials and sample sizeconsiderations: another perspective. Statist Sci 15:95–110.

Machin D, Campbell MJ (2005). Design of Studies forMedical Research. Chichester: John Wiley & Sons.

Matthews JNS (2006). Introduction to RandomizedControlled Clinical Trials, 2nd edn. Boca Raton, FL:Chapman & Hall/CRC.

Molenberghs G, Kenward M (2007). Missing Data inClinical Studies. Chichester: John Wiley & Sons.

Piantadosi S (2005). Clinical Trials: A methodologicperspective, 2nd edn. Chichester: John Wiley & Sons.

Roberts C, Torgerson DJ (1999). Understandingcontrolled trials: baseline imbalance in randomisedcontrolled trials. BMJ 319:185.

Senn S (1997). Statistical Issues in Drug Development.Chichester: John Wiley & Sons.

Turner JR (2007). New Drug Development: Design,methodology, and analysis. Hoboken, NJ: John Wiley& Sons.

190 Chapter 12 • Additional statistical considerations in clinical trials

12.10 Review

1. Consider a design for an exploratory therapeutictrial of a new antihypertensive drug compared withplacebo. It has been agreed that a between-treatment group difference in mean change frombaseline (that is, the treatment effect) of at least 20 mmHg in SBP would be considered clinicallymeaningful. The primary hypothesis must be testedwith a � 0.05.

(a) If the standard deviation for the betweendifference in mean change from baseline SBPis 40 mmHg, what is the required sample sizefor a test with power of 80%? What is therequired sample size for a test with 90%power?

(b) If the standard deviation for the betweendifference in mean change from baseline SBPis 60 mmHg, what is the required sample sizefor a test with power of 80%? What is therequired sample size for a test with 90%power?

(c) How is the estimate of the standard deviationobtained?

2. What are some advantages and disadvantages ofusing multiple investigational centers in clinicaltrials?

3. In what ways are noninferiority trials different fromsuperiority trials?

Page 206: Introduction to Statistics in Pharmaceutical Clinical Trials

We conclude with two overarching commentsthat bring together the various topics andconsiderations that we have discussedthroughout the book.

First, when faced with the myriad challengesthat occur during drug development, the disci-pline of Statistics is the knight in shining armorthat rides to our assistance and facilitates the collection, analysis, and interpretation ofoptimum quality data as the basis for rationaldecision-making at all stages of the process. Thediscipline of Statistics as operationally defined inthis book includes study design, experimentalmethodology, statistical analysis, and the inter-pretation of the findings of the trial. Very impor-tantly, the interpretation involves addressingissues of both statistical significance and clinicalsignificance. Our interest in Statistics, then, is apragmatic one: The discipline provides the bestway currently available to conduct clinicaldevelopment programs.

Second, it is appropriate to remind ourselvesthat our ultimate interest is in providing a new,biologically active, pharmacological agent thatwill alter a patient’s biology for the better. Statis-tical analyses can be performed on any kind of data (some extremely influential statisticalmethods were developed in the agriculturalarena). In this book, however, the data of interestare biological data, both desirable biologicallytherapeutic effects and undesirable biologicalside-effects. Producing a drug that has anacceptable benefit–risk ratio is a long andcomplex process, and one in which statisticalmethodology is invaluable.

It is also good to remind ourselves frequentlythat the welfare of real patients is our ultimateconcern. This may not be the first thought thatpops into our heads when we are in the middleof a sample-size estimation calculation for anupcoming clinical trial, or when deciding uponwhich imputation methodology to use to dealwith missing data in a clinical database. Never-theless, this is why we do these things. Thepharmaceutical industry is not immune tocontroversy; far from it. However, as Turner(2007, p 239) noted: “New drug development isa very complicated and difficult undertaking,but one that makes an enormous difference tothe health of people across the globe. It is anoble pursuit.”

In this book, set in the context of drug devel-opment, we have taught you how to conduct anarray of statistical analyses. While teaching suchcomputational skills is appropriate in a statisticstextbook, we also hope that we have beensuccessful in providing you with a conceptualunderstanding of and appreciation for thecontribution of the discipline of Statistics to thedevelopment of pharmaceutical drugs that mayimprove the health of your family members,your friends, and yourself.

Reference

Turner JR (2007). New Drug Development: Design,methodology, and analysis. Hoboken, NJ: John Wiley& Sons.

13Concluding comments

Page 207: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 208: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 1 Standard normal distribution areas

Appendix 2 Percentiles of t distributions

Appendix 3 Percentiles of v2 distributions

Appendix 4 Percentiles of F distributions (a � 0.05)

Appendix 5 Values of q for Tukey’s HSD test (a � 0.05)

AppendicesStatistical tables

Page 209: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 210: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 1

Standard normal distribution areas

z P(Z � z) P(Z � z) P(�z � Z � z) P(Z � �z or Z � z)

0.00 0.50000 0.50000 0.00000 1.000000.01 0.50399 0.49601 0.00798 0.992020.02 0.50798 0.49202 0.01596 0.984040.03 0.51197 0.48803 0.02393 0.976070.04 0.51595 0.48405 0.03191 0.968090.05 0.51994 0.48006 0.03988 0.960120.06 0.52392 0.47608 0.04784 0.952160.07 0.52790 0.47210 0.05581 0.944190.08 0.53188 0.46812 0.06376 0.936240.09 0.53586 0.46414 0.07171 0.928290.10 0.53983 0.46017 0.07966 0.920340.11 0.54380 0.45620 0.08759 0.912410.12 0.54776 0.45224 0.09552 0.904480.13 0.55172 0.44828 0.10343 0.896570.14 0.55567 0.44433 0.11134 0.888660.15 0.55962 0.44038 0.11924 0.880760.16 0.56356 0.43644 0.12712 0.872880.17 0.56749 0.43251 0.13499 0.865010.18 0.57142 0.42858 0.14285 0.857150.19 0.57535 0.42465 0.15069 0.849310.20 0.57926 0.42074 0.15852 0.841480.21 0.58317 0.41683 0.16633 0.833670.22 0.58706 0.41294 0.17413 0.825870.23 0.59095 0.40905 0.18191 0.818090.24 0.59483 0.40517 0.18967 0.810330.25 0.59871 0.40129 0.19741 0.802590.26 0.60257 0.39743 0.20514 0.794860.27 0.60642 0.39358 0.21284 0.787160.28 0.61026 0.38974 0.22052 0.779480.29 0.61409 0.38591 0.22818 0.771820.30 0.61791 0.38209 0.23582 0.764180.31 0.62172 0.37828 0.24344 0.756560.32 0.62552 0.37448 0.25103 0.748970.33 0.62930 0.37070 0.25860 0.741400.34 0.63307 0.36693 0.26614 0.733860.35 0.63683 0.36317 0.27366 0.726340.36 0.64058 0.35942 0.28115 0.718850.37 0.64431 0.35569 0.28862 0.711380.38 0.64803 0.35197 0.29605 0.70395

Page 211: Introduction to Statistics in Pharmaceutical Clinical Trials

196 Appendix 1 • Standard normal distribution areas

z P(Z � z) P(Z � z) P(�z � Z � z) P(Z � �z or Z � z)

0.39 0.65173 0.34827 0.30346 0.696540.40 0.65542 0.34458 0.31084 0.689160.41 0.65910 0.34090 0.31819 0.681810.42 0.66276 0.33724 0.32551 0.674490.43 0.66640 0.33360 0.33280 0.667200.44 0.67003 0.32997 0.34006 0.659940.45 0.67364 0.32636 0.34729 0.652710.46 0.67724 0.32276 0.35448 0.645520.47 0.68082 0.31918 0.36164 0.638360.48 0.68439 0.31561 0.36877 0.631230.49 0.68793 0.31207 0.37587 0.624130.50 0.69146 0.30854 0.38292 0.617080.51 0.69497 0.30503 0.38995 0.610050.52 0.69847 0.30153 0.39694 0.603060.53 0.70194 0.29806 0.40389 0.596110.54 0.70540 0.29460 0.41080 0.589200.55 0.70884 0.29116 0.41768 0.582320.56 0.71226 0.28774 0.42452 0.575480.57 0.71566 0.28434 0.43132 0.568680.58 0.71904 0.28096 0.43809 0.561910.59 0.72240 0.27760 0.44481 0.555190.60 0.72575 0.27425 0.45149 0.548510.61 0.72907 0.27093 0.45814 0.541860.62 0.73237 0.26763 0.46474 0.535260.63 0.73565 0.26435 0.47131 0.528690.64 0.73891 0.26109 0.47783 0.522170.65 0.74215 0.25785 0.48431 0.515690.66 0.74537 0.25463 0.49075 0.509250.67 0.74857 0.25143 0.49714 0.502860.68 0.75175 0.24825 0.50350 0.496500.69 0.75490 0.24510 0.50981 0.490190.70 0.75804 0.24196 0.51607 0.483930.71 0.76115 0.23885 0.52230 0.477700.72 0.76424 0.23576 0.52848 0.471520.73 0.76730 0.23270 0.53461 0.465390.74 0.77035 0.22965 0.54070 0.459300.75 0.77337 0.22663 0.54675 0.453250.76 0.77637 0.22363 0.55275 0.447250.77 0.77935 0.22065 0.55870 0.441300.78 0.78230 0.21770 0.56461 0.435390.79 0.78524 0.21476 0.57047 0.429530.80 0.78814 0.21186 0.57629 0.423710.81 0.79103 0.20897 0.58206 0.417940.82 0.79389 0.20611 0.58778 0.412220.83 0.79673 0.20327 0.59346 0.406540.84 0.79955 0.20045 0.59909 0.400910.85 0.80234 0.19766 0.60467 0.395330.86 0.80511 0.19489 0.61021 0.389790.87 0.80785 0.19215 0.61570 0.38430

Page 212: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 1 • Standard normal distribution areas 197

z P(Z � z) P(Z � z) P(�z � Z � z) P(Z � �z or Z � z)

0.88 0.81057 0.18943 0.62114 0.378860.89 0.81327 0.18673 0.62653 0.373470.90 0.81594 0.18406 0.63188 0.368120.91 0.81859 0.18141 0.63718 0.362820.92 0.82121 0.17879 0.64243 0.357570.93 0.82381 0.17619 0.64763 0.352370.94 0.82639 0.17361 0.65278 0.347220.95 0.82894 0.17106 0.65789 0.342110.96 0.83147 0.16853 0.66294 0.337060.97 0.83398 0.16602 0.66795 0.332050.98 0.83646 0.16354 0.67291 0.327090.99 0.83891 0.16109 0.67783 0.322171.00 0.84134 0.15866 0.68269 0.317311.01 0.84375 0.15625 0.68750 0.312501.02 0.84614 0.15386 0.69227 0.307731.03 0.84849 0.15151 0.69699 0.303011.04 0.85083 0.14917 0.70166 0.298341.05 0.85314 0.14686 0.70628 0.293721.06 0.85543 0.14457 0.71086 0.289141.07 0.85769 0.14231 0.71538 0.284621.08 0.85993 0.14007 0.71986 0.280141.09 0.86214 0.13786 0.72429 0.275711.10 0.86433 0.13567 0.72867 0.271331.11 0.86650 0.13350 0.73300 0.267001.12 0.86864 0.13136 0.73729 0.262711.13 0.87076 0.12924 0.74152 0.258481.14 0.87286 0.12714 0.74571 0.254291.15 0.87493 0.12507 0.74986 0.250141.16 0.87698 0.12302 0.75395 0.246051.17 0.87900 0.12100 0.75800 0.242001.18 0.88100 0.11900 0.76200 0.238001.19 0.88298 0.11702 0.76595 0.234051.20 0.88493 0.11507 0.76986 0.230141.21 0.88686 0.11314 0.77372 0.226281.22 0.88877 0.11123 0.77754 0.222461.23 0.89065 0.10935 0.78130 0.218701.24 0.89251 0.10749 0.78502 0.214981.25 0.89435 0.10565 0.78870 0.211301.26 0.89617 0.10383 0.79233 0.207671.27 0.89796 0.10204 0.79592 0.204081.28 0.89973 0.10027 0.79945 0.200551.29 0.90147 0.09853 0.80295 0.197051.30 0.90320 0.09680 0.80640 0.193601.31 0.90490 0.09510 0.80980 0.190201.32 0.90658 0.09342 0.81316 0.186841.33 0.90824 0.09176 0.81648 0.183521.34 0.90988 0.09012 0.81975 0.180251.35 0.91149 0.08851 0.82298 0.177021.36 0.91309 0.08691 0.82617 0.17383

Page 213: Introduction to Statistics in Pharmaceutical Clinical Trials

198 Appendix 1 • Standard normal distribution areas

z P(Z � z) P(Z � z) P(�z � Z � z) P(Z � �z or Z � z)

1.37 0.91466 0.08534 0.82931 0.170691.38 0.91621 0.08379 0.83241 0.167591.39 0.91774 0.08226 0.83547 0.164531.40 0.91924 0.08076 0.83849 0.161511.41 0.92073 0.07927 0.84146 0.158541.42 0.92220 0.07780 0.84439 0.155611.43 0.92364 0.07636 0.84728 0.152721.44 0.92507 0.07493 0.85013 0.149871.45 0.92647 0.07353 0.85294 0.147061.46 0.92785 0.07215 0.85571 0.144291.47 0.92922 0.07078 0.85844 0.141561.48 0.93056 0.06944 0.86113 0.138871.49 0.93189 0.06811 0.86378 0.136221.50 0.93319 0.06681 0.86639 0.133611.51 0.93448 0.06552 0.86896 0.131041.52 0.93574 0.06426 0.87149 0.128511.53 0.93699 0.06301 0.87398 0.126021.54 0.93822 0.06178 0.87644 0.123561.55 0.93943 0.06057 0.87886 0.121141.56 0.94062 0.05938 0.88124 0.118761.57 0.94179 0.05821 0.88358 0.116421.58 0.94295 0.05705 0.88589 0.114111.59 0.94408 0.05592 0.88817 0.111831.60 0.94520 0.05480 0.89040 0.109601.61 0.94630 0.05370 0.89260 0.107401.62 0.94738 0.05262 0.89477 0.105231.63 0.94845 0.05155 0.89690 0.103101.64 0.94950 0.05050 0.89899 0.101011.65 0.95053 0.04947 0.90106 0.098941.66 0.95154 0.04846 0.90309 0.096911.67 0.95254 0.04746 0.90508 0.094921.68 0.95352 0.04648 0.90704 0.092961.69 0.95449 0.04551 0.90897 0.091031.70 0.95543 0.04457 0.91087 0.089131.71 0.95637 0.04363 0.91273 0.087271.72 0.95728 0.04272 0.91457 0.085431.73 0.95818 0.04182 0.91637 0.083631.74 0.95907 0.04093 0.91814 0.081861.75 0.95994 0.04006 0.91988 0.080121.76 0.96080 0.03920 0.92159 0.078411.77 0.96164 0.03836 0.92327 0.076731.78 0.96246 0.03754 0.92492 0.075081.79 0.96327 0.03673 0.92655 0.073451.80 0.96407 0.03593 0.92814 0.071861.81 0.96485 0.03515 0.92970 0.070301.82 0.96562 0.03438 0.93124 0.068761.83 0.96638 0.03362 0.93275 0.067251.84 0.96712 0.03288 0.93423 0.065771.85 0.96784 0.03216 0.93569 0.06431

Page 214: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 1 • Standard normal distribution areas 199

z P(Z � z) P(Z � z) P(�z � Z � z) P(Z � �z or Z � z)

1.86 0.96856 0.03144 0.93711 0.062891.87 0.96926 0.03074 0.93852 0.061481.88 0.96995 0.03005 0.93989 0.060111.89 0.97062 0.02938 0.94124 0.058761.90 0.97128 0.02872 0.94257 0.057431.91 0.97193 0.02807 0.94387 0.056131.92 0.97257 0.02743 0.94514 0.054861.93 0.97320 0.02680 0.94639 0.053611.94 0.97381 0.02619 0.94762 0.052381.95 0.97441 0.02559 0.94882 0.051181.96 0.97500 0.02500 0.95000 0.050001.97 0.97558 0.02442 0.95116 0.048841.98 0.97615 0.02385 0.95230 0.047701.99 0.97670 0.02330 0.95341 0.046592.00 0.97725 0.02275 0.95450 0.045502.01 0.97778 0.02222 0.95557 0.044432.02 0.97831 0.02169 0.95662 0.043382.03 0.97882 0.02118 0.95764 0.042362.04 0.97932 0.02068 0.95865 0.041352.05 0.97982 0.02018 0.95964 0.040362.06 0.98030 0.01970 0.96060 0.039402.07 0.98077 0.01923 0.96155 0.038452.08 0.98124 0.01876 0.96247 0.037532.09 0.98169 0.01831 0.96338 0.036622.10 0.98214 0.01786 0.96427 0.035732.11 0.98257 0.01743 0.96514 0.034862.12 0.98300 0.01700 0.96599 0.034012.13 0.98341 0.01659 0.96683 0.033172.14 0.98382 0.01618 0.96765 0.032352.15 0.98422 0.01578 0.96844 0.031562.16 0.98461 0.01539 0.96923 0.030772.17 0.98500 0.01500 0.96999 0.030012.18 0.98537 0.01463 0.97074 0.029262.19 0.98574 0.01426 0.97148 0.028522.20 0.98610 0.01390 0.97219 0.027812.21 0.98645 0.01355 0.97289 0.027112.22 0.98679 0.01321 0.97358 0.026422.23 0.98713 0.01287 0.97425 0.025752.24 0.98745 0.01255 0.97491 0.025092.25 0.98778 0.01222 0.97555 0.024452.26 0.98809 0.01191 0.97618 0.023822.27 0.98840 0.01160 0.97679 0.023212.28 0.98870 0.01130 0.97739 0.022612.29 0.98899 0.01101 0.97798 0.022022.30 0.98928 0.01072 0.97855 0.021452.31 0.98956 0.01044 0.97911 0.020892.32 0.98983 0.01017 0.97966 0.020342.33 0.99010 0.00990 0.98019 0.019812.34 0.99036 0.00964 0.98072 0.01928

Page 215: Introduction to Statistics in Pharmaceutical Clinical Trials

200 Appendix 1 • Standard normal distribution areas

z P(Z � z) P(Z � z) P(�z � Z � z) P(Z � �z or Z � z)

2.35 0.99061 0.00939 0.98123 0.018772.36 0.99086 0.00914 0.98173 0.018272.37 0.99111 0.00889 0.98221 0.017792.38 0.99134 0.00866 0.98269 0.017312.39 0.99158 0.00842 0.98315 0.016852.40 0.99180 0.00820 0.98360 0.016402.41 0.99202 0.00798 0.98405 0.015952.42 0.99224 0.00776 0.98448 0.015522.43 0.99245 0.00755 0.98490 0.015102.44 0.99266 0.00734 0.98531 0.014692.45 0.99286 0.00714 0.98571 0.014292.46 0.99305 0.00695 0.98611 0.013892.47 0.99324 0.00676 0.98649 0.013512.48 0.99343 0.00657 0.98686 0.013142.49 0.99361 0.00639 0.98723 0.012772.50 0.99379 0.00621 0.98758 0.012422.51 0.99396 0.00604 0.98793 0.012072.52 0.99413 0.00587 0.98826 0.011742.53 0.99430 0.00570 0.98859 0.011412.54 0.99446 0.00554 0.98891 0.011092.55 0.99461 0.00539 0.98923 0.010772.56 0.99477 0.00523 0.98953 0.010472.57 0.99492 0.00508 0.98983 0.010172.58 0.99506 0.00494 0.99012 0.009882.59 0.99520 0.00480 0.99040 0.009602.60 0.99534 0.00466 0.99068 0.009322.61 0.99547 0.00453 0.99095 0.009052.62 0.99560 0.00440 0.99121 0.008792.63 0.99573 0.00427 0.99146 0.008542.64 0.99585 0.00415 0.99171 0.008292.65 0.99598 0.00402 0.99195 0.008052.66 0.99609 0.00391 0.99219 0.007812.67 0.99621 0.00379 0.99241 0.007592.68 0.99632 0.00368 0.99264 0.007362.69 0.99643 0.00357 0.99285 0.007152.70 0.99653 0.00347 0.99307 0.006932.71 0.99664 0.00336 0.99327 0.006732.72 0.99674 0.00326 0.99347 0.006532.73 0.99683 0.00317 0.99367 0.006332.74 0.99693 0.00307 0.99386 0.006142.75 0.99702 0.00298 0.99404 0.005962.76 0.99711 0.00289 0.99422 0.005782.77 0.99720 0.00280 0.99439 0.005612.78 0.99728 0.00272 0.99456 0.005442.79 0.99736 0.00264 0.99473 0.005272.80 0.99744 0.00256 0.99489 0.005112.81 0.99752 0.00248 0.99505 0.004952.82 0.99760 0.00240 0.99520 0.004802.83 0.99767 0.00233 0.99535 0.00465

Page 216: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 1 • Standard normal distribution areas 201

z P(Z � z) P(Z � z) P(�z � Z � z) P(Z � �z or Z � z)

2.84 0.99774 0.00226 0.99549 0.004512.85 0.99781 0.00219 0.99563 0.004372.86 0.99788 0.00212 0.99576 0.004242.87 0.99795 0.00205 0.99590 0.004102.88 0.99801 0.00199 0.99602 0.003982.89 0.99807 0.00193 0.99615 0.003852.90 0.99813 0.00187 0.99627 0.003732.91 0.99819 0.00181 0.99639 0.003612.92 0.99825 0.00175 0.99650 0.003502.93 0.99831 0.00169 0.99661 0.003392.94 0.99836 0.00164 0.99672 0.003282.95 0.99841 0.00159 0.99682 0.003182.96 0.99846 0.00154 0.99692 0.003082.97 0.99851 0.00149 0.99702 0.002982.98 0.99856 0.00144 0.99712 0.002882.99 0.99861 0.00139 0.99721 0.002793.00 0.99865 0.00135 0.99730 0.002703.01 0.99869 0.00131 0.99739 0.002613.02 0.99874 0.00126 0.99747 0.002533.03 0.99878 0.00122 0.99755 0.002453.04 0.99882 0.00118 0.99763 0.002373.05 0.99886 0.00114 0.99771 0.002293.06 0.99889 0.00111 0.99779 0.002213.07 0.99893 0.00107 0.99786 0.002143.08 0.99896 0.00104 0.99793 0.002073.09 0.99900 0.00100 0.99800 0.002003.10 0.99903 0.00097 0.99806 0.001943.11 0.99906 0.00094 0.99813 0.001873.12 0.99910 0.00090 0.99819 0.001813.13 0.99913 0.00087 0.99825 0.001753.14 0.99916 0.00084 0.99831 0.001693.15 0.99918 0.00082 0.99837 0.001633.16 0.99921 0.00079 0.99842 0.001583.17 0.99924 0.00076 0.99848 0.001523.18 0.99926 0.00074 0.99853 0.001473.19 0.99929 0.00071 0.99858 0.001423.20 0.99931 0.00069 0.99863 0.001373.21 0.99934 0.00066 0.99867 0.001333.22 0.99936 0.00064 0.99872 0.001283.23 0.99938 0.00062 0.99876 0.001243.24 0.99940 0.00060 0.99880 0.001203.25 0.99942 0.00058 0.99885 0.001153.26 0.99944 0.00056 0.99889 0.001113.27 0.99946 0.00054 0.99892 0.001083.28 0.99948 0.00052 0.99896 0.001043.29 0.99950 0.00050 0.99900 0.001003.30 0.99952 0.00048 0.99903 0.000973.31 0.99953 0.00047 0.99907 0.000933.32 0.99955 0.00045 0.99910 0.00090

Page 217: Introduction to Statistics in Pharmaceutical Clinical Trials

202 Appendix 1 • Standard normal distribution areas

z P(Z � z) P(Z � z) P(�z � Z � z) P(Z � �z or Z � z)

3.33 0.99957 0.00043 0.99913 0.000873.34 0.99958 0.00042 0.99916 0.000843.35 0.99960 0.00040 0.99919 0.000813.36 0.99961 0.00039 0.99922 0.000783.37 0.99962 0.00038 0.99925 0.000753.38 0.99964 0.00036 0.99928 0.000723.39 0.99965 0.00035 0.99930 0.000703.40 0.99966 0.00034 0.99933 0.000673.41 0.99968 0.00032 0.99935 0.000653.42 0.99969 0.00031 0.99937 0.000633.43 0.99970 0.00030 0.99940 0.000603.44 0.99971 0.00029 0.99942 0.000583.45 0.99972 0.00028 0.99944 0.000563.46 0.99973 0.00027 0.99946 0.000543.47 0.99974 0.00026 0.99948 0.000523.48 0.99975 0.00025 0.99950 0.000503.49 0.99976 0.00024 0.99952 0.000483.50 0.99977 0.00023 0.99953 0.000473.51 0.99978 0.00022 0.99955 0.000453.52 0.99978 0.00022 0.99957 0.000433.53 0.99979 0.00021 0.99958 0.000423.54 0.99980 0.00020 0.99960 0.000403.55 0.99981 0.00019 0.99961 0.000393.56 0.99981 0.00019 0.99963 0.000373.57 0.99982 0.00018 0.99964 0.000363.58 0.99983 0.00017 0.99966 0.000343.59 0.99983 0.00017 0.99967 0.000333.60 0.99984 0.00016 0.99968 0.000323.80 0.99993 0.00007 0.99986 0.000143.82 0.99993 0.00007 0.99987 0.000133.84 0.99994 0.00006 0.99988 0.000123.86 0.99994 0.00006 0.99989 0.000113.88 0.99995 0.00005 0.99990 0.000103.90 0.99995 0.00005 0.99990 0.000103.92 0.99996 0.00004 0.99991 0.000093.94 0.99996 0.00004 0.99992 0.000083.96 0.99996 0.00004 0.99993 0.000073.98 0.99997 0.00003 0.99993 0.000074.00 0.99997 0.00003 0.99994 0.000064.02 0.99997 0.00003 0.99994 0.000064.04 0.99997 0.00003 0.99995 0.000054.06 0.99998 0.00002 0.99995 0.000054.08 0.99998 0.00002 0.99995 0.000054.10 0.99998 0.00002 0.99996 0.000044.12 0.99998 0.00002 0.99996 0.000044.14 0.99998 0.00002 0.99997 0.000034.16 0.99998 0.00002 0.99997 0.000034.18 0.99999 0.00001 0.99997 0.000034.20 0.99999 0.00001 0.99997 0.00003

Page 218: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 1 • Standard normal distribution areas 203

z P(Z � z) P(Z � z) P(�z � Z � z) P(Z � �z or Z � z)

4.22 0.99999 0.00001 0.99998 0.000024.24 0.99999 0.00001 0.99998 0.000024.26 0.99999 0.00001 0.99998 0.000024.28 0.99999 0.00001 0.99998 0.000024.30 0.99999 0.00001 0.99998 0.000024.32 0.99999 0.00001 0.99998 0.000024.34 0.99999 0.00001 0.99999 0.000014.36 0.99999 0.00001 0.99999 0.000014.38 0.99999 0.00001 0.99999 0.000014.40 0.99999 0.00001 0.99999 0.000014.42 1.00000 0.00000 0.99999 0.000014.44 1.00000 0.00000 0.99999 0.000014.46 1.00000 0.00000 0.99999 0.000014.48 1.00000 0.00000 0.99999 0.000014.50 1.00000 0.00000 0.99999 0.000014.52 1.00000 0.00000 0.99999 0.000014.54 1.00000 0.00000 0.99999 0.000014.56 1.00000 0.00000 0.99999 0.000014.58 1.00000 0.00000 1.00000 0.000004.60 1.00000 0.00000 1.00000 0.00000

Page 219: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 220: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 2

Percentiles of t distributions

Area in the symmetric central region P(�t � T � t)0.80 0.90 0.95 0.99 0.999

Area to the left of t: P(T �t)Degrees of freedom (df) 0.90 0.95 0.975 0.995 0.9995

1 3.078 6.314 12.71 63.66 636.62 1.886 2.920 4.303 9.925 31.603 1.638 2.353 3.182 5.841 12.924 1.533 2.132 2.776 4.604 8.6105 1.476 2.015 2.571 4.032 6.8696 1.440 1.943 2.447 3.707 5.9597 1.415 1.895 2.365 3.499 5.4088 1.397 1.860 2.306 3.355 5.0419 1.383 1.833 2.262 3.250 4.781

10 1.372 1.812 2.228 3.169 4.58711 1.363 1.796 2.201 3.106 4.43712 1.356 1.782 2.179 3.055 4.31813 1.350 1.771 2.160 3.012 4.22114 1.345 1.761 2.145 2.977 4.14015 1.341 1.753 2.131 2.947 4.07316 1.337 1.746 2.120 2.921 4.01517 1.333 1.740 2.110 2.898 3.96518 1.330 1.734 2.101 2.878 3.92219 1.328 1.729 2.093 2.861 3.88320 1.325 1.725 2.086 2.845 3.85021 1.323 1.721 2.080 2.831 3.81922 1.321 1.717 2.074 2.819 3.79223 1.319 1.714 2.069 2.807 3.76824 1.318 1.711 2.064 2.797 3.74525 1.316 1.708 2.060 2.787 3.72526 1.315 1.706 2.056 2.779 3.70727 1.314 1.703 2.052 2.771 3.69028 1.313 1.701 2.048 2.763 3.67429 1.311 1.699 2.045 2.756 3.65930 1.310 1.697 2.042 2.750 3.64631 1.309 1.696 2.040 2.744 3.63332 1.309 1.694 2.037 2.738 3.62233 1.308 1.692 2.035 2.733 3.61134 1.307 1.691 2.032 2.728 3.60135 1.306 1.690 2.030 2.724 3.591

Page 221: Introduction to Statistics in Pharmaceutical Clinical Trials

206 Appendix 2 • Percentiles of t distributions

Area in the symmetric central region P(�t � T � t)0.80 0.90 0.95 0.99 0.999

Area to the left of t: P(T � t)Degrees of freedom (df) 0.90 0.95 0.975 0.995 0.9995

36 1.306 1.688 2.028 2.719 3.58237 1.305 1.687 2.026 2.715 3.57438 1.304 1.686 2.024 2.712 3.56639 1.304 1.685 2.023 2.708 3.55840 1.303 1.684 2.021 2.704 3.55145 1.301 1.679 2.014 2.690 3.52050 1.299 1.676 2.009 2.678 3.49660 1.296 1.671 2.000 2.660 3.46070 1.294 1.667 1.994 2.648 3.43580 1.292 1.664 1.990 2.639 3.41690 1.291 1.662 1.987 2.632 3.402

100 1.290 1.660 1.984 2.626 3.390120 1.289 1.658 1.980 2.617 3.373140 1.288 1.656 1.977 2.611 3.361160 1.287 1.654 1.975 2.60 3.352180 1.286 1.653 1.973 2.603 3.345200 1.286 1.653 1.972 2.601 3.340220 1.285 1.652 1.971 2.598 3.335240 1.285 1.651 1.970 2.596 3.332260 1.285 1.651 1.969 2.595 3.328280 1.285 1.650 1.968 2.594 3.326300 1.284 1.650 1.968 2.592 3.323

Page 222: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 3

Percentiles of �2 distributions

Area to the left of X2: P(X � X2)Degrees of freedom (df) 0.90 0.95 0.99 0.999

1 2.706 3.841 6.635 10.8282 4.605 5.991 9.210 13.8163 6.251 7.815 11.345 16.2664 7.779 9.488 13.277 18.4675 9.236 11.070 15.086 20.5156 10.645 12.592 16.812 22.4587 12.017 14.067 18.475 24.3228 13.362 15.507 20.090 26.1249 14.684 16.919 21.666 27.877

10 15.987 18.307 23.209 29.58811 17.275 19.675 24.725 31.26412 18.549 21.026 26.217 32.90913 19.812 22.362 27.688 34.52814 21.064 23.685 29.141 36.12315 22.307 24.996 30.578 37.69716 23.542 26.296 32.000 39.25217 24.769 27.587 33.409 40.79018 25.989 28.869 34.805 42.31219 27.204 30.144 36.191 43.82020 28.412 31.410 37.566 45.31521 29.615 32.671 38.932 46.79722 30.813 33.924 40.289 48.26823 32.007 35.172 41.638 49.72824 33.196 36.415 42.980 51.17925 34.382 37.652 44.314 52.62026 35.563 38.885 45.642 54.05227 36.741 40.113 46.963 55.47628 37.916 41.337 48.278 56.89229 39.087 42.557 49.588 58.30130 40.256 43.773 50.892 59.70335 46.059 49.802 57.342 66.61940 51.805 55.758 63.691 73.40245 57.505 61.656 69.957 80.07750 63.167 67.505 76.154 86.66155 68.796 73.311 82.292 93.16860 74.397 79.082 88.379 99.60770 85.527 90.531 100.43 112.32

Page 223: Introduction to Statistics in Pharmaceutical Clinical Trials

208 Appendix 3 • Percentiles of �2 distributions

Area to the left of X2: P(X � X2)Degrees of freedom (df) 0.90 0.95 0.99 0.999

80 96.578 101.88 112.33 124.8490 107.57 113.15 124.12 137.21

100 118.50 124.34 135.81 149.45110 129.39 135.48 147.41 161.58120 140.23 146.57 158.95 173.62130 151.05 157.61 170.42 185.57140 161.83 168.61 181.84 197.45150 172.58 179.58 193.21 209.26160 183.31 190.52 204.53 221.02170 194.02 201.42 215.81 232.72180 204.70 212.30 227.06 244.37190 215.37 223.16 238.27 255.98200 226.02 233.99 249.45 267.54

Page 224: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 4

Percentiles of F distributions (a � 0.05)

Numerator degrees of freedom (ndf)(ddf) 1 2 3 4 5 6 7 8 9

1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.52 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.383 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.814 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.005 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.776 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.107 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.688 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.399 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18

10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.0211 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.9012 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.8013 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.7114 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.6515 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.5916 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.5417 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.4918 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.4619 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.4220 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.3921 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.3722 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.3423 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.3224 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.3025 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.2826 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.2727 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.2528 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.2429 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.2230 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.2140 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.1250 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.0760 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.0470 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.0280 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.0090 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99

100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97110 3.93 3.08 2.69 2.45 2.30 2.18 2.09 2.02 1.97120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96

Page 225: Introduction to Statistics in Pharmaceutical Clinical Trials

210 Appendix 4 • Percentiles of F distributions (a � 0.05)

Numerator degrees of freedom (ndf)(ddf) 10 12 15 20 25 30 40 60 80 100 120

1 241.9 243.9 245.9 248.0 249.3 250.1 251.1 252.2 252.7 253.0 253.32 19.40 19.41 19.43 19.45 19.46 √19.46 19.47 19.48 19.48 19.49 19.493 8.79 8.74 8.70 8.66 8.63 8.62 8.59 8.57 8.56 8.55 8.554 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.67 5.66 5.665 4.74 4.68 4.62 4.56 4.52 4.50 4.46 4.43 4.41 4.41 4.406 4.06 4.00 3.94 3.87 3.83 3.81 3.77 3.74 3.72 3.71 3.707 3.64 3.57 3.51 3.44 3.40 3.38 3.34 3.30 3.29 3.27 3.278 3.35 3.28 3.22 3.15 3.11 3.08 3.04 3.01 2.99 2.97 2.979 3.14 3.07 3.01 2.94 2.89 2.86 2.83 2.79 2.77 2.76 2.75

10 2.98 2.91 2.85 2.77 2.73 2.70 2.66 2.62 2.60 2.59 2.5811 2.85 2.79 2.72 2.65 2.60 2.57 2.53 2.49 2.47 2.46 2.4512 2.75 2.69 2.62 2.54 2.50 2.47 2.43 2.38 2.36 2.35 2.3413 2.67 2.60 2.53 2.46 2.41 2.38 2.34 2.30 2.27 2.26 2.2514 2.60 2.53 2.46 2.39 2.34 2.31 2.27 2.22 2.20 2.19 2.1815 2.54 2.48 2.40 2.33 2.28 2.25 2.20 2.16 2.14 2.12 2.1116 2.49 2.42 2.35 2.28 2.23 2.19 2.15 2.11 2.08 2.07 2.0617 2.45 2.38 2.31 2.23 2.18 2.15 2.10 2.06 2.03 2.02 2.0118 2.41 2.34 2.27 2.19 2.14 2.11 2.06 2.02 1.99 1.98 1.9719 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.96 1.94 1.9320 2.35 2.28 2.20 2.12 2.07 2.04 1.99 1.95 1.92 1.91 1.9021 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.89 1.88 1.8722 2.30 2.23 2.15 2.07 2.02 1.98 1.94 1.89 1.86 1.85 1.8423 2.27 2.20 2.13 2.05 2.00 1.96 1.91 1.86 1.84 1.82 1.8124 2.25 2.18 2.11 2.03 1.97 1.94 1.89 1.84 1.82 1.80 1.7925 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.80 1.78 1.7726 2.22 2.15 2.07 1.99 1.94 1.90 1.85 1.80 1.78 1.76 1.7527 2.20 2.13 2.06 1.97 1.92 1.88 1.84 1.79 1.76 1.74 1.7328 2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.74 1.73 1.7129 2.18 2.10 2.03 1.94 1.89 1.85 1.81 1.75 1.73 1.71 1.7030 2.16 2.09 2.01 1.93 1.88 1.84 1.79 1.74 1.71 1.70 1.6840 2.08 2.00 1.92 1.84 1.78 1.74 1.69 1.64 1.61 1.59 1.5850 2.03 1.95 1.87 1.78 1.73 1.69 1.63 1.58 1.54 1.52 1.5160 1.99 1.92 1.84 1.75 1.69 1.65 1.59 1.53 1.50 1.48 1.4770 1.97 1.89 1.81 1.72 1.66 1.62 1.57 1.50 1.47 1.45 1.4480 1.95 1.88 1.79 1.70 1.64 1.60 1.54 1.48 1.45 1.43 1.4190 1.94 1.86 1.78 1.69 1.63 1.59 1.53 1.46 1.43 1.41 1.39

100 1.93 1.85 1.77 1.68 1.62 1.57 1.52 1.45 1.41 1.39 1.38110 1.92 1.84 1.76 1.67 1.61 1.56 1.50 1.44 1.40 1.38 1.36120 1.91 1.83 1.75 1.66 1.60 1.55 1.50 1.43 1.39 1.37 1.35

Page 226: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 5

Values of q for Tukey’s HSD test (a � 0.05)

av 2 3 4 5 6

4 3.92649 5.04024 5.75706 6.28702 6.706445 3.63535 4.60166 5.21848 5.67312 6.032906 3.46046 4.33902 4.89559 5.30494 5.628557 3.34392 4.16483 4.68124 5.06007 5.359098 3.26115 4.04101 4.52880 4.88575 5.167239 3.19906 3.94850 4.41490 4.75541 5.02352

10 3.15106 3.87676 4.32658 4.65429 4.9120211 3.11265 3.81952 4.25609 4.57356 4.8229512 3.08132 3.77278 4.19852 4.50760 4.7501513 3.05529 3.73414 4.15087 4.45291 4.6897014 3.03319 3.70139 4.11051 4.40661 4.6385415 3.01432 3.67338 4.07597 4.36699 4.5947416 2.99800 3.64914 4.04609 4.33269 4.5568117 2.98373 3.62796 4.01999 4.30271 4.5236518 2.97115 3.60930 3.99698 4.27629 4.4944219 2.95998 3.59274 3.97655 4.25283 4.4684620 2.95000 3.57794 3.95829 4.23186 4.4452421 2.94102 3.56463 3.94188 4.21300 4.4243622 2.93290 3.55259 3.92704 4.19594 4.4054723 2.92553 3.54167 3.91356 4.18045 4.3883124 2.91880 3.53170 3.90126 4.16632 4.3726525 2.91263 3.52257 3.89000 4.15337 4.3583126 2.90697 3.51417 3.87964 4.14146 4.3451127 2.90174 3.50643 3.87009 4.13047 4.3329428 2.89690 3.49918 3.86125 4.12030 4.3216729 2.89240 3.49263 3.85304 4.11087 4.3112130 2.88822 3.48651 3.84540 4.10208 4.3014731 2.88432 3.48065 3.83828 4.09389 4.2923832 2.88068 3.47525 3.83162 4.08622 4.2838933 2.87726 3.47019 3.82537 4.07904 4.2759234 2.87405 3.46544 3.81951 4.07230 4.2684435 2.87103 3.46097 3.81400 4.06595 4.2614136 2.86818 3.45676 3.80880 4.05997 4.2547737 2.86550 3.45278 3.80389 4.05432 4.2485138 2.86296 3.44902 3.79925 4.04898 4.2425839 2.86055 3.44546 3.79486 4.04392 4.2369740 2.85827 3.44208 3.79069 4.03913 4.2316541 2.85610 3.43888 3.78673 4.03457 4.22659

Page 227: Introduction to Statistics in Pharmaceutical Clinical Trials

212 Appendix 5 • Values of q for Tukey’s HSD test (a � 0.05)

av 2 3 4 5 6

42 2.85404 3.43582 3.78296 4.03024 4.2217943 2.85208 3.43292 3.77938 4.02611 4.2172144 2.85020 3.43015 3.77596 4.02217 4.2128445 2.84842 3.42751 3.77270 4.01842 4.2086846 2.84671 3.42499 3.76958 4.01483 4.2046947 2.84508 3.42257 3.76660 4.01140 4.2008948 2.84352 3.42026 3.76375 4.00812 4.1972449 2.84203 3.41805 3.76102 4.00497 4.1937550 2.84059 3.41592 3.75839 4.00195 4.1904051 2.83921 3.41389 3.75588 3.99906 4.1871952 2.83789 3.41193 3.75346 3.99627 4.1841053 2.83662 3.41005 3.75104 3.99360 4.1811354 2.83540 3.40824 3.74886 3.99103 4.1782755 2.83422 3.40649 3.74677 3.98855 4.1755256 2.83308 3.40482 3.74475 3.98616 4.1728757 2.83199 3.40320 3.74268 3.98386 4.1703158 2.83093 3.40163 3.74075 3.98164 4.1678559 2.82992 3.40013 3.73889 3.97949 4.1654760 2.82893 3.39867 3.73709 3.97742 4.1631761 2.82798 3.39726 3.73535 3.97542 4.1609462 2.82706 3.39590 3.73367 3.97348 4.1587963 2.82617 3.39458 3.73204 3.97161 4.1567164 2.82531 3.39331 3.73047 3.96979 4.1547065 2.82448 3.39207 3.72894 3.96804 4.1527566 2.82367 3.39088 3.72746 3.96633 4.1508567 2.82288 3.38971 3.72603 3.96468 4.1490268 2.82212 3.38859 3.72464 3.96308 4.1472469 2.82138 3.38750 3.72329 3.96152 4.1455270 2.82067 3.38644 3.72198 3.96001 4.1438471 2.81997 3.38540 3.72071 3.95855 4.1422172 2.81929 3.38440 3.71947 3.95712 4.1406373 2.81864 3.38343 3.71827 3.95574 4.1390974 2.81800 3.38248 3.71710 3.95439 4.1375975 2.81738 3.38156 3.71596 3.95308 4.1361476 2.81665 3.38067 3.71485 3.95181 4.1347277 2.81606 3.37979 3.71377 3.95056 4.1333478 2.81548 3.37894 3.71273 3.94935 4.1320079 2.81492 3.37812 3.71170 3.94818 4.1306980 2.81437 3.37731 3.71071 3.94703 4.1294181 2.81384 3.37652 3.70973 3.94591 4.1281782 2.81332 3.37575 3.70879 3.94481 4.1269683 2.81281 3.37501 3.70786 3.94375 4.1257784 2.81232 3.37428 3.70696 3.94271 4.1246285 2.81184 3.37356 3.70608 3.94169 4.1234986 2.81136 3.37287 3.70522 3.94070 4.1223987 2.81090 3.37219 3.70438 3.93974 4.1213288 2.81045 3.37152 3.70356 3.93879 4.1202789 2.81001 3.37087 3.70276 3.93778 4.11924

Page 228: Introduction to Statistics in Pharmaceutical Clinical Trials

Appendix 5 • Values of q for Tukey’s HSD test (a � 0.05) 213

av 2 3 4 5 6

90 2.80958 3.37024 3.70197 3.93691 4.1182491 2.80916 3.36962 3.70121 3.93607 4.1172592 2.80875 3.36901 3.70046 3.93524 4.1163093 2.80835 3.36842 3.69972 3.93443 4.1153694 2.80795 3.36784 3.69901 3.93363 4.1144495 2.80757 3.36727 3.69830 3.93274 4.1135496 2.80719 3.36671 3.69762 3.93194 4.1126697 2.80682 3.36617 3.69694 3.93117 4.1118098 2.80646 3.36564 3.69628 3.93041 4.1109599 2.80611 3.36511 3.69564 3.92967 4.11013

100 2.80576 3.36460 3.69501 3.92894 4.10932101 2.80542 3.36410 3.69439 3.92822 4.10853102 2.80509 3.36361 3.69378 3.92752 4.10775103 2.80476 3.36313 3.69318 3.92684 4.10699104 2.80444 3.36266 3.69260 3.92616 4.10624105 2.80412 3.36219 3.69203 3.92550 4.10550106 2.80382 3.36174 3.69147 3.92486 4.10478107 2.80351 3.36129 3.69092 3.92422 4.10408108 2.80322 3.36085 3.69038 3.92360 4.10339109 2.80293 3.36043 3.68984 3.92299 4.10271110 2.80264 3.36000 3.68932 3.92239 4.10204111 2.80236 3.35959 3.68881 3.92180 4.10139112 2.80208 3.35918 3.68831 3.92122 4.10074113 2.80181 3.35878 3.68782 3.92065 4.10011114 2.80155 3.35839 3.68733 3.92009 4.09949115 2.80129 3.35801 3.68686 3.91954 4.09888116 2.80103 3.35763 3.68639 3.91900 4.09828117 2.80078 3.35726 3.68593 3.91847 4.09769118 2.80053 3.35689 3.68548 3.91795 4.09711119 2.80028 3.35653 3.68503 3.91744 4.09655120 2.80004 3.35618 3.68460 3.91694 4.09599121 2.79981 3.35583 3.68417 3.91644 4.09544122 2.79958 3.35549 3.68375 3.91596 4.09490123 2.79935 3.35516 3.68333 3.91548 4.09436124 2.79913 3.35482 3.68292 3.91501 4.09384125 2.79890 3.35450 3.68252 3.91454 4.09333126 2.79869 3.35418 3.68213 3.91409 4.09282127 2.79847 3.35387 3.68174 3.91364 4.09232128 2.79826 3.35356 3.68135 3.91320 4.09183129 2.79806 3.35325 3.68098 3.91276 4.09135130 2.79785 3.35295 3.68061 3.91234 4.09087131 2.79765 3.35265 3.68024 3.91192 4.09040132 2.79745 3.35236 3.67988 3.91150 4.08994133 2.79726 3.35208 3.67953 3.91109 4.08949134 2.79707 3.35179 3.67918 3.91069 4.08904135 2.79688 3.35152 3.67883 3.91029 4.08860136 2.79669 3.35124 3.67849 3.90990 4.08817137 2.79651 3.35097 3.67816 3.90952 4.08774

Page 229: Introduction to Statistics in Pharmaceutical Clinical Trials

214 Appendix 5 • Values of q for Tukey’s HSD test (a � 0.05)

av 2 3 4 5 6

138 2.79633 3.35071 3.67783 3.90914 4.08723139 2.79615 3.35045 3.67751 3.90877 4.08683140 2.79598 3.35019 3.67719 3.90840 4.08644141 2.79580 3.34993 3.67687 3.90804 4.08606142 2.79563 3.34968 3.67656 3.90768 4.08562143 2.79547 3.34943 3.67626 3.90732 4.08538144 2.79530 3.34919 3.67596 3.90698 4.08495145 2.79514 3.34895 3.67566 3.90663 4.08459146 2.79498 3.34871 3.67537 3.90630 4.08423147 2.79482 3.34848 3.67508 3.90596 4.08388148 2.79466 3.34825 3.67479 3.90563 4.08342149 2.79451 3.34802 3.67451 3.90531 4.08306150 2.79435 3.34780 3.67423 3.90499 4.08271

Page 230: Introduction to Statistics in Pharmaceutical Clinical Trials

Chapter 5

1.

a nominalb ratioc ratiod ratioe ratiof ordinal

3.

a 63b 53c 72

Chapter 6

1.

a 100/200 � 0.5b 30/200 � 0.15c 45/200 � 0.225d 30/45 � 0.67

3.

a 0.00135b 0.5c 0.05d 0.00003

9.

a t � � 1.833 or t � 1.833b t � � 3.250 or t � 3.250c t � � 2.045 or t � 2.045d t � � 3.659 or t � 3.659

10.

a not rejectb rejectc rejectd not reject

Chapter 8

2.

a (0.04, 0.21)b (0.08, 0.16)

3.

a (0.10, 0.22)b (0.08, 0.24)c Z � 0.74 therefore we would be 54%

confident.

4.

a (0.005, 0.15)b (� 0.01, 0.16)c (� 0.04, 0.19)

Chapter 9

2.

a (�0.745, 1.425)b (�0.963, 1.643)c (�1.405, 2.085)

3.

a The difference between groups is notstatistically significant.

Review exercise solutions by chapter

Page 231: Introduction to Statistics in Pharmaceutical Clinical Trials

b The difference between groups is statisti-cally significant and the treatment appearsto increase SBP.

c The difference between groups is statisti-cally significant and the treatment appearsto lower SBP.

Chapter 10

3.

a not rejectedb rejectedc rejectedd rejectede not rejectedf not rejectedg rejectedh not rejected

4.

a 0.119b 0.008c 0.001d 0.317

5.

a

b H0: p1 � p2 � 0HA: p1 � p2 � 0

c Chi-square test, Fisher’s exact test, or zapproximation

d Yes, there is sufficient evidence to rejectthe null hypothesis (a � 0.05). Using thechi-square test the assumptions are that the two groups are independent;responses are mutually exclusive; and the expected cell counts are at least 5 in� 80% of the cells. The value of the chi-square test statistic is 6.62.

(152)(385)e The odds ratio � ––––––––– � 1.45(117)(346)

Participants exposed to the test treatmentare 1.45 times more likely to respondthan participants in the placebo group.

Chapter 11

1.

a H0: lTEST � lPLACEBO � 0HA: lTEST � lPLACEBO � 0

b t � � 2.0423 or t � 2.042c Independent samples; outcome approxi-

mately normally distributed; equalunknown variance

d �2.26e Do not reject H0 since � 2.042 � � 0.68

� 2.042. The mean pain scores are notsignificantly different between the twogroups.

2.

a SS error � 238.67896MS Drug � 49.947295MS Error � 7.95597F � 6.28

b 3c H0: l1 � l2 � l3

HA: at least one pair of population meansare unequal

d 33e F � 3.32f Reject H0 since 6.28 � 3.32. At least one

pair of population means is unequal.

3.

a H0: lP � lL � lM � lHHA: at least one pair of population meansare unequal

b Each group represents a simple randomsample from relevant populations; obser-vations are independent; outcome isapproximately normally distributed; thevariance is equal across the populations

c Placebo vs. low; placebo vs. medium;placebo vs. high; low vs. medium; low vs.high; medium vs. high

d To control the overall Type I error–––20

e MSDT � 3.68639�–––– � 3.0130

Placebo Test

Responder 117 152Non-responder 385 346

502 498

216 Review exercise solutions by chapter

Page 232: Introduction to Statistics in Pharmaceutical Clinical Trials

Chapter 12

4.

a 80% power: 64/group90% power: 86/group

b 80% power: 143/group90% power: 191/group

Review exercise solutions by chapter 217

Page 233: Introduction to Statistics in Pharmaceutical Clinical Trials
Page 234: Introduction to Statistics in Pharmaceutical Clinical Trials

a, 70, 76, 77, 159application of Bayes’ theorem, 178–180in Bonferroni’s test, 160–161choice of, 78, 82, 128, 174

b, 77, 174application of Bayes’ theorem, 178–180

v2 distributionscritical values, 137percentiles, 207–208

v2 test of homogeneity, 135–137comparing g proportions, 138–140comparing two proportions, 138r responses from g groups, 140

active comparator drugs, 28, 32active pharmaceutical ingredients (APIs), 13adaptive designs, 189–190adverse drug reactions, 99adverse events (AEs), 97, 99, 181

analysis of, 102Kaplan–Meier estimation of survival function,

109–114probability of, 100–101reporting of, 93–94, 99–100risks of specific AEs, 101–102time-to-event analysis, 107–108see also safety analyses; side effects

alternate hypothesis, 27, 28–29, 30, 75–76, 78, 82in equivalence trials, 28, 187–188in non-inferiority trials, 28, 187–188in superiority trials, 26see also hypothesis testing

among-samples variance, 154, 157analysis of covariance (ANCOVA), 187analysis of variance (ANOVA), 152–155

data collection, 166–167interpretation, 158Kruskal–Wallis test, 167–169with only two groups, 166worked example, 155–158

ANOVA tables, 155, 157Kruskal–Wallis test, 168

antihypertensive drugs, 5“appropriate” clinical trials, 36area under curve (AUC)

drug-concentration time curves, 89–90normal distribution, 63–65standard normal distribution, 65

arithmetic mean, 53see also mean

arrhythmias, 125assay sensitivity, 189association, 3

distinction from causation, 4

baseline characteristics, evaluation, 186–187baseline measurements, 43Bayes’ theorem, 59–60

application to a and b, 178–180survival function, 110–111

bell-shaped curve, 62, 63beneficence, 19benefit–risk ratio, 91, 97

see also safety analysesBernoulli processes, 61best practice guidelines, 12bias, 37, 129

risk from censoring, 114bin width, histograms, 50–51binary outcomes, sample size estimation, 173–175binomial distribution, 61–62, 67

confidence intervals, 104bioavailability trials, 92–93biologics license applications (BLA), 10blinding, 14, 40block randomization, 38blood pressure, 5–6

high see hypertensionmeasurement over time, 43–44medical management, 6uniformity of measurement, 43

blood pressure measurements, safety analysis,123–124

Bonferroni’s test, 160–163, 164, 165

Index

Note: page numbers in italics refer to Figures, and those in bold to Tables.

Page 235: Introduction to Statistics in Pharmaceutical Clinical Trials

carcinogenicity tests, 15Cardiac Arrhythmia Suppression Trial (CAST),

surrogate endpoint, 42cardiac arrhythmias, 125cardiovascular disease risk, relationship to blood

pressure, 6carryover effect, crossover trials, 39–40case report forms (CRFs), adverse event reporting, 100categorical data analysis, 145

v2 test of homogeneity, 135–140Fisher’s exact test, 140–143Mantel–Haenszel method, 143–145Z approximation, 133–135

causation, 4censored data, 110, 113, 171

center-to-center variability, 182central laboratories, 117, 181central limit theorem, 71, 104central tendency, 52–53central tendency measures, laboratory data, 118–119centralized procedure, MAA, 11change scores, 43classical probability, 67clinical communications

ethical issues, 20statistical aspects, 12–13

clinical equipoise, 19, 31clinical monitors, 185clinical significance

of laboratory data, 123minimally clinically relevant difference, 175of treatment effects, 29–30, 80–81, 82, 128, 130,

165–166clinical study protocols, 10, 44–45clinical trial applications (CTAs), EMEA, 11clinical trials, 6, 85

categorization by phase, 15–16definition, 1ethical considerations, 19–20ICH classification, 16–18

see also confirmatory trials; early phase (humanpharmacology) clinical trials; equivalence trials;first-time-in-human (FTIH) studies;noninferiority trials; superiority trials

clinically relevant observations, 41cluster randomization, 38Cmax (maximal drug concentration), 88, 89, 90co-rapporteur, 11Cochran’s statistic, 144coding of adverse events, 101coefficient of variation (CV), 55combinations, probabilities of, 61Committee for Medicinal Products for Human Use

(CHMP), 11compelling evidence, 4–5complement rule, 58concerned member states (CMSs), 11

concurrently controlled, parallel group design, 38–39conditional probability, 58–59confidence intervals, 74

for difference in two means, 120–121in equivalence and noninferiority studies, 188with known population variance, 71for a mean with unknown variance, 119–120relationship to hypothesis tests, 81–82for a sample proportion, 103–105of survival distribution functions, 113with unknown population variance, 72–74

confirmatory trials, 16, 17–18, 24duration, 40–41endpoints, 131, 185–186hypothesis testing, 132–133

v2 test of homogeneity, 135–140analysis of variance, 152–158Bonferroni’s test, 160–163Fisher’s exact test, 140–143Mantel–Haenszel method, 143–145two-sample t tests, 147–150Wilcoxon rank sum test, 150–152Z approximation, 133–134

ICH Guidance E9, 127–128objectives, 130–131, 185–186

Consolidated Standards of Reporting Trials(CONSORT) group, 13

contingency tables, 135, 136, 137g groups and two responses, 139for logrank test, 169, 170multicenter trials, 144

continuous data analysis, 147analysis of variance (ANOVA), 152–155, 166–167Kruskal–Wallis test, 167–169logrank test, 169–171two-sample t test, 147–150Wilcoxon rank sum test, 150–152

continuous outcomes, sample size estimation,173–175

continuous random variables, 60normal distribution, 62–67

control compounds, manufacture, 14controls, in human pharmacology studies, 86correction factor, Z approximation, 133costs of drug development, 7Cox’s proportional hazards model, 114critical regions, 78–79, 81cross-tabulation, 58, 169crossover designs, 39–40cumulative frequency, 49cumulative percentage, 49

data, 5measurement scales, 48–49missing, 184–185

data collection, 166–167data monitoring committees (DMCs), 20, 189

220 Index

Page 236: Introduction to Statistics in Pharmaceutical Clinical Trials

decentralized procedure, MAA, 11decision-making, 5, 165–166

in analysis of variance, 158during a clinical development program, 7–8during evidence-based clinical practice, 8relationship to statistical power, 180

descriptive displaysof central tendency and dispersion, 55of laboratory data, 118, 119, 122of systolic blood pressure values, 80in time-to-event analysis, 108, 109

diagnostic test development, application of Bayes’theorem, 59–60

diastolic blood pressure (DBP), 5–6discrete random variables, 60, 61

binomial distribution, 61–62dispersion, 53–55dose, definition, 91dose-escalation cohort studies, 91–92dose-finding studies, 17, 91–92double-blind trials, 14, 40drop-outs, 107drug-concentration time curve, 88drug delivery systems, 13drug development, 9drug discovery, 9–10drug interactions, 87drug treatment groups, 25drug treatment schedule, 44drugs, useful attributes, 9Dunnett’s test, 164, 165

early phase (human pharmacology) clinical trials, 16,17, 24, 85–86, 93–94

bioavailability trials, 92–93dose-finding trials, 91–92goals, 86–87limitations, 94–95pharmacokinetic and pharmacodynamic data

analysis, 89–91pharmacokinetic studies, 87–89research questions, 87study designs, 86see also first-time-in-human (FTIH) studies

ECG monitoring, 124–125effectiveness, 18efficacy, 14

measures of, 44, 131efficacy analysis, use of ITT and per-protocol

populations, 182–183eligibility criteria, 44elimination of drugs, 89end-of-study values, plots against baseline values,

122, 123end-of-treatment measurements, 43endpoints, 41–42

confirmatory trials, 131

examples, 43primary, 185secondary, 185–186

environmental conditions, control of, 38equivalence trials, 27–28, 130–131

study design, 187–189errors see type I errors; type II errorsestimation, 70–71, 82estimators, 69

for sample proportions, 103–104ethics

in clinical communications, 13in clinical trials, 19, 166in preclinical trials, 8relationship to hypothesis testing, 31–32in statistical methodology, 19–20use of healthy participants, 87use of placebo controls, 32

European Medicines Agency (EMEA), 11–12guidance on baseline covariates, 187

events, probability of, 58–59evidence-based clinical practice, 8evidence provision, confirmatory trials, 128excipients, 13excretion of drugs, 89experimental control, 18experimental methodology, 36, 40–41

blood pressure measurement over time, 43–44local control, 38uniformity of blood pressure measurement, 43

experimental studies, 37exploratory studies, 15, 180

F distributions, 154, 157percentiles, 209–210

factorials, 61false-positive/negative rates, 59–60FDA (Food and Drug Administration), 10–11

definition of “substantial evidence”, 128guidance on safety reviews, 98

first-pass metabolism, 92first-time-in-human (FTIH) studies, 16, 17, 86, 181

limitations, 94pharmacokinetic and pharmacodynamic studies,

89–91sample size, 173see also human pharmacology studies

Fisher’s exact test, 140–143fraud, 129frequency tables, 49, 50

generalization, multicenter studies, 182genotoxicity, 15“go” decisions, 85goals of human pharmacology trials, 86–87good clinical practice (GCP), 12good laboratory practice (GLP), 12, 15

Index 221

Page 237: Introduction to Statistics in Pharmaceutical Clinical Trials

good manufacturing practice (GMP), 12, 13good study design, 36goodness-of-fit tests, 135grand mean, 153, 156graphical displays of end-of-study values, 122, 123groups, independence, 105

half-life of drugs (t1/2), 89hazards, Cox’s model, 114healthy participants, use in early phase trials, 87hemoglobin values, descriptive display, 118high blood pressure see hypertensionhistograms, 50–51homogeneity assumption, 135honestly significant difference (HSD) test, Tukey,

163–164, 165q values, 211–214

human pharmacology trials, 16, 17, 24, 85–86,93–94

bioavailability trials, 92–93dose-finding trials, 91–92goals, 86–87limitations, 94–95pharmacokinetic and pharmacodynamic data

analysis, 89–91pharmacokinetic studies, 87–89research questions, 87study designs, 86see also first-time-in-human (FTIH) studies

hypergeometric distribution, 140, 141hypertension, 5, 6

as a surrogate endpoint, 42hypothesis tests, 2, 31, 74–76, 82–83

v2 test of homogeneity, 135–140analysis of variance (ANOVA), 152–158in confirmatory trials, 132–133in equivalence and noninferiority studies, 187–188ethical considerations, 31–32Fisher’s exact test, 140–143Kruskal–Wallis test, 167–169Mantel–Haenszel method, 143–145multiple comparisons

Bonferroni’s test, 160–163implications of method chosen, 164–166risk of type I error, 159–160Tukey’s honestly significant difference test,163–164

p values, 80–81problems in safety analysis, 103relationship to confidence intervals, 81–82research questions, 77and secondary objectives, 186single population means, 78–80survival distributions (logrank test), 169–171of two means, 147–150type I and type II errors, 76–77, 178–180Wilcoxon rank sum test, 150–152

Z approximation, 133–134see also alternate hypothesis; null hypothesis

ICH 10, 11classification of clinical trials, 16–18definition of adverse events, 99definition of confirmatory trials, 127guidance on confirmatory trials, 127–128, 128–129guidance on laboratory analyses, 118guidance on nonclinical studies, 14guidance on QT interval prolongation assessment,

124, 125guidance on sample size, 181

imputation of missing values, 184independence of observations, 150independent events, 60independent groups, 105

t test, 148–149independent substantiation of results, 128–129inferential statistics, 69, 83

and early phase studies, 89information accumulation, 178, 179–180intent-to-treat (ITT) population, 182

use in efficacy analyses, 182–183interaction studies, 15interim analyses, 186, 189interquartile range, 55interval measurement scales, 48–49investigational new drug (IND) applications, FDA, 10,

11

JNC 7 report, 6

Kaplan–Meier estimation of survival distribution,109, 111, 112–113

Kruskal–Wallis test, 167–169

laboratory data analyses, 117–118clinical significance, 123confidence intervals

for difference in two means, 120–121for a mean with unknown variance, 119–120

graphical displays, 122, 123measures of central tendency, 118–119responders’ analysis, 121–122shift analysis, 121, 122

laboratory values, standardization, 117large numbers, law of, 68last observation carried forward (LOCF), 184lead compounds, 9lead optimization, 9–10local control, 38local laboratory analyses, problems, 117logrank test, 169–171logistic problems, multicenter studies, 181logistic regression, 138lower limits (LLs), 70

222 Index

Page 238: Introduction to Statistics in Pharmaceutical Clinical Trials

Mantel–Haenszel method, 143–145, 170manufacturing, 13

for clinical trials, 14marketing authorization application (MAA), EMEA, 11maximal drug concentration (Cmax), 88, 89, 90maximum tolerated dose (MTD), 91mean, 53, 147

of a binomial distribution, 61of a normal distribution, 62–63, 65regression to, 25two-sample t test, 147–150see also population mean

mean square error (within-samples variance), 153measurement scales, 48–49MedDRA codes, 101median, 52–53, 55median survival time, 113metabolism, 92minimally clinically relevant difference, 175

definition of, 181minimally effective dose (MED), 91minimally significant difference (MSD)

Bonferroni’s test, 161, 162, 163Kruskal–Wallis test, 168–169

missing data, 184–185mode, 52morbidity, 41, 42mortality, 41, 42multicenter studies, 143, 181–182multiple comparisons

Bonferroni’s test, 160–163implications of method chosen, 164–166risk of type I error, 159–160, 186Tukey’s honestly significant difference test, 163–164

multiplicity issues, safety analyses, 102–103mutagenicity, 15mutual recognition procedure, 11mutually exclusive events, 58

new drug applications (NDAs), FDA, 10–11“no-go” decisions, 85nominal measurement scales, 48nonclinical research, 14–15

ethical issues, 8nonexperimental studies, 37noninferiority trials, 28, 130, 131

study design, 187–189nonparametric analysis, 147

Kruskal–Wallis test, 167–169Wilcoxon rank sum test, 150–152

normal distribution, 62–65assumption of, 71standard normal (Z) distribution, 65, 66transformation to standard normal distribution,

65–67null hypothesis, 27, 28–29, 30, 76, 78, 82

in equivalence trials, 28, 187–188

in non-inferiority trials, 28, 187–188rejection of, 79, 132in superiority trials, 26see also hypothesis testing

objectives, 44of confirmatory trials, 130–131number of, 186primary, 185secondary, 185–186

observations, clinical relevance, 41odds ratio, 137–138omnibus F test, 158, 160on-treatment adverse events, 99one-factor ANOVA, 156–158one-sample t test, 78–80ordinal measurement scales, 48

p values, 80–81P wave, 124pairwise comparisons

multipleBonferroni’s test, 160–162risk of type I error, 159–160Tukey’s honestly significant difference test,163–164

two-sample t test, 147–150parallel group design, 38, 39parameters, 69patient welfare, 191per-protocol population, 182

use in efficacy analyses, 182–183percentiles, 55pharmaceutical manufacturing, 13

for clinical trials, 14pharmacodynamics, 14, 86, 87

data analysis, 89–91pharmacokinetics, 14, 86, 87–89

data analysis, 89–91Phase I clinical trials, 15–16

see also human pharmacology trialsPhase II clinical trials, 16Phase III clinical trials, 16

see also equivalence trials; noninferiority trials;superiority trials

physical examinations, in early phase clinical trials, 93placebo-controlled, parallel group design, 39placebo controls, ethical considerations, 32, 189placebo effect, 24–25placebo treatment groups, 25, 39pooled standard deviation, 148, 149pooled variance, 148, 149population mean

confidence intervalknown population variance, 71unknown population variance, 72–74

hypothesis testing, 78–80

Index 223

Page 239: Introduction to Statistics in Pharmaceutical Clinical Trials

population parameter estimates, 53, 69population standard error of the mean, 70populations, 47–48, 182

in efficacy analyses, 182–183postmarketing surveillance, 18–19power curves, 175, 176, 177power of a study, 77, 179

relationship to decision-making, 180relationship to required sample size, 174, 175

pre-hypertension, definition, 6precision, research questions, 26prevalence of a disease, 60primary analyses, 182–183primary comparisons, 90primary objective and endpoint, 185probability, 57

classical, 67in hypothesis testing, 75relative frequency, 67–68

probability distributions, 60–61binomial distribution, 61–62normal distribution, 62–67

probability of events, 58–59probability values, 57–58proof by contradiction, 75proportions

choice of denominator, 100confidence intervals, 103–105

for differences between proportions, 105–107hypothesis testing, 133

v2 test of homogeneity, 135–140Fisher’s exact test, 140–143Mantel–Haenszel method, 143–145Z approximation, 133–134

q statistic, Tukey’s HSD test, 163values, 211–214

QRS complex, 124QT interval, 124QT interval prolongation, 124–125quality of life (QoL), 185–186

random variables, 49assumption of normal distribution, 71display of frequency values, 49–52

random variation, 153randomization, 37–38, 186randomized trials, 40range of data, 54rank analyses

Kruskal–Wallis test, 167–169Wilcoxon rank sum test, 151, 152

rapporteur, 11ratio measurement scales, 49Reference Member State (RMS), 11reference ranges, 117–118regression to the mean, 25

regulatory agencies, 10European Medicines Agency (EMEA), 11–12FDA (Food and Drug Administration), 10–11

regulatory submissions, statistical aspects, 12regulatory toxicology studies, 15relative frequency probability, 67–68relative frequency of random variable values, 49relative risk of an adverse event, 102reliability factor, confidence intervals, 73, 74, 103replication of studies, 129replication within studies, 36–37reproductive toxicology studies, 15research hypotheses, 24, 26–27research questions, 23–24

in early phase studies, 87in equivalence studies, 187hypothesis testing, 77in noninferiority studies, 188relationship to study design, 32–33, 35in safety analyses, 104useful characteristics, 25–26

responders’ analysis, 118, 121–122response, definition, 91

safety, measures of, 44safety analyses, 97, 125–126

clinical laboratory data analyses, 117–123confidence intervals for differences between

proportions, 105–107confidence intervals for sample proportions,

103–105Kaplan–Meier estimation of survival function,

109–114laboratory data analyses, 117–123multiplicity issues, 102–103probability of adverse events, 100–101QT interval prolongation, 124–125rationale, 97–98regulations, 98risks of specific AEs, 101–102sampling variation, 103serious adverse events, 102time-to-event analysis, 107–108vital signs, 123–124

safety population, 182safety studies, nonclinical, 14sample proportions

confidence intervals, 103–105confidence intervals for differences between

proportions, 105–107sample range, 54sample size

and assumption of normal distribution, 71effect on p value, 80law of large numbers, 68relationship to power of a study, 175

sample size estimation, 173

224 Index

Page 240: Introduction to Statistics in Pharmaceutical Clinical Trials

binary outcomes in superiority trials, 175–176continuous outcomes in superiority trials, 173–175ethical issues, 20role of collaboration, 180–181

sample standard deviation (s), 54sample statistics, as estimators of population

parameters, 69, 70sample variance (s2), 54samples, 47sampling distribution, 70sampling variation, 69–70, 103scatterplots, baseline and end-of-study values, 122,

123scientific method, 2, 23secondary analyses, 183secondary comparisons, 90secondary objectives and endpoints, 185–186self-reporting of adverse events, 93, 99–100sensitivity of a diagnostic test, 59sequential designs, 189serious adverse events, 102shift analysis, 118, 121, 122side effects

acceptability, 91see also adverse events

significance, 30simple random samples, 47simple randomization, 37simplified clinical trials (SCTs), 18single center factors, 129single-dose trials, 17size of a test, 77skewed distributions, 51, 150specificity of a diagnostic test, 59standard deviation, 54

estimation of, 175, 180of a normal distribution, 62, 63, 64, 65pooled, 148, 149

standard error of difference in sample means, 120standard error of estimator, 105

effect of sample size, 103standard error of the mean, 73, 74, 119standard normal (Z) distribution, 65, 66

transformation of a normal distribution, 65–67standard normal distribution areas, 195–203standardization, multicenter studies, 182statistic, definition of, 3statistical analysis, 3statistical analysis plan, 45statistical significance, 30Statistics

discipline of, 1–3, 191role clinical communications, 12–13role in regulatory submissions, 12

stem-and-leaf plots, 51, 52step functions, 111stratified randomization, 38

stratified samples, Mantel–Haenszel method, 143–145Student’s t distributions, 72–74, 78–79, 81,

148–149percentiles, 205–206

study conduct, 185study design, 35–36, 57, 132

adaptive designs, 189–190basic principles, 36–38blinding, 40concurrently controlled, parallel group design,

38–39confirmatory trials, 128crossover design, 39–40equivalence and noninferiority trials, 187–189ethical issues, 20in human pharmacology trials, 86interim analyses, 189randomized trials, 40relationship to research questions, 32–33in safety analyses, 104sample size estimation, 173–181

study monitoring, 185study populations, 47–48study protocols, 10, 44–45

confirmatory trials, 128multicenter studies, 182sample size estimation, 174

subgroup analysis, 183–184substantial evidence

FDA definition, 128regulatory views, 127–130

summary statistics, 55laboratory data, 118

superiority trials, 26, 130hypothesis testing, 132sample size estimation

for binary outcomes, 175–176for continuous outcomes, 173–175

surrogate endpoints, 41–42, 131examples, 43

survival analysis, 109–110logrank test, 169–171

survival distributions, Kaplan–Meier estimation, 111,112–113

survival function, 110–111symmetric distributions, 51systematic bias, 37systematic variation, 4systolic blood pressure (SBP), 5–6

t1/2, 89t distributions, 72–74, 78–79, 81, 148–149

percentiles, 205–206t test, two sample, 147–150T wave, 124tabular display of summary statistics, 55

laboratory data, 118

Index 225

Page 241: Introduction to Statistics in Pharmaceutical Clinical Trials

target populations, 47Tchebysheff’s theorem, 54, 64teratogenicity, 15test statistics, 76, 78, 79, 82, 132

v2, 136–7ANOVA (F), 154Bonferroni’s test, 161Cochran’s statistic, 144Kruskal–Wallis test, 168Mantel–Haenszel method, 143Tukey’s honestly significant difference test (q), 163two-sample t test, 148Wilcoxon rank sum test, 151Z approximation, 133

test of treatment-by-subgroup interaction, 183–184therapeutic confirmatory trials, 16, 17–18, 24

duration, 40–41endpoints, 131, 185–186hypothesis testing, 132–133

v2 test of homogeneity, 135–140analysis of variance, 152–158Bonferroni’s test, 160–163Fisher’s exact test, 140–143Mantel-Haenszel method, 143–145two-sample t tests, 147–150Wilcoxon rank sum test, 150–152Z approximation, 133–134

ICH Guidance E9, 127–128objectives, 130–131, 185–186

therapeutic exploratory trials, 16, 17, 24, 85therapeutic use clinical trials, 16, 18time at risk, variation, 107time-to-event analysis, 107–108tmax (time at maximal drug concentration), 88, 90total sum of squares, 153total systemic exposure measurement, 88–89toxicity, maximum tolerated dose (MTD), 91toxicodynamics, 14toxicology testing, nonclinical, 14–15trapezoidal rule, 89treatment effects

clinical relevance, 29–30, 80–81, 82, 128, 130,165–166

relationship to required sample size, 174treatment groups, 38

true positive/negative rates, 59–60Tukey’s honestly significant difference (HSD) test,

163–164, 165q values, 211–214

two-sample t test, 148–149two-sided hypothesis tests, 78, 129type I errors, 76–77, 132

acceptable level, 128risk in multiple comparisons, 159–160risk in safety analysis, 102–103see also a

type II errors, 76–77see also b

typical responses, 37

unbiased estimators, 53uncertainty, 179upper limits (ULs), 70useful drugs, attributes, 9useful information, characteristics, 24useful research questions, 23–24, 25–26

variability, 129center-to-center, 182coefficient of variation, 55percentiles, 55

variance, 54of a binomial distribution, 61pooled, 148, 149relationship to required sample size, 174of survival distribution function, 113

variation, 4vital signs, 123–124

monitoring in early phase clinical trials, 93

weighting, multicenter trials, 143Wilcoxon rank sum test, 150–152within-samples variance (mean square error), 153

xenobiotics, metabolism, 92

Z approximation, 133–134Z (standard normal) distribution, 65, 66

transformation of a normal distribution, 65–67values for two-sided confidence intervals, 104

Z distribution areas, 195–203

226 Index