6
VET RECORD | 18 January 2020 59 RESEARCH COMMENT Javier Palarea-Albaladejo, Iain McKendrick Best practice for the design and statistical analysis of animal studies RECENT developments in technology and computing are facilitating the collection and processing of large volumes of increasingly complex data, while apparently supporting easier access to advanced data analysis methodologies and cutting-edge algorithms that, superficially at least, are straightforward to use. While it is true that the technical aspect of statistical computation is dramatically simplified by access to soſtware tools, the most important requirements for success are a genuine understanding of variation and uncertainty coupled with well-developed statistical knowledge and thinking. However, with the proliferation of systems for rapid, semi- unsupervised data processing and analysis, particularly in relation to new forms of complex high-throughput data, users are increasingly less likely to be aware of, understand or know how to assess the assumptions that new algorithms and advanced statistical methods are making, especially where these assessments are subjective and require use of other concepts and methods. Statistics as a discipline may be perceived by some researchers as a ‘necessary evil’, with which they engage to facilitate publication by scientific journals. Yet ethical considerations and the economic costs of animal experimentation should themselves be sufficient reasons to encourage the use of well-considered and rigorous statistical analysis to support reliable scientific conclusions. Many of the most serious issues that impact on scientific studies frequently arise from poor experimental designs, misunderstanding of statistical concepts and naïve use of powerful computational resources. However, when reviewing scientific studies, it is paradoxical that although much weight is put on the statistical results – even, occasionally, to the extent of overlooking the actual biological relevance – the methods used and the rationale for key decisions regarding their use are oſten reported only in terms that are vague or even actively misleading. Quantitative methods should not be used blindly as a routine step in the research pathway without critical exploration and assessment of their appropriateness in the light of the scientific objectives of the study and the features of the data collected. A myriad of data types, methodological approaches, critical assumptions and data- specific issues can all be relevant, in a unique constellation, to a particular study. The potential for these to influence results and interpretations must not be underestimated, and transparency is a prerequisite when reporting conclusions. For example, where two valid approaches lead to conflicting results, this should be acknowledged, accepted and discussed. Although not seeking to be comprehensive, this commentary aims to raise awareness of some of WHAT YOU NEED TO KNOW • Thoughtful and effective experimental design is crucial in obtaining accurate and meaningful results. • Decisions about the design must be driven by the overall objectives of the study, a well-defined hypothesis about the biological parameters of interest and the planned statistical analysis. • All studies should incorporate an element of randomisation, to make the design robust against unidentified and unanticipated sources of bias. • Before starting data collection, it is important to perform a power analysis to determine the number of replicates required to statistically detect a specific biological effect. • Selection of the statistical methodology used to analyse the data collected should be based on careful consideration of the objectives of the analysis and the nature of the data. • The level of measurement and distribution of the independent and dependent variables must be taken into account, as this will ultimately affect which assumptions are met. • While the conventional practice is to report statistical significance using P values, a P value in isolation is a limited summary that is open to misuse. In veterinary studies, it may be more appropriate to report results based on the clinically significant effect size. • As scientific research becomes increasingly data-driven, there are substantial benefits in having specialist quantitative expertise available within a project. on May 24, 2020 by guest. Protected by copyright. http://veterinaryrecord.bmj.com/ Veterinary Record: first published as 10.1136/vr.m117 on 17 January 2020. Downloaded from

REERC COET Best practice for the design and statistical ... · preferable, giving statistical robustness and better estimation of treatment effects. Decisions about the design must

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: REERC COET Best practice for the design and statistical ... · preferable, giving statistical robustness and better estimation of treatment effects. Decisions about the design must

18 January 2020 | VET RECORDPB VET RECORD | 18 January 2020 59

RESEARCH COMMENT

Javier Palarea-Albaladejo, Iain McKendrick

Best practice for the design and statistical analysis of animal studies

RECENT developments in technology and computing are facilitating the collection and processing of large volumes of increasingly complex data, while apparently supporting easier access to advanced data analysis methodologies and cutting-edge algorithms that, superficially at least, are straightforward to use.

While it is true that the technical aspect of statistical computation is dramatically simplified by access to software tools, the most important requirements for success are a genuine understanding of variation and uncertainty coupled with well-developed statistical knowledge and thinking. However, with the proliferation of systems for rapid, semi-unsupervised data processing and analysis, particularly in relation to new forms of complex high-throughput data, users are increasingly less likely to be aware of, understand or know how to assess the assumptions that new algorithms and advanced statistical methods are making, especially where these assessments are subjective and require use of other concepts and methods.

Statistics as a discipline may be perceived by some researchers as a ‘necessary evil’, with which they engage to facilitate publication by scientific journals. Yet ethical considerations and the economic costs of animal experimentation should themselves be sufficient reasons to encourage the use of well-considered and rigorous statistical analysis to support reliable scientific conclusions.

Many of the most serious issues that impact on scientific studies frequently arise from poor experimental designs, misunderstanding of statistical concepts and naïve use of powerful computational resources. However, when reviewing scientific studies, it is paradoxical that although much weight is put on the statistical results – even, occasionally, to the extent of overlooking the actual biological relevance – the methods used and the rationale for key decisions regarding their use are often reported only in terms that are vague or even actively misleading.

Quantitative methods should not be used blindly as a routine step in the research pathway

without critical exploration and assessment of their appropriateness in the light of the scientific objectives of the study and the features of the data collected. A myriad of data types, methodological approaches, critical assumptions and data-specific issues can all be relevant, in a unique constellation, to a particular study. The potential for these to influence results and interpretations must not be underestimated, and transparency is a prerequisite when reporting conclusions. For example, where two valid approaches lead to conflicting results, this should be acknowledged, accepted and discussed.

Although not seeking to be comprehensive, this commentary aims to raise awareness of some of

WHAT YOU NEED TO KNOW• Thoughtful and effective experimental design is crucial in

obtaining accurate and meaningful results.• Decisions about the design must be driven by the overall

objectives of the study, a well-defined hypothesis about the biological parameters of interest and the planned statistical analysis.

• All studies should incorporate an element of randomisation, to make the design robust against unidentified and unanticipated sources of bias.

• Before starting data collection, it is important to perform a power analysis to determine the number of replicates required to statistically detect a specific biological effect.

• Selection of the statistical methodology used to analyse the data collected should be based on careful consideration of the objectives of the analysis and the nature of the data.

• The level of measurement and distribution of the independent and dependent variables must be taken into account, as this will ultimately affect which assumptions are met.

• While the conventional practice is to report statistical significance using P values, a P value in isolation is a limited summary that is open to misuse. In veterinary studies, it may be more appropriate to report results based on the clinically significant effect size.

• As scientific research becomes increasingly data-driven, there are substantial benefits in having specialist quantitative expertise available within a project.

59-64 Stats Editorial.indd 59 15/01/2020 14:49

on May 24, 2020 by guest. P

rotected by copyright.http://veterinaryrecord.bm

j.com/

Veterinary R

ecord: first published as 10.1136/vr.m117 on 17 January 2020. D

ownloaded from

Page 2: REERC COET Best practice for the design and statistical ... · preferable, giving statistical robustness and better estimation of treatment effects. Decisions about the design must

18 January 2020 | VET RECORD60 VET RECORD | 18 January 2020 61

RESEARCH COMMENT

the most common issues affecting experimental design and data analysis, with a particular focus on animal studies, while suggesting general principles and guidance to support better practice.

Experimental designThe theory of experimental design is well-established and there is extensive bibliography aimed at varied audiences, and also numerous online resources.1

The purpose of experimental design is to identify relevant biological measurements and experimental units, and, hence, define appropriate treatment groups and decide where there should be experimental replication in the study. Thoughtful and effective experimental design with a clear focus on the biological quantities of interest is crucial in obtaining accurate and meaningful results.

The starting point for any design should be a clear understanding of the key objective(s) of the study, the hypotheses being tested, the identification of logistical or practical constraints that might lead to the confounding of effects or otherwise impact on the design and understanding of the major sources of variability in the observed data. Based on the perceived understanding of variation, statistical approaches such as blocking and the inclusion of covariates should be used to reduce uncertainty in the estimates of key quantities of interest.

All studies should incorporate an element of randomisation, to make the design robust against unidentified and unanticipated sources of bias. For practical reasons, blinding of the study is not always feasible, but it is preferable, where possible, to remove what can be an appreciable source of bias. In general, balanced designs are preferable, giving statistical robustness and better estimation of treatment effects.

Decisions about the design must be driven by the overall objectives of the study, a well-defined hypothesis about the biological parameters of interest and the planned statistical analysis. The protocol for the statistical analysis should be specified in advance and not changed over the course of the study in the light of initial results (although it may be changed in the event of poor model fit). The importance of this ‘joined-up’ approach is captured in the extensive activity currently underway in the area of ‘estimands’, promoting the use of a structured framework to

ensure that the objectives of a clinical trial are identified and propagate into consistent study design, implementation and analysis.2

Such holistic thinking should help avoid fundamental misunderstandings. For instance, the standard statistical test for differences in the mean between groups assumes, as the null hypothesis, that the means are equal (ie, no treatment effect) and aims to reject this hypothesis based on the observed data. However, a failure to find sufficient evidence to reject the null hypothesis is not evidence that the means are equal, as has been mistakenly concluded in some studies. Rather, if the scientific objective is to establish that the mean response is indistinguishable between groups, a test for equivalence should be used – where the null hypothesis is that the groups are different. As always, the nature of the statistical hypothesis, and, hence, the design and analysis, will critically depend on the scientific objective.

Power analysisPower analysis informs and supports good experimental design by determining the number of replicates required to statistically detect a specific biological effect with a given level of confidence (usually, as a convention, set at 95 per cent). Note that retrospective power calculations are immaterial. Where these are carried out on previous, statistically insignificant results, they add nothing to the analysis that has been carried out.

In general, there are four elements in any power calculation: the sample size (where ‘more’ will always give stronger results but be more costly), an estimate of the variability that will be seen in the data (where higher variability will always make it more difficult to identify a given effect), the size of the effect of interest (where the larger this is the more likely it is that the analysis will deliver strong results) and the power – the probability that the null hypothesis is correctly rejected (conventionally targeting a figure of 80 per cent).

The estimate of variability is most commonly thought of as the exogenous element – preferably derived from a carefully designed pilot study or via a literature review. Nevertheless, it may be subject to appreciable uncertainty. Depending on the context of the study, it can be useful to specify any two of the other elements to allow the researcher to look at the impact on the third. For example, if the sample size is capped at a specific

59-64 Stats Editorial.indd 60 15/01/2020 14:49

on May 24, 2020 by guest. P

rotected by copyright.http://veterinaryrecord.bm

j.com/

Veterinary R

ecord: first published as 10.1136/vr.m117 on 17 January 2020. D

ownloaded from

Page 3: REERC COET Best practice for the design and statistical ... · preferable, giving statistical robustness and better estimation of treatment effects. Decisions about the design must

18 January 2020 | VET RECORD60 VET RECORD | 18 January 2020 61

RESEARCH COMMENT

value, and the researchers have a clear idea of the likely effect size, an estimate of the power can be derived. If this is substantially greater than 80 per cent then they might wish to reduce the sample size to comply with the imperative to reduce animal use. Conversely, if the power is very small then it would be unethical for an animal study to proceed. In this latter situation, researchers should look to optimise the choice of experimental design further.

We find that specifying a biologically meaningful effect size is usually challenging for researchers, particularly for novel experiments. Many find it easier to specify the effect size not in absolute terms but as a percentage change relative to a baseline. For example, we might seek to specify the percentage mean reduction in a vaccinated group relative to the control group, where information about the latter is available from previous trials. Alternatively, the sample size and desired power can be specified, allowing calculation of what can be thought of as the minimum effect size that can be detected with at least the specified power. This latter quantity can be a helpful benchmark in assessing the validity of proposed experiments.

Given the approximate and/or contingent nature of much of the information fed into power calculations, it is not helpful to interpret any single set of results as specifying how the

researchers should carry out their study. Rather, the power calculation defines a useful framework to make informed decisions, based on clear assumptions, quantifying the probability of success of any particular experiment, providing warning of proposed experiments that are either grossly over- or

underpowered and, hence, supporting better use of available resources. It also forces consideration of the nature of the statistical hypotheses, which can be invaluable.

It is important to understand that power calculations are determined by the requisite parameter values, are only valid for the biological quantity under consideration in its specific experimental setting and also depend on the specific statistical analysis planned. Only in some circumstances will a previous power calculation be relevant to a new study. When a study is seeking to explore multiple objectives then separate power calculations should be conducted – potentially predicated on different statistical analysis – and amalgamated to come to a consensus for the entire study.

Computer routines for power analysis are widely available for the most basic statistical tests (eg, t test, chi-square test and binomial test for proportions), and simple ANOVA and linear models. For more complicated models, power analysis may only be feasible via stochastic simulation (ie, using a computer program to generate multiple realisations of the experiment given specific parameters, analysing each of these pseudo-datasets and collating the results so that power is empirically estimated).

Our experience suggests that the most positive benefits are likely not to arise from power

Well-designed, adequately-powered studies that are analysed using appropriate statistical methods can improve the quality of the evidence base in veterinary medicine and reduce wastage of both animals and money

59-64 Stats Editorial.indd 61 15/01/2020 14:49

on May 24, 2020 by guest. P

rotected by copyright.http://veterinaryrecord.bm

j.com/

Veterinary R

ecord: first published as 10.1136/vr.m117 on 17 January 2020. D

ownloaded from

Page 4: REERC COET Best practice for the design and statistical ... · preferable, giving statistical robustness and better estimation of treatment effects. Decisions about the design must

18 January 2020 | VET RECORD62 VET RECORD | 18 January 2020 63

RESEARCH COMMENT

calculations per se, but from identifying and revising the design of suboptimal experiments. Any investment in improving the design of an experiment will pay off statistically, ethically and economically.

Statistical analysisStatistical science is a mature discipline that has given rise to a wealth of statistical tests, models and methodological frameworks. When identifying a suitable statistical approach for use in analysing an experimental dataset, it is important to distinguish between numerical and categorical data and between different types of numerical data. Careful consideration of the objectives of the analysis and the nature of the data should greatly restrict the range of relevant statistical methods.

Unfortunately, it is not uncommon to see analyses where an inappropriate statistical method is ‘press-ganged’ into use, possibly because the researcher has previous experience in using it. Needless to say, this is not an appropriate criterion for selecting a method. However, it is true that the choice of methodology will inevitably be influenced by the analyst’s background and conception of the modelling problem. The variability in analytical methods arising from these decisions can have appreciable effects, as demonstrated in a recent study.3

Assessing assumptionsHaving identified a plausible methodology, it is important to understand that all statistical methods are subject to some set of technical assumptions – even those described as ‘non-parametric’, which should not be misconstrued as meaning ‘free from any assumptions’. It should be a key part of any statistical analysis to confirm that these assumptions are plausibly being met. Deviations from these assumptions rarely cause computational problems – hence, no warning will be provided by the algorithm – but they will lead to unreliable results that may not reflect the actual biology and over- or understate the strength of statistical evidence derived from the data. It is, therefore, imperative that users understand what these assumptions are and how to evaluate them.

A good example is the fitting of a generalised linear model without adjusting for overdispersion in the data. Such a model is demonstrably poorly fitted. It is easy to test whether overdispersion is present and to fit a better model that accounts

for it, thus avoiding serious consequences such as spuriously identifying effects as statistically significant. Nevertheless, it remains a very common mistake.

Some assumptions cannot readily be assessed. In practice, these high-level assumptions (eg, about the absence of possible confounding variables or the presence of measurement errors) are very rarely discussed, although such a discussion might sometimes form an important part of an analysis.

Basic assumptions of many tests and models include independence, distributional properties and homogeneity of variances. After fitting a model, the residuals provide rich information about potential problems in these respects. Lack of independence between observations will show patterns in residual plots, and this assessment can be formalised using a test such as the Bartels’ rank test. Statistical methods for measurements made on a continuous scale commonly assume normally distributed data, and graphical representations, like a simple histogram or a quantile-quantile plot, and formal statistical tests (eg, Kolmogorov-Smirnov or Shapiro-Wilk tests) are helpful in assessing this. A lack of homogeneity in variances is also easy to spot graphically, and specific tests are available to assess this (eg, Bartlett or Levene tests).

Many models have been shown to be robust against moderate deviations from their distributional assumptions. A lack of independence or variance heterogeneity in data will typically have more serious consequences for the statistical outcomes. The benefits of carrying out an exploratory data analysis to help understand and visualise the dataset before conducting any formal analysis cannot be underestimated.

Data transformations can be useful in dealing with some deviations from model assumptions, particularity in relation to normality and homogeneity of variances in measurement data. Popular options include log transformation to address data asymmetry, square-root transformation to stabilise the variance and standardisation to remove the influence of heterogeneous scales and units of measurement and facilitate comparisons between multiple variables.

Nevertheless, although data transformations can be useful in practice, they should not be used to mitigate the effect of making a poor choice

59-64 Stats Editorial.indd 62 15/01/2020 14:49

on May 24, 2020 by guest. P

rotected by copyright.http://veterinaryrecord.bm

j.com/

Veterinary R

ecord: first published as 10.1136/vr.m117 on 17 January 2020. D

ownloaded from

Page 5: REERC COET Best practice for the design and statistical ... · preferable, giving statistical robustness and better estimation of treatment effects. Decisions about the design must

18 January 2020 | VET RECORD62 VET RECORD | 18 January 2020 63

RESEARCH COMMENT

of statistical model or mask unwanted features of the data. For instance, when modelling data consisting of counts, use of a generalised linear model is usually preferable to transforming the data onto a continuous scale before using a standard linear regression model. Similarly, when outlier observations are present, rather than just relying on transformations or omitting them, it is advisable to investigate whether these might be resulting from a different underlying process or arising from a technical issue. In general, to downplay the effect of outliers, the use of methods based on robust statistics would be preferable (eg, MM-estimation for robust regression analysis).

Dealing with non-negligible complexities in the data that have an impact on the validity of model assumptions will generally require increased sophistication in the statistical analysis. For example, in the common situations of observing data where animals come from the same litter or are housed in pens, or where data are collected from the same animals over time, mixed models (ie, multilevel models) allow the inclusion of random effects. These can be used to account for variability at different levels (eg, variability seen in piglets within litters, within pens) and autocorrelation in repeated observations over time, providing improved estimates of variability.

Moreover, multiple observations of different attributes are commonly collected from the same experimental units. In many instances, it makes the most sense to examine these simultaneously, using multivariate data analysis methods in order to fully understand the structure and key features of the data overall, instead of looking at each measurement in isolation. To maximise clarity, the data analysis should be as simple as possible to meet the scientific objectives, but not so overly simplistic that important features of the data are ignored.

P valuesP values are commonly used to summarise statistical evidence and, hence, support scientific conclusions. However, an unfortunate corollary of this central role is that there are instances of misuse, misinterpretation and ‘P hacking’ – a conscious or unconscious bias towards finding small P values that distorts scientific research and that is contributing to the crisis in reproducibility.

Variability and uncertainty are integral to scientific experimentation, and the essence

of statistical analysis is to account for and quantify these aspects of an experiment to better support understanding of the biology. A P value in isolation is a limited summary and, thus, emphatic scientific statements based solely on whether it exceeds an essentially arbitrary threshold are not advisable. Reporting a P value cannot replace close scrutiny of results and effective scientific reasoning. Recent work in the statistical community has sought to raise awareness of the abuse of P values in scientific research, providing principles for their proper use and interpretation.4

One well-understood way in which the validity of statistical results can be undermined is through biased or multiple testing of hypotheses. If hypotheses are selectively tested or reported (because that aspect of the data looks interesting) then this ‘cherry-picking’ of results will lead to the reporting of results with P values that are spuriously low. This is one of the fundamental rationales for defining an analysis protocol before the data are collected, and only deviating from it where it is demonstrably leading to a poorly fitting model. Multiple testing of many hypotheses is problematic because the more tests are carried out, statistically the more likely it is that some of them will give rise to small P values, even if the null hypothesis is always true. Hence, where multiple testing occurs, the P values reported will overstate the level of statistical significance.

Various approaches have been developed to adjust P values. For example, Bonferroni or Tukey corrections are popular options, although they can become too conservative as the number of comparisons increases. Alternative approaches based on controlling for the expected proportion of incorrectly rejected null hypotheses (ie, false discovery rate) have proved effective and less stringent when screening high dimensional datasets for signals of interest. We would recommend that researchers stay vigilant to the risks involved in multiple testing, adjusting P values where appropriate but focusing on interpreting the point estimates and confidence intervals for key parameters from their statistical models since these are likely to be more valuable in leading to enhanced scientific understanding.

Reporting statistical analysisObjective, hypothesis, design and data analysis all interact and must be mutually consistent. These should be clearly stated and detailed in

59-64 Stats Editorial.indd 63 15/01/2020 14:49

on May 24, 2020 by guest. P

rotected by copyright.http://veterinaryrecord.bm

j.com/

Veterinary R

ecord: first published as 10.1136/vr.m117 on 17 January 2020. D

ownloaded from

Page 6: REERC COET Best practice for the design and statistical ... · preferable, giving statistical robustness and better estimation of treatment effects. Decisions about the design must

18 January 2020 | VET RECORD64 VET RECORD | 18 January 2020 PB

RESEARCH COMMENT

a dedicated section of the scientific output that describes: the purpose of each analysis conducted; the variables involved and any manipulations performed on them; methodological assumptions and their validity; criteria to assess the biological relevance of the results; a list of any software packages used; and, where relevant, details of any specialised or customised routines used. Any ad hoc analyses conducted in addition to those originally specified should be identified as such and, in general, interpreted on an exploratory or indicative basis.

Numerical results are best summarised in succinct tables recording group sizes, key estimates and associated measures of uncertainty. Graphs are useful to represent the observed data and, when appropriate, to visually support the conclusions from the modelling. Statistical analysis should always report the range of uncertainty in the results or some measure of variability, preferably by reporting the confidence intervals around estimated effects and/or numerical significance levels. Scientific statements should clearly refer to the statistical results on which they are based.

ConclusionsAs scientific research becomes increasingly data-driven, and large, complex data sets proliferate, rigorous experimental design, statistical modelling and interpretation are indispensable to successful scientific discovery and the maintenance of a positive perception of the credibility and value of scientific work. To mitigate the risks highlighted in this commentary and drive good practice, formal expert review and validation of the quantitative elements of a study before and after it is conducted would be helpful.

Given the complex methodological landscape and the substantial work that is involved in delivering a valid statistical analysis, there are substantial benefits in having specialist quantitative expertise available within a

project. Trained and experienced statisticians, as professional modellers and interpreters of data, can bring a wide view of potential approaches to the design of studies and the analysis of data, promote efficiency in execution, implement practices to facilitate replicability and reproducibility, promote consistency in reporting methods and results while providing independent and objective translation of statistical results into biological insights and, in general, help to reduce the incidence of the common statistical problems found in scientific studies.

However, it is also enormously valuable to seek to continuously improve statistical sophistication among veterinary and biological scientists through training. Likewise, improving biological sophistication among applied statisticians will lead to better outcomes. It is also the case that increased quantitative sophistication needs to be accompanied by an equivalent open-mindedness among journals and editors toward approaches that might be novel or less familiar.

When all is taken into account, the key message has to be that the more reliant on dependable statistical results a scientific study is, and the more complex the dataset and statistical methods used, the more important it is that all involved adhere to the elements of good practice described in this commentary.

Javier Palarea-Albaladejo, Iain McKendrickBiomathematics and Statistics Scotland, Edinburgh, UKemail: [email protected]

doi: 10.1136/vr.m117

References1 National Centre for the Replacement, Refinement and Reduction of Animals

in Research. Experimental design. www.nc3rs.org.uk/experimental-design (accessed 6 January 2020)

2 International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use. Estimands and sensitivity analysis in clinical trials. https://database.ich.org/sites/default/files/E9-R1_EWG_Draft_Guideline.pdf (accessed 6 January 2020)

3 Silberzahn R, Uhlmann EL, Martin DP, et al. Many analysts, one data set: making transparent how variations in analytic choices affect results. Adv Meth Pract Psychol Sci 2018;1:337–56

4 Statistical inference in the 21st century: a world beyond P<0.05. Am Stat 2019;73 Supp 1

Starting a new job?Check your contract against the BVA contracts of employment leaflet.

www.bva.co.uk/guides

59-64 Stats Editorial.indd 64 15/01/2020 14:49

on May 24, 2020 by guest. P

rotected by copyright.http://veterinaryrecord.bm

j.com/

Veterinary R

ecord: first published as 10.1136/vr.m117 on 17 January 2020. D

ownloaded from