21
HAL Id: hal-01883618 https://hal.inria.fr/hal-01883618 Submitted on 28 Sep 2018 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Cyber Security and Privacy Experiments: A Design and Reporting Toolkit Kovila Coopamootoo, Thomas Groß To cite this version: Kovila Coopamootoo, Thomas Groß. Cyber Security and Privacy Experiments: A Design and Re- porting Toolkit. Marit Hansen; Eleni Kosta; Igor Nai-Fovino; Simone Fischer-Hübner. Privacy and Identity Management. The Smart Revolution: 12th IFIP WG 9.2, 9.5, 9.6/11.7, 11.6/SIG 9.2.2 In- ternational Summer School, Ispra, Italy, September 4-8, 2017, Revised Selected Papers, AICT-526, Springer International Publishing, pp.243-262, 2018, IFIP Advances in Information and Communica- tion Technology, 978-3-319-92924-8. 10.1007/978-3-319-92925-5_17. hal-01883618

Cyber Security and Privacy Experiments: A Design and

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

HAL Id: hal-01883618https://hal.inria.fr/hal-01883618

Submitted on 28 Sep 2018

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Distributed under a Creative Commons Attribution| 4.0 International License

Cyber Security and Privacy Experiments: A Design andReporting Toolkit

Kovila Coopamootoo, Thomas Groß

To cite this version:Kovila Coopamootoo, Thomas Groß. Cyber Security and Privacy Experiments: A Design and Re-porting Toolkit. Marit Hansen; Eleni Kosta; Igor Nai-Fovino; Simone Fischer-Hübner. Privacy andIdentity Management. The Smart Revolution : 12th IFIP WG 9.2, 9.5, 9.6/11.7, 11.6/SIG 9.2.2 In-ternational Summer School, Ispra, Italy, September 4-8, 2017, Revised Selected Papers, AICT-526,Springer International Publishing, pp.243-262, 2018, IFIP Advances in Information and Communica-tion Technology, 978-3-319-92924-8. �10.1007/978-3-319-92925-5_17�. �hal-01883618�

Cyber Security & Privacy Experiments:A Design & Reporting Toolkit

Kovila P.L. Coopamootoo and Thomas Groß

Newcastle University,Newcastle upon Tyne, United Kingdom

{kovila.coopamootoo|thomas.gross}@newcastle.ac.uk

Abstract. With cyber security increasingly flourishing into a scientificdiscipline, there has been a number of proposals to advance evidence-based research, ranging from introductions of evidence-based method-ology [8], proposals to make experiments dependable [30], guidance forexperiment design [38,8], to overviews of pitfalls to avoid when writ-ing about experiments [42]. However, one is still given to wonder: Whatare the best practices in reporting research that act as tell-tale signs ofreliable research.We aim at developing a set of indicators for complete reporting thatcan drive the quality of experimental research as well as support thereviewing process.As method, we review literature on key ingredients for sound experimentand studied fallacies and shortcomings in other fields. We draw on lessonslearned and infuse them into indicators. We provide definition, reportingexamples, importance and impact and guiding steps to be taken for eachindicator.As results, we offer a toolkit with nine systematic indictors for designingand reporting experiments. We report on lessons and challenges from aninitial sharing of this toolkit with the community.The toolkit is a valuable companion for researchers. It incites the con-sideration of scientific foundations at experiment design and reportingphases. It also supports program committees and reviewers in qualitydecisions, thereby impacting the state of our field.

1 Introduction

Cyber security and privacy are both exciting fields that weave together method-ologies, theories and perspectives from various disciplines: mathematics, engi-neering, law, psychology and social sciences. As consequence, it gains a collectivetapestry, a definite strength that exemplifies inter-disciplinary fields. However,the sharing of expertise and drawing on best practices of each discipline is achallenge. For example, the research area of human factors of cyber security andprivacy is inter-disciplinary research area. It clearly benefits from the system-atic design and reporting standards characteristic of the rigorous methodologyof experimental psychology at its best.

Without guidelines, we rely on researchers to assess the quality of their de-signs and reporting. We ask program committees and reviewers to make bestdecisions on submissions to their best judgment. At the same time, these sub-missions impact the future of the field, may sow uncertainty in the research com-munity and among policy makers alike, especially when they intend to transferresearch findings into practice.

Workshop at IFIP Privacy & Identity Management Summerschool 2017. Ourworkshop on Evidence-Based Methods was intended as a first evaluation of aset of indicators we originally developed for a systematic literature review in theResearch Institute in Science of Security (RISCS). We offered a presentation ofeach of the indicators and their specifications and offered participants a codebookand a marking sheet [9] as well as publications reporting experimental privacystudies.

Contribution. This paper aims to offer support for experimental cyber securityand privacy research as scientific discipline. It provides nine clear guidelines tosupport the design and reporting of experiments and discusses challenges todissemination from a first encounter with the community.

Outline. In the rest of the paper, we discuss our choice for the set of indica-tors before detailing each of the nine completeness indicators. For each indicatorwe proceed with theoretical background, benefits for fulfilling the indicators,outcome of not achieving them, practical steps to take in design and report-ing, together with best practice examples. We then provide the lessons learntfrom a first connection with the community before providing the discussion andconclusion.

2 Choice of Completeness Indicators

We chose indicators that contribute and build-up towards sound statistical infer-ence. As a consequence, we addressed reproducibility, internal validity, correctstatistical reporting and parameter estimation. We deliberately excluded criteriaon external validity and ethics, but may consider them in future versions of thetoolkit.

Benefits of this first toolkit. The indicators are designed as a toolkit mainly forresearchers, providing both a theoretical and a practical component. First, itacts as support for the design phase of a user study/experiment and aims tobe one-stop resource. We provide a theoretical background with each indicator,substantiated with reasoning in the form of benefits for having the indicators andthe outcome of not catering for them. Second, it acts as a companion for reportingvia the practical steps to take, typical locations in articles and examples of goodpractice. Further, an additional benefit, is a clear list that can enable programcommittees to evaluate the reporting of research studies.

3 CI1: Upstream Replication

3.1 Theoretical Background

Similar to Coopamootoo & Groß [8], we call replication the attempt to recreatethe conditions sufficient to obtaining a previously observed finding, a definitionadapted from the Open Science Collaboration [7]. We refer to upstream repli-cation when a study replicates existing studies or previously validated meth-ods/instruments.

We note that replication of studies is an important research practice that pro-vides confidence in the findings, where Cumming & Calin-Jageman [12] pointout that rarely, if ever can a single finding give definitive answer to a researchquestion, while the Open Science Collaboration notes the alarming discoverythat a number of widely known and accepted research findings cannot be repli-cated [12].

We ask: ‘Is the study reporting existing studies or methods?’

Benefits of fulfilling CI1. Researchers engaging in upstream replication are gain-ing sound foundations for their studies in employing methods whose exact prop-erties are known and well-tested. For instance, for a measurement instrument weexpect it to be known, which parameters of the population are measured. Weexpect of the instrument itself internal validity, repeatability and reproducibil-ity. In the logic of the statistical inference of the given experiment we are thenentitled to assume that the properties of the instrument are a given and will bethe same for other researchers in the future. As a completeness indicator, CI1thereby yields evidence whether the foundations of the given reported study aresound.

Outcome for not fulfilling CI1. Should evidence towards CI1 be missing, we wouldneed to assume that the study did not pay attention to its sound foundations.This, in turn, means that the study is on uncertain footing. For the manipulationinstruments, it is not assured that they cause the intended change in the par-ticipants reliably. For the measurement instruments, it is not assured that theymeasure the intended property. Consequently, instruments without evidence ofsound a priori validation yield sources of errors that can well confound the mainexperiment and thereby put the overall inference in question.

3.2 Steps to Take

How to achieve CI1. The key principle towards gaining evidence for CI1 is theuse of validated tools. We recommend to select manipulation and measurementinstruments that come with strong evidence of their validation and properties.

In the experiment design and execution, researchers will employ instrumentsexactly as validated, for example, using the defined scale and scoring sheet asprovided. If adaptations are made to the instrument instructions, these will bedocumented.

In the reporting of the study, researcher will then document the evidence fortheir exact replication, for instance, by including the exact materials used andby citing the validation study they rely upon.

We note that documentation of instruments also apply to those employed formanipulation checks.

Typical location in articles. CI is reported in the methods section with subsec-tions on measurement apparatus and manipulation apparatus or experimentalconditions.

Reporting Example.

Example 1 (CI1 - Manipulation Apparatus, with amendments fromNwadike et al. [37]).We induce a happy and sad affect via video stimulus, a mood induction pro-tocol recommended by Westermann’s critical review of different methods [47].For happiness affect we used the restaurant scene from the movie When Harrymeets Sally [clip length 155 seconds] while for sadness affect we used the dyingscene from the movie The Champ [clip length 171 seconds]. We refer to Rot-tenberg et al. [40] to start and end the clips at the exact frames as previouslyvalidated.

Example 2 (CI1 - Manipulation Check, with amendments from [37]). Weused the 60-item full PANAS- X questionnaire [46] as manipulation checkon the induced affect state. We focus on sadness and joviality as equiva-lent of happiness. The PANAS-X is scale is based on 5-point Likert-itemsanchored on 1 - very slightly or not at all, 2 - a little, 3 - moderately, 4 -quite a bit, and 5 - extremely. We anchored PANAS-X for affect “at thepresent moment.”

Example 3 (CI1 - Validated Measurement Apparatus).“The State-Trait Anxiety Inventory for Adults (STAI-AD) [43] is a 40-questionself-report questionnaire. We use the temporary construct of state anxiety, thatis, “how you feel right now.” It employs 4-point Likert items anchored on 1 –Not At All, 2 – Somewhat, 3 – Moderately So, and 4 – Very Much So.”

3.3 Further Sources

Across sciences, a replication crisis has been observed. Prominently in psychol-ogy, a large scale replication endeavor by the Open Science Collaboration [7] ofN = 100 studies across 3 psychology journals found that only 47% of the originaleffect sizes were in the 95% confidence interval of the replication effect size.

The Open Science Collaboration makes a case that research claims gain cred-ibility when the supporting evidence undergoes sound replication [7]. We note

that the replication needs to be done deliberately to increase the overall PositivePredictive Value of the results [25,34].

In security literature, Maxion [30] postulates that repeatability, reproducibil-ity and validity are the main criteria differentiating a well designed experimentfrom those that are not.

4 CI2: Reproducibility

4.1 Theoretical Background

CI2 considers the enablement of downstream replication. While downstream repli-cation includes repeatability, that is, whether a study can be replicated by thesame researchers, CI2 considers especially, whether the study is sufficiently re-ported to be reproducible by other researchers. We refer to Maxion [30] for furtherdiscussion on repeatability and reproducibility.

CI2 establishes whether the reporting supports reproducibility, defined asthe closeness of results obtained on the same test material under “changes of[. . . ] conditions, technicians, apparatus, laboratories and so on” [13]. A key re-quirement of replicating existing studies is the availability of clear documenta-tion.which ideally would entail a detailed step-by-step experimental protocol,which makes provisions for reproducibility.

The principle for reproducibility is diligent documentation of all variables ofthe study’s lifecycle. We ask ‘Is there correct reporting of manipulation apparatus,measurement apparatus, detailed procedure, sample size, demographics, samplingand recruitment method, contributing towards reproducibility?’

Benefits of fulfilling CI2. Offering sound reporting for reproducibility allows fordownstream replication and contributes to the enablement of research synthesisin a field. This is crucial to enable falsification and hence empirical progress.Having a reproducible study at hand means that other researchers can test thetheories evaluated in the given study and establish independent evidence on thetheories, possibly falsifying the earlier result. Furthermore, replication studiesinform the overall positive predictive value for the considered relations and allowfor a meta analysis on the effect sizes and their confidence intervals.

Hence, as completeness indicator, CI2 checks whether evaluates whether thetheories named in the given study can be empirically scrutinized in subsequentexperimentation from the given reporting, and thereby whether the given studymakes a sound contribution to empirical sciences.

Outcome for not fulfilling CI2. Should the evaluation for CI2 not offer evidencetowards reproducibility, we need to assume that the given study cannot be repli-cated downstream. First, the lack of reproducibility leaves other researchers witha great ambiguity what was actually done. Second, following Poppers discussionon falsifiability [39], a study that cannot be reproduced does not actually yieldstrong empirical evidence because other researchers cannot execute the offeredexperiment to falsify the reported theory, which in turn casts doubt on the studyadvancing empirical knowledge.

4.2 Steps to Take

How to achieve CI2. Researchers will provide detailed description of experimentdesign, including the all choices made, possibly supplemented by an experimentdiagram, as well as the procedure executed in the experiment itself.

We note that documentation towards reproducibility will often also includeplanned analyses, which we consider under other CIs. A recommended practicein this case is to pre-commit the experiment and analysis plan at organizationssuch as the Open Science Framework1 or AsPredicted2. As example, committedanalysis plan and analysis report [22] published for password research [17].

Typical location in articles. C2 covers the whole method section including adetailed procedure, sample recruitment and demographics, manipulation andmeasurement instruments. Planned analysis will be in the analysis or the resultssection.

Reporting Example.

Example 4 (CI2 - Demographics).We refer to Table 3 of Kluever and Zanibbi [27] for a detailed demographicsreport that is relevant to the context of the study reported.

Example 5 (CI2 - Measurement Apparatus precisely referencing sources).“We administered the NASA Task Load Index in an online form. The formexactly replicated the full NASA TLX questionnaire as specified on in NASATask Load Index (TLX), v. 1.0, Appendix, pp. 13. [24]”

Example 6 (CI2 - Procedure).“The procedure consisted of (i) pre-task questionnaires for demographics andpersonality traits, (ii) a manipulation to induce cognitive depletion, (iii) a ma-nipulation check on the level of depletion, (iv) a password entry for a mock-upGMail registration, and (v) a debriefing and memorability check one week afterthe task with a GMail login mockup. ” This was followed with a details of eachsection.

4.3 Further Sources

First, for reproducibility of the experiment design, which is what this CI mainlyfocuses on, we refer to experiment design methodology [16,31,33].

Second, for reproducibility of the planned analyses, which involves the doc-umentation of the plan as well as the recording of all the analyses done, we sug-gest inspiration from reproducibility principles from general computing science1 https://osf.io2 https://aspredicted.org

research [41] or more specific sources with focus on computation-supported sci-entific practice [45]. To render all computations, statistical analyses and graphsreproducible, we suggest the R framework knitr [48] as demonstrated within theanalysis report [22].

5 CI3: Internal Validity

5.1 Theoretical Background

CI3 addresses internal validity of the experiment, which refers to the truththat can be ascribed to cause-effect relationships between independent variables(IV) and dependent variables (DV) [3], where the IV is a variable that is in-duced/manipulated and the DV is the variable that is observed/measured [32].

This CI asks for research questions and hypotheses that provide the foun-dations for null hypothesis significance testing (NHST) [36]. Operationalizationenables systematic and explicit clarification of the predictors or independentvariables, and hence the cause and manipulation, while the target variable or de-pendent variables clarify the effect, hence the measurements. Subject assignmentpoints to whether and how participants were randomly assigned and balancedacross experimental conditions.

Manipulation check refers to verification that the manipulation has actuallytaken effect, hence assuring systematic effects.

We ask ‘Is there an explicit and operational specification of the RQs, null andalternative hypotheses, IVs, DVs, subject assignment method and manipulationchecks?’

Benefits of fulfilling CI3. CI3 ensures internal validity and a solid statement ofintention for Null Hypothesis Significance Testing (NHST) [36].

Outcome for not fulfilling CI3. Should evidence for CI3 be missing, we wouldneed to assume other possible explanations for the cause-effect relationship in-vestigated, that is that the reported design could involve variables contributingunsystematic effects. This in turn would mean that other researchers could notrely on the results reported.

5.2 Steps to Take

How to achieve CI3. We propose in the first instance that following the step bystep exercise we previously detailed [8] on ‘An Exercise in Experiment Design’to be beneficial for internal validity. In particular, developing research questions,defining testable hypotheses, operationalizing hypotheses into IVs and DVs. ForIVs, reseachers will answer ‘What factor is being manipulated and influences theoutcome?’ For DVs, ‘What is being measured?’ and how can we measure theoutcome of manipulation reliably.

Typical location in articles. The aims section can detail the research questionsand hypotheses where as the method to include sub-sections on operationalizingthe variables into measures and experimental conditions. The method sectionwill also include subject assignment information.

Reporting Example.

Example 7 (CI3 - Research Question, from Cherapau et al. [5]).“How availability of Touch ID sensor impacts users’ selection of unlocking au-thentication secrets?”.

Example 8 (CI3 - Hypotheses, from Cherapau et al. [5]).For null hypotheses H0: “Use of Touch ID has no effect on the entropy ofpasscodes used for iPhone locking.” or “Availability of Touch ID has no effecton ratio of users who lock their iPhones.”For corresponding alternative hypotheses H1: “Use of Touch ID affects the en-tropy of passcodes used for iPhone locking.” or “Availability of Touch ID in-creases the ratio of users who lock their iPhones” [5].

Example 9 (CI3 - Subject Assignment, amended from Bursztein etal. [4]).“Our task scheduler presented the CAPTCHAs to Turkers in the following way. . .Random Order - fully random, where any captcha from any scheme couldfollow any other.”

We also refer to Example 2 for manipulation checks.

6 CI4: Limitations

6.1 Theoretical Background

CI4 establishes what other factors could affect the cause and effect relationshipunder investigation and hence limit validity including both internal and externalvalidity. This CI is related to the requirement of controlled variables for exper-iment design, that is the assurance that an observed change in the dependentvariable is a result of a systematic change in the independent variable [32].

We ask ‘Was there a discussion on the limitations, possible confounders,biases and assumptions made?’

Benefits of fulfilling CI4. CI4 provides transparency of validity and assurance thatother possible explanations for the stated causal relations, have been considered.This in turn provides confidence in the reported results.

Outcome for not fulfilling CI4. Should the limitations not have been discussed inthe experiment report, we would need to assume that the researchers might havefailed to control variables that impact the internal validity of the experiment.This puts the reported effects into question.

6.2 Steps to Take

How to achieve CI4. Researchers are (1) to evaluate experimental designs for al-ternative explanations that could influence the observed effects, such as identify-ing confounding and controlling for variables, (2) to make explicit the boundariesand of the design, such as whether a convenient sample was used, and (3) ac-knowledge the limits in interpretations that can be inferred from the findings,such as whether the results are a correct reflection of estimates for the generalpopulation.

A discussion of the limits and boundaries of the study, identification of pos-sible confounding variables whose presence affect the relationship under study,and possible assumptions made in setup, are all valuable inputs that strengthenthe validity of the experiment.

Typical location in articles. While researchers may report and discuss limitationsthroughout the article, it is preferred to define a dedicated limitations section,that shows clarity and researcher awareness of the limits of their design.

Reporting Example.

Example 10 (CI4 - Sampling bias, from Akhawe & Felt [1]).“The participants in our field study are not a random population sample. Ourstudy only represents users who opt in to browser telemetry programs. Thismight present a bias. The users who volunteered might be more likely to clickthrough dialogs and less concerned about privacy. Thus, the clickthrough rateswe measure could be higher than population-wide rates.”

7 CI5: Reporting Standard

7.1 Theoretical Background

Statistical reporting guidelines helps the reader, reviewer, policy maker to gainconfidence in the reported statistical analysis and results. As example, we pro-pose reporting recommendations of the American Psychology Association (APA) [2]as quality standard.

We ask ‘Was the result reported in the APA style?’

Benefits of fulfilling CI5. Reporting standards provide a degree of comprehen-siveness in the information that is reported for empirical investigations. Uniformreporting standards make it easier to generalize within and across fields, tounderstand implications of individual studies and supports research synthesis.Comprehensive reporting also supports decision makers in policy and practicetowards understanding how the research was conducted [2].

Outcome for not fulfilling CI5. The impact of not fulfilling CI5 opens gaps andlead to questioning research quality, reuse and reproducibility.

7.2 Steps to Take

How to achieve CI5. Researchers are to closely adhere to statistical reportingstandards such as the APA [2] and reporting statistical inference as recommendedwhether in paragraphs, tables or figures. This include reporting actual p-values,that is not only whether the p-value is less that α, and effect sizes and confidenceintervals.

Typical location in articles. Reporting standards usually focus on the specifi-cation of the results section, yet can also indicate the format of a structuredabstract or the structure of the overall publication.

Reporting Example.

Example 11 (CI5 - with amendments from Coopamootoo et al. [10]).We computed a one-way ANOVA. “There was a statistically significant effectof the experiment condition on the password strength score, F (2, 63) = 6.716,p = .002 < .05. We measure the effect size . . . η2 = .176, 95% CI [0.043, 0.296][. . . ].”

8 CI6: Test Statistic

8.1 Theoretical Background

The reporting on the test statistic offers a precise interface on the result ofthe computed statistical analysis. This data allows for a future analysis of aposteriori likelihoods, such as in a Positive Predictive Value (PPV) [25]. Simplyput, this data helps other researchers to ascertain whether the result could be afalse positive or not.

We consider the precise documentation of the outcome of the statistical test.For instance, for a t-test we would expect to learn the t-value as well as thedegrees of freedom, along with the exact p-value computed for this t.

We ask ‘Did the result statement include test statistic and p-value?’

Benefits of fulfilling CI6. If the test statistic is fully specified, we gain importantdata for the subsequent analysis of the result. From the consistency of the re-ported test statistic and the p-value, we gain confidence in the correct reporting.In addition, the data includes sufficient redundancy that others can validate thepresented p-values or use the reporting of the test statistic to compute standard-ized effect sizes for subsequent meta-analysis.

Outcome for not fulfilling CI6. Should the test static or the p-values not bereported, e.g., by just stating that the result “is statistically significant, p < .05,we lose a lot of information. We could neither ascertain the confidence levelof the significance nor the internal consistency of the reported test. Hence, thereported result will lack internal credibility and not be particularly trustworthy.

8.2 Steps to Take

How to achieve CI6. The key principle is to report sufficient data, such thatothers can cross-check the reported values and use them in further researchsynthesis. Usually, this involves reporting the test statistic itself, the degrees offreedom vis-à-vis of the sample size, and the exact p-value. When comparisonsbetween conditions are made, then the descriptive statistics for the relevantconditions should be provided (e.g., mean and standard deviation for conditionsof a t-test).

Typical location in articles. The test statistics will be specified in the resultssection of the paper. As a rule of thumb, for each result we claim as beingstatistically significant, we will provide the test statistic supporting that claimas suffix.

Reporting Example.We refer to the Example 11 for test statists and p-value reporting.

9 CI7: Assumptions

9.1 Theoretical Background

Statistical tests can easily lead us astray if their assumptions are not fulfilled:they may produce spurious results. Even though some tests have been shown tobe somewhat robust against borderline violations of their underlying assump-tions, the burden of proof that the assumptions were sufficiently fulfilled is onthe researchers who conducted the test.

In general, the exact type of test in a family needs to be specified to in-form which assumptions come to bear. For instance, the assumptions of anindependent-samples t-test will be different from a dependent-samples t-test.Similarly, it needs to specified whether the test is “one-tailed” or “two-tailed” toput the reported p-values into perspective.

To ascertain whether the statistical analyses were correctly employed on thedata, statistical assumptions need to be made explicit in reporting. For exam-ple, the assumptions for parametric tests, in general, are normally distributeddata, homogeneity of variance, interval data and independence [15]. Parametricstatistical tests often require a systematic treatment of outliers.

We ask ‘Were significance level α and test statistics properties and assump-tions appropriately stated?’

Benefits of fulfilling CI7. A precise specification of the test used and explicitdocumentation of the assumptions checked gives the reader confidence that thestatistical tools were appropriately chosen and employed diligently.

Outcome for not fulfilling CI7. Should test properties and assumptions not bedocumented, we need to assume that researchers did not establish that they couldreliably employ the statistical test. Consequently, the reported test statistics andp-values could be off and not be relied upon.

9.2 Steps to Take

How to achieve CI7. One would choose the designated significance level a prioriand state it explicitly. Similarly, the researchers need to establish whether the testwill be one- or two-tailed in advance. Researchers check whether the data meetsthe assumptions of the planned statistical test and explicitly report whether andhow the data met the test assumptions. Decisions on how the data was treated(e.g., outlier management) need to be reported explicitly.

We emphasize that complex statistical models (such as regressions) usuallyrequire comprehensive post-hoc model diagnostics to evaluate whether the modelis sound.

Typical location in articles. The treatment of assumptions is documented in theresults section, either close to the report of the statistical test or in a separatesubsection. Often it will support the confidence in the reported results, if acomprehensive analysis report is published alongside the research paper thatdocuments all checks of assumptions and diagnostics, transformations of thedata, and decisions made.

Reporting Example.

Example 12 (CI7 - Significance level α & test statistics properties, fromGroß et al. [23]).“All inferential statistics are computed with two-tailed tests and at an α levelof .05”

Example 13 (CI7 - Test statistics assumptions, from Groß et al. [23]).“The distribution of the Passwordmeter password strength score is measuredon interval level and is not significantly different from a normal distribution,Saphiro-Wilk, D(100) = .99, p = .652 > .05”a.“We computed Levene’s test for the homogeneity of variances. For the passwordmeter scores, the variances were not significantly unequal.”

a We note here that numerical normality tests, such as Saphiro-Wilk may havetoo little sensitivity for small sample sizes and too much sensitivity for largesample sizes. [44]

10 CI8: Confidence Intervals on Effects

10.1 Theoretical Background

An effect size estimates the magnitude of an effect, an unknown parameter ofthe population, given the observed data of an experiment. Confidence intervalprocedures on the effect estimate the range of plausible values for the populationparameter, if the experiment were repeated independently infinitely many times.We note that this is a frequentist view, in which the confidence level applies tothe procedure. For instance, a series of 95% confidence intervals will tend tocontain the population parameter on average 95% of the intervals.

Effect sizes and their confidence interval offer an informative view on anexperiment’s observed effect magnitudes. Consequently, the APA guidelines [2]state that “estimates of appropriate effect sizes and confidence intervals are theminimum expectations.” QI8 includes that the effect sizes are reported in a easilyhuman-interpretable form.

We ask ‘Were the appropriate the effect sizes and confidence intervals (CI)reported?’

An effect that is statistically significant is not necessarily scientifically signif-icant or important. To draw conclusions on an effect’s importance or practicalimplications, we consult the magnitude of the effect, its effect size. [6].

In estimation theory, the effect size (ES) provides a point estimate of effect inthe population, while the confidence interval (CI) provides the interval estimate.While we endorse the use of estimation theory [12,19], we note that interpretingconfidence intervals correctly requires diligence [35]. Notably, it is a fallacy tointerpret a post-data X% confidence interval to have a X% probability to includethe true population parameter.

Benefits of fulfilling CI8. CI8 evaluates the robust reporting of effect magnitudesthrough parameter and interval estimation, which yields, in turn, the foundationfor future meta-analysis and research synthesis.

Outcome for not fulfilling CI8. Without effect size estimate, we only have thesignificance of the results and p−values to go on. However, we will miss out oninformation on the magnitude of the claimed effects. For example a significantp-value does not say how important the observed effect is: it could well be trivial,and neither contribute much to research nor vouch for changes to practice.

10.2 Steps to Take

How to achieve CI8. To compute effect sizes in experiments together with theirconfidence intervals and to report these in publications. Literature already pro-vides a number of manuals and research articles on computing the differentfamilies of effect sizes [18,29] To also refer to the New Statistics [12] for theestimation approach, effect-size and confidence intervals.

Typical location in articles. Effect sizes and their confidence intervals are doc-umented in the results section, either stated as a suffix after the p-value of thecorresponding statistical inference or provided in dedicated tables.

Reporting Example.We refer to the Example 11 for effect size and confidence interval reporting.

10.3 Further Sources

Kirk [26] and Cumming [11] debated that the current research practice of exclu-sive focusing on a dichotomous reject-nonreject decision strategy of null hypoth-esis testing that can impeded scientific progress. Rather, they posit, the focusshould be on the magnitude of effects, that is the practical significance of effectsand the steady accumulation of knowledge. They advise to switch from the muchdisputed NHST to effect sizes, estimation and cumulation of evidence.

11 CI9: Statistical Inference

11.1 Theoretical Background

CI9 evaluates the overall correctness of the statistical inference, that is, howstatements on statistical significance are expressed and what conclusions aredrawn from the statement. As such, CI9 relies to some extent on observationsmade with respect to preceding completeness indicators.

We ask ‘Was the significance and hypothesis testing decision interpreted cor-rectly and put in context of effect size and sample size/power?’

Nickerson [36] offers a comprehensive overview of the controversies aroundNull Hypothesis Significance Testing (NHST), while Maxwell and Delaney [31,p.48] and Goodman [21] point to p-Value misconceptions, Morey et al. [35] ana-lyze confidence interval fallacies and Ioannidis [25] argues “why most publishedresearch findings are false.”

The evaluation in our work is founded on Nickerson’s review [36] on miscon-ceptions around NHST, which include:

– p misperceived as the probability that the hypothesis be true and 1 − pmisperceived as the probability that the alternative hypothesis be true,

– a small p considered as evidence that the results be replicable,– a small value of p misinterpreted as a treatment effect of large magnitude,– statistical significance considered as theoretical or practical significance,– significance level α misinterpreted as the probability that a Type I error will

be made,– Type II error rate β considered to mean the probability that the null hy-

pothesis be false,– failing to reject the null hypothesis misrepresented as equivalent to demon-

strating it to be true,– failure to reject the null hypothesis misinterpreted as evidence of a failed

experiment.While Nickerson’s observations are concerned with the correct interpretationof NHST, for us CI9 also includes preparing the ground with population andsampling as well as a priori hypothesis specification, and post-hoc concernssuch as multiple-comparison corrections.

Benefits of fulfilling CI9. Evidence towards CI9 convinces us of the robustness anddiligence of the statistical inference made, because common pitfalls and fallacieshave been avoided. The result statement will offer a sound starting point for theinterpretation of the findings.

Outcome for not fulfilling CI9. Should there be evidence of incorrect statisti-cal inference or the presence of fallacies, we would need to assume that theresearchers interpretation of said results be tainted by the misinterpretationsand misrepresentations made. Hence, the overall conclusion of the study wouldbe put into question. We perceive reviews on misconceptions and fallacies asimportant guard rails [36,31,21,35].

11.2 Steps to Take

How to achieve CI9. To achieve CI9, we recommend to investigate how p-valuesand confidence intervals can and cannot be interpreted. The key principle hereis diligence: The devil is in the details.

Typical location in articles. The correctness of the statistical inference is pre-pared by the documentation of the a priori elements of a study in the methodssection, supported by the correct reporting of statistical tests in the results andfinally completed by the interpretation of the outcomes in the discussion.

Reporting Example.

Example 14 (CI9 - Type I error correction).“Given the number of comparative t-tests computed on the data set, we computea multiple comparisons correction, where differences marked with a dagger † inTable 1 are statistically significant under Bonferroni-Holm correction for allcomparisons made.”

12 Lessons Learnt from the workshop

12.1 Aim

To assess whether and how the set of nine indicators could be applied in practice.

12.2 Method

Procedure. We gave a small presentation of the hallmarks of experiment design(following our 2016 workshop at the same venue) and then presented the nineindicators as a set of ‘Quality Indicators’, where quality assessment is a stageemployed within Systematic Literature Review procedures [14]. These nine in-dicators were developed as a checklist of factors to be evaluated within experi-mental studies, as part of a UK Research Institute in Science of Cyber Security(RISCS) funded project, which had the overall aim to evaluate the state of theart in evidence-based methods in cyber security and privacy.

Prior to the workshop we developed a first version of a codebook whichspecified each of the indicators in terms of sub-criteria and examples and acodesheet providing a marking scheme.

Next, we facilitated open coding with the aim to extract concepts from thefree-form text. We provided participants with (1) two example research articlesreporting user experiments in the context of privacy [20,28], (2) the CI specifi-cation as a codebook [9] and (3) the marking as a codesheet [9].

Participants. Participants worked in two groups to review the two articles.N = 9participants attended the workshop, 6 female, 3 male. The 7 participants whoprovided their age had mean age 31.86 years (SD = 8.28). 5 participants werefrom a usable privacy and security background, while others were from otherareas of privacy and security. Participants’ first language varied (4 German, 2English and 1 Tamil, 2 did not answer). With the sole aim to gauge participants’expertise, we offered participants three Likert questions to rate their frequencyof use of evidence-based methods (from 1 – ‘Never’ to 5 – ‘A great deal’), theirskills (from 1 – ‘Poor’ to 5 – ‘Excellent’) and their familiarity (from 1 – ‘Not atall familiar’ to 5 – ‘Extremely familiar’) in designing experiments. Participantsreported using evidence-based methods such as experiments with a median valueof 3, to have a median skill level of 2 and median familiarity in designing exper-iments of 2.

12.3 Results

We provide results in the form of participant feedback and recommendations.

Practical Requirement. Participants recommended shaping of the indicators as atoolkit that can readily be employed by the community. This involves designingclear sections in the tool set that researchers and committee members can pickup. As a result, following the workshop, we have revised the indicators to matchthese requirements, as presented through sections CI1 to CI9. In this paper weprovided the theoretical underpinnings for each CI together with ‘Steps to Take’and ‘Examples’.

Design Requirement. Participants noted the time commitment required if onedoes not know what to look for when applying the toolkit in a reviewing exercise.To address this, we provide clear examples for each CI together with typicalsections in research papers that provide support for criteria fulfilling each CI.

Ethical Considerations. Participants suggested to factor in ethical considera-tions, as aspect of experimental reporting we omitted but foresee its benefits forcompleteness of reporting.

13 Discussion

Community progress. A toolkit, such as the one we provide here, contributesto a standard to aspire to. It supports the community in developing the skillsto design, run and report rigorous experiments in cyber security. At the sametime, while the lack of defined best practices requires individual researchersto determine what the standards they adhere to, our toolkit offers a commonground.

It also supports the reviewing process and program committee decisions, byoffering syntactic criteria to check for the completeness of scientific reporting. Inaddition, it contributes to a culture of well designed and reported experimentsthat can serve as notable examples to follow in the field.

Added value for researchers. We believe this toolkit can be a valuable ingredi-ent for inter-disciplinary security and privacy research. It combines theoreticalbackground and practical guidelines to support foundations in experiment designand reporting. By following the requirements of participants as voiced during theworkshop, we provided clear sections that can be picked up by researchers andcommittee members. It supports both novice and experienced usable security andprivacy researchers. While learning a methodology takes time for any novice, webelieve that this toolkit may support the learning the nitty-gritty of experimen-tal methodology by being designed as a one-stop resource. For more advancedresearchers, it presents itself as a checklist and offers some good practices tofollow.

Not exhaustive. We observe that our current toolkit is not exhaustive and foreseethat it will grow as discussions advance within the community. In line with this,we plan to facilitate further discussion exercises within the community and toseek ways for engagement.

14 Conclusion

This paper provides a first toolkit for experimental research in cyber securityand privacy with a sampler of theoretical foundations and practical guidelines.I can support a study’s lifecycle from conception, design, analysis and reportingto replication. It provides a companion for novice researchers as well as reviewersneeding a structured checklist. Although the toolkit is certainly not exhaustive,it may still grow with discussions and evidence-based projects within the com-munity. We believe that already in the current form, it can support a culture ofrobustly designed and reported experiments, thereby contributing to empiricalresearch in the field.

15 Acknowledgment

We are indebted to the the participants of the “Workshop on Evidence-BasedMethods” at the 2017 IFIP Summerschool on Privacy and Identity Manage-ment for their generous feedback. This work was supported by the UK ResearchInstitute in Science of Cyber Security (RISCS II) project “Scientific Methodsin Cyber Security: Systematic Evaluation and Community Knowledge Base forEvidence-Based Methods in Cyber Security.” It was in parts funded by the ERCStarting Grant CASCAde (GA no716980).

References

1. D. Akhawe and A. P. Felt. Alice in warningland: A large-scale field study of browsersecurity warning effectiveness. In USENIX security symposium, volume 13, 2013.

2. American Psychological Association (APA). Publication manual. American Psy-chological Association, 6th revised edition, 2009.

3. M. B. Brewer. Research design and issues of validity. Handbook of research methodsin social and personality psychology, pages 3–16, 2000.

4. E. Bursztein, S. Bethard, C. Fabry, J. C. Mitchell, and D. Jurafsky. How goodare humans at solving captchas? a large scale evaluation. In Security and Privacy(SP), 2010 IEEE Symposium on, pages 399–413. IEEE, 2010.

5. I. Cherapau, I. Muslukhov, N. Asanka, and K. Beznosov. On the impact of touchid on iphone passcodes. In SOUPS, pages 257–276, 2015.

6. J. Cohen. A power primer. Psychological bulletin, 112(1):155, 1992.7. O. S. Collaboration et al. Estimating the reproducibility of psychological science.

Science, 349(6251):aac4716, 2015.8. K. P. Coopamootoo and T. Groß. Evidence-based methods for privacy and identity

management. In Privacy and Identity Management. Facing up to Next Steps, pages105–121. Springer, 2016.

9. K. P. Coopamootoo and T. Groß. A codebook for experimental research: The niftynine indicators v1.0. Technical Report 1514, Newcastle University, November 2017.

10. K. P. Coopamootoo and T. Groß. An empirical investigation of security fatigue- the case of password choice after solving a captcha. In The LASER Workshop:Learning from Authoritative Security Experiment Results (LASER 2017). USENIXAssociation, 2017.

11. G. Cumming. The new statistics: Why and how. Psychological science, 25(1):7–29,2014.

12. G. Cumming and R. Calin-Jageman. Introduction to the new statistics: Estimation,open science, and beyond. Routledge, 2016.

13. B. Everitt. Cambridge dictionary of statistics. Cambridge University Press, 1998.14. Evidence-Based Software Engineering (EBSE). Guidelines for performing system-

atic literature reviews in software engineering. EBSE Technical Report EBSE-2007-01, Keele University and University of Durham, July 2007.

15. A. Field. Discovering statistics using IBM SPSS statistics. Sage, 2013.16. A. Field and G. Hole. How to design and report experiments. Sage, 2003.17. T. Fordyce, S. Green, and T. Groß. Investigation of the effect of fear and stress on

password choice. In In proceedings of the 7th ACM Workshop on Socio-TechnicalAspects in Security and Trust (STAST’2017), 2017.

18. C. O. Fritz, P. E. Morris, and J. J. Richler. Effect size estimates: current use,calculations, and interpretation. Journal of Experimental Psychology: General,141(1):2, 2012.

19. M. J. Gardner and D. G. Altman. Confidence intervals rather than p values:estimation rather than hypothesis testing. Br Med J (Clin Res Ed), 292(6522):746–750, 1986.

20. J. Gideon, L. Cranor, S. Egelman, and A. Acquisti. Power strips, prophylactics,and privacy, oh my! In Proceedings of the second symposium on Usable privacyand security, pages 133–144. ACM, 2006.

21. S. Goodman. A dirty dozen: twelve p-value misconceptions. In Seminars in hema-tology, volume 45, pages 135–140. Elsevier, 2008.

22. T. Groß. Analysis report – investigation of the effect of fear and stress on passwordchoice. OSF Report https://osf.io/3cd9h/, Open Science Framework, 2017.

23. T. Groß, K. Coopamootoo, and A. Al-Jabri. Effect of cognitive depletion on pass-word choice. In The LASER Workshop: Learning from Authoritative Security Ex-periment Results (LASER 2016), pages 55–66. USENIX Association, 2016.

24. S. G. Hart and L. E. Staveland. Development of nasa-tlx (task load index): Resultsof empirical and theoretical research. Advances in psychology, 52:139–183, 1988.

25. J. P. Ioannidis. Why most published research findings are false. PLoS Med,2(8):e124, 2005.

26. R. E. Kirk. The importance of effect magnitude. Handbook of research methods inexperimental psychology, pages 83–105, 2003.

27. K. A. Kluever and R. Zanibbi. Balancing usability and security in a video captcha.In Proceedings of the 5th Symposium on Usable Privacy and Security, page 14.ACM, 2009.

28. S. Korff and R. Böhme. Too much choice: End-user privacy decisions in the contextof choice proliferation. In Symposium on Usable Privacy and Security (SOUPS),pages 69–87, 2014.

29. D. Lakens. Calculating and reporting effect sizes to facilitate cumulative science:a practical primer for t-tests and anovas. Frontiers in psychology, 4, 2013.

30. R. Maxion. Making experiments dependable. Dependable and Historic Computing,pages 344–357, 2011.

31. S. E. Maxwell and H. D. Delaney. Designing experiments and analyzing data: Amodel comparison perspective, volume 1. Psychology Press, 2nd edition, 2004.

32. S. Miller. Experimental design and statistics. Routledge, 2005.33. D. C. Montgomery. Design and analysis of experiments. John Wiley & Sons, 8th

edition, 2012.34. R. Moonesinghe, M. J. Khoury, and A. C. J. Janssens. Most published research

findings are false—but a little replication goes a long way. PLoS Med, 4(2):e28,2007.

35. R. D. Morey, R. Hoekstra, J. N. Rouder, M. D. Lee, and E.-J. Wagenmakers.The fallacy of placing confidence in confidence intervals. Psychonomic bulletin &review, 23(1):103–123, 2016.

36. R. S. Nickerson. Null hypothesis significance testing: a review of an old and con-tinuing controversy. Psychological methods, 5(2):241, 2000.

37. U. Nwadike, T. Groß, and K. P. Coopamootoo. Evaluating users’ affect states:Towards a study on privacy concerns. In Privacy and Identity Management. Facingup to Next Steps, pages 248–262. Springer, 2016.

38. S. Peisert and M. Bishop. How to design computer security experiments. In FifthWorld Conference on Information Security Education, pages 141–148. Springer,2007.

39. K. Popper. The logic of scientific discovery. Routledge, 2005.40. J. Rottenberg, R. Ray, and J. Gross. Emotion elicitation using films. handbook of

emotion elicitation and assessment. edited by: Coan ja, allen jjb. 2007.41. G. K. Sandve, A. Nekrutenko, J. Taylor, and E. Hovig. Ten simple rules for

reproducible computational research. PLoS computational biology, 9(10):e1003285,2013.

42. S. Schechter. Common pitfalls in writing about security and privacy human sub-jects experiments, and how to avoid them. Microsoft, January, 2013.

43. C. D. Spielberger, R. L. Gorsuch, and R. E. Lushene. Manual for the state-traitanxiety inventory. 1970.

44. L. Statistics. Testing for normality. https://statistics.laerd.com [Accessed2018-01-20].

45. V. Stodden, F. Leisch, and R. D. Peng. Implementing reproducible research. CRCPress, 2014.

46. D. Watson and L. A. Clark. The panas-x: Manual for the positive and negativeaffect schedule-expanded form. 1999.

47. R. Westermann, G. Stahl, and F. Hesse. Relative effectiveness and validity of moodinduction procedures: analysis. European Journal of social psychology, 26:557–580,1996.

48. Y. Xie. knitr: a comprehensive tool for reproducible research in r. ImplementReprod Res, 1:20, 2014.