80
arXiv:2004.07213v2 [cs.CY] 20 Apr 2020 Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims * Miles Brundage 1† , Shahar Avin 3,2† , Jasmine Wang 4,29† ‡ , Haydn Belfield 3,2† , Gretchen Krueger 1† , Gillian Hadfield 1,5,30 , Heidy Khlaaf 6 , Jingying Yang 7 , Helen Toner 8 , Ruth Fong 9 , Tegan Maharaj 4,28 , Pang Wei Koh 10 , Sara Hooker 11 , Jade Leung 12 , Andrew Trask 9 , Emma Bluemke 9 , Jonathan Lebensold 4,29 , Cullen O’Keefe 1 , Mark Koren 13 , Théo Ryffel 14 , JB Rubinovitz 15 , Tamay Besiroglu 16 , Federica Carugati 17 , Jack Clark 1 , Peter Eckersley 7 , Sarah de Haas 18 , Maritza Johnson 18 , Ben Laurie 18 , Alex Ingerman 18 , Igor Krawczuk 19 , Amanda Askell 1 , Rosario Cammarota 20 , Andrew Lohn 21 , David Krueger 4,27 , Charlotte Stix 22 , Peter Henderson 10 , Logan Graham 9 , Carina Prunkl 12 , Bianca Martin 1 , Elizabeth Seger 16 , Noa Zilberman 9 , Seán Ó hÉigeartaigh 2,3 , Frens Kroeger 23 , Girish Sastry 1 , Rebecca Kagan 8 , Adrian Weller 16,24 , Brian Tse 12,7 , Elizabeth Barnes 1 , Allan Dafoe 12,9 , Paul Scharre 25 , Ariel Herbert-Voss 1 , Martijn Rasser 25 , Shagun Sodhani 4,27 , Carrick Flynn 8 , Thomas Krendl Gilbert 26 , Lisa Dyer 7 , Saif Khan 8 , Yoshua Bengio 4,27 , Markus Anderljung 12 1 OpenAI, 2 Leverhulme Centre for the Future of Intelligence, 3 Centre for the Study of Existential Risk, 4 Mila, 5 University of Toronto, 6 Adelard, 7 Partnership on AI, 8 Center for Security and Emerging Technology, 9 University of Oxford, 10 Stanford University, 11 Google Brain, 12 Future of Humanity Institute, 13 Stanford Centre for AI Safety, 14 École Normale Supérieure (Paris), 15 Remedy.AI, 16 University of Cambridge, 17 Center for Advanced Study in the Behavioral Sciences, 18 Google Research, 19 École Polytechnique Fédérale de Lausanne, 20 Intel, 21 RAND Corporation, 22 Eindhoven University of Technology, 23 Coventry University, 24 Alan Turing Institute, 25 Center for a New American Security, 26 University of California, Berkeley, 27 University of Montreal, 28 Montreal Polytechnic, 29 McGill University, 30 Schwartz Reisman Institute for Technology and Society April 2020 * Listed authors are those who contributed substantive ideas and/or work to this report. Contributions include writing, research, and/or review for one or more sections; some authors also contributed content via participation in an April 2019 workshop and/or via ongoing discussions. As such, with the exception of the primary/corresponding authors, inclusion as author does not imply endorsement of all aspects of the report. Miles Brundage ([email protected]), Shahar Avin ([email protected]), Jasmine Wang ([email protected]), Haydn Belfield ([email protected]), and Gretchen Krueger ([email protected]) contributed equally and are correspond- ing authors. Other authors are listed roughly in order of contribution. Work conducted in part while at OpenAI.

Toward Trustworthy AI Development: Mechanisms for Supporting … · arXiv:2004.07213v2 [cs.CY] 20 Apr 2020 Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable

  • Upload
    others

  • View
    28

  • Download
    0

Embed Size (px)

Citation preview

  • arX

    iv:2

    004.

    0721

    3v2

    [cs

    .CY

    ] 2

    0 A

    pr 2

    020

    Toward Trustworthy AI Development:Mechanisms for Supporting Verifiable Claims∗

    Miles Brundage1†, Shahar Avin3,2†, Jasmine Wang4,29†‡, Haydn Belfield3,2†, Gretchen Krueger1†,Gillian Hadfield1,5,30, Heidy Khlaaf6, Jingying Yang7, Helen Toner8, Ruth Fong9,

    Tegan Maharaj4,28, Pang Wei Koh10, Sara Hooker11, Jade Leung12, Andrew Trask9,Emma Bluemke9, Jonathan Lebensold4,29, Cullen O’Keefe1, Mark Koren13, Théo Ryffel14,JB Rubinovitz15, Tamay Besiroglu16, Federica Carugati17, Jack Clark1, Peter Eckersley7,Sarah de Haas18, Maritza Johnson18, Ben Laurie18, Alex Ingerman18, Igor Krawczuk19,

    Amanda Askell1, Rosario Cammarota20, Andrew Lohn21, David Krueger4,27, Charlotte Stix22,Peter Henderson10, Logan Graham9, Carina Prunkl12, Bianca Martin1, Elizabeth Seger16,

    Noa Zilberman9, Seán Ó hÉigeartaigh2,3, Frens Kroeger23, Girish Sastry1, Rebecca Kagan8,Adrian Weller16,24, Brian Tse12,7, Elizabeth Barnes1, Allan Dafoe12,9, Paul Scharre25,

    Ariel Herbert-Voss1, Martijn Rasser25, Shagun Sodhani4,27, Carrick Flynn8,Thomas Krendl Gilbert26, Lisa Dyer7, Saif Khan8, Yoshua Bengio4,27, Markus Anderljung12

    1OpenAI, 2Leverhulme Centre for the Future of Intelligence, 3Centre for the Study of Existential Risk,4Mila, 5University of Toronto, 6Adelard, 7Partnership on AI, 8Center for Security and Emerging Technology,

    9University of Oxford, 10Stanford University, 11Google Brain, 12Future of Humanity Institute,13Stanford Centre for AI Safety, 14École Normale Supérieure (Paris), 15Remedy.AI,

    16University of Cambridge, 17Center for Advanced Study in the Behavioral Sciences,18Google Research,19École Polytechnique Fédérale de Lausanne, 20Intel, 21RAND Corporation,

    22Eindhoven University of Technology, 23Coventry University, 24Alan Turing Institute,25Center for a New American Security, 26University of California, Berkeley,

    27University of Montreal, 28Montreal Polytechnic, 29McGill University,30Schwartz Reisman Institute for Technology and Society

    April 2020

    ∗Listed authors are those who contributed substantive ideas and/or work to this report. Contributions include writing,research, and/or review for one or more sections; some authors also contributed content via participation in an April 2019workshop and/or via ongoing discussions. As such, with the exception of the primary/corresponding authors, inclusion asauthor does not imply endorsement of all aspects of the report.

    †Miles Brundage ([email protected]), Shahar Avin ([email protected]), Jasmine Wang ([email protected]),Haydn Belfield ([email protected]), and Gretchen Krueger ([email protected]) contributed equally and are correspond-ing authors. Other authors are listed roughly in order of contribution.

    ‡Work conducted in part while at OpenAI.

    http://arxiv.org/abs/2004.07213v2

  • Contents

    Executive Summary 1List of Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1 Introduction 41.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Institutional, Software, and Hardware Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Outline of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2 Institutional Mechanisms and Recommendations 82.1 Third Party Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Red Team Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Bias and Safety Bounties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Sharing of AI Incidents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3 Software Mechanisms and Recommendations 213.1 Audit Trails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Privacy-Preserving Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4 Hardware Mechanisms and Recommendations 314.1 Secure Hardware for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 High-Precision Compute Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Compute Support for Academia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    5 Conclusion 39

    Acknowledgements 41

    References 42

    Appendices 60I Workshop and Report Writing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60II Key Terms and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62III The Nature and Importance of Verifiable Claims . . . . . . . . . . . . . . . . . . . . . . . . . . 64IV AI, Verification, and Arms Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67V Cooperation and Antitrust Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70VI Supplemental Mechanism Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    A Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71B Verifiable Data Policies in Distributed Computing Systems . . . . . . . . . . . . . . . 74C Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

  • Executive SummaryRecent progress in artificial intelligence (AI) has enabled a diverse array of applications across commer-cial, scientific, and creative domains. With this wave of applications has come a growing awareness ofthe large-scale impacts of AI systems, and recognition that existing regulations and norms in industryand academia are insufficient to ensure responsible AI development [1] [2] [3].

    Steps have been taken by the AI community to acknowledge and address this insufficiency, includingwidespread adoption of ethics principles by researchers and technology companies. However, ethicsprinciples are non-binding, and their translation to actions is often not obvious. Furthermore, thoseoutside a given organization are often ill-equipped to assess whether an AI developer’s actions are con-sistent with their stated principles. Nor are they able to hold developers to account when principles andbehavior diverge, fueling accusations of "ethics washing" [4]. In order for AI developers to earn trustfrom system users, customers, civil society, governments, and other stakeholders that they are buildingAI responsibly, there is a need to move beyond principles to a focus on mechanisms for demonstrat-ing responsible behavior [5]. Making and assessing verifiable claims, to which developers can be heldaccountable, is one crucial step in this direction.

    With the ability to make precise claims for which evidence can be brought to bear, AI developers can morereadily demonstrate responsible behavior to regulators, the public, and one another. Greater verifiabilityof claims about AI development would help enable more effective oversight and reduce pressure tocut corners for the sake of gaining a competitive edge [1]. Conversely, without the capacity to verifyclaims made by AI developers, those using or affected by AI systems are more likely to be put at risk bypotentially ambiguous, misleading, or false claims.

    This report suggests various steps that different stakeholders in AI development can take to make it easierto verify claims about AI development, with a focus on providing evidence about the safety, security,fairness, and privacy protection of AI systems. Implementation of such mechanisms can help makeprogress on the multifaceted problem of ensuring that AI development is conducted in a trustworthyfashion.1 The mechanisms outlined in this report deal with questions that various parties involved in AIdevelopment might face, such as:

    • Can I (as a user) verify the claims made about the level of privacy protection guaranteed by a newAI system I’d like to use for machine translation of sensitive documents?

    • Can I (as a regulator) trace the steps that led to an accident caused by an autonomous vehicle?Against what standards should an autonomous vehicle company’s safety claims be compared?

    • Can I (as an academic) conduct impartial research on the impacts associated with large-scale AIsystems when I lack the computing resources of industry?

    • Can I (as an AI developer) verify that my competitors in a given area of AI development will followbest practices rather than cut corners to gain an advantage?

    Even AI developers who have the desire and/or incentives to make concrete, verifiable claims may notbe equipped with the appropriate mechanisms to do so. The AI development community needs a ro-bust "toolbox" of mechanisms to support the verification of claims about AI systems and developmentprocesses.

    1The capacity to verify claims made by developers, on its own, would be insufficient to ensure responsible AI development.Not all important claims admit verification, and there is also a need for oversight agencies such as governments and standardsorganizations to align developers’ incentives with the public interest.

    1

  • This problem framing led some of the authors of this report to hold a workshop in April 2019, aimed atexpanding the toolbox of mechanisms for making and assessing verifiable claims about AI development.2

    This report builds on the ideas proposed at that workshop. The mechanisms outlined do two things:

    • They increase the options available to AI developers for substantiating claims they make about AIsystems’ properties.

    • They increase the specificity and diversity of demands that can be made of AI developers by otherstakeholders such as users, policymakers, and members of civil society.

    Each mechanism and associated recommendation discussed in this report addresses a specific gap pre-venting effective assessment of developers’ claims today. Some of these mechanisms exist and need tobe extended or scaled up in some way, and others are novel. The report is intended as an incrementalstep toward improving the verifiability of claims about AI development.

    The report organizes mechanisms under the headings of Institutions, Software, and Hardware, which arethree intertwined components of AI systems and development processes.

    • Institutional Mechanisms: These mechanisms shape or clarify the incentives of people involvedin AI development and provide greater visibility into their behavior, including their efforts to en-sure that AI systems are safe, secure, fair, and privacy-preserving. Institutional mechanisms play afoundational role in verifiable claims about AI development, since it is people who are ultimatelyresponsible for AI development. We focus on third party auditing, to create a robust alternativeto self-assessment of claims; red teaming exercises, to demonstrate AI developers’ attention tothe ways in which their systems could be misused; bias and safety bounties, to strengthen incen-tives to discover and report flaws in AI systems; and sharing of AI incidents, to improve societalunderstanding of how AI systems can behave in unexpected or undesired ways.

    • Software Mechanisms: These mechanisms enable greater understanding and oversight of specificAI systems’ properties. We focus on audit trails, to enable accountability for high-stakes AI systemsby capturing critical information about the development and deployment process; interpretabil-ity, to foster understanding and scrutiny of AI systems’ characteristics; and privacy-preservingmachine learning, to make developers’ commitments to privacy protection more robust.

    • Hardware Mechanisms: Mechanisms related to computing hardware can play a key role in sub-stantiating strong claims about privacy and security, enabling transparency about how an orga-nization’s resources are put to use, and influencing who has the resources necessary to verifydifferent claims. We focus on secure hardware for machine learning, to increase the verifiabil-ity of privacy and security claims; high-precision compute measurement, to improve the valueand comparability of claims about computing power usage; and compute support for academia,to improve the ability of those outside of industry to evaluate claims about large-scale AI systems.

    Each mechanism provides additional paths to verifying AI developers’ commitments to responsible AIdevelopment, and has the potential to contribute to a more trustworthy AI ecosystem. The full list ofrecommendations associated with each mechanism is found on the following page and again at the endof the report.

    2See Appendix I, "Workshop and Report Writing Process."

    2

  • List of Recommendations

    Institutional Mechanisms and Recommendations

    1. A coalition of stakeholders should create a task force to research options for conducting and fund-ing third party auditing of AI systems.

    2. Organizations developing AI should run red teaming exercises to explore risks associated withsystems they develop, and should share best practices and tools for doing so.

    3. AI developers should pilot bias and safety bounties for AI systems to strengthen incentives andprocesses for broad-based scrutiny of AI systems.

    4. AI developers should share more information about AI incidents, including through collaborativechannels.

    Software Mechanisms and Recommendations

    5. Standards setting bodies should work with academia and industry to develop audit trail require-ments for safety-critical applications of AI systems.

    6. Organizations developing AI and funding bodies should support research into the interpretabilityof AI systems, with a focus on supporting risk assessment and auditing.

    7. AI developers should develop, share, and use suites of tools for privacy-preserving machinelearning that include measures of performance against common standards.

    Hardware Mechanisms and Recommendations

    8. Industry and academia should work together to develop hardware security features for AI ac-celerators or otherwise establish best practices for the use of secure hardware (including secureenclaves on commodity hardware) in machine learning contexts.

    9. One or more AI labs should estimate the computing power involved in a single project in greatdetail (high-precision compute measurement), and report on the potential for wider adoptionof such methods.

    10. Government funding bodies should substantially increase funding of computing power resourcesfor researchers in academia, in order to improve the ability of those researchers to verify claimsmade by industry.

    3

  • 1 Introduction

    1.1 Motivation

    With rapid technical progress in artificial intelligence (AI)3 and the spread of AI-based applicationsover the past several years, there is growing concern about how to ensure that the development anddeployment of AI is beneficial – and not detrimental – to humanity. In recent years, AI systems havebeen developed in ways that are inconsistent with the stated values of those developing them. Thishas led to a rise in concern, research, and activism relating to the impacts of AI systems [2] [3]. AIdevelopment has raised concerns about amplification of bias [6], loss of privacy [7], digital addictions[8], social harms associated with facial recognition and criminal risk assessment [9], disinformation[10], and harmful changes to the quality [11] and availability of gainful employment [12].

    In response to these concerns, a range of stakeholders, including those developing AI systems, havearticulated ethics principles to guide responsible AI development. The amount of work undertaken toarticulate and debate such principles is encouraging, as is the convergence of many such principles on aset of widely-shared concerns such as safety, security, fairness, and privacy.4

    However, principles are only a first step in the effort to ensure beneficial societal outcomes from AI[13]. Indeed, studies [17], surveys [18], and trends in worker and community organizing [2] [3] makeclear that large swaths of the public are concerned about the risks of AI development, and do not trustthe organizations currently dominating such development to self-regulate effectively. Those potentiallyaffected by AI systems need mechanisms for ensuring responsible development that are more robustthan high-level principles. People who get on airplanes don’t trust an airline manufacturer because of itsPR campaigns about the importance of safety - they trust it because of the accompanying infrastructureof technologies, norms, laws, and institutions for ensuring airline safety.5 Similarly, along with thegrowing explicit adoption of ethics principles to guide AI development, there is mounting skepticismabout whether these claims and commitments can be monitored and enforced [19].

    Policymakers are beginning to enact regulations that more directly constrain AI developers’ behavior[20]. We believe that analyzing AI development through the lens of verifiable claims can help to informsuch efforts. AI developers, regulators, and other actors all need to understand which properties of AIsystems and development processes can be credibly demonstrated, through what means, and with whattradeoffs.

    We define verifiable claims6 as falsifiable statements for which evidence and arguments can be brought

    3We define AI as digital systems that are capable of performing tasks commonly thought to require intelligence, with thesetasks typically learned via data and/or experience.

    4Note, however, that many such principles have been articulated by Western academics and technology company employees,and as such are not necessarily representative of humanity’s interests or values as a whole. Further, they are amenable to variousinterpretations [13][14] and agreement on them can mask deeper disagreements [5]. See also Beijing AI Principles [15] andZeng et. al. [16] for examples of non-Western AI principles.

    5Recent commercial airline crashes also serve as a reminder that even seemingly robust versions of such infrastructure areimperfect and in need of constant vigilance.

    6While this report does discuss the technical area of formal verification at several points, and several of our recommendationsare based on best practices from the field of information security, the sense in which we use "verifiable" is distinct from howthe term is used in those contexts. Unless otherwise specified by the use of the adjective "formal" or other context, this reportuses the word verification in a looser sense. Formal verification seeks mathematical proof that a certain technical claim is

    4

  • to bear on the likelihood of those claims being true. While the degree of attainable certainty will varyacross different claims and contexts, we hope to show that greater degrees of evidence can be providedfor claims about AI development than is typical today. The nature and importance of verifiable claimsis discussed in greater depth in Appendix III, and we turn next to considering the types of mechanismsthat can make claims verifiable.

    1.2 Institutional, Software, and Hardware Mechanisms

    AI developers today have many possible approaches for increasing the verifiability of their claims. De-spite the availability of many mechanisms that could help AI developers demonstrate their claims andhelp other stakeholders scrutinize their claims, this toolbox has not been well articulated to date.

    We view AI development processes as sociotechnical systems,7 with institutions, software, and hardwareall potentially supporting (or detracting from) the verifiability of claims about AI development. AI de-velopers can make claims about, or take actions related to, each of these three interrelated pillars of AIdevelopment.

    In some cases, adopting one of these mechanisms can increase the verifiability of one’s own claims,whereas in other cases the impact on trust is more indirect (i.e., a mechanism implemented by one actorenabling greater scrutiny of other actors). As such, collaboration across sectors and organizations will becritical in order to build an ecosystem in which claims about responsible AI development can be verified.

    • Institutional mechanisms largely pertain to values, incentives, and accountability. Institutionalmechanisms shape or clarify the incentives of people involved in AI development and providegreater visibility into their behavior, including their efforts to ensure that AI systems are safe, se-cure, fair, and privacy-preserving. These mechanisms can also create or strengthen channels forholding AI developers accountable for harms associated with AI development. In this report, weprovide an overview of some such mechanisms, and then discuss third party auditing, red teamexercises, safety and bias bounties, and sharing of AI incidents in more detail.

    • Software mechanisms largely pertain to specific AI systems and their properties. Software mecha-nisms can be used to provide evidence for both formal and informal claims regarding the propertiesof specific AI systems, enabling greater understanding and oversight. The software mechanismswe highlight below are audit trails, interpretability, and privacy-preserving machine learning.

    • Hardware mechanisms largely pertain to physical computational resources and their properties.Hardware mechanisms can support verifiable claims by providing greater assurance regarding theprivacy and security of AI systems, and can be used to substantiate claims about how an organi-zation is using their general-purpose computing capabilities. Further, the distribution of resourcesacross different actors can influence the types of AI systems that are developed and which ac-tors are capable of assessing other actors’ claims (including by reproducing them). The hardwaremechanisms we focus on in this report are hardware security features for machine learning,high-precision compute measurement, and computing power support for academia.

    true with certainty (subject to certain assumptions). In contrast, this report largely focuses on claims that are unlikely to bedemonstrated with absolute certainty, but which can be shown to be likely or unlikely to be true through relevant argumentsand evidence.

    7Broadly, a sociotechnical system is one whose "core interface consists of the relations between a nonhuman system and ahuman system", rather than the components of those systems in isolation. See Trist [21].

    5

  • 1.3 Scope and Limitations

    This report focuses on a particular aspect of trustworthy AI development: the extent to which organiza-tions developing AI systems can and do make verifiable claims about the AI systems they build, and theability of other parties to assess those claims. Given the backgrounds of the authors, the report focusesin particular on mechanisms for demonstrating claims about AI systems being safe, secure, fair, and/orprivacy-preserving, without implying that those are the only sorts of claims that need to be verified.

    We devote particular attention to mechanisms8 that the authors have expertise in and for which concreteand beneficial next steps were identified at an April 2019 workshop. These are not the only mechanismsrelevant to verifiable claims; we survey some others at the beginning of each section, and expect thatfurther useful mechanisms have yet to be identified.

    Making verifiable claims is part of, but not equivalent to, trustworthy AI development, broadly defined.An AI developer might also be more or less trustworthy based on the particular values they espouse, theextent to which they engage affected communities in their decision-making, or the extent of recoursethat they provide to external parties who are affected by their actions. Additionally, the actions of AIdevelopers, which we focus on, are not all that matters for trustworthy AI development–the existenceand enforcement of relevant laws matters greatly, for example.

    Appendix I discusses the reasons for the report’s scope in more detail, and Appendix II discusses therelationship between different definitions of trust and verifiable claims. When we use the term "trust"as a verb in the report, we mean that one party (party A) gains confidence in the reliability of anotherparty’s claims (party B) based on evidence provided about the accuracy of those claims or related ones.We also make reference to this claim-oriented sense of trust when we discuss actors "earning" trust,(providing evidence for claims made), or being "trustworthy" (routinely providing sufficient evidencefor claims made). This use of language is intended to concisely reference an important dimension oftrustworthy AI development, and is not meant to imply that verifiable claims are sufficient for attainingtrustworthy AI development.

    1.4 Outline of the Report

    The next three sections of the report, Institutional Mechanisms and Recommendations, SoftwareMechanisms and Recommendations, and Hardware Mechanisms and Recommendations, each be-gin with a survey of mechanisms relevant to that category. Each section then highlights several mech-anisms that we consider especially promising. We are uncertain which claims are most important toverify in the context of AI development, but strongly suspect that some combination of the mechanismswe outline in this report are needed to craft an AI ecosystem in which responsible AI development canflourish.

    The way we articulate the case for each mechanism is problem-centric: each mechanism helps addressa potential barrier to claim verification identified by the authors. Depending on the case, the recom-mendations associated with each mechanism are aimed at implementing a mechanism for the first time,researching it, scaling it up, or extending it in some way.

    8We use the term mechanism generically to refer to processes, systems, or approaches for providing or generating evidenceabout behavior.

    6

  • The Conclusion puts the report in context, discusses some important caveats, and reflects on next steps.

    The Appendices provide important context, supporting material, and supplemental analysis. AppendixI provides background on the workshop and the process that went into writing the report; Appendix IIserves as a glossary and discussion of key terms used in the report; Appendix III discusses the natureand importance of verifiable claims; Appendix IV discusses the importance of verifiable claims in thecontext of arms control; Appendix V provides context on antitrust law as it relates to cooperation amongAI developers on responsible AI development; and Appendix VI offers supplemental analysis of severalmechanisms.

    7

  • 2 Institutional Mechanisms and Recommendations

    "Institutional mechanisms" are processes that shape or clarify the incentives of the people involved inAI development, make their behavior more transparent, or enable accountability for their behavior.Institutional mechanisms help to ensure that individuals or organizations making claims regarding AIdevelopment are incentivized to be diligent in developing AI responsibly and that other stakeholders canverify that behavior. Institutions9 can shape incentives or constrain behavior in various ways.

    Several clusters of existing institutional mechanisms are relevant to responsible AI development, andwe characterize some of their roles and limitations below. These provide a foundation for the subse-quent, more detailed discussion of several mechanisms and associated recommendations. Specifically,we provide an overview of some existing institutional mechanisms that have the following functions:

    • Clarifying organizational goals and values;

    • Increasing transparency regarding AI development processes;

    • Creating incentives for developers to act in ways that are responsible; and

    • Fostering exchange of information among developers.

    Institutional mechanisms can help clarify an organization’s goals and values, which in turn can pro-vide a basis for evaluating their claims. These statements of goals and values–which can also be viewedas (high level) claims in the framework discussed here–can help to contextualize the actions an orga-nization takes and lay the foundation for others (shareholders, employees, civil society organizations,governments, etc.) to monitor and evaluate behavior. Over 80 AI organizations [5], including technol-ogy companies such as Google [22], OpenAI [23], and Microsoft [24] have publicly stated the principlesthey will follow in developing AI. Codes of ethics or conduct are far from sufficient, since they are typ-ically abstracted away from particular cases and are not reliably enforced, but they can be valuable byestablishing criteria that a developer concedes are appropriate for evaluating its behavior.

    The creation and public announcement of a code of ethics proclaims an organization’s commitment toethical conduct both externally to the wider public, as well as internally to its employees, boards, andshareholders. Codes of conduct differ from codes of ethics in that they contain a set of concrete behavioralstandards.10

    Institutional mechanisms can increase transparency regarding an organization’s AI developmentprocesses in order to permit others to more easily verify compliance with appropriate norms, regula-tions, or agreements. Improved transparency may reveal the extent to which actions taken by an AIdeveloper are consistent with their declared intentions and goals. The more reliable, timely, and com-plete the institutional measures to enhance transparency are, the more assurance may be provided.

    9Institutions may be formal and public institutions, such as: laws, courts, and regulatory agencies; private formal ar-rangements between parties, such as contracts; interorganizational structures such as industry associations, strategic alliances,partnerships, coalitions, joint ventures, and research consortia. Institutions may also be informal norms and practices thatprescribe behaviors in particular contexts; or third party organizations, such as professional bodies and academic institutions.

    10Many organizations use the terms synonymously. The specificity of codes of ethics can vary, and more specific (i.e., action-guiding) codes of ethics (i.e. those equivalent to codes of conduct) can be better for earning trust because they are morefalsifiable. Additionally, the form and content of these mechanisms can evolve over time–consider, e.g., Google’s AI Principles,which have been incrementally supplemented with more concrete guidance in particular areas.

    8

  • Transparency measures could be undertaken on a voluntary basis or as part of an agreed frameworkinvolving relevant parties (such as a consortium of AI developers, interested non-profits, or policymak-ers). For example, algorithmic impact assessments are intended to support affected communities andstakeholders in assessing AI and other automated decision systems [2]. The Canadian government, forexample, has centered AIAs in its Directive on Automated Decision-Making [25] [26]. Another pathtoward greater transparency around AI development involves increasing the extent and quality of docu-mentation for AI systems. Such documentation can help foster informed and safe use of AI systems byproviding information about AI systems’ biases and other attributes [27][28][29].

    Institutional mechanisms can create incentives for organizations to act in ways that are responsible.Incentives can be created within an organization or externally, and they can operate at an organizationalor an individual level. The incentives facing an actor can provide evidence regarding how that actor willbehave in the future, potentially bolstering the credibility of related claims. To modify incentives at anorganizational level, organizations can choose to adopt different organizational structures (such as benefitcorporations) or take on legally binding intra-organizational commitments. For example, organizationscould credibly commit to distributing the benefits of AI broadly through a legal commitment that shiftsfiduciary duties.11

    Institutional commitments to such steps could make a particular organization’s financial incentives moreclearly aligned with the public interest. To the extent that commitments to responsible AI developmentand distribution of benefits are widely implemented, AI developers would stand to benefit from eachothers’ success, potentially12 reducing incentives to race against one another [1]. And critically, gov-ernment regulations such as the General Data Protection Regulation (GDPR) enacted by the EuropeanUnion shift developer incentives by imposing penalties on developers that do not adequately protectprivacy or provide recourse for algorithmic decision-making.

    Finally, institutional mechanisms can foster exchange of information between developers. To avoid"races to the bottom" in AI development, AI developers can exchange lessons learned and demonstratetheir compliance with relevant norms to one another. Multilateral fora (in addition to bilateral conversa-tions between organizations) provide opportunities for discussion and repeated interaction, increasingtransparency and interpersonal understanding. Voluntary membership organizations with stricter rulesand norms have been implemented in other industries and might also be a useful model for AI developers[31].13

    Steps in the direction of robust information exchange between AI developers include the creation ofconsensus around important priorities such as safety, security, privacy, and fairness;14 participation inmulti-stakeholder fora such as the Partnership on Artificial Intelligence to Benefit People and Society(PAI), the Association for Computing Machinery (ACM), the Institute of Electrical and Electronics En-gineers (IEEE), the International Telecommunications Union (ITU), and the International StandardsOrganization (ISO); and clear identification of roles or offices within organizations who are responsible

    11The Windfall Clause [30] is one proposal along these lines, and involves an ex ante commitment by AI firms to donate asignificant amount of any eventual extremely large profits.

    12The global nature of AI development, and the national nature of much relevant regulation, is a key complicating factor.13See for example the norms set and enforced by the European Telecommunications Standards Institute (ETSI). These norms

    have real "teeth," such as the obligation for designated holders of Standard Essential Patents to license on Fair, Reasonable andNon-discriminatory (FRAND) terms. Breach of FRAND could give rise to a breach of contract claim as well as constitute abreach of antitrust law [32]. Voluntary standards for consumer products, such as those associated with Fairtrade and Organiclabels, are also potentially relevant precedents [33].

    14An example of such an effort is the Asilomar AI Principles [34].

    9

  • for maintaining and deepening interorganizational communication [10].15

    It is also important to examine the incentives (and disincentives) for free flow of information withinan organization. Employees within organizations developing AI systems can play an important role inidentifying unethical or unsafe practices. For this to succeed, employees must be well-informed aboutthe scope of AI development efforts within their organization and be comfortable raising their concerns,and such concerns need to be taken seriously by management.16 Policies (whether governmental ororganizational) that help ensure safe channels for expressing concerns are thus key foundations forverifying claims about AI development being conducted responsibly.

    The subsections below each introduce and explore a mechanism with the potential for improving theverifiability of claims in AI development: third party auditing, red team exercises, bias and safetybounties, and sharing of AI incidents. In each case, the subsections below begin by discussing aproblem which motivates exploration of that mechanism, followed by a recommendation for improvingor applying that mechanism.

    15Though note competitors sharing commercially sensitive, non-public information (such as strategic plans or R&D plans)could raise antitrust concerns. It is therefore important to have the right antitrust governance structures and procedures inplace (i.e., setting out exactly what can and cannot be shared). See Appendix V.

    16Recent revelations regarding the culture of engineering and management at Boeing highlight the urgency of this issue[35].

    10

  • 2.1 Third Party Auditing

    Problem:

    The process of AI development is often opaque to those outside a given organization, and

    various barriers make it challenging for third parties to verify the claims being made by a

    developer. As a result, claims about system attributes may not be easily verified.

    AI developers have justifiable concerns about being transparent with information concerning commercialsecrets, personal information, or AI systems that could be misused; however, problems arise when theseconcerns incentivize them to evade scrutiny. Third party auditors can be given privileged and securedaccess to this private information, and they can be tasked with assessing whether safety, security, privacy,and fairness-related claims made by the AI developer are accurate.

    Auditing is a structured process by which an organization’s present or past behavior is assessed forconsistency with relevant principles, regulations, or norms. Auditing has promoted consistency andaccountability in industries outside of AI such as finance and air travel. In each case, auditing is tailoredto the evolving nature of the industry in question.17 Recently, auditing has gained traction as a potentialparadigm for assessing whether AI development was conducted in a manner consistent with the statedprinciples of an organization, with valuable work focused on designing internal auditing processes (i.e.those in which the auditors are also employed by the organization being audited) [36].

    Third party auditing is a form of auditing conducted by an external and independent auditor, ratherthan the organization being audited, and can help address concerns about the incentives for accuracyin self-reporting. Provided that they have sufficient information about the activities of an AI system, in-dependent auditors with strong reputational and professional incentives for truthfulness can help verifyclaims about AI development.

    Auditing could take at least four quite different forms, and likely further variations are possible: auditingby an independent body with government-backed policing and sanctioning power; auditing that occursentirely within the context of a government, though with multiple agencies involved [37]; auditing bya private expert organization or some ensemble of such organizations; and internal auditing followedby public disclosure of (some subset of) the results.18 As commonly occurs in other contexts, the re-sults produced by independent auditors might be made publicly available, to increase confidence in thepropriety of the auditing process.19

    Techniques and best practices have not yet been established for auditing AI systems. Outside of AI,however, there are well-developed frameworks on which to build. Outcomes- or claim-based "assuranceframeworks" such as the Claims-Arguments-Evidence framework (CAE) and Goal Structuring Notation(GSN) are already in wide use in safety-critical auditing contexts.20 By allowing different types of ar-guments and evidence to be used appropriately by auditors, these frameworks provide considerableflexibility in how high-level claims are substantiated, a needed feature given the wide ranging and fast-

    17See Raji and Smart et al. [36] for a discussion of some lessons for AI from auditing in other industries.18Model cards for model reporting [28] and data sheets for datasets [29] reveal information about AI systems publicly, and

    future work in third party auditing could build on such tools, as advocated by Raji and Smart et al. [36].19Consumer Reports, originally founded as the Consumers Union in 1936, is one model for an independent, third party

    organization that performs similar functions for products that can affect the health, well-being, and safety of the people usingthose products. (https://www.consumerreports.org/cro/about-us/what-we-do/research-and-testing/index.htm).

    20See Appendix III for further discussion of claim-based frameworks for auditing.

    11

  • evolving societal challenges posed by AI.

    Possible aspects of AI systems that could be independently audited include the level of privacy protectionguaranteed, the extent to (and methods by) which the AI systems were tested for safety, security orethical concerns, and the sources of data, labor, and other resources used. Third party auditing couldbe applicable to a wide range of AI applications, as well. Safety-critical AI systems such as autonomousvehicles and medical AI systems, for example, could be audited for safety and security. Such auditscould confirm or refute the accuracy of previous claims made by developers, or compare their effortsagainst an independent set of standards for safety and security. As another example, search engines andrecommendation systems could be independently audited for harmful biases.

    Third party auditors should be held accountable by government, civil society, and other stakeholdersto ensure that strong incentives exist to act accurately and fairly. Reputational considerations help toensure auditing integrity in the case of financial accounting, where firms prefer to engage with credibleauditors [38]. Alternatively, a licensing system could be implemented in which auditors undergo astandard training process in order to become a licensed AI system auditor. However, given the varietyof methods and applications in the field of AI, it is not obvious whether auditor licensing is a feasibleoption for the industry: perhaps a narrower form of licensing would be helpful (e.g., a subset of AI suchas adversarial machine learning).

    Auditing imposes costs (financial and otherwise) that must be weighed against its value. Even if auditingis broadly societally beneficial and non-financial costs (e.g., to intellectual property) are managed, thefinancial costs will need to be borne by someone (auditees, large actors in the industry, taxpayers, etc.),raising the question of how to initiate a self-sustaining process by which third party auditing could matureand scale. However, if done well, third party auditing could strengthen the ability of stakeholders in theAI ecosystem to make and assess verifiable claims. And notably, the insights gained from third partyauditing could be shared widely, potentially benefiting stakeholders even in countries with differentregulatory approaches for AI.

    Recommendation: A coalition of stakeholders should create a task force to research options for

    conducting and funding third party auditing of AI systems.

    AI developers and other stakeholders (such as civil society organizations and policymakers) should col-laboratively explore the challenges associated with third party auditing. A task force focused on thisissue could explore appropriate initial domains/applications to audit, devise approaches for handlingsensitive intellectual property, and balance the need for standardization with the need for flexibility asAI technology evolves.21 Collaborative research into this domain seems especially promising given thatthe same auditing process could be used across labs and countries. As research in these areas evolves, sotoo will auditing processes–one might thus think of auditing as a "meta-mechanism" which could involveassessing the quality of other efforts discussed in this report such as red teaming.

    One way that third party auditing could connect to government policies, and be funded, is via a "regu-latory market" [42]. In a regulatory market for AI, a government would establish high-level outcomesto be achieved from regulation of AI (e.g., achievement of a certain level of safety in an industry) andthen create or support private sector entities or other organizations that compete in order to design andimplement the precise technical oversight required to achieve those outcomes.22 Regardless of whethersuch an approach is pursued, third party auditing by private actors should be viewed as a complement

    21This list is not exhaustive - see, e.g., [39], [40], and [41] for related discussions.22Examples of such entities include EXIDA, the UK Office of Nuclear Regulation, and the private company Adelard.

    12

  • to, rather than a substitute, for governmental regulation. And regardless of the entity conducting over-sight of AI developers, in any case there will be a need to grapple with difficult challenges such as thetreatment of proprietary data.

    13

  • 2.2 Red Team Exercises

    Problem:

    It is difficult for AI developers to address the "unknown unknowns" associated with AI systems,

    including limitations and risks that might be exploited by malicious actors. Further, existing

    red teaming approaches are insufficient for addressing these concerns in the AI context.

    In order for AI developers to make verifiable claims about their AI systems being safe or secure, they needprocesses for surfacing and addressing potential safety and security risks. Practices such as red teamingexercises help organizations to discover their own limitations and vulnerabilities as well as those of theAI systems they develop, and to approach them holistically, in a way that takes into account the largerenvironment in which they are operating.23

    A red team exercise is a structured effort to find flaws and vulnerabilities in a plan, organization, ortechnical system, often performed by dedicated "red teams" that seek to adopt an attacker’s mindsetand methods. In domains such as computer security, red teams are routinely tasked with emulatingattackers in order to find flaws and vulnerabilities in organizations and their systems. Discoveries madeby red teams allow organizations to improve security and system integrity before and during deployment.Knowledge that a lab has a red team can potentially improve the trustworthiness of an organization withrespect to their safety and security claims, at least to the extent that effective red teaming practices existand are demonstrably employed.

    As indicated by the number of cases in which AI systems cause or threaten to cause harm, developers of anAI system often fail to anticipate the potential risks associated with technical systems they develop. Theserisks include both inadvertent failures and deliberate misuse. Those not involved in the developmentof a particular system may be able to more easily adopt and practice an attacker’s skillset. A growingnumber of industry labs have dedicated red teams, although best practices for such efforts are generallyin their early stages.24 There is a need for experimentation both within and across organizations in orderto move red teaming in AI forward, especially since few AI developers have expertise in relevant areassuch as threat modeling and adversarial machine learning [44].

    AI systems and infrastructure vary substantially in terms of their properties and risks, making in-housered-teaming expertise valuable for organizations with sufficient resources. However, it would also bebeneficial to experiment with the formation of a community of AI red teaming professionals that drawstogether individuals from different organizations and backgrounds, specifically focused on some subsetof AI (versus AI in general) that is relatively well-defined and relevant across multiple organizations.25

    A community of red teaming professionals could take actions such as publish best practices, collectivelyanalyze particular case studies, organize workshops on emerging issues, or advocate for policies thatwould enable red teaming to be more effective.

    Doing red teaming in a more collaborative fashion, as a community of focused professionals across

    23Red teaming could be aimed at assessing various properties of AI systems, though we focus on safety and security in thissubsection given the expertise of the authors who contributed to it.

    24For an example of early efforts related to this, see Marshall et al., "Threat Modeling AI/ML Systems and Dependencies"[43]

    25In the context of language models, for example, 2019 saw a degree of communication and coordination across AI developersto assess the relative risks of different language understanding and generation systems [10]. Adversarial machine learning,too, is an area with substantial sharing of lessons across organizations, though it is not obvious whether a shared red teamfocused on this would be too broad.

    14

  • organizations, has several potential benefits:

    • Participants in such a community would gain useful, broad knowledge about the AI ecosystem,allowing them to identify common attack vectors and make periodic ecosystem-wide recommen-dations to organizations that are not directly participating in the core community;

    • Collaborative red teaming distributes the costs for such a team across AI developers, allowing thosewho otherwise may not have utilized a red team of similarly high quality or one at all to access itsbenefits (e.g., smaller organizations with less resources);

    • Greater collaboration could facilitate sharing of information about security-related AI incidents.26

    Recommendation: Organizations developing AI should run red teaming exercises to explore risks

    associated with systems they develop, and should share best practices and tools for doing so.

    Two critical questions that would need to be answered in the context of forming a more cohesive AIred teaming community are: what is the appropriate scope of such a group, and how will proprietaryinformation be handled?27 The two questions are related. Particularly competitive contexts (e.g., au-tonomous vehicles) might be simultaneously very appealing and challenging: multiple parties stand togain from pooling of insights, but collaborative red teaming in such contexts is also challenging becauseof intellectual property and security concerns.

    As an alternative to or supplement to explicitly collaborative red teaming, organizations building AItechnologies should establish shared resources and outlets for sharing relevant non-proprietary infor-mation. The subsection on sharing of AI incidents also discusses some potential innovations that couldalleviate concerns around sharing proprietary information.

    26This has a precedent from cybersecurity; MITRE’s ATT&CK is a globally accessible knowledge base of adversary tactics andtechniques based on real-world observations, which serves as a foundation for development of more specific threat models andmethodologies to improve cybersecurity (https://attack.mitre.org/).

    27These practical questions are not exhaustive, and even addressing them effectively might not suffice to ensure that col-laborative red teaming is beneficial. For example, one potential failure mode is if collaborative red teaming fostered excessivehomogeneity in the red teaming approaches used, contributing to a false sense of security in cases where that approach isinsufficient.

    15

  • 2.3 Bias and Safety Bounties

    Problem:

    There is too little incentive, and no formal process, for individuals unaffiliated with a particular

    AI developer to seek out and report problems of AI bias and safety. As a result, broad-based

    scrutiny of AI systems for these properties is relatively rare.

    "Bug bounty" programs have been popularized in the information security industry as a way to compen-sate individuals for recognizing and reporting bugs, especially those related to exploits and vulnerabil-ities [45]. Bug bounties provide a legal and compelling way to report bugs directly to the institutionsaffected, rather than exposing the bugs publicly or selling the bugs to others. Typically, bug bountiesinvolve an articulation of the scale and severity of the bugs in order to determine appropriate compen-sation.

    While efforts such as red teaming are focused on bringing internal resources to bear on identifying risksassociated with AI systems, bounty programs give outside individuals a method for raising concernsabout specific AI systems in a formalized way. Bounties provide one way to increase the amount ofscrutiny applied to AI systems, increasing the likelihood of claims about those systems being verified orrefuted.

    Bias28 and safety bounties would extend the bug bounty concept to AI, and could complement existingefforts to better document datasets and models for their performance limitations and other properties.29

    We focus here on bounties for discovering bias and safety issues in AI systems as a starting point foranalysis and experimentation, but note that bounties for other properties (such as security, privacy pro-tection, or interpretability) could also be explored.30

    While some instances of bias are easier to identify, others can only be uncovered with significant analysisand resources. For example, Ziad Obermeyer et al. uncovered racial bias in a widely used algorithmaffecting millions of patients [47]. There have also been several instances of consumers with no directaccess to AI institutions using social media and the press to draw attention to problems with AI [48]. Todate, investigative journalists and civil society organizations have played key roles in surfacing differentbiases in deployed AI systems. If companies were more open earlier in the development process aboutpossible faults, and if users were able to raise (and be compensated for raising) concerns about AI toinstitutions, users might report them directly instead of seeking recourse in the court of public opinion.31

    In addition to bias, bounties could also add value in the context of claims about AI safety. Algorithms ormodels that are purported to have favorable safety properties, such as enabling safe exploration or ro-bustness to distributional shifts [49], could be scrutinized via bounty programs. To date, more attentionhas been paid to documentation of models for bias properties than safety properties,32 though in both

    28For an earlier exploration of bias bounties by one of the report authors, see Rubinovitz [46].29For example, model cards for model reporting [28] and datasheets for datasets [29] are recently developed means of

    documenting AI releases, and such documentation could be extended with publicly listed incentives for finding new forms ofproblematic behavior not captured in that documentation.

    30Bounties for finding issues with datasets used for training AI systems could also be considered, though we focus on trainedAI systems and code as starting points.

    31We note that many millions of dollars have been paid to date via bug bounty programs in the computer security domain,providing some evidence for this hypothesis. However, bug bounties are not a panacea and recourse to the public is alsoappropriate in some cases.

    32We also note that the challenge of avoiding harmful biases is sometimes framed as a subset of safety, though for the

    16

  • cases, benchmarks remain in an early state. Improved safety metrics could increase the comparabilityof bounty programs and the overall robustness of the bounty ecosystem; however, there should also bemeans of reporting issues that are not well captured by existing metrics.

    Note that bounties are not sufficient for ensuring that a system is safe, secure, or fair, and it is importantto avoid creating perverse incentives (e.g., encouraging work on poorly-specified bounties and therebynegatively affecting talent pipelines) [50]. Some system properties can be difficult to discover even withbounties, and the bounty hunting community might be too small to create strong assurances. However,relative to the status quo, bounties might increase the amount of scrutiny applied to AI systems.

    Recommendation: AI developers should pilot bias and safety bounties for AI systems to strengthen

    incentives and processes for broad-based scrutiny of AI systems.

    Issues to be addressed in setting up such a bounty program include [46]:

    • Setting compensation rates for different scales/severities of issues discovered;

    • Determining processes for soliciting and evaluating bounty submissions;

    • Developing processes for disclosing issues discovered via such bounties in a timely fashion;33

    • Designing appropriate interfaces for reporting of bias and safety problems in the context of de-ployed AI systems;

    • Defining processes for handling reported bugs and deploying fixes;

    • Avoiding creation of perverse incentives.

    There is not a perfect analogy between discovering and addressing traditional computer security vul-nerabilities, on the one hand, and identifying and addressing limitations in AI systems, on the other.Work is thus needed to explore the factors listed above in order to adapt the bug bounty concept tothe context of AI development. The computer security community has developed norms (though not aconsensus) regarding how to address "zero day" vulnerabilities,34 but no comparable norms yet exist inthe AI community.

    There may be a need for distinct approaches to different types of vulnerabilities and associated bounties,depending on factors such as the potential for remediation of the issue and the stakes associated with theAI system. Bias might be treated differently from safety issues such as unsafe exploration, as these havedistinct causes, risks, and remediation steps. In some contexts, a bounty might be paid for informationeven if there is no ready fix to the identified issue, because providing accurate documentation to systemusers is valuable in and of itself and there is often no pretense of AI systems being fully robust. In other

    purposes of this discussion, little hinges on this terminological issue. We distinguish the two in the title of this section in orderto call attention to the unique properties of different types of bounties.

    33Note that we specifically consider public bounty programs here, though instances of private bounty programs also existin the computer security community. Even in the event of a publicly advertised bounty, however, submissions may be private,and as such there is a need for explicit policies for handling submissions in a timely and legitimate fashion–otherwise suchprograms will provide little assurance.

    34A zero-day vulnerability is a security vulnerability that is unknown to the developers of the system and other affectedparties, giving them "zero days" to mitigate the issue if the vulnerability were to immediately become widely known. Thecomputer security community features a range of views on appropriate responses to zero-days, with a common approach beingto provide a finite period for the vendor to respond to notification of the vulnerability before the discoverer goes public.

    17

  • cases, more care will be needed in responding to the identified issue, such as when a model is widelyused in deployed products and services.

    18

  • 2.4 Sharing of AI Incidents

    Problem:

    Claims about AI systems can be scrutinized more effectively if there is common knowledge of

    the potential risks of such systems. However, cases of desired or unexpected behavior by AI

    systems are infrequently shared since it is costly to do unilaterally.

    Organizations can share AI "incidents," or cases of undesired or unexpected behavior by an AI systemthat causes or could cause harm, by publishing case studies about these incidents from which others canlearn. This can be accompanied by information about how they have worked to prevent future incidentsbased on their own and others’ experiences.

    By default, organizations developing AI have an incentive to primarily or exclusively report positiveoutcomes associated with their work rather than incidents. As a result, a skewed image is given to thepublic, regulators, and users about the potential risks associated with AI development.

    The sharing of AI incidents can improve the verifiability of claims in AI development by highlightingrisks that might not have otherwise been considered by certain actors. Knowledge of these risks, in turn,can then be used to inform questions posed to AI developers, increasing the effectiveness of externalscrutiny. Incident sharing can also (over time, if used regularly) provide evidence that incidents arefound and acknowledged by particular organizations, though additional mechanisms would be neededto demonstrate the completeness of such sharing.

    AI incidents can include those that are publicly known and transparent, publicly known and anonymized,privately known and anonymized, or privately known and transparent. The Partnership on AI has begunbuilding an AI incident-sharing database, called the AI Incident Database.35 The pilot was built usingpublicly available information through a set of volunteers and contractors manually collecting known AIincidents where AI caused harm in the real world.

    Improving the ability and incentive of AI developers to report incidents requires building additionalinfrastructure, analogous to the infrastructure that exists for reporting incidents in other domains suchas cybersecurity. Infrastructure to support incident sharing that involves non-public information wouldrequire the following resources:

    • Transparent and robust processes to protect organizations from undue reputational harm broughtabout by the publication of previously unshared incidents. This could be achieved by anonymizingincident information to protect the identity of the organization sharing it. Other information-sharing methods should be explored that would mitigate reputational risk to organizations, whilepreserving the usefulness of information shared;

    • A trusted neutral third party that works with each organization under a non-disclosure agreementto collect and anonymize private information;

    35See Partnership on AI’s AI Incident Registry as an example (http://aiid.partnershiponai.org/). A related resource is a listcalled Awful AI, which is intended to raise awareness of misuses of AI and to spur discussion around contestational researchand tech projects [51]. A separate list summarizes various cases in which AI systems "gamed" their specifications in unexpectedways [52]. Additionally, AI developers have in some cases provided retrospective analyses of particular AI incidents, such aswith Microsoft’s "Tay" chatbot [53].

    19

  • • An organization that maintains and administers an online platform where users can easily accessthe incident database, including strong encryption and password protection for private incidentsas well as a way to submit new information. This organization would not have to be the same asthe third party that collects and anonymizes private incident data;

    • Resources and channels to publicize the existence of this database as a centralized resource, to ac-celerate both contributions to the database and positive uses of the knowledge from the database;and

    • Dedicated researchers who monitor incidents in the database in order to identify patterns andshareable lessons.

    The costs of incident sharing (e.g., public relations risks) are concentrated on the sharing organiza-tion, although the benefits are shared broadly by those who gain valuable information about AI inci-dents. Thus, a cooperative approach needs to be taken for incident sharing that addresses the potentialdownsides. A more robust infrastructure for incident sharing (as outlined above), including optionsfor anonymized reporting, would help ensure that fear of negative repercussions from sharing does notprevent the benefits of such sharing from being realized.36

    Recommendation: AI developers should share more information about AI incidents, including

    through collaborative channels.

    Developers should seek to share AI incidents with a broad audience so as to maximize their usefulness,and take advantage of collaborative channels such as centralized incident databases as that infrastructurematures. In addition, they should move towards publicizing their commitment to (and procedures for)doing such sharing in a routine way rather than in an ad-hoc fashion, in order to strengthen thesepractices as norms within the AI development community.

    Incident sharing is closely related to but distinct from responsible publication practices in AI and coor-dinated disclosure of cybersecurity vulnerabilities [55]. Beyond implementation of progressively morerobust platforms for incident sharing and contributions to such platforms, future work could also exploreconnections between AI and other domains in more detail, and identify key lessons from other domainsin which incident sharing is more mature (such as the nuclear and cybersecurity industries).

    Over the longer term, lessons learned from experimentation and research could crystallize into a maturebody of knowledge on different types of AI incidents, reporting processes, and the costs associated withincident sharing. This, in turn, can inform any eventual government efforts to require or incentivizecertain forms of incident reporting.

    36We do not mean to claim that building and using such infrastructure would be sufficient to ensure that AI incidents areaddressed effectively. Sharing is only one part of the puzzle for effectively managing incidents. For example, attention shouldalso be paid to ways in which organizations developing AI, and particularly safety-critical AI, can become "high reliabilityorganizations" (see, e.g., [54]).

    20

  • 3 Software Mechanisms and Recommendations

    Software mechanisms involve shaping and revealing the functionality of existing AI systems. They cansupport verification of new types of claims or verify existing claims with higher confidence. This sectionbegins with an overview of the landscape of software mechanisms relevant to verifying claims, and thenhighlights several key problems, mechanisms, and associated recommendations.

    Software mechanisms, like software itself, must be understood in context (with an appreciation for therole of the people involved). Expertise about many software mechanisms is not widespread, whichcan create challenges for building trust through such mechanisms. For example, an AI developer thatwants to provide evidence for the claim that "user data is kept private" can help build trust in the lab’scompliance with a a formal framework such as differential privacy, but non-experts may have in minda different definition of privacy.37 It is thus critical to consider not only which claims can and can’t besubstantiated with existing mechanisms in theory, but also who is well-positioned to scrutinize thesemechanisms in practice.38

    Keeping their limitations in mind, software mechanisms can substantiate claims associated with AI de-velopment in various ways that are complementary to institutional and hardware mechanisms. Theycan allow researchers, auditors, and others to understand the internal workings of any given system.They can also help characterize the behavioral profile of a system over a domain of expected usage.Software mechanisms could support claims such as:

    • This system is robust to ’natural’ distributional shifts [49] [56];

    • This system is robust even to adversarial examples [57] [58];

    • This system has a well-characterized error surface and users have been informed of contexts inwhich the system would be unsafe to use;

    • This system’s decisions exhibit statistical parity with respect to sensitive demographic attributes39;and

    • This system provides repeatable or reproducible results.

    Below, we summarize several clusters of mechanisms which help to substantiate some of the claimsabove.

    Reproducibility of technical results in AI is a key way of enabling verification of claims about system

    37For example, consider a desideratum for privacy: access to a dataset should not enable an adversary to learn anythingabout an individual that could not be learned without access to the database. Differential privacy as originally conceived doesnot guarantee this–rather, it guarantees (to an extent determined by a privacy budget) that one cannot learn whether thatindividual was in the database in question.

    38In Section 3.3, we discuss the role that computing power–in addition to expertise–can play in influencing who can verifywhich claims.

    39Conceptions of, and measures for, fairness in machine learning, philosophy, law, and beyond vary widely. See, e.g., Xiangand Raji [59] and Binns [60].

    21

  • properties, and a number of ongoing initiatives are aimed at improving reproducibility in AI.4041 Publi-cation of results, models, and code increase the ability of outside parties (especially technical experts)to verify claims made about AI systems. Careful experimental design and the use of (and contributionto) standard software libraries can also improve reproducibility of particular results.42

    Formal verification establishes whether a system satisfies some requirements using the formal methodsof mathematics. Formal verification is often a compulsory technique deployed in various safety-criticaldomains to provide guarantees regarding the functional behaviors of a system. These are typically guar-antees that testing cannot provide. Until recently, AI systems utilizing machine learning (ML)43 havenot generally been subjected to such rigor, but the increasing use of ML in safety-critical domains, suchas automated transport and robotics, necessitates the creation of novel formal analysis techniques ad-dressing ML models and their accompanying non-ML components. Techniques for formally verifying MLmodels are still in their infancy and face numerous challenges,44 which we discuss in Appendix VI(A).

    The empirical verification and validation of machine learning by machine learning has been pro-posed as an alternative paradigm to formal verification. Notably, it can be more practical than formalverification, but since it operates empirically, the method cannot as fully guarantee its claims. Machinelearning could be used to search for common error patterns in another system’s code, or be used tocreate simulation environments to adversarially find faults in an AI system’s behavior.

    For example, adaptive stress testing (AST) of an AI system allows users to find the most likely failure of asystem for a given scenario using reinforcement learning [61], and is being used by to validate the nextgeneration of aircraft collision avoidance software [62]. Techniques requiring further research includeusing machine learning to evaluate another machine learning system (either by directly inspecting itspolicy or by creating environments to test the model) and using ML to evaluate the input of anothermachine learning model. In the future, data from model failures, especially pooled across multiple labsand stakeholders, could potentially be used to create classifiers that detect suspicious or anomalous AIbehavior.

    Practical verification is the use of scientific protocols to characterize a model’s data, assumptions, andperformance. Training data can be rigorously evaluated for representativeness [63] [64]; assumptionscan be characterized by evaluating modular components of an AI model and by clearly communicatingoutput uncertainties; and performance can be characterized by measuring generalization, fairness, andperformance heterogeneity across population subsets. Causes of differences in performance between

    40We note the distinction between narrow senses of reproducibility that focus on discrete technical results being reproduciblegiven the same initial conditions, sometimes referred to as repeatability, and broader senses of reproducibility that involvereported performance gains carrying over to different contexts and implementations.

    41One way to promote robustness is through incentivizing reproducibility of reported results. There are increas-ing effort to award systems the recognition that they are robust, e.g., through ACM’s artifact evaluation badgeshttps://www.acm.org/publications/policies/artifact-review-badging. Conferences are also introducing artifact evaluation,e.g., in the intersection between computer systems research and ML. See, e.g., https://reproindex.com/event/repro-sml2020 and http://cknowledge.org/request.html The Reproducibility Challenge is another notable effort in this area:https://reproducibility-challenge.github.io/neurips2019/

    42In the following section on hardware mechanisms, we also discuss how reproducibility can be advanced in part by levelingthe playing field between industry and other sectors with respect to computing power.

    43Machine learning is a subfield of AI focused on the design of software that improves in response to data, with that datataking the form of unlabeled data, labeled data, or experience. While other forms of AI that do not involve machine learningcan still raise privacy concerns, we focus on machine learning here given the recent growth in associated privacy techniquesas well as the widespread deployment of machine learning.

    44Research into perception-based properties such as pointwise robustness, for example, are not sufficiently comprehensiveto be applied to real-time critical AI systems such as autonomous vehicles.

    22

  • models could be robustly attributed via randomized controlled trials.

    A developer may wish to make claims about a system’s adversarial robustness.45 Currently, the securitybalance is tilted in favor of attacks rather than defenses, with only adversarial training [65] havingstood the test of multiple years of attack research. Certificates of robustness, based on formal proofs, aretypically approximate and give meaningful bounds of the increase in error for only a limited range ofinputs, and often only around the data available for certification (i.e. not generalizing well to unseendata [66] [67] [68]). Without approximation, certificates are computationally prohibitive for all but thesmallest real world tasks [69]. Further, research is needed on scaling formal certification methods tolarger model sizes.

    The subsections below discuss software mechanisms that we consider especially important to advancefurther. In particular, we discuss audit trails, interpretability, and privacy-preserving machine learn-ing.

    45Adversarial robustness refers to an AI system’s ability to perform well in the context of (i.e. to be robust against) "adver-sarial" inputs, or inputs designed specifically to degrade the system’s performance.

    23

  • 3.1 Audit Trails

    Problem:

    AI systems lack traceable logs of steps taken in problem-definition, design, development, and

    operation, leading to a lack of accountability for subsequent claims about those systems’ prop-

    erties and impacts.

    Audit trails can improve the verifiability of claims about engineered systems, although they are not yet amature mechanism in the context of AI. An audit trail is a traceable log of steps in system operation, andpotentially also in design and testing. We expect that audit trails will grow in importance as AI is appliedto more safety-critical contexts. They will be crucial in supporting many institutional trust-buildingmechanisms, such as third-party auditors, government regulatory bodies,46 and voluntary disclosure ofsafety-relevant information by companies.

    Audit trails could cover all steps of the AI development process, from the institutional work of problemand purpose definition leading up to the initial creation of a system, to the training and development ofthat system, all the way to retrospective accident analysis.

    There is already strong precedence for audit trails in numerous industries, in particular for safety-criticalsystems. Commercial aircraft, for example, are equipped with flight data recorders that record and cap-ture multiple types of data each second [70]. In safety-critical domains, the compliance of such evidenceis usually assessed within a larger "assurance case" utilising the CAE or Goal-Structuring-Notation (GSN)frameworks.47 Tools such as the Assurance and Safety Case Environment (ACSE) exist to help both theauditor and the auditee manage compliance claims and corresponding evidence. Version control toolssuch as GitHub or GitLab can be utilized to demonstrate individual document traceability. Proposedprojects like Verifiable Data Audit [71] could establish confidence in logs of data interactions and usage.

    Recommendation: Standards setting bodies should work with academia and industry to develop

    audit trail requirements for safety-critical applications of AI systems.

    Organizations involved in setting technical standards–including governments and private actors–shouldestablish clear guidance regarding how to make safety-critical AI systems fully auditable.48 Althoughapplication dependent, software audit trails often require a base set of traceability49 trails to be demon-strated for qualification;50 the decision to choose a certain set of trails requires considering trade-offsabout efficiency, completeness, tamperproofing, and other design considerations. There is flexibility inthe type of documents or evidence the auditee presents to satisfy these general traceability requirements

    46Such as the National Transportation Safety Board with regards to autonomous vehicle traffic accidents.47See Appendix III for discussion of assurance cases and related frameworks.48Others have argued for the importance of audit trails for AI elsewhere, sometimes under the banner of "logging." See, e.g.,

    [72].49Traceability in this context refers to "the ability to verify the history, location, or application of an item by means of doc-

    umented recorded identification," https://en.wikipedia.org/wiki/Traceability, where the item in question is digital in nature,and might relate to various aspects of an AI system’s development and deployment process.

    50This includes traceability: between the system safety requirements and the software safety requirements, between thesoftware safety requirements specification and software architecture, between the software safety requirements specificationand software design, between the software design specification and the module and integration test specifications, betweenthe system and software design requirements for hardware/software integration and the hardware/software integration testspecifications, between the software safety requirements specification and the software safety validation plan, and betweenthe software design specification and the software verification (including data verification) plan.

    24

  • (e.g., between test logs and requirement documents, verification and validation activities, etc.).51

    Existing standards often define in detail the required audit trails for specific applications. For example,IEC 61508 is a basic functional safety standard required by many industries, including nuclear power.Such standards are not yet established for AI systems. A wide array of audit trails related to an AIdevelopment process can already be produced, such as code changes, logs of training runs, all outputsof a model, etc. Inspiration might be taken from recent work on internal algorithmic auditing [36] andongoing work on the documentation of AI systems more generally, such as the ABOUT ML project [27].Importantly, we recommend that in order to have maximal impact, any standards for AI audit trailsshould be published freely, rather than requiring payment as is often the case.

    51See Appendix III.

    25

  • 3.2 Interpretability

    Problem:

    It’s difficult to verify claims about "black-box" AI systems that make predictions without ex-

    planations or visibility into their inner workings. This problem is compounded by a lack of

    consensus on what interpretability means.

    Despite remarkable performance on a variety of problems, AI systems are frequently termed "blackboxes" due to the perceived difficulty of understanding and anticipating their behavior. This lack ofinterpretability in AI systems has raised concerns about using AI models in high stakes decision-makingcontexts where human welfare may be compromised [73]. Having a better understanding of how the in-ternal processes within these systems work can help proactively anticipate points of failure, audit modelbehavior, and inspire approaches for new systems.

    Research in model interpretability is aimed at helping to understand how and why a particular modelworks. A precise, technical definition for interpretability is elusive; by nature, the definition is subjectto the inquirer. Characterizing desiderata for interpretable models is a helpful way to formalize inter-pretability [74] [75]. Useful interpretability tools for building trust are also highly dependent on thetarget user and the downstream task. For example, a model developer or regulator may be more inter-ested in understanding model behavior over the entire input distribution whereas a novice laypersonmay wish to understand why the model made a particular prediction for their individual case.52

    Crucially, an "interpretable" model may not be necessary for all situations. The weight we place upon amodel being interpretable may depend upon a few different factors, for example:

    • More emphasis in sensitive domains (e.g., autonomous driving or healthcare,53 where an incor-rect prediction adversely impacts human welfare) or when it is important for end-users to haveactionable recourse (e.g., bank loans) [77];

    • Less emphasis given historical performance data (e.g., a model with sufficient historical perfor-mance may be used even if it’s not interpretable); and

    • Less emphasis if improving interpretability incurs other costs (e.g., compromising privacy).

    In the longer term, for sensitive domains where human rights and/or welfare can be harmed, we antic-ipate that interpretability will be a key component of AI system audits, and that certain applications ofAI will be gated on the success of providing adequate intuition to auditors about the model behavior.This is already the case in regulated domains such as finance [78].54

    An ascendent topic of research is how to compare the relative merits of different interpretability methodsin a sensible way. Two criteria appear to be crucial: a. The method should provide sufficient insight for

    52While definitions in this area are contested, some would distinguish between "interpretability" and "explainability" ascategories for these two directions, respectively.

    53See, e.g., Sendak et. al. [76] which focuses on building trust in a hospital context, and contextualizes the role of inter-pretability in this process.

    54In New York, an investigation is ongoing into apparent gender discrimination associated with the Apple Card’s credit lineallowances. This case illustrates the interplay of (a lack of) interpretability and the potential harms associated with automateddecision-making systems [79].

    26

  • the end-user to understand how the model is making its predictions (e.g., to assess if it aligns withhuman judgment), and b. the interpretable explanation should be faithful to the model, i.e., accuratelyreflect its underlying behavior.

    Work on evaluating a., while limited in treatment, has primarily centered on comparing methods usinghuman surveys [80]. More work at the intersection of human-computer interaction, cognitive science,and interpretability research–e.g., studying the efficacy of interpretability tools or exploring possibleinterfaces–would be welcome, as would further exploration of how practitioners currently use suchtools [81] [82] [83] [78] [84].

    Evaluating b., the reliability of existing methods is an active area of research [85] [86] [87] [88] [89][90] [91] [92] [93]. This effort is complicated by the lack of ground truth on system behavior (if wecould reliably anticipate model behavior under all circumstances, we would not need an interpretabilitymethod). The wide use of interpretable tools in sensitive domains underscores the continued need todevelop benchmarks that assess the reliability of produced model explanations.

    It is important that techniques developed under the umbrella of interpretability not be used to provideclear explanations when such clarity is not feasible. Without sufficient rigor, interpretability could beused in service of unjustified trust by providing misleading explanations for system behavior. In identi-fying, carrying out, and/or funding research on interpretability, particular attention should be paid towhether and how such research might eventually aid in verifying claims about AI systems with highdegrees of confidence to support risk assessment and auditing.

    Recommendation: Organizations developing AI and funding bodies should support research into

    the interpretability of AI systems, with a focus on supporting risk assessment and auditing.

    Some areas of interpretability research are more developed than others. For example, attribution meth-ods for explaining individual predictions of computer vision models are arguably one of the most well-developed research areas. As such, we suggest that the following under-explored directions would beuseful for the development of interpretability tools that could support verifiable claims about systemproperties:

    • Developing and establishing consensus on the criteria, objectives, and frameworks for interpretabil-ity research;

    • Studying the provenance of a learned model (e.g., as a function of the distribution of training data,choice of particular model families, or optimization) instead of treating models as fixed; and

    • Constraining models to be interpretable by default, in contrast to the standard setting of trying tointerpret a model post-hoc.

    This list is not intended to be exhaustive, and we recognize that there is uncertainty about which researchdirections will ultimately bear fruit. We discuss the landscape of interpretability research further inAppendix VI(C).

    27

  • 3.3 Privacy-Preserving Machine Learning

    Problem:

    A range of methods can potentially be used to verifiably safeguard the data and models involved

    in AI development. However, standards are lacking for evaluating new privacy-preserving ma-

    chine learning techniques, and the ability to implement them currently lies outside a typical AI

    developer’s skill set.

    Training datasets for AI often include sensitive information about people, raising risks of privacy viola-tion. These risks include unacceptable access to raw data (e.g., in the case of an untrusted employee or adata breach), unacceptable inference from a trained model (e.g., when sensitive private information canbe extracted from a model), or unacceptable access to a model itself (e.g., when the model representspersonalized preferences of an individual or is protected by intellectual property).

    For individuals to trust claims about an ML system sufficiently so as to participate in its training, theyneed evidence about data access (who will have access to what kinds of data under what circumstances),data usage, and data protection. The AI development community, and other relevant communities, havedeveloped a range of methods and mechanisms to address these concerns, under the general heading of"privacy-preserving machine learning" (PPML) [94].

    Privacy-preserving machine learning aims to protect the privacy of data or mo