14
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5502–5515 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 5502 Towards Debiasing Sentence Representations Paul Pu Liang, Irene Mengze Li, Emily Zheng, Yao Chong Lim, Ruslan Salakhutdinov, Louis-Philippe Morency Machine Learning Department and Language Technologies Institute Carnegie Mellon University [email protected] Abstract As natural language processing methods are increasingly deployed in real-world scenarios such as healthcare, legal systems, and social science, it becomes necessary to recognize the role they potentially play in shaping so- cial biases and stereotypes. Previous work has revealed the presence of social biases in widely used word embeddings involving gen- der, race, religion, and other social constructs. While some methods were proposed to de- bias these word-level embeddings, there is a need to perform debiasing at the sentence-level given the recent shift towards new contextual- ized sentence representations such as ELMo and BERT. In this paper, we investigate the presence of social biases in sentence-level rep- resentations and propose a new method, SENT- DEBIAS, to reduce these biases. We show that SENT-DEBIAS is effective in removing biases, and at the same time, preserves per- formance on sentence-level downstream tasks such as sentiment analysis, linguistic accept- ability, and natural language understanding. We hope that our work will inspire future re- search on characterizing and removing social biases from widely adopted sentence represen- tations for fairer NLP. 1 Introduction Machine learning tools for learning from language are increasingly deployed in real-world scenarios such as healthcare (Velupillai et al., 2018), legal systems (Dale, 2019), and computational social sci- ence (Bamman et al., 2016). Key to the success of these models are powerful embedding layers which learn continuous representations of input in- formation such as words, sentences, and documents from large amounts of data (Devlin et al., 2019; Mikolov et al., 2013). Although word-level em- beddings (Pennington et al., 2014; Mikolov et al., 2013) are highly informative features useful for a variety of tasks in Natural Language Process- ing (NLP), recent work has shown that word-level embeddings reflect and propagate social biases present in training corpora (Lauscher and Glavaˇ s, 2019; Caliskan et al., 2017; Swinger et al., 2019; Bolukbasi et al., 2016). Machine learning systems that incorporate these word embeddings can further amplify biases (Sun et al., 2019b; Zhao et al., 2017; Barocas and Selbst, 2016) and unfairly discrimi- nate against users, particularly those from disad- vantaged social groups. Fortunately, researchers working on fairness and ethics in NLP have devised methods towards debiasing these word representa- tions for both binary (Bolukbasi et al., 2016) and multiclass (Manzini et al., 2019) bias attributes such as gender, race, and religion. More recently, sentence-level representations such as ELMo (Peters et al., 2018), BERT (De- vlin et al., 2019), and GPT (Radford et al., 2019) have become the preferred choice for text sequence encoding. When compared to word-level represen- tations, these models have achieved better perfor- mance on multiple tasks in NLP (Wu and Dredze, 2019), multimodal learning (Zellers et al., 2019; Sun et al., 2019a), and grounded language learn- ing (Urbanek et al., 2019). As their usage prolifer- ates across various real-world applications (Huang et al., 2019; Alsentzer et al., 2019), it becomes nec- essary to recognize the role they play in shaping social biases and stereotypes. Debiasing sentence representations is difficult for two reasons. Firstly, it is usually unfeasible to fully retrain many of the state-of-the-art sentence- based embedding models. In contrast with conven- tional word-level embeddings such as GloVe (Pen- nington et al., 2014) and word2vec (Mikolov et al., 2013) which can be retrained on a single machine within a few hours, the best sentence encoders such as BERT (Devlin et al., 2019), and GPT (Radford et al., 2019) are trained on massive amounts of text

Towards Debiasing Sentence Representations · 2020. 6. 20. · 2016;Manzini et al.,2019). Secondly, sentences display large variety in how they are composed from individual words

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5502–5515July 5 - 10, 2020. c©2020 Association for Computational Linguistics

    5502

    Towards Debiasing Sentence Representations

    Paul Pu Liang, Irene Mengze Li, Emily Zheng, Yao Chong Lim,Ruslan Salakhutdinov, Louis-Philippe Morency

    Machine Learning Department and Language Technologies InstituteCarnegie Mellon [email protected]

    Abstract

    As natural language processing methods areincreasingly deployed in real-world scenariossuch as healthcare, legal systems, and socialscience, it becomes necessary to recognizethe role they potentially play in shaping so-cial biases and stereotypes. Previous workhas revealed the presence of social biases inwidely used word embeddings involving gen-der, race, religion, and other social constructs.While some methods were proposed to de-bias these word-level embeddings, there is aneed to perform debiasing at the sentence-levelgiven the recent shift towards new contextual-ized sentence representations such as ELMoand BERT. In this paper, we investigate thepresence of social biases in sentence-level rep-resentations and propose a new method, SENT-DEBIAS, to reduce these biases. We showthat SENT-DEBIAS is effective in removingbiases, and at the same time, preserves per-formance on sentence-level downstream taskssuch as sentiment analysis, linguistic accept-ability, and natural language understanding.We hope that our work will inspire future re-search on characterizing and removing socialbiases from widely adopted sentence represen-tations for fairer NLP.

    1 IntroductionMachine learning tools for learning from languageare increasingly deployed in real-world scenariossuch as healthcare (Velupillai et al., 2018), legalsystems (Dale, 2019), and computational social sci-ence (Bamman et al., 2016). Key to the successof these models are powerful embedding layerswhich learn continuous representations of input in-formation such as words, sentences, and documentsfrom large amounts of data (Devlin et al., 2019;Mikolov et al., 2013). Although word-level em-beddings (Pennington et al., 2014; Mikolov et al.,2013) are highly informative features useful for

    a variety of tasks in Natural Language Process-ing (NLP), recent work has shown that word-levelembeddings reflect and propagate social biasespresent in training corpora (Lauscher and Glavaš,2019; Caliskan et al., 2017; Swinger et al., 2019;Bolukbasi et al., 2016). Machine learning systemsthat incorporate these word embeddings can furtheramplify biases (Sun et al., 2019b; Zhao et al., 2017;Barocas and Selbst, 2016) and unfairly discrimi-nate against users, particularly those from disad-vantaged social groups. Fortunately, researchersworking on fairness and ethics in NLP have devisedmethods towards debiasing these word representa-tions for both binary (Bolukbasi et al., 2016) andmulticlass (Manzini et al., 2019) bias attributessuch as gender, race, and religion.

    More recently, sentence-level representationssuch as ELMo (Peters et al., 2018), BERT (De-vlin et al., 2019), and GPT (Radford et al., 2019)have become the preferred choice for text sequenceencoding. When compared to word-level represen-tations, these models have achieved better perfor-mance on multiple tasks in NLP (Wu and Dredze,2019), multimodal learning (Zellers et al., 2019;Sun et al., 2019a), and grounded language learn-ing (Urbanek et al., 2019). As their usage prolifer-ates across various real-world applications (Huanget al., 2019; Alsentzer et al., 2019), it becomes nec-essary to recognize the role they play in shapingsocial biases and stereotypes.

    Debiasing sentence representations is difficultfor two reasons. Firstly, it is usually unfeasible tofully retrain many of the state-of-the-art sentence-based embedding models. In contrast with conven-tional word-level embeddings such as GloVe (Pen-nington et al., 2014) and word2vec (Mikolov et al.,2013) which can be retrained on a single machinewithin a few hours, the best sentence encoders suchas BERT (Devlin et al., 2019), and GPT (Radfordet al., 2019) are trained on massive amounts of text

  • 5503

    data over hundreds of machines for several weeks.As a result, it is difficult to retrain a new sentenceencoder whenever a new source of bias is uncov-ered from data. We therefore focus on post-hoc de-biasing techniques which add a post-training debi-asing step to these sentence representations beforethey are used in downstream tasks (Bolukbasi et al.,2016; Manzini et al., 2019). Secondly, sentencesdisplay large variety in how they are composedfrom individual words. This variety is driven bymany factors such as topics, individuals, settings,and even differences between spoken and writtentext. As a result, it is difficult to scale traditionalword-level debiasing approaches (which involvebias-attribute words such as man, woman) (Boluk-basi et al., 2016) to sentences.

    Related Work: Although there has been somerecent work in measuring the presence of bias insentence representations (May et al., 2019; Bastaet al., 2019), none of them have been able to suc-cessfully remove bias from pretrained sentence rep-resentations. In particular, Zhao et al. (2019), Parket al. (2018), and Garg et al. (2019) are not ableto perform post-hoc debiasing and require chang-ing the data or underlying word embeddings andretraining which is costly. Bordia and Bowman(2019) only study word-level language modelsand also requires re-training. Finally, Kurita et al.(2019) only measure bias on BERT by extendingthe word-level Word Embedding Association Test(WEAT) (Caliskan et al., 2017) metric in a mannersimilar to May et al. (2019).

    In this paper, as a compelling step towards gen-eralizing debiasing methods to sentence represen-tations, we capture the various ways in which bias-attribute words can be used in natural sentences.This is performed by contextualizing bias-attributewords using a diverse set of sentence templatesfrom various text corpora into bias-attribute sen-tences. We propose SENT-DEBIAS, an extensionof the HARD-DEBIAS method (Bolukbasi et al.,2016), to debias sentences for both binary1 andmulticlass bias attributes spanning gender and reli-gion. Key to our approach is the contextualizationstep in which bias-attribute words are convertedinto bias-attribute sentences by using a diverse set

    1Although we recognize that gender is non-binary andthere are many important ethical principles in the design, as-cription of categories/variables to study participants, and re-porting of results in studying gender as a variable in NLP (Lar-son, 2017), for the purpose of this study, we follow existingresearch and focus on female and male gendered terms.

    Binary Genderman, woman

    he, shefather, motherson, daughter

    Multiclass Religionjewish, christian, muslim

    torah, bible, quransynagogue, church, mosque

    rabbi, priest, imam

    Table 1: Examples of word pairs to estimate the binarygender bias subspace and the 3-class religion bias sub-space in our experiments.

    of sentence templates from text corpora. Our ex-perimental results demonstrate the importance ofusing a large number of diverse sentence templateswhen estimating bias subspaces of sentence rep-resentations. Our experiments are performed ontwo widely popular sentence encoders BERT (De-vlin et al., 2019) and ELMo (Peters et al., 2018),showing that our approach reduces the bias whilepreserving performance on downstream sequencetasks. We end with a discussion about possibleshortcomings and present some directions for fu-ture work towards accurately characterizing andremoving social biases from sentence representa-tions for fairer NLP.

    2 Debiasing Sentence Representations

    Our proposed method for debiasing sentence rep-resentations, SENT-DEBIAS, consists of four steps:1) defining the words which exhibit bias attributes,2) contextualizing these words into bias attributesentences and subsequently their sentence represen-tations, 3) estimating the sentence representationbias subspace, and finally 4) debiasing general sen-tences by removing the projection onto this biassubspace. We summarize these steps in Algorithm1 and describe the algorithmic details in the follow-ing subsections.

    1) Defining Bias Attributes: The first step in-volves identifying the bias attributes and defininga set of bias attribute words that are indicative ofthese attributes. For example, when characteriz-ing bias across the male and female genders, weuse the word pairs (man, woman), (boy, girl) thatare indicative of gender. When estimating the 3-class religion subspace across the Jewish, Christian,and Muslim religions, we use the tuples (Judaism,Christianity, Islam), (Synagogue, Church, Mosque).Each tuple should consist of words that have anequivalent meaning except for the bias attribute. Ingeneral, for d-class bias attributes, the set of wordsforms a dataset D = {(w(i)1 , ...,w

    (i)d )}

    mi=1 of m en-

    tries where each entry (w1, ...,wd) is a d-tuple ofwords that are each representative of a particular

  • 5504

    Algorithm 1 SENT-DEBIAS: a debiasing algorithm for sentence representations.SENT-DEBIAS:

    1: Initialize (usually pretrained) sentence encoder Mθ.2: Define bias attributes (e.g. binary gender gm and gf ).3: Obtain words D = {(w(i)1 , ...,w

    (i)d )}

    mi=1 indicative of bias attributes (e.g. Table 1).

    4: S = ⋃mi=1 CONTEXTUALIZE(w(i)1 , ...,w

    (i)d ) = {(s

    (i)1 , ..., s

    (i)d )}

    ni=1 // words into sentences

    5: for j ∈ [d] do6: Rj = {Mθ(s

    (i)j )}

    ni=1 // get sentence representations

    7: end for8: V = PCAk (⋃

    dj=1⋃w∈Rj (w −µi)) // compute bias subspace

    9: for each new sentence representation h do10: hV = ∑

    kj=1⟨h,vj⟩vj // project onto bias subspace

    11: ĥ = h − hV // subtract projection12: end for

    bias attribute (we drop the superscript (i) when itis clear from the context). Table 1 shows some biasattribute words that we use to estimate the bias sub-spaces for binary gender and multiclass religiousattributes (full pairs and triplets in appendix).

    Existing methods that investigate biases tend tooperate at the word-level which simplifies the prob-lem since the set of tokens is bounded by the vo-cabulary size (Bolukbasi et al., 2016). This sim-ple approach has the advantage of identifying thepresence of biases using predefined sets of wordassociations, and estimate the bias subspace usingthe predefined bias word pairs. On the other hand,the potential number of sentences are unboundedwhich makes it harder to precisely characterize thesentences in which bias is present or absent. There-fore, it is not trivial to directly convert these wordsto sentences to obtain a representation from pre-trained sentence encoders. In the subsection below,we describe our solution to this problem.

    2) Contextualizing Words into Sentences: Acore step in our SENT-DEBIAS approach involvescontextualizing the predefined sets of bias attributewords to sentences so that sentence encoders canbe applied to obtain sentence representations. Oneoption is to use a simple template-based design tosimplify the contextual associations a sentence en-coder makes with a given term, similar to how Mayet al. (2019) proposed to measure (but not remove)bias in sentence representations. For example, eachword can be slotted into templates such as “Thisis .”, “I am a .”. We take an alter-native perspective and hypothesize that for a givenbias attribute (e.g. gender), a single bias subspace

    exists across all possible sentence representations.For example, the bias subspace should be the samein the sentences “The boy is coding.”, “The girl iscoding.”, “The boys at the playground.”, and “Thegirls at the playground.”. In order to estimate thisbias subspace accurately, it becomes important touse sentence templates that are as diverse as pos-sible to account for all occurrences of that wordin surrounding contexts. In our experiments, weempirically demonstrate that estimating the biassubspace using a large and diverse set of templatesfrom text corpora leads to improved bias reductionas compared to using simple templates.

    To capture the variety in syntax across sentences,we use large text corpora to find naturally occur-ring sentences. These naturally occurring sentencestherefore become our sentence “templates”. Touse these templates to generate new sentences, wereplace words representing a single class with an-other. For example, a sentence containing a maleterm “he” is used to generate a new sentence butreplacing it with the corresponding female term“she”. This contextualization process is repeatedfor all word tuples in the bias attribute word datasetD, eventually contextualizing the given set of biasattribute words into bias attribute sentences. Sincethere are multiple templates which a bias attributeword can map to, the contextualization process re-sults in a bias attribute sentence dataset S whichis substantially larger in size:

    S =m

    ⋃i=1

    CONTEXTUALIZE(w(i)1 , ...,w(i)d ) (1)

    = {(s(i)1 , ..., s

    (i)d )}

    ni=1, ∣S ∣ > ∣D∣ (2)

  • 5505

    Dataset Type Topics Formality Length Samples

    WikiText-2 written everything formal 24.0“the mailing contained information about their history

    and advised people to read several books,which primarily focused on {jewish/christian/muslim} history”

    SST written movie reviews informal 19.2“{his/her} fans walked out muttering words like horrible and terrible,

    but had so much fun dissing the film that they didn’t mind the ticket cost.”

    Reddit writtenpolitics,

    electronics,relationships

    informal 13.6“roommate cut my hair without my consent,

    ended up cutting {himself /herself} and is threatening tocall the police on me”

    MELD spoken comedy TV-series informal 8.1 “that’s the kind of strength that I want in the {man/woman} I love!”POM spoken opinion videos informal 16.0 “and {his/her} family is, like, incredibly confused”

    Table 2: Comparison of the various datasets used to find natural sentence templates. Length represents the averagelength measured by the number of words in a sentence. Words in italics indicate the words used to estimating thebinary gender or multiclass religion subspaces, e.g. (man, woman), (jewish, christian, muslim). This demonstratesthe variety in our naturally occurring sentence templates in terms of topics, formality, and spoken/written text.

    where CONTEXTUALIZE(w1, ...,wd) is a functionwhich returns a set of sentences obtained by match-ing words with naturally-occurring sentence tem-plates from text corpora.

    Our text corpora originate from the followingfive sources: 1) WikiText-2 (Merity et al., 2017a),a dataset of formally written Wikipedia articles (weonly use the first 10% of WikiText-2 which wefound to be sufficient to capture formally writtentext), 2) Stanford Sentiment Treebank (Socheret al., 2013), a collection of 10000 polarized writ-ten movie reviews, 3) Reddit data collected fromdiscussion forums related to politics, electronics,and relationships, 4) MELD (Poria et al., 2019), alarge-scale multimodal multi-party emotional dia-log dataset collected from the TV-series Friends,and 5) POM (Park et al., 2014), a dataset of spokenreview videos collected across 1,000 individualsspanning multiple topics. These datasets have beenthe subject of recent research in language under-standing (Merity et al., 2017b; Liu et al., 2019;Wang et al., 2019) and multimodal human lan-guage (Liang et al., 2018, 2019). Table 2 summa-rizes these datasets. We also give some examplesof the diverse templates that occur naturally acrossvarious individuals, settings, and in both writtenand spoken text.

    3) Estimating the Bias Subspace: Now thatwe have contextualized all m word d-tuples in Dinto n sentence d-tuples S , we pass these sentencesthrough a pre-trained sentence encoder (e.g. BERT,ELMo) to obtain sentence representations. Sup-pose we have a pre-trained encoder Mθ with pa-rameters θ. Define Rj , j ∈ [d] as sets that collectall sentence representations of the j-th entry in thed-tuple, Rj = {Mθ(s

    (i)j )}

    ni=1. Each of these sets

    Rj defines a vector space in which a specific bias

    attribute is present across its contexts. For example,when dealing with binary gender bias,R1 (likewiseR2) defines the space of sentence representationswith a male (likewise female) context. The onlydifference between the representations in R1 ver-susR2 should be the specific bias attribute present.Define the mean of set j as µj = 1∣Rj ∣ ∑w∈Rj w.The bias subspace V = {v1, ...,vk} is given by thefirst k components of principal component analysis(PCA) (Abdi and Williams, 2010):

    V = PCAk⎛

    d

    ⋃j=1

    ⋃w∈Rj

    (w −µj)⎞

    ⎠. (3)

    k is a hyperparameter in our experiments whichdetermines the dimension of the bias subspace. In-tuitively, V represents the top-k orthogonal direc-tions which most represent the bias subspace.

    4) Debiasing: Given the estimated bias sub-space V, we apply a partial version of the HARD-DEBIAS algorithm (Bolukbasi et al., 2016) to re-move bias from new sentence representations. Tak-ing the example of binary gender bias, the HARD-DEBIAS algorithm consists of two steps:

    4a) Neutralize: Bias components are removedfrom sentences that are not gendered and shouldnot contain gender bias (e.g., I am a doctor., Thatnurse is taking care of the patient.) by removing theprojection onto the bias subspace. More formally,given a representation h of a sentence and the previ-ously estimated gender subspace V = {v1, ...,vk},the debiased representation ĥ is given by first ob-taining hV, the projection of h onto the bias sub-space V before subtracting hV from h. This resultsin a vector that is orthogonal to the bias subspace

  • 5506

    V and therefore contains no bias:

    hV =k

    ∑j=1

    ⟨h,vj⟩vj , (4)

    ĥ = h − hV. (5)

    4b) Equalize: Gendered representations are cen-tered and their bias components are equalized (e.g.man and woman should have bias components inopposite directions, but of the same magnitude).This ensures that any neutral words are equidistantto biased words with respect to the bias subspace.In our implementation, we skip this Equalize stepbecause it is hard to identify all or even the majorityof sentence pairs to be equalized due to the com-plexity of natural sentences. For example, we cannever find all the sentences that man and womanappear in to equalize them appropriately. Note thateven if the magnitudes of sentence representationsare not normalized, the debiased representationsare still pointing in directions orthogonal to thebias subspace. Therefore, skipping the equalizestep still results in debiased sentence representa-tions as measured by our definition of bias.

    3 ExperimentsWe test the effectiveness of SENT-DEBIAS at re-moving biases and retaining performance on down-stream tasks. All experiments are conducted onEnglish terms and downstream tasks. We acknowl-edge that biases can manifest differently acrossdifferent languages, in particular gendered lan-guages (Zhou et al., 2019), and emphasize theneed for future extensions in these directions. Ex-perimental details are in the appendix and codeis released at https://github.com/pliang279/sent_debias.

    3.1 Evaluating BiasesBiases are traditionally measured using the WordEmbedding Association Test (WEAT) (Caliskanet al., 2017). WEAT measures bias in word embed-dings by comparing two sets of target words to twosets of attribute words. For example, to measuresocial bias surrounding genders with respect to ca-reers, one could use the target words programmer,engineer, scientist, and nurse, teacher, librarian,and the attribute words man, male, and woman,female. Unbiased word representations should dis-play no difference between the two target wordsin terms of their relative similarity to the two setsof attribute words. The relative similarity as mea-sured by WEAT is commonly known as the effect

    size. An effect size with absolute value closer to 0represents lower bias.

    To measure the bias present in sentence repre-sentations, we use the method as described in Mayet al. (2019) which extended WEAT to the SentenceEncoder Association Test (SEAT). For a given setof words for a particular test, words are convertedinto sentences using a template-based method. TheWEAT metric can then be applied for fixed-length,pre-trained sentence representations. To measurebias over multiple classes, we use the Mean Aver-age Cosine similarity (MAC) metric which extendsSEAT to a multiclass setting (Manzini et al., 2019).For the binary gender setting, we use words fromthe Caliskan Tests (Caliskan et al., 2017) whichmeasure biases in common stereotypes surround-ing gendered names with respect to careers, math,and science (Greenwald et al., 2009). To evaluatebiases in the multiclass religion setting, we modifythe Caliskan Tests used in May et al. (2019) withlexicons used by Manzini et al. (2019).

    3.2 Debiasing SetupWe first describe the details of applying SENT-DEBIAS on two widely-used sentence encoders:BERT2 (Devlin et al., 2019) and ELMo (Peterset al., 2018). Note that the pre-trained BERT en-coder must be fine-tuned on task-specific data. Thisimplies that the final BERT encoder used duringdebiasing changes from task to task. To accountfor these differences, we report two sets of metrics:1) BERT: simply debiasing the pre-trained BERTencoder, and 2) BERT post task: first fine-tuningBERT and post-processing (i.e. normalization) ona specific task before the final BERT representa-tions are debiased. We apply SENT-DEBIAS onBERT fine-tuned on two single sentence datasets,Stanford Sentiment Treebank (SST-2) sentimentclassification (Socher et al., 2013) and Corpus ofLinguistic Acceptability (CoLA) grammatical ac-ceptability judgment (Warstadt et al., 2018). It isalso possible to apply BERT (Devlin et al., 2019)on downstream tasks that involve two sentences.The output sentence pair representation can alsobe debiased (after fine-tuning and normalization).We test the effect of SENT-DEBIAS on QuestionNatural Language Inference (QNLI) (Wang et al.,2018) which converts the Stanford Question An-swering Dataset (SQuAD) (Rajpurkar et al., 2016)into a binary classification task. These results are

    2We used uncased BERT-Base throughout all experiments.

    https://github.com/pliang279/sent_debiashttps://github.com/pliang279/sent_debias

  • 5507

    Test BERT BERT post SST-2 BERT post CoLA BERT post QNLI ELMoC6: M/F Names, Career/Family

    C6b: M/F Terms, Career/Family

    C7: M/F Terms, Math/Arts

    C7b: M/F Names, Math/Arts

    C8: M/F Terms, Science/Arts

    C8b: M/F Names, Science/Arts

    Multiclass Caliskan

    +0.477→ −0.096

    +0.108→ −0.437

    +0.253→ +0.194

    +0.254→ +0.194

    +0.399→ −0.075

    +0.636→ +0.540

    +0.035→ +0.379

    +0.036→ −0.109

    +0.010→ −0.057

    −0.219→ −0.221

    +1.153→ −0.755

    +0.103→ +0.081

    −0.222→ −0.047

    +1.200→ +1.000

    −0.009→ +0.149

    +0.199→ +0.186

    +0.268→ +0.311

    +0.150→ +0.308

    +0.425→ −0.163

    +0.032→ −0.192

    +0.243→ +0.757

    −0.261→ −0.054

    −0.155→ −0.004

    −0.584→ −0.083

    −0.581→ −0.629

    −0.087→ +0.716

    −0.521→ −0.443

    −0.380→ −0.298

    −0.345→ −0.327

    −0.479→ −0.487

    +0.016→ −0.013

    −0.296→ −0.327

    +0.554→ +0.548

    Table 3: Debiasing results on BERT and ELMo sentence representations. First six rows measure binary SEATeffect sizes for sentence-level tests, adapted from Caliskan tests. SEAT scores closer to 0 represent lower bias.CN : test from Caliskan et al. (2017) row N . The last row measures bias in a multiclass religion setting usingMAC (Manzini et al., 2019) before and after debiasing. MAC score ranges from 0 to 2 and closer to 1 representslower bias. Results are reported as x1 → x2 where x1 represents score before debiasing and x2 after, with lowerbias score in bold. Our method reduces bias of BERT and ELMo for the majority of binary and multiclass tests.

    reported as BERT post SST-2, BERT post CoLA,and BERT post QNLI respectively.

    For ELMo, the encoder stays the same for down-stream tasks (no fine-tuning on different tasks) sowe just debias the ELMo sentence encoder. Wereport this result as ELMo.

    3.3 Debiasing ResultsWe present these debiasing results in Table 3, andsee that for both binary gender bias and multiclassreligion bias, our proposed method reduces theamount of bias as measured by the given tests andmetrics. The reduction in bias is most pronouncedwhen debiasing the pre-trained BERT encoder. Wealso observe that simply fine-tuning the BERT en-coder for specific tasks also reduces the biasespresent as measured by the Caliskan tests, to someextent. However, fine-tuning does not lead to con-sistent decreases in bias and cannot be used as astandalone debiasing method. Furthermore, fine-tuning does not give us control over which type ofbias to control for and may even amplify bias if thetask data is skewed towards particular biases. Forexample, while the bias effect size as measured byCaliskan test C7 decreases from +0.542 to −0.033and +0.288 after fine-tuning on SST-2 and CoLArespectively, the effect size as measured by themulticlass Caliskan test increases from +0.035 to+1.200 and +0.243 after fine-tuning on SST-2 andCoLA respectively.

    3.4 Comparison with BaselinesWe compare to three baseline methods for debias-ing: 1) FastText derives debiased sentence embed-dings using an average of debiased FastText wordembeddings (Bojanowski et al., 2016) using word-level debiasing methods (Bolukbasi et al., 2016), 2)

    Debiasing Method Ave. Abs. Effect SizeBERT original (Devlin et al., 2019) +0.354FastText (Bojanowski et al., 2016) +0.565BERT word (Bolukbasi et al., 2016) +0.861BERT simple (May et al., 2019) +0.298SENT-DEBIAS BERT (ours) +0.256

    Table 4: Comparison of various debiasing methodson sentence embeddings. FastText (Bojanowski et al.,2016) (and BERT word) derives debiased sentenceembeddings with an average of debiased FastText(and BERT) word embeddings using word-level debi-asing methods (Bolukbasi et al., 2016). BERT sim-ple adapts May et al. (2019) by using simple templatesto debias BERT representations. SENT-DEBIAS BERTrepresents our method using diverse templates. We re-port the average absolute effect size across all Caliskantests. Average scores closer to 0 represent lower bias.

    BERT word obtains a debiased sentence represen-tation from average debiased BERT word represen-tations, again debiased using word-level debiasingmethods (Bolukbasi et al., 2016), and 3) BERTsimple adapts May et al. (2019) by using simpletemplates to debias BERT sentence representations.From Table 4, SENT-DEBIAS achieves a lower av-erage absolute effect size and outperforms the base-lines based on debiasing at the word-level and av-eraging across all words. This indicates that it isnot sufficient to debias words only and that biasesin a sentence could arise from their debiased wordconstituents. In comparison with BERT simple, weobserve that using diverse sentence templates ob-tained from naturally occurring written and spokentext makes a difference on how well we can removebiases from sentence representations. This supportsour hypothesis that using increasingly diverse tem-plates estimates a bias subspace that generalizes todifferent words in their context.

  • 5508

    Figure 1: Influence of the number of templates on the effectiveness of bias removal on BERT fine-tuned on SST-2(left) and BERT fine-tuned on QNLI (right). All templates are from WikiText-2. The solid line represents themean over different combinations of domains and the shaded area represents the standard deviation. As increasingsubsets of data are used, we observe a decreasing trend and lower variance in average absolute effect size.

    3.5 Effect of TemplatesWe further test the importance of sentence tem-plates through two experiments.

    1) Same Domain, More Quantity: Firstly, weask: how does the number of sentence templatesimpact debiasing performance? To answer this, webegin with the largest domain WikiText-2 (13750templates) and divide it into 5 partitions each ofsize 2750. We collect sentence templates using allpossible combinations of the 5 partitions and applythese sentence templates in the contextualizationstep of SENT-DEBIAS. We then estimate the cor-responding bias subspace, debias, and measure theaverage absolute values of all 6 SEAT effect sizes.Since different combinations of the 5 partitions re-sult in a set of sentence templates of different sizes(20%, 40%, 60%, 80%, 100%), this allows us tosee the relationship between size and debiasing per-formance. Combinations with the same percentageof data are grouped together and for each groupwe compute the mean and standard deviation ofthe average absolute effect sizes. We perform theabove steps to debias BERT fine-tuned on SST-2and QNLI and plot these results in Figure 1. Pleaserefer to the appendix for experiments with BERTfine-tuned on CoLA, which show similar results.

    For BERT fine-tuned on SST-2, we observe adecreasing trend in the effect size as increasingsubsets of the data is used. For BERT fine-tunedon QNLI, there is a decreasing trend that quicklytapers off. However, using a larger number of tem-plates reduces the variance in average absoluteeffect size and improves the stability of the SENT-DEBIAS algorithm. These observations allow us toconclude the importance of using a large numberof templates from naturally occurring text corpora.

    2) Same Quantity, More Domains: How does

    the number of domains that sentence templates areextracted from impact debiasing performance? Wefix the total number of sentence templates to be1080 and vary the number of domains these tem-plates are drawn from. Given a target number k, wefirst choose k domains from our Reddit, SST, POM,WikiText-2 datasets and randomly sample 1080/ktemplates from each of the k selected domains. Weconstruct 1080 templates using all possible subsetsof k domains and apply them in the contextual-ization step of SENT-DEBIAS. We estimate thecorresponding bias subspace, debias and measurethe average absolute SEAT effect sizes. To seethe relationship between the number of domainsk and debiasing performance, we group combina-tions with the same number of domains (k) andfor each group compute the mean and standard de-viation of the average absolute effect sizes. Thisexperiment is also performed for BERT fine-tunedon SST-2 and QNLI datasets. Results are plottedin Figure 2.

    We draw similar observations: there is a decreas-ing trend in effect size as templates are drawn frommore domains. For BERT fine-tuned on QNLI,using a larger number of domains reduces the vari-ance in effect size and improves stability of thealgorithm. Therefore, it is important to use a largevariety of templates across different domains.

    3.6 VisualizationAs a qualitative analysis of the debiasing process,we visualize how the distances between sentencerepresentations shift after the debiasing processis performed. We average the sentence represen-tations of a concept (e.g. man, woman, science,art) across its contexts (sentence templates) andplot the t-SNE (van der Maaten and Hinton, 2008)

  • 5509

    Figure 2: Influence of the number of template domains on the effectiveness of bias removal on BERT fine-tuned onSST-2 (left) and BERT fine-tuned on QNLI (right). The domains span the Reddit, SST, POM, WikiText-2 datasets.The solid line is the mean over different combinations of domains and the shaded area is the standard deviation.As more domains are used, we observe a decreasing trend and lower variance in average absolute effect size.

    Pretrained BERT embeddings Debiased BERT embeddings

    Figure 3: t-SNE plots of average sentence representations of a word across its sentence templates before (left) andafter (right) debiasing. After debiasing, non gender-specific concepts (in black) are more equidistant to genders.

    embeddings of these points in 2D space. FromFigure 3, we observe that BERT average represen-tations of science and technology start off closer toman while literature and art are closer to woman.After debiasing, non gender-specific concepts (e.gscience, art) become more equidistant to both manand woman average concepts.

    3.7 Performance on Downstream TasksTo ensure that debiasing does not hurt the perfor-mance on downstream tasks, we report the perfor-mance of our debiased BERT and ELMo on SST-2and CoLA by training a linear classifier on top ofdebiased BERT sentence representations. FromTable 5, we observe that downstream task perfor-mance show a small decrease ranging from 1 − 3%after the debiasing process. However, the perfor-mance of ELMo on SST-2 increases slightly from89.6 to 90.0. We hypothesize that these differencesin performance are due to the fact that CoLA testsfor linguistic acceptability so it is more concernedwith low-level syntactic structure such as verb us-age, grammar, and tenses. As a result, changesin sentence representations across bias directionsmay impact its performance more. For example,

    sentence representations after the gender debiasingsteps may display a mismatch between genderedpronouns and the sentence context. For SST, ithas been shown that sentiment analysis datasetshave labels that correlate with gender informationand therefore contain gender bias (Kiritchenko andMohammad, 2018). As a result, we do expect pos-sible decreases in accuracy after debiasing. Fi-nally, we test the effect of SENT-DEBIAS on QNLIby training a classifier on top of debiased BERTsentence pair representations. We observe littleimpact on task performance: our debiased BERTfine-tuned on QNLI achieves 90.6% performanceas compared to the 91.3% we obtained withoutdebiasing.

    4 Discussion and Future Work

    Firstly, we would like to emphasize that both theWEAT, SEAT, and MAC metrics are not perfectsince they only have positive predictive ability:they can be used to detect the presence of biasesbut not their absence (Gonen and Goldberg, 2019).This calls for new metrics that evaluate biases andcan scale to the various types of sentences appear-ing across different individuals, topics, and in both

  • 5510

    Test BERT debiased BERT ELMo debiased ELMoSST-2 92.7 89.1 89.6 90.0CoLA 57.6 55.4 39.1 37.1QNLI 91.3 90.6 - -

    Table 5: We test the effect of SENT-DEBIAS on bothsingle sentence (BERT and ELMo on SST-2, CoLA)and paired sentence (BERT on QNLI) downstreamtasks. The performance (higher is better) of debiasedBERT and ELMo sentence representations on down-stream tasks is not hurt by the debiasing step.

    spoken and written text. We believe that our pos-itive results regarding contextualizing words intosentences implies that future work can build on ouralgorithms and tailor them for new metrics.

    Secondly, a particular bias should only be re-moved from words and sentences that are neutral tothat attribute. For example, gender bias should notbe removed from the word “grandmother” or thesentence “she gave birth to me”. Previous work ondebiasing word representations tackled this issue bylisting all attribute specific words based on dictio-nary definitions and only debiasing the remainingwords. However, given the complexity of naturalsentences, it is extremely hard to identify the setof neutral sentences and its complement. Thus, indownstream tasks, we removed bias from all sen-tences which could possibly harm downstream taskperformance if the dataset contains a significantnumber of non-neutral sentences.

    Finally, a fundamental challenge lies in the factthat these representations are trained without ex-plicit bias control mechanisms on large amounts ofnaturally occurring text. Given that it becomes in-feasible (in standard settings) to completely retrainthese large sentence encoders for debiasing (Zhaoet al., 2018; Zhang et al., 2018), future work shouldfocus on developing better post-hoc debiasing tech-niques. In our experiments, we need to re-estimatethe bias subspace and perform debiasing wheneverthe BERT encoder was fine-tuned. It remains to beseen whether there are debiasing methods whichare invariant to fine-tuning, or can be efficientlyre-estimated as the encoders are fine-tuned.

    5 ConclusionThis paper investigated the post-hoc removal ofsocial biases from pretrained sentence representa-tions. We proposed the SENT-DEBIAS method thataccurately captures the bias subspace of sentencerepresentations by using a diverse set of templatesfrom naturally occurring text corpora. Our experi-ments show that we can remove biases that occur in

    BERT and ELMo while preserving performance ondownstream tasks. We also demonstrate the impor-tance of using a large number of diverse sentencetemplates when estimating bias subspaces. Lever-aging these developments will allow researchers tofurther characterize and remove social biases fromsentence representations for fairer NLP.

    Acknowledgements

    PPL and LPM were supported in part by theNational Science Foundation (Awards #1750439,#1722822) and National Institutes of Health. RSwas supported in part by US Army, ONR, Apple,and NSF IIS1763562. Any opinions, findings, andconclusions or recommendations expressed in thismaterial are those of the author(s) and do not nec-essarily reflect the views of the National ScienceFoundation, National Institutes of Health, DARPA,and AFRL, and no official endorsement should beinferred. We also acknowledge NVIDIA’s GPUsupport and the anonymous reviewers for their con-structive comments.

    ReferencesHervé Abdi and Lynne J. Williams. 2010. Princi-

    pal component analysis. WIREs Comput. Stat.,2(4):433–459.

    Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, andMatthew McDermott. 2019. Publicly available clini-cal BERT embeddings. In Proceedings of the 2ndClinical Natural Language Processing Workshop,pages 72–78, Minneapolis, Minnesota, USA. Asso-ciation for Computational Linguistics.

    David Bamman, A. Seza Doğruöz, Jacob Eisenstein,Dirk Hovy, David Jurgens, Brendan O’Connor, Al-ice Oh, Oren Tsur, and Svitlana Volkova. 2016. Pro-ceedings of the first workshop on NLP and computa-tional social science.

    Solon Barocas and Andrew D Selbst. 2016. Big data’sdisparate impact. Calif. L. Rev., 104:671.

    Christine Basta, Marta R. Costa-jussà, and Noe Casas.2019. Evaluating the underlying gender bias in con-textualized word embeddings. In Proceedings of theFirst Workshop on Gender Bias in Natural LanguageProcessing, pages 33–39, Florence, Italy. Associa-tion for Computational Linguistics.

    Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

    https://doi.org/10.1002/wics.101https://doi.org/10.1002/wics.101https://doi.org/10.18653/v1/W19-1909https://doi.org/10.18653/v1/W19-1909https://doi.org/10.18653/v1/W19-3805https://doi.org/10.18653/v1/W19-3805

  • 5511

    Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,Venkatesh Saligrama, and Adam T Kalai. 2016.Man is to computer programmer as woman is tohomemaker? Debiasing word embeddings. In Proc.of NIPS, pages 4349–4357.

    Shikha Bordia and Samuel R. Bowman. 2019. Identify-ing and reducing gender bias in word-level languagemodels. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Student Research Work-shop, pages 7–15, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

    Aylin Caliskan, Joanna J Bryson, and ArvindNarayanan. 2017. Semantics derived automaticallyfrom language corpora contain human-like biases.Science.

    Robert Dale. 2019. Law and word order: Nlp in legaltech. Natural Language Engineering, 25(1):211–217.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

    Sahaj Garg, Vincent Perot, Nicole Limtiaco, AnkurTaly, Ed H. Chi, and Alex Beutel. 2019. Counterfac-tual fairness in text classification through robustness.In Proceedings of the 2019 AAAI/ACM Conferenceon AI, Ethics, and Society, AIES ’19, page 219–226,New York, NY, USA. Association for ComputingMachinery.

    Hila Gonen and Yoav Goldberg. 2019. Lipstick on apig: Debiasing methods cover up systematic genderbiases in word embeddings but do not remove them.In Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 609–614,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

    Anthony Greenwald, T Poehlman, Eric Luis Uhlmann,and Mahzarin Banaji. 2009. Understanding and us-ing the implicit association test: Iii. meta-analysis ofpredictive validity.

    Kexin Huang, Jaan Altosaar, and Rajesh Ran-ganath. 2019. Clinicalbert: Modeling clinicalnotes and predicting hospital readmission. CoRR,abs/1904.05342.

    Svetlana Kiritchenko and Saif Mohammad. 2018. Ex-amining gender and race bias in two hundred sen-timent analysis systems. In Proceedings of theSeventh Joint Conference on Lexical and Compu-tational Semantics, pages 43–53, New Orleans,

    Louisiana. Association for Computational Linguis-tics.

    Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black,and Yulia Tsvetkov. 2019. Measuring bias in contex-tualized word representations. In Proceedings of theFirst Workshop on Gender Bias in Natural LanguageProcessing, pages 166–172, Florence, Italy. Associ-ation for Computational Linguistics.

    Brian Larson. 2017. Gender as a variable in natural-language processing: Ethical considerations. In Pro-ceedings of the First ACL Workshop on Ethics inNatural Language Processing, pages 1–11, Valencia,Spain. Association for Computational Linguistics.

    Anne Lauscher and Goran Glavaš. 2019. Are we con-sistently biased? multidimensional analysis of bi-ases in distributional word vectors. In *SEM 2019.

    Paul Pu Liang, Yao Chong Lim, Yao-Hung Hubert Tsai,Ruslan Salakhutdinov, and Louis-Philippe Morency.2019. Strong and simple baselines for multimodalutterance embeddings. In NAACL-HLT, pages2599–2609, Minneapolis, Minnesota. Associationfor Computational Linguistics.

    Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, andLouis-Philippe Morency. 2018. Multimodal lan-guage analysis with recurrent multistage fusion. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing.

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach. CoRR, abs/1907.11692.

    Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research, 9:2579–2605.

    Thomas Manzini, Lim Yao Chong, Alan W Black,and Yulia Tsvetkov. 2019. Black is to criminalas caucasian is to police: Detecting and removingmulticlass bias in word embeddings. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 615–621, Minneapo-lis, Minnesota. Association for Computational Lin-guistics.

    Chandler May, Alex Wang, Shikha Bordia, Samuel R.Bowman, and Rachel Rudinger. 2019. On measur-ing social biases in sentence encoders. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers). Association for Computa-tional Linguistics.

    Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2017a. Pointer sentinel mixture

    https://doi.org/10.18653/v1/N19-3002https://doi.org/10.18653/v1/N19-3002https://doi.org/10.18653/v1/N19-3002http://dblp.uni-trier.de/db/journals/nle/nle25.html#Dale19http://dblp.uni-trier.de/db/journals/nle/nle25.html#Dale19https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://doi.org/10.1145/3306618.3317950https://doi.org/10.1145/3306618.3317950https://doi.org/10.18653/v1/N19-1061https://doi.org/10.18653/v1/N19-1061https://doi.org/10.18653/v1/N19-1061https://doi.org/10.18653/v1/S18-2005https://doi.org/10.18653/v1/S18-2005https://doi.org/10.18653/v1/S18-2005https://doi.org/10.18653/v1/W19-3823https://doi.org/10.18653/v1/W19-3823https://doi.org/10.18653/v1/W17-1601https://doi.org/10.18653/v1/W17-1601http://arxiv.org/abs/1907.11692http://arxiv.org/abs/1907.11692http://www.jmlr.org/papers/v9/vandermaaten08a.htmlhttps://doi.org/10.18653/v1/N19-1062https://doi.org/10.18653/v1/N19-1062https://doi.org/10.18653/v1/N19-1062https://openreview.net/forum?id=Byj72udxe

  • 5512

    models. In 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France,April 24-26, 2017, Conference Track Proceedings.OpenReview.net.

    Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2017b. Pointer sentinel mixturemodels. In 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France,April 24-26, 2017, Conference Track Proceedings.OpenReview.net.

    Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In 1st International Con-ference on Learning Representations, ICLR 2013,Scottsdale, Arizona, USA, May 2-4, 2013, WorkshopTrack Proceedings.

    Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-ducing gender bias in abusive language detection.In Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing,pages 2799–2804, Brussels, Belgium. Associationfor Computational Linguistics.

    Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,Kenji Sagae, and Louis-Philippe Morency. 2014.Computational analysis of persuasiveness in socialmultimedia: A novel dataset and multimodal predic-tion approach. In ICMI.

    Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In EMNLP.

    Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.

    Soujanya Poria, Devamanyu Hazarika, Navonil Ma-jumder, Gautam Naik, Erik Cambria, and Rada Mi-halcea. 2019. MELD: A multimodal multi-partydataset for emotion recognition in conversations. InProceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics, pages 527–536, Florence, Italy. Association for ComputationalLinguistics.

    Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In EMNLP.

    Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep models

    for semantic compositionality over a sentiment tree-bank. In EMNLP.

    Chen Sun, Austin Oliver Myers, Carl Vondrick, KevinMurphy, and Cordelia Schmid. 2019a. Videobert:A joint model for video and language representationlearning. CoRR, abs/1904.01766.

    Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang,Mai ElSherief, Jieyu Zhao, Diba Mirza, ElizabethBelding, Kai-Wei Chang, and William Yang Wang.2019b. Mitigating gender bias in natural languageprocessing: Literature review. In ACL.

    Nathaniel Swinger, Maria De-Arteaga, Neil ThomasHeffernan IV, Mark DM Leiserson, and Adam Tau-man Kalai. 2019. What are the biases in my wordembedding? In Proceedings of the 2019 AAAI/ACMConference on AI, Ethics, and Society, AIES ’19,page 305–311, New York, NY, USA. Association forComputing Machinery.

    Jack Urbanek, Angela Fan, Siddharth Karamcheti,Saachi Jain, Samuel Humeau, Emily Dinan, TimRocktaschel, Douwe Kiela, Arthur Szlam, and JasonWeston. 2019. Learning to speak and act in a fantasytext adventure game. CoRR, abs/1903.03094.

    Sumithra Velupillai, Hanna Suominen, Maria Liakata,Angus Roberts, Anoop D. Shah, Katherine Mor-ley, David Osborn, Joseph Hayes, Robert Stewart,Johnny Downs, Wendy Chapman, and Rina Dutta.2018. Using clinical natural language processing forhealth outcomes research: Overview and actionablesuggestions for future advances. Journal of Biomed-ical Informatics, 88:11 – 19.

    Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Black-boxNLP.

    Chenguang Wang, Mu Li, and Alexander J. Smola.2019. Language models with transformers. CoRR,abs/1904.09408.

    Alex Warstadt, Amanpreet Singh, and Samuel R Bow-man. 2018. Neural network acceptability judgments.arXiv preprint arXiv:1805.12471.

    Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-cas: The surprising cross-lingual effectiveness ofBERT. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages833–844, Hong Kong, China. Association for Com-putational Linguistics.

    Rowan Zellers, Yonatan Bisk, Ali Farhadi, and YejinChoi. 2019. From recognition to cognition: Vi-sual commonsense reasoning. In The IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR).

    https://openreview.net/forum?id=Byj72udxehttps://openreview.net/forum?id=Byj72udxehttps://openreview.net/forum?id=Byj72udxehttps://doi.org/10.18653/v1/D18-1302https://doi.org/10.18653/v1/D18-1302https://doi.org/10.18653/v1/N18-1202https://doi.org/10.18653/v1/N18-1202https://doi.org/10.18653/v1/P19-1050https://doi.org/10.18653/v1/P19-1050https://doi.org/10.1145/3306618.3314270https://doi.org/10.1145/3306618.3314270https://doi.org/https://doi.org/10.1016/j.jbi.2018.10.005https://doi.org/https://doi.org/10.1016/j.jbi.2018.10.005https://doi.org/https://doi.org/10.1016/j.jbi.2018.10.005http://arxiv.org/abs/1904.09408https://doi.org/10.18653/v1/D19-1077https://doi.org/10.18653/v1/D19-1077https://doi.org/10.18653/v1/D19-1077

  • 5513

    Brian Hu Zhang, Blake Lemoine, and MargaretMitchell. 2018. Mitigating unwanted biases with ad-versarial learning. In AIES.

    Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot-terell, Vicente Ordonez, and Kai-Wei Chang. 2019.Gender bias in contextualized word embeddings. InProceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 629–634,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

    Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-donez, and Kai-Wei Chang. 2017. Men also likeshopping: Reducing gender bias amplification usingcorpus-level constraints. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 2979–2989, Copenhagen,Denmark. Association for Computational Linguis-tics.

    Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral wordembeddings. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Process-ing, pages 4847–4853, Brussels, Belgium. Associa-tion for Computational Linguistics.

    Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-Hao Huang,Muhao Chen, Ryan Cotterell, and Kai-Wei Chang.2019. Examining gender bias in languages withgrammatical gender. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 5276–5284, Hong Kong,China. Association for Computational Linguistics.

    https://doi.org/10.18653/v1/N19-1064https://doi.org/10.18653/v1/D17-1323https://doi.org/10.18653/v1/D17-1323https://doi.org/10.18653/v1/D17-1323https://doi.org/10.18653/v1/D18-1521https://doi.org/10.18653/v1/D18-1521https://doi.org/10.18653/v1/D19-1531https://doi.org/10.18653/v1/D19-1531

  • 5514

    A Debiasing Details

    We provide some details on estimating the biassubspaces and debiasing steps.

    Bias Attribute Words: Table 6 shows the biasattribute words we used to estimate the bias sub-spaces for binary gender bias and multiclass reli-gious biases.

    Datasets: We provide some details on datasetdownloading below:

    1. WikiText-2 was downloaded fromhttps://github.com/pytorch/examples/

    tree/master/word_language_model. Wetook the first 10% of WikiText-2 sentences asnaturally occurring templates representativeof highly formal text.

    2. Hacker News and Reddit Subreddit data col-lected from news and discussion forumsrelated to topics ranging from politics toelectronics was downloaded from https://github.com/minimaxir/textgenrnn/.

    B Experimental Details

    B.1 BERT

    All three variants of BERT (BERT, BERT post SST,BERT post CoLA) are uncased base model withhyper-parameters described in Table 7.

    For all three models, the second output“pooled output” of BERT is treated as the sentenceembedding. The variant BERT is the pretrainedmodel with weights downloaded from https://s3.amazonaws.com/models.huggingface.

    co/bert/bert-base-uncased.tar.gz. Thevariant BERT post SST is BERT after being fine-tuned on the Stanford Sentiment Treebank(SST-2)task, a binary single-sentence classification task(Socher et al., 2013). During fine-tuning, wefirst normalize the sentence embedding and thenfeed it into a linear layer for classification. Thevariant BERT post CoLA is BERT fine-tuned onthe Corpus of Linguistic Acceptability (CoLA)task, a binary single-sentence classification task.Normalization and classification are done exactlythe same as BERT post SST. All BERT modelsare fine-tuned for 3 epochs which is the defaulthyper-parameter in the huggingface transformersrepository. Debiasing for BERT models that arefine-tuned is done just before the classificationlayer.

    Binary Genderman, woman

    boy, girlhe, she

    father, motherson, daughter

    guy, galmale, female

    his, herhimself, herself

    John, Mary

    Multiclass Religionjewish, christian, muslimjews, christians, muslims

    torah, bible, quransynagogue, church, mosque

    rabbi, priest, imamjudaism, christianity, islam

    Table 6: Word pairs to estimate the binary gender biassubspace (left) and the 3-class religion bias subspace(right).

    Hyper-parameter Valueattention probs dropout prob

    hidden act

    hidden dropout prob

    hidden size

    initializer range

    intermediate size

    max position embeddings

    num attention heads

    num hidden layers

    type vocab size

    vocab size

    0.1

    gelu

    0.1

    768

    0.02

    3072

    512

    12

    12

    2

    30522

    Table 7: Configuration of BERT models, includingBERT, BERT→ SST, and BERT→ CoLA.

    B.2 ELMoWe use the ElmoEmbedder from al-lennlp.commands.elmo. We perform summationover the aggregated layer outputs. The resultingsentence representation is a time sequence vectorwith data dimension 1024. When computinggender direction, we perform mean pooling overthe time dimension to obtain a 1024-dimensionalvector for each definitional sentence. In debiasing,we remove the gender direction from each timestep of each sentence representation. We then feedthe debiased representation into an LSTM withhidden size 512. Finally, the last hidden state ofthe LSTM goes through a fully connected layer tomake predictions.

    C Additional ResultsWe also studied the effect of templates on BERTfine-tuned on CoLA as well. Steps taken areexactly the same as described in Effect of Tem-plates: Same Domain, More Quantity and Ef-fect of Templates: Same Quantity, More Do-mains. Results are plotted in Figure 4. It shows

    https://github.com/pytorch/examples/tree/master/word_language_modelhttps://github.com/pytorch/examples/tree/master/word_language_modelhttps://github.com/minimaxir/textgenrnn/https://github.com/minimaxir/textgenrnn/https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gzhttps://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gzhttps://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz

  • 5515

    Figure 4: Evaluation of Bias Removal on BERT fine-tuned on CoLA with varying percentage of data from a singledomain (left) and varying number of domains with fixed total size (right).

    that debiasing performance improves and stabilizeswith the number of sentence templates as well asthe number of domains.