Enhancing Content Preservation in Text Style Transfer

Proceedings of the 59th Annual Meeting of the Association for Computational Linguisticsand the 11th International Joint Conference on Natural Language Processing, pages 93–102

August 1–6, 2021. ©2021 Association for Computational Linguistics

93

Enhancing Content Preservation in Text Style TransferUsing Reverse Attention and Conditional Layer Normalization

Dongkyu Lee Zhiliang Tian Lanqing Xue Nevin L. ZhangDepartment of Computer Science and Engineering,

The Hong Kong University of Science and Technology{dleear, ztianac, lxueaa, lzhang}@cse.ust.hk

Abstract

Text style transfer aims to alter the style (e.g.,sentiment) of a sentence while preserving itscontent. A common approach is to map a givensentence to content representation that is freeof style, and the content representation is fedto a decoder with a target style. Previous meth-ods in filtering style completely remove tokenswith style at the token level, which incurs theloss of content information. In this paper, wepropose to enhance content preservation by im-plicitly removing the style information of eachtoken with reverse attention, and thereby re-tain the content. Furthermore, we fuse contentinformation when building the target style rep-resentation, making it dynamic with respect tothe content. Our method creates not only style-independent content representation, but alsocontent-dependent style representation in trans-ferring style. Empirical results show that ourmethod outperforms the state-of-the-art base-lines by a large margin in terms of contentpreservation. In addition, it is also competitivein terms of style transfer accuracy and fluency.

1 Introduction

Style transfer is a popular task in computer visionand natural language processing. It aims to convertan input with a certain style (e.g., sentiment, for-mality) into a different style while preserving theoriginal content.

One mainstream approach is to separate stylefrom content, and to generate a transferred sentenceconditioned on the content information and a targetstyle. Recently, several models (Li et al., 2018;Xu et al., 2018; Wu et al., 2019) have proposedremoving style information at the token level byfiltering out tokens with style information, whichare identified using either attention-based methods(Bahdanau et al., 2015) or frequency-ratio basedmethods (Wu et al., 2019). This line of work isbuilt upon the assumption that style is localized to

to our knowledge , this is the best deal in phoenix .FILTERING

OursTo our knowledge, this is the <MASK> <MASK> <MASK> <MASK>.To our knowledge, this is the best deal in phoenix .

Average attention score

Figure 1: Illustration of difference between our method andfiltering method in handling flat attention distribution. Eachbar indicates attention score of the corresponding word.

certain tokens in a sentence, and a token has eithercontent or style information, but not both. Thusby utilizing a style marking module, the modelsfilter out the style tokens entirely when construct-ing a style-independent content representation ofthe input sentence. The drawback with the filter-ing method is that one needs to manually set athreshold to decide whether a token is stylistic orcontent-related. Previous studies address this issueby using the average attention score as a threshold(Li et al., 2018; Xu et al., 2018; Wu et al., 2019). Amajor shortcoming of this approach is the incapa-bility of handling flat attention distribution. Whenthe distribution is flat, in which similar attentionscores are assigned to tokens, the style markingmodule would remove/mask out more tokens thannecessary. This incurs information loss in contentas depicted in Figure 1.

In this paper, we propose a novel method for textstyle transfer. A key idea is to exploit the fact that atoken often posses both style and content informa-tion. For example, the word “delicious” is a tokenwith strong style information, but it also implies thesubject is food. Such words play a pivotal role inrepresenting style (e.g., positive sentiment) as wellas presenting a hint at the subject matter/content(e.g., food). The complete removal of such tokensleads to the loss of content information.

For the sake of enhancing content preservation,

94

we propose a method to implicitly remove style atthe token level using reverse attention. We utilizeknowledge attained from attention networks (Bah-danau et al., 2015) to estimate style informationof a token, and suppress such signal to take outstyle. Attention mechanism is known to attend tointerdependent representations given a query. Instyle classification task, an attention score couldbe interpreted as to what extent a token has styleattribute. If we can identify which tokens revealstylistic property and to what extent, it is then pos-sible to take the negation and to approximate theamount of content attribute within a token. In thispaper, we call it reverse attention. We utilize suchscore to suppress the stylistic attribute of tokens,fully capturing content property.

This paper further enhances content preservationby fusing content information in creating targetstyle representation. Despite of extensive effortsin creating content representation, the previouswork has overlooked building content-dependentstyle representations. The common approach is toproject the target style onto an embedding space,and share the style embedding among the samestyle as an input to the decoder. However, our worksheds light on building content-related style by uti-lizing conditional layer normalization (CLN). Thismodule of ours takes in content representations,and creates content-dependent style representationby shaping the content variable to fit in the distri-bution of target style. This way, our style represen-tation varies according to the content of the inputsequence even with the same target style.

Our method is based on two techniques, ReverseAttention and Conditional Layer Normalization,thus we call it RACoLN. In empirical evaluation,RACoLN achieves the state-of-the-art performancein terms of content preservation, outperformingthe previous state-of-the-art by a large margin, andshows competency in style transfer accuracy andfluency. The contributions are as follows:

• We introduce reverse attention as a way tosuppress style information while preservingcontent information when building a contentrepresentation of an input.

• Aside from building style-independent con-tent representation, our approach utilizesconditional layer normalization to constructcontent-dependent style representation.

• Our model achieves state-of-the-art perfor-

mance in terms of content preservation, out-performing current state-of-the-art by morethan 4 BLEU score on Yelp dataset, and showscompetency in other metrics as well.

2 Related Work

In recent years, text style transfer in unsupervisedlearning environment has been studied and ex-plored extensively. Text style transfer task views asentence as being comprised of content and style.Thus, there have been attempts to disentangle thecomponents (Shen et al., 2017; Li et al., 2018; Xuet al., 2018; Wu et al., 2019). Shen et al. (2017)map a sentence to a shared content space amongstyles to create style-independent content variable.Some studies view style as localized feature of sen-tences. Xu et al. (2018) propose to identify style to-kens with attention mechanism, and filter out suchtokens. Frequency-based is proposed to enhancethe filtering process (Wu et al., 2019). This streamof work is similar to our work in that the objectiveis to take out style at the token level, but differentsince ours does not remove tokens completely.

Instead of disentangling content and style, otherpapers focus on revising an entangled representa-tion of an input. A few previous studies utilize apre-trained classifier and edit entangled latent vari-able until it contains target style using the gradient-based optimization (Wang et al., 2019; Liu et al.,2020). He et al. (2020) view each domain of dataas a partially observable variable, and transfer sen-tence using amortized variational inference. Daiet al. (2019) use the transformer architecture andrewrite style in the entangled representation at thedecoder. We consider this model as the strongestbaseline model in terms of content preservation.

In the domain of computer vision, it is a preva-lent practice to exploit variants of normalization totransfer style (Dumoulin et al., 2017; Ulyanov et al.,2016). Dumoulin et al. (2017) proposed condi-tional instance normalization (CIN) in which eachstyle is assigned with separate instance normal-ization parameter, in other words, a model learnsseparate gain and bias parameters of instance nor-malization for each style.

Our work differs in several ways. Style trans-fer in image views style transfer as changing the“texture” of an image. Therefore, Dumoulin et al.(2017) place CIN module following every convo-lution layer, “painting” with style-specific parame-ters on the content representation. Therefore, the

95

The

food is

delicious

Pre-trained GRU

GRU

GRU

style-independent content representation

content-dependent style representation

CLN

The

food is

delicious

The

food

is bland1

3

4

2

Embedding

Attention

Reverse Attention

Initial Hidden State

Style Marker Module1 Removing Style2 Stylizer3 4 Decoder

Input x

E(x) zx

z s

zx z s

x s

Stop Gradient

E′ (x)

Output

Figure 2: Input x first passes style marker module for computing reverse attention. The reverse attention score is then appliedto token embeddings, implicitly removing style. The content representation from the encoder is fed to stylizer, in which stylerepresentation is made from the content. The decoder generates transferred output by conditioning on the two representations.

network passes on entangled representation of animage. Our work is different in that we disentanglecontent and style, thus we do not overwrite con-tent with style-specific parameters. In addition, weapply CLN only once before passing it to decoder.

3 Approach

3.1 Task Definition

Let D = {(xi, si)Ni=1} be a training corpus, whereeach xi is a sentence, and si is its style label. Ourexperiments were carried on a sentiment analysistask, where there are two style labels, namely “pos-itive” and “negative.”

The task is to learn from D a model xs =fθ(x, s), with parameters θ, that takes an inputsentence x and a target style s as inputs, and out-puts a new sentence xs that is in the target styleand retains the content information of x.

3.2 Model Overview

We conduct this task in an unsupervised environ-ment in which ground truth sentence xs is not pro-vided. To achieve our goal, we employ a styleclassifier s = C(x) that takes a sentence x as inputand returns its style label. We pre-train such modelon D and keep it frozen in the process of learningfθ.

Given the style classifier C(x), our task be-comes to learn a model xs = fθ(x, s) such that

C(xs) = s. As such, the task is conceptually sim-ilar to adversarial attack: The input x is from thestyle class s, and we want to modify it so that itwill be classified into the target style class s.

The architecture of our model fθ is shown inFigure 2, which will some times referred to as thegenerator network. It consists of an encoder, a styl-izer and a decoder. The encoder maps an inputsequence x into a style-independent representationzx. Particularly, the encoder has a style markermodule that computes attention scores of input to-kens, and it “reverses” them to estimate the contentinformation. The reversed attention scores are ap-plied to the token embedding E(x) and the resultsE′(x) are fed to bidirectional GRU to produce zx.

The stylizer takes a target style s and the con-tent representation zx as inputs, and produces acontent-related style representation zs. Finally, thedecoder takes the content representation zx andstyle representation zs as inputs, and generates anew sequence xs.

3.3 Encoder

3.3.1 Style Marker Module

Let x = [x1, x2, . . . , xT ] be a length T sequenceof input with a style s. The style marker module ispre-trained in order to calculate the amount of styleinformation in each token in a given input. Weuse one layer of bidirectional GRU with attention

96

(Yang et al., 2016). Specifically,

vt = tanh(Wwht + bw) (1)

αt =exp(vT

t u/τ)∑Tt=1 exp(v

Tt u/τ)

(2)

where ht is the hidden representation from the bidi-rectional GRU at time step t. u is learnable parame-ters initialized with random weights, and τ denotesthe temperature in softmax. When pre-trainingthe style marker module, we construct a sentencerepresentation by taking the weighted sum of thetoken representations with the weights being theattention scores, and feed the context vector to afully-connected layer.

o =T∑t=1

αtht (3)

p = softmax(Wco+ bc) (4)

The cross-entropy loss is used to learn the param-eters of the style marker module. The attentionscores in the style marker indicate what tokens areimportant to style classification, and to what ex-tent. Those scores will be “reversed” in the nextsection to reveal the content information. The fully-connected layer of the style marker module is nolonger needed once the style marker module istrained. It is hence removed.

3.3.2 Reverse AttentionUsing attention score from the pre-trained stylemarker module, we propose to implicitly removethe style information in each token. We negate theextent of style information in each token to estimatethe extent of content information, namely reverseattention.

αt = 1− αt,T∑t=1

αt = 1 (5)

where αt is an attention value from style markermodule, and αt is the corresponding reverse atten-tion score. We multiply the reverse attention scoresto the embedding vectors of tokens.

et = αtet, et = E(xt) (6)

Intuitively, this can be viewed as implicitly remov-ing the stylistic attribute of tokens, suppressing the

norm of a token embedding respect to correspond-ing reverse attention score. The representationsfinally flow into a bidirectional GRU

zx = bidirectionalGRU(e) (7)

to produce a content representation zx, which isthe last hidden state of the bidirectional GRU. Byutilizing reverse attention, we map a sentence tostyle-independent content representation.

3.4 StylizerThe goal of the stylizer is to create a content-relatedstyle representation. We do this by applying condi-tional layer normalization on the content represen-tation zx from encoder as input to this module.

Layer normalization requires the number of gainand bias parameters to match the size of input rep-resentation. Therefore, mainly for the purpose ofshrinking the size, we perform affine transforma-tion on the content variable.

zx = Wzzx + bz (8)

The representation is then fed to conditional layernormalization so that the representation falls intotarget style distribution in style space. Specifically,

zs = CLN(zx; s) = γ s �N(zx) + βs (9)

N(zx) =zx − µσ

(10)

where µ and σ are mean and standard deviationof input vector respectively, and s is target style.Our model learns separate γs (gain) and βs (bias)parameters for different styles.

Normalization method is commonly used tochange feature values in common scale, but knownto implicitly keep the features. Therefore, we ar-gue that the normalized content feature values re-tain content information of the content variable.By passing through conditional layer normaliza-tion module, the content latent vector is scaled andshifted with style-specific gain and bias parameter,falling into target style distribution. Thus, unlikeprevious attempts in text style transfer, the stylerepresentation is dynamic respect to the content,being content-dependent embedding.

In order to block backpropagation signal relatedto style flowing into zx, we apply stop gradient onzx before feeding it to stylizer.

97

3.5 Decoder

The decoder generates a sentence with the targetstyle conditioned on content-related style represen-tation and content representation. We construct ourdecoder using one single layer of GRU.

xs ∼ Decθ(zx, zs) = pD(xs|zx, zs) (11)

As briefly discussed in Section 3.2, the outputsfrom our generator are further passed on for differ-ent loss functions. However, sampling process orgreedy decoding does not allow gradient to flow,because the methods are not differentiable. There-fore, we use soft sampling to keep the gradientflow. Specifically, when the gradient flow is re-quired through the outputs, we take the product ofprobability distribution of each time step and theweight of embedding layer to project the outputsonto word embedding space. We empirically foundthat soft sampling is more suitable in our environ-ment than gumbel-softmax (Jang et al., 2017).

3.6 Pre-trained Style Classifier

Due to the lack of parallel corpus, we cannot traingenerator network with maximum likelihood es-timation on style transfer ability. Therefore, thispaper employs a pre-trained classifier C(x) to trainour generator on transferring style. Our classifiernetwork has the same structure as style marker mod-ule with fully-connected layer appended, nonethe-less, it is a separate model obtained from a differentset of initial model parameters. We use the cross-entropy loss for training:

Lpre = −E(x,s)∼D[log pC(s|xs)] (12)

We freeze the weights of this network after it hasbeen fully trained.

3.7 The Loss Function

As shown in Figure 3, our loss function consists offour parts: a self reconstruction loss Lself , a cyclereconstruction loss Lcycle, a content loss Lcontent,and a style transfer loss Lstyle.

3.7.1 Self Reconstruction LossLet (x, s) ∈ D be a training example. If we askour model to fθ(x, s) to “transfer” the input intoits original style, i.e., s = s, we would expect it toreconstruct the input.

Lself = −E(x,s)∼D[log pD(x|zx, zs)] (13)

Decθ

x Encθ zx

s

xs z x s

x s

C( x s)

Lself

Lcycle

Lstyle

Lcontent

s

Decθzxx xs

Encθ

Encθ

Decθ

Styθ

Styθ Styθ

Figure 3: Illustration of loss functions in training phase.Encθ, Styθ, and Decθ denote the encoder, the stylizer, andthe decoder respectively. The circle figure denotes a generatedsentence with soft sampling. As illustrated, Lcycle,Lstyle andLcontent require soft sampling to keep the gradient flow.

where zx is the content representation of the inputx, zs is the representation of the style s, and pD isthe conditional distribution over sequences definedby the decoder.

3.7.2 Cycle Reconstruction LossSuppose we first transfer a sequence x into anotherstyle s to get xs using soft sampling, and thentransfer xs back to the original style s. We wouldexpect to reconstruct the input x. Hence we havethe following cycle construction loss:

Lcycle = −E(x,s)∼D[log pD(x|zxs , zs)] (14)

where zxs is the content representation of the trans-ferred sequence xs.1

3.7.3 Content LossIn the aforementioned cycle reconstruction process,we obtain a content representation zx of the input xand a content representation zxs of the transferredsequence xs. As the two transfer steps presumablyinvolve only style but not content, the two contentrepresentations should be similar. Hence we havethe following content loss:

Lcontent = E(x,s)∼D||zx − zxs ||22 (15)

1Strictly speaking, the quantity is not well-defined becausethere is no description of how the target style s is picked.In our experiments, we use data with two styles. So, thetarget style just means the other style. To apply the method toproblems with multiple styles, random sampling of differentstyle should be added. This remark applies also to the twoloss terms to be introduced below.

98

3.7.4 Style Transfer LossWe would like the transferred sequence xs to be ofstyle s. Hence we have the following style transferloss:

Lstyle = −E(x,s)∼D[log pC(s|xs)] (16)

where pC is the conditional distribution over stylesdefined by the style classifier C(x). As mentionedin Section 3.5, xs was generated with soft sam-pling.

3.7.5 Total LossIn summary, we balance the four loss functions totrain our model.

L = λ1Lself + λ2Lcycle + λ3Lcontent + λ4Lstyle(17)

where λi is balancing parameter.

4 Experiment

4.1 DatasetsFollowing prior work on text style transfer, we usetwo common datasets: Yelp and IMDB review.

4.1.1 Yelp ReviewOur study uses Yelp review dataset (Li et al., 2018)which contains 266K positive and 177K negativereviews. Test set contains a total of 1000 sen-tences, 500 positive and 500 negative, and human-annotated sentences are provided which are usedin measuring content preservation.

4.1.2 IMDB Movie ReviewAnother dataset we test is IMDB movie reviewdataset (Dai et al., 2019). This dataset is comprisedof 17.9K positive and 18.8K negative reviews fortraining corpus, and 2K sentences are used for test-ing.

4.2 Automatic Evaluation4.2.1 Style Transfer AccuracyStyle transfer accuracy (S-ACC) measures whetherthe generated sentences reveal target style property.We have mentioned a style classifier before: C(x)which is used in the loss function. To evaluatetransfer accuracy, we train another style classifierCeval(x). It has the identical architecture as be-fore and trained on the same data, except from adifferent set of initial model parameters. We uti-lize such structure due to its superior performancecompared to that of commonly used CNN-based

classifier (Kim, 2014). Our evaluation classifierachieves accuracy of 97.8% on Yelp and 98.9% onIMDB, which are higher than that of CNN-based.

4.2.2 Content PreservationA well-transferred sentence must maintain its con-tent. In this paper, content preservation was evalu-ated with two BLEU scores (Papineni et al., 2002),one between generated sentence and input sen-tence (self-BLEU), and the other with human-generated sentence (ref-BLEU). With this metric,one can evaluate how a sentence maintains its con-tent throughout inference.

4.2.3 FluencyA natural language generation task aims to out-put a sentence, which is not only task-specific,but also fluent. This study measures perplexity(PPL) of generated sentences in order to measurefluency. Following (Dai et al., 2019), we use 5-gram KenLM (Heafield, 2011) trained on the twotraining datasets. A lower PPL score indicates atransferred sentence is more fluent.

4.2.4 BERT ScoreZhang et al. (2020) proposed BERT score whichcomputes contextual similarity of two sentences.Previous methods, such as BLEU score, compute n-gram matching score, while BERT score evaluatesthe contextual embedding of the tokens obtainedfrom pre-trained BERT (Devlin et al., 2019). Thisevaluation metric has been shown to correlate withhuman judgement, thus our paper includes BERTscore between model generated output and the hu-man reference sentences. We report precision, re-call, and F1 score.

4.3 Human EvaluationIn addition to automatic evaluation, we validate thegenerated outputs with human evaluation. Witheach model, we randomly sample 150 outputs fromeach of the two datasets, total of 300 outputs permodel. Given the target style and the originalsentence, the annotators are asked to evaluate themodel generated sentence with a score range from1 (Very Bad) to 5 (Very Good) on content preserva-tion, style transfer accuracy, and fluency. We reportthe average scores from the 4 hired annotators inTable 3.

4.4 Implementation DetailsIn this paper, we set the embedding size to 128dimension and hidden representation dimension of

99

Table 1: Automatic evaluation result on Yelp dataset. Bold numbers indicate best performance. G-Score denotes geometricmean of self-BLEU and S-ACC, and BERT-P, BERT-R, and BERT-F1 are BERT score precision, recall and F1 respectively. Allthe baseline model outputs and codes were used from their official repositories if provided to the public.

YelpS-ACC ref-BLEU self-BLEU PPL G-score BERT-P BERT-R BERT-F1

Cross-Alignment (Shen et al., 2017)2 74.2 4.2 13.2 53.1 32.0 87.8 86.2 87.0ControlledGen (Hu et al., 2017)3 83.7 16.1 50.5 146.3 65.0 90.6 89.0 89.8Style Transformer (Dai et al., 2019)4 87.3 19.8 55.2 73.8 69.4 91.6 89.9 90.7Deep Latent (He et al., 2020)5 85.2 15.1 40.7 36.7 58.9 89.8 88.6 89.2RACoLN (Ours) 91.3 20.0 59.4 60.1 73.6 91.8 90.3 91.0

Table 2: Automatic evaluation result on IMDB dataset. Boldnumbers indicate best performance. As for IMDB Dataset, inthe absence of human reference, BERT score and referenceBLEU are not reported.

IMDBS-ACC self-BLEU PPL G-score

Cross-Alignment 63.9 1.1 29.9 8.4ControlledGen 81.2 63.8 119.7 71.2Style Transformer 74.0 70.4 71.2 72.2Deep Latent 59.3 64.0 41.1 61.6RACoLN (Ours) 83.1 70.9 45.3 76.8

Table 3: Human evaluation result. Each score indicates theaverage score from the hired annotators. The inter-annotatoragreement, Krippendorff’s alpha, is 0.729.

YELP IMDB

Style Content Fluency Style Content Fluency

Cross-Alignment 2.6 2.4 3.3 2.2 2.1 2.3ControlledGen 3.3 4.0 3.7 3.3 3.8 3.6Style Transformer 3.7 4.3 4.0 3.3 4.0 3.8Deep Latent 3.5 3.6 4.3 2.7 3.7 4.2RACoLN (Ours) 4.0 4.5 4.2 3.6 4.1 4.1

encoder to 500. The size of bias and gain parame-ters of conditional layer norm is 200, and the sizeof hidden representation for decoder is set to 700 tocondition on both content and style representation.Adam optimizer (Kingma and Ba, 2015) was usedto update parameter with learning rate set to 0.0005.For balancing parameters of total loss function, weset to 0.5 for λ1 and λ2, and 1 for the rest.

4.5 Experimental Result & AnalysisWe compare our model with the baseline models,and the automatic evaluation result is presentedin Table 1. Our model outperforms the baseline

2https://github.com/shentianxiao/language-style-transfer

3https://github.com/asyml/texar/tree/master/examples/text_style_transfer

4https://github.com/fastnlp/style-transformer

5https://github.com/cindyxinyiwang/deep-latent-sequence-model

models in terms of content preservation on bothof the datasets. Especially, on Yelp dataset, ourmodel achieves 59.4 self-BLEU score, surpassingthe previous state-of-the-art model by more than4 points. Furthermore, our model also achievesthe state-of-the-art result in content preservationon IMDB dataset, which is comprised of longersequences than those of Yelp.

In terms of style transfer accuracy and fluency,our model is highly competitive. Our modelachieves the highest score in style transfer accu-racy on both of the datasets (91.3 on Yelp and 83.1on IMDB). Additionally, our model shows the abil-ity to produce fluent sentences as shown in theperplexity score. In terms of the BERT scores, theproposed model performs the best, having the high-est contextual similarity with the human referenceamong the style transfer models.

With the automatic evaluation result, we seea trend of trade-off. Most of the baseline mod-els are good at particular metric, but show roomfor improvement on other metrics. For example,Deep Latent and Cross-Alignment constantly per-form well in terms of perplexity, but their abilityto transfer style and preserving content needs im-provement. Style Transformer achieves compara-ble performance across all evaluation metrics, butour model outperforms the model on every metricon both of the datasets. Therefore, the result showsthat our model is well-balanced but also strong inevery aspect in text style transfer task.

As for the human evaluation, we observe that theresult mainly conform with the automatic evalua-tion. Our model received the highest score on thestyle and content evaluation metric on both of thedatasets by a large margin compared to the otherbaselines. Moreover, the fluency score is compa-rable with that of Deep Latent model, showing itscompetency in creating a fluent output. Both auto-matic and human evaluation depict the strength of

https://github.com/shentianxiao/language-style-transfer

https://github.com/shentianxiao/language-style-transfer

https://github.com/asyml/texar/tree/master/examples/text_style_transfer

https://github.com/asyml/texar/tree/master/examples/text_style_transfer

https://github.com/fastnlp/style-transformer

https://github.com/fastnlp/style-transformer

https://github.com/cindyxinyiwang/deep-latent-sequence-model

https://github.com/cindyxinyiwang/deep-latent-sequence-model

100

Table 4: Sample outputs generated by the baseline modelsand our approach on Yelp and IMDB dataset. Bold wordsindicate successful transfer in style without grammatical error.

YELP

Original Input Everyone is always super friendly and helpful .

Cross-AlignmentEveryone is always super friendlyand helpful and inattentive .

ControlledGen Tonight selection of meats and cheeses .Deep Latent Now i ’m not sure how to be .Style Transformer Which is n’t super friendly .RACoLN (Ours) Everyone is always super rude and unprofessional .

Original Input I love this place , the service is always great !Cross-Alignment I know this place , the food is just a horrible !ControlledGen I avoid this place , the service is nasty depressing vomitDeep Latent I do n’t know why the service is always great !Style Transformer I do n’t recommend this place , the service is n’t !RACoLN (Ours) I avoid this place , the service is always horrible !

IMDB

Original InputI actually disliked the leading characters so muchthat their antics were never funny but pathetic .

Cross-AlignmentI have never get a good movie , i have never haveseen in this movie .

ControlledGenI actually anticipated the leading characters so muchthat their antics were never funny but timeless .

Deep LatentI actually disliked the leading characters so muchthat their antics were never funny but incredible .

Style TransformerI actually disliked the leading characters so muchthat their antics were never funny but vhs .

RACoLN (Ours)I actually liked the leading characters so muchthat their antics were never corny but appropriate .

Original Input The plot is clumsy and has holes in it .Cross-Alignment The worst film is one of the worst movies i ’ve ever seen .ControlledGen The plot is top-notch and has one-liners in it .Deep Latent The plot is tight and has found it in a very well done .Style Transformer The plot is joys and has flynn in it .RACoLN (Ours) The plot is incredible and has twists in it .

the proposed model not only in preserving content,but also on other metrics.

4.5.1 Style and Content SpaceWe visualize the test dataset of Yelp projected oncontent and style space using t-SNE in Figure 4. Itis clearly observed that the content representations(zx) are spread across content space, showing thatthe representations are independent of style. Afterthe content representations go through the stylizermodule, there is a clear distinction between differ-ent styles representations (zs) in style space. This isin sharp contrast to the corresponding distributionsof the style-independent content representationsshown on the right of the figure. The figure clearlydepicts how style-specific parameters in the stylizermodule shape the content representations to fall inthe target style distribution. This figure illustrateshow our model successfully removes style at theencoder, and constructs content-related style at thestylizer module.

4.5.2 Ablation StudyIn order to validate the proposed modules, we con-duct ablation study on Yelp dataset which is pre-

Style Space Content Space

Figure 4: Visualization of Yelp test dataset on content andstyle space using t-SNE. Gray dots denote sentences withnegative style transferred to positive sentiment, while reddots are sentences with positive style transferred to negativesentiment.

Table 5: Ablation study on the proposed model. (-) indicatesremoving the corresponding component from the proposedmodel.

S-ACC ref-BLEU self-BLEU PPL

Input Copy 2.2 22.7 100.0 41.2

Proposed Model 91.3 20.0 59.4 60.1

(-) Reverse Attention 84.0 16.6 47.2 60.5(-) Stylizer 91.8 19.1 53.0 59.0(-) Lcontent 87.2 19.5 54.8 62

sented in Table 5. We observe a significant dropacross all aspects without the reverse attention mod-ule. In other case, where we remove the stylizermodule and use style embedding as in the previouspapers, the model loses the ability to retain content,drop of around 6 score on self-BLEU. We find thatthe two core components are interdependent in suc-cessfully transferring style in text. Lastly, as forthe loss functions, incorporating Lcontent brings ameaningful increase in content preservation.6

5 Conclusion

In this paper, we introduce a way to implicitly re-move style at the token level using reverse attention,and fuse content information to style representationusing conditional layer normalization. With the twocore components, our model is able to enhance con-tent preservation while keeping the outputs fluentwith target style. Both automatic and human evalu-ation shows that our model has the best ability inpreserving content and is strong in other metrics aswell. In the future, we plan to study problems withmore than two styles and apply multiple attribute

6Other loss functions were not included, since the lossfunctions have been extensively tested and explored in previ-ous papers (Prabhumoye et al., 2018; Dai et al., 2019).

101

style transfer, where the target style is comprisedof multiple styles.

Acknowledgement

Research on this paper was supported by HongKong Research Grants Council under grant16204920 and Tencent AI Lab Rhino-Bird FocusedResearch Program (No. GF202035).

Ethical Considerations

A text style transfer model is a conditional genera-tive model, in which the condition is the target style.This makes a wide range of applications possible,since a style can be defined as any common featurein a corpus, such as formality, tense, sentiment, etc.

However, at the same time, due to its inherentfunctionality, a text style transfer model can posepotential harm when used with a malicious inten-tion. It can lead to a situation where one deliber-ately distorts a sentence for his or her own benefit.To give an example in a political context, politi-cal stance can be viewed a style in political slantdataset (Voigt et al., 2018) as in (Prabhumoye et al.,2018). If one intentionally changes the style (polit-ical stance) of a person with the proposed modelstructure, the generated output can be exploited tocreate fake news or misinformation. One possibleremedy for such potentially problematic situationis to employ fact checking system as a safety mea-sure (Nadeem et al., 2019). We are fully aware thatfact checking is not the fundamental solution tothe potential harm that text style transfer modelspossess. Nevertheless, one can filter out misleadinginformation using the system in certain domains(i.e., politics), lowering the level of the danger thatcan be otherwise posed by style transfer. In con-clusion, such problem is shared among conditionalgenerative models in general, and future studies onhow to mitigate this problem are in crucial need.

Our work validates the proposed model and thebaseline models on human evaluation, in whichmanual work was involved. Thus, we disclosethe compensation level given to the hired anno-tators. The average lengths of the two corporatested are 10.3 words for Yelp and 15.5 wordsfor IMDB. In addition, the annotation was per-formed on sentence-level, in which the annotatorswere asked to score a model generated sentence.Considering the length and the difficulty, the ex-pected annotations per hour was 100 sentences.The hourly pay was set to 100 Hong Kong dollars

(HK$), which is higher than Hong Kong’s statu-tory minimum wage. The annotators evaluated1,500 sentences in total (750 sentences per dataset),thus each annotator was compensated with the totalamount of HK$1,500.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd InternationalConference on Learning Representations, ICLR 2015,San Diego, CA, USA, May 7-9, 2015, ConferenceTrack Proceedings.

Ning Dai, Jianze Liang, Xipeng Qiu, and XuanjingHuang. 2019. Style transformer: Unpaired text styletransfer without disentangled latent representation.In Proceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics, pages 5997–6007, Florence, Italy. Association for ComputationalLinguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186, Minneapolis, Minnesota. Association forComputational Linguistics.

Vincent Dumoulin, Jonathon Shlens, and ManjunathKudlur. 2017. A learned representation for artisticstyle. In 5th International Conference on LearningRepresentations, ICLR 2017, Toulon, France, April24-26, 2017, Conference Track Proceedings.

Junxian He, Xinyi Wang, Graham Neubig, and Tay-lor Berg-Kirkpatrick. 2020. A probabilistic formu-lation of unsupervised text style transfer. In 8thInternational Conference on Learning Representa-tions, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.

Kenneth Heafield. 2011. KenLM: Faster and smallerlanguage model queries. In Proceedings of the SixthWorkshop on Statistical Machine Translation, pages187–197, Edinburgh, Scotland. Association for Com-putational Linguistics.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P. Xing. 2017. Toward con-trolled generation of text. In Proceedings of the 34thInternational Conference on Machine Learning, vol-ume 70 of Proceedings of Machine Learning Re-search, pages 1587–1596, International ConventionCentre, Sydney, Australia. PMLR.

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categori-cal reparameterization with gumbel-softmax. In 5th

https://doi.org/10.18653/v1/P19-1601

https://doi.org/10.18653/v1/P19-1601

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://www.aclweb.org/anthology/W11-2123

https://www.aclweb.org/anthology/W11-2123

102

International Conference on Learning Representa-tions, ICLR 2017, Toulon, France, April 24-26, 2017,Conference Track Proceedings.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1746–1751,Doha, Qatar. Association for Computational Linguis-tics.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.

Juncen Li, Robin Jia, He He, and Percy Liang. 2018.Delete, retrieve, generate: a simple approach to senti-ment and style transfer. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),pages 1865–1874, New Orleans, Louisiana. Associa-tion for Computational Linguistics.

Dayiheng Liu, Jie Fu, Yidan Zhang, Chris Pal, andJiancheng Lv. 2020. Revision in continuous space:Unsupervised text style transfer without adversariallearning. In The Thirty-Fourth AAAI Conference onArtificial Intelligence, AAAI 2020, The Thirty-SecondInnovative Applications of Artificial Intelligence Con-ference, IAAI 2020, The Tenth AAAI Symposium onEducational Advances in Artificial Intelligence, EAAI2020, New York, NY, USA, February 7-12, 2020,pages 8376–8383. AAAI Press.

Moin Nadeem, Wei Fang, Brian Xu, Mitra Mohtarami,and James Glass. 2019. FAKTA: An automaticend-to-end fact checking system. In Proceedingsof the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations), pages 78–83, Minneapolis, Min-nesota. Association for Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evalu-ation of machine translation. In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut-dinov, and Alan W Black. 2018. Style transferthrough back-translation. In Proceedings of the 56thAnnual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 866–876,Melbourne, Australia. Association for ComputationalLinguistics.

Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,

and R. Garnett, editors, Advances in Neural Infor-mation Processing Systems 30, pages 6830–6841.Curran Associates, Inc.

Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempit-sky. 2016. Instance normalization: The missing in-gredient for fast stylization. CoRR, abs/1607.08022.

Rob Voigt, David Jurgens, Vinodkumar Prabhakaran,Dan Jurafsky, and Yulia Tsvetkov. 2018. RtGender:A corpus for studying differential responses to gen-der. In Proceedings of the Eleventh InternationalConference on Language Resources and Evaluation(LREC 2018), Miyazaki, Japan. European LanguageResources Association (ELRA).

Ke Wang, Hang Hua, and Xiaojun Wan. 2019. Con-trollable unsupervised text attribute transfer via edit-ing entangled latent representation. In Advances inNeural Information Processing Systems 32, pages11036–11046. Curran Associates, Inc.

Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han, andSonglin Hu. 2019. Mask and infill: Applying maskedlanguage model for sentiment transfer. In Proceed-ings of the Twenty-Eighth International Joint Con-ference on Artificial Intelligence, IJCAI-19, pages5271–5277. International Joint Conferences on Arti-ficial Intelligence Organization.

Jingjing Xu, Xu Sun, Qi Zeng, Xiaodong Zhang, Xu-ancheng Ren, Houfeng Wang, and Wenjie Li. 2018.Unpaired sentiment-to-sentiment translation: A cy-cled reinforcement learning approach. In Proceed-ings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 979–988, Melbourne, Australia. Associationfor Computational Linguistics.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchicalattention networks for document classification. InProceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1480–1489, San Diego, California. Associa-tion for Computational Linguistics.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-uating text generation with bert. In InternationalConference on Learning Representations.

https://doi.org/10.3115/v1/D14-1181

https://doi.org/10.3115/v1/D14-1181

https://doi.org/10.18653/v1/N18-1169

https://doi.org/10.18653/v1/N18-1169

https://doi.org/10.18653/v1/N19-4014

https://doi.org/10.18653/v1/N19-4014

https://doi.org/10.3115/1073083.1073135

https://doi.org/10.3115/1073083.1073135

https://doi.org/10.18653/v1/P18-1080

https://doi.org/10.18653/v1/P18-1080

http://arxiv.org/abs/1607.08022

http://arxiv.org/abs/1607.08022

https://www.aclweb.org/anthology/L18-1445



https://doi.org/10.18653/v1/P18-1090

https://doi.org/10.18653/v1/P18-1090

https://doi.org/10.18653/v1/N16-1174

https://doi.org/10.18653/v1/N16-1174

https://openreview.net/forum?id=SkeHuCVFDr

https://openreview.net/forum?id=SkeHuCVFDr

Documents

Enhancing Content Preservation in Text Style Transfer