11
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4061–4071 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 4061 Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text Lukas Ruff Yury Zemlyanskiy Robert Vandermeulen § Thomas Schnake Marius Kloft §‡ Technical University of Berlin, Berlin, Germany University of Southern California, Los Angeles, United States § Technical University of Kaiserslautern, Kaiserslautern, Germany {lukas.ruff,t.schnake}@tu-berlin.de [email protected] {vandermeulen,kloft}@cs.uni-kl.de Abstract There exist few text-specific methods for un- supervised anomaly detection, and for those that do exist, none utilize pre-trained models for distributed vector representations of words. In this paper we introduce a new anomaly detection method—Context Vector Data De- scription (CVDD)—which builds upon word embedding models to learn multiple sentence representations that capture multiple seman- tic contexts via the self-attention mechanism. Modeling multiple contexts enables us to per- form contextual anomaly detection of sen- tences and phrases with respect to the mul- tiple themes and concepts present in an un- labeled text corpus. These contexts in com- bination with the self-attention weights make our method highly interpretable. We demon- strate the effectiveness of CVDD quantita- tively as well as qualitatively on the well- known Reuters, 20 Newsgroups, and IMDB Movie Reviews datasets. 1 Introduction Anomaly Detection (AD) (Chandola et al., 2009; Pimentel et al., 2014; Aggarwal, 2017) is the task of discerning rare or unusual samples in a corpus of unlabeled data. A common approach to AD is one-class classification (Moya et al., 1993), where the objective is to learn a model that compactly describes “normality”—usually as- suming that most of the unlabeled training data is “normal,” i.e. non-anomalous. Deviations from this description are then deemed to be anoma- lous. Examples of one-class classification meth- ods are the well-known One-Class SVM (OC- SVM) (Sch¨ olkopf et al., 2001) and Support Vector Data Description (SVDD) (Tax and Duin, 2004). Applying AD to text is useful for many appli- cations including discerning anomalous web con- tent (e.g. posts, reviews, or product descriptions), automated content management, spam detection, and characterizing news articles so as to identify similar or dissimilar novel topics. Recent work has found that proper text representation is criti- cal for designing well-performing machine learn- ing algorithms. Given the exceptional impact that universal vector embeddings of words (Bengio et al., 2003) such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and Fast- Text (Bojanowski et al., 2017; Joulin et al., 2017) or dynamic vector embeddings of text by language models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have had on NLP, it is somewhat surprising that there has been no work on adapting AD techniques to use such unsuper- vised pre-trained models. Existing AD methods for text still typically rely on bag-of-words (BoW) text representations (Manevitz and Yousef, 2001, 2007; Mahapatra et al., 2012; Kannan et al., 2017). In this work, we introduce a novel one-class classification method that takes advantage of pre- trained word embedding models for performing AD on text. Starting with pre-trained word embed- dings, our method—Context Vector Data Descrip- tion (CVDD)—finds a collection of transforms to map variable-length sequences of word embed- dings to a collection of fixed-length text repre- sentations via a multi-head self-attention mecha- nism. These representations are trained along with a collection of context vectors such that the con- text vectors and representations are similar while keeping the context vectors diverse. Training these representations and context vectors jointly allows our algorithm to capture multiple modes of nor- malcy which may, for example, correspond to a collection of distinct yet non-anomalous top- ics. Disentangling such modes allows for contex- tual anomaly detection with sample-based expla- nations and enhanced model interpretability.

Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4061–4071Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

4061

Self-Attentive, Multi-Context One-Class Classificationfor Unsupervised Anomaly Detection on Text

Lukas Ruff† Yury Zemlyanskiy‡Robert Vandermeulen§ Thomas Schnake† Marius Kloft§‡

†Technical University of Berlin, Berlin, Germany‡University of Southern California, Los Angeles, United States§Technical University of Kaiserslautern, Kaiserslautern, Germany

{lukas.ruff,t.schnake}@tu-berlin.de [email protected]{vandermeulen,kloft}@cs.uni-kl.de

Abstract

There exist few text-specific methods for un-supervised anomaly detection, and for thosethat do exist, none utilize pre-trained modelsfor distributed vector representations of words.In this paper we introduce a new anomalydetection method—Context Vector Data De-scription (CVDD)—which builds upon wordembedding models to learn multiple sentencerepresentations that capture multiple seman-tic contexts via the self-attention mechanism.Modeling multiple contexts enables us to per-form contextual anomaly detection of sen-tences and phrases with respect to the mul-tiple themes and concepts present in an un-labeled text corpus. These contexts in com-bination with the self-attention weights makeour method highly interpretable. We demon-strate the effectiveness of CVDD quantita-tively as well as qualitatively on the well-known Reuters, 20 Newsgroups, and IMDBMovie Reviews datasets.

1 Introduction

Anomaly Detection (AD) (Chandola et al., 2009;Pimentel et al., 2014; Aggarwal, 2017) is thetask of discerning rare or unusual samples in acorpus of unlabeled data. A common approachto AD is one-class classification (Moya et al.,1993), where the objective is to learn a modelthat compactly describes “normality”—usually as-suming that most of the unlabeled training data is“normal,” i.e. non-anomalous. Deviations fromthis description are then deemed to be anoma-lous. Examples of one-class classification meth-ods are the well-known One-Class SVM (OC-SVM) (Scholkopf et al., 2001) and Support VectorData Description (SVDD) (Tax and Duin, 2004).

Applying AD to text is useful for many appli-cations including discerning anomalous web con-tent (e.g. posts, reviews, or product descriptions),

automated content management, spam detection,and characterizing news articles so as to identifysimilar or dissimilar novel topics. Recent workhas found that proper text representation is criti-cal for designing well-performing machine learn-ing algorithms. Given the exceptional impact thatuniversal vector embeddings of words (Bengioet al., 2003) such as word2vec (Mikolov et al.,2013), GloVe (Pennington et al., 2014), and Fast-Text (Bojanowski et al., 2017; Joulin et al., 2017)or dynamic vector embeddings of text by languagemodels such as ELMo (Peters et al., 2018) andBERT (Devlin et al., 2018) have had on NLP, it issomewhat surprising that there has been no workon adapting AD techniques to use such unsuper-vised pre-trained models. Existing AD methodsfor text still typically rely on bag-of-words (BoW)text representations (Manevitz and Yousef, 2001,2007; Mahapatra et al., 2012; Kannan et al., 2017).

In this work, we introduce a novel one-classclassification method that takes advantage of pre-trained word embedding models for performingAD on text. Starting with pre-trained word embed-dings, our method—Context Vector Data Descrip-tion (CVDD)—finds a collection of transforms tomap variable-length sequences of word embed-dings to a collection of fixed-length text repre-sentations via a multi-head self-attention mecha-nism. These representations are trained along witha collection of context vectors such that the con-text vectors and representations are similar whilekeeping the context vectors diverse. Training theserepresentations and context vectors jointly allowsour algorithm to capture multiple modes of nor-malcy which may, for example, correspond toa collection of distinct yet non-anomalous top-ics. Disentangling such modes allows for contex-tual anomaly detection with sample-based expla-nations and enhanced model interpretability.

Page 2: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4062

(a) Overall most anomalous reviews in IMDB test set according to CVDD.

c1 c2 c3great excellent awful downright plot charactersgood superb stupid inept story storylinewell wonderful pathetic irritating scenes narrativenice best annoying inane subplots twiststerrific beautiful unfunny horrible tale interesting

(b) Top words per context by self-attention weights from the most similar sentences per CVDD context.

(c) Most normal reviews in IMDB test set for CVDD contexts c1, c2, and c3 with highlighted self-attention weights.

Figure 1: Illustration of CVDD trained on the IMDB Movie Reviews. The five most anomalous movie reviews areshown in (a). Table (b) shows the most important words from three of the CVDD contexts. Our model successfullydisentangles positive and negative sentiments as well as cinematic language in an unsupervised manner. (c) showsthe most normal examples w.r.t. those contexts with explanations given by the highlighted self-attention weights.

2 Context Vector Data Description

In this section we introduce Context VectorData Description (CVDD), a self-attentive, multi-context one-class classification method for unsu-pervised AD on text. We first describe the CVDDmodel and objective, followed by a descriptionof its optimization procedure. Finally we presentsome analysis of CVDD.

2.1 Multi-Head Self-Attention

Here we describe the problem setting and multi-head self-attention mechanism which lies at thecore of our method. Let S = (w1, . . . , w`) ∈Rd×` be a sentence or, more generally, a sequenceof ` ∈ N words (e.g. phrase or document), whereeach word is represented via some d-dimensionalvector. Given some pre-trained word embedding,let H = (h1, . . . , h`) ∈ Rp×` be the correspond-ing p-dimensional vector embeddings of the wordsin S. The vector embeddingH might be some uni-versal word embedding (e.g. GloVe, FastText) orthe hidden vector activations of sentence S givenby some language model (e.g. ELMo, BERT).

The aim of multi-head self-attention (Lin et al.,

2017) is to define a transformation that ac-cepts sentences S(1), . . . , S(n) of varying lengths`(1), . . . , `(n) and returns a vector of fixed length,thereby allowing us to apply more standard ADtechniques. The idea here is to find a fixed-lengthvector representation of size p via a convex combi-nation of the word vectors in H . The coefficientsof this convex combination are adaptive weightswhich are learned during training.

We describe the model now in more detail.Given the word embeddings H ∈ Rp×` of a sen-tence S, the first step of the self-attention mech-anism is to compute the attention matrix A ∈(0, 1)`×r by

A = softmax(tanh(H>W1)W2

), (1)

where W1 ∈ Rp×da and W2 ∈ Rda×r. Thetanh-activation is applied element-wise and thesoftmax column-wise, thus making each vectorak of the attention matrix A = (a1, . . . , ar) apositive vector that sums to one, i.e. a weight-ing vector. The r vectors a1 . . . , ar are calledattention heads with each head giving a weight-ing over the words in the sentence. Dimension

Page 3: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4063

da is the internal dimensionality and thus sets thecomplexity of the self-attention module. We nowobtain a fixed-length sentence embedding matrixM = (m1, . . . ,mr) ∈ Rp×r from the word em-beddings H by applying the self-attention weightsA as

M = HA. (2)

Thus, each column mk ∈ Rp is a weightedlinear combination of the vector embeddingsh1, . . . , h` ∈ Rp with weights ak ∈ R` given bythe respective attention head k, i.e. mk = Hak.Often, a regularization term P such as

P =1

n

n∑i=1

∥∥(A(i)>A(i) − I)∥∥2F

(3)

is added to the learning objective to promote theattention heads to be nearly orthogonal and thuscapture distinct views that focus on different se-mantics and concepts of the data. Here, I denotesthe r × r identity matrix, ‖ · ‖F is the Frobeniusnorm, and A(i) , A(H(i);W1,W2) is the atten-tion matrix corresponding to sample S(i).

2.2 The CVDD ObjectiveIn this section, we introduce an unsupervised ADmethod for text. It aims to capture multiple dis-tinct contexts that may appear in normal text. Todo so, it leverages the multi-head self-attentionmechanism (described in the previous section),with the heads focusing on distinct contexts (onehead per context).

We first define a notion of similarity. Letsim(u, v) be the cosine similarity between twovectors u and v, i.e.

sim(u, v) =〈u, v〉‖u‖ ‖v‖

∈ [−1, 1] (4)

and denote by d(u, v) the cosine distance betweenu and v, i.e.

d(u, v) =1

2(1− sim(u, v)) ∈ [0, 1]. (5)

As before, let r be the number of attentionheads. We now define the context matrix C =(c1, . . . , cr) ∈ Rp×r to be a matrix whosecolumns c1, . . . , cr are vectors in the word em-bedding space Rp. Given an unlabeled trainingcorpus S(1), . . . , S(n) of sentences (or phrases,documents, etc.), which may vary in length `(i),and their corresponding word vector embeddings

H(1), . . . ,H(n), we formulate the Context VectorData Description (CVDD) objective as follows:

minC,W1,W2

1

n

n∑i=1

r∑k=1

σk(H(i)) d(ck,m

(i)k )︸ ︷︷ ︸

=:Jn(C,W1,W2)

. (6)

σ1(H), . . . , σr(H) are input-dependent weights,i.e.

∑k σk(H) = 1, which we specify further

below in detail. That is, CVDD finds a set ofvectors c1, . . . , cr ∈ Rp with small cosine dis-tance to the attention-weighted data representa-tions m(i)

1 , . . . ,m(i)r ∈ Rp. Note that each concept

vector ck is paired with the data representationm

(i)k of the kth head. This causes the network to

learn attention weights that extract the most com-mon concepts and themes from the data. We callthe vectors c1, . . . , cr ∈ Rp context vectors be-cause they represent a compact description of thedifferent contexts that are present in the data. For agiven text sample S(i), the corresponding embed-ding m

(i)k provides a sample representation with

respect to the kth context.

Multi-context regularization To promote thecontext vectors c1, . . . , cr to capture diversethemes and concepts, we regularize them towardsorthogonality:

P (C) = ‖C>C − I‖2F . (7)

We can now state the overall CVDD objective as

minC,W1,W2

Jn(C,W1,W2) + λP (C), (8)

where Jn(C,W1,W2) is the objective functionfrom Eq. (6) and λ > 0. Because CVDD mini-mizes the cosine distance

d(ck,mk) =1

2

(1−

⟨ck‖ck‖

,Hak‖Hak‖

⟩), (9)

regularizing the context vectors c1, . . . , ck to beorthogonal implicitly regularizes the attentionweight vectors a1, . . . , ar to be orthogonal as well,a phenomenon which we also observed empiri-cally. Despite this, we found that regularizing thecontext vectors as in (7) allows for faster, morestable optimization in comparison to regularizingthe attention weights as in (3). We suspect thisis because in (3) P = Pn(W1,W2) depends non-linearly on the attention network weights W1 andW2 as well as on the data batches. In compari-son, the gradients of P (C) in (7) can be directly

Page 4: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4064

computed. Empirically we found that selectingλ ∈ {1, 10} gives reliable results with the desiredeffect that CVDD learns multiple distinct contexts.

Optimization of CVDD We optimize theCVDD objective jointly over the self-attentionnetwork weights {W1,W2} and the context vec-tors c1, . . . , cr using stochastic gradient descent(SGD) and its variants (e.g. Adam (Kingma andBa, 2014)). Thus, CVDD training scales linearlyin the number of training batches. Training is car-ried out until convergence. Since the self-attentionmodule is just a two-layer feed-forward network,the computational complexity of training CVDDis very low. However, evaluating a pre-trainedmodel for obtaining word embeddings may addto the computational cost (e.g. in case of largepre-trained language models) in which case par-allelization strategies (e.g. by using GPUs) shouldbe exploited. We initialize the context vectors withthe centroids from running k-means++ (Arthurand Vassilvitskii, 2007) on the sentence represen-tations obtained from averaging the word embed-dings. We empirically found that this initializa-tion strategy improves optimization speed as wellas performance.

Weighting contexts in the CVDD objectiveThere is a natural motivation to consider multi-ple vectors for representation because sentences ordocuments may have multiple contexts, e.g. cin-ematic language, movie theme, or sentiment inmovie reviews. This raises the question of howthese context representations should be weightedin a learning objective. For this, we propose touse a parameterized softmax over the r cosine dis-tances of a sample S with embedding H in ourCVDD objective:

σk(H) =exp(−α d(ck,mk(H)))∑rj=1 exp(−α d(cj ,mj(H)))

, (10)

for k = 1, . . . , r with α ∈ [0,+∞). Theα parameter allows us to balance the weightingbetween two extreme cases: (i) α = 0 whichresults in all contexts being equally weighted,i.e. σk(H) = 1/r for all k, and (ii) α → ∞ inwhich case the softmax approximates the argmin-function, i.e. only the closest context kmin =argmink d(ck,mk) has weight σkmin = 1 whereasσk = 0 for k 6= kmin otherwise.

Traditional clustering methods typically onlyconsider the argmin, i.e. the closest representa-

tives (e.g. nearest centroid in k-means). For learn-ing multiple sentence representations as well ascontexts from data, however, this might be inef-fective and result in poor representations. Thisis due to the optimization getting “trapped” bythe closest context vectors which strongly dependon the initialization. Not penalizing the distancesto other context vectors also does not foster theextraction of multiple contexts per sample. Forthis reason, we initially set α = 0 in trainingand then gradually increase the α parameter usingsome annealing strategy. Thus, learning initiallyfocuses on extracting multiple contexts from thedata before sample representations then graduallyget fine-tuned w.r.t their closest contexts.

2.3 Contextual Anomaly DetectionOur CVDD learning objective and the introductionof context vectors allows us to score the “anoma-lousness” in relation to these various contexts,i.e. to determine anomalies contextually. We natu-rally define the anomaly score w.r.t. context k forsome sample S with embedding H as

sk(H) = d(ck,mk(H)), (11)

the cosine distance of the contextual embeddingmk(H) to the respective context vector ck. Agreater the distance ofmk(H) to ck implies a moreanomalous sample w.r.t. context k. A straightfor-ward choice for an overall anomaly score then isto take the average over the contextual anomalyscores:

s(H) =1

r

r∑k=1

sk(H). (12)

One might, however, select different weights fordifferent contexts as particular contexts might bemore or less relevant in certain scenarios. Usingword lists created from the most similar attention-weighted sentences to a context vector provides aninterpretation of the particular context.

2.4 Avoiding Manifold CollapseNeural approaches to AD (Ruff et al., 2018) andclustering (Fard et al., 2018) are prone to con-verge to degenerate solutions where the data istransformed to a small manifold or a single point.CVDD potentially may also suffer from this man-ifold collapse phenomenon. Indeed, there existsa theoretical optimal solution (C∗,W ∗1 ,W

∗2 ) for

which the (nonnegative) CVDD objective (6) be-comes zero due to trivial representations. This is

Page 5: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4065

the case for (C∗,W ∗1 ,W∗2 ) where

mk(H(i);W ∗1 ,W

∗2 ) = c∗k ∀i = 1, . . . , n, (13)

holds, i.e. if the contextual representationmk(· ;W ∗1 ,W ∗2 ) is a constant mapping. In thiscase, all contextual data representations havecollapsed to the respective context vectors and areindependent of the input sentence S with embed-ding H . Because the pre-trained embeddings Hare fixed, and the self-attention embedding mustbe a convex combination of the columns in H , itis difficult for the network to overfit to a constantfunction. A degenerate solution may only occurif there exists a word which occurs in the samelocation in all training samples. This property ofCVDD, however, might be used “as a feature” touncover such simple common structure amongstthe data such that appropriate pre-processingsteps can be carried out to rule out such “CleverHans” behavior (Lapuschkin et al., 2019). Finally,since we normalize the contextual representationsmk as well as the context vectors ck with cosinesimilarity, a trivial collapse to the origin (mk = 0or ck = 0) is also avoided.

3 Related Work

Our method is related to works from unsupervisedrepresentation learning for NLP, methods for ADon text, as well as representation learning for AD.

Vector representations of words (Bengio et al.,2003; Collobert and Weston, 2008) or word em-beddings have been the key for many substan-tial advances in NLP in recent history. Well-known examples include word2vec (Mikolovet al., 2013), GloVe (Pennington et al., 2014), orfastText (Bojanowski et al., 2017; Joulin et al.,2017). Approaches for learning sentence em-beddings have also been introduced, includingSkipThought (Kiros et al., 2015), ParagraphVec-tor (Le and Mikolov, 2014), Conceptual SentenceEmbedding (Wang et al., 2016), Sequential De-noising Autoencoders (Hill et al., 2016) or Fast-Sent (Hill et al., 2016). In a comparison of un-supervised sentence embedding models, Hill et al.(2016) show that the optimal embedding criticallydepends on the targeted downstream task. Forspecific applications, more complex deep mod-els such as recurrent (Chung et al., 2014), recur-sive (Socher et al., 2013) or convolutional (Kim,2014) networks that learn task-specific dynamic

sentence embeddings usually perform best. Re-cently, large language models like ELMo (Peterset al., 2018) or BERT (Devlin et al., 2018) thatlearn dynamic sentence embeddings in an unsu-pervised manner have proven to be very effectivefor transfer learning, beating the state-of-the-art inmany downstream tasks. Such large deep models,however, are very computationally intensive. Fi-nally, no method for learning representations ofwords or sentences specifically for the AD taskhave been proposed yet.

There are only few works addressing ADon text. Manevitz and Yousef study one-class classification of documents using theOC-SVM (Scholkopf et al., 2001; Manevitzand Yousef, 2001) and a simple autoen-coder (Manevitz and Yousef, 2007). Liuet al. (2002) consider learning from positivelylabeled as well as unlabeled mixed data ofdocuments what they call a “partially supervisedclassification” problem that is similar to one-classclassification. Kannan et al. (2017) introduce anon-negative matrix factorization method for ADon text that is based on block coordinate descentoptimization. Mahapatra et al. (2012) includeexternal contextual information for detectinganomalies using LDA clustering. All the aboveworks, however, only consider document-to-wordco-occurrence text representations. Other ap-proaches rely on specific handcrafted features forparticular domains or types of anomalies (Guthrie,2008; Kumaraswamy et al., 2015). None of theexisting methods make use of pre-trained wordmodels that were trained on huge corpora of text.

Learning representations for AD or DeepAnomaly Detection (Chalapathy and Chawla,2019) has seen great interest recently. Such ap-proaches are motivated by applications on largeand complex datasets and the limited scalabilityof classic, shallow AD techniques and their needfor manual feature engineering. Deep approachesaim to overcome those limitations by automati-cally learning relevant features from the data andmini-batch training for improved computationalscaling. Most existing deep AD works are inthe computer vision domain and show promising,state-of-the-art results for image data (Andrewset al., 2016; Schlegl et al., 2017; Ruff et al., 2018;Deecke et al., 2018; Golan and El-Yaniv, 2018;Hendrycks et al., 2019). Other works examinedeep AD on general high-dimensional point data

Page 6: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4066

(Sakurada and Yairi, 2014; Xu et al., 2015; Erfaniet al., 2016; Chen et al., 2017). Few deep ap-proaches examine sequential data, and those thatdo exist focus on time series data AD using LSTMnetworks (Bontemps et al., 2016; Malhotra et al.,2015, 2016). As mentioned earlier, there exists norepresentation learning approach for AD on text.

4 Experiments

We evaluate the performance of CVDD quantita-tively in one-class classification experiments onthe Reuters-215781 and 20 Newsgroups2 datasetsas well as qualitatively in an application on IMDBMovie Reviews3 to detect anomalous reviews.4

4.1 Experimental Details

Pre-trained Models We employ the pre-trainedGloVe (Pennington et al., 2014) as well as fast-Text (Bojanowski et al., 2017; Joulin et al., 2017)word embeddings in our experiments. For GloVewe consider the 6B tokens vector embeddings ofp = 300 dimensions that have been trained onthe Wikipedia and Gigaword 5 corpora. For fast-Text we consider the English word vectors thatalso have p = 300 dimensions which have beentrained on the Wikipedia and English webcrawl.We also experimented with dynamic word embed-dings from the BERT (Devlin et al., 2018) lan-guage model but did not observe improvementsover GloVe or fastText on the considered datasetsthat would justify the added computational cost.

Baselines We consider three baselines for ag-gregating word vector embeddings to fixed-lengthsentence representations: (i) mean, (ii) tf-idfweighted mean, and (iii) max-pooling. It has beenrepeatedly observed that the simple average sen-tence embedding proves to be a strong baselinein many tasks (Wieting et al., 2016; Arora et al.,2017; Ruckle et al., 2018). Max-pooling is com-monly applied over hidden activations (Lee andDernoncourt, 2016). The tf-idf weighted mean is anatural sentence embedding baseline that includesdocument-to-term co-occurrence statistics. ForAD, we then consider the OC-SVM (Scholkopfet al., 2001) with cosine kernel (which in this caseis equivalent to SVDD (Tax and Duin, 2004)) onthese sentence embeddings where we always train

1daviddlewis.com/resources/testcollections/reuters215782qwone.com/ jason/20Newsgroups3ai.stanford.edu/amaas/data/sentiment4Code available at: github.com/lukasruff/CVDD-PyTorch

for hyperparameters ν ∈ {0.05, 0.1, 0.2, 0.5} andreport the best result.

CVDD configuration We employ self-attentionwith da = 150 for CVDD and present results forr ∈ {3, 5, 10} number of attention heads. We useAdam (Kingma and Ba, 2014) with a batch sizeof 64 for optimization and first train for 40 epochswith a learning rate of η = 0.01 after which wetrain another 60 epochs with η = 0.001, i.e. we es-tablish a simple two-phase learning rate schedule.For weighting contexts, we consider the case ofequal weights (α = 0) as well as a logarithmic an-nealing strategy α ∈ {0, 10−4, 10−3, 10−2, 10−1}where we update α every 20 epochs. For multi-context regularization, we choose λ ∈ {1, 10}.

Data pre-processing On all three datasets, wealways lowercase text and strip punctuation, num-bers, as well as redundant whitespaces. More-over, we remove stopwords using the stopwordslist from the nltk library (Bird et al., 2009) andonly consider words with a minimum length of 3characters.

4.2 One-Class Classification of News Articles

Setup We perform one-class classification ex-periments on the Reuters-21578 and 20 News-groups topic classification datasets which allow usto quantitatively evaluate detection performancevia AUC measure by using the ground truth labelsin testing. Such use of classification datasets isthe typical evaluation approach in the AD litera-ture (Erfani et al., 2016; Ruff et al., 2018; Golanand El-Yaniv, 2018). For the multi-label Reutersdataset, we only consider the subset of samplesthat have exactly one label and have selected theclasses such that there are at least 100 training ex-amples per class. For 20 Newsgroups, we con-sider the six top-level subject matter groups com-puter, recreation, science, miscellaneous, politics,and religion as distinct classes. In every one-classclassification setup, one of the classes is the nor-mal class and the remaining classes are consideredanomalous. We always train the models only withthe training data from the respective normal class.We then perform testing on the test samples fromall classes, where samples from the normal classget label y = 0 (“normal”) and samples from allthe remaining classes get label y = 1 (“anoma-lous”) for determining the AUC.

Page 7: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4067

GloVe fastTextNormal OC-SVM CVDD OC-SVM CVDDClass mean tf-idf max r=3 r=5 r=10 c∗ mean tf-idf max r=3 r=5 r=10 c∗

Reutersearn 91.1 88.6 77.1 94.0 92.8 91.8 97.6 87.8 82.4 74.9 95.3 92.7 93.9 94.5acq 93.1 77.0 81.4 90.2 88.7 91.5 95.6 91.8 74.1 80.2 91.0 90.3 92.7 92.4crude 92.4 90.3 91.2 89.6 92.5 95.5 89.4 93.3 90.2 84.7 90.9 94.1 97.3 85.0trade 99.0 96.8 93.7 98.3 98.2 99.2 97.9 97.6 95.0 92.1 97.9 98.1 99.3 97.7money-fx 88.6 81.2 73.6 82.5 76.7 82.8 99.7 80.5 82.6 73.8 82.6 79.8 82.5 99.5interest 97.4 93.5 84.2 92.3 91.7 97.7 98.4 91.6 88.7 82.8 93.3 92.1 95.9 97.4ship 91.2 93.1 86.5 97.6 96.9 95.6 99.7 90.0 90.6 85.0 96.9 94.7 96.1 99.720 Newscomp 82.0 81.2 54.5 70.9 66.4 63.3 86.6 77.5 78.0 65.5 74.0 68.2 64.2 88.2rec 73.2 75.6 56.2 50.8 52.8 53.3 68.9 66.0 70.0 51.9 60.6 58.5 54.1 85.1sci 60.6 64.1 53.0 56.7 56.8 55.7 61.0 61.0 64.2 57.0 58.2 57.6 55.9 64.4misc 61.8 63.1 54.1 75.1 70.2 68.6 83.8 62.3 62.1 55.7 75.7 70.3 68.0 83.9pol 72.5 75.5 64.9 62.9 65.3 65.1 75.4 73.7 76.1 68.1 71.5 66.4 67.1 82.8rel 78.2 79.2 68.4 76.3 72.9 70.7 87.3 77.8 78.9 73.9 78.1 73.2 69.5 89.3

Table 1: AUCs (in %) of one-class classification experiments on Reuters and 20 Newsgroups.

computer politics religionc1 c2 (c∗) c3 c1 c2 c3 (c∗) c1 c2 (c∗) c3get windows use kill think government example god onehelp software using killed know peace particular christ firstthanks disk used escape say arab specific christians twoappreciated dos uses away really political certain faith threegot unix possible back thing occupation analysis jesus alsoknow computer system shoot anyone forces rather christianity laterway hardware need shot guess support therefore bible timetry desktop allow crying something movement consistent scripture lasttried macintosh could killing understand leaders often religion yeartake cpu application fight sure parties context worship four

Table 2: Example of top words per context on 20 Newsgroups one-class classification experiments comp, pol, andrel for CVDD model with r = 3 contexts.

Reuters 20 Newsgroupsearn acq crude trade money-fx interest ship rec sci miscshr acquire oil trade bank rate port game use saledividend buy crude imports market pct shipping team systems offerprofit purchase barrels economic dollar bank ships season modified shippingqtr acquisition petroleum exports currency rates seamen games method pricenet stake prices tariffs exchange discount vessel league system sellprior acquired refinery goods rates effective canal play types itemscts assets supply export liquidity interest cargo win data solddividends transaction exports trading markets lending vessels scoring provide sellingshare sell dlr deficit monetary raises sea playoffs devices brandloss sale gas pact treasury cuts ferry playoff require bought

Table 3: Top words of the best single contexts c∗ for one-class classification experiments of news articles.

Results The results are presented in Table 1.Overall, we can see that CVDD shows competitivedetection performance. We compute the AUCsfor CVDD from the average anomaly score overthe contextual anomaly scores as defined in (12).We find CVDD performance to be robust overλ ∈ {1, 10} and results are similar for weightingcontexts equally (α = 0) or employing the loga-rithmic annealing strategy. The CVDD results inTable 1 are averages over those hyperparameters.

We get an intuition of the theme captured bysome CVDD context vector by examining a list oftop words for this context. We create such lists bycounting the top words according to the highestself-attention weights from the most similar testsentences per context vector. Table 2 shows an ex-ample of such context word lists for CVDD threecontexts. Such lists may guide a user in weightingand selecting relevant contexts in a specific appli-cation. Following this thought, we also report the

Page 8: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4068

IMDB Movie Reviewsc1 c2 c3 c4 c5 c6 c7 c8 c9 c10great awful plot two think actions film head william movieexcellent downright characters one anybody development filmmakers back john moviesgood stupid story three know efforts filmmaker onto michael pornsuperb inept storyline first would establishing movie cut richard sexwell pathetic scenes five say knowledge syberberg bottom davies watchwonderful irritating narrative four really involvement cinema neck david teennice annoying subplots part want policies director floor james bestbest inane twists every never individuals acting flat walter dvdterrific unfunny tale best suppose necessary filmmaking thick robert scenesbeautiful horrible interesting also actually concerning actors front gordon flick

Table 4: Top words per context on IMDB Movie Reviews for CVDD model with r = 10 contexts.

best single context detection performance in AUCto illustrate the potential of contextual anomalydetection. Those results are given in the c∗ columnof Table 1 and demonstrate the possible boosts inperformance. We highlighted the respective bestcontexts in Table 2 and present word lists of thebest contexts of all the other classes in Table 3.One can see that those contexts indeed appear tobe typical for what one would expect as a char-acterization of those classes. Note that the OC-SVM on the simple mean embeddings establishesa strong baseline as has been observed on othertasks. Moreover, the tf-idf weighted embeddingsprove especially beneficial on the larger 20 News-groups dataset for filtering out “general languagecontexts” (similar to stop words) that are less dis-criminative for the characterization of a text cor-pus. A major advantage of CVDD is its strong in-terpretability and the potential for contextual ADwhich allows to sort out relevant contexts.

4.3 Detecting Anomalous Movie Reviews

Setup We apply CVDD for detecting anoma-lous reviews in a qualitative experiment on IMDBMovie Reviews. For this, we train a CVDD modelwith r = 10 context vectors on the full IMDB trainset of 25 000 movie review samples. After train-ing, we examine the most anomalous and mostnormal reviews according to the CVDD anomalyscores on the IMDB test set which also includes25 000 reviews. We use GloVe word embeddingsand keep the CVDD model settings and trainingdetails as outlined in Section 4.1.

Results Table 4 shows the top words for eachof the r = 10 CVDD contexts of the trainedmodel. We can see that the different contextsof the CVDD model capture the different themespresent in the movie reviews well. Note for ex-ample that c1 and c2 represent positive and nega-

tive sentiments respectively, c3, c7, and c10 referto different aspects of cinematic language, and c9captures names, for example. Figure 1 in the intro-duction depicts qualitative examples of this exper-iment. 1a are the movie reviews having the high-est anomaly scores. The top anomaly is a reviewthat repeats the same phrase many times. From ex-amining the most anomalous reviews, the datasetseems to be quite clean in general though. Fig-ure 1c shows the most normal reviews w.r.t. thefirst three contexts, i.e. the samples that havethe lowest respective contextual anomaly scores.Here, the self-attention weights give a sample-based explanation for why a particular review isnormal in view of the respective context.

5 Conclusion

We presented a new self-attentive, multi-contextone-class classification approach for unsupervisedanomaly detection on text which makes use of pre-trained word models. Our method, Context Vec-tor Data Description (CVDD), enables contextualanomaly detection and has strong interpretabilityfeatures. We demonstrated the detection perfor-mance of CVDD empirically and showed qualita-tively that CVDD is well capable of learning dis-tinct, diverse contexts from unlabeled text corpora.

Acknowledgments

We thank Fei Sha for insightful discussions andcomments as well as the anonymous reviewers fortheir helpful suggestions. LR acknowledges sup-port from the German Federal Ministry of Educa-tion and Research (BMBF) in the project ALICEIII (FKZ: 01IS18049B). YZ is partially supportedby NSF Awards IIS-1513966, IIS-1632803, IIS-1833137, CCF-1139148, DARPA Award FA8750-18-2-0117, DARPA-D3M Award UCB-00009528,Google Research Awards, gifts from Facebook

Page 9: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4069

and Netflix, and ARO W911NF-12-1-0241 andW911NF-15-1-0484. MK and RV acknowl-edge support by the German Research Founda-tion (DFG) award KL 2698/2-1 and by the Ger-man Federal Ministry of Education and Research(BMBF) awards 031L0023A, 01IS18051A, and031B0770E. Part of the work was done while MKwas a sabbatical visitor of the DASH Center at theUniversity of Southern California.

ReferencesCharu C. Aggarwal. 2017. Outlier Analysis. Springer.

J. T. A. Andrews, E. J. Morton, and L. D. Griffin. 2016.Detecting Anomalous Data Using Auto-Encoders.IJMLC, 6(1):21.

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.A simple but tough-to-beat baseline for sentence em-beddings. In ICLR.

David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding.In ACM-SIAM symposium on discrete algorithms,pages 1027–1035.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model. JMLR, 3(Feb):1137–1155.

Steven Bird, Edward Loper, and Ewan Klein. 2009.Natural language processing with Python: analyz-ing text with the natural language toolkit. O’ReillyMedia, Inc.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Loıc Bontemps, James McDermott, Nhien-An Le-Khac, et al. 2016. Collective anomaly detectionbased on long short-term memory recurrent neu-ral networks. In International Conference on Fu-ture Data and Security Engineering, pages 141–152.Springer.

Raghavendra Chalapathy and Sanjay Chawla. 2019.Deep learning for anomaly detection: A survey.arXiv:1901.03407.

V. Chandola, A. Banerjee, and V. Kumar. 2009.Anomaly Detection: A Survey. ACM ComputingSurveys, 41(3):1–58.

J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. 2017.Outlier Detection with Autoencoder Ensembles. InSDM, pages 90–98.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. In NIPS Workshop.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In ICML,pages 160–167.

Lucas Deecke, Robert A. Vandermeulen, Lukas Ruff,Stephan Mandt, and Marius Kloft. 2018. Imageanomaly detection with generative adversarial net-works. In ECML-PKDD.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. arXiv:1810.04805.

S. M. Erfani, S. Rajasegarar, S. Karunasekera, andC. Leckie. 2016. High-dimensional and large-scaleanomaly detection using a linear one-class SVMwith deep learning. Pattern Recognition, 58:121–134.

Maziar Moradi Fard, Thibaut Thonet, and EricGaussier. 2018. Deep k-means: Jointly clus-tering with k-means and learning representations.arXiv:1806.10069.

Izhak Golan and Ran El-Yaniv. 2018. Deep anomalydetection using geometric transformations. In NIPS.

David Guthrie. 2008. Unsupervised Detection ofAnomalous Text. Ph.D. thesis, University ofSheffield.

Dan Hendrycks, Mantas Mazeika, and Thomas Diet-terich. 2019. Deep anomaly detection with outlierexposure. ICLR.

Felix Hill, Kyunghyun Cho, and Anna Korhonen.2016. Learning distributed representations of sen-tences from unlabelled data. In Proceedings of the2016 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 1367–1377, SanDiego, California. Association for ComputationalLinguistics.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2017. Bag of tricks for efficienttext classification. In Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics: Volume 2, Short Pa-pers, pages 427–431, Valencia, Spain. Associationfor Computational Linguistics.

Ramakrishnan Kannan, Hyenkyun Woo, Charu C Ag-garwal, and Haesun Park. 2017. Outlier detectionfor text data. In SDM, pages 489–497.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1746–1751,Doha, Qatar. Association for Computational Lin-guistics.

D. Kingma and J. Ba. 2014. Adam: A Method forStochastic Optimization. arXiv:1412.6980.

Page 10: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4070

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InNIPS, pages 3294–3302.

Raksha Kumaraswamy, Anurag Wazalwar, TusharKhot, Jude W Shavlik, and Sriraam Natarajan. 2015.Anomaly detection in text: The value of domainknowledge. In FLAIRS Conference, pages 225–228.

Sebastian Lapuschkin, Stephan Waldchen, AlexanderBinder, Gregoire Montavon, Wojciech Samek, andKlaus-Robert Muller. 2019. Unmasking clever hanspredictors and assessing what machines really learn.Nature communications, 10(1):1096.

Quoc Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In ICML,pages 1188–1196.

Ji Young Lee and Franck Dernoncourt. 2016. Sequen-tial short-text classification with recurrent and con-volutional neural networks. In Proceedings of the2016 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 515–520, SanDiego, California. Association for ComputationalLinguistics.

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. In ICLR.

Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li.2002. Partially supervised classification of text doc-uments. In ICML, volume 2, pages 387–394.

Amogh Mahapatra, Nisheeth Srivastava, and JaideepSrivastava. 2012. Contextual anomaly detection intext data. Algorithms, 5(4):469–489.

Pankaj Malhotra, Anusha Ramakrishnan, GaurangiAnand, Lovekesh Vig, Puneet Agarwal, and Gau-tam Shroff. 2016. Lstm-based encoder-decoder formulti-sensor anomaly detection. In Anomaly Detec-tion Workshop at 33rd International Conference onMachine Learning.

Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, andPuneet Agarwal. 2015. Long short term memorynetworks for anomaly detection in time series. InEuropean Symposium on Artificial Neural Networks.

Larry M Manevitz and Malik Yousef. 2001. One-class svms for document classification. JMLR,2(Dec):139–154.

Larry M Manevitz and Malik Yousef. 2007. One-classdocument classification via neural networks. Neuro-computing, 70(7-9):1466–1481.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In NIPS, pages 3111–3119.

Mary M Moya, Mark W Koch, and Larry D Hostetler.1993. One-class classifier networks for target recog-nition applications. In Proceedings World Congresson Neural Networks, pages 797–801.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543, Doha,Qatar. Association for Computational Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.

Marco AF Pimentel, David A Clifton, Lei Clifton, andLionel Tarassenko. 2014. A review of novelty de-tection. Signal Processing, 99:215–249.

Andreas Ruckle, Steffen Eger, Maxime Peyrard, andIryna Gurevych. 2018. Concatenated p-mean wordembeddings as universal cross-lingual sentence rep-resentations. arXiv:1803.01400.

Lukas Ruff, Robert A. Vandermeulen, Nico Gornitz,Lucas Deecke, Shoaib A. Siddiqui, AlexanderBinder, Emmanuel Muller, and Marius Kloft. 2018.Deep one-class classification. In ICML, volume 80,pages 4390–4399.

M. Sakurada and T. Yairi. 2014. Anomaly detection us-ing autoencoders with nonlinear dimensionality re-duction. In Proceedings of the 2nd MLSDA Work-shop, page 4.

T. Schlegl, P. Seebock, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs. 2017. UnsupervisedAnomaly Detection with Generative AdversarialNetworks to Guide Marker Discovery. In IPMI,pages 146–157.

B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola,and R. C. Williamson. 2001. Estimating the Supportof a High-Dimensional Distribution. Neural compu-tation, 13(7):1443–1471.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Process-ing, pages 1631–1642, Seattle, Washington, USA.Association for Computational Linguistics.

D. M. J. Tax and R. P. W. Duin. 2004. Support VectorData Description. Machine learning, 54(1):45–66.

Page 11: Self-Attentive, Multi-Context One-Class Classification for ... · bination with the self-attention weights make our method highly interpretable. We demon-strate the effectiveness

4071

Yashen Wang, Heyan Huang, Chong Feng, QiangZhou, Jiahui Gu, and Xiong Gao. 2016. CSE:Conceptual sentence embeddings based on attentionmodel. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 505–515, Berlin,Germany. Association for Computational Linguis-tics.

John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2016. Towards universal paraphrastic sen-tence embeddings. In ICLR.

D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe. 2015.Learning Deep Representations of Appearance andMotion for Anomalous Event Detection. In BMVC,pages 8.1–8.12.