12
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1106–1117 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 1106 Hierarchy-Aware Global Model for Hierarchical Text Classification Jie Zhou 1, 2* , Chunping Ma 2 , Dingkun Long 2 , Guangwei Xu 2 , Ning Ding 3 , Haoyu Zhang 4 , Pengjun Xie 2 , Gongshen Liu 11 School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University 2 Alibaba Group, 3 Tsinghua University, 4 National University of Defense Technology {sanny02,lgshen}@sjtu.edu.cn, {kunka.xgw,chengchen.xpj}@taobao.com {chunping.mcp,dingkun.ldk}@alibaba-inc.com {dingn18}@mails.tsinghua.edu.cn, {zhanghaoyu10}@nudt.edu.cn Abstract Hierarchical text classification is an essential yet challenging subtask of multi-label text clas- sification with a taxonomic hierarchy. Existing methods have difficulties in modeling the hier- archical label structure in a global view. Fur- thermore, they cannot make full use of the mu- tual interactions between the text feature space and the label space. In this paper, we for- mulate the hierarchy as a directed graph and introduce hierarchy-aware structure encoders for modeling label dependencies. Based on the hierarchy encoder, we propose a novel end-to-end hierarchy-aware global model (Hi- AGM) with two variants. A multi-label at- tention variant (HiAGM-LA) learns hierarchy- aware label embeddings through the hierarchy encoder and conducts inductive fusion of label- aware text features. A text feature propaga- tion model (HiAGM-TP) is proposed as the de- ductive variant that directly feeds text features into hierarchy encoders. Compared with pre- vious works, both HiAGM-LA and HiAGM- TP achieve significant and consistent improve- ments on three benchmark datasets. 1 Introduction Text classification is widely used in Natural Lan- guage Processing (NLP) applications, such as sen- timental analysis (Pang and Lee, 2007), informa- tion retrieval (Liu et al., 2015), and document cat- egorization (Yang et al., 2016). Hierarchical text classification (HTC) is a particular multi-label text classification (MLC) problem, where the classifica- tion result corresponds to one or more nodes of a taxonomic hierarchy. The taxonomic hierarchy is commonly modeled as a tree or a directed acyclic graph, as depicted in Figure 1. Existing approaches for HTC could be catego- rized into two groups: local approach and global * This work was done during intern at Alibaba Group. Corresponding author. Figure 1: This short sample is tagged with news, sports, football, features and books. Note that HTC could be either a single-path or a multi-path problem. approach. The first group tends to constructs mul- tiple classification models and then traverse the hierarchy in a top-down manner. Previous local studies (Wehrmann et al., 2018; Shimura et al., 2018; Banerjee et al., 2019) propose to overcome the data imbalance on child nodes by learning from parent one. However, these models contain a large number of parameters and easily lead to exposure bias for the lack of holistic structural information. The global approach treats HTC problem as a flat MLC problem, and uses one single classifier for all classes. Recent global methods introduce var- ious strategies to utilize structural information of top-down paths, such as recursive regularization (Gopal and Yang, 2013), reinforcement learning (Mao et al., 2019) and meta-learning (Wu et al., 2019). There is so far no global method that en- codes the holistic label structure for label correla- tion features. Moreover, these methods still exploit the hierarchy in a shallow manner, thus ignoring the fine-grained label correlation information that has proved to be more fruitful in our work. In this paper, we formulate the hierarchy as a directed graph and utilize prior probabilities of la- bel dependencies to aggregate node information. A hierarchy-aware global model (HiAGM) is pro-

Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1106–1117July 5 - 10, 2020. c©2020 Association for Computational Linguistics

1106

Hierarchy-Aware Global Model for Hierarchical Text ClassificationJie Zhou1,2∗, Chunping Ma2, Dingkun Long2, Guangwei Xu2,

Ning Ding3, Haoyu Zhang4, Pengjun Xie2, Gongshen Liu1†

1School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University2Alibaba Group, 3Tsinghua University, 4National University of Defense Technology

{sanny02,lgshen}@sjtu.edu.cn,{kunka.xgw,chengchen.xpj}@taobao.com

{chunping.mcp,dingkun.ldk}@alibaba-inc.com{dingn18}@mails.tsinghua.edu.cn, {zhanghaoyu10}@nudt.edu.cn

Abstract

Hierarchical text classification is an essentialyet challenging subtask of multi-label text clas-sification with a taxonomic hierarchy. Existingmethods have difficulties in modeling the hier-archical label structure in a global view. Fur-thermore, they cannot make full use of the mu-tual interactions between the text feature spaceand the label space. In this paper, we for-mulate the hierarchy as a directed graph andintroduce hierarchy-aware structure encodersfor modeling label dependencies. Based onthe hierarchy encoder, we propose a novelend-to-end hierarchy-aware global model (Hi-AGM) with two variants. A multi-label at-tention variant (HiAGM-LA) learns hierarchy-aware label embeddings through the hierarchyencoder and conducts inductive fusion of label-aware text features. A text feature propaga-tion model (HiAGM-TP) is proposed as the de-ductive variant that directly feeds text featuresinto hierarchy encoders. Compared with pre-vious works, both HiAGM-LA and HiAGM-TP achieve significant and consistent improve-ments on three benchmark datasets.

1 Introduction

Text classification is widely used in Natural Lan-guage Processing (NLP) applications, such as sen-timental analysis (Pang and Lee, 2007), informa-tion retrieval (Liu et al., 2015), and document cat-egorization (Yang et al., 2016). Hierarchical textclassification (HTC) is a particular multi-label textclassification (MLC) problem, where the classifica-tion result corresponds to one or more nodes of ataxonomic hierarchy. The taxonomic hierarchy iscommonly modeled as a tree or a directed acyclicgraph, as depicted in Figure 1.

Existing approaches for HTC could be catego-rized into two groups: local approach and global

∗This work was done during intern at Alibaba Group.†Corresponding author.

Figure 1: This short sample is tagged with news, sports,football, features and books. Note that HTC could beeither a single-path or a multi-path problem.

approach. The first group tends to constructs mul-tiple classification models and then traverse thehierarchy in a top-down manner. Previous localstudies (Wehrmann et al., 2018; Shimura et al.,2018; Banerjee et al., 2019) propose to overcomethe data imbalance on child nodes by learning fromparent one. However, these models contain a largenumber of parameters and easily lead to exposurebias for the lack of holistic structural information.The global approach treats HTC problem as a flatMLC problem, and uses one single classifier forall classes. Recent global methods introduce var-ious strategies to utilize structural information oftop-down paths, such as recursive regularization(Gopal and Yang, 2013), reinforcement learning(Mao et al., 2019) and meta-learning (Wu et al.,2019). There is so far no global method that en-codes the holistic label structure for label correla-tion features. Moreover, these methods still exploitthe hierarchy in a shallow manner, thus ignoringthe fine-grained label correlation information thathas proved to be more fruitful in our work.

In this paper, we formulate the hierarchy as adirected graph and utilize prior probabilities of la-bel dependencies to aggregate node information.A hierarchy-aware global model (HiAGM) is pro-

Page 2: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1107

posed to enhance textual information with the labelstructural features. It comprises a traditional textencoder for extracting textual information and ahierarchy-aware structure encoder for modelinghierarchical label relations. The hierarchy-awarestructure encoder could be either a TreeLSTM ora hierarchy-GCN where hierarchical prior knowl-edge is integrated. Moreover, these two structureencoders are bidirectionally calculated, allowingthem to capture label correlation information inboth top-down and bottom-up manners. As a result,HiAGM is more robust than previous top-downmodels and is able to alleviate the problems causedby exposure bias and imbalanced data.

To aggregate text features and label structuralfeatures, we present two variants of HiAGM, amulti-label attention model HiAGM-LA and a textfeature propagation model HiAGM-TP. Both vari-ants extract hierarchy-aware text features based onthe structure encoders. HiAGM-LA extracts the in-ductive label-wise text features while HiAGM-TPgenerates hybrid information in a deductive manner.Specifically, HiAGM-LA updates the label embed-ding across the holistic hierarchy and then employsnode outputs as the hierarchy-aware label represen-tations. Finally, it conducts multi-label attentionfor label-aware text features. On the other hand,HiAGM-TP directly utilizes text features as theinput of the structure encoder in a serial dataflow.Hence it propagates textual information throughoutthe overall hierarchy. The hidden state of each nodein the entire hierarchy represents the class-specifictextual information.

The major contributions of this paper are:

• With the prior hierarchy knowledge, we adopttypical structure encoders for modeling labeldependencies in both top-down and bottom-up manners, which has not been investigatedfor hierarchical text classification.• We propose a novel end-to-end hierarchy-

aware global model (HiAGM). We furtherpresent two variants for label-wise text fea-tures, a hierarchy-aware multi-label attentionmodel (HiAGM-LA) and a hierarchy-awaretext feature propagation model (HiAGM-TP).• We empirically demonstrate that both variants

of HiAGM achieve consistent improvementson various datasets when using different struc-ture encoders. Our best model outperformsthe state-of-the-art model by 3.25% of Macro-F1 and 0.66% of Micro-F1 on RCV1-V2.

• We release our code and experimental splitsof Web-of-Science and NYTimes for repro-ducibility. 1

2 Related Work

Existing works for HTC could be categorized intolocal and global approaches. Local approachescould be subdivided into local classifier per node(LCN) (Banerjee et al., 2019), local classifier perparent node (LCPN) (Dumais and Chen, 2000),and local classifier per level (LCL)(Shimura et al.,2018; Wehrmann et al., 2018; Kowsari et al., 2017).Banerjee et al. (2019) transfers parameters of theparent model for child models as LCN. Wehrmannet al. (2018) alleviates exposure bias problem bythe hybrid of LCL and global optimizations. Penget al. (2018) decomposes the hierarchy into sub-graphs and conducts Text-GCN on n-gram tokens.

The global approach improves flat MLC mod-els with the hierarchy information. Cai and Hof-mann (2004) modifies SVM to Hierarchical-SVMby decomposition. Gopal and Yang (2013) pro-poses a simple recursive regularization of parame-ters among adjacent classes. Deep learning archi-tectures are also employed in global models, suchas sequence-to-sequence (Yang et al., 2018), meta-learning (Wu et al., 2019), reinforcement learn-ing (Mao et al., 2019), and capsule network (Penget al., 2019). Those models mainly focus on im-proving decoders based on the constraint of hier-archical paths. In contrast, we propose an effec-tive hierarchy-aware global model, HiAGM, thatextracts label-wise text features with hierarchy en-coders based on prior hierarchy information.

Moreover, the attention mechanism is introducedin MLC by Mullenbach et al. (2018) for ICD cod-ing. Rios and Kavuluru (2018) trains label repre-sentation through basic GraphCNN and conductsmutli-label attention with residual shortcuts. At-tentionXML (You et al., 2019) converts MLC to amulti-label attention LCL model by label clusters.Huang et al. (2019) improves HMCN (Wehrmannet al., 2018) with label attention per level. OurHiAGM-LA, however, employs multi-label atten-tion in a single model with a simplified structureencoder, reducing the computational complexity.

Recent works, in semantic analysis (Chen et al.,2017b), semantic role labeling (He et al., 2018) andmachine translation (Chen et al., 2017a), shows theimprovement on sentence representation of syntax

1https://github.com/Alibaba-NLP/HiAGM

Page 3: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1108

Figure 2: Example of the taxonomic hierarchy. Thenumber indicates the prior probability of label depen-dencies according to the training corpus.

encoder, such as Tree-Based RNN (Tai et al., 2015;Chen et al., 2017a) and GraphCNN (Marcheggianiand Titov, 2017). We modify those structure en-coders for HTC with fine-grained prior knowledgein both top-down and bottom-up manners.

3 Problem Definition

Hierarchical text classification (HTC), a subtask oftext classification, organizes the label space with apredefined taxonomic hierarchy. The hierarchy ispredefined based on holistic corpus. The hierarchygroups label subsets according to class relations.The taxonomic hierarchy mainly contains the tree-like structure and the directed acyclic graph (DAG)structure. Note that DAG can be converted intoa tree-like structure by distinguishing each labelnode as a single-path node. Thus, the taxonomichierarchy can be simplified as a tree-like structure.

As illustrated in Figure 2, we formulate ataxonomic hierarchy as a directed graph G =

(V,−→E ,←−E ) where V refers to the set of label nodes

V = {v1, v2, . . . , vC} and C denotes the num-ber of label nodes.

−→E = {(vi, vj)|i ∈ V, j ∈

child(i)} is the top-down hierarchy path and←−E =

{(vj , vi)|i ∈ V, j ∈ child(i)} is the bottom-up hierarchy path. Formally, we define HTC asH = (X,L) with a sequence of text objectsX = (x1, x2, . . . , xN ) and an aligned sequenceof supervised label sets L = (l1, l2, . . . , lN ).

As depicted in Figure 1, each sample xi cor-responds to a label set li that includes multipleclasses. Those corresponding classes belong toeither one or more sub-paths in the hierarchy.Note that the sample belongs to the parent nodevi in the condition pertaining to the child nodevj ∈ child(i).

4 Hierarchy-Aware Global Model

As depicted in Figure 3, we propose a Hierarchy-Aware Global Model (HiAGM) that leverages thefine-grained hierarchy information and then aggre-gates label-wise text features. HiAGM consistsof a traditional text encoder for textual informa-tion and a hierarchy-aware structure encoder forhierarchical label correlation features.

We present two variants of HiAGM for hybridinformation aggregation, a multi-label attentionmodel (HiAGM-LA) and a text feature propaga-tion model (HiAGM-TP). HiAGM-LA updates la-bel representations with the structure encoder andgenerates label-aware text features with multi-labelattention mechanism. HiAGM-TP propagates textrepresentations throughout the holistic hierarchy,thus obtaining label-wise text features with the fu-sion of label correlations.

4.1 Prior Hierarchy Information

The taxonomic hierarchy describes the hierarchicalrelations among labels. The major bottleneck ofHTC is how to make full use of this establishedstructure. Previous studies directly utilize this hier-archy path in a static method based on a pipelineframework, hierarchical model or label assignmentmodel. In contrast, based on Bayesian statistical in-ference, HiAGM leverages the prior knowledge oflabel correlations regarding the predefined hierar-chy and corpus. We exploit the prior probability oflabel dependencies as prior hierarchy knowledge.

Suppose that there is a hierarchy path ei,j be-tween the parent node vi and child node vj . Thisedge feature f(ei,j) is represented by the priorprobability P (Uj |Ui) and P (Ui|Uj) as:

P (Uj |Ui) =P (Uj ∩ Ui)

P (Ui)=P (Uj)

P (Ui)=Nj

Ni,

P (Ui|Uj) =P (Ui ∩ Uj)

P (Uj)=P (Uj)

P (Uj)= 1.0,

(1)

where Uk means the occurrence of vk andP (Uj |Ui) is the conditional probability of vj giventhat vi occurs. P (Uj ∩ Ui) is the probability of{vj , vi} occurring simultaneously. Nk refers to thenumber of Uk in the training subset. Note that thehierarchy ensures Uk given that vchild(k) occurs.We rescale and normalize the prior probabilities ofchild nodes vchild(k) to sum total to 1.

Page 4: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1109

Figure 3: The overall structure of our hierarchy-aware global model. HiAGM consists of a text encoder and ahierarchy-aware encoder. The dataflows of structure encoders are illustrated in the grey dashed box. Two variants,as HiAGM-LA and HiAGM-TP, are presented in black dashed boxes, respectively.

4.2 Hierarchy-Aware Structure Encoder

Tree-LSTM and graph convolutional neural net-works (GCN) are widely used as structure encodersfor aggregating node information in NLP (Tai et al.,2015; Chen et al., 2017a; He et al., 2018; Rios andKavuluru, 2018). As depicted in Figure 3, HiAGMmodels fine-grained hierarchy information basedon the hierarchy-aware structure encoder. Based onthe prior hierarchy information, we improve typicalstructure encoders for the directed hierarchy graph.Specifically, the top-down dataflow employs theprior hierarchy information as fc(ei,j) =

Nj

Niwhile

the bottom-up one adopts fp(ei,j) = 1.0.

Bidirectional Tree-LSTM Tree-LSTM could beutilized as our structure encoder. The imple-mentation of Tree-LSTM is similar to syntax en-coders(Tai et al., 2015; Zhang et al., 2016; Li et al.,2018). The predefined hierarchy is identical toall samples, which allows the mini-batch trainingmethod for this recursive computational module.The node transformation is as follows:

ik = σ(W(i) vk +U(i) hk + b(i)),

fk,j = σ(W(f) vk +U(f) hj + b(f)),

ok = σ(W(o) vk +U(o) hk + b(o)),

uk = tanh(W (u) vk +U (u) hk + b(u)),

ck = ik � uk +∑

jfk,j � cj ,

hk = ok � tanh(ck),

(2)

where hk and ck represent the hidden state andmemory cell state of node k respectively.

To induce label correlations, HiAGM employs abidirectional Tree-LSTM by the fusion of a child-sum and a top-down module:

h↑k =∑

j∈child(k)fp(ek,j)h

↑j ,

h↓k = fc(ek,p)h↓p,

hbik = h↑k ⊕ h↓k,

(3)

where h↑k and h↓k are separately calculated inthe bottom-up and top-down manner as hk =TreeLSTM(hk). ⊕ indicates the concatenation ofhidden states. The final hidden state of node k isthe hierarchical node representation hbi

k .

Hierarchy-GCN GCN (Kipf and Welling, 2017)is proposed to enhance node representations basedon the local graph structural information. SomeNLP studies have improved Text-GCNs for richword representations upon the syntactic struc-ture and word correlation(Marcheggiani and Titov,2017; Vashishth et al., 2019; Yao et al., 2019; Penget al., 2018). We introduce a simple hierarchy-GCNfor the hierarchy structure, thus gaining our afore-mentioned fine-grained hierarchy information.

Hierarchy-GCN aggregates dataflows within thetop-down, bottom-up, and self-loop edges. Inthe hierarchy graph, each directed edge repre-sents a pair-wise label correlation feature. Thus,those dataflows should conduct node transforma-tions with edge-wise linear transformations. How-ever, edge-wise transformations shall lead to over-parameterized edge-wise weight matrixes. OurHierarchy-GCN simplifies this transformation witha weighted adjacent matrix. This weighted adjacent

Page 5: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1110

matrix represents the hierarchical prior probability.Formally, Hierarchy-GCN encodes the hidden stateof node k based on its associated neighbourhoodN(k) = {nk, child(k), parent(k)} as:

uk,j = ak,jvj + bkl ,

gk,j = σ(W d(j,k)g vk + bkg),

hk = ReLU(∑

j∈N(k)gk,j � uk,j),

(4)

where Wd(k,j)g ∈ Rdim, bl ∈ RN×dim, and bg ∈

RN . d(j, k) indicates the hierarchical directionfrom node j to node k, including top-down, bottom-up, and self-loop edges. Note that ak,j ∈ R de-notes the hierarchy probability fd(k,j)(ekj), wherethe self-loop edge employs ak,k = 1, top-downedges use fc(ej,k) = Nk

Nj, and bottom-up edges

use fp(ej,k) = 1. The holistic edge feature ma-trix F = {a0,0, a0,1, . . . , aC−1,C−1} indicates theweighted adjacent matrix of the directed hierarchygraph. Finally, the output hidden state hk of nodek denotes its label representation corresponding tothe hierarchy structural information.

4.3 Hybrid Information AggregationPrevious global models classify labels upon theoriginal textual information and improve the de-coder with predefined hierarchy paths. In contrast,we construct a novel end-to-end hierarchy-awareglobal model (HiAGM) for the mutual interactionof text features and label correlations. It combinesa traditional text classification model with a hier-archy encoder, thus obtaining label-wise text fea-tures. HiAGM is extended to two variants, a paral-lel model for an inductive fusion (HiAGM-LA) anda serial model for a deductive fusion (HiAGM-TP).

Given a document x = (w1,w2, . . . ,ws), thesequence of token embedding is firstly fed intoa bidirectional GRU layer to extract text contex-tual feature. Then, multiple CNNs are used forgenerating n-gram features. The concatenation ofn-gram features is filtered by a top-k max-poolinglayer to extract key information. Finally, by reshap-ing, we can obtain the continuous text represen-tation S = (s1, . . . , sn) where si ∈ Rdc and dcindicates the output dimension of the CNN layer.n = nk × nc refers to the multiplication of top-knumber and the number of CNNs.

Hierarchy-Aware Multi-Label Attention Thefirst variant of HiAGM is proposed based on multi-label attention, called as HiAGM-LA. Attention

mechanism is usually utilized as the memory unitin text classification (Yang et al., 2016; Du et al.,2019). Recent LCL studies (Huang et al., 2019;You et al., 2019) construct one multi-label attention-based model per level so as to avoid optimizinglabel embedding among different levels.

Our HiAGM-LA is similar to those baselinesbut simplifies multi-label attention LCL modelsto a global model. Based on our hierarchy en-coders, HiAGM-LA could overcome the problemof convergence for label embedding across var-ious levels. Label representations are enhancedwith bidirectional hierarchical information. Thislocal structural information makes it feasible tolearn label features across different levels in a sin-gle model. Formally, suppose that the trainablelabel embedding of node k is randomly initial-ized as Lk ∈ Rdl . The initial label embeddingLk is directly fed into structure encoders as theinput vector of aligned label node xk. Then, theoutput hidden state h ∈ RC×dc represents as thehierarchy-aware label features. Given text repre-sentation S ∈ Rn×dc , HiAGM-LA calculates thelabel-wise attention value αki as:

αkj =esj h

Tk∑n

j=1 esj hT

k

,vk =n∑

i=1

αki si, (5)

Note that αki indicates how informative the i-th text feature vector is for the k-th label. Wecan get the inductive label-aligned text featuresV ∈ RC×dc based on multi-label attention. Thenit would be fed into the classifier for prediction.Furthermore, we could directly use the hidden stateof hierarchy encoders as the pretrained label repre-sentations so that HiAGM-LA could be even lighterin the inference process.

Hierarchical text feature propagation Graphneural networks are capable of message passing(Gilmer et al., 2017; Duvenaud et al., 2015), learn-ing both local node correlations and overall graphstructure. To avoid the noise from heterogeneousfusion, the second variant obtains label-wise textfeatures based on a deductive method. It directlytakes text features S as the node inputs and updatestextual information through the hierarchy-awarestructure encoder. This variant mainly conducts thepropagation of text features, called as HiAGM-TP.Formally, node inputs V are reshaped from textfeatures by a single linear transformation:

V = M S, (6)

Page 6: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1111

where the trainable weight matrix M ∈R(n×dc)×(C×dv) transforms text features S ∈Rn×dc to node inputs V ∈ RC×dv .

Given the predefined structure, each samplewould update its textual information throughout thesame holistic taxonomic hierarchy. In a mini-batchlearning manner, the initial node representation Vis fed into the hierarchy encoder. The output hiddenstate h denotes deductive hierarchy-aware text fea-tures as the input of the final classifier. Comparedwith HiAGM-LA, the transformation of HiAGM-TP is conducted on textual information withoutthe fusion of label embedding. Thus, the structureencoder would be activated in both training andinference procedures for passing textual messagesacross the hierarchy. It could converge much easierbut has slightly higher computational complexitythan HiAGM-LA.

4.4 Classification

We flatten the hierarchy by taking all nodes asleaf nodes for multi-label classification, no mat-ter it is a leaf node or an internal node. The finalhierarchy-aware features are fed into a fully con-nected layer for prediction. HiAGM is comple-mentary with recursive regularization(Gopal andYang, 2013) as Lr =

∑i∈C

∑j∈child(i)

12 ||wi −

wj ||2 for the parameters of the final fully con-nected layer. For multi-label classification, HiAGMuses a binary cross-entropy loss function:Lc =−∑N

i=1

∑Cj=1[yijlog(y

′ij)+(1−yij)log(1−y′ij)]

where yij and y′ij are the ground truth and sigmoidscore for the j-th label of the i-th sample. Thus, thefinal loss function is Lm = Lc + λ · Lr.

5 Experiment

In this section, we introduce our experiments withdatasets, evaluation metrics, implementation de-tails, comparison, ablation study, and analysis ofexperimental results.

5.1 Experiment Setup

We experiment our proposed architecture on RCV1-V2, Web-of-Science (WOS) and NYTimes (NYT)datasets for comparison and ablation study.

Datasets RCV1-V2 (Lewis et al., 2004) andNYT (Sandhaus, 2008) are both news categoriza-tion corpora while WOS (Kowsari et al., 2017)includes abstracts of published papers from Web ofScience. Those typical text classification datasets

Dataset |L| Depth Avg(|Li|) Train Val TestRCV1 103 4 3.24 20,833 2,316 781,265WOS 141 2 2.0 30,070 7,518 9,397NYT 166 8 7.6 23,345 5,834 7,292

Table 1: Data Statistics: |L| is the number of classes.Avg(|Li|) is the average number of classes per sample.Depth indicates the maximum level of hierarchy.

are all annotated with the ground truth of hierarchi-cal taxonomic labels. We use the benchmark splitof RCV1-V2 and select a small partial training sub-set for validation. WOS dataset is randomly splittedinto training, validation and test subsets. In NYT,we randomly select and split subsets from originalraw data. We also remove samples with no labelor only a single one-level label. Note that WOSis for single-path HTC while NYT and RCV1-V2include multi-path taxonomic tags. The statisticsof datasets is shown in Table 1.

Evaluation Metrics We measure the experimen-tal results with standard evaluation metrics (Gopaland Yang, 2013), including Micro-F1 and Macro-F1. Micro-F1 takes the overall precision and recallof all the instances into account while Macro-F1equals to the average F1-score of labels. So Micro-F1 gives more weight to frequent labels, whileMacro-F1 equally weights all labels.

Implementation Details We use a one-layer bi-GRU with 64 hidden units and 3 parallel CNN lay-ers with filter region size of {2, 3, 4}. The vocabu-lary is created by the most frequent words with themaximum size of 60,000. We use 300-dimensionalpretrained word embedding from GloVe2 (Penning-ton et al., 2014) and randomly initialize the out-of-vocabulary words above the minimum count of 2.The key information pertaining to text classificationcould be extracted from the beginning statements.Thus, we set the maximum length of token inputsas 256. The fixed threshold for tagging is chosen as0.5. Dropout is employed in the embedding layerand MLP layer with the rate of 0.5 while in thebi-GRU layer and node transformation with therate of 0.1 and 0.05 respectively. Additionally, forHiAGM-LA, the label embedding is initialized byKaiming uniform (He et al., 2015) while the othermodel parameters are initialized by Xavier uniform(Glorot and Bengio, 2010). We use the Adam opti-mizer in a mini-batch size of 64 with learning rate

2https://nlp.stanford.edu/projects/glove

Page 7: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1112

Model Micro MacroLocal Models

HR-DGCNN-3 (Peng et al., 2018) 76.18 43.34HMCN (Mao et al., 2019) 80.80 54.60

HFT(M) (Shimura et al., 2018) 80.29 51.40Htrans (Banerjee et al., 2019) 80.51 58.49

Global ModelsSGM 4 (Yang et al., 2018) 77.30 47.49

HE-AGCRCNN (Peng et al., 2019) 77.80 51.30HiLAP-RL (Mao et al., 2019) 83.30 60.10

BaselinesTextRCNN 81.57 59.25

TextRCNN+LabelAttention 81.88 59.85HiAGM-LA

TreeLSTM 82.54†‡ 61.90†‡

GCN 82.21†‡ 61.65†‡

GCN w/o Rec 82.26†‡ 61.85†‡

HiAGM-TPTreeLSTM 83.20† 62.32†

GCN 83.96† 63.35†GCN w/o Rec 83.95† 63.23†

Table 2: Comparison to previous models on RCV1-V2.Note that the prior probability matrix in HiAGM-TPis fine-tuned during training while the one in HiAGM-LA is fixed. w/o Rec denotes training without recursiveregularization. ”†” and ”‡” indicate statistically signif-icant difference (p<0.01) from TextRCNN and TextR-CNN+LabelAttention respectively.

α = 1 × 10−4, momentum parameters β1 = 0.9,β2 = 0.999 and ε = 1× 10−6. The penalty coeffi-cient of recursive regularization is set as 1× 10−6.Our model evaluates the test subset with the bestmodel on the validation subset.

5.2 Comparison

In Table 2, we compare the performance of Hi-AGM to traditional MLC models and the state-of-the-art HTC studies on RCV1-V2. With the recur-sive regularization for the last MLP layer, thoseconventional text classification models also obtaincompetitive performance. As for our proposed ar-chitecture, both HiAGM-LA and HiAGM-TP out-perform most state-of-the-art results of global andlocal studies, esspecially in Macro-F1. It showsthe strong advancement of our hierarchy encoderson HTC. HiAGM-LA achieves the performanceof 61.90% Macro-F1 score and 82.54% Micro-F1 score while HiAGM-TP obtains the best per-formance of 63.35% Macro-F1 score and 83.96%Micro-F1 score.

To clarify the improvement of our proposed

4The result is reproduced with benchmark split upon thereleased project of SGM.

ModelHiAGM-LA HiAGM-TP

Micro Macro Time Micro Macro TimeTreeLSTM 82.54 61.90 1.0 × 83.24 62.60 3.2×

GCN 82.21 61.65 0.9× 83.92 63.01 1.1×

Table 3: Comparison of the HiAGM variants on RCV1-V2 with fixed prior probability. Note that Time denotesthe time cost of one epoch during inference comparedto TreeLSTM-based HiAGM-LA. Statistically signifi-cant difference (p<0.01) compared to the best one.

HiAGM, we also experiment without recursiveregularization. Compared with the state-of-the-art recent work (HiLAP) (Mao et al., 2019), ourHiAGM-LA and HiAGM-TP without recursive reg-ularization also achieve competitive improvementby 1.75% and 3.13% in terms of Macro-F1. Itdemonstrates that the recursive regularization iscomplementary but not necessary with our pro-posed architecture.

According to Table 4, HiAGM achieves con-sistent improvement on the performance of HTCamong RCV1-V2, WOS and NYT datasets. It indi-cates the strong improvement of the label-wise textfeature on HTC task. The results present that ourproposed global model HiAGM has the advancedcapability of enhancing text features for HTC.

All in all, HiAGM strongly improves the perfor-mance on the benchmark dataset RCV1-V2 andthe other two classical text classification datasets.Especially, it obtains better results on Macro-F1score. It indicates that HiAGM has a strong abilityto tackle data-sparse classes deep in the hierarchy.

5.3 Analysis

Hybrid Information Aggregation Accordingto Table 2, both variants outperform the baselinemodels and previous studies. It denotes that theenhanced text feature is beneficial for HTC. Weclarify the ablation study of two variants and struc-ture encoders in Table 3. Both HiAGM-LA andHiAGM-TP are trained with fixed prior probabil-ity. With the help of the recursive computationprocess, bidirectional Tree-LSTM achieves bet-ter performance on learning hierarchy-aware la-bel embedding. However, it additionally leads tolower computational efficiency when compared toHierarchy-GCN. Regarding HiAGM-TP, hierarchy-GCN shows its better performance and efficiencythan bidirectional Tree-LSTM.

These two variants have various advantages,respectively. To be specific, HiAGM-TP hasbetter performance than HiAGM-LA in both Bi-

Page 8: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1113

ModelRCV1-V2 RCV1-V2-R WOS NYT

Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1Global Text Classification Baseline

TextRNN 81.10 51.09 87.78 70.42 77.94 69.65 70.29 53.06TextCNN 79.37 55.45 84.97 68.06 82.00 76.18 70.11 56.84

TextRCNN 81.57 59.25 88.32 72.23 83.55 76.99 70.83 56.18HiAGM-LA

GCN 82.21 61.65 88.49 73.14 84.61 79.37 72.35 58.67TreeLSTM 82.54 61.90 88.47 72.81 84.82 79.51 72.50 58.86

HiAGM-TPGCN 83.96 63.35 88.64 74.00 85.82 80.28 74.97 60.83

TreeLSTM 83.20 62.32 88.86 74.16 85.18 79.95 74.43 60.76

Table 4: Experimental results of our proposed HiAGM-LA and HiAGM-TP on various datasets. Note that RCV1-V2-R refers to the version that transpose original subset of the train and test set. All models are trained withthe constraint of recursive regularization. HiAGM-LA is trained with fixed prior probability while HiAGM-TP istrained with trainable one.

TreeLSTM and Hierarchy-GCN encoders. Themulti-label attention variant, HiAGM-LA, wouldsomehow induce noises from the randomly initial-ized label embedding. Otherwise, HiAGM-TP ag-gregates the fusion of local structural informationand text feature maps, without the negative impactof label embedding.

As for efficiency, HiAGM-LA is more computa-tionally efficient than HiAGM-TP, especially in theinference process. The label representation fromhierarchy encoders could be utilized as pretrainedlabel embedding for multi-label attention duringinference. Thus, HiAGM-LA omits the hierarchy-aware structure encoder module after training.

We recommend HiAGM-TP for high perfor-mance while we also suggest HiAGM-LA for em-pirically good performance and faster inference.

GCN Layers The impact of GCN layers is alsoan important issue for HiAGM. As illustrated inFigure 4, the one-layer structure encoder con-sistently performs best in both HiAGM-LA andHiAGM-TP. It indicates that the correlation be-tween non-adjacent nodes is not essential for HTCbut somehow noisy for hierarchical information ag-gregation. This empirical conclusion is consistentwith the implementation of recursive regularization(Peng et al., 2018; Gopal and Yang, 2013)and trans-fer learning (Banerjee et al., 2019; Shimura et al.,2018) between adjacent labels or levels.

Prior Probability According to the aforemen-tioned comparisons, our simplified structure en-coders with prior probabilities is undoubtedly bene-ficial for HTC. We also investigate different choicesof prior probabilities with hierarchy-GCN encoder

on the HiAGM-TP variant, clarified as Table 5.Note that the weighted adjacent matrix is initial-ized by prior probabilities.

The simple weighted adjacent matrix performsbetter than the complex edge-wise weight matrixfor node transformation. The fixed weighted ad-jacent matrix also achieves better results than theoriginal unweighted adjacent matrix and the train-able randomly initialized one. It demonstrates thatthe prior probability of the hierarchy is capable ofrepresenting hierarchical label dependencies. Fur-thermore, the best result is obtained by the settingthat obeys the calculating direction of prior prob-ability. When comparing the results of the fixedadjacent matrix and trainable one, we can find thatthe weighted adjacent matrix could be finetunedfor higher flexibility and better performance.

In Table 5, the settings that allows all interac-

Figure 4: Ablation study on the depth of GCN.

Page 9: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1114

Top-Down Bottom-UpFixed Trainable

Micro Macro Micro MacroEdge-Wise Matrix - - 82.75 60.81

Randomly Initialized - - 83.86 62.12Randomly Initialized∗ - - 82.80 62.51

1 1 83.77 62.31 83.86 62.96P P 83.61 63.65 83.83 63.141 P 83.65 62.46 83.95 63.23P 1 83.92 63.01 83.96 63.35P∗ 1∗ - - 83.33 62.86

Table 5: Ablation study of the fine-grained hierarchy in-formation on RCV1-V2 based on GCN-based HiAGM-TP. Edge-Wise Matrix denotes that each directionaledge has a distinct trainable weight matrix for trans-formation while the others use the weighted adjacentmatrix. P is fc(ei,j) =

Nj

Niand 1 is fp(ei,j) = 1.0. “*”

allows the information propagation between all nodeswhile the others obey the constraint of hierarchy.

tions perform worse than the others that allowpropagation throughout the hierarchy paths. Asanalyzed on GCN layers, the interaction betweennon-adjacent nodes would lead to negative impacton the HTC. We also validate this conclusion basedon the ablation study of prior probability.

Performance Study We analyze the improve-ment on performance by dividing labels basedon their levels. We compute level-based Micro-F1 scores of NYT on baseline, HiAGM-LA, andHiAGM-TP. Figure 5 shows that our models retaina better performance than the baseline on all levels,especially among deep levels.

Figure 5: Evaluation of labels among different levels.Note that we observe similar results for other datasetsand omit them for a cleaner view.

6 Conclusion

In this paper, we propose a novel end-to-endhierarchy-aware global model that extracts the labelstructural information for aggregating label-wisetext features. We present a bidirectional TreeL-STM and a hierarchy-GCN as the hierarchy-awarestructure encoder. Furthermore, our frameworkis extended into a parallel variant based on multi-label attention and a serial variant of text featurepropagation. Our approaches empirically achievesignificant and consistent improvement on threedistinct datasets, especially on the low-frequencylabels. Specifically, both variants outperform thestate-of-the-art model on the RCV1-V2 benchmarkdataset. And our best model obtains a Macro-F1score of 63.35% and a Micro-F1 score of 83.96%.

Acknowledgments

We thank all the anonymous reviewers for theirvaluable suggestions. This research work was sup-ported by the National Natural Science Foundationof China (Grant No.61772337, U1736207).

ReferencesSiddhartha Banerjee, Cem Akkaya, Francisco Perez-

Sorrosal, and Kostas Tsioutsiouliklis. 2019. Hier-archical transfer learning for multi-label text classi-fication. In Proceedings of the 57th Conference ofthe Association for Computational Linguistics, ACL2019, Florence, Italy, July 28- August 2, 2019, Vol-ume 1: Long Papers, pages 6295–6300. Associationfor Computational Linguistics.

Lijuan Cai and Thomas Hofmann. 2004. Hierarchi-cal document categorization with support vector ma-chines. In Proceedings of the 2004 ACM CIKM In-ternational Conference on Information and Knowl-edge Management, Washington, DC, USA, Novem-ber 8-13, 2004, pages 78–87. ACM.

Huadong Chen, Shujian Huang, David Chiang, and Ji-ajun Chen. 2017a. Improved neural machine trans-lation with a syntax-aware encoder and decoder. InProceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2017,Vancouver, Canada, July 30 - August 4, Volume1: Long Papers, pages 1936–1945. Association forComputational Linguistics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,Hui Jiang, and Diana Inkpen. 2017b. EnhancedLSTM for natural language inference. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics, ACL 2017, Vancou-ver, Canada, July 30 - August 4, Volume 1: LongPapers, pages 1657–1668. Association for Compu-tational Linguistics.

Page 10: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1115

Cunxiao Du, Zhaozheng Chen, Fuli Feng, Lei Zhu,Tian Gan, and Liqiang Nie. 2019. Explicit inter-action model towards text classification. In TheThirty-Third AAAI Conference on Artificial Intelli-gence, AAAI 2019, The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference, IAAI2019, The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence, EAAI 2019, Hon-olulu, Hawaii, USA, January 27 - February 1, 2019,pages 6359–6366. AAAI Press.

Susan T. Dumais and Hao Chen. 2000. Hierarchi-cal classification of web content. In SIGIR 2000:Proceedings of the 23rd Annual International ACMSIGIR Conference on Research and Developmentin Information Retrieval, July 24-28, 2000, Athens,Greece, pages 256–263. ACM.

David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gomez-Bombarelli, TimothyHirzel, Alan Aspuru-Guzik, and Ryan P. Adams.2015. Convolutional networks on graphs for learn-ing molecular fingerprints. In Advances in Neu-ral Information Processing Systems 28: AnnualConference on Neural Information Processing Sys-tems 2015, December 7-12, 2015, Montreal, Quebec,Canada, pages 2224–2232.

Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley,Oriol Vinyals, and George E. Dahl. 2017. Neuralmessage passing for quantum chemistry. In Pro-ceedings of the 34th International Conference onMachine Learning, ICML 2017, Sydney, NSW, Aus-tralia, 6-11 August 2017, volume 70 of Proceedingsof Machine Learning Research, pages 1263–1272.PMLR.

Xavier Glorot and Yoshua Bengio. 2010. Understand-ing the difficulty of training deep feedforward neu-ral networks. In Proceedings of the Thirteenth In-ternational Conference on Artificial Intelligence andStatistics, AISTATS 2010, Chia Laguna Resort, Sar-dinia, Italy, May 13-15, 2010, volume 9 of JMLRProceedings, pages 249–256. JMLR.org.

Siddharth Gopal and Yiming Yang. 2013. Recur-sive regularization for large-scale classification withhierarchical and graphical dependencies. In The19th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD 2013,Chicago, IL, USA, August 11-14, 2013, pages 257–265. ACM.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Delving deep into rectifiers: Surpass-ing human-level performance on imagenet classifi-cation. In 2015 IEEE International Conference onComputer Vision, ICCV 2015, Santiago, Chile, De-cember 7-13, 2015, pages 1026–1034. IEEE Com-puter Society.

Shexia He, Zuchao Li, Hai Zhao, and Hongxiao Bai.2018. Syntax for semantic role labeling, to be, ornot to be. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics,

ACL 2018, Melbourne, Australia, July 15-20, 2018,Volume 1: Long Papers, pages 2061–2071. Associa-tion for Computational Linguistics.

Wei Huang, Enhong Chen, Qi Liu, Yuying Chen, ZaiHuang, Yang Liu, Zhou Zhao, Dan Zhang, and Shi-jin Wang. 2019. Hierarchical multi-label text clas-sification: An attention-based recurrent network ap-proach. In Proceedings of the 28th ACM Interna-tional Conference on Information and KnowledgeManagement, CIKM 2019, Beijing, China, Novem-ber 3-7, 2019, pages 1051–1060. ACM.

Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. In 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France,April 24-26, 2017, Conference Track Proceedings.OpenReview.net.

Kamran Kowsari, Donald E. Brown, Mojtaba Hei-darysafa, Kiana Jafari Meimandi, Matthew S. Ger-ber, and Laura E. Barnes. 2017. Hdltex: Hierar-chical deep learning for text classification. In 16thIEEE International Conference on Machine Learn-ing and Applications, ICMLA 2017, Cancun, Mex-ico, December 18-21, 2017, pages 364–371. IEEE.

David D. Lewis, Yiming Yang, Tony G. Rose, and FanLi. 2004. RCV1: A new benchmark collection fortext categorization research. J. Mach. Learn. Res.,5:361–397.

Zuchao Li, Shexia He, Jiaxun Cai, Zhuosheng Zhang,Hai Zhao, Gongshen Liu, Linlin Li, and Luo Si.2018. A unified syntax-aware framework for seman-tic role labeling. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, Brussels, Belgium, October 31 - Novem-ber 4, 2018, pages 2401–2411. Association for Com-putational Linguistics.

Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng,Kevin Duh, and Ye-Yi Wang. 2015. Representa-tion learning using multi-task deep neural networksfor semantic classification and information retrieval.In NAACL HLT 2015, The 2015 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, Denver, Colorado, USA, May 31 - June 5, 2015,pages 912–921. The Association for ComputationalLinguistics.

Yuning Mao, Jingjing Tian, Jiawei Han, and XiangRen. 2019. Hierarchical text classification with re-inforced label assignment. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, pages445–455. Association for Computational Linguis-tics.

Diego Marcheggiani and Ivan Titov. 2017. Encodingsentences with graph convolutional networks for se-mantic role labeling. In Proceedings of the 2017

Page 11: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1116

Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2017, Copenhagen, Den-mark, September 9-11, 2017, pages 1506–1515. As-sociation for Computational Linguistics.

James Mullenbach, Sarah Wiegreffe, Jon Duke, JimengSun, and Jacob Eisenstein. 2018. Explainable pre-diction of medical codes from clinical text. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,NAACL-HLT 2018, New Orleans, Louisiana, USA,June 1-6, 2018, Volume 1 (Long Papers), pages1101–1111. Association for Computational Linguis-tics.

Bo Pang and Lillian Lee. 2007. Opinion mining andsentiment analysis. Foundations and Trends in In-formation Retrieval, 2(1-2):1–135.

Hao Peng, Jianxin Li, Qiran Gong, Senzhang Wang,Lifang He, Bo Li, Lihong Wang, and Philip S. Yu.2019. Hierarchical taxonomy-aware and attentionalgraph capsule rcnns for large-scale multi-label textclassification. CoRR, abs/1906.04898.

Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, MengjiaoBao, Lihong Wang, Yangqiu Song, and Qiang Yang.2018. Large-scale hierarchical text classificationwith recursively regularized deep graph-cnn. In Pro-ceedings of the 2018 World Wide Web Conference onWorld Wide Web, WWW 2018, Lyon, France, April23-27, 2018, pages 1063–1072. ACM.

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing, EMNLP 2014, October 25-29, 2014,Doha, Qatar, A meeting of SIGDAT, a Special Inter-est Group of the ACL, pages 1532–1543. ACL.

Anthony Rios and Ramakanth Kavuluru. 2018. Few-shot and zero-shot multi-label learning for structuredlabel spaces. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Process-ing, Brussels, Belgium, October 31 - November 4,2018, pages 3132–3142. Association for Computa-tional Linguistics.

Evan Sandhaus. 2008. The new york times annotatedcorpus. Linguistic Data Consortium, Philadelphia,6(12):e26752.

Kazuya Shimura, Jiyi Li, and Fumiyo Fukumoto. 2018.HFT-CNN: learning hierarchical category structurefor multi-label short text categorization. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, Brussels, Belgium,October 31 - November 4, 2018, pages 811–816. As-sociation for Computational Linguistics.

Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015. Improved semantic representations

from tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Nat-ural Language Processing of the Asian Federationof Natural Language Processing, ACL 2015, July26-31, 2015, Beijing, China, Volume 1: Long Pa-pers, pages 1556–1566. The Association for Com-puter Linguistics.

Shikhar Vashishth, Manik Bhandari, Prateek Yadav,Piyush Rai, Chiranjib Bhattacharyya, and Partha P.Talukdar. 2019. Incorporating syntactic and seman-tic information in word embeddings using graph con-volutional networks. In Proceedings of the 57thConference of the Association for ComputationalLinguistics, ACL 2019, Florence, Italy, July 28- Au-gust 2, 2019, Volume 1: Long Papers, pages 3308–3318. Association for Computational Linguistics.

Jonatas Wehrmann, Ricardo Cerri, and Rodrigo C.Barros. 2018. Hierarchical multi-label classifica-tion networks. In Proceedings of the 35th Inter-national Conference on Machine Learning, ICML2018, Stockholmsmassan, Stockholm, Sweden, July10-15, 2018, volume 80 of Proceedings of MachineLearning Research, pages 5225–5234. PMLR.

Jiawei Wu, Wenhan Xiong, and William Yang Wang.2019. Learning to learn and predict: A meta-learning approach for multi-label classification. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing, EMNLP-IJCNLP 2019, HongKong, China, pages 4353–4363. Association forComputational Linguistics.

Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, WeiWu, and Houfeng Wang. 2018. SGM: sequencegeneration model for multi-label classification. InProceedings of the 27th International Conferenceon Computational Linguistics, COLING 2018, SantaFe, New Mexico, USA, August 20-26, 2018, pages3915–3926. Association for Computational Linguis-tics.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alexander J. Smola, and Eduard H. Hovy. 2016. Hi-erarchical attention networks for document classifi-cation. In NAACL HLT 2016, The 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, San Diego California, USA, June 12-17, 2016, pages 1480–1489. The Association forComputational Linguistics.

Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.Graph convolutional networks for text classification.In The Thirty-Third AAAI Conference on ArtificialIntelligence, AAAI 2019, The Thirty-First Innova-tive Applications of Artificial Intelligence Confer-ence, IAAI 2019, The Ninth AAAI Symposium on Ed-ucational Advances in Artificial Intelligence, EAAI

Page 12: Hierarchy-Aware Global Model for Hierarchical Text …text feature propagation model (HiAGM-TP). We empirically demonstrate that both variants of HiAGM achieve consistent improvements

1117

2019, Honolulu, Hawaii, USA, January 27 - Febru-ary 1, 2019, pages 7370–7377. AAAI Press.

Ronghui You, Zihan Zhang, Ziye Wang, Suyang Dai,Hiroshi Mamitsuka, and Shanfeng Zhu. 2019. At-tentionxml: Label tree-based attention-aware deepmodel for high-performance extreme multi-labeltext classification. In Advances in Neural Informa-tion Processing Systems, pages 5812–5822.

Xingxing Zhang, Liang Lu, and Mirella Lapata. 2016.Top-down tree long short-term memory networks.In NAACL HLT 2016, The 2016 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, San Diego California, USA, June 12-17, 2016,pages 310–320. The Association for ComputationalLinguistics.