16
TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays Xiaosong Wang * 1 , Yifan Peng * 2 , Le Lu 1 , Zhiyong Lu 2 ,Ronald M. Summers 1 1 Department of Radiology and Imaging Sciences, Clinical Center, 2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20892 {xiaosong.wang,yifan.peng,le.lu,luzh,mohammad.bagheri,rms}@nih.gov Abstract Chest X-rays are one of the most common radiologi- cal examinations in daily clinical routines. Reporting tho- rax diseases using chest X-rays is often an entry-level task for radiologist trainees. Yet, reading a chest X-ray image remains a challenging job for learning-oriented machine intelligence, due to (1) shortage of large-scale machine- learnable medical image datasets, and (2) lack of tech- niques that can mimic the high-level reasoning of human radiologists that requires years of knowledge accumulation and professional training. In this paper, we show the clini- cal free-text radiological reportscan be utilized as a priori knowledge for tackling these two key problems. We propose a novel Text-Image Embedding network (TieNet) for extract- ing the distinctive image and text representations. Multi- level attention models are integrated into an end-to-end trainable CNN-RNN architecture for highlighting the mean- ingful text words and image regions. We first apply TieNet to classify the chest X-rays by using both image features and text embeddings extracted from associated reports. The proposed auto-annotation framework achieves high accu- racy (over 0.9 on average in AUCs) in assigning disease labels for our hand-label evaluation dataset. Furthermore, we transform the TieNet into a chest X-ray reporting system. It simulates the reporting process and can output disease classification and a preliminary report together. The classi- fication results are significantly improved (6% increase on average in AUCs) compared to the state-of-the-art baseline on an unseen and hand-labeled dataset (OpenI). 1. Introduction In the last decade, challenging tasks in computer vi- sion have gone through different stages, from sole im- age classification to multi-category multi-instance classi- * Both authors contributed equally. Figure 1. Overview of the proposed automated chest X-ray report- ing framework. A multi-level attention model is introduced. fication/detection/segmentation to more complex cognitive tasks that involve understanding and describing the rela- tionships of object instances inside the images or videos. The rapid and significant performance improvement is partly driven by publicly accessible of the large-scale im- age and video datasets with quality annotations, e.g., Ima- geNet [8], PASCAL VOC [10], MS COCO [22], and Vi- sual Genome [18] datasets. In particular, ImageNet pre- trained deep Convolutional Neural Network (CNN) mod- els [15, 19, 21] has become an essential basis (indeed an ad- vantage) for many higher level tasks, e.g., Recurrent Neural Network (RNN) based image captioning [34, 17, 30, 11], Visual Question Answering [36, 42, 38, 27], and instance relationship extraction [16, 14, 6]. On the contrary, there are few publicly available large- scale image datasets in the medical image domain. Con- ventional means of annotating natural images, e.g crowd- sourcing, cannot be applied to medical images due to the fact that these tasks often require years of professional train- ing and domain knowledge. On the other hand, radiologi- cal raw data (e.g., images, clinical annotations, and radio- 1 arXiv:1801.04334v1 [cs.CV] 12 Jan 2018

National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Embed Size (px)

Citation preview

Page 1: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

TieNet: Text-Image Embedding Network for Common Thorax DiseaseClassification and Reporting in Chest X-rays

Xiaosong Wang∗1, Yifan Peng∗2, Le Lu 1, Zhiyong Lu 2,Ronald M. Summers 1

1Department of Radiology and Imaging Sciences, Clinical Center,2 National Center for Biotechnology Information, National Library of Medicine,

National Institutes of Health, Bethesda, MD 20892{xiaosong.wang,yifan.peng,le.lu,luzh,mohammad.bagheri,rms}@nih.gov

Abstract

Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho-rax diseases using chest X-rays is often an entry-level taskfor radiologist trainees. Yet, reading a chest X-ray imageremains a challenging job for learning-oriented machineintelligence, due to (1) shortage of large-scale machine-learnable medical image datasets, and (2) lack of tech-niques that can mimic the high-level reasoning of humanradiologists that requires years of knowledge accumulationand professional training. In this paper, we show the clini-cal free-text radiological reportscan be utilized as a prioriknowledge for tackling these two key problems. We proposea novel Text-Image Embedding network (TieNet) for extract-ing the distinctive image and text representations. Multi-level attention models are integrated into an end-to-endtrainable CNN-RNN architecture for highlighting the mean-ingful text words and image regions. We first apply TieNetto classify the chest X-rays by using both image featuresand text embeddings extracted from associated reports. Theproposed auto-annotation framework achieves high accu-racy (over 0.9 on average in AUCs) in assigning diseaselabels for our hand-label evaluation dataset. Furthermore,we transform the TieNet into a chest X-ray reporting system.It simulates the reporting process and can output diseaseclassification and a preliminary report together. The classi-fication results are significantly improved (6% increase onaverage in AUCs) compared to the state-of-the-art baselineon an unseen and hand-labeled dataset (OpenI).

1. Introduction

In the last decade, challenging tasks in computer vi-sion have gone through different stages, from sole im-age classification to multi-category multi-instance classi-

∗Both authors contributed equally.

Figure 1. Overview of the proposed automated chest X-ray report-ing framework. A multi-level attention model is introduced.

fication/detection/segmentation to more complex cognitivetasks that involve understanding and describing the rela-tionships of object instances inside the images or videos.The rapid and significant performance improvement ispartly driven by publicly accessible of the large-scale im-age and video datasets with quality annotations, e.g., Ima-geNet [8], PASCAL VOC [10], MS COCO [22], and Vi-sual Genome [18] datasets. In particular, ImageNet pre-trained deep Convolutional Neural Network (CNN) mod-els [15, 19, 21] has become an essential basis (indeed an ad-vantage) for many higher level tasks, e.g., Recurrent NeuralNetwork (RNN) based image captioning [34, 17, 30, 11],Visual Question Answering [36, 42, 38, 27], and instancerelationship extraction [16, 14, 6].

On the contrary, there are few publicly available large-scale image datasets in the medical image domain. Con-ventional means of annotating natural images, e.g crowd-sourcing, cannot be applied to medical images due to thefact that these tasks often require years of professional train-ing and domain knowledge. On the other hand, radiologi-cal raw data (e.g., images, clinical annotations, and radio-

1

arX

iv:1

801.

0433

4v1

[cs

.CV

] 1

2 Ja

n 20

18

Page 2: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

logical reports) have been accumulated in many hospitals’Picture Archiving and Communication Systems (PACS) fordecades. The main challenge is how to transform those ret-rospective radiological data into a machine-learnable for-mat. Accomplishing this with chest X-rays represents a ma-jor milestone in the medical-imaging community [35].

Different from current deep learning models, radiolo-gists routinely observe multiple findings when they readmedical images and compile radiological reports. One mainreason is that these findings are often correlated. For in-stance, liver metastases can spread to regional lymph nodesor other body parts. By obtaining and maintaining a holis-tic picture of relevant clinical findings, a radiologist will beable to make a more accurate diagnosis. To our best knowl-edge, developing a universal or multi-purpose CAD frame-work, which is capable of detecting multiple disease typesin a seamless fashion, is still a challenging task. However,such a framework is a crucial part to build an automatic ra-diological diagnosis and reporting system.

Toward this end, we investigate how free-text radiologi-cal reports can be exploited as a priori knowledge using aninnovative text-image embedding network. We apply thisnovel system in two different scenarios. We first introducea new framework for auto-annotation of the chest X-rays byusing both images features and text embeddings extractedfrom associated reports. Multi-level attention models areintegrated into an end-to-end trainable CNN-RNN architec-ture for highlighting the meaningful text words and imageregions. In addition, we convert the proposed annotationframework into a chest X-ray reporting system (as shownin Figure 1). The system stimulates the real-world report-ing process by outputting disease classification and generat-ing a preliminary report spontaneously. The text embeddinglearned from the retrospective reports are integrated into themodel as a priori knowledge and the joint learning frame-work boosts the performance in both tasks in comparison toprevious state-of-the-art.

Our contributions are in fourfold: (1) We proposed theText-Image Embedding Network, which is a multi-purposeend-to-end trainable multi-task CNN-RNN framework; (2)We show how raw report data, together with paired image,can be utilized to produce meaningful attention-based im-age and text representations using the proposed TieNet. (3)We outline how the developed text and image embeddingsare able to boost the auto-annotation framework and achieveextremely high accuracy for chest x-ray labeling; (4) Fi-nally, we present a novel image classification frameworkwhich takes images as the sole input, but uses the pairedtext-image representations from training as a prior knowl-edge injection, in order to produce improved classificationscores and preliminary report generations.

Importantly, we validate our approach on three differ-ent datasets and the TieNet improves the image classifica-

tion result (6% increase on average in area under the curve(AUC) for all disease categories) in comparison to the state-of-the-art on an unseen and hand-labeled dataset (OpenI[7]) from other institute. Our multi-task training schemecan help not only the image classification but also the reportgeneration by producing reports with higher BLEU scoresthan the baseline method.

2. Related work

Computer-Aided Detection (CADe) and Diagnosis(CADx) has long been a major research focus in medicalimage processing [5]. In recent years, deep learning mod-els start to outperform conventional statistical learning ap-proaches in various tasks, such as automated classificationof skin lesions [9], detection of liver lesions [4], and detec-tion of pathological-image findings [40]. However, currentCADe methods typically target one particular type of dis-ease or lesion, such as lung nodules, colon polyps or lymphnodes [24].

Wang et al. [35] provide a recent and prominent excep-tion, where they introduced a large scale chest X-ray datasetby processing images and their paired radiological reports(extracted from their institutional PACS database) with nat-ural language processing (NLP) techniques. The publiclyavailable dataset contains 112, 120 front-view chest X-rayimages of 30, 805 unique patients 1. However, radiologi-cal reports contain richer information than simple diseasebinary labels, e.g., disease location and severity, whichshould be exploited in order to fully leverage existing PACSdatasets. Thus, we differ from Wang et al.’s approach byleveraging this rich text information in order to produce anenhanced system for chest X-ray CADx.

In vision of visual captioning, our work is closed to[37, 33, 29, 38, 27]. Xu et al. [37] first introduced thesequence-to-sequence model and spatial attention modelinto the image captioning task. They conditioned the longshort-term memory (LSTM) decoder on different parts ofthe input image during each decoding step, and the atten-tion signal was determined by the previous hidden stateand CNN features. Vinyals et al. [33] cast the syntacticalparsing problem as a sequence-to-sequence learning taskby linearizing the parsing tree. Pederoli et al. [29] alloweda direct association between caption words and image re-gions. More recently, multi-attention models [38, 27] ex-tract salient regions and words from both image and text andthen combine them together for better representations of thepair. In medical imaging domain, Shin et al.[32] proposedto correlate the entire image or saliency regions with MeSHterms. Promising results [41] are also reported in summariz-ing the findings in pathology images using task-oriented re-ports in the training. The difference between our model and

1https://nihcc.app.box.com/v/ChestXray-NIHCC/

Page 3: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Figure 2. Framework of the proposed chest X-ray auto-annotation and reporting framework. Multi-level attentions are introduced toproduce saliency-encoded text and image embeddings.

theirs lies in that we employ multi-attention models with amixture of image and text features in order to provide moresalient and meaningful embeddings for the image classifi-cation and report generation task.

Apart from visual attention, text-based attention has alsobeen increasingly applied in deep learning for NLP [2,26, 31]. It attempts to relieve one potential problem thatthe traditional encoder-decoder framework faces, which isthat the input is long or very information-rich and selec-tive encoding is not possible. The attention mechanismattempts to ease the above problems by allowing the de-coder to refer back to the input sequence [39, 23, 25]. Tothis end, our work closely follows the one used in [23]where they extracted an interpretable sentence embeddingby introducing self-attention. Our model paired both theattention-based image and text representation from trainingas a prior knowledge injection to produce improved classi-fication scores.

3. Text-Image Embedding NetworkThe radiological report is a summary of all the clinical

findings and impressions determined during examination ofa radiography study. A sample report is shown in Figure 1.It usually contains richer information than just disease key-words, but also may consist of negation and uncertaintystatements. In the ‘findings’ section, a list of normal and ab-normal observations will be listed for each part of the body

examined in the image. Attributes of the disease patterns,e.g., specific location and severity, will also be noted. Fur-thermore, critical diagnosis information is often presentedin the ‘impression’ section by considering all findings, pa-tient history, and previous studies. Suspicious findings maycause recommendations for additional or follow-up imagingstudies. As such, reports consist of a challenging mixtureof information and a key for machine learning is extractinguseful parts for particular applications.

In addition to mining the disease keywords [35] as asummarization of the radiological reports, we want to learna text embedding to capture the richer information con-tained in raw reports. Figure 2 illustrates the proposed Text-Image Embedding Network. We first introduce the founda-tion of TieNet, which is an end-to-end trainable CNN-RNNarchitecture. Afterwards we discuss two enhancements wedevelop and integrate, i.e., attention-encoded text embed-ding (AETE) and saliency weighted global average pooling(SW-GAP). Finally, we outline the joint learning loss func-tion used to optimize the framework.

3.1. End-to-End Trainable CNN-RNN Model

As shown in Figure 2, our end-to-end trainable CNN-RNN model takes an image I and a sequence of 1-of-Vencoded words.

S = {w1, . . . ,wT },wt ∈ RV , (1)

Page 4: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

where wt is a vector standing for a dw dimensional wordembedding for the t-th word in the report, V is the size ofthe vocabulary, and T is the length of the report. The ini-tial CNN component uses layers borrowed from ImageNetpre-trained models for image classification, e.g., ResNet-50(from Conv1 to Res5c). The CNN component additionallyincludes a convolutional layer (transition layer) to manipu-late the spatial grid size and feature dimension.

Our RNN is based off of Xu et al.’s visual image spa-tial attention model [37] for image captioning. The con-volutional activations from the transition layer, denoted asX, initialize the RNN’s hidden state, ht, where a fully-connected embedding, φ(X), maps the size dX transitionlayer activations to the LSTM state space of dimension dh.In addition,X is also used as one of the RNN’s input. How-ever, following Xu et al. [37], our sequence-to-sequencemodel includes a deterministic and soft visual spatial atten-tion, at, that is multiplied element-wise to X before the lat-ter is inputted to the RNN. At each time step, the RNN alsooutputs the subsequent attention map, at+1.

In addition to the soft-weighted visual features, the RNNalso accepts the current word at each time step as input. Weadopt standard LSTM units [13] for the RNN. The transitionto the next hidden state can then be denoted as

ht = LSTM([wt,at,X],ht−1). (2)

The LSTM produces the report by generating one word ateach time step conditioned on a context vector, i.e., the pre-vious hidden state ht, the previously generated words wt,and the convolutional features of X whose dimension isD ×D × C. Here D = 16 and C = 1024 denote the spa-tial and channel dimensions, respectively. Once the modelis trained, reports for a new image can be generated by se-quentially sampling wt ∼ p(wt|ht) and updating the stateusing Equation 2.

The end-to-end trainable CNN-RNN model provides apowerful means to process both text and images. However,our goal is also to obtain an interpretable global text andvisual embedding for the purposes of classification. For thisreason, we introduce two key enhancements in the form ofthe AETE and SW-GAP.

3.2. Attention Encoded Text Embedding

To compute a global text representation, we use an ap-proach that closely follows the one used in [23]. Morespecifically, we use attention to combine the most salientportions of the RNN hidden states. Let H = (h1, . . . ,hT )be the dh × T matrix of all the hidden states. The attentionmechanism outputs a r × T matrix of weights G as

G = softmax(Ws2 tanh(Ws1 H)), (3)

where r is the number of global attentions we want to ex-tract from the sentence, and Ws1 and Ws2 are s-by-dh and

r-by-smatrices, respectively. s is a hyperparameter govern-ing the dimensionality, and therefore maximum rank, of theattention-producing process.

With the attention calculated, we compute an r×dh em-bedding matrix, M = GH, which in essence executes rweighted sums across the T hidden states, aggregating themtogether into r representations. Each row of G, denoted gi

(i ∈ {1 . . . r}), indicates how much each hidden state con-tributes to the final embedded representation of M. We canthus draw a heat map for each row of the embedding matrixM (See Figure 10 for examples). This way of visualizationgives hints on what is encoded in each part of the embed-ding, adding an extra layer of interpretation.

To provide a final global text embedding of the sentencesin the report, the AETE executes max-over-r pooling acrossM, producing an embedding vector XAETE with size dh.

3.3. Saliency Weighted Global Average Pooling

In addition to using attention to provide a more mean-ingful text embedding, our goal is also to produce improvedvisual embeddings for classification. For this purpose, were-use the attention mechanism, G, except that we performa max-over-r operation, producing a sequence of saliencyvalues, gt(t = 1, . . . , T ), for each word, wt. These saliencyvalues are used to weight and select the spatial attentionmaps, at, generated at each time point:

aws(x, y) =∑t

at(x, y) ∗ gt. (4)

This map is encoded with all spatial saliency regions guidedby the text attention. We use this this map to highlight thespatial regions of X with more meaningful information:

XSW−GAP (c) =∑(x,y)

aws(x, y) ∗X(x, y, c), (5)

where x, y ∈ {1...D} and XSW−GAP is a 1-by-C vectorrepresenting the global visual information, guided by bothtext- and visual-based attention. The lower part of figure 2illustrates an example of such pooling strategy.

3.4. Joint Learning

With global representations computed for both the imageand report, these must be combined together to produce thefinal classification. To accomplish this, we concatenate thetwo forms of representations X = [XAETE ; XSW−GAP ]and use a final fully-connected layer to produce the out-put for multi-label classification. The intuition behind ourmodel is that the connection between the CNN and RNNnetwork will benefit the training of both because the imageactivations can be adjusted for the text embedding task andsalient image features could be extracted by pooling basedon high text saliency.

Page 5: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

In a similar fashion as Wang et al. [35], we de-fine an M -dimensional disease label vector y =[y1, ..., ym, ..., yM ], ym ∈ {0, 1} for each case and M = 15indicates the number of classes. ym indicates the presencewith respect to a pathology or ‘no finding’ (of listed diseasecategories) in the image. Here, we adopt the NLP-mined la-bels provided by [35] as the ‘ground-truth’ during the train-ing.

The instance numbers for different disease categories arehighly unbalanced, from hundreds to dozens of thousands.In addition to the positive/negative balancing introducedin [35], we add weights to instances associated with dif-ferent categories,

Lm(f(I,S),y) = βP∑

ym=1

− ln(f(I,S)) · λm

+ βN∑

ym=0

− ln(1− f(I,S)) · λm, (6)

where βP = |N ||P |+|N | and βN = |P |

|P |+|N | . |P | and |N |are the total number of images with at least one disease andwith no diseases, respectively. λm = (Q−Qm)/Q is a setof precomputed class-wised weights, where Q and Qm arethe total number of images and the number of images thathave disease label m. λm will be larger if the number ofinstances from class m is small.

Because the TieNet can also generate text reports, wealso optimize the RNN generative model loss [37], LR.Thus the overall loss is composed of two parts, the sigmoidcross entropy loss LC for the multi-label classification andthe loss LR from the RNN generative model [37],

Loverall = αLC + (1− α)LR (7)

where α is added to balance the large difference betweenthe two loss types.

3.5. Medical Image Auto-Annotation

One straightforward application of the TieNet is theauto-annotation task to mine image classification labels. Byomitting the generation of sequential words, we accumulateand back-propagate only the classification loss for bettertext-image embeddings in image classification. Here, weuse the NLP-mined disease labels as ‘ground truth’ in thetraining. Indeed we want to learn a mapping between theinput image-report pairs and the image labels. The reporttexts often contain more easy-to-learn features than the im-age side. The contribution of both sources to the final classi-fication prediction should be balanced via either controllingthe feature dimensions or drop-off partial of the ‘easy-to-learn’ data during training.

3.6. Automatic Classification and Reporting of Tho-rax Diseases

For a more difficult but real-world scenario, we trans-form the image-text embedding network to serve as a uni-

fied system of image classification and report generationwhen only the unseen image is available. During the train-ing, both image and report are fed and two separate lossesare computed as stated above, i.e., the loss for image clas-sification and the loss for sequence-to-sequence modeling.While testing, only the image is required as the input.The generated text contained the learned text embeddingrecorded in the LSTM units and later used in the final imageclassification task. The generative model we integrated intothe text-image embedding network is the key to associate animage with its attention encoded text embedding.

4. DatasetChestX-ray14 [35] is a recently released benchmark

dataset for common thorax disease classification and local-ization. It consists of 14 disease labels that can be observedin chest X-ray, i.e., Atelectasis, Cardiomegaly, Effusion, In-filtration, Mass, Nodule, Pneumonia, Pneumothorax, Con-solidation, Edema, Emphysema, Fibrosis, Pleural Thicken-ing, and Hernia. The NLP-mined labels are used as ‘groundtruth’ for model training throughout the experiments. Weadopt the patient-level data splits published with the data 2.

Hand-labeled: In addition to NLP-mined labels, we ran-domly select 900 reports from the testing set and have tworadiologists to annotate the 14 categories of findings for theevaluation purpose. A trial set of 30 reports was first used tosynchronize the criterion of annotation between two annota-tors. Then, each report was independently annotated by twoannotators. In this paper, we used the inter-rater agreement(IRA) to measure the consistency between two observers.The resulting Cohens kappa is 84.3%. Afterwards, the fi-nal decision was adjudicated between two observers on theinconsistent cases.

OpenI [7] is a publicly available radiography datasetcollected from multiple institutes by Indiana University.Using the OpenI API, we retrieved 3,851 unique radiologyreports and 7,784 associated frontal/lateral images whereeach OpenI report was annotated with key concepts (MeSHwords) including body parts, findings, and diagnoses. Forconsistency, we use the same 14 categories of findings asabove in the experiments. In our experiments, only 3,643unique front view images and corresponding reports are se-lected and evaluated.

5. ExperimentsReport vocabulary: We use all 15,472 unique words in

the training set that appear at least twice. Words that appearless frequently are replaced by a special out-of-vocabularytoken, and the start and the end of the reports are markedwith a special 〈START〉 and 〈END〉 token. The pre-trainedword embedding vectors was learned on PubMed articles

2https://nihcc.app.box.com/v/ChestXray-NIHCC

Page 6: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

using the gensim word2vec implementation with the dimen-sionality set to 200 3. The word embedding vectors will beevolved along with other LSTM parameters.

Evaluation Metrics: To compare previous state-of-the-art works, we choose different evaluation metrics for differ-ent tasks so as to maintain consistency with data as reportedin the previous works.

Receiver Operating Curves (ROC) are plotted for eachdisease category to measure the image classification perfor-mance and afterward, Areas Under Curve (AUC) are com-puted, which reflect the overall performance as a summaryof different operating points.

To assess the quality of generated text report, BLEUscores [28], METEOR [3] and ROUGE-L [20] are com-puted between the original reports and the generated ones.Those measures reflect the word overlapping statistics be-tween two text corpora. However, we believe their capabil-ities are limited for showing the actual accuracy of diseasewords (together with their attributes) overlapping betweentwo text corpora.

Training: The LSTM model contains a 256 dimensionalcell and s = 2000 in Ws1 and Ws2 for generating theattention weights G. During training, we use 0.5 dropouton the MLP and 0.0001 for L2 regularization. We use theAdam optimizer with a mini-batch size of 32 and a constantlearning rate of 0.001.

In addition, our self-attention LSTM has a hidden layerwith 350 units. We choose the matrix embedding to have 5rows (the r), and a coefficient of 1 for the penalization term.All the models are trained until convergence is achieved andthe hyper-parameters for testing is selected according to thecorresponding best validation set performance.

Our text-image embedding network is implementedbased on TensorFlow [1] and Tensorpack 4. The ImageNetpre-trained model, i.e., ResNet-50 [12] is obtained from theCaffe model zoo and converted into the TensorFlow com-patible format. The proposed network takes the weightsfrom the pre-trained model and fixes them during the train-ing. Other layers in the network are trained from scratch.In a similar fashion as introduced in [35], we reduce thesize of mini-batch to fit the entire model in each GPU whilewe accumulate the gradients for a number of iterations andalso across a number of GPUs for better training perfor-mance. The DCNN models are trained using a Dev-BoxLinux server with 4 Titan X GPUs.

5.1. Auto-annotation of Images

Figure 3 illustrates the ROC curves for the image clas-sification performance with 3 different inputs evaluatedon 3 different testing sets, i.e., ChestX-ray14 testing set

3https://radimrehurek.com/gensim/models/word2vec.html

4https://github.com/ppwwyyxx/tensorpack/

(ChestX-ray14), the hand-labeled set (Hand-labeled) andthe OpenI set (OpenI). Separate curves are plotted for eachdisease categories and ‘No finding’. Here, two differentauto-annotation frameworks are trained by using differentinputs, i.e., taking reports only (R) and taking image-reportpairs (I+R) as inputs. When only the reports are used, theframework will not have the saliency weighted global av-erage pooling path. In such way, we can get a sense howthe features from text path and image path individually con-tribute to the final classification prediction.

We train the proposed auto-annotation framework us-ing the training and validation sets from the ChestX-ray14dataset and test it on all three testing sets, i.e., ChestX-ray14, hand-labeled and OpenI. Table 1 shows the AUCvalues for each class computed from the ROC curves shownin Figure 3. The auto-annotation framework achieves highperformance on both ChestX-ray14 and Hand-labeled, i.e.,over 0.87 in AUC with reports alone as the input and over0.90 in AUC with image-report pairs on sample numberweighted average (#wAV G). The combination of im-age and report demonstrates the supreme advantage in thistask. In addition, the auto-annotation framework trainedon ChestX-ray14 performed equivalently on OpenI. It indi-cates that the model trained on a large-scale image datasetcould easily be generalized to the unseen data from otherinstitutes. The model trained solely based on images couldalso be generalized well to the datasets from other sources.In this case, both the proposed method and the one in [35]are able to perform equally well on all three testing sets.

5.2. Classification and Reporting of Chest X-ray

When the TieNet is switched to an automatic diseaseclassification and reporting system, it takes a single im-age as the input and is capable of outputting a multi-label prediction and corresponding radiological report to-gether. The ROC curves on the right in Figure 3 and Ta-ble 1 show the image classification performance producedby the multi-purpose reporting system. The AUCs fromour TieNet (I+GR) demonstrate the consistent improve-ment in AUCs (2.3% − 5.7% on #wAV G for all the dis-ease categories) across all three datasets. The multilabelclassification framework [35] serves as a baseline modelthat also takes solely the images. Furthermore, the per-formance improvement achieved on the Hand-labeled andOpenI datasets (with ground truth image labels) is evenlarger than the performance gain on ChestX-ray14 (withNLP-mined labels). It indicates that the TieNet is able tolearn more meaningful and richer text embeddings directlyfrom the raw reports and correct the inconsistency betweenembedded features and erroneous mined labels.

Table 2 shows that the generated reports from our pro-posed system obtain higher scores in all evaluation metricsin comparison to the baseline image captioning model [37].

Page 7: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Disease ChestX-ray14 Hand-labeled OpenI

R / I+R / I [35] / I+GR / # R / I+R / I [35] / I+GR / # R / I+R / I [35] / I+GR / #

Atelectasis .983 / .993 / .700 / .732 / 3255 .886 / .919 / .680 / .715 / 261 .981 / .976 / .702 / .774 / 293Cardiomegaly .978 / .994 / .810 / .844 / 1065 .964 / .989 / .820 / .872 / 185 .944 / .962 / .803 / .847 / 315Effusion .984 / .995 / .759 / .793 / 4648 .938 / .967 / .780 / .823 / 257 .968 / .977 / .890 / .899 / 140Infiltration .960 / .986 / .661 / .666 / 6088 .849 / .879 / .648 / .664 / 271 .981 / .984 / .585 / .718 / 57Mass .984 / .994 / .693 / .725 / 1712 .935 / .943 / .696 / .710 / 93 .959 / .903 / .756 / .723 / 14Nodule .981 / .994 / .668 / .685 / 1615 .974 / .974 / .662 / .684 / 130 .967 / .960 / .647 / .658 / 102Pneumonia .947 / .969 / .658 / .720 / 477 .917 / .946 / .724 / .681 / 55 .983 / .994 / .642 / .731 / 36Pneumothorax .983 / .995 / .799 / .847 / 2661 .983 / .996 / .784 / .855 / 166 .960 / .960 / .631 / .709 / 22Consolidation .989 / .997 / .703 / .701 / 1815 .923 / .910 / .609 / .631 / 60 .969 / .989 / .790 / .855 / 28Edema .976 / .989 / .805 / .829 / 925 .970 / .987 / .815 / .834 / 33 .984 / .995 / .799 / .879 / 40Emphysema .996 / .997 / .833 / .865 / 1093 .980 / .981 / .835 / .863 / 44 .849 / .868 / .675 / .792 / 94Fibrosis .986 / .986 / .786 / .796 / 435 .930 / .989 / .688 / .714 / 11 .985 / .960 / .744 / .791 / 18PT .988 / .997 / .684 / .735 / 1143 .904 / .923 / .679 / .776 / 41 .948 / .953 / .691 / .749 / 52Hernia .929 / .958 / .871 / .876 / 86 .757 / .545 / .864 / .647 / 2 – / – / – / – / 0NoFinding .920 / .985 / – / .701 / 9912 .889 / .908 / – / .666 / 85 .933 / .936 / – / .747 / 2789

AVG .976 / .989 / .745 / .772 / – .922 / .925 / .735 / .748 / – .960 / .965 / .719 / .779 / –

#wAVG .978 / .992 / .722 / .748 / – .878 / .900 / .687 / .719 / – .957 / .966 / .741 / .798 / –

Table 1. Evaluation of image classification results (AUCs) on ChestX-ray14, hand-labeled and OpenI dataset. Performances are reported onfour methods, i.e., multilabel classification based on Report (R), Image + Report (I+R), Image [35], and Image + Generative Report(I+GR).

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e

R classification on ChestX-ray14

Atelectasis 0.983Cardiomegaly 0.978Effusion 0.984Infiltrate 0.960Mass 0.984Nodule 0.981Pneumonia 0.947Pneumothorax 0.983Consolidation 0.989Edema 0.976Emphysema 0.996Fibrosis 0.986Pleural_Thickening 0.988Hernia 0.929No finding 0.920

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

Tru

e P

osi

tive R

ate

I + R classification on ChestX-ray14

Atelectasis 0.993

Cardiomegaly 0.994

Effusion 0.995

Infiltrate 0.986

Mass 0.994

Nodule 0.994

Pneumonia 0.969

Pneumothorax 0.995

Consolidation 0.997

Edema 0.989

Emphysema 0.997

Fibrosis 0.986

Pleural_Thickening 0.997

Hernia 0.958

No finding 0.985

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

Tru

e P

osi

tive R

ate

I + GR classification on ChestX-ray14

Atelectasis 0.732

Cardiomegaly 0.844

Effusion 0.793

Infiltrate 0.666

Mass 0.725

Nodule 0.685

Pneumonia 0.720

Pneumothorax 0.847

Consolidation 0.701

Edema 0.829

Emphysema 0.865

Fibrosis 0.796

Pleural_Thickening 0.735

Hernia 0.876

No finding 0.701

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e

R classification on Hand-labeled

Atelectasis 0.886Cardiomegaly 0.964Effusion 0.938Infiltrate 0.849Mass 0.935Nodule 0.974Pneumonia 0.917Pneumothorax 0.983Consolidation 0.923Edema 0.970Emphysema 0.980Fibrosis 0.930Pleural_Thickening 0.904Hernia 0.757No finding 0.889

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

Tru

e P

osi

tive R

ate

I + R classification on Hand-labeled

Atelectasis 0.919

Cardiomegaly 0.989

Effusion 0.967

Infiltrate 0.879

Mass 0.943

Nodule 0.974

Pneumonia 0.946

Pneumothorax 0.996

Consolidation 0.910

Edema 0.987

Emphysema 0.981

Fibrosis 0.989

Pleural_Thickening 0.923

Hernia 0.545

No finding 0.908

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

Tru

e P

osi

tive R

ate

I + GR classification on Hand-labeled

Atelectasis 0.715

Cardiomegaly 0.872

Effusion 0.823

Infiltrate 0.664

Mass 0.710

Nodule 0.684

Pneumonia 0.681

Pneumothorax 0.855

Consolidation 0.631

Edema 0.834

Emphysema 0.863

Fibrosis 0.714

Pleural_Thickening 0.776

Hernia 0.647

No finding 0.666

Figure 3. A comparison of classification performance with different testing inputs, i.e. Report (R), Image+Report (I+R), and Im-age+Generative Report(I+GR).

It may be because the gradients from RNN are backpropa-gated to the CNN part and the adjustment of image featuresfrom Transition layer will benefit the report generation task.

Figure 10 illustrates 4 sample results from the proposed

automatic classification and reporting system. Please seemore examples in the appendix A. Original images areshown along with the classification predications, originalreports and generated reports. Text-attended words are also

Page 8: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Imag

e S

amp

le c

ases

P Atelectasis Effusion No finding Nodule Pneumothorax Mass

Consolidation

Mass

Ori

gin

al r

epo

rt findings : a single ap view of the chest

demonstrates increasing bibasilar

interstitial opacities with decreased overall

aeration . increasing blunting of right

costophrenic angle. … impression :

increasing bibasilar atelectasis with

possible development of right pleural

effusion .

Normal no evidence of lung infiltrate . findings : heart and mediastinum

unchanged . multiple lung nodules .

evidence of recent left chest surgery with

left chest tube in place . very small left

apical pneumothorax . lungs unchanged , no

evidence of acute infiltrates . impression :

stable chest .

findings : large left suprahilar and infrahilar

masses as well as the well circumscribed

nodule the level of the aortic knob . the

right infrahilar mass as well . no effusion .

impression : metastatic lung disease .

Gen

erat

ed R

epo

rt

findings : a single ap view of the chest

demonstrates unchanged bilateral reticular

opacities , consider atelectasis . continued

left basilar atelectasis . no evidence of

developing infiltrate . the cardiac and

mediastinal contours are stable .

impression : no evidence of developing

infiltrate .

findings : pa and lateral views of the chest

demonstrate lungs that are clear without

focal mass , infiltrate or effusion .

cardiomediastinal silhouette is normal size

and contour . pulmonary vascularity is

normal in caliber and distribution .

impression : no evidence of acute

pulmonary pathology

findings : pa and lateral views of the chest

demonstrate unchanged bilateral chest tubes

. again pulmonary nodules are seen on the

right and cardiac silhouette unchanged . the

cardiac and mediastinal contours are stable

. impression : 1. bilateral masses and left

lower lung field consolidation . 2.new

bilateral lung masses .

comparison is to previous upright study of

no significant interval change is seen in the

appearance of the chest . the mediastinal

soft tissue and pulmonary vascularity are

stable . there are blastic bone lesions in the

chest . bones , soft tissues are normal . the

lung fields are clear . there are calcified

lymph nodes in the left lower lung .

impression : . sclerotic lesions in the left

humeral , consistent with metastasis.

A

A B C D

Figure 4. 4 sample image Classification Predictions (P) along with original and generated reports. Text attentions are highlighted over thegenerated text. Correct predication is marked in green, false prediction in red and missing prediction in blue.

Table 2. Evaluation of generated reports in ChestX-ray14 testingset using BLEU, METEOR and ROUGE-L.

Captioning [37] TieNet I+GR

BLEU-1 0.2391 0.2860BLEU-2 0.1248 0.1597BLEU-3 0.0861 0.1038BLEU-4 0.0658 0.0736METEOR 0.1024 0.1076ROUGE-L 0.1988 0.2263

highlighted over the generated reports. If looking at gener-ated reports alone, we find that they all read well. However,the described diseases may not truly appear in the images.For example, ‘Atelectasis’ is correctly recognized in sam-ple A but ‘Effusion’ is missed. ‘Effusion’ (not too far fromthe negation word ‘without’) is erroneously highlighted insample B but the system is still able to correctly classify theimage as ‘No finding’. In sample D, the generated reportmisses ‘Mass’ while it states right about the metastasis inthe lung. One promising finding is that the false predictions(‘Mass’ and ‘Consolidation’) in sample C can actually beobserved in the image (verified by a radiologist) but some-how did not noted in the original report, which indicates ourproposed netowrk can in some extent associate the image

appearance with the text description.

6. ConclusionAutomatically extracting the machine-learnable anno-

tation from the retrospective data remains a challengingtask, among which images and reports are two main use-ful sources. Here, we proposed a novel text-image em-bedding network integrated with multi-level attention mod-els. TieNet is implemented in an end-to-end CNN-RNNarchitecture for learning a blend of distinctive image andtext representations. Then, we demonstrate and discuss thepros and cons of including radiological reports in both auto-annotation and reporting tasks. While significant improve-ments have been achieved in multi-label disease classifica-tion, there is still much space to improve the quality of gen-erated reports. For future work, we will extend TieNet toinclude multiple RNNs for learning not only disease wordsbut also their attributes and further correlate them and imagefindings with the description in the generated reports.

Acknowledgements This work was supported by theIntramural Research Programs of the NIH Clinical Centerand National Library of Medicine. Thanks to Adam Har-rison and Shazia Dharssi for proofreading the manuscript.We are also grateful to NVIDIA Corporation for the GPUdonation.

Page 9: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

References[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,

C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe-mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. War-den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Ten-sorFlow: large-scale machine learning on heterogeneous dis-tributed systems. 2016. 6

[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine trans-lation by jointly learning to align and translate. In Inter-national Conference on Learning Representations (ICLR),pages 1–15, 2015. 3

[3] S. Banerjee and A. Lavie. METEOR: An automatic met-ric for MT evaluation with improved correlation with humanjudgments. In Proceedings of the ACL Workshop on Intrinsicand Extrinsic Evaluation Measures for Machine Translationand/or Summarization, pages 65–72, 2005. 6

[4] A. Ben-Cohen, I. Diamant, E. Klang, M. Amitai, andH. Greenspan. Fully convolutional network for liver seg-mentation and lesions detection. In International Workshopon Large-Scale Annotation of Biomedical Data and ExpertLabel Synthesis, pages 77–85, 2016. 2

[5] G. Chartrand, P. M. Cheng, E. Vorontsov, M. Drozdzal,S. Turcotte, C. J. Pal, S. Kadoury, and A. Tang. Deep learn-ing: a primer for radiologists. Radiographics : a review pub-lication of the Radiological Society of North America, Inc,37(7):2113–2131, 2017. 2

[6] B. Dai, Y. Zhang, and D. Lin. Detecting visual relation-ships with deep relational networks. In The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),pages 3076–3086, 2017. 1

[7] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E.Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J.McDonald. Preparing a collection of radiology examinationsfor distribution and retrieval. Journal of the American Medi-cal Informatics Association, 23(2):304–310, 2015. 2, 5

[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 248–255, 2009. 1

[9] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter,H. M. Blau, and S. Thrun. Dermatologist-level classifi-cation of skin cancer with deep neural networks. Nature,542(7639):115–118, 2017. 2

[10] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I.Williams, J. Winn, and A. Zisserman. The PASCAL visualobject classes challenge: A retrospective. International Jour-nal of Computer Vision, 111(1):98–136, 2015. 1

[11] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin,and L. Deng. Semantic compositional networks for visualcaptioning. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 1–13, 2017. 1

[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In The IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 6

[13] S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. 4

[14] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko.Modeling relationships in referential expressions with com-positional modular networks. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages1115–1124, 2017. 1

[15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. In Proceed-ings of the 22nd ACM international conference on Multime-dia, pages 675–678, 2014. 1

[16] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fullyconvolutional localization networks for dense captioning.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 4565–4574, 2016. 1

[17] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-ments for generating image descriptions. IEEE transactionson pattern analysis and machine intelligence, 39(4):664–676, 2017. 1

[18] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bern-stein, and F.-F. Li. Visual genome: Connecting language andvision using crowdsourced dense image annotations. 2016.1

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012. 1

[20] C.-Y. Lin. ROUGE: A package for automatic evaluationof summaries. In Text summarization branches out: Pro-ceedings of the ACL-04 workshop, volume 8, pages 1–8.Barcelona, Spain, 2004. 6

[21] M. Lin, Q. Chen, and S. Yan. Network in network. In In-ternational Conference on Learning Representations (ICLR),pages 1–10, 2014. 1

[22] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick,J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollar.Microsoft COCO: Common objects in context. In EuropeanConference on Computer Vision (ECCV), pages 740–755,2014. 1

[23] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou,and Y. Bengio. A structured self-attentive sentence embed-ding. In 5th International Conference on Learning Repre-sentations (ICLR), pages 1–15, 2017. 3, 4

[24] J. Liu, D. Wang, L. Lu, Z. Wei, L. Kim, E. B. Turk-bey, B. Sahiner, N. Petrick, and R. M. Summers. Detec-tion and diagnosis of colitis on computed tomography us-ing deep convolutional neural networks. Medical Physics,44(9):4630–4642, 2017. 2

[25] Y. Liu, C. Sun, L. Lin, and X. Wang. Learning natural lan-guage inference using bidirectional LSTM model and inner-attention. 2016. 3

[26] F. Meng, Z. Lu, M. Wang, H. Li, W. Jiang, and Q. Liu. En-coding source language with convolutional neural network

Page 10: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

for machine translation. In Proceedings of the 53rd An-nual Meeting of the Association for Computational Linguis-tics and the 7th International Joint Conference on NaturalLanguage Processing (ACL-CoNLL), pages 20–30, 2015. 3

[27] H. Nam, J.-W. Ha, and J. Kim. Dual attention networks formultimodal reasoning and matching. In The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),pages 299–307, 2017. 1, 2

[28] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: amethod for automatic evaluation of machine translation. InProceedings of the 40th annual meeting on association forcomputational linguistics (ACL), pages 311–318, 2002. 6

[29] M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek. Areas ofattention for image captioning. In International Conferenceon Computer Vision (ICCV), pages 1–22, 2017. 2

[30] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hock-enmaier, and S. Lazebnik. Flickr30k entities: Collect-ing region-to-phrase correspondences for richer image-to-sentence models. In International Conference on ComputerVision (ICCV), 2015. 1

[31] A. M. Rush, S. Chopra, and J. Weston. A neural attentionmodel for abstractive sentence summarization. In Proceed-ings of the 2015 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 379–389, 2015.3

[32] H.-C. Shin, K. Roberts, L. Lu, D. Demner-Fushman, J. Yao,and R. M. Summers. Learning to read chest X-rays: recur-rent neural cascade model for automated image annotation.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 2497–2506, 2016. 2

[33] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. InAdvances in Neural Information Processing Systems, pages2692–2700, 2015. 2

[34] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show andtell: A neural image caption generator. In The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),pages 3156–3164, 2015. 1

[35] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Sum-mers. ChestX-ray8: Hospital-scale chest x-ray database andbenchmarks on weakly-supervised classification and local-ization of common thorax diseases. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages2097–2106, 2017. 2, 3, 5, 6, 7

[36] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel.Ask me anything: free-form visual question answering basedon knowledge from external sources. In The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),pages 1–5, 2016. 1

[37] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neu-ral image caption generation with visual attention. In In-ternational Conference on Machine Learning (ICML), pages2048–2057, 2015. 2, 4, 5, 6, 8

[38] D. Yu, J. Fu, T. Mei, and Y. Rui. Multi-level attention net-works for visual question answering. In The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),pages 1–9, 2017. 1, 2

[39] W. L. L. C.-C. Yulia, T. S. Amir, R. F. A. C. D. Alan, andW. B. I. Trancoso. Not all contexts are created equal: Betterword representations with variable attention. In Proceedingsof the 2015 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1367–1372, 2015. 3

[40] Z. Zhang, P. Chen, M. Sapkota, and L. Yang. Tandemnet:Distilling knowledge from medical images using diagnos-tic reports as optional semantic references. In InternationalConference on Medical Image Computing and Computer-Assisted Intervention, pages 320–328. Springer, 2017. 2

[41] Z. Zhang, Y. Xie, F. Xing, M. McGough, and L. Yang. MD-Net: a semantically and visually interpretable medical imagediagnosis network. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 6428–6436,2017. 2

[42] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W:Grounded question answering in images. In The IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2016. 1

A. More Experiment ResultsIn this section, we present 20 more classification and re-

porting results (case E-X) from the proposed TieNet in ad-dition to the four examples (case A-D) shown in the mainpaper. Sample images are illustrated along with associ-ated classification Predictions (P), original and generatedreports. Text attentions are highlighted with different sat-uration levels over the generated text. Darker red meanshigher weights of the text attention. Correct classificationpredications are marked in green, false predictions in redand missed predictions in blue.

Page 11: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Image Sample cases

P

Atelecta

sis Effu

sion

N

o fin

din

g

Nod

ule P

neu

moth

ora

x M

ass

Con

solid

atio

n

Mass

Original report

findin

gs : a sin

gle ap

view

of th

e chest

dem

onstrates in

creasing b

ibasilar

interstitial o

pacities w

ith d

ecreased o

verall

aeration . in

creasing b

luntin

g o

f right

costo

phren

ic angle. …

impressio

n :

increasin

g b

ibasilar atelectasis w

ith

possib

le dev

elopm

ent o

f right p

leural

effusio

n .

Norm

al no ev

iden

ce of lu

ng in

filtrate . fin

din

gs : h

eart and m

ediastin

um

unch

anged

. multip

le lung n

odules .

evid

ence o

f recent left ch

est surg

ery w

ith

left chest tu

be in

place . v

ery sm

all left

apical p

neu

moth

orax

. lungs u

nch

anged

, no

evid

ence o

f acute in

filtrates . impressio

n :

stable ch

est .

findin

gs : larg

e left suprah

ilar and in

frahilar

masses as w

ell as the w

ell circum

scribed

nodule th

e level o

f the ao

rtic knob . th

e

right in

frahilar m

ass as well . n

o effu

sion .

impressio

n : m

etastatic lung d

isease .

Generated Report

findin

gs : a sin

gle ap

view

of th

e chest

dem

onstrates u

nch

anged

bilateral reticu

lar

opacities , co

nsid

er atelectasis . contin

ued

left basilar atelectasis . n

o ev

iden

ce of

dev

elopin

g in

filtrate . the card

iac and

med

iastinal co

nto

urs are stab

le .

impressio

n : n

o ev

iden

ce of d

evelo

pin

g

infiltrate .

findin

gs : p

a and

lateral view

s of th

e chest

dem

onstrate lu

ngs th

at are clear with

out

focal m

ass , infiltrate o

r effusio

n .

cardio

med

iastinal silh

ouette is n

orm

al size

and

conto

ur . p

ulm

onary

vascu

larity is

norm

al in calib

er and d

istributio

n .

impressio

n : n

o ev

iden

ce of acu

te

pulm

onary

path

olo

gy

findin

gs : p

a and

lateral view

s of th

e chest

dem

onstrate u

nch

anged

bilateral ch

est tubes

. again

pulm

onary

nodules are seen

on th

e

right an

d card

iac silhouette u

nch

anged

. the

cardiac an

d m

ediastin

al conto

urs are stab

le

. impressio

n : 1

. bilateral m

asses and

left

low

er lung field

conso

lidatio

n . 2

.new

bilateral lu

ng m

asses .

com

pariso

n is to

prev

ious u

prig

ht stu

dy o

f

no sig

nifican

t interv

al chan

ge is seen

in th

e

appearan

ce of th

e chest . th

e med

iastinal

soft tissu

e and p

ulm

onary

vascu

larity are

stable . th

ere are blastic b

one lesio

ns in

the

chest . b

ones , so

ft tissues are n

orm

al . the

lung field

s are clear . there are calcified

lym

ph

nodes in

the left lo

wer lu

ng .

impressio

n : . sclero

tic lesions in

the left

hum

eral , consisten

t with

metastasis.

A

B

C

D

Figure 5. 4 sample image Classification Predictions (P) along with original and generated reports. Text attentions are highlighted over thegenerated text. Correct predication is marked in green, false prediction in red and missing prediction in blue.

Page 12: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Image Sample cases

P

No fin

din

g

No fin

din

g

Effu

sion

N

o fin

din

g

Original report

findin

gs : p

revio

usly

noted

right lo

wer lo

be

infiltrates h

ave reso

lved

. ven

ous lin

e is

noted

in sv

c . lungs are n

ow

clear . the

heart size an

d m

ediastin

al conto

ur are

norm

al . impressio

n : n

o acu

te

cardio

pulm

onary

disease is n

oted

.

lungs are m

oderately

well-in

flated . n

o

infiltrates o

r effusio

ns . in

terstitial mark

ings

are upper lim

its of n

orm

al .

cardio

med

iastinal silh

ouette is n

orm

al .

impressio

n : m

oderate rig

ht p

leural

effusio

n . n

o p

neu

moth

orax

. right ch

est

tube w

ith tip

and sid

eport d

eep in

the rig

ht

costo

phren

ic sulcu

s .

findin

gs : n

o rad

iograp

hic ab

norm

alities .

impressio

n : stab

le chest .

Generated Report

findin

gs : p

a and

lateral chest. d

iffuse b

ony

lesions are n

oted

. suggest co

rrelation

with

chest ct fo

r more sen

sitive assessm

ent fo

r .

min

imalin

terstitial thick

enin

g is seen

in

lung . th

is is stable , b

ut su

ggest co

rrelation

with

chest ct o

r for in

terval d

evelo

pin

g

acute p

ulm

onary

path

olo

gy . n

o ev

iden

ce of

pneu

moth

orax

. no acu

te infiltrates o

r

pulm

onary

infiltrates are . . b

ony stru

ctures

are intact .

findin

gs : lu

ngs clear . n

orm

al cardio

med

ias

tinal co

nto

urs . n

o p

leural effu

sions . v

ague

den

sity o

r atelectasis or scarrin

g in

the rig

ht

lung b

ase. impressio

n : n

o acu

te pro

cess .

findin

gs : th

e bilateral p

leural effu

sions are

noted

. the h

eart is norm

al in size . th

ere is

no p

leural effu

sion

. bony stru

ctures are

intact .im

pressio

n : n

ew p

leural effu

sions .

findin

gs : p

a and

lateral view

s of th

e chest

dem

onstrate lu

ngs th

at are clear with

out

focal m

ass , infiltrate o

r effusio

n . card

iom

e

diastin

al silhouette is n

orm

al size and

conto

ur . p

ulm

onary

vascu

larity is n

orm

al

in lim

its and to

p n

orm

al size cardiac

silhouette . sk

eletal structu

res areintact .

impressio

n : n

o acu

te lung in

filtrates .

E F

G

HD

Figure 6. 4 sample image Classification Predictions (P) along with original and generated reports. Text attentions are highlighted over thegenerated text. Correct predication is marked in green, false prediction in red and missing prediction in blue.

Page 13: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Image Sample cases

P

Em

ph

ysem

a

Infiltra

tion

M

ass C

on

solid

atio

n

No fin

din

g N

od

ule

Original report

min

imal-m

oderate left n

eck an

d u

pper

chest su

bcu

taneo

us em

physem

a unch

anged

or m

inim

ally d

ecreasing. m

oderate-m

arked

right ch

est , min

imal-m

oderate rig

ht n

eck

and u

pper ab

dom

en su

bcu

taneo

us

emph

ysem

a.

findin

gs : in

terval d

evelo

pm

ent o

f left

upper lo

be p

atchy n

odular in

filtrate

inferio

rly . u

nch

anged

radio

paq

ue cath

eter

com

patib

le with

vp sh

unt . stab

le catheter

overly

ing th

e stom

ach . co

stophren

ic angles

are clear . cardiac an

d m

ediastin

al bord

ers

are with

in n

orm

al limits o

f size .

impressio

n : in

terval d

evelo

pm

ent o

f left

upper lo

be p

atchy n

odular in

filtrate

findin

gs : a sin

gle ap

view

of th

e chest

dem

onstrates u

nch

anged

or m

inim

ally

increasin

g , d

epen

den

t positio

nin

g , rig

ht

lung m

ass/conso

lidatio

n. th

e cardiac an

d

med

iastinal co

nto

urs are stab

le .

impressio

n : 1

. unch

anged

or m

inim

ally

increasin

g , d

epen

den

t positio

nin

g , rig

ht

lung m

ass/conso

lidatio

n . 2

. no ev

iden

ce of

dev

elopin

g in

filtrate the v

isualized

left lung

findin

gs : lu

ngs are w

ell aerated w

ith n

o

evid

ence o

f infiltrate . card

iac and

med

iastinal b

ord

ers are with

in n

orm

al

limits o

f size . .. . impressio

n : n

o ev

iden

ce

of in

filtrate

Generated Report

reason

for stu

dy : s/p

vats clin

ical info

rmati

on

: aplastic an

emia ch

est 1 v

iew : ch

est x-

ray p

erform

ed o

n th

e same d

ay . th

e heart

and

med

iastinum

are norm

al . the

subcu

taneo

us em

physem

a is seen in

the

right n

eck an

d n

eck o

n th

e right . th

ere is

unch

anged

subcu

taneo

us em

ph

ysem

a

seen o

n th

e right .

reason

for ex

am ( en

tered b

y o

rderin

g

clinician

into

cris): r/o acu

te , r/opulm

onary

disease in

terval ch

anges n

o in

terval ch

ange

and

seen areav

ailable o

f the h

eart . there

are a nodular in

terstitial infiltrateo

r scarring

as well as m

ild d

egen

erative ch

anges o

f the

spin

e .

findin

gs in

clude 1

. left lung co

nso

lidatio

n .

2. left lu

ng co

nso

lidatio

n .

Fin

din

gs:h

eartand

med

iastinum

unch

anged

. lungs u

nch

anged

, no ev

iden

ce of acu

te

infiltrates . n

odule p

rojectin

g o

n p

osterio

r

ribposterio

rly . o

sseous stru

ctures in

tact .

impressio

n : stab

le chest .

I J

K

L

Figure 7. 4 sample image Classification Predictions (P) along with original and generated reports. Text attentions are highlighted over thegenerated text. Correct predication is marked in green, false prediction in red and missing prediction in blue.

Page 14: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Image Sample cases

P

Effu

sion

Pn

eum

oth

ora

x

Effu

sion

Atelecta

sis pn

eum

oth

ora

x

Nod

ule A

telectasis C

on

solid

atio

n

No fin

din

g A

telectasis

Original report

contin

ued

visu

alization o

f a small p

leural

effusio

n o

n th

e left side an

d slig

ht

elevatio

n o

f the left h

emid

iaphrag

m . th

ere

is decrease in

size of th

e left apical

pneu

moth

orax

which

is min

imal n

ow

. no

dev

elopin

g in

filtrates .

findin

gs : a sin

gle ap

view

of th

e chest

dem

onstrates stab

le et tube . th

ere is no

chan

ge in

ng o

r swan

. likely

right p

leural

fluid

with

locu

lation n

ear horizo

ntal

fissure . th

e cardiac an

d m

ediastin

al

conto

urs are stab

le . impressio

n : 1

. stable

lines tu

bes 2

. right p

leural effu

sion

findin

gs in

clude 2

. . left lung n

odules . 3

. .

ven

ous cath

eter , tip in

superio

r ven

a cava .

4 . . ev

iden

ce of p

revio

us b

ilateral axillary

surg

ery . im

pressio

n in

creased d

ensity

left

base ( o

f pleu

ral fluid

/left low

er lobe

conso

lidatio

n , atelectasis ?

findin

gs co

mpatib

le with

excreto

ry p

hase ,

as requested

-- no e

vid

ence o

f

pneu

moth

orax

.

Generated Report

findin

gs : left p

icc line rem

ains in

place .

small resid

ual p

leural effu

sion

in th

e left

lateral lung b

ase has b

een rem

oved

. no

defin

ite infiltrateso

r effusio

ns m

inim

ally . .

no

defin

ite pleu

ral effusio

ns . a v

ery sm

all

right p

leural effu

sion

has b

een p

laced sin

ce

last study . im

pressio

n :reso

lutio

n o

f right

pleu

ral effusio

n .

findin

gs : th

ere is still endotrach

eal tube

above th

e carina . a sm

all right ap

ical pleu

ra

leffusio

n1.in

terval rem

oval o

f an n

asogastri

c tube . n

gtu

be is n

oted

. there are d

iffuse

bilateral p

erihilar airsp

ace opacities,

inclu

din

g d

iffuse airsp

ace disease an

d a

small rig

ht p

leural effu

sion

.there aren

o

pleu

ral effusio

ns.th

ere is no p

neu

moth

orax

.

the h

eart is stable. im

pressio

n :1

. two

chest

tubes, co

ntin

ued

bilateralp

neu

moth

orax

.2.

bilateral p

leural effu

sions an

d b

ilateral

low

er lung atelectasis .

findin

gs in

clude u

nch

anged

in ap

pearan

ce

of left lu

ng o

pacities . n

o ev

iden

ce of

pneu

moth

orax

. one is in

pleu

ral den

sity in

the ch

est and m

aybe o

f uncertain

locatio

n

and

not d

efinitely

seen .

findin

gs in

clude 1

. blu

ntin

g o

f costo

phren

ic

angles an

d lin

ear com

ponen

ts in th

e lung

bases , p

redom

inan

tly in

the ch

est stable

and

a smalllin

ear com

ponen

ts consisten

t

with

atelectasis . 3 . . n

o ev

iden

ce of

pneu

moth

orax

. 3 . . m

etal clips in

dicativ

e

of p

revio

us left ax

illary su

rgery

.impressio

n

unch

anged

unch

anged

since ap

ril 3 n

o

evid

ence o

f acute p

ulm

onary

pro

cess

M

N

O

P

Figure 8. 4 sample image Classification Predictions (P) along with original and generated reports. Text attentions are highlighted over thegenerated text. Correct predication is marked in green, false prediction in red and missing prediction in blue.

Page 15: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Image Sample cases

P

Mass A

telectasis In

filtratio

n E

dem

a

Nod

ule

Effu

sion

Con

solid

atio

n P

neu

mon

ia

Infiltra

tion

Ed

ema

Atelecta

sis Nod

ule

Pn

eum

oth

ora

x E

mp

hysem

a In

filtratio

n

Effu

sion

Original report

findin

gs : h

eart size with

in n

orm

al limits .

right h

ilar mass an

d rig

ht m

iddle lo

be

atelectasis unch

anged

. increased

interstitial in

filtrates in th

e lungs m

ost

apparen

t in th

e left lung . n

o p

leural flu

id.

soft tissu

e nodule in

the left su

praclav

icular

regio

n . o

sseous stru

ctures in

tact

impressio

n : in

creasing in

terstitial

infiltrate ? ed

ema o

r infectio

n .

again

there is a sm

all left pleu

ral effusio

n

dev

eloped

since th

e study. b

ilateral

interstitial lu

ng d

isease is again

seen

presen

t on th

e prev

ious stu

dy p

rimarily

in

both

lung b

ases but n

ow

increased

den

sity

in th

e right u

pper lo

be as w

ell suggestin

g

bilateral p

atchy areas o

f conso

lidatio

n ,

presu

mab

ly p

neu

monia .

stable left lu

ng b

ase linear d

ensities

consisten

t with

scarring o

r atelectasis . no

dev

elopin

g in

filtrates or effu

sions .

exten

sive d

iffuse b

ony in

volv

emen

t from

pro

state cancer ag

ain seen

.

findin

gs : u

nch

anged

right sid

ed ch

est tube

and rig

ht ap

ical pneu

moth

orax

. stable

right-sid

ed ch

est wall su

bcu

taneo

us

emph

ysem

a . stable left ap

ical infiltrate

costo

phren

ic angles are clear . card

iac and

med

iastinal b

ord

ers are with

in n

orm

al

limits o

f size . impressio

n : u

nch

anged

right

sided

chest tu

be an

d rig

ht ap

ical

pneu

moth

orax

.

Generated Report

findin

gs : m

ultip

le right lu

ng m

asses are

seen b

ilaterally w

ith b

ilateral pulm

onary

nodules . larg

e right h

ilar masses is seen

that is slig

htly

worse o

n th

e right an

d th

e

right m

idlu

ng . th

e right lu

ng co

ntin

ues to

remain

norm

al .

reason

for stu

dy : n

ew sh

ortn

ess of b

reath

clinical in

form

ation: ch

ronic ly

mpho

ma

two

dr im

ages m

arked

,1.m

inim

al bilateral

mid

lung in

filtrates appearin

g sin

ce 10

septem

ber, p

ossib

ly d

ue to

edem

a and/o

r

pneu

monia . an

teriorch

est surg

ery . clo

thin

g

artifact appearin

g . an

terior rig

ht u

pper

abdom

en su

rgery

.

findin

gs : 1

. increased

linear atelectasis in

the lin

gula co

nsisten

t with

metastases . 2

.

min

imal lin

ear den

sities in th

e costo

phren

ic

angles ch

aracteristic of scarrin

g . 3

. healed

rib fractu

res.3. m

inim

al tortu

osity

thoracic

aorta . 4

. Multip

le calcified p

ulm

onary

nodules co

nsisten

t with

pulm

onary

edem

a .

impressio

n : stab

le chest , n

egativ

e for

evid

ence o

f pneu

moth

orax

.

findin

gs : m

inim

al right p

neu

moth

orax

desp

ite right p

leural tu

be w

ith larg

e pleu

ral

effusio

n an

d rem

oval rig

ht ch

est tube . 2

right ch

estsubcu

taneo

us em

physem

a .

impressio

n : stab

le tubes. rig

ht ch

est . right

low

er chest su

rgery

increasin

g o

r . . right

chest su

rgery

.

Q

R

S T

Figure 9. 4 sample image Classification Predictions (P) along with original and generated reports. Text attentions are highlighted over thegenerated text. Correct predication is marked in green, false prediction in red and missing prediction in blue.

Page 16: National Institutes of Health, Bethesda, MD 20892 arXiv ... · Chest X-rays are one of the most common radiologi-cal examinations in daily clinical routines. Reporting tho- ... In

Image Sample cases

P

Atelecta

sis Effu

sion

In

filtratio

n E

ffusio

n

Nod

ule M

ass

Mass

Original report

findin

gs : 2

view

s of th

e chest w

ere

perfo

rmed

. left-sided

effusio

n as

decreased

. right-sid

ed effu

sion is g

rossly

stable g

iven

differen

ces in p

atient ` s

positio

nin

g . th

ere is bib

asilar atelectasis

right g

reater than

left . no p

neu

moth

orax

is

seen . th

e cardiac silh

ouette is p

artially

obscu

red b

y th

e airspace p

rocess an

d

pleu

ral effusio

ns .

findin

gs co

mpatib

le with

bilateral lu

ng

infiltrates , p

ossib

le pleu

ral effusio

ns , an

d

spin

e deg

enerativ

e arthritis . trach

eosto

my

tube . rig

ht sw

an-g

anz cath

eter , tip

pro

jected o

ver rig

ht p

ulm

onary

artery . left

pic cath

eter . ekg lead

s . advise fo

llow

up

chest , if clin

ically in

dicated

.

findin

gs : p

a and lateral v

iews o

f the ch

est

dem

onstrate u

nch

anged

bilateral n

odules

masses , larg

est right u

pper lu

ng field

measu

ring ap

pro

xim

ately 4

.3 cm

in

diam

eter . posto

perativ

e features

bilaterally

. the card

iac and m

ediastin

al

conto

urs are stab

le . impressio

n :

unch

anged

bilateral n

odules an

d m

asses

with

no ev

iden

ce of in

filtrate .

findin

gs : u

nch

anged

large m

ediastin

al

masses w

ith trach

eal dev

iation to

ward

s

right an

d an

teriorly

. small h

ydro

pneu

moth

orax

remain

s on th

e right

anterio

rly . th

e cardiac an

d m

ediastin

al

conto

urs are stab

le . impressio

n : rig

ht

anterio

r hydro

pneu

moth

orax

unch

anged

large m

ediastin

al masses w

ith trach

eal

dev

iation

Generated Report

findin

gs : co

mpared

to a p

revio

us d

ay ` s

study . th

ere is a left chest tu

be w

ith its tip

in th

e left apex

. subseg

men

tal atelectasis

are seen in

theleft lo

wer lo

be.th

ere is also

a small rig

ht p

leural effu

sion

. impressio

n :

increasin

g left p

leural effu

sion

.

findin

gs : n

o sig

nifican

t interv

al chan

ge .

again

noted

is an illd

efined

bilateral p

leural

effusio

ns an

d b

ilateral infiltrates . n

o

pneu

moth

orax

.patien

t is not ro

tated o

n th

e

last exam

. impressio

n : stab

le exam

inatio

n

.

findin

gs : th

e heart an

d m

ediastin

um

are

norm

al . the lu

ng field

s are clear . bilateral

lung n

odules are seen

in th

e lungs . left lu

ng

is clear .impressio

n : n

o acu

te airspace

disease .

indicatio

n : d

evelo

pin

g m

ass in rig

ht lo

wer

lobe , s/p

right m

ediastin

al med

iastinal

enlarg

emen

t unch

anged

.2.rig

ht m

ediastin

al

massu

nch

anged

. posto

perativ

e features in

superio

r med

iastinum

appearin

g sin

ce last

study w

ith n

o d

efinite ev

iden

ce of

pneu

moth

orax

noted

with

right lu

ng v

olu

me

loss w

ith en

largem

ent m

ediastin

al soft

tissue th

icken

ing &

right h

ilar enlarg

emen

t

unch

anged

.posto

perativ

e features w

ith

right ch

est tube u

nch

anged

.

A

U

V

W

X

Figure 10. 4 sample image Classification Predictions (P) along with original and generated reports. Text attentions are highlighted over thegenerated text. Correct predication is marked in green, false prediction in red and missing prediction in blue.