11
IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020 1 Deep learning for classification and localization of COVID-19 markers in point-of-care lung ultrasound Subhankar Roy, Willi Menapace, Sebastiaan Oei, Ben Luijten, Enrico Fini, Cristiano Saltori, Iris Huijben, Nishith Chennakeshava, Federico Mento, Alessandro Sentelli, Emanuele Peschiera, Riccardo Trevisan, Giovanni Maschietto, Elena Torri, Riccardo Inchingolo, Andrea Smargiassi, Gino Soldati, Paolo Rota, Andrea Passerini, Ruud J.G. van Sloun, Elisa Ricci, Libertario Demi Abstract Deep learning (DL) has proved successful in medical imaging and, in the wake of the recent COVID- 19 pandemic, some works have started to investigate DL- based solutions for the assisted diagnosis of lung dis- eases. While existing works focus on CT scans, this paper studies the application of DL techniques for the analysis of lung ultrasonography (LUS) images. Specifically, we present a novel fully-annotated dataset of LUS images collected from several Italian hospitals, with labels indicat- ing the degree of disease severity at a frame-level, video- level, and pixel-level (segmentation masks). Leveraging these data, we introduce several deep models that address relevant tasks for the automatic analysis of LUS images. In particular, we present a novel deep network, derived from Spatial Transformer Networks, which simultaneously predicts the disease severity score associated to a input frame and provides localization of pathological artefacts in a weakly-supervised way. Furthermore, we introduce a new method based on uninorms for effective frame score aggregation at a video-level. Finally, we benchmark state of the art deep models for estimating pixel-level segmenta- tions of COVID-19 imaging biomarkers. Experiments on the proposed dataset demonstrate satisfactory results on all the considered tasks, paving the way to future research on DL for the assisted diagnosis of COVID-19 from LUS data. Index TermsCOVID-19, Lung ultrasound, Deep learning I. I NTRODUCTION The rapid global SARS-CoV-2 outbreak resulted in a scarcity of medical equipment. In addition to a worldwide Copyright (c) 2019 IEEE. Personal use of this material is permit- ted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs- [email protected]. Manuscript submitted on April 14, 2020. S. Oei * , B. Luijten * , I. Huijben * , N. Chennakeshava * , and R.J.G. van Sloun are with the Dept. of Electrical Engineering, Eindhoven Uni- versity of Technology, The Netherlands (e-mail: [email protected]). S. Roy * , C. Saltori * , E. Fini * , W. Menapace * , P. Rota , E. Ricci , A. Passerini are with the Dept. of Information Engineering and Computer Science, University of Trento, Trento, Italy (e-mail: [email protected], an- [email protected]). S. Roy and E. Ricci are with FBK, Trento, Italy. F. Mento * , A. Sentelli, E. Peschiera, R. Trevisan, G. Maschietto, and L. Demi are with ULTRa, Dept. of Information Engineering and Computer Science, University of Trento, Trento, Italy (e-mail: liber- [email protected]) E. Torri is with Bresciamed, Italy. R. Inchingolo and A. Smargiassi are with the Dept. of Cardovascular and Thoracic Sciences-Fondazione Policlinico Universitario A. Gemelli IRCCS, Italy. G. Soldati is with the Diagnostic and Interventional Ultrasound Unit, Valle del Serchio General Hospital, Lucca, Italy. * joint first authors contributed equally to this work. joint lead authors contributed equally to this work. shortage of mouth masks and mechanical ventilators, testing capacity has been severely limited. Priority of testing was therefore given to suspected patients and hospital staff [1]. However, extensive testing and diagnostics are of great im- portance in order to effectively contain the pandemic. Indeed, countries that have been able to achieve large-scale testing of possibly infected people combined with massive citizen surveillance, reached significant containment of the SARS- CoV-2 virus [2]. The insufficient testing capacity in most countries has therefore spurred the need and search for alterna- tive methods that enable diagnosis of COVID-19. In addition, the accuracy of the current lab test, reverse transcription polymerase chain reaction (RT-PCR) arrays, remains highly dependent on swab technique and location [3]. COVID-19 pneumonia can rapidly progress into a very critical condition. Examination of radiological images of over 1,000 COVID-19 patients showed many acute respiratory dis- tress syndrome (ARDS)-like characteristics, such as bilateral, and multi-lobar glass ground opacifications (mainly posteri- orly and/or peripherally distributed) [4], [5]. As such, chest computed tomography (CT) has been coined as a potential alternative for diagnosing COVID-19 patients [4]. While RT- PCR may take up to 24 hours and requires multiple tests for definitive results, diagnosis using CT can be much quicker. However, use of chest CT comes with significant drawbacks: it is costly, exposes patients to radiation, requires extensive cleaning after scans, and relies on radiologist interpretability. Lately, ultrasound imaging, a more widely available, cost- effective, safe and real-time imaging technique, is gaining attention. In particular, lung ultrasound (LUS) is increasingly used in point-of-care settings for detection and management of acute respiratory disorders [6], [7]. In some cases, it demonstrated better sensitivity than chest X-ray in detecting pneumonia [8]. Clinicians have recently described use of LUS imaging in the emergency room for diagnosis of COVID-19 [9]. Findings suggest specific LUS characteristics and imaging biomarkers for COVID-19 patients [10]–[12], which may be used to both detect these patients and manage the respiratory efficacy of mechanical ventilation [13]. The broad range of applicability and relatively low costs make ultrasound imaging an extremely useful technique in situations when patient inflow exceeds the regular hospital imaging infrastructure capabilities. Thanks to its low costs, it is also accessible for low- and middle-income countries [14]. However, interpreting This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging 0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis. Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX ... · IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020 1 Deep learning for classification and localization

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

  • IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020 1

    Deep learning for classification and localizationof COVID-19 markers in point-of-care lung

    ultrasoundSubhankar Roy, Willi Menapace, Sebastiaan Oei, Ben Luijten, Enrico Fini, Cristiano Saltori, Iris Huijben,Nishith Chennakeshava, Federico Mento, Alessandro Sentelli, Emanuele Peschiera, Riccardo Trevisan,

    Giovanni Maschietto, Elena Torri, Riccardo Inchingolo, Andrea Smargiassi, Gino Soldati, Paolo Rota,Andrea Passerini, Ruud J.G. van Sloun, Elisa Ricci, Libertario Demi

    Abstract— Deep learning (DL) has proved successful inmedical imaging and, in the wake of the recent COVID-19 pandemic, some works have started to investigate DL-based solutions for the assisted diagnosis of lung dis-eases. While existing works focus on CT scans, this paperstudies the application of DL techniques for the analysisof lung ultrasonography (LUS) images. Specifically, wepresent a novel fully-annotated dataset of LUS imagescollected from several Italian hospitals, with labels indicat-ing the degree of disease severity at a frame-level, video-level, and pixel-level (segmentation masks). Leveragingthese data, we introduce several deep models that addressrelevant tasks for the automatic analysis of LUS images.In particular, we present a novel deep network, derivedfrom Spatial Transformer Networks, which simultaneouslypredicts the disease severity score associated to a inputframe and provides localization of pathological artefactsin a weakly-supervised way. Furthermore, we introduce anew method based on uninorms for effective frame scoreaggregation at a video-level. Finally, we benchmark stateof the art deep models for estimating pixel-level segmenta-tions of COVID-19 imaging biomarkers. Experiments on theproposed dataset demonstrate satisfactory results on allthe considered tasks, paving the way to future research onDL for the assisted diagnosis of COVID-19 from LUS data.

    Index Terms— COVID-19, Lung ultrasound, Deep learning

    I. INTRODUCTIONThe rapid global SARS-CoV-2 outbreak resulted in a

    scarcity of medical equipment. In addition to a worldwide

    Copyright (c) 2019 IEEE. Personal use of this material is permit-ted. However, permission to use this material for any other purposesmust be obtained from the IEEE by sending a request to [email protected]. Manuscript submitted on April 14, 2020. S.Oei∗, B. Luijten∗, I. Huijben∗, N. Chennakeshava∗, and R.J.G. vanSloun† are with the Dept. of Electrical Engineering, Eindhoven Uni-versity of Technology, The Netherlands (e-mail: [email protected]).S. Roy∗, C. Saltori∗, E. Fini∗, W. Menapace∗, P. Rota†, E. Ricci†, A.Passerini† are with the Dept. of Information Engineering and ComputerScience, University of Trento, Trento, Italy (e-mail: [email protected], [email protected]). S. Roy and E. Ricci† are with FBK, Trento,Italy. F. Mento∗, A. Sentelli, E. Peschiera, R. Trevisan, G. Maschietto,and L. Demi† are with ULTRa, Dept. of Information Engineering andComputer Science, University of Trento, Trento, Italy (e-mail: [email protected]) E. Torri is with Bresciamed, Italy. R. Inchingoloand A. Smargiassi are with the Dept. of Cardovascular and ThoracicSciences-Fondazione Policlinico Universitario A. Gemelli IRCCS, Italy.G. Soldati is with the Diagnostic and Interventional Ultrasound Unit,Valle del Serchio General Hospital, Lucca, Italy. ∗joint first authorscontributed equally to this work. †joint lead authors contributed equallyto this work.

    shortage of mouth masks and mechanical ventilators, testingcapacity has been severely limited. Priority of testing wastherefore given to suspected patients and hospital staff [1].However, extensive testing and diagnostics are of great im-portance in order to effectively contain the pandemic. Indeed,countries that have been able to achieve large-scale testingof possibly infected people combined with massive citizensurveillance, reached significant containment of the SARS-CoV-2 virus [2]. The insufficient testing capacity in mostcountries has therefore spurred the need and search for alterna-tive methods that enable diagnosis of COVID-19. In addition,the accuracy of the current lab test, reverse transcriptionpolymerase chain reaction (RT-PCR) arrays, remains highlydependent on swab technique and location [3].

    COVID-19 pneumonia can rapidly progress into a verycritical condition. Examination of radiological images of over1,000 COVID-19 patients showed many acute respiratory dis-tress syndrome (ARDS)-like characteristics, such as bilateral,and multi-lobar glass ground opacifications (mainly posteri-orly and/or peripherally distributed) [4], [5]. As such, chestcomputed tomography (CT) has been coined as a potentialalternative for diagnosing COVID-19 patients [4]. While RT-PCR may take up to 24 hours and requires multiple tests fordefinitive results, diagnosis using CT can be much quicker.However, use of chest CT comes with significant drawbacks:it is costly, exposes patients to radiation, requires extensivecleaning after scans, and relies on radiologist interpretability.

    Lately, ultrasound imaging, a more widely available, cost-effective, safe and real-time imaging technique, is gainingattention. In particular, lung ultrasound (LUS) is increasinglyused in point-of-care settings for detection and managementof acute respiratory disorders [6], [7]. In some cases, itdemonstrated better sensitivity than chest X-ray in detectingpneumonia [8]. Clinicians have recently described use of LUSimaging in the emergency room for diagnosis of COVID-19[9]. Findings suggest specific LUS characteristics and imagingbiomarkers for COVID-19 patients [10]–[12], which may beused to both detect these patients and manage the respiratoryefficacy of mechanical ventilation [13]. The broad range ofapplicability and relatively low costs make ultrasound imagingan extremely useful technique in situations when patientinflow exceeds the regular hospital imaging infrastructurecapabilities. Thanks to its low costs, it is also accessible forlow- and middle-income countries [14]. However, interpreting

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • 2 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

    Task 2: Video-based

    Score Prediction

    Task 3: Frame-basedSegmentation

    Video

    Segmentation Network

    STNCNN

    STNCNN

    Localizations

    Task 1: Frame-based

    Score Prediction

    AggregationVideoScore

    Frame Score

    Frame Score

    Frame Score STNCNN

    Segmentation Network

    Segmentation Network

    Fig. 1: Overview of the different tasks considered in this work. Given a LUS image sequence, we propose approaches for: (orange) predictionof the disease severity score for each input frame and weakly supervised localization of pathological patterns; (pink) aggregation of frame-levelscores for producing predictions on videos; (green) estimation of segmentation masks indicating pathological artifacts.

    ultrasound images can be a challenging task and is prone toerrors due to a steep learning curve [15].

    Recently, automatic image analysis by machine and deeplearning (DL) methods have already shown promise for re-construction, classification, regression and segmentation oftissues using ultrasound images [16], [17]. In this paperwe describe the use of DL to assist clinicians in detectingCOVID-19 associated imaging patterns on point-of-care LUS.In particular, we tackle three different tasks on LUS imaging(Fig. 1): frame-based classification, video-level grading andpathological artifact segmentation. The first task consists ofclassifying each single frame of a LUS image sequence intoone of the four levels of disease severity, defined by the scoringsystem in [12]. Video-level grading aims to predict a score forthe entire frame sequence based on the same scoring scale.Segmentation instead comprises pixel-level classification ofthe pathological artifacts within each frame.

    This paper advances the state of the art in the automaticanalysis of LUS images for supporting medical personnelin the diagnosis of COVID-19 related pathologies in manydirections. (1) We propose an extended and fully-annotatedversion of the ICLUS-DB database [18]. The dataset containslabels on the 4-level scale proposed in [12], both at frame andvideo-level. Furthermore, it includes a subset of pixel-levelannotated LUS images useful for developing and assessingsemantic segmentation methods. (2) We introduce a novel deeparchitecture which permits to predict the score associatedto a single LUS image, as well as to identify regionscontaining pathological artifacts in a weakly supervisedmanner. Our network leverages Spatial Transformers Network(STN) [19] and consistency losses [20] to achieve diseasepattern localization and from a soft ordinal regressionloss [21] for robust score estimation. (3) We introduce asimple and lightweight approach based on uninorms [22]to aggregate frame-level predictions and estimate the scoreassociated to a video sequence. (4) We address the problemof automatic localization of pathological artifacts evaluatingthe performance of state-of-the-art semantic segmentationmethods derived from fully convolutional architectures.(5) Finally, we conduct an extensive evaluation of ourmethods on all the tasks, showing that accurate predictionand localization of COVID-19 imaging biomarkers can beachieved with the proposed solutions. Dataset and codeare available at https : //iclus − web.bluetensor .ai and at

    https : //github.com/mhug − Trento/DL4covidUltrasound .

    II. RELATED WORK

    DL has proven to be successful in a multitude of computervision tasks ranging from object recognition and detection tosemantic segmentation. Motivated by these successes, morerecently, DL has been increasingly used in medical applica-tions, e.g. for biomedical image segmentation [23] or pneu-monia detection from chest X-ray [24]. These seminal worksindicate that, with the availability of data, DL can lead to theassistance and automation of preliminary diagnoses which areof tremendous significance in the medical community.

    In the wake of the current pandemic, recent works have fo-cused on the detection of COVID-19 from chest CT [25], [26].In [27], a U-Net type network is used to regress a boundingbox for each suspicious COVID-19 pneumonia region on con-secutive CT scans, and a quadrant-based filtering is exploitedto reduce possible false positive detections. Differently, in [28]a threshold-based region proposal is first used to retrieve theregion of interests (RoIs) in the input scan and the Inceptionnetwork is exploited to classify each proposed RoI. Similarly,in [29], a VNET-IR-RPN model pre-trained for pulmonarytuberculosis detection is used to propose RoIs in the input CTand a 3D version of Resnet-18 is employed to classify eachRoI. However, very few works using DL on LUS images canbe found in the literature [30]. A classification and weakly-supervised localization method for lung pathology is describedin [17]. Based on the same idea, in [18] a frame-basedclassification and weakly-supervised segmentation method isapplied on LUS images for COVID-19 related pattern detec-tion. Here, Efficientnet is trained to recognize COVID-19 inLUS images, after which class activation maps (CAMs) [31]are exploited to produce a weakly-supervised segmentationmap of the input image. Our work has several differencescompared to all the previous works. First, while in [18] CAMsare used for localization, in this work we exploit STN tolearn a weakly-supervised localization policy from the data(i.e. not exploiting explicit labelled locations but inferring itfrom simple frame-based classification labels). Second, whilein [18] a classification problem is solved, we focus on ordinalregression, predicting not only the presence of COVID-19related artifacts, but also a score connected to the diseaseseverity. Third, we move a step forward compared to allprevious methods by proposing a video-level prediction model

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • SUBHANKAR ROY et al. (APRIL 2020) 3

    Convex Linear0

    10000

    20000

    30000

    num

    ber o

    f fra

    mes

    Brescia

    Score 0 Score 1 Score 2 Score 3Convex Linear

    0

    2000

    4000

    6000

    Rome

    Convex Linear0

    1000

    2000

    3000

    4000Lucca

    Convex Linear0

    250

    500

    750

    1000

    1250

    Pavia

    Convex Linear0

    500

    1000

    1500

    2000

    Tione

    Convex Linear0

    10000

    20000

    30000

    40000

    Total

    Fig. 2: The distribution of the probes and the scores of frames grouped by hospital and overall statistics.

    built on top of the frame-based method. Finally, we proposea simple yet effective method to predict segmentation masksusing an ensemble of multiple state-of-the-art convolutionalnetwork architectures for image segmentation. Additionally,the model’s predictions are accompanied with uncertaintyestimates to facilitate interpretation of the results.

    III. ICLUS-DB: DATA COLLECTION AND ANNOTATION

    We here present the Italian COVID-19 Lung UltrasoundDataBase (ICLUS-DB), which currently includes a total of 277lung ultrasound (LUS) videos from 35 patients, correspondingto 58,924 frames1. The data were acquired within differentclinical centers (BresciaMed, Brescia, Italy, Valle del SerchioGeneral Hospital, Lucca, Italy, Fondazione Policlinico Univer-sitario A. Gemelli IRCCS, Rome, Italy, Fondazione PoliclinicoUniversitario San Matteo IRCCS, Pavia, Italy, Tione GeneralHospital, Tione (TN), Italy) and using a variety of ultra-sound scanners (MindrayDC-70 Exp®, EsaoteMyLabAlpha®,ToshibaAplio XV®, WiFi Ultrasound Probes - ATL). Bothlinear and convex probes were used, depending on necessities.Of the 35 patients, 17 were confirmed positive to COVID-19by swab technique (49 %), 4 were COVID-19 suspected (11%), and 14 were healthy and symptomless individuals (40 %).

    A recent proposal by Soldati et al. describes how specificimaging biomarkers in LUS can be used in the manage-ment of COVID-19 patients [12]. Specifically, to evaluate theprogression of the pathology, a 4-level scoring system wasdevised [32], with scores ranging from 0 to 3. Score 0 indicatesthe presence of a continuous pleural-line accompanied byhorizontal artifacts called A-lines [33], which characterize ahealthy lung surface. In contrast, score 1 indicates the firstsigns of abnormality, i.e., the appearance of alterations in thepleural-line in conjunction with vertical artifacts. Scores 2 and3 are representative of a more advanced pathological state,with the presence of small or large consolidations, respectively.Finally score 3 is associated with the presence of a widerhyperechogenic area below the pleural surface, which can bereferred to as “white lung”.

    A total of 45,560 and 13,364 frames, acquired using re-spectively convex and linear probes, were labelled accordingto the scoring system defined above. Of the 58,924 LUSframes forming the dataset, 5,684 were labeled score 3 (10%),18,972 score 2 (32%), 14,295 score 1 (24%), 19,973 score0 (34%). A plot showing the distribution of the scores andprobes per hospital is shown in Fig. 2. To guarantee objective

    1https://iclus-web.bluetensor.ai.

    annotation, the labelling process was stratified into 4 levels:1) score assigned frame-by-frame by four master studentswith ultrasound background knowledge, 2) validation of theassigned scores performed by a PhD student with expertise inLUS, 3) second level of validation performed by a biomedicalengineer with more than 10 year of experience in LUS and4) third level of validation and agreement between clinicianswith more than 10 years of experience in LUS.

    Additionally, a subset of 60 videos sampled across all35 patients was selected and video-level annotations wereprovided for them. These annotations use the same scoringsystem defined for the frame-level annotations. In order toaddress subjective biases in the evaluation of the videos,five different clinicians provided their evaluation for eachsequence. We assess the complexity of this task by calculatingthe inter-operator agreement, comparing the evaluation of thepredictions of each doctor against the average prediction ofthe remaining four. The resulting average agreement is about67% among the available labels.

    Finally, for 33 patients, a total of 1,005 and 426 framesrespectively acquired using convex and linear probes, weresemantically annotated at a pixel-level by contouring theaforementioned imaging biomarkers using the annotation toolLabelMe [34]. For the frames acquired using the linear probe,relative pixel-level occurrences for scores 0, 1, 2, and 3are 6.4%, 0.080%, 0.67%, and 3.7%, respectively. For theconvex probe, these statistics are 1.9%, 0.074%, 1.8%, and2.1%, respectively. Notably, a large proportion of pixels isnot associated to either of these scores. These pixels do notdisplay clear characteristics of a specific class, and are referredto as background (BG). A few images and the correspondingannotations are shown in the supplementary material.

    IV. DEEP LEARNING-BASED ANALYSIS OF LUS IMAGESThis paper tackles several challenges towards the develop-

    ment of automatic approaches for supporting medical person-nel in the diagnosis of COVID-19 related pathologies (seeFig. 1). In particular, following the COVID-19 LUS scoringsystem in [12] we present a novel deep architecture whichautomatically predicts the pathological scores associated toall frames of a LUS image sequence (Section IV-A) andoptimally fuse them to produce a disease severity score atvideo-level (Section IV-B). We also show that the proposedmodel automatically identifies regions in an image which areassociated to pathological artifacts without requiring pixel-level annotation. Finally, to further improve the accuracy inthe automatic detection of disease-related patterns, we also

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • 4 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

    consider a scenario where frames are provided with pixel-levelannotations and we propose a segmentation model derivedfrom a state of the art convolutional network architecture(Section IV-C). In the following, we describe the proposeddeep learning models.

    A. Frame-based score prediction

    1) Problem formulation and notation: With the purpose ofsupporting medical personnel in the analysis of LUS images,in this paper we introduce an approach for predicting thepresence or the absence of a pathological artifact in each frameof a LUS image sequence and for automatically assessingthe severity score of the disease related to such patternsaccording to the COVID-19 LUS scoring system [12]. Weare also interested in the spatial localisation of a pathologicalartifact in the frame without assuming any annotation aboutsuch artifact positions in a frame. The weak localization isachieved through the use of Spatial Transformer Networks(STN) [19]. The use of STN stems from the fact that mostof the pathological artifacts are concentrated in a relativelysmall area of the image, and, hence the entire image shouldnot be considered by the network to make predictions. Theproblem can be formalized as follows.

    Let X denote the input space (i.e. the image space) andS the set of possible scores. During training, we are given atraining set T = {(xn, sn)}Nn=1 where xn ∈ X and sn ∈ S.

    2) Model definition: We are interested in learning a mappingΦ : X → S, which given an input LUS image outputsthe associated pathological score label. We model Φ as thecomposition of two functions Φ = Φstn ◦ Φcnn where Φstn :X → X estimates an affine transformation and applies it tothe input image x and Φcnn : X → S assigns the score to thetransformed image. Intuitively, Φstn learns to localize regionsof interest in the input image and provides Φcnn with animage crop where information about the score is most salient.Consequently, Φstn produces as a side effect the localizationof pathological artifacts in the frame. The mapping Φcnn iscomposed by a convolutional feature extractor and a linearlayer with |S|-dimensional output logits. The model Φstn isimplemented as a deep neural network derived STN [19].Fig. 3 shows an overview of the proposed deep architecture.

    In the context of deep learning the generalization capabilityof a network is of critical importance. To this end, data aug-mentation has shown to be very effective [35] in improving theperformance of a network. Previous works [18] showed thataugmenting a dataset composed of LUS images can drasticallyimprove the ability of the network to discriminate healthyand ill patients. Another way to achieve robust predictions isto enforce some consistency between two perturbed versions(colour jitter, dropout, etc.) of the same image [20], [36]. Thismakes the network produce smoothed predictions by attendingto the salient features in an image. Inspired by this idea, wepropose to use STN [19] to produce two different crops froma single image and enforce the predictions of the networkto be similar. We name our approach Regularised SpatialTransformer Networks (Reg-STN).

    STN [19] is a differentiable module that applies a learnableaffine transformation to an input image, or more in general to a

    Fig. 3: Illustration of the architecture for frame-based score predic-tion. An STN modeled by Φstn predicts two transformations θ1 andθ2 which are applied to the input image producing two transformedversions x1 and x2 that localize pathological artifacts. The featureextractor Φcnn is applied to x1 to generate the final prediction.

    feature map, conditioned on the input itself. It consists of threeparts: (i) a localization network that predicts the parameters ofthe affine transformation, (ii) a grid generator which selectsthe grid co-ordinates in the source image, to be sampled from,and (iii) a sampler that warps the input image based on thetransformation, producing the output map.

    For what concerns the localization network, it is trained tooutput a transformation matrix θ such that:(

    αs

    βs

    )= θ

    αtβt1

    (1)where αs, βs, αt, βt, are the source and target coordinatesin the input and output feature map respectively. In principleθ can describe any affine transformation, however, keeping inmind the properties of LUS images we restrict the space ofpossible transformations to rotation, translation, and isotropicscaling:

    θ =

    [σ r1 ταr2 σ τβ

    ](2)

    In our proposed method, an input image, x is processed bythe Φstn that predicts two set of transformations θ1 and θ2,instead of one θ. Subsequently, the transformations are appliedto x, generating cropped images x1 and x2, respectively. Thenetwork Φcnn is then applied to x1 and x2, producing twosets of logits for the same image under different transfor-mations. As a side effect, the intermediate images x1 andx2 are produced and can be interpreted as the localizationof the pathological artifacts in the input image x. Finally,the Φcnn(x1) branch then can be trained with any standardsupervised classification loss and (Φcnn(x1), Φcnn(x2)) istrained with a consistency enforcing loss (see below).

    3) Loss definition: As stated before, we are interested indevising a deep network Φ for automatically predicting the 4-level scores identified in [12]. While this problem can triviallybe cast within a classification framework, in this paper weargue that ordinal regression [37] is more appropriate as weare interested in predicting labels from an ordinal scale. Therationale behind the choice of ordinal regression is that thereexist certain categories that are more correct than others withrespect to the true label, as opposed to an independent classscenario, in which the order of the levels does not matter. Infact, errors on low-distance levels should be less penalized

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • SUBHANKAR ROY et al. (APRIL 2020) 5

    with respect to long-distance error. For instance, predicting aseverely ill patient (score 3) as healthy (score 0) should bestrongly discouraged, while sometimes the difference betweenscore 1 and score 2 can be subtle and the network should notbe overly penalized.

    While ordinal regression can be implemented resorting onthe traditional approach of decomposing the problem assuminga |S|-rank formulation [38], following [21] we introduce alightweight approach for Soft ORDinal regression (SORD).In practice, we implement an ordinal regression framework byusing a carefully devised label smoothing mechanism. Insteadof one-hot representations of labels, we encode the groundtruth information into soft-valued vectors (SORD vectors) ŝ ∈R|S|, where S is the set of possible scores for a frame. Hence,for a frame x with score s ∈ S the i-th element of the SORDvector is computed as follows:

    ŝi =e−δ(s,i)∑j∈S e

    −δ(j,i) (3)

    where δ is a manually defined distance function betweenscores/levels for which we use square distance multipliedby a constant factor. This formulation produces a smoothprobability distribution over S, in which the magnitude ofthe elements decreases while the distance to the groundtruth increases. Encoding ground truth labels as probabilitydistributions seamlessly blends with common classificationloss functions that use a softmax output. Therefore, at trainingtime, we simply train the network Φ using cross entropy:

    LSORD = −|S|∑i=0

    ŝi log

    (exp(Φ (x)i)∑|S|j exp(Φ (x)j)

    )(4)

    The result is a loss function that yields a smaller cost forpredictions that are in the neighbourhood of the groundtruth label, which, in turn generates smaller gradients, hencediscouraging drastic updates of the network for small errors.Empirically, we found that our algorithm works best when weincrease the distance of score 0 from the others. As mentionedbefore, this is also validated by the semantics of the scores.

    Another desirable property of the network is to extract im-portant semantic features of the input image, in order to enableaccurate frame score prediction. This can be strengthened byresorting to a regularization in the form of consistency losson the two branch predictions (Φcnn(x1), Φcnn(x2)) withthe rationale that two different crops from the same imageshould have similar predictions. In our case, these two cropsare produced by the Φstn. In details, the consistency loss isdefined on the network representations as following:

    LMSE = ‖Φcnn (x1)− Φcnn (x2)‖22 (5)

    Unfortunately, LMSE coupled with learnable affine transfor-mations produces degenerate solutions in which the localiza-tion network of the STN learns to output identical parametersfor the affine transformations. In fact, it is enough to imposeθ1 = θ2 to minimize LMSE . To prevent this pathological be-haviour of the network, we enforce a prior on the parameters ofthe transformations. In particular, we stimulate the localizationnetwork to produce reasonably scaled patches by minimizing|σ − σp|, where σp is a fixed prior. Now, in order to enable

    the STN into yielding different parameters θ1 6= θ2, we simplychoose σp1 6= σp2. Hence, a loss is defined as follows:

    LP = |σ1 − σp1|+ |σ2 − σp2| (6)

    Finally, the proposed Reg-STN model is trained end-to-endminimizing the following joint loss function:

    LTOT = LSORD + LMSE + LP (7)

    4) Training strategy: We split the ICLUS-DB dataset into atrain and test split. The test split comprises 80 videos from 11patients, with a total of 10,709 frames. All the frames fromthe remaining videos are included in the train set. The splitis performed at patient level, such that the sets of patientsin the training and test set are disjoint. The STN is modeledby a ConvNet similar to [17]. Specifically, we removed theAverage Pooling and the output layer and replaced it withtwo fully connected layers to predict the affine transformationparameters. The CNN architecture [17] is kept unchanged.The STN and CNN are jointly trained by using the Adamoptimizer with an initial learning rate of 1e− 4, a batch sizeof 64 and trained for 120 epochs. We also used similar dataaugmentation strategies and learning rate decay as suggestedin [17], [18]. We set the values of σ1 and σ2 to 0.50 and0.75 respectively, leveraging the prior knowledge about LUSimages that pathological artifacts roughly covers 25% to 50%area of the image.

    B. Video-level score aggregation1) Problem formulation and notation: The identification of

    potentially pathological artifacts in LUS images is a crucialstep towards diagnosis support. However, frame-based pre-dictions should be turned into a single video-based scoreprediction in order to assess the pathological state of a patient.The video-based score aggregation problem can be formalizedas follows. Let v = {xi}Mi=1, be a video, V be the set of videosof any length, and S the set of scores. The goal of video-levelscore prediction is learning a mapping Ψ : V → S.

    In principle the mapping Ψ could be obtained by takingthe maximum score assigned to any frame of the currentvideo because the identification of an artifact of score s ina frame implies that the patient has a severity level of atleast s. This hard rule, however, is inapplicable in practicewhen dealing with machine-predicted scores, as even a singleframe-based prediction error could harm the overall prediction.Thus, in this section we propose a more flexible aggregationmechanism devised for predicting the score associated to avideo, leveraging the video-level annotations provided in theICLUS-DB (Section III).

    2) Model definition: In designing the model Ψ, we considerthe fact that it needs to operate in a low-data regime, wherefew videos are provided with annotations as in the currentversion of the ICLUS-DB. Inspired by the hard rule previouslymentioned, we propose a simple strategy that combines frame-level predictions using a parameterized aggregation layer, i.e.:

    Ψ(v) = ΨU (Φ(x1), . . . ,Φ(xM )) (8)

    Here Φ is the frame-level mapping and ΨU is an aggregationfunction based on uninorms [39], which are a principled way

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • 6 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

    to soften the hard rule. A uninorm U is a monotonic increas-ing, commutative and associative mapping from [0, 1]× [0, 1]to [0, 1] with neutral element e ∈ [0, 1]. This means thatU(a, e) = U(e, a) = a for all a ∈ [0, 1]. If e = 1, U isfully non-compensatory (like taking the minimum betweena and b), while it is fully compensatory if e = 0 (like thetaking maximum). Choosing e ∈ (0, 1) allows the uninormto have a hybrid behaviour. Note that being associative,uninorms can be applied to an arbitrary number of inputs(e.g., U(a, b, c) = U(U(a, b), c)). Following [22], we learnthe appropriate value for the neutral element e from data.Our aggregation layer takes as input the sequence of frame-based prediction scores Φ(x), aggregates them along eachdimension/score using a uninorm U and returns the softmaxof the resulting aggregation as a video-based prediction. Thelayer has only four parameters, which are the neutral elementsfor each candidate score {0, 1, 2, 3}, and it is thus amenableto training with little supervision.

    Any uninorm with neutral element e can be written as [39]:

    Ue(a, b) =

    eT (ae ,

    be ) if a, b ∈ [0, e]

    e+ (1− e)S(a−e1−e ,b−e1−e ) if a, b ∈ [e, 1]

    Û(a, b) otherwise(9)

    for a certain choice of T , S and Û(a, b) such that min(a, b) ≤Û(a, b) ≤ max(a, b). The functions T and S are called t-normand t-conorm respectively, and model the non-compensatoryand compensatory behaviour. Different choices for these func-tions lead to different uninorms. We found the product t-norm T (a, b) = ab (and corresponding t-conorm S(a, b) =a + b − ab) to be the most effective choice as it allows thegradient to flow the most. Concerning the function Û(a, b),common choices are min(a, b) and max(a, b), producing theso-called min-uninorms and max-uninorms respectively. Wefound min-uninorms to be the best choice in our setting (withrespect to max(a, b) but also mean(a, b)), likely because oftheir fully non-compensatory behaviour in the area of highestdiscrepancy between frame-based predictions.

    3) Loss definition: The architecture is trained using theSORD loss described in Eq. (5) computed over the video-levelprediction.

    4) Training strategy: The frame-based predictor outputs pre-diction scores with a distribution that differs between thetraining and the test set. In order not to overfit the video-basedpredictor on the training scores distribution, we completelyseparate the training sets of the frame-based and video-basedpredictor. We train the frame-based predictor on all videosequences T without any video-based annotation, and evaluateit on the remaining sequences T ′ . We then train and evaluatethe video-based predictor on T ′, using a k-fold cross validationprocedure (k = 5) with splits made at the patient level (i.e.all videos from the same patient are in the same fold). Wechoose to use as video-level annotations the ones producedby the first annotator, the clinician with the highest expertise.We train our model using an Adam optimizer with learningrate 10−2 without weight decay and with no learning ratescheduling. For each epoch, we compute the loss for eachtrain video sequence and accumulate its gradients, performinga single optimization step at the end of each epoch. We train

    the model for a maximum of 30 epochs and use the loss onthe training set to define an early stopping strategy.

    Note that the entire architecture including the frame-levelcomponent could be trained entirely end-to-end. However, thissolution is not effective given the vast disproportion in theamount of supervision at the video and frame levels currentlyavailable in ICLUS-DB. We thus trained the aggregation layerafter freezing the weights of the frame-based architecture. Fullend-to-end training combining frame-based and video-basedsupervision will be investigated in future work.

    C. Semantic Segmentation1) Problem formulation and notation: Let X = Ri×j and Y

    denote the input (i.e. the image space) and output (i.e. thesegmentation masks) space, respectively. In the earlier pre-sented frameworks for image- and video-based classification,the score set was defined as S = {0, 1, 2, 3}. For semanticsegmentation we however distinguish five different scores, i.e.the four scores in S, complemented by the background (BG)score, assigned to pixels that were not annotated for showingmarkers associated with any of the classes in S. As such,Y = {0, 1, 2, 3,BG}i×j .

    2) Model definition: We are interested in learning a mappingΩ : X → Y , which given an input LUS image, outputs theassociate pathological segmentation mask. To model Ω, wecompare several network architectures for end-to-end imagesegmentation, such as the vanilla U-Net [23], and the morerecently proposed U-Net++ [40], and Deeplabv3+ [41].

    Our baseline U-Net model has three encoding layer blocks,each comprising two convolutional layers with ReLU activa-tions and one maxpool layer (pooling across 2, 2, and 5 pixelsin both dimensions, respectively), a latent layer, and a mirroreddecoder (where pooling is replaced by nearest neighbour up-sampling). We use skip connections between each layer blockof the encoder and decoder. To mitigate overfitting we applydropout (p = 0.5) during training at the latent bottleneck ofthe model. The Unet++ variant leverages the first four encoderblocks of the ResNet50 model [42] to construct a latent space.The latent space is upsampled in the decoder stage by meansof transpose 2D convolutional layers. The decoder containsresidual blocks, and also exploits skip connections between(same-sized) hidden layer outputs in the ResNet50 encoderand the decoder. The Deeplabv3+ model similarly employs anencoder-decoder structure, where features are extracted usingspatial pyramid pooling (i.e. pooling at different grid scales)and atrous convolutions, resulting in decoded segmentationmaps with detailed object boundaries.

    3) Loss definition: We adopt a pixel-wise categorical cross-entropy loss between he segmentation masks g(yn) and themodel predictions ŷn = Ω(h(xn)). Functions g(·), and h(·)are pre-processing transformations applied prior to training.

    Function h(·) comprises the resizing of all acquired B-modeimages to 260 × 200 pixels, preserving the original aspectratio of the scans by appropriate zero padding, and subsequentnormalization between -1 and 1.

    4) Training strategy: Due to the larger (and more represen-tative) set of pixel-level annotations for the convex probe,compared to the linear probe acquisitions (1,005 and 426

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • SUBHANKAR ROY et al. (APRIL 2020) 7

    Input Image GradCam a) Reg-STN (Translation + Fixed Scaling) b) Reg-STN (Translation + Scaling + Rotation)L

    inea

    rC

    onve

    x

    Fig. 4: Examples of the image crops produced by the Reg-STN network. The first column shows input images acquired with linear andconvex sensors, respectively. In the second column we report the heatmaps produced by GradCam [44] and the bounding boxes obtainedby thresholding. In the remaining columns, original image overlayed with bounding boxes and the two respective crops (in red and green)produced when the Reg-STN models: a) only translation and a fixed scaling; b) all possible transformations viz. translation, scaling androtation, are shown. In each case the Reg-STN focuses on the most salient parts which contains the pathological artifacts.

    annotations, respectively), we here specifically focus on theconvex acquisitions. We split our dataset into a train (70%) andtest set (30%) at a patient level, i.e. all movies and frames fromone patient fall into a specific set. Among the 1005 frames, atotal of 1158 imaging biomarkers were segmented.

    During training, we are given a training set of N image-labelpairs T = {(xn,yn)}Nn=1 where xn ∈ X and yn ∈ Y . Themodel parameters are learned by back-propagating the earlierdefined categorical cross-entropy using the Adam optimizer(default settings), with a learning rate of 10−5. Training wasstopped upon convergence of the training loss.

    Each training batch consists of 32 B-mode images and theircorresponding segmentation masks, which are balanced acrosspatients and scores to avoid biases resulting from the lengthof the ultrasound scan (number of frames in a single video)and population-level distribution of scores. While these biasesgenerally aid the overall accuracy, they hamper patient-leveldecision making across demographics.

    To promote invariance to common LUS image transforma-tions and thereby improve generalization at inference, eachimage-label pair is heavily manipulated on-line during trainingby a set of augmentation functions that were each activatedon the image-label pair with a probability of 0.33. The set ofaugmentation functions, each applied with a randomly sampledstrength bounded by a set maximum, consists of: affine trans-formations (translation (max. ±15%), rotation (max. ±15◦),scaling (max. ±45%), and shearing (max. ±4.5◦)), multiplica-tion with a constant (max. ±45%), Gaussian blurring (σmax =34 ), contrast distortion (max. ±45%), horizontal flipping (p =0.5), and additive white Gaussian noise (σmax = 0.015).

    5) Inference: To further boost robustness and performance,we apply model ensembling and calculate the unweightedaverage over predicted softmax logits of the U-net, U-net++,and Deeplabv3+ models (all trained with data augmentation).

    To allow for qualitatively assessment of the uncertainty ofthe predictions, we produce pixel-level estimates of modeluncertainty by using Monte-Carlo (MC) dropout [43]. Duringinference, we stochastically apply dropout in the latent space,yielding multiple point estimates of our class predictions. Theamount of variation in the resulting predictions, ultimatelyprovides an indication of uncertainty for every pixel.

    V. EXPERIMENTAL RESULTS

    A. Frame-based score predictionTo evaluate the performance of our proposed frame-based

    scoring method and its constituent components we considerthe following baselines: i) CNN trained with Cross Entropyloss (CE), ii) CNN trained with SORD, iii) Resnet-18 trainedwith SORD, iv) STN based CNN trained with SORD; v)CNN+Random Crop+SORD, a CNN trained on SORD withrandom crops rather than bounding boxes extracted by STNand vi) Our proposed Reg-STN model.

    In Table I, we evaluate the performance of our methodin terms of F1-score. Since, the annotations in LUS imagesare quite subjective (see later) we also report results fortwo additional metrics, which are then defined as Setting2 and Setting 3, respectively. The metrics are: i) Setting 1considers the F1 score computed on the entire test set, ii)Setting 2 considers the F1 score computed on a modifiedversion of the test set obtained by dropping, for each video,the K frames before and after each transition between twodifferent ground truth scores, potentially removing ambiguousframes that present characteristics at the boundary betweentwo classes, thereby allowing us to identify the impact of noisylabeling on the performance of the model; and iii) Setting 3,we drop the most challenging videos by using the inter-doctoragreement between the 5 independent video-level annotations.In practice, we only keep in the test set the videos withat least A doctors agreeing on the video-level annotations.For completeness, we report under Setting 3 also the scoresobtained on the complete portion of the test set containingvideo-level annotations (Video Ann.).

    As shown in Table I, our proposed Reg-STN trained withSORD beat the baseline models in most of the settingsand is the second best in the remaining. On average, Reg-STN performs the best amongst all baselines. This provesthe effectiveness of our proposed method for doing frame-based prediction for pathology detection in LUS images. Ourexperiments were run on a RTX-2080 NVIDIA GPU. Asfor computational complexity, it takes ∼11 hours to train aCNN+Reg-STN+SORD model on this hardware.

    B. Video-based score predictionWe evaluate video-based score prediction in terms of

    weighted F1 score, Precision and Recall. These are obtainedby first computing the metric for each score (zero to three), and

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • 8 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

    TABLE I: F1 scores (%) for the frame-based classification under different evaluation settings. Setting 1 represents evaluation on the fulltest set, Setting 2 represents the analysis on the test set with dropped transition frames and Setting 3 represents the analysis accounting forinter-doctor agreement. The baseline for this setting is provided by the evaluation on the set of test sequences with video-level annotations(Video Ann.). Best and second best F1 scores (%) are in bold and underlines, respectively.

    Model Setting 1Regular Metric

    Setting 2Drop Transition Frames (K)

    Setting 3Inter-doctor Agreement (A) Avg

    K=1 K=3 K=5 K=7 Video Ann. A=3 A=4

    CNN+CE 61.6 63.1 64.9 66.3 67.6 74.8 78.0 77.0 69.2CNN+SORD 63.2 64.8 66.3 67.8 68.9 73.0 76.8 75.8 69.6Resnet-18+SORD 62.2 63.9 65.5 66.9 67.8 74.5 77.4 76.4 69.3CNN+STN+SORD 61.0 62.6 63.8 64.8 65.6 78.4 82.2 81.4 70.0CNN+Random Crop+SORD 61.8 63.0 64.2 65.1 65.9 71.9 74.6 73.5 67.5CNN+Reg-STN+SORD (Ours) 65.1 66.7 68.3 69.5 70.3 75.4 78.4 77.5 71.4

    TABLE II: Mean and standard deviation of weighted F1 score,precision and recall computed over the five cross validation folds,for the proposed video-based classification method and baselines.

    Method F1 (%) Precision (%) Recall (%)max argmax 46± 21 55± 27 49± 18argmax mean 51± 12 56± 19 53± 09uninorms 61± 12 70± 19 60± 07

    TABLE III: Confusion matrices (%) for the proposed video-basedclassification method and baselines.

    max argmax argmax mean uninorms0 1 2 3 0 1 2 3 0 1 2 3

    0 7 9 7 3 16 10 0 0 10 16 0 01 0 3 12 5 3 9 9 0 2 17 2 02 0 0 17 9 0 3 21 2 0 5 16 53 0 0 5 22 0 2 17 9 0 5 5 17

    then computing the weighted average over scores, where theweight is the fraction of instances having that score. Note thatweighted recall corresponds to (multiscore) accuracy, i.e., thefraction of correctly predicted scores over the total number ofpredictions. Table II reports averages and standard deviationsof these metrics over the five folds of the cross validationprocedure. We compare our video-level predictor with twostandard aggregation methods, max argmax and argmax mean.The former implements the hard rule described in Section IV-B. It labels each frame with the most probable score accordingto the frame-level predictor, and takes the maximal score alongthe video. The latter averages frame-level predictions overthe video and returns the score with the maximal average.The proposed method outperforms both baselines in termsof F1-score, precision and recall. Table III shows confusionmatrices for the three methods, obtained by concatenating thepredictions for all folds. As expected, the max argmax hardrule is strongly biased towards predicting the highest score,resulting in bad performance on all other scores. On the otherhand, the argmax mean baseline has the best performance inpredicting score zero, but performs poorly on the other scores(under-predicting scores one and three and over-predictingscore two). The uninorm-based aggregation is more balanced,outperforming each of the baselines on three out of four scores.

    C. Semantic Segmentation

    Fig. 5 shows several illustrative examples of semanticsegmentation results of our ensemble network, along with theirground-truth annotations. A quantitative assessment and com-parison of segmentation performance for the U-Net, U-Net++,Deeplabv3+, and ensemble models are provided in Table IV.

    We observe that using on-line augmentation of images andannotations in combination with model ensembling yields astrong performance gain over a baseline U-Net, increasing theDice coefficient from 0.64 to 0.75 for the union of COVID-19markers. The ensemble model yields a categorical Dice scoreof 0.65 (mean across the segmentations for score 0, 2 and 3).This metric was 0.47 for our baseline U-net.

    In Fig. 6 we provide a visualization of uncertainty in thepredicted segmentations for two example images by plottingthe pixel-wise standard deviation yielded by MC dropoutacross 40 samples. Arrows in (A) indicate a region displayingCOVID-19 markers for which ambiguity in the exact shapeand extent are well reflected in the pixel-level uncertainty.Arrows in (B) indicate a seemingly false-positive region whichwas assessed as a high-grade COVID-19 marker by the deepnetwork, and not annotated as such. Interestingly, retrospec-tively the network output was judged as a true positive by theannotators, showing an area of hyperechogenic lung below thepleural surface [12], which characterizes a high permeabilityand advanced disease state.

    VI. DISCUSSION AND CONCLUSIONS

    A. Frame-based score prediction evaluation

    In Table I we ablate the contribution of the building blocksof our model for frame-based prediction. The replacementof the traditional cross-entropy (CE) with the SORD lossfor ordinal regression clearly improves the performance. Onthe other hand, we found that the addition of STN leads toa drop in the F1-score because of the additional trainableparameters (as many as the CNN) introduced by the STN andthe absence of a regularisation. However, STN comes withtwo positive side effects: (i) it provides weakly supervisedlocalizations without using fine-grained supervision; and (ii)enables the use of consistency-based regularization, which isvery beneficial in terms of performance. Our full model, whichembeds the STN module, the SORD loss and the proposedconsistency loss achieves an F1-score of 65.1, outperformingall the baselines by a large margin. To further investigate if theboost occurs because of the consistency term or the STN, weconducted an experiment using two sufficiently overlappingrandom crops and enforced consistency loss between the two.Unsurprisingly, the F1-score for CNN+Random Crop+SORDstays much below to our proposed method. We hypothesizethat the consistency loss is only useful when the crops coverthe area of the artefact.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • SUBHANKAR ROY et al. (APRIL 2020) 9

    B-mode scan Annotation Semantic segmentation COVID-19 marker contours

    A

    B

    C

    D

    Fig. 5: Four examples of B-mode input image frames (first column), their annotations (second column) including COVID-19 biomarkers(moderate / score 2: orange, severe / score 3: red), and signs of healthy lung (blue). The corresponding semantic segmentations and contoursof COVID-19 markers by deep learning are given in the third and fourth colomn, respectively.

    In contrast to the previous work [18], we found that the useof more complex architectures like ResNet18 does not bringany positive improvement in performance. We reason that thisis due to the low intrinsic complexity of the task. Conversely,we suggest that most of the confusion of the model is causedby the noise in both frames and labels. In turn, we believethat this noisiness is due to the subjectivity of the annotationand the presence of ambiguous frames. In fact, frame labelsdo not take into account that multiple artifacts can be presentat a time. This happens mostly when the sensor is moving,causing a transitions from one score to another. In order tohighlight the concentration of the errors of our models aroundtransitions, we devise the experimental Setting 2, as shown inTable I, in which we drop frames close to transition points. Theresults in the Table I show that removing ambiguous framesfrom the test set dramatically reduces the amount of errors ofthe model, regardless of the architecture, empirically validatingour hypothesis about noisy labeling.

    In Table I we also measured how the subjectivity of the an-notated scores affects the performance of the model in Setting3 and discovered that when there is a strong agreement amongdoctors (more than 2 doctors agree on a score) our networkperforms notably better, increasing the F1-score by almost 3points. This suggests that some videos are intrinsically moreambiguous than others. In addition, we found that, on thismatter, the network seems to be behave similarly to humanannotators, which is a desirable property. Moreover, althoughit seems counter-intuitive, our experiments point out that theperformance of the model does not change much after a certain

    degree of agreement between doctors (A = 3 vs. A = 4).This is probably caused by the fact that imposing strongeragreement makes the test set smaller, yielding less statisticallysignificant results.

    Finally, we visualize the crops yielded by the STN andillustrate them in Fig. 4. We considered two kind of affinetransformations modeled by the Reg-STN in our experiments:i) learnable translation with fixed scaling; and ii) learnabletranslation, scaling and rotation. We compute an F1-score of65.9 when the STN models a learnable translation with fixedscaling. In both the cases the STN produces highly localizedcrops that mostly hinges around the area of pathologicalartifact. Interestingly, for both convex and linear sensors acqui-sitions, the Reg-STN learns to ignore the area above the pleura,which is essentially irrelevant for the prediction of a frame.This validates the usefulness of incorporating STN blocksin our frame-based predictor. We also report the heatmapsproduced by GradCam [44] for the same images. Qualitatively,GradCam does not always focus on the relevant areas of theimage. For example, for the linear probe image displayed inthe figure, attention is given to the intercostal tissue layers andnot to the areas of the image below the pleural-line, which arethe areas of interest for the analsysis of LUS data. Also, wenoticed that the quality of the heatmaps deteriorates when theprediction of the network is incorrect. Moreover, we found ithard to produce reasonable boxes from the heatmaps producedby GradCam, since it requires thresholding. For these reasons,we believe that STN produces superior localizations.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • 10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

    A

    B

    B-mode scan Annotation Semantic segmentation COVID-19 uncertainty

    Fig. 6: Two examples (A, B) of class uncertainty in the segmentations, showing B-mode input image frames (first column), annotations (secondcolumn), including COVID-19 biomarkers (moderate / score 2: orange, severe / score 3: red), the corresponding semantic segmentations bydeep learning (third column), and pixel-level COVID-19 class uncertainty by MC-dropout (fourth column).

    B. Video-based score prediction evaluation

    When trained on the annotations by the most expert clini-cian, video-based classification achieves an F1-score of 61%,a precision of 70% and a recall of 60%. It is noticeablethat these values are in line with the low inter-annotatoragreement reported in Section III, which together to the smallnumber of samples with video-level annotations can explainthe high variance of the scores across folds. We expect thatextending our relatively small set of video-level annotationswill help counteracting the labeling noise, increase the modelperformance and reduce its variance.

    C. Segmentation evaluation

    Our segmentation model is able to segment and discriminatebetween areas in B-mode LUS images that contain back-ground, healthy markers and (different stages of) COVID-19biomarkers at a pixel-level, reaching a pixel-wise accuracyof 96% and a binary Dice score of 0.75. Alongside thesesegmentations, we provide spatial uncertainty estimates thatmay be used to interpret model predictions.

    Interestingly, and importantly, none of the highest (andmost severe) score index annotations in the test set weremissed by our model, judged by visual assessment of theresulting segmentations, and by analysing the relative image-level intersections among the corresponding predicted andannotated regions. Moreover, we observed model predictionsof COVID-19-positive regions, that had however not beenannotated as such. Fig. 6B shows a representative exampleof such a case. After re-evaluating some of such examplesfrom the test set, together with the annotators, we learned thatthe annotators were sometimes unsure whether to annotate aregion as e.g. score 2 or 3, and therefore decided that themarker was not clear enough to annotate the region at all,leading to the aforementioned discrepancy.

    Segmentation performance and extraction of semanticscould be further boosted by leveraging temporal structureamong frames in a sequential model. Such models couldlearn from annotations across full videos, or through partialannotations and weak supervision. We leave these extensionsof the present method to future work.

    TABLE IV: Segmentation performance in terms of the mean categor-ical accuracy across all pixels and scores (Acc.), the Dice coefficientfor the union of COVID-19-related scores (Dice), and the mean Diceacross scores 0, 2, and 3 (Cat. Dice). Score 1 was excluded due tothe low number of annotations.

    Network and training strategy Acc. Dice Cat. Dice(1) U-Net 0.94 0.64 0.47(2) U-Net with on-line augmentation 0.95 0.69 0.55(3) U-Net++ with on-line augmentation 0.97 0.72 0.64(4) Deeplab v3+ with on-line augmentation 0.95 0.71 0.62Ensemble of (2,3,4) 0.96 0.75 0.65

    D. Limitations of the datasetIn order to unravel the specific characteristics of this disease,

    researchers needed to gather as much data from patients aspossible. However, due to the enormous impact and rapidspread of infected patients, data gathering in an organizedmanner proved a challenge. As a result, the precise demo-graphics of the patient group in our database remain unknown.

    Ideally, the dataset should be larger, more heterogeneous,and more balanced in term of scores in order to be used forlearning accurate deep models. In our case, the data has beencollected in a limited set of hospitals, all of them locatedin Italy. Furthermore, the way data was collected is proneto certain bias, e.g. due to a high patient inflow, the mostsevere patients were prioritized and assessed, and ultrasounddiagnosis was performed on patients with a high clinicalsuspicion. No subsequent testing was done, resulting in thepossible inclusion of false positive cases.

    Labels in the ICLUS-DB turned out to be noisy. Further-more, for frame-based classification and segmentation tasksthe inter-operator agreement was not available. The noisecan be indirectly observed in Table I, where using onlya selection of training samples, performance improves byalmost 5%. Extending the database to obtain frame-level labelsfrom multiple annotators would surely lead to more robustmodels. Finally, the included LUS videos with score 0 areall of healthy patients, and therefore by no means we claimto distinguish between COVID-19 patients and those withdifferent pathologies.

    E. Possible applicationsA benefit of using ultrasound is the low risk of cross-

    infection when using a plastic disposable cover and indi-

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.

  • SUBHANKAR ROY et al. (APRIL 2020) 11

    vidually packaged ultrasound gel on a portable handheldmachine [45]. This is in contrast with use of CT, for whichrooms and systems need to be rigorously cleaned to preventcontamination (and preferably reserved for patients with ahigh COVID-19 suspicion). LUS can be performed inside thepatient’s room without need of transportation, making it asuperior method for point-of-care assessment of patients.

    Moreover, ultrasound renders real-time images and, com-bined with our DL methods, provides results instantly. It mayalso directly assist in triage of patients; first-look estimationof the disease’s severity and the urgency at which a patientneeds to be addressed. In addition, low and middle-incomecountries, where diagnosis through RT-PCR or CT may notalways be available, can particularly benefit from low-costultrasound imaging as well [46]. However lack of training onthe interpretation of these LUS images [47] could still limitits use in practice. Our proposed DL method may thereforefacilitate ultrasound imaging in these countries.

    REFERENCES[1] WHO, “Laboratory testing strategy recommendations for COVID-

    19: Interim guidance,” Tech. Rep., 2020. [Online]. Avail-able: https://apps.who.int/iris/bitstream/handle/10665/331509/WHO-COVID-19-lab testing-2020.1-eng.pdf

    [2] R. Niehus, P. M. D. Salazar, A. Taylor, and M. Lipsitch, “Quantifyingbias of COVID-19 prevalence and severity estimates in Wuhan, Chinathat depend on reported cases in international travelers,” medRxiv, p.2020.02.13.20022707, feb 2020.

    [3] Y. Yang et al., “Evaluating the accuracy of different respiratory speci-mens in the laboratory diagnosis and monitoring the viral shedding of2019-nCoV infections,” medRxiv, p. 2020.02.11.20021493, feb 2020.

    [4] S. Salehi, A. Abedi, S. Balakrishnan, and A. Gholamrezanezhad, “Coro-navirus Disease 2019 (COVID-19): A Systematic Review of ImagingFindings in 919 Patients,” Am J Roentgenol, pp. 1–7, mar 2020.

    [5] A. Bernheim et al., “Chest CT Findings in CoronavirusDisease-19 (COVID-19): Relationship to Duration of Infec-tion,” Radiology, p. 200463, feb 2020. [Online]. Available:http://pubs.rsna.org/doi/10.1148/radiol.2020200463

    [6] F. Mojoli, B. Bouhemad, S. Mongodi, and D. Lichtenstein, “Lungultrasound for critically ill patients,” pp. 701–714, mar 2019.

    [7] R. Raheja, M. Brahmavar, D. Joshi, and D. Raman, “Application ofLung Ultrasound in Critical Care Setting: A Review,” Cureus, vol. 11,no. 7, jul 2019.

    [8] Y. Amatya, J. Rupp, F. M. Russell, J. Saunders, B. Bales, and D. R.House, “Diagnostic use of lung ultrasound compared to chest radiographfor suspected pneumonia in a resource-limited setting,” InternationalJournal of Emergency Medicine, vol. 11, no. 1, dec 2018.

    [9] E. Poggiali et al., “Can Lung US Help Critical Care Clinicians inthe Early Diagnosis of Novel Coronavirus (COVID-19) Pneumonia?”Radiology, p. 200847, mar 2020.

    [10] Q. Y. Peng et al., “Findings of lung ultrasonography of novel coronavirus pneumonia during the 2019 – 2020 epidemic,” Intensive CareMedicine, no. 87, pp. 6–7, mar 2020.

    [11] G. Soldati et al., “Is there a role for lung ultrasound during the covid-19pandemic?” J Ultrasound Med, 2020.

    [12] ——, “Proposal for international standardization of the use of lungultrasound for COVID-19 patients; a simple, quantitative, reproduciblemethod,” J. Ultrasound Med., 2020.

    [13] K. Stefanidis et al., “Lung sonography and recruitment in patients withearly acute respiratory distress syndrome: A pilot study,” Critical Care,vol. 15, no. 4, p. R185, aug 2011.

    [14] K. A. Stewart et al., “Trends in Ultrasound Use in Low and MiddleIncome Countries: A Systematic Review.” International journal of MCHand AIDS, vol. 9, no. 1, pp. 103–120, 2020.

    [15] L. Tutino, G. Cianchi, F. Barbani, S. Batacchi, R. Cammelli, andA. Peris, “Time needed to achieve completeness and accuracy in bedsidelung ultrasound reporting in Intensive Care Unit,” SJTREM, vol. 18,no. 1, p. 44, aug 2010.

    [16] R. J. van Sloun, R. Cohen, and Y. C. Eldar, “Deep learning inultrasound imaging,” Proceedings of the IEEE, vol. 108, no. 1, pp.11–29, jul 2019. [Online]. Available: http://arxiv.org/abs/1907.02994

    [17] R. J. van Sloun and L. Demi, “Localizing b-lines in lung ultrasonographyby weakly-supervised deep learning, in-vivo results,” J-BHI, 2019.

    [18] G. Soldati et al., “Towards computer aided lung ultrasound imaging forthe management of patients affected by covid-19,” under submission.

    [19] M. Jaderberg et al., “Spatial transformer networks,” in NIPS, 2015.[20] S. Roy, A. Siarohin, E. Sangineto, S. R. Bulo, N. Sebe, and E. Ricci,

    “Unsupervised domain adaptation using feature-whitening and consen-sus loss,” in CVPR, 2019.

    [21] R. Diaz and A. Marathe, “Soft labels for ordinal regression,” in CVPR,2019.

    [22] V. Melnikov and E. Hüllermeier, “Learning to aggregate using uni-norms,” in ECML, 2016.

    [23] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in MICCAI, 2015.

    [24] P. Rajpurkar et al., “Chexnet: Radiologist-level pneumonia detectionon chest x-rays with deep learning,” arXiv preprint arXiv:1711.05225,2017.

    [25] D. Dong, Z. Tang, S. Wang, H. Hui, L. Gong, Y. Lu, Z. Xue, H. Liao,F. Chen, F. Yang et al., “The role of imaging in the detection andmanagement of covid-19: a review,” IEEE Reviews in BiomedicalEngineering, 2020.

    [26] F. Shi, J. Wang, J. Shi, Z. Wu, Q. Wang, Z. Tang, K. He, Y. Shi, andD. Shen, “Review of artificial intelligence techniques in imaging dataacquisition, segmentation and diagnosis for covid-19,” IEEE Reviews inBiomedical Engineering, 2020.

    [27] J. Chen et al., “Deep learning-based model for detecting 2019 novelcoronavirus pneumonia on high-resolution computed tomography: aprospective study,” medRxiv, 2020.

    [28] S. Wang et al., “A deep learning algorithm using ct images to screenfor corona virus disease (covid-19),” medRxiv, 2020.

    [29] X. Xu et al., “Deep learning system to screen coronavirus disease 2019pneumonia,” arXiv preprint arXiv:2002.09334, 2020.

    [30] S. Liu et al., “Deep learning in medical ultrasound analysis: a review,”Engineering, 2019.

    [31] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learningdeep features for discriminative localization,” in CVPR, 2016.

    [32] G. Soldati et al., “Simple, Safe, Same: Lung Ultrasound for COVID-19(LUSCOVID19),” ClinicalTrials.gov Identifier: NCT04322487, 2020.

    [33] G. Soldati, M. Demi, R. Inchingolo, A. Smargiassi, and L. Demi, “Onthe physical basis of pulmonary sonographic interstitial syndrome,” JUltrasound Med, vol. 35, no. 10, pp. 2075–2086, oct 2016.

    [34] K. Wada, “labelme: Image Polygonal Annotation with Python,”https://github.com/wkentaro/labelme, 2016.

    [35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in NIPS, 2012.

    [36] M. Sajjadi, M. Javanmardi, and T. Tasdizen, “Regularization withstochastic transformations and perturbations for deep semi-supervisedlearning,” in NIPS, 2016.

    [37] C. Winship and R. D. Mare, “Regression models with ordinal variables,”American sociological review, pp. 512–525, 1984.

    [38] K. Crammer and Y. Singer, “Pranking with ranking,” in NIPS, 2002.[39] R. R. Yager and A. Rybalov, “Uninorm aggregation operators,” Fuzzy

    Sets Syst., vol. 80, no. 1, p. 111–120, May 1996.[40] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++:

    A nested u-net architecture for medical image segmentation,” in DeepLearning in Medical Image Analysis and Multimodal Learning forClinical Decision Support. Springer, 2018, pp. 3–11.

    [41] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmen-tation,” in Proceedings of the European conference on computer vision(ECCV), 2018, pp. 801–818.

    [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in CVPR, 2016.

    [43] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:Representing model uncertainty in deep learning,” in ICML, 2016.

    [44] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra, “Grad-cam: Visual explanations from deep networks viagradient-based localization,” in Proceedings of the IEEE internationalconference on computer vision, 2017, pp. 618–626.

    [45] J. C.-H. Cheung and K. N. Lam, “POCUS in COVID-19: pearls andpitfalls,” Tech. Rep. 0, apr 2020.

    [46] S. Sippel, K. Muruganandan, A. Levine, and S. Shah, “Review article:Use of ultrasound in the developing world,” International Journal ofEmergency Medicine, vol. 4, no. 1, p. 72, dec 2011.

    [47] S. Shah, B. A. Bellows, A. A. Adedipe, J. E. Totten, B. H. Backlund,and D. Sajed, “Perceived barriers in the use of ultrasound in developingcountries,” Critical Ultrasound Journal, vol. 7, no. 1, dec 2015.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE Transactions on Medical Imaging

    0278-0062 © IEEE 2020. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.Authorized licensed use limited to: IEEE Xplore. Downloaded on June 26,2020 at 14:03:23 UTC from IEEE Xplore. Restrictions apply.