Cross-View Correspondence Reasoning Based on Bipartite ...openaccess.thecvf.com/content_CVPR_2020/papers/Liu...Cross-view Correspondence Reasoning based on Bipartite Graph Convolutional

Cross-view Correspondence Reasoning based on Bipartite Graph Convolutional

Network for Mammogram Mass Detection

Yuhang Liu1 Fandong Zhang2 Qianyi Zhang1 Siwen Wang1 Yizhou Wang3 Yizhou Yu1,∗∗1Deepwise AI Lab 2 Center for Data Science, Peking University

3 Center on Frontiers of Computing Studies, Dept. of Computer Science & Technology,

Advanced Institute of Information Technology, Peking University

{liuyuhang, zhangqianyi, wangsiwen, yuyizhou}@deepwise.com

{fd.zhang, yizhou.wang}@pku.edu.cn

Abstract

Mammogram mass detection is of great clinical signifi-

cance due to the high proportion of breast cancers. The in-

formation from cross views (i.e., mediolateral-oblique and

cranio-caudal) is highly related and complementary, and

is helpful to make comprehensive decisions. However, un-

like radiologists who can recognize masses with reason-

ing ability in cross-view images, most existing methods lack

the ability to reason under the guidance of domain knowl-

edge, thus it limits the performance. In this paper, we in-

troduce the bipartite graph convolutional network to en-

dow existing methods with cross-view reasoning ability of

radiologists in mammogram mass detection. The bipar-

tite node sets are constructed by cross-view images respec-

tively to represent relatively consistent regions in breasts,

while the bipartite edge learns to model both inherent cross-

view geometric constraints and appearance similarities be-

tween correspondences. Based on the bipartite graph, the

information propagates methodically through correspon-

dences and enables spatial visual features equipped with

customized cross-view reasoning ability. Experimental re-

sults on DDSM dataset demonstrate the proposed algorithm

achieves state-of-the-art performance. Besides, visual anal-

ysis shows the model has a clear physical meaning, which

is helpful to radiologists in clinical interpretation.

1. Introduction

Breast cancer continues to have the highest incidence

and mortality rates among women worldwide [49]. Screen-

ing mammography has been proved to effectively reduce

breast cancer mortality [48]. Mass is one of the most im-

portant signs of breast cancer. However, mammogram mass

∗Corresponding author.

detection is challenging for both radiologists and computer-

aided detection (CAD) system, since masses can be par-

tially obscured by high-intensity compacted glands espe-

cially in dense breasts. In clinical practice, cross-view im-

ages (i.e, as shown in Figure 1, cranio-caudal (CC) view

which is a top-down view of the breast, and mediolateral

oblique (MLO) view which is a side view of the breast taken

at a certain angle) provide related and complementary infor-

mation [46], and help to make comprehensive decisions.

To exploit relations of cross-view mammogram images,

an intuitive idea is to adopt [21, 54] to model inter-image

non-local relations. For example, CVR-RCNN [32] adds a

relation module to the second stage of Faster RCNN [42] to

learn inter-proposal relations. However, unlike radiologists

who are able to reason with domain knowledge, the con-

straints of the relation learning is implicit and uncontrolled,

while cross-view geometric constraints and semantic rela-

tions are not explicitly considered. Thus, the learned rela-

tions may be incorrect. Besides, the relation module relies

on the quality of stage-one proposals. It may fail under se-

vere gland occlusions, which also lead to poor performance.

We argue that the key issue is how to endow the cur-

rent detection methods with the power of reasoning. When

identifying masses, radiologists take the reasoning proce-

dure explicitly. First extract suspicious regions in the ex-

amined image. And then search the regions in the auxiliary

view with compatible locations and appearances. If reason-

able correspondences are found, regions in both views are

more likely to be mass, and vice versa. Therefore, the cross-

view region-based reasoning procedure is helpful for mass

detection.

Motivated by the above observations, in this paper, we

introduce a novel Bipartite Graph convolutional Network

(BGN) to provide the reasoning ability in mammogram

mass detection. BGN can be embedded into any object

detection frameworks [42, 18, 59]. It takes backbone im-

13812

Figure 1. Relations between CC and MLO views. Figure (a)-(b)

indicate CC and MLO views of the breast. Line to the right side

of figure (b) corresponds to the projected pectoral muscle plain.

Figure (c) indicates an ideal projection model. The CC view is a

top-down view taken along the pectoral muscle plain, while the

MLO view is a side view taken at a certain angle.

age features as input and outputs cross-view enhanced fea-

tures. To model cross-view region-based reasoning proce-

dure, bipartite graph nodes are constructed by cross-view

images respectively, each of which represents a relatively

consistent region in breasts. The graph edges are designed

to model both inherent geometric constraints and appear-

ance similarities of cross-view nodes jointly. Therefore,

only edges between nodes from different views exist, lead-

ing to a bipartite graph structure. After several layers of

bipartite graph convolution, the node features are enhanced

through correspondences and enables spatial visual features

equipped with cross-view reasoning ability. Unlike existing

methods that use none or weak cross-view constraints, the

proposed model learns stronger customized cross-view rea-

soning with both geometric and semantic correspondences.

Moreover, instead of applying the module after the proposal

stage, the proposed graph module enhances backbone fea-

tures before the proposal. Therefore, it suffers less from the

proposal missing problem.

Experimental results on both a public dataset DDSM

[20] and an in-house dataset demonstrate that the proposed

algorithm achieves state-of-the-art performance. Besides,

visual analysis shows the model has a clear physical mean-

ing, which is helpful for clinical interpretation.

Our contributions are mainly two-fold: Firstly, we pro-

pose a novel mammogram mass detection framework that

effectively exploits cross-view information and visual cor-

respondences. Next, we build a bipartite graph convo-

lutional network capable of performing reasoning about

cross-view correspondences and modeling both geometric

constraints and visual similarities across views.

2. Related Work

Mammogram Mass Detection Mammogram mass de-

tection has been studied for several years. Traditional meth-

ods use handcrafted features to represent masses and design

complex classifiers for identification [37, 52, 11]. However,

these methods are limited due to the lack of representa-

tion ability and end-to-end training ability. In recent years,

deep learning has made great progress in medical image

computing [58, 17, 10, 61]. Modern object detection net-

works are applied to enhance mass detection performance

[43, 24, 30, 55, 31, 5]. However, cross-view mammo-

gram images containing related and complementary infor-

mation are not considered. Ma et al. [32] attempt to model

cross-view property and integrate relation module [21] into

Faster RCNN [42] to learn cross-view inter-proposal rela-

tions. However, the relation learning is implicit and uncon-

trolled, and lacks reasoning ability under the guidance of

domain knowledge. Thus, it limits the performance. Be-

sides, severe gland occlusion may lead to proposal-missing

problem, which also causes poor performance. The BGN

provides reasoning ability with domain knowledge , which

helps to learn stronger customized feature enhancement.

Visual Reasoning based on Graph Convolutional Net-

work Visual reasoning aims to combine different infor-

mation or interactions between objects or things, and has

been applied to many computer vision tasks, such as classi-

fication [1, 34], object detection [8, 22], segmentation [28]

and so on [27, 35, 15, 33, 41]. Recently, researchers at-

tempt to introduce graph convolutional network [62] for

visual reasoning [56]. Li et al. [28] propose graph con-

volutional units to learn graph visual representation from

2D data. However, information propagates from all se-

mantic correspondences which introduce noises for learn-

ing representation. Meanwhile, the reasoning procedure is

implicit and uncontrolled, and it limits the performance as

well as interpretation. Gao et al. [16] enhance target feature

by introducing a spatial-temporal graph in visual tracking

task. However, the nodes are represented by uniform grids,

which are sensitive to object scales, image shapes, geomet-

ric distortions, etc. Noticing crucial semantic dependencies

among objects, Xu et al. [57] attempt to reason with a class-

to-class prior graph. However, the graph remains fixed and

is hard to adapt to all cases. Besides, it cannot be applied to

the single-class detection tasks (e.g., mass detection).

Multi-view Visual Recognition Understanding and rep-

resenting 3D object is a fundamental problem in vision

recognition [39, 14] and stereo vision [25, 47, 6, 7, 40].

Multi-view based approaches render the 3D object from

multi views, and deploy image-based classifiers on individ-

ual view images [50, 23]. Feng et al. [14] propose a group-

view convolutional network to model hierarchical correla-

tion from multiple views. Yang et al. [60] learn to reinforce

the information by exploiting region-level and view-level

relations. Different from multi-view based approaches,

3813

Figure 2. The pipeline of the proposed BGN. BGN takes cross-view backbone features as inputs, and outputs enhanced features for further

prediction. First, bipartite graph nodes are constructed by mapping spatial visual features with pseudo landmarks. Each mapping cell is

a representative region for each graph node. Then, the bipartite graph edge learns to model both geometric constraints and semantic sim-

ilarities. Next, correspondence reasoning enhancement is conducted for feature enhancement by propagating information on the bipartite

graph. Finally, the enhanced features are aggregated with original features for further prediction.

mammography cross-view images have more explicit cor-

respondences, which helps to design stronger customized

reasoning mechanisms. Explicit correspondences also exist

in stereo vision [47, 6, 7, 40], which matches key points in

general scenario as correspondences with calibrated cam-

eras. However, different from stereo vision, we cannot get

exact matched correspondences due to standard mammog-

raphy screening mechanisms [46] . The key challenge is to

utilize fuzzy correspondences for feature enhancement.

3. Methodology

3.1. Overview

The proposed BGN aims to endow cross-view corre-

spondence reasoning ability in the mammogram detection

framework. BGN is stacked on the backbone to enhance

feature representations, and can be integrated into any mod-

ern detection frameworks. As illustrated in Figure 2, there

are three major steps. Firstly, to model the region-based rea-

soning procedure, bipartite graph nodes are mapped from

cross-view backbone visual features, where each node de-

notes the representations of relative consistent region in

breasts. Then, bipartite graph edges are designed to model

both cross-view geometric constraints and appearance sim-

ilarities of bipartite graph nodes. Finally, correspondence

reasoning enhancement based on the pre-defined bipartite

graph is designed to enhance feature representations. Af-

ter information propagation through nodes, node represen-

tations are mapped to spatial visual domain reversely, which

enables enhanced spatial features aware of cross-view cor-

respondences. Both enhanced features and original back-

bone features are fused for further proposals.

Formally, we are given a paired 2D feature maps

Fe, Fa ∈ RHW×C extracted from the examined view

(where detection is performed) and its auxiliary view (the

other view), where e, a ∈ {CC,MLO} indicate the view

types, H,W and C represent the height, width and channel

of the feature map. Note either CC or MLO view can be

treated as the examined view. As formulated in Equation

1, BGN learns a function f , parameterized by the bipartite

graph G = (V,U , E) with node sets as V,U and edges as E .

V,U indicate nodes fromCC andMLO views respectively,

each edge in E connects a node in V to one in U .

Y = f(Fe, Fa;G) (1)

3.2. Bipartite Graph Node

Bipartite graph node is designed to represent region-

level correspondences in breasts. There are two issues: (1)

Where to locate? (2) What to represent?

3814

(a) CC View (b) MLO View (c) Mapping Cell

Figure 3. Illustration of pseudo landmarks and bipartite graph node

mapping. (a)-(b) draw pseudo landmarks and the matched bound-

ing boxes on CC and MLO views respectively. (c) illustrates how

bipartite node mapping works when k = 1. Each mapping cell

denotes the representative region of the node in the CC view.

Pseudo landmarks that preserve relative consistent loca-

tion in breasts are defined to solve the first issue, while bi-

partite graph node mapping produces node representations

from spatial visual features. We describe the details in the

following parts.

3.2.1 Pseudo Landmarks

Landmarks are points in a shape object in which correspon-

dences between and within the populations of the object

are preserved [12]. However, there are no specialized land-

marks for breasts. We have to define pseudo landmarks ac-

cording to prior knowledge.

The pseudo landmarks should satisfy the following three

properties: I. Each pseudo landmark represents a relatively

consistent region in breasts; II. Different pseudo landmarks

represent different regions; III. The union of all pseudo

landmarks covers the breast region completely.

An intuitive idea is to treat uniform grids of the image as

landmarks. However, property I. is not satisfied, leading to

sensitiveness to image scale, geometric distortions, etc.

As illustrated in Figure 1, the design of the pseudo land-

mark embedding method is based on a basic observation:

CC and MLO views of standard mammography screening

have natural geometric correspondences. Ideally, a point

in CC view approximately corresponds to a line parallel to

projected pectoral muscle plane in MLO view.

To embed pseudo landmarks as shown in Figure 3,

equidistant parallel lines are first inserted between the nip-

ple and pectoral muscle line (projected by pectoral muscle

plane). The parallel lines and the breast contour intersect,

and we insert points uniformly between two intersection

points. Finally, all the points are re-ordered based on inter-

sections and defined as pseudo landmarks. Specially, as for

MLO view which contains pectoral muscle areas addition-

ally, a similar method is applied to define pseudo landmarks

in pectoral muscle areas. With these processes, we obtain a

set of pseudo landmarks for each view.

3.2.2 Bipartite Graph Node Mapping

Bipartite graph node mapping aims to project spatial visual

features (FCC , FMLO) to node domain parameterized by

matrices XCC ∈ R|V|×C , XMLO ∈ R

|U|×C . The features

at a node are region-level features of the region correspond-

ing to the node.

The node mapping reveals the relation between graph

nodes and all the pixels. As illustrated in the following

equations, we design kNN (k Nearest Neighbor) forward

mapping φk with its auxiliary matrix A for node visual rep-

resentations. Each node corresponds to an irregular region

satisfying the property that for any pixel in the region, the

node is one of its k nearest nodes. φk performs region-

level feature pooling within the regions corresponding to

the graph nodes:

φk(F,N ) = (Qf )TF, (2)

Qf = A(Λf )−1, (3)

Aij =

{

1 if j th node is kNN of i th pixel

0 Otherwise, (4)

where N ∈ {V,U} represents node set corresponding to

spatial visual feature F ∈ RHW×C , A ∈ R

HW×|N| is an

auxiliary matrix to assign spatial features to top-k nearest

graph nodes, Λf ∈ R|N |×|N| is a diagonal matrix, Λf

jj =HW∑

i=1

Aij , and Qf ∈ RHW×|N| which is a normalized form

of A serves as the forward mapping matrix.

Compared with fixed-grid assign methods [16], the pro-

posed node representations are more robust to image scales,

geometric distortions, etc, since φk selects representative

region adaptively according to relations among node loca-

tions. Besides, the mapping mechanism has a clear phys-

ical meaning, which is helpful for visual interpretation.

Specifically, Figure 3(c) shows the mapping degenerates to

Voronoi grids [2] when k = 1.

Based on kNN forward mapping, we can obtain visual

representations of bipartite graph node sets.

XCC = φk(FCC ,V) (5)

XMLO = φk(FMLO,U) (6)

3.3. Bipartite Graph Edge

If given a mass locating at one certain node in the ex-

amined view, it is obvious that different nodes in the auxil-

iary view can have different probabilities representing the

3815

same mass instance as the given mass. Thus, bipartite

graph edge aims to reveal such underlying relations between

nodes. We characterize the edge in two aspects: geomet-

ric constraints and appearance similarities. The two as-

pects describe the inherent constraints caused by mammo-

gram screening mechanism and visual similarities between

nodes, respectively.

Formally, bipartite graph edge is represented as an adja-

cency matrix E ∈ R|V|×|U| composed of a geometric graph

Eg ∈ R|V|×|U| and a semantic graph Es ∈ R

|V|×|U|. The

geometric graph as a global prior graph reveals the geomet-

ric constraints across views. The semantic graph as an in-

stance dependent graph represents the semantic similarities

between nodes. The two graphs jointly affect cross-view in-

formation propagation. Equation 7 illustrates the relations

of these matrices, where ◦ indicates element-wise dot.

E = Eg ◦ Es (7)

3.3.1 Geometric Relation Learning

How to represent geometric constraints? Though the CC

and MLO views have standard camera pose, the exact ge-

ometric correspondence is not well-defined due to tissue

deformation under pressure and lack of visual cues. We

hereby model the geometric correspondence using masses

as visual cues. Each edge in the geometric graph repre-

sents the correlation of the linked nodes that denote the

same mass instance from different views. To approximate

the correlation, for each mass, if the node is the closest to

the center of the bounding box, it will be selected to rep-

resent this mass. Then we link the nodes that represent the

same mass instance from different views (e.g. 4 th node in

CC view and 3 th node in MLO view in Figure 3).

We take two steps to obtain the geometric graph. Firstly,

we obtain a frequent statistics matrix ǫ ∈ R|V|×|U| based

on the annotated masses in the training set by calculating

occurrences of cross-view node pairs representing the same

mass instances. Then, we perform a column-row normal-

ization method [57] to obtain Eg .

3.3.2 Semantic Relation Learning

The geometric graph provides global geometric prior cor-

relations. However, it is not precise enough to find exact

correspondence pairs across views. Thus, noises can be in-

volved in the reasoning procedure. The semantic graph is

designed to learn the semantic relation between nodes, and

can help filter the noisy relations.

How to define semantic similarities between nodes? An

intuitive idea is to measure by inner product or cosine dis-

tance [3, 53]. However, relations between nodes that rep-

resent backgrounds are unknown, and may enhance back-

ground representations. Thus we release the weights, and

allow the module to learn its own similarity:

Esij = σ([(XCC

i )T , (XMLOj )T ]ws), (8)

where XCCi , XMLO

j ∈ RC represent i th and j th node fea-

tures of CC and MLO views respectively, ws ∈ R2C indi-

cates the fusion parameter, and σ means the sigmoid activa-

tion function.

3.4. Correspondence Reasoning Enhancement

Correspondence reasoning enhancement, based on the

defined bipartite graph G, is designed to fully take the ad-

vantage of cross-view reasoning procedure for customized

feature enhancement. There are three major steps. Firstly,

we augment bipartite graph convolution to adapt to the mod-

ern graph convolutional manner. Then map node represen-

tations to spatial domain reversely, which enables spatial

feature aware of the correspondences. Last, we concatenate

with the original features to enhance the representations.

Bipartite Graph Convolution To adapt to the manner of

modern graph convolutional network [16, 26], we first give

the augmented form of the bipartite graph:

X = [(XCC)T , (XMLO)T ]T , (9)

E =

(

0 EET

0

)

, (10)

where X ∈ R|V∪U|×C , E ∈ R

|V∪U|×|V∪U| indicate the aug-

mented form of bipartite nodes and edges respectively.

We adopt similar fashion [16] to define graph convolu-

tion. An iteration of graph convolution layer with convo-

lution parameters Wg ∈ RC×D is formulated in Equation

11. Intuitively, we can stack multiple graph convolutional

layers in graph convolutional network.

Z = σ(EXWg) (11)

kNN Reverse Mapping. To enhance spatial features, we

build a kNN reverse mapping function ψk to project graph

node features to the spatial domain. The mapping follows

similar design principles as the kNN forward mapping and

keeps the same number (k) of nearest neighbors.

Formally, ψk is formulated as :

ψk(Z,Ne) = Qr[Z]e, (12)

Qr = (Λr)−1A, (13)

where Ne ∈ {V,U} represents the unipartite node set from

the examined view, A ∈ RHW×|N| follows the similar def-

inition of the Equation 4, [·]e indicates an indexing oper-

ator which selects nodes in the examined view from all

bipartite nodes, Λr ∈ RHW×HW is a diagonal matrix,

Λrii =

|N |∑

j=1

Aij , and Qr ∈ RHW×|N| is the reverse map-

ping matrix which is the normalized form of A.

3816

Table 1. Performance on DDSM dataset(%).

Method R@t

Campanini et at. [4] [email protected]

Eltonsy et at. [13] [email protected], [email protected], [email protected]

Sampat et at. [45] [email protected], [email protected], [email protected]

Faster RCNN [32] [email protected], [email protected], [email protected]

CVR-RCNN [32] [email protected], [email protected], [email protected]

BG-RCNN [email protected], [email protected], [email protected]

Table 2. Performance on DDSM dataset(%).

Method [email protected] [email protected] [email protected] [email protected] [email protected]

Faster RCNN, FPN 75.3 81.5 87.3 89.8 91.4

Faster RCNN, FPN, DCN 75.7 82.5 88.4 90.1 91.4

Mask RCNN, FPN 76.0 82.5 88.7 90.8 91.4

Mask RCNN, FPN, DCN 76.7 83.9 89.4 91.4 91.8

BG-RCNN 79.5 86.6 91.8 92.5 94.5

Feature fusion. We finally fuse and obtain the enhanced

feature Y , parameterized by Wf ∈ RD×(D+C):

F = ψk(Z,N ) (14)

Y = [F, F ]WTf (15)

4. Experiments

4.1. Implementation Details

The mammogram images are first segmented by OTSU

[38], and the foreground region is treated as input. We apply

hough transform to detect pectoral muscle line and the nip-

ple for pseudo landmark embedding. To avoid over-fitting

during training, we conduct several specific augmentation

methods (e.g. random flip, random crop, multi-scaling).

We build the proposed BG-RCNN by integrating

BGN into Mask RCNN object detection framework [18].

ResNet50 [19] which is pretrained on ImageNet [44] is

taken as a backbone network. Our implementation is based

on PyTorch deep learning framework [10]. We adopt SGD

with a learning rate 0.02, weight decay 10−4, momentum

0.9 and nesterov set True. The whole training procedure

takes 30 epoches. As for stacked bipartite graph model, we

keep the same number of nearest neighbors k for both φkand ψk for bipartite node mapping and reverse mapping.

4.2. Datasets

Our experiments are conducted on both a public dataset

called DDSM [20] and an in-house dataset. We do not

choose other dataset such as INBreast [36] , MIAS [51],

because the amount of dataset is insufficient.

DDSM dataset. DDSM dataset contains 2620 mammog-

raphy cases. For most cases, each contains two views of im-

ages for both breasts. As in other approaches [32, 13, 4, 45],

Table 3. Performance on in-house dataset(%).


Faster RCNN, FPN 82.9 84.7 88.0 89.1 89.6

Faster RCNN, FPN, DCN 83.1 86.9 88.7 89.8 90.3

Mask RCNN, FPN 83.1 85.9 89.6 90.3 90.7

Mask RCNN, FPN, DCN 84.2 87.8 90.2 91.6 92.1

BG-RCNN 87.8 90.5 92.8 93.9 94.1

we adopt the same method to split training, validation and

testing set. There are 512 cases used in the evaluation.

In-house dataset. We collect an in-house dataset, which

contains 3000 cases and 12000 images. Each case contains

cross-view images of each breast. The annotations, namely

the mask of each breast lesion, are labeled by 3 radiologists

with experiences of more than 10 years. When disagree-

ment meets, we take the majority opinion of radiologists.

The dataset is randomly divided into training, validation and

testing sets by 8:1:1.

4.3. Baselines

Faster RCNN, FPN. Faster RCNN [42] with Feature

Pyramid Network (FPN) [29] is a solid baseline in object

detection task. We use ResNet50[19] as the backbone.

Faster RCNN, FPN, DCN. Deformable Convolution

Network (DCN) [9] is used to enhance the transformation

modeling capability of convolutional networks. DCN is in-

tegrated into baselines to enhance the performance.

Mask RCNN, FPN. Mask RCNN [18] is a state-of-the-

art model on both object detection and instance segmenta-

tion. To exploit the mask supervision for localization, we

employ Mask RCNN as a baseline.

Mask RCNN, FPN, DCN. DCN is also integrated into

Mask RCNN baselines.

4.4. Comparison with stateoftheart methods

We evaluate the performance by recall (R) at t false

postive per image (FPI), simplified as R@t, where t ∈{0.5, 1.0, 2.0, 3.0, 4.0}.

Table 1 and Table 2 report the experimental performance

on DDSM dataset. Results in Table 1 are reported from

[4, 13, 45, 32], and baselines in Table 2 are implemented by

us. We do not compare with [31], as they do not release the

dataset split method. We keep the same FPI and compare

the recall with a strong competitor CVR-RCNN[32]. We

can conclude that the proposed model outperforms state-

of-the-art methods. The same conclusion can be drawn on

the in-house dataset from Table 3. To understand how the

3817

Table 4. Effectiveness of pseudo landmarks on DDSM dataset(%).


Uniform 76.4 84.6 90.4 92.5 93.2

BG-RCNN 79.5 86.6 91.8 92.5 94.5

Table 5. Effectiveness of node number on DDSM dataset(%).


V1, U1 75.7 83.9 88.7 92.1 93.2

V9, U13 79.5 86.0 90.4 92.5 93.2

V21, U25 79.5 86.6 91.8 92.5 94.5

V42, U46 76.0 85.6 90.8 93.5 94.2

V66, U71 78.8 85.6 90.1 92.1 93.2

proposed model benefits from the correspondence reason-

ing mechanism, we analyze the cases in Figure 4. We com-

pare the recall when keeping the same FPI. We can see that

the proposed method can significantly improve the recall

(the 2nd and 3rd row) and localization ability (the 1st row).

4.5. Ablation study

Ablation of Pseudo Landmarks As shown in Table 4,

We first investigate the effectiveness of pseudo landmarks

versus uniform grids. We keep the same number of uni-

form grids and pseudo landmarks. The results have demon-

strated that pseudo landmarks are rather more effective than

uniform grids. We also investigate how node number af-

fects the performance. “Vi,Uj” means there are i nodes in

CC view and j nodes in MLO view. Specifically, the set-

ting “V1,U1” is equivalent to two-branch Faster RCNN. As

shown in Table 5, we choose “V21,U25” as our final results.

Ablation of Bipartite Graph Node Mapping. To inves-

tigate the effectiveness of bipartite node mapping, we first

compare with a simple method, which directly crop a fixed

region for graph node representation. We also evaluate how

k influences the results. We keep the same k for both φk and

ψk. As shown in Table 6, we can see that bipartite graph

node mapping is effective and necessary. Meanwhile, when

dense nodes embedded, the model works better with larger

k, because larger k can abstract more context feature for the

node which may lack sufficient context representation.

Ablation of Bipartite Graph Edge. We analyze the in-

fluence of Es and Eg in the bipartite graph edge. It degen-

erates to naive Mask RCNN when neither Es nor Eg are

used, since no information propagates across views. When

either Es or Eg is used, we set E to Es or Eg respectively. As

shown in Table 7, either Es or Eg can make an improvement,

and combining both parts achieves the best performance.

4.6. Visualization

Our visualization experiments mainly answer two ques-

tions: (1) Where does the bipartite graph focus on auxiliary

Table 6. Effectiveness of bipartite graph node mapping on DDSM

dataset(%).


V21, U25, crop 76.4 84.2 90.1 91.8 93.2

V21, U25, k=1 79.8 86.3 91.8 92.5 94.2

V21, U25, k=2 79.5 86.6 91.8 92.5 94.5

V21, U25, k=3 79.5 86.3 87.7 91.8 93.5

V66, U71, crop 75.3 84.2 89.4 91.8 92.1

V66, U71, k=1 77.7 85.6 90.1 91.4 93.2

V66, U71, k=2 78.8 85.6 90.1 92.1 93.2

V66, U71, k=3 79.1 86.0 90.1 92.1 93.8

Table 7. Effectiveness of each component in bipartite graph edge

on DDSM dataset(%).

Es Eg [email protected] [email protected] [email protected] [email protected] [email protected]

× × 76.7 83.9 89.4 91.4 91.8

×√

78.4 83.9 91.1 92.1 93.8√

× 77.7 86.3 89.4 91.8 93.5√ √

79.5 86.6 91.8 92.5 94.5

view? (2) How does the correspondence reasoning mecha-

nism enhance the feature representations?

To answer the first question, we design a specialized

method for correspondence visualization. The main pur-

pose is to find representative regions of correlated nodes in

the auxiliary view when given a query mass in the exam-

ined view. We first define a one-hot representative vector

x ∈ R|V∪U| to represent locations of the mass in the exam-

ined area. The index of the node which is nearest to the cen-

ter of the analyzed mass in the examined image is set to 1.

Then visualize the feature by Equation 16, where o ∈ RHW

represents the response vector, and [·]e indicates indexing

operator which selects nodes in the examined view from bi-

partite graph nodes. As shown in Figure 4, we can see that

the bipartite graph focuses on the mated mass area in the

auxiliary view, which helps to learn complementary feature

representations. Besides, our model has a clear physical

meaning and provides visual cues of mated masses. Thus it

can help radiologists in clinical interpretation.

o = Qr[Ex]e (16)

To answer the second question, we compare the response

map before and after feature enhancement. Specifically,

channel-wise max pooling is conducted on Fe and Y re-

spectively. As shown in Figure 4, feature response map

activates more prominently on mass located area after en-

hancement. As a result, the corresponding reasoning en-

hancement method can help to improve the detection perfor-

mance and make comprehensive and sufficient judgment.

5. Conclusions and Future Work

In this paper, we introduce the bipartite graph convolu-

tional network to provide customized reasoning ability in

3818

(a) (b) (c) (d) (e) (f) (g)

Figure 4. Detection results of BG-RCNN. Each row shows a representative case. Column (a)-(b) refer to the examined image and its

auxiliary view image with annotations. Column (c)-(d) indicate detection results by Mask-RCNN and BG-RCNN. Column (e) visualizes

the attention area on the auxiliary view. Column (f)-(g) visualize the response maps before and after correspondence visual reasoning.

mammogram mass detection. To model the cross-view rea-

soning procedure, bipartite graph nodes induced by pseudo

landmarks are constructed from cross-view images respec-

tively, which are able to represent the relatively consistent

regions. Then bipartite graph edge learns both cross-view

inherent geometric constraints and semantic relations. Fi-

nally, in correspondence reasoning enhancement, informa-

tion propagates on the bipartite graph and it enables spatial

visual features aware of cross-view correspondences. Thus

feature representations are enhanced. Experiments on both

public and in-house datasets demonstrate that the proposed

method achieves state-of-the-art performance. Besides, vi-

sual analysis shows that the proposed model has a clear

physical meaning, which is helpful to radiologists in clin-

ical interpretation.

Future work will include: (1) exploring learnable forms

of pseudo landmarks; (2) integrating bilateral (same view

of left and right breasts) domain knowledge to the model;

(3) exploiting more powerful graph mechanisms to facilitate

information propagation.

Acknowledgment

This work was supported in part by Zhejiang Province

Key Research Development Program (No. 2020C03073),

MOST-2018AAA0102004, NSFC-61625201, 61527804,

DFG TRR169 / NSFC Major International Collaboration

Project ”Crossmodal Learning”. We would like to thank

Shu Zhang and Yuchun Chen for valuable discussions.

3819

References

[1] Jon Almazan, Albert Gordo, Alicia Fornes, and Ernest Val-

veny. Word spotting and recognition with embedded at-

tributes. IEEE transactions on pattern analysis and machine

intelligence, 36(12):2552–2566, 2014. 2

[2] Franz Aurenhammer and Rolf Klein. Voronoi diagrams.

Handbook of computational geometry, 5(10):201–290, 2000.

4

[3] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local

algorithm for image denoising. In 2005 IEEE Computer So-

ciety Conference on Computer Vision and Pattern Recogni-

tion (CVPR’05), volume 2, pages 60–65. IEEE, 2005. 5

[4] Renato Campanini, Danilo Dongiovanni, Emiro Iampieri,

Nico Lanconelli, Matteo Masotti, Giuseppe Palermo,

Alessandro Riccardi, and Matteo Roffilli. A novel featureless

approach to mass detection in digital mammograms based on

support vector machines. Physics in Medicine & Biology,

49(6):961, 2004. 6

[5] Zhenjie Cao, Zhicheng Yang, Xiaoyan Zhuo, Ruei-Sung Lin,

Shibin Wu, Lingyun Huang, Mei Han, Yanbo Zhang, and

Jie Ma. Deeplima: Deep learning based lesion identifica-

tion in mammograms. In Proceedings of the IEEE Inter-

national Conference on Computer Vision Workshops, pages

0–0, 2019. 2

[6] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo

matching network. In The IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), June 2018. 2, 3

[7] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Huimin Ma,

Sanja Fidler, and Raquel Urtasun. 3d object proposals us-

ing stereo imagery for accurate object class detection. IEEE

transactions on pattern analysis and machine intelligence,

40(5):1259–1272, 2017. 2, 3

[8] Xinlei Chen, Li-Jia Li, Li Fei-Fei, and Abhinav Gupta. It-

erative visual reasoning beyond convolutions. In Proceed-

ings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 7239–7248, 2018. 2

[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong

Zhang, Han Hu, and Yichen Wei. Deformable convolutional

networks. In Proceedings of the IEEE international confer-

ence on computer vision, pages 764–773, 2017. 6

[10] Neeraj Dhungel, Gustavo Carneiro, and Andrew P Bradley.

Automated mass detection in mammograms using cascaded

deep learning and random forests. In 2015 international con-

ference on digital image computing: techniques and applica-

tions (DICTA), pages 1–8. IEEE, 2015. 2, 6

[11] Joao Otavio Bandeira Diniz, Pedro Henrique Bandeira Di-

niz, Thales Levi Azevedo Valente, Aristofanes Correa Silva,

Anselmo Cardoso de Paiva, and Marcelo Gattass. Detec-

tion of mass regions in mammograms by bilateral analysis

adapted to breast density using similarity indexes and convo-

lutional neural networks. Computer methods and programs

in biomedicine, 156:191–207, 2018. 2

[12] I. L. Dryden and K. V. Mardia. Statistical Shape Analysis,

with Applications in R. Second Edition. John Wiley and

Sons, Chichester, 2016. 4

[13] Nevine H Eltonsy, Georgia D Tourassi, and Adel S El-

maghraby. A concentric morphology model for the detection

of masses in mammography. IEEE transactions on medical

imaging, 26(6):880–889, 2007. 6

[14] Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and

Yue Gao. Gvcnn: Group-view convolutional neural networks

for 3d shape recognition. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition, pages

264–272, 2018. 2

[15] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio,

Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De-

vise: A deep visual-semantic embedding model. In Advances

in neural information processing systems, pages 2121–2129,

2013. 2

[16] Junyu Gao, Tianzhu Zhang, and Changsheng Xu. Graph con-

volutional tracking. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages 4649–

4659, 2019. 2, 4, 5

[17] Zhihui Guo, Ling Zhang, Le Lu, Mohammadhadi Bagheri,

Ronald M Summers, Milan Sonka, and Jianhua Yao. Deep

logismos: Deep learning graph-based 3d segmentation of

pancreatic tumors on ct scans. In 2018 IEEE 15th Interna-

tional Symposium on Biomedical Imaging (ISBI 2018), pages

1230–1233. IEEE, 2018. 2

[18] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-

shick. Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969, 2017. 1,

6

[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Deep residual learning for image recognition. In Proceed-

ings of the IEEE conference on computer vision and pattern

recognition, pages 770–778, 2016. 6

[20] Michael Heath, Kevin Bowyer, Daniel Kopans, Richard

Moore, and W Philip Kegelmeyer. The digital database for

screening mammography. In Proceedings of the 5th interna-

tional workshop on digital mammography, pages 212–218.

Medical Physics Publishing, 2000. 2, 6

[21] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen

Wei. Relation networks for object detection. In Proceed-


Recognition, pages 3588–3597, 2018. 1, 2

[22] Chenhan Jiang, Hang Xu, Xiaodan Liang, and Liang Lin.

Hybrid knowledge routed modules for large-scale object de-

tection. In Advances in Neural Information Processing Sys-

tems, pages 1552–1563, 2018. 2

[23] Edward Johns, Stefan Leutenegger, and Andrew J Davison.

Pairwise decomposition of image sequences for active multi-

view recognition. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages 3813–

3822, 2016. 2

[24] Hwejin Jung, Bumsoo Kim, Inyeop Lee, Minhwan Yoo, Jun-

hyun Lee, Sooyoun Ham, Okhee Woo, and Jaewoo Kang.

Detection of masses in mammograms using a one-stage ob-

ject detector based on a deep convolutional neural network.

PloS one, 13(9):e0203355, 2018. 2

[25] Abhishek Kar, Christian Hane, and Jitendra Malik. Learning

a multi-view stereo machine. In Advances in neural infor-

mation processing systems, pages 365–376, 2017. 2

3820

[26] Thomas N Kipf and Max Welling. Semi-supervised classi-

fication with graph convolutional networks. arXiv preprint

arXiv:1609.02907, 2016. 5

[27] Christoph H Lampert, Hannes Nickisch, and Stefan Harmel-

ing. Learning to detect unseen object classes by between-

class attribute transfer. In 2009 IEEE Conference on Com-

puter Vision and Pattern Recognition, pages 951–958. IEEE,

2009. 2

[28] Yin Li and Abhinav Gupta. Beyond grids: Learning graph

representations for visual recognition. In Advances in Neural

Information Processing Systems, pages 9225–9235, 2018. 2

[29] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He,

Bharath Hariharan, and Serge Belongie. Feature pyra-

mid networks for object detection. In Proceedings of the

IEEE conference on computer vision and pattern recogni-

tion, pages 2117–2125, 2017. 6

[30] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and

Piotr Dollar. Focal loss for dense object detection. In Pro-

ceedings of the IEEE international conference on computer

vision, pages 2980–2988, 2017. 2

[31] Yuhang Liu, Zhen Zhou, Shu Zhang, Ling Luo, Qianyi

Zhang, Fandong Zhang, Xiuli Li, Yizhou Wang, and Yizhou

Yu. From unilateral to bilateral learning: Detecting mam-

mogram masses with contrasted bilateral network. In In-

ternational Conference on Medical Image Computing and

Computer-Assisted Intervention, pages 477–485. Springer,

2019. 2, 6

[32] Jiechao Ma, Sen Liang, Xiang Li, Hongwei Li, Bjoern H

Menze, Rongguo Zhang, and Wei-Shi Zheng. Cross-view

relation networks for mammogram mass detection. arXiv

preprint arXiv:1907.00528, 2019. 1, 2, 6

[33] Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang,

and Alan L Yuille. Learning like a child: Fast novel visual

concept learning from sentence descriptions of images. In

Proceedings of the IEEE international conference on com-

puter vision, pages 2533–2541, 2015. 2

[34] Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta.

The more you know: Using knowledge graphs for image

classification. arXiv preprint arXiv:1612.04844, 2016. 2

[35] Ishan Misra, Abhinav Gupta, and Martial Hebert. From red

wine to red tomato: Composition with context. In Proceed-


Recognition, pages 1792–1801, 2017. 2

[36] Ines C Moreira, Igor Amaral, Ines Domingues, Antonio Car-

doso, Maria Joao Cardoso, and Jaime S Cardoso. Inbreast:

toward a full-field digital mammographic database. Aca-

demic radiology, 19(2):236–248, 2012. 6

[37] Naga R Mudigonda, Rangaraj M Rangayyan, and JE Leo

Desautels. Detection of breast masses in mammograms by

density slicing and texture flow-field analysis. IEEE Trans-

actions on Medical Imaging, 20(12):1215–1227, 2001. 2

[38] N. Otsu. A threshold selection method from gray-level his-

tograms. IEEE Transactions on Systems, Man, and Cyber-

netics, 9(1):62–66, Jan 1979. 6

[39] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai,

Mengyuan Yan, and Leonidas J Guibas. Volumetric and

multi-view cnns for object classification on 3d data. In Pro-

ceedings of the IEEE conference on computer vision and pat-

tern recognition, pages 5648–5656, 2016. 2

[40] Henri Rebecq, Guillermo Gallego, Elias Mueggler, and Da-

vide Scaramuzza. Emvs: Event-based multi-view stereo—3d

reconstruction with an event camera in real-time. Inter-

national Journal of Computer Vision, 126(12):1394–1414,

2018. 2, 3

[41] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali

Farhadi. You only look once: Unified, real-time object de-

tection. In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 779–788, 2016. 2

[42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.

Faster r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information pro-

cessing systems, pages 91–99, 2015. 1, 2, 6

[43] Dezso Ribli, Anna Horvath, Zsuzsa Unger, Peter Pollner, and

Istvan Csabai. Detecting and classifying lesions in mam-

mograms with deep learning. Scientific reports, 8(1):4165,

2018. 2

[44] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-

jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,

Aditya Khosla, and Michael Bernstein. Imagenet large scale

visual recognition challenge. International Journal of Com-

puter Vision, 115(3):211–252. 6

[45] Mehul P Sampat, Alan C Bovik, Gary J Whitman, and Mia K

Markey. A model-based framework for the detection of

spiculated masses on mammography a. Medical physics,

35(5):2110–2123, 2008. 6

[46] Mehul P Sampat, Mia K Markey, Alan C Bovik, et al.

Computer-aided detection and diagnosis in mammography.

Handbook of image and video processing, 2(1):1195–1217,

2005. 1, 3

[47] Thomas Schops, Johannes L Schonberger, Silvano Galliani,

Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An-

dreas Geiger. A multi-view stereo benchmark with high-

resolution images and multi-camera videos. In Proceedings

of the IEEE Conference on Computer Vision and Pattern


[48] Edward A Sickles. Breast cancer screening outcomes

in women ages 40-49: clinical experience with service

screening using modern mammography. JNCI Monographs,

1997(22):99–104, 1997. 1

[49] Rebecca Siegel, Jiemin Ma, Zhaohui Zou, and Ahmedin Je-

mal. Cancer statistics, 2014. CA: a cancer journal for clini-

cians, 64(1):9–29, 2014. 1

[50] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik

Learned-Miller. Multi-view convolutional neural networks

for 3d shape recognition. In Proceedings of the IEEE in-

ternational conference on computer vision, pages 945–953,

2015. 2

[51] P SUCKLING J. The mammographic image analysis society

digital mammogram database. Digital Mammo, pages 375–

386, 1994. 6

[52] Shen-Chuan Tai, Zih-Siou Chen, and Wei-Ting Tsai. An au-

tomatic mass detection system in mammograms based on

complex texture features. IEEE journal of biomedical and

health informatics, 18(2):618–627, 2014. 2

3821

[53] Carlo Tomasi and Roberto Manduchi. Bilateral filtering for

gray and color images. In Iccv, volume 98, page 2, 1998. 5

[54] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-

ing He. Non-local neural networks. 1

[55] Nan Wu, Jason Phang, Jungkyu Park, Yiqiu Shen, Zhe

Huang, Masha Zorin, Stanisław Jastrzebski, Thibault Fevry,

Joe Katsnelson, Eric Kim, et al. Deep neural networks im-

prove radiologists’ performance in breast cancer screening.

2019. 2

[56] Hang Xu, Chenhan Jiang, Xiaodan Liang, and Zhenguo Li.

Spatial-aware graph relation network for large-scale object

detection. In Proceedings of the IEEE Conference on Com-

puter Vision and Pattern Recognition, pages 9298–9307,

2019. 2

[57] Hang Xu, ChenHan Jiang, Xiaodan Liang, Liang Lin, and

Zhenguo Li. Reasoning-rcnn: Unifying adaptive global rea-

soning into large-scale object detection. In Proceedings

of the IEEE Conference on Computer Vision and Pattern


[58] Zhoubing Xu, Yuankai Huo, JinHyeong Park, Bennett Land-

man, Andy Milkowski, Sasa Grbic, and Shaohua Zhou. Less

is more: Simultaneous view classification and landmark de-

tection for abdominal ultrasound images. In International

Conference on Medical Image Computing and Computer-

Assisted Intervention, pages 711–719. Springer, 2018. 2

[59] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen

Lin. Reppoints: Point set representation for object detection.

arXiv preprint arXiv:1904.11490, 2019. 1

[60] Ze Yang and Liwei Wang. Learning relationships for multi-

view 3d object recognition. In Proceedings of the IEEE

International Conference on Computer Vision, pages 7505–

7514, 2019. 2

[61] Fandong Zhang, Ling Luo, Xinwei Sun, Zhen Zhou, Xi-

uli Li, Yizhou Yu, and Yizhou Wang. Cascaded generative

and discriminative learning for microcalcification detection

in breast mammograms. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition, pages

12578–12586, 2019. 2

[62] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang,

Zhiyuan Liu, and Maosong Sun. Graph neural networks:

A review of methods and applications. arXiv preprint

arXiv:1812.08434, 2018. 2

3822

Documents

Cross-View Correspondence Reasoning Based on Bipartite ...openaccess.thecvf.com/content_CVPR_2020/papers/Liu...Cross-view Correspondence Reasoning based on Bipartite Graph Convolutional