arXiv:2003.14014v1 [cs.CV] 31 Mar 2020 · the Self-Organizing Map (SOM)(Kakuda et al. 1998) to pro-duce a set of keypoint-like points for modeling the spatial pattern of a point cloud

SK-Net: Deep Learning on Point Cloud via End-to-end Discovery of SpatialKeypoints

Weikun Wu,1∗ Yan Zhang,12∗ David Wang,3 Yunqi Lei1†1Computer Science Department, Xiamen University, China

2School of Mathematics Science, Guizhou Normal University, China3Department of Electrical and Computer Engineering, The Ohio State University, USA

[email protected], [email protected], [email protected], [email protected]

Abstract

Since the PointNet was proposed, deep learning on pointcloud has been the concentration of intense 3D research.However, existing point-based methods usually are not ad-equate to extract the local features and the spatial patternof a point cloud for further shape understanding. This paperpresents an end-to-end framework, SK-Net, to jointly opti-mize the inference of spatial keypoint with the learning offeature representation of a point cloud for a specific pointcloud task. One key process of SK-Net is the generation ofspatial keypoints (Skeypoints). It is jointly conducted by twoproposed regulating losses and a task objective function with-out knowledge of Skeypoint location annotations and propos-als. Specifically, our Skeypoints are not sensitive to the lo-cation consistency but are acutely aware of shape. Anotherkey process of SK-Net is the extraction of the local structureof Skeypoints (detail feature) and the local spatial pattern ofnormalized Skeypoints (pattern feature). This process gener-ates a comprehensive representation, pattern-detail (PD) fea-ture, which comprises the local detail information of a pointcloud and reveals its spatial pattern through the part districtreconstruction on normalized Skeypoints. Consequently, ournetwork is prompted to effectively understand the correlationbetween different regions of a point cloud and integrate con-textual information of the point cloud. In point cloud tasks,such as classification and segmentation, our proposed methodperforms better than or comparable with the state-of-the-artapproaches. We also present an ablation study to demonstratethe advantages of SK-Net.

1 IntroductionRapid development in stereo sensing technology has made3D data ubiquitous. The naive structure of most 3D dataobtained by the stereo sensor is a point cloud. It makesthose methods directly consuming points the most straight-forward approach to handle the 3D tasks. Most of the ex-isting point-based methods work on the following aspects:(1) Utilizing the multi-layer perceptron (MLP) to extract thepoint features and using the symmetry function to guaranteethe permutation invariance. (2) Capturing the local structurethrough explicit or other local feature extraction approaches,∗Contributed equally.†Corresponding author.

and/or modeling the spatial pattern of a point cloud by dis-covering a set of keypoint-like points. (3) Aggregating fea-tures to deal with a specific point cloud task.

Despite being a pioneer in this area, PointNet(Qi et al.2017a) only extracts the point features to acquire the globalrepresentation of a point cloud. For boosting point-basedperformance, the study advancing on above (2) has recentlyfostered the proposal of many techniques. One of these pro-posals is downsampling a point cloud to choose a set ofspatial locations of a point cloud as local feature extrac-tion regions(Qi et al. 2017b; Jiang et al. 2018; Liu et al.2019a). However, in this method, those locations are mostlyobtained by artificial definition, e.g., FPS algorithm(Eldar etal. 1997), and do not reckon with the spatial pattern and thecorrelation between different regions of a point cloud. Sim-ilarly, (Xie et al. 2018; Hua, Tran, and Yeung 2018) achievepointwise local feature extraction by corresponding eachpoint to a region with respect to modeling the local structureof a point cloud. A special way is that (Le and Duan 2018;Li et al. 2018) combine the regular grids of 3D space withpoints, leveraging spatially-local correlation of regular gridsto represent local information. However, performing point-wise local feature extraction and incorporating regular gridsoften require high computational costs, which is infeasiblewith large point injections and higher grid resolutions. Fur-thermore, SO-Net(Li, Chen, and Hee Lee 2018) introducesthe Self-Organizing Map (SOM)(Kakuda et al. 1998) to pro-duce a set of keypoint-like points for modeling the spatialpattern of a point cloud. Even though SO-Net takes the re-gional correlation of a point cloud into account, it trainsSOM separately. This leaves the spatial modeling of SOMand a specific point cloud task disconnected.

To address these issues, this paper explores a method thatcan benefit from an end-to-end framework while the infer-ence of Skeypoints is jointly optimized with the learningof the local details and the spatial pattern of a point cloud.By this means, those end-to-end learned Skeypoints are instrong connection with the local feature extraction and thespatial modeling of a point cloud. It is worth mentioning thatrelated approaches of end-to-end learning of 3D keypointsusually require the keypoint location proposals(Georgakis etal. 2018; Georgakis et al. 2019) and expect the location con-

arX

iv:2

003.

1401

4v1

[cs

.CV

] 3

1 M

ar 2

020

Figure 1: The generation of Skeypoints and the preliminaryspatial exploration of the point cloud. From left to right, thefigures show the distributions of generated Skeypoints (redballs) with training epochs increasing; the rightmost figuresshow the resulting Skeypoints.

sistency of 3D keypoints. Besides, most of the existing meth-ods handle the tasks of 3D matching or 3D pose estimation.But those methods are not extended to the point cloud recog-nition tasks(Georgakis et al. 2018; Suwajanakorn et al. 2018;Zhou et al. 2018; ?). Our work focuses on point cloudrecognition tasks. Before producing the PD feature, all lo-cal features perform max-pooling to ensure that the localspatial pattern is of permutation invariance and fine shape-description. Thus the generated Skeypoints are not sensitiveto location consistency. With further training under the ad-justment of two regulating losses, these Skeypoints gradu-ally spread to the entire space of a point cloud and distributearound the discriminative regions of the point cloud. Thegeneration of Skeypoints and the preliminary spatial explo-ration of the point cloud is shown in Figure 1. The key con-tributions of this paper are as follows:

• We propose a novel network named SK-Net. It is an end-to-end framework that jointly optimizes the inference ofspatial keypoints (Skeypoints) and the learning of featurerepresentation in the process of solving a specific pointcloud task, e.g., classification and segmentation. The gen-erated Skeypoints do not require location consistency butare acutely aware of shape. It benefits the local detail ex-traction and the spatial modeling of a point cloud.

• We design a pattern and detail extraction module (PDEmodule) where the key detail features and pattern featuresare extracted and then aggregated to form the pattern-detail (PD) feature. It promotes correlation between dif-ferent regions of a point cloud and integrates the contex-tual information of the point cloud.

• We propose two regulating losses to conduct the genera-tion of Skeypoints without any location annotations andproposals. These two regulating losses are mutually rein-forcing and neither of them can be omitted.

• We conduct extensive experiments to evaluate the perfor-mance of our method, which is better than or compara-ble with the state-of-the-art approaches. In addition, wepresent an ablation test to demonstrate the advantages ofthe SK-Net.

2 Related WorkIn this section, we briefly review existing works of variousregional feature extraction on point cloud and end-to-endlearning of 3D keypoints.

2.1 Extraction of regional feature on point cloudPointNet++(Qi et al. 2017b) is the earliest proposed methodthat addresses the problem of local information extraction.In practice, this method partitions the set of points into over-lapping local regions corresponding to the spatial locationschosen by the FPS algorithm through the distance metric ofthe underlying space. It enables the learning of local featuresof a local region with increasing contextual scales. Besides,(Jiang et al. 2018; Liu et al. 2019a) also perform the similardownsampling of a point cloud to obtain the spatial loca-tions regarding local feature extraction. PointSIFT(Jiang etal. 2018) extends the traditional feature SIFT(Lowe 2004)to develop a local region sampling approach for capturingthe local information of different orientations around a spa-tial location. It is similar to (Xie et al. 2018) expandingfrom ShapeContext(Belongie, Malik, and Puzicha 2001).Point2Sequence(Liu et al. 2019a) explores the attention-based aggregation of multi-scale areas of each local regionto achieve local feature extraction. And that, (Hua, Tran,and Yeung 2018) defines a pointwise convolution opera-tor to query each point’s neighbors and bin them into ker-nel cells for extracting pointwise local features. (Shen etal. 2018) proposes a point-set kernel operator as a set oflearnable 3D points that jointly respond to a set of neigh-boring data points according to their geometric affinities.Other techniques also exploit extension operators to extractlocal structure. For example, (Klokov and Lempitsky 2017;Atzmon, Maron, and Lipman 2018) map the point cloudto different representation space, and (Simonovsky and Ko-modakis 2017; Wu et al. 2018) dynamically construct agraph comprising edge features for each specific local sam-pling of a point cloud. In particular, (Le and Duan 2018;Li et al. 2018) combine the regular grids of 3D space withpoints and leverage the spatially-local correlation of regu-lar grids to achieve local information capture. Among thelatest methods, the study of local geometric relations playsan important role. (Lan et al. 2019) models geometric localstructure through decomposing the edge features along threeorthogonal directions. (Liu et al. 2019b) learns the geomet-ric topology constraint among points to acquire an induc-tive local representation. By contrast, our method not onlyperforms the relevant regional feature extraction in the PDEmodule, but also learns a set of Skeypoints correspondingto geometrically and semantically meaningful regions of thepoint cloud. In this way, our network can effectively modelthe spatial pattern of a point cloud and promotes the corre-lation between different regions of the point cloud.

Most related to our approach is the recent work, SO-Net(Li, Chen, and Hee Lee 2018). This work suggests uti-lizing the Self-Organizing Map (SOM) to build the spatialpattern of a point cloud. However, SO-Net is not proved tobe generic to large-scale semantic segmentation. This is in-tuitively because its network is not end-to-end and the SOM

is as an early stage with respect to the overall pipeline, so theSOM does not establish a connection with a specific pointcloud task. In contrast, our method is capable of extendingto large-scale point cloud analysis and achieving the end-to-end joint learning of Skeypoints and feature representationtowards a specific point cloud task.

2.2 End-to-end learning of 3D keypointsApproaches for end-to-end learning 3D keypoints have beeninvestigated. Most of them focus on handling 3D matchingor 3D pose estimation tasks. They do not extend to pointcloud recognition tasks, especially classification and seg-mentation(Georgakis et al. 2018; Suwajanakorn et al. 2018;Zhou et al. 2018; Georgakis et al. 2019; Feng et al. 2019;Lu et al. 2019). Some of these methods usually require key-point proposals. For example, (Georgakis et al. 2018) pro-poses that using a Region Proposal Network (RPN) to ob-tain several Region of Interests (ROIs) then determines thekeypoint locations by the centroids of those ROIs. Similarly,(Georgakis et al. 2019) casts a keypoint proposal network(KPN) comprising two convolutional layers to acquire thekeypoint confidence score in evaluating whether the can-didate location is a keypoint or not. In addition, Keypoint-Net(Suwajanakorn et al. 2018) presents an end-to-end geo-metric reasoning framework to learn an ordered set of 3Dkeypoints, whose discovery is guided by the carefully con-structed consistency and relative pose objective functions.These approaches tend to depend on the location consis-tency of keypoints of different views(Georgakis et al. 2018;Suwajanakorn et al. 2018; Zhou et al. 2018; Georgakis et al.2019). Moreover, (Feng et al. 2019) presents an end-to-enddeep network architecture to jointly learn the descriptors for2D and 3D keypoints from an image and point cloud, estab-lishing 2D-3D correspondence. (Lu et al. 2019) end-to-endtrains its keypoint detector to enable the system to avoid theinference of dynamic objects and leverage the help of suf-ficiently salient features on stationary objects. By contrast,our SK-Net focuses on point cloud classification and seg-mentation. In addition, our Skeypoints not only are inferredwithout any location annotations or proposals but also areinsensitive to location consistency. This greatly benefits thespatial region relation exploration of a point cloud.

3 End-to-end optimization of Skeypoints andfeature representation

In this section, we present the end-to-end framework ofour SK-Net. First, the architecture of SK-Net is introduced.Then its important process, PDE module, is discussed in de-tail. Finally, the significant properties of Skeypoints are an-alyzed.

3.1 SK-Net architectureThe overall architecture of SK-Net is shown in Figure 2.The backbone of our network consists of the following threecomponents: (a) point feature extraction, (b) skeypoint in-ference and (c) pattern and detail extraction.(a) Point feature extraction. Given an input point cloud Pwith size of N , i.e. P = {pi ⊆ R3, i = 0, 1, . . . , N − 1},

sh

are

d

MLP

(64,128,256,1024)

1 x

10

24

Segm

enta

tion

Cla

ssific

ation

M x

3

Max

Pool

(c)

Pattern

and detail

extraction

(PDE)

module

Co

ncat

n x

64

1 x

20

48

N x

3

MLP

-64

po

int

featu

res

n x

20

48

interpolating

Inp

ut p

oin

ts

sh

are

d

MLP

(1024,512)

P

FC

51

2

FC

25

6

FC

(M

x 3

)

Sk

(a) Point feature extraction

Concat

FC

51

2

FC

25

6

FC

C

LA

SS

ES

Sh

are

d F

C 3

84

Sh

are

d F

C 2

56

Sh

are

d F

C 1

28

Sh

are

d F

C C

LA

SS

ES

Branch

1 x

10

24

(b) Skeypoint inference

Pa

tte

rn-d

eta

il(P

D)

fae

ture

Ske

yp

oin

ts

Glo

ba

l fe

atu

re

Global feature

P

Figure 2: The overall architecture of the SK-Net.

where each point pi is a vector of its (x, y, z) coordinates. Pgoes through a series of multi-layer perceptrons (MLPs) toextract its point features. Next, we use the symmetry func-tion of max-pooling to acquire the global feature.(b) Skeypoint inference. M Skeypoints, Sk = {skj ⊆R3, j = 0, 1, . . . ,M − 1}, are regressed by stacking threefully connected layers on the top of the global feature. skjrepresents the j-th Skeypoint. Note that the MLP-64 pointfeatures are used to handle point cloud segmentation tasks.Meanwhile, the interpolating operation proposed by (Qi etal. 2017b) is utilized.(c) Pattern and detail extraction. The generated Skeypointsare forwarded into the PDE module to get PD features. Moredetails are presented in the next subsection.Finally, our network combines the PD feature aggregated inthe PDE module with the global feature obtained in (a), fora specific point cloud task e.g., classification and segmenta-tion.

3.2 Pattern and detail extraction moduleThe PDE module is shown in Figure 3. The PDE moduleconsumes M generated Skeypoints and N input points. Itcontains three parts: (1) extraction of local details, (2) cap-ture of local spatial patterns, (3) integration of features.

Extraction of local details In this part, we apply a sim-ple kNN search to sample the local region. The number oflocal regions to extract local information from is identicalwith the number of generated Skeypoints. Consequently, Mlocal neighborhoods captured by the Skeypoints are groupedto form a M ×H × 3 tensor named local detail tensor. Weuse the (x, y, z) coordinate as our tensors channels. Thecaptured points are the points live in the local neighbor-hood captured by a generated Skeypoint and are denoted byCpsj , j = 0, 1, . . . ,M − 1. H is the number of capturedpoints of each generated Skeypoint. Each captured point ofa local neighborhood is denoted by cph, h = 0, 1, . . . ,H−1.The M set of captured points can be formulated by:

Cpsj = kNN(cph|skj , h = 0, 1, . . . ,H − 1) (1)

Subsequently, PDE module applies a series of MLPs on thelocal detail tensor to extract the M × 256 detail feature. Itis worth mentioning that there are other ways to sample lo-cal regions. We will present relevant experiments in the next

cp0|sk0

cpH-1|sk0

cp0|skM-1

cpH-1|skM-1

Average

pool

sh

are

d

Concat

MLP

(64,128,256)

x

Grouping

Normalized Skeypoints

N x 3

Input points

M x

3

Ske

yp

oin

ts

z

y

(1) Extraction of local details

Grouping

kNN search for local

region sampling

x

z

y

M x 3

Skeypoint Normalization

Sk

(2) Capture of local spatial patterns

kNN search on

normailized Skeypoints

M x

H x

3

Lo

cal de

tail

ten

sor

M x

25

6D

eta

il fea

ture

M x

K x

3

sh

are

d

MLP

(64,128,256)

Lo

cal sp

atia

l pa

ttern

te

nso

r

M x

25

6

Pa

tte

rn f

eatu

re

(3) Integration of features

Max

Pool

Max

Pool

Max

Pool

Sha

red F

C 3

84

Sha

red F

C 5

12

Sha

red F

C 7

68

Sha

red F

C 1

024

1 x

10

24

Pa

tte

rn-d

eta

il (

PD

)

fe

atu

re

M x

51

2

Cps0

CpsM-1

Figure 3: The pivotal PDE module of our SK-Net. The redballs around the point cloud object in (1) are the Skeypoints,and the orange balls in (2) represent the normalized Skey-points.

section to illustrate the superiority of using the kNN searchin our scheme.

Capture of local spatial patterns To effectively capturethe local spatial patterns of a point cloud, we define a con-cept named normalized Skeypoint. We average the locationcoordinates of points which live in the local neighborhoodcaptured by each generated Skeypoint for acquiring the Mnormalized Skeypoints, Sknorm = {sknormj ⊆ R3, j =0, 1, . . . ,M − 1}, where sknormj is defined by:

sknormj = average(Cpsj) (2)

The normalized Skeypoints are distributed into the discrim-inative regions of a point cloud surface, as illustrated in Fig-ure 4. It is beneficial to perform the entire spatial modelingof the point cloud. Next, we use kNN search on normalizedSkeypoints themselves to capture the local spatial patternsof the point cloud. After that, we group the search resultsto get the M ×K × 3 local spatial pattern tensor, where Kis the number of normalized Skeypoints in each local spatialregion. TheM×256 pattern feature is extracted by the sameapproach as obtaining the detail feature.

Figure 4: Each pair contains a point cloud (the left) gen-erated by Skeypoints and corresponding one generated bynormalized Skeypoints.

Integration of features In this part, the PDE module con-catenates the pattern feature with the detail feature to formM×512 intermediate result. This result is further aggregatedthrough a stack of shared FC layers to acquire the pattern-detail (PD) feature. The PD feature builds the connectionsbetween different regions of a point cloud and integrates thecontextual information of the point cloud.

3.3 Properties of SkeypointsThe fact that a point set is permutation invariant requires cer-tain symmetrizations in the net computation. It is the same inour Skeypoint inference and relevant feature formation in or-der to perform the permutation invariance. Moreover, for ourscheme and point cloud tasks, it requires that our Skeypointsare distinct from each other (distinctiveness), adapted to thespatial modeling of a point cloud (adaptation) and close togeometrically and semantically meaningful regions (close-ness). However, due to the lack of keypoint annotations, theSkeypoints might either tend to be extremely close and evencollapse to the same 3D location, or may stray away from thepoint cloud object. Because of these issues, we propose tworegulating losses to conduct the generation of Skeypoints.They guarantee the distinctiveness, adaptation, and close-ness of Skeypoints.

Permutation invariance We employ a modified PointNetframework to acquire the global feature of a point cloud. It isbasis of the inferring of Skeypoints. After that, the activatedoutputs of the Skeypoint inference component are directlyreshaped to obtain Skeypoints. Each coordinate of a Skey-point is regressed independently. These procedures makeSkeypoint generation satisfy permutation invariance. Notethat the nonlinear activation function PReLU(He et al. 2015)is adopted to guarantee that each layer activation matchesthe location distribution of the point cloud and the regres-sive consequence is well-adapted and robust. Moreover, alllocal features use max-pooling to ensure their permutationinvariance before producing the PD feature.

Distinctiveness and adaptation In order to achieve thesecharacteristics, we take the separation distance betweenSkeypoints as a hyperparameter δ and propose the separationloss Lsep. Hyperparameter δ provides a prior knowledge ofthe location distribution of the point cloud. It makes the gen-erated Skeypoints adapted to the spatial modeling of a pointcloud and boosts the convergence of our network. The sep-aration loss Lsep prompts the M Skeypoints to be distinctfrom each other and penalizes two Skeypoints if their dis-tance is less than δ. The Lsep is formulated by:

Lsep =1

M2

M∑i=1

M∑i 6=j

max(0, δ − ||ski − skj ||2) (3)

Closeness In order to make Skeypoints correspond to ge-ometrically and semantically meaningful regions of a pointcloud, the network should encourage Skeypoints to close tothe discriminative regions of the point cloud surface. A reg-ulating loss named close loss Lclose is proposed to penalizea Skeypoint and its captured points live in the local struc-ture of this Skeypoint if they are farther than the value ofhyperparameter θ in 3D Euclidean space. The Lclose is rep-resented by:

Lclose =1

MH

M∑i=1

H∑h=1

max(0, ||ski − cph||2 − θ) (4)

These two loss terms allow our network to discover satis-factory Skeypoints as far as possible. Besides, they promptour network to extract more compact local features with re-spect to the regions corresponding to the learned Skeypointsand to capture the effective spatial pattern of a point cloudby utilizing the normalized Skeypoints.

4 ExperimentsDatasets We validate on four datasets to demonstrate theeffectiveness of our SK-Net. Object classification on Mod-elNet(Wu et al. 2015) is evaluated by accuracy, part seg-mentation on ShapeNetPart(Yi et al. 2016) is evaluated bymean Intersection over Union (mIoU) on points and seman-tic scene labeling on ScanNet(Dai et al. 2017) is evaluatedby per-point accuracy.

• ModelNet. This includes two datasets which respectivelycontain CAD models of 10 and 40 categories. Model-Net10 consists of 4, 899 object instances which are splitinto 2, 468 training samples and 909 testing samples.ModelNet40 consists of 12, 311 object instances amongwhich 9, 843 objects belong to the training set and theother 3, 991 samples for testing.

• ShapeNetPart. 16, 881 shapes from 16 categories, labeledwith 50 parts in total. A large proportion of shape cate-gories are annotated with two to five parts.

• ScanNet. 1, 513 scanned and reconstructed indoor scenes.We follow the experiment setting in (Qi et al. 2017b) anduse 1201 scenes for training, 312 scenes for test.

Implementation details Sk-Net is implemented by Ten-sorflow in CUDA. We run our model on GeForce GTX TitanX for training. In general, we set the number of the Skey-points to 192, and K to 16, H to 32 in the PDE module.In most experiments, the hyperparameters δ , θ of our tworegulating losses are both 0.05, and the weights of all lossterms are identical. In addition, we use Adam(Kingma andBa 2014) optimizer with an initial learning rate of 0.001 andthe learning rate is decreased by staircase exponential de-cay. Batch size is 16. All layers are implemented with batchnormalization. PReLU activation is applied to the layers ofthe point feature extraction and Skeypoint inference compo-nents, while ReLU activation is applied to every layer of thesubsequent network.

4.1 Classification on ModelNetFor a fair comparison, the ModelNet10/40 datasets for ourexperiments are preprocessed by (Qi et al. 2017b). By de-fault, 1024 input points are used. Moreover, we attempt touse more points and surface normals as additional featuresfor improving performance. Table 1 shows the classificationaccuracy of state-of-the-art methods on point cloud repre-sentation. In ModelNet10, our network outperforms state-of-the-art methods by 95.8% with 1024 points and achievesa better accuracy of 96.2% with 5000 points and normal in-formation. In ModelNet40, the network performs a compa-rable result of 92.7% through training with 5000 points and

surface normal vectors as additional features. Although SO-Net presents the best result in ModelNet40, its network istrained separately and not generic to large-scale semanticsegmentation, while our proposed model is trained end-to-end and does extend to large-scale point cloud analysis.

4.2 Part segmentation on ShapeNetPartPart segmentation is more challenging than object classifi-cation and can be formulated as a per-point classificationproblem. We use the segmentation network to do this. Fairly,ShapeNetPart dataset is prepared by (Qi et al. 2017b) andwe also follow the metric protocol from (Qi et al. 2017b).The results of the related approaches are illustrated in Ta-ble 2. Although the best mIoU of all shapes is performed by(Liu et al. 2019a), ours outperforms state-of-the-art methodsin six categories and achieves comparable results in the re-maining categories. We provide ShapeNetPart segmentationvisualization in the supplementary material.

4.3 Semantic scene labeling on ScanNetWe conduct experiments on ScanNets Semantic Scene La-beling task to validate that our method is suitable for large-scale point cloud analysis. We use our segmentation networkto do this. To perform this task, we set the point cloud tobe normalized into [−1, 1] because the Lsep is sensitive tothe distribution of point clouds. All experiments use 8192points. We compare the per-point accuracy and the numberof the local region features in each method, as shown in Ta-ble 3. Our approach is better than PointNet and slightly be-hind PointNet++. However, the number of our local regionfeatures is only 192 which is much smaller than that in Point-Net++. These results show that our method is extendable tolarge-scale semantic segmentation and the local region fea-tures obtained by our method are compact and efficient.

4.4 Ablation studyIn this subsection, we further conduct several ablation exper-iments to investigate various setup variations and to demon-strate the advantages of SK-Net.

Effects of features extracted in PDE module We presentan ablation test on ModelNet10 classification to show the ef-fect of the features extracted by PDE module i.e. detail fea-tures and pattern features, as shown in Figure 3. The accu-racy of the experiments as follows: 93.9% (using only detailfeatures), 93.3% (using only pattern features), and 95.8%(using their concatenated result). It demonstrates that ourproposed PD feature promotes correlation between differentregions of a point cloud and is more effective in modelingthe whole spatial distribution of the point cloud.

Complementarity of regulating losses Ablation experi-ments of the losses are presented to validate the effect ofour loss functions (classification loss: Lcls, separation loss:Lsep, close loss: Lclose). As shown in Table 4, EXP.1 is thebaseline and only the classification loss is used. The resultsshow that our method is effective. EXP.2,3,4 show that thetwo regulating losses proposed by our scheme are mutually

Table 1: Object classification accuracy (%) on ModelNet.

Method Representation Input ModelNet10 ModelNet40Class Instance Class Instance

PointNet(Qi et al. 2017a) points 1024× 3 − − 86.2 89.2PointNet++(Qi et al. 2017b) points+normal 5000× 6 − − − 91.9Kd-Net(Klokov and Lempitsky 2017) points 215 × 3 93.5 94.0 88.5 91.8OctNet(Riegler, Osman Ulusoy, and Geiger 2017) points 1283 90.1 90.9 83.8 86.5SCN(Xie et al. 2018) points 1024× 3 − − 87.6 90.0ECC(Simonovsky and Komodakis 2017) points 1000× 3 90.0 90.8 83.2 87.4KC-Net(Klokov and Lempitsky 2017) points 2048× 3 − 94.4 − 91.0DGCNN(Wu et al. 2018) points 1024× 3 − − 90.2 92.2PointGrid(Le and Duan 2018) points 1024× 3 − − 88.9 92.0PointCNN(Li et al. 2018) points 1024× 3 − − − 91.7PCNN(Atzmon, Maron, and Lipman 2018) points 1024× 3 − 94.9 − 92.3Point2Sequence(Liu et al. 2019a) points 1024× 3 95.1 95.3 90.4 92.6SO-Net(Li, Chen, and Hee Lee 2018) points+normal 5000× 6 95.5 95.7 90.8 93.4Ours points 1024× 3 95.6 95.8 89.0 91.5Ours points+normal 5000× 6 96.2 96.2 90.3 92.7

Table 2: Object part segmentation results on ShapeNetPart dataset.

Intersection over Union (IoU)

mean air bag cap car chair ear. gui. knife lamp lap. motor mug pistol rocket skate table

PointNet(Qi et al. 2017a) 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6PointNet++(Qi et al. 2017b) 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6Kd-Net(Klokov and Lempitsky 2017) 82.3 80.1 74.6 74.3 70.3 88.6 73.5 90.2 87.2 81.0 94.9 57.4 86.7 78.1 51.8 69.9 80.3KC-Net(Shen et al. 2018) 84.7 82.8 81.5 86.4 77.6 90.3 76.8 91.0 87.2 84.5 95.5 69.2 94.4 81.6 60.1 75.2 81.3DGCNN (Wu et al. 2018) 85.1 84.2 83.7 84.4 77.1 90.9 78.5 91.5 87.3 82.9 96.0 67.8 93.3 82.6 59.7 75.5 82.0point2sequence (Liu et al. 2019a) 85.2 82.6 81.8 87.5 77.3 90.8 77.1 91.1 86.9 83.9 95.7 70.8 94.6 79.3 58.1 75.2 82.8SO-Net (Li, Chen, and Hee Lee 2018) 84.9 82.8 77.8 88.0 77.3 90.6 73.5 90.7 83.9 82.8 94.8 69.1 94.2 80.9 53.1 72.9 83.0Ours 85.0 82.9 80.7 87.6 77.8 90.5 79.9 91.0 88.1 84.0 95.7 69.9 94.0 81.1 60.8 76.4 81.9

Table 3: Comparisons of per-point classification on ScanNet.

Method Local regions Local selection operator accuracy

PointNet − − 77.7%PointNet++(ssg) [1024, 256, 64, 16] FPS 82.6%PointNet++(msg) [512, 128] FPS 83.2%

Ours 192 End-to-end learning 81.4%

reinforcing and neither of them can be omitted. The effect ofcombining these two losses can refine the spatial model andimprove the performance by 1.5%.

Table 4: Ablation test of the losses on ModelNet10 classifi-cation.

EXP. Lcls Lsep Lclose accuracy

1 X − − 94.3%

2 X X − 93.8%

3 X − X 93.2%

4 X X X 95.8%

Effects of local region samplings In this part, we experi-ment using other techniques (ball query and sift query pro-posed by (Jiang et al. 2018)) to sample local regions andalso play with the general search radii: 0.1, 0.2. For all ex-periments, the number of sample points is 32 and the numberof input points is 1024. As shown in Table 5, the sift queryis the least effective method and the ball query is slightlyworse than kNN. This shows kNN is the most beneficial toour scheme.

Table 5: Effects of local sample choices on ModelNet10classification.

kNN ball query sift queryr=0.1 r=0.2 r=0.1 r=0.2

95.8% 94.3% 94.7% 92.5% 92.7%

Effectiveness of Skeypoints In order to exhibit the effec-tiveness of our Skeypoints, we eliminate the Skeypoints.And then the network is trained respectively with randomdropout, farthest point sampling (FPS) algorithm and theSOM. We use these methods to discover a set of keypoint-like points for building the spatial pattern of a point cloud.Note that the keypoints are jittered slightly to avoid turninginto naive downsampling in the random dropout and FPSmethods. In addition, we use 11× 11 SOM nodes processedby SO-Net(Li, Chen, and Hee Lee 2018) and likewise thenumber of keypoints is 121 for all comparative experiments.Subsequently, we evaluate the accuracy of different spatialmodeling methods on ModelNet10. Visualization examplesand the accuracy results are illustrated in Figure 5 and Fig-ure 6. It is interesting that the keypoints respectively selectedby the FPS algorithm and learned by SOM, are more well-distributed than our method but have approximately 2-3 per-cent lower accuracy than ours. Moreover, it is more sensi-tive to the density perturbation of points with the unlearnedrandom dropout and FPS algorithm. By contrast, our Skey-points are jointly optimized with the feature learning for aspecific point cloud task, promoting the robustness of thevariability of points density and performing better perfor-mance.

Learned by

SOM

Selected by

random dropout

Selected by

FPS algorithmDiscovered by

ours

Figure 5: There is a point cloud on the left-most side andthe corresponding normalized keypoints(red balls) discov-ered by different methods on the other side.

2004006008001,000

20

40

60

80

100

Number of points

Acc

urac

y

Random dropoutFPS algorithm

SOMOurs

Figure 6: Curve shows the performance of the spatial mod-eling approaches on various points densities.

Robustness to corruption The network is trained with in-put points of 1024 and evaluated with point dropout ran-domly to demonstrate the robustness of the point densityvariability. In ModelNet10/40, our accuracy drops by 3.4%and 4.5% respectively with 50% points missing (1024 to512), and remains 75.1% and 63.0% respectively with 256points, as shown in Figure 7 (a). Moreover, our SK-Net isrobust to the noise or corruption of the Skeypoints, as exhib-ited in Figure 7 (b). When the noise sigma is the maximumof 0.6 in our experiments, the accuracy on ModelNet10/40is respectively 89.1 and 86.5.

2004006008001,000

90

70

50

30

Number of points

Acc

urac

y

ModelNet10ModelNet40

(a)

0 0.1 0.2 0.3 0.4 0.5 0.680

85

90

95

Skeypoint noise sigma

Acc

urac

y


(b)

Figure 7: Robustness to the corruption. (a) There is pointdropout randomly during testing, and the sizeH decrease by8 when the number of input points is less than 512. (b) Gaus-sian noiseN (0, σ) is added to the generated Skeypoints dur-ing testing.

Effects of preferences Our proposed method is more thansimply using the end-to-end learned Skeypoints to choosethe local regions for extracting the local features, as an ex-

isting method(Qi et al. 2017b) does. The Skeypoints alsoplay an important role in the spatial modeling of a pointcloud, enhancing the regional associations and promptingour model to get more compact regional features, as illus-trated in Table 3. Therefore, the number of generated Skey-points should be adapted to urge the Skeypoints to embracethe point cloud adequately, and the size K of the normal-ized Skeypoints of a local spatial region should properly re-inforce the region correlation of the point cloud. As shownin Figure 8 (a), our network gradually performs better accu-racy with the Skeypoints increasing from 32 to 192, whilethe accuracy decreases slightly with 256 Skeypoints. Andthen the effect of the size K is also shown in Figure 8 (b) oncondition that the number of generated Skeypoints remainsat 128. The best accuracies are achieved when K is 16.

50 100 150 200 250 30080

85

90

95

Number of Skeypoints

Acc

urac

y


(a)

10 15 20 25 3080

85

90

95

Size of K

Acc

urac

y


(b)

Figure 8: Effect of preferences. (a) The effect of the numberof Skeypoints on the condition that K is 16. (b) The effectof various K when fixing 128 Skeypoints.

5 ConclusionsIn this work, we propose an end-to-end framework namedSK-Net to jointly optimize the inference of Skeypoints withthe feature learning of a point cloud. These Skeypoints aregenerated by two complementary regulating losses. Specif-ically, their generation requires neither location labels andproposals nor location consistency. Furthermore, we designthe PDE module to extract and integrate the detail featureand pattern feature, so that local region feature extractionand the spatial modeling of a point cloud can be achievedefficiently. As a result, the correlation between different re-gions of a point cloud is enhanced, and the learning of thecontextual information of the point cloud is promoted. In ad-dition, we conduct experiments to show that for both pointcloud classification and segmentation, our method achievesbetter or similar performance when compared with otherexisting methods. The advantages of our SK-Net is alsodemonstrated by several ablation experiments.

AcknowledgementYunqi Lei is the corresponding author. This work was sup-ported by the National Natural Science Foundation of China(61671397). We thank all anonymous reviewers for theirconstructive comments.

References[Atzmon, Maron, and Lipman 2018] Atzmon, M.; Maron, H.; and

Lipman, Y. 2018. Point convolutional neural networks by extensionoperators. arXiv preprint arXiv:1803.10091.

[Belongie, Malik, and Puzicha 2001] Belongie, S.; Malik, J.; andPuzicha, J. 2001. Shape context: A new descriptor for shape match-ing and object recognition. In Advances in Neural Information Pro-cessing Systems, 831–837.

[Dai et al. 2017] Dai, A.; Chang, A. X.; Savva, M.; Halber, M.;Funkhouser, T.; and Nießner, M. 2017. Scannet: Richly-annotated3D reconstructions of indoor scenes. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 5828–5839.

[Eldar et al. 1997] Eldar, Y.; Lindenbaum, M.; Porat, M.; and Zeevi,Y. Y. 1997. The farthest point strategy for progressive image sam-pling. In IEEE Transactions on Image Processing, 1305–1315.

[Feng et al. 2019] Feng, M.; Hu, S.; Ang, M.; and Lee, G. H. 2019.2D3D-matchnet: Learning to match keypoints across 2D image and3D point cloud. arXiv preprint arXiv:1904.09742.

[Georgakis et al. 2018] Georgakis, G.; Karanam, S.; Wu, Z.; Ernst,J.; and Kosecka, J. 2018. End-to-end learning of keypoint detectorand descriptor for pose invariant 3D matching. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,1965–1973.

[Georgakis et al. 2019] Georgakis, G.; Karanam, S.; Wu, Z.; andKosecka, J. 2019. Learning local rgb-to-cad correspondences forobject pose estimation. arXiv preprint arXiv:1811.07249.

[He et al. 2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delv-ing deep into rectifiers: Surpassing human-level performance onimagenet classification. In Proceedings of the IEEE internationalconference on computer vision, 1026–1034.

[Hua, Tran, and Yeung 2018] Hua, B.; Tran, M.; and Yeung, S. a.2018. Pointwise convolutional neural networks. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,984–993.

[Jiang et al. 2018] Jiang, M.; Wu, Y.; Zhao, T.; Zhao, Z.; and Lu,C. 2018. Pointsift: A sift-like network module for 3d point cloudsemantic segmentation. arXiv preprint arXiv:1807.00652.

[Kakuda et al. 1998] Kakuda, N.; Miwa, T.; Nagaoka, M.; and Ko-honen, T. 1998. The self-organizing map. Neurocomputing21(1):1–6.

[Kingma and Ba 2014] Kingma, D. P., and Ba, J. 2014.Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980.

[Klokov and Lempitsky 2017] Klokov, R., and Lempitsky, V. 2017.Escape from cells: Deep kd-networks for the recognition of 3Dpoint cloud models. In Proceedings of the IEEE International Con-ference on Computer Vision, 863–872.

[Lan et al. 2019] Lan, S.; Yu, R.; Yu, G.; and Davis, L. S. 2019.Modeling local geometric structure of 3D point clouds using geo-cnn. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 998–1008.

[Le and Duan 2018] Le, T., and Duan, Y. 2018. Pointgrid: A deepnetwork for 3D shape understanding. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 9204–9214.

[Li et al. 2018] Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; and Chen,B. 2018. Pointcnn: Convolution on x-transformed points. In Ad-vances in Neural Information Processing Systems, 820–830.

[Li, Chen, and Hee Lee 2018] Li, J.; Chen, B. M.; and Hee Lee, G.2018. So-net: Self-organizing network for point cloud analysis.In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 9397–9406.

[Liu et al. 2019a] Liu, X.; Han, Z.; Liu, Y.-S.; and Zwicker, M.2019a. Point2sequence: Learning the shape representation of 3Dpoint clouds with an attention-based sequence to sequence net-work. In Proceedings of the AAAI Conference on Artificial Intelli-gence, volume 33, 8778–8785.

[Liu et al. 2019b] Liu, Y.; Fan, B.; Xiang, S.; and Pan, C. 2019b.Relation-shape convolutional neural network for point cloud anal-ysis. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 8895–8904.

[Lowe 2004] Lowe, D. G. 2004. Distinctive image features fromscale-invariant keypoints. In International Journal of ComputerVision, 91–110.

[Lu et al. 2019] Lu, W.; Wan, G.; Zhou, Y.; Fu, X.; Yuan, P.; andSong, S. 2019. Deepicp: An end-to-end deep neural network for3D point cloud registration. arXiv preprint arXiv:1905.04153.

[Qi et al. 2017a] Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017a.Pointnet: Deep learning on point sets for 3D classification and seg-mentation. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 652–660.

[Qi et al. 2017b] Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b.Pointnet++: Deep hierarchical feature learning on point sets in ametric space. In Advances in Neural Information Processing Sys-tems, 5099–5108.

[Riegler, Osman Ulusoy, and Geiger 2017] Riegler, G.; Os-man Ulusoy, A.; and Geiger, A. 2017. Octnet: Learning deep3D representations at high resolutions. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition,3577–3586.

[Shen et al. 2018] Shen, Y.; Feng, C.; Yang, Y.; and Tian, D. 2018.Mining point cloud local structures by kernel correlation and graphpooling. In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition, 4548–4557.

[Simonovsky and Komodakis 2017] Simonovsky, M., and Ko-modakis, N. 2017. Dynamic edge-conditioned filters in convo-lutional neural networks on graphs. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 3693–3702.

[Suwajanakorn et al. 2018] Suwajanakorn, S.; Snavely, N.; Tomp-son, J. J.; and Norouzi, M. 2018. Discovery of latent 3D keypointsvia end-to-end geometric reasoning. In Advances in Neural Infor-mation Processing Systems, 2059–2070.

[Wu et al. 2015] Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.;Tang, X.; and Xiao, J. 2015. 3D shapenets: A deep representa-tion for volumetric shapes. In Proceedings of IEEE Conference onComputer Vision and Pattern Recognition, 1912–1920.

[Wu et al. 2018] Wu, B.; Liu, Y.; Lang, B.; and Huang, L. 2018.Dgcnn: Disordered graph convolutional neural network based onthe gaussian mixture model. Neurocomputing 321:346–356.

[Xie et al. 2018] Xie, S.; Liu, S.; Chen, Z.; and Tu, Z. 2018. Atten-tional shapecontextnet for point cloud recognition. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recogni-tion, 4606–4615.

[Yi et al. 2016] Yi, L.; Kim, V. G.; Ceylan, D.; Shen, I.; Yan, M.;Su, H.; Lu, C.; Huang, Q.; Sheffer, A.; Guibas, L.; et al. 2016. Ascalable active framework for region annotation in 3D shape col-lections. ACM Transactions on Graphics 35(6):210.

[Zhou et al. 2018] Zhou, X.; Karpur, A.; Gan, C.; Luo, L.; andHuang, Q. 2018. Unsupervised domain adaptation for 3D key-point estimation via view consistency. In European Conference onComputer Vision, 137–153.

Documents

arXiv:2003.14014v1 [cs.CV] 31 Mar 2020 · the Self-Organizing Map (SOM)(Kakuda et al. 1998) to pro-duce a set of keypoint-like points for modeling the spatial pattern of a point cloud