CrackFormer: Transformer Network for Fine-Grained Crack

CrackFormer: Transformer Network for Fine-Grained Crack Detection

Huajun Liu1∗, Xiangyu Miao1, Christoph Mertz2, Chengzhong Xu3, Hui Kong3*

1Nanjing University of Science and Technology,2Carnegie Mellon University,3University of Macau{liuhj, miaoxy}@njust.edu.cn, [email protected], {czxu, huikong}@um.edu.mo

Abstract

Cracks are irregular line structures that are of interest inmany computer vision applications. Crack detection (e.g.,from pavement images) is a challenging task due to inten-sity in-homogeneity, topology complexity, low contrast andnoisy background. The overall crack detection accuracycan be significantly affected by the detection performanceon fine-grained cracks. In this work, we propose a CrackTransformer network (CrackFormer) for fine-grained crackdetection. The CrackFormer is composed of novel atten-tion modules in a SegNet-like encoder-decoder architecture.Specifically, it consists of novel self-attention modules with1x1 convolutional kernels for efficient contextual informa-tion extraction across feature-channels, and efficient posi-tional embedding to capture large receptive field contextualinformation for long range interactions. It also introducesnew scaling-attention modules to combine outputs from thecorresponding encoder and decoder blocks to suppress non-semantic features and sharpen semantic ones. The Crack-Former is trained and evaluated on three classical crackdatasets. The experimental results show that the Crack-Former achieves the Optimal Dataset Scale (ODS) values of0.871, 0.877 and 0.881, respectively, on the three datasetsand outperforms the state-of-the-art methods.

1. Introduction

Pavement crack detection from images is a challengingissue due to intensity inhomogeneity, topology complexity,low contrast, and noisy texture background [18]. In addi-tion, crack’s diversity (thin, grid or thick crack etc.) makesit more difficult.

There are a large number of studies on crack detec-tion [6, 22, 2, 36, 37, 35, 10]. Recent studies have employedconvolutional neural networks (CNNs) to boost detectionaccuracy to a higher level. In this study, we consider theproblem of detecting thin cracks from the image of an as-phalt surface. In general, it is much easier to detect thick

*Corresponding author

Figure 1. Crack prediction from our CrackFormer model (Bestviewed in color). The upper left is a classical crack image. Theupper right is the predicted result. The bottom shows a profile slicewith normalized grey scale, its ground truth and correspondingcrack predicted probabilities.

cracks than thin cracks. Thus, crack detection performanceis largely affected by how well one method can detect thincracks.

The state-of-the-art (SOTA) methods heavily rely onFully Convolutional Networks (FCNs) [9], such as Seg-Net [31], U-Net [27] and their variants [21]. SegNetsand U-Nets use an encoder-decoder architecture, wherethe encoder extracts high-level semantic representationsby using a cascade of convolution and pooling layers,and the decoder leverages memorized pooling indices orskip connections to re-use high-resolution feature mapsfrom the encoder in order to recover lost spatial informa-tion from high-level representations. Despite their out-standing performance, these methods suffer from limita-tion in complex segmentation tasks, e.g. when dealingwith thin cracks or when there exists low contrast betweencrack and background. In general, these models rely onstacked 3 × 3 convolution and pooling operations, andcould not achieve pixel-level segmentation precision in theconvolution-pooling pipeline, resulting in blur and coarsecrack segmentation. Moreover, suffering from the limitedreceptive field by using 3 × 3 convolutional kernels, these

3783

methods tend to fail in detecting long cracks, resulting indiscontinuous crack detection.

In this work, we propose a Crack Transformer net-work (CrackFormer) by combining novel self-attention andscaling-attention mechanisms for crack detection. It ex-plores to leverage the merits of Transformer models [30]to capture long-range interactions and simultaneously adoptsmall convolution kernels for fine-grained attentive percep-tion. CrackFormer keeps the regular layout by using aSegNet-like architecture, but introduces attention mecha-nisms in two different ways. Fig. 2 shows our networkstructure. The main contribution of this paper can be sum-marized as follows,

1. A new self-attention block (Self-AB) is proposed(Fig. 3). The Self-AB can fully extract contextual in-formation across feature-channels by leveraging the1×1 convolution kernels, and capture large receptivefield contextual information across spatial-domain byan efficient position embedding.

2. A new scaling-attention block (Scal-AB) is proposed(Fig. 4), where a set of scaling-attention masks aregenerated by nonlinearizing the encoder’s featuremaps, and used to suppress non-semantic features andsharpen the semantic cracks.

3. We propose a Transformer encoder-decoder structureintegrating the proposed Self-AB and Scal-AB blocks,where the Self-AB is embedded into different levelsof the encoder and decoder modules, and the Scal-ABis introduced between the encoder feature maps andcorresponding decoder ones.

A crack prediction result by our method is shown inFig. 1, where the original image is shown in the upper left,the predicted result shown in the upper right. In the lowerrow, we can observe from the profile that the cracks are pre-dicted precisely.

2. Related WorkCrack detection via classification - Since CNNs were

introduced to pavement crack detection, research in thisfield has achieved significant progress [6, 22, 2, 36, 37,35, 10]. Earlier works on crack detection are based onobject detection pipeline for damage region-proposal anddamage classification. For instance, the Faster R-CNN [6],YOLO [2], SSD Inception and SSD MobileNet [22] etchave ever been used for pavement damage-region extrac-tion. Although these bounding-box based methods can de-tect crack regions reasonably well, they do not provide asprecise information, e.g., crack’s width and shape etc, aspixel-wise segmentation methods do.

Crack detection via segmentation - Since Zhang etal. [36] proposed pixel-level asphalt crack detection based

on CNN models, some more accurate methods analyzepavement damage using deep neural networks [37, 35, 10].For example, Liu et al. [21] proposed a pyramid fea-tures aggregation network and a Condition Random Fields(CRFs) post-processing scheme for crack segmentation.Zou et al. [37] provided a multi-stage fusion on the Seg-Net encoder-decoder architecture for crack segmentation.Yang et al. [35] proposed a feature pyramid and hierarchicalboosting network for pavement crack detection, which inte-grates context information to low-level features for crackdetection in a feature-pyramid way. Fei et al. [10] proposedthe CrackNet-V model, which stacks several 3 × 3 convo-lutional layers and a 15 × 15 convolution kernel for deepabstraction to achieve a high performance for crack seg-mentation. Although these segmentation-based crack detec-tion methods have obtained promising results, they cannotafford to achieve satisfying performance at the pixel-levelsegmentation precision and result in blur and coarse seg-mentation.

Self attention - Very recently, the Transformer [30]’sself-attention mechanism [28, 4, 7, 26, 3] has beenadopted or improved on image segmentation task. TheDANet [11] proposed a parallel position-attention andchannel-attention enhanced FCN, but its computationalcomplexity O((HW )2C) + O(HWC2) is high. The CC-Net [15] harvests the contextual information in horizon-tal and vertical directions to enhance pixel-wise represen-tative capability and works more efficient than non-localblock [33]. The Axial-attention [32] shows that self-attention layers alone could be stacked to form a fully at-tentional model by restricting the receptive field of self-attention to a local square region for panoptic segmentation.

Moreover, it has been shown [7] that multi-head self-attention layer with sufficient number of heads is at leastas expressive as any convolutional layer. Explorationon replacing 3 × 3 convolution kernels in popular back-bones, such as the ResNet [13] etc, by a self-attentionaugmented convolution model [4], a stand-alone self-attention model [26] or a Lambda-attention layer [3] hasyielded remarkable gains. These self-attention works, withmerits of long-distance interaction, local receptive field,computation- and memory-efficiency, have inspired us toexplore more efficient and effective self-attention mecha-nism for crack segmentation task.

Scaling attention - The self attention is effective forglobal-dependency modeling, and it is likely to be valuablefor connected and long-range crack segmentation. How-ever, the self-attention only might not be enough for fine-grained cracks, which can be strongly affected by noisybackground. Therefore, we seek the help from scaling-attention mechanism.

Scaling attention focuses on emphasizing semantic fea-tures and suppressing non-semantic ones. For example, the

3784

Figure 2. The structure of Crack Transformer network.

attention-gate mechanism [23] identifies salient image re-gions and prune feature responses to preserve only the ac-tivation relevant to the specific task to boost segmentationperformance. The squeeze-and-excitation (SE) [14] mod-ule uses global average pooling and a linear layer to com-pute a scaling factor for each channel and then scales thechannels accordingly. Convolutional block attention mod-ule (CBAM) [24] added global max pooling in addition toglobal average pooling and an extra spatial attention sub-module to compute scaling factors on channel and spatialdomain separately. The Spatial attention [12] and multi-scale attention on the U-Net [5] combine local features withtheir corresponding global dependencies, explicitly model-ing the dependencies between channels and different scalespatial information in segmentation task. Oktay et al. [23]propose a soft-attention mechanism to softly weight the en-coder and decoder features at each pixel location. Thesescaling-attention or attention-gate methods operate on a lo-cal receptive field is helpful for sharpening semantic featureand suppress non-semantic ones by a soft mask after Sig-moid normalization.

3. Our work

3.1. Overview

The CrackFormer adopts the basic structure of the Seg-Net [31], shown in Fig.2. To establish long-range interac-tion between the low-level feature maps, we propose novelself-attention blocks as the basic module. To enhance crackcrispness, we introduce a local attention block between en-coder and the corresponding decoder features to generate at-tention masks. Finally, multi-stage side fusion is exploitedbetween feature maps of different stages to fuse coarse tofine cracks to obtain a refined result.

Similar to the SegNet, the CrackFormer’s encoder con-sists of the first 13 convolutional layers of the VGG16 [29]network, and they are deployed in 5 stages according to thelayout of {2, 2, 3, 3, 3}. Meanwhile, the corresponding de-coder has a symmetrical layout of {3, 3, 3, 2, 2}. In con-

trast, the 3 × 3 convolution module at each layer of SegNetis replaced by the self-attention block in the CrackFormer(Sec.3.2).

At the end of each stage, a maximum pooling with a 2×2window and stride of 2 (non-overlapping window) is usedto reduce the size of feature maps by one half. The max-pooling indices, i.e, the locations of the maximum featurevalue in each pooling window, are memorized for each en-coder feature map. The appropriate decoder upsamples itsinput feature map(s) using the memorized max-pooling in-dices from the corresponding encoder feature map(s). Ateach stage, the corresponding encoder and decoder featuresare concatenated to generate an attentive mask by a scaling-attention block to refine each tensor of each stage (Sec.3.3).

The predicted results in all stages are then fused to gen-erate the final result. The predicted results in each stage andfused features are resized to the original dimension of theinput image, and the model is supervised by a multi-lossfunction in training phase.

3.2. Self Attention for Long Range Capture

In the CrackFormer, the self-attention block (Self-AB) inencoder and decoder is a bottleneck module with two CBR(Conv-BatchNorm-ReLU) blocks, composed of 1× 1 conv,BatchNorm and ReLU, and a self-attention layer (SAL inshort) between them (Fig.3(b)).

Figure 3. The self-attention block and self-attention layer.

3785

The self-attention layer is a position embedded self-attention block simultaneously with large receptive fieldand 1x1 convolution kernels, which is shown as Fig. 3(a).Let X ∈ Rdin×WH be an input tensor, where W and Hrepresent the width and height of image and din denotes thechannel of input tensor. This layer applies three 1×1 convo-lutions to generate the keys, queries, and values to generatenew features F c ∈ Rdout×WH (dout denotes the channelof output features) using the following content-based multihead self-attention operation,

F c = Q⊗(σ(K⊤

)⊗ V

), (1)

where ⊗ is a matrix multiplication operation. Let h bethe number of head, du be intra-depth dimension, r be thereceptive field size, dk and dv be the dimension of tensorK and V , respectively. Then we have Q ∈ Rdk×h×WH ,K ∈ Rdk×du×WH , and V ∈ Rdv×du×WH . Let σ denotethe operation of applying softmax normalization on the ten-sor. This attention operation can be interpreted as first ag-gregating the pixel features in V into global context vectorsusing the weights in σ

(K⊤

), and then redistributing the

global context vectors back to individual pixels using theweights in Q. We notice its similarity to the one used inBello [3], but it does not use batch normalization on queriesand values. Softmax normalizing on the keys constrains theoutput features to be convex combinations of the global con-text vectors.

The relative position embedding can make the globalcontext vector obtain an effective receptive field in a neigh-bour region. A relative positional embedded kernel Er

mn ∈Rdk×du×r×r is defined as learnable weight parameters,where r indexes the possible relative positions for all (n,m)pairs. The contextual vector Er

nm as a convolutional kernelis embedded to the context vectors according to the follow-ing equation,

F p = Q⊗(⊗Er

nm(V )

)(2)

Finally, the output context vector of the self-attentionlayer is the element-wise addition of the global content vec-tor and position embedded vector, F n = F c ⊕ F p, where⊕ is the matrix element-wise addition operator. Note thatthe computational and memory complexities of this layerare O(N) in the number of pixels.

3.3. Scaling Attention for Sharpening Crack

At each stage, the feature vectors in the encoder and de-coder are connected and combined according to the scaling-attention block (Scal-AB) to generate the salient and crispcrack boundary map (Fig. 4). The attention-gate mechanismin Attention U-Net [23] inspires that the feature vectors in aspecific decoder block can boost segmentation performance

by combining the feature vectors in the corresponding en-coder block. In essence, the attention-gate mechanism gen-erates an attentive mask, which is normalized to αi ∈ [0, 1]by a Sigmoid activation function, and multiplies element-wise to those features to be refined. In this way, it acts asa filter to activate some features within a region of interestand simultaneously suppress other irrelevant features.

Thus, we propose a scaling-attention block on the en-coder and decoder features. Specifically, at each stage ofthe CrackFormer, we use features in encoder to generate anattentive mask as attention coefficients and multiply themelement-wise to the corresponding features in decoder toactive crack features and suppress the non-crack ones.

Figure 4. The scaling-attention block.

Let’s take the kth-stage fusion as an example, fea-tures from the encoder and decoder are {Xk

1 , Xk2 , X

k3 } and

{Y k1 , Y k

2 , Y k3 }, respectively. Based on Fig. 4, the mask is

generated according to

LkMask = δ

(BN

(⊗3×3

(Γ(Xk

1 , Xk2 , X

k3

))))(3)

where Γ(·) denotes a tensor concatenation operation, and⊗3×3(·) denotes a 3×3 convolution operation and followedby a BatchNorm BN , and δ(·) is a Sigmoid activation func-tion. Subsequently, side output of the kth stage is predictedby the scaling-attention mechanism as follows,

Skside = Lk

Mask ⊙BN(⊗3×3

(Γ(Y k1 , Y k

2 , Y k3

)))(4)

where ⊙ denotes an element-wise multiplication oper-ation. The scaling-attention maps at each stage are visu-alised as Fig. 5, which is an attention coefficient mask fromhigh-level features. Around the semantic crack, there is astronger response. From the output features of differentstages, we can see a coarse to fine semantic crack response,which can be used to refine more crisp boundary.

After the feature in each stage is upsampled in order tomake its dimension the same as that of the input image, weget five predicted results Sk

side, k = 1, 2, · · · , 5, which are

3786

Figure 5. From left to right: the scaling-attention maps from stage1 to stage 5, respectively.

concatenated and fused to generate the final output Sfuse

like the HED [34], RCF [19] and DeepCrack [37] etc. Fi-nally, all sides and fused output are fully supervised by thecrack ground truth labels.

3.4. Loss Function

The balanced weight cross-entropy loss function everused in the RCF network [20] is adopted with a small mod-ification for training, where the pixels in the label whosenormalized intensities yi are higher than η are considered aspositive samples, and the pixels with probability between 0and 0.05 as negative samples, and pixels in between as ne-glected.

l (Xi;W ) =

α log (1− P (Xi;W )) 0 ≤ yi ≤ 0.050 0.05 < yi ≤ η

β logP (Xi;W ) yi > η(5)

where α = λ|Y +|

|Y +|+|Y −| , and β =|Y −|

|Y +|+|Y −| , with |Y +|and |Y −| representing the number of positive and negativesamples, respectively. λ is used to balance the loss ratio ofpositive and negative samples. Let Xi be the value of pixeli, and yi be the probability of pixel i in the labeled image,and P (Xi;W ) representing the predicted probability of thepixel being crack, and W the weight of the model.

To reweigh each side output in training process, weweigh the loss on different side outputs, and increase theweights in the last two sides and the fusion side. The totalloss function is

L(W) =

n∑i=1

(5∑

k=1

Skside · l

(Xk

i ;W)+ Sfuse · l

(X fuse

i ;W))

,

(6)where Sk

side, k ∈ {1, 2, 3, 4, 5}, represents the lossweight of the kth stage, Sfuse the loss weight of the fu-sion layer, n the total number of pixels in each sample, andk the number of side outputs, respectively.

4. Implementation DetailsData Augmentation - We augment the training set by

random clipping, flipping, and rotation operations. We also

use Gamma transformation on the training images to reducethe influence of brightness. In the end, we expand eachtraining set by 228 times of the original samples.

Training & Validation Parameters - To improve therobustness of the model, the images in the training sethold its original dimension and have not been resized. TheBatchSize in the experiment is set to 1, and the Shuffle strat-egy is set True. We choose the Stochastic Gradient Descent(SGD) as optimizer and set the MOMENTUM to 0.9.

Due to the data augmentation, the total training epochis set to 500 and the initial learning rate is set to 1e-3. Weadopt the StepLR strategy to adjust the learning rate at epoch20, 50 and 100. At each epoch milestone, the learning ratewill decay 1/10 times of the previous one.

5. Experiments5.1. Datasets

Our model is trained and evaluated on three pub-lic benchmarks, the CrackTree260, CrackLS315 andStone331.

The CrackTree260 [25] contains 260 road pavementimages. These pavement images are captured by an area-array camera under visible-light illumination, and the sizeof each sample is 800 × 600. 200 samples are chosen fortraining, 20 samples for validation and 40 samples for test-ing.

The CrackLS315 [37] contains 315 images of asphaltpavement captured under laser illumination by a line-arraycamera. Each image has a size of 512× 512. Among them,265 samples are selected for training, and the remaining 10samples for validation and 40 samples for testing.

The Stone331 [17] contains 331 images of the stone sur-face, captured by an area-array camera under visible-lightillumination. Original image size is 1024 × 1024, becauseof the irregularity of cutting surface, original images arecenter-cropped to 512 × 512 clipped samples. 261 imagesof them are chosen for training, 20 for validation and 50 fortesting.

5.2. Performance Metrics

The performance metrics of Precision (abbre. as PR)and Recall (abbre. as RE) are calculated as PR = TP

TP+FP

and RE = TPTP+FN for binary classification tasks.

Specifically, for each image, PR and RE can be calcu-lated by comparing the detected cracks against the humanannotated ground truth. Then, the F-measure ( 2·PR·RE

PR+RE )can be computed as an overall metric for performance eval-uation. Specifically, three different F-measure-based met-rics are employed in the evaluation, the best F-measure onthe data set for a fixed threshold - Optimal Dataset Scale(ODS), the aggregate F-measure on the data set for the bestthreshold on each image - Optimal Image Scale (OIS), and

3787

the average precision (AP), which is equivalent to the areaunder the precision-recall curve [8].

5.3. Comparison with the SOTA methods

To evaluate our model’s performance, some classicalmodels, such as the SE [8], HED [34], RCF [20], Seg-Net [31], SRN [16], U-Net [27], FPHBN [35] and Deep-Crack [37] are compared with ours on crack detection task.The SE [8] is a classical method based on random decisionforest used for edge detection. The HED [34] is a modelbased on the VGG16, whose feature maps are generated ateach stage of the VGG16 and aggregated for multi-stagefusion. The RCF [20] and SRN [16] are similar with theHED, which is an extension of the HED. The SegNet [31]and U-Net [27] are encoder and decoder architecture withsymmetrical structures. The DeepCrack [37] is an exten-sion to the SegNet for crack detection.

5.3.1 The Results on the CrackTree260

The CrackTree260 is a thin crack dataset labeled with a sin-gle pixel width or extremely tiny edges. On the asphaltsurface and under visible-light illumination, the crack ex-hibits extreme weak contrast between the ”crack” and ”non-crack” pixels.

Model ODS↑ OIS↑ AP↑ FLOPs↓ mPara↓SE [8] 0.662 0.673 0.683 - -

FPHBN [35] 0.517 0.579 - - -SRN [16] 0.774 0.781 0.779 451.3G 28.5MHED [34] 0.816 0.820 0.831 146.9G 14.7MSegNet [1] 0.844 0.851 0.862 311.3G 29.5MU-Net [27] 0.847 0.832 0.869 400.0G 31.0MRCF [20] 0.857 0.863 0.861 187.9G 14.8M

DeepCrack [37] 0.852 0.864 0.875 1001.7G 30.9MCrackFormer 0.881 0.883 0.896 123.0G 7.35M

Table 1. Performance on the CrackTree260.

From the precision-recall curves in Fig. 9(a) and statis-tical performance in Tab. 1, it can be seen that the Crack-Former outperforms the compared SOTA methods on theCrackTree260, with 0.881 on ODS, 0.883 on OIS and 0.896on AP, respectively. We obtain a gain of 2.9% on ODS,2.3% on OIS and 2.1% on AP, respectively. comparedwith the DeepCrack. Visualized results in Fig. 6 show thatthe CrackFormer’s results are more continuous and crispthan the compared deep learning models. The crack pro-file shows that the CrackFormer can achieve high predictionaccuracy even for cracks with one-pixel width or tiny edge.

5.3.2 The Results on the CrackLS315

The images of this dataset are captured under laser illumi-nation. The training on this dataset is more difficult than on

Figure 6. Predicted results on the CrackTree260. From top to bot-tom row: the original crack images, the ground truth, the resultsby the proposed CrackFormer, the results by DeepCrack [37], theresults by RCF [19], respectively.

the other datasets because of the extreme low contrast. Theprecision-recall curves are shown in Fig. 9(b).


U-Net [27] 0.672 0.703 0.740 218.6G 31.0MSRN [16] 0.755 0.789 0.795 246.6G 28.5MSegNet [1] 0.761 0.780 0.780 170.1G 29.5MHED [34] 0.763 0.798 0.829 80.3G 14.7MRCF [20] 0.788 0.816 0.829 102.7G 14.8M

DeepCrack [37] 0.853 0.867 0.877 547.4G 30.9MCrackFormer 0.871 0.879 0.883 67.2G 7.35M

Table 2. Performance on the CrackLS315.

It can be seen from Tab. 2 that the CrackFormer achievesthe best performance on the CrackLS315. It obtains a gainof 1.8% on ODS, 1.2% on OIS, 0.6% on AP, respectively,compared with the DeepCrack. The ODS of the HED, SRN,SegNet and U-Net, is 10.8%, 11.6%, 11.0% and 19.9%lower than the CrackFormer, respectively. Compared withthe method SE, the DeepCrack obtains an improvement of41.2% in terms of ODS. The HED, SRN, RCF and SegNetshow comparable results, while the CrackFormer has bet-ter performance than these methods. Visualized results inFig. 7 (seen as the middle row) show that the CrackFormercan predict more detailed thin crack from low contrast as-phalt pavements.

3788

Figure 7. Predicted results on the CrackLS315. From top to bottomrow: the original crack images, the ground truth, the results by theCrackFormer, the results by the DeepCrack [37], the results by theRCF [19], respectively.

5.3.3 The Results on the Stone331

This dataset is from stone cutting surface and its smoothsurface makes the crack texture too weak to be observedeven by human eyes. The visualized results in Fig. 8 (seenas the first row) show that the CrackFormer can predict themost continuous and complete crack detection results. Itcan be seen from precision-recall curves in Fig. 9(c), theCrackFormer outperforms the other compared methods.


HED [34] 0.719 0.763 0.758 80.3G 14.7MSRN [16] 0.735 0.776 0.741 246.6G 28.5MU-Net [27] 0.757 0.776 0.809 218.6G 31.0MRCF [20] 0.789 0.829 0.820 102.7G 14.8M

SegNet [1] 0.794 0.815 0.787 170.1G 29.5MDeepCrack [37] 0.856 0.875 0.888 547.4G 30.9M

CrackFormer 0.877 0.885 0.894 67.2G 7.35MTable 3. Performance on the Stone331.

From statistical performance in Tab. 3, the CrackFormerachieves an ODS of 0.877, 0.885 OIS and 0.894 AP, respec-tively, on the test dataset. The CrackFormer obtains a gainof 2.1% on ODS, 1.0% on OIS and 0.6% on AP, respec-tively, compared with the DeepCrack. Compared with themainstream deep learning models, it outperforms by 8.3%,

8.8%, 12.0% and 14.2% on ODS over the SegNet, RCF,U-Net and SRN, respectively. Compared with traditionalmethod SE, the CrackFormer obtains an improvement of32% in terms of ODS.

Figure 8. Predicted results on the Stone331. From top to bot-tom: the original crack images, the ground truth, the results bythe CrackFormer, the results by the DeepCrack [37], the results bythe RCF [19], respectively.

5.4. Multi-scale Analysis

The multi-scale fusion scheme has proven to be an effec-tive way to enhance crack detection performance [18]. Infact, because crack images exhibit different characteristicsat different scales. At a large-scale stage, crack detectionis reliable, but its localization is poor and may miss thincracks. At a small-scale stage, details are preserved, butdetection suffers a lot from clutters in background texture.Therefore, we quantitatively analyze output of different-scale stage and scale-wise fusion performance on the threedatasets. The statistical results are shown in Tab. 4. Over-all, the ODS and OIS values increase step by step fromstage S1 to S5, and we obtain a 9.4% ODS gain in aver-age. This means that the output of the CrackFormer fromcoarse to fine scale (stage) gradually matches the true scaleof this kind of thin crack benchmark. From the viewpointof multi scale fusion, it can be found that the incrementalfusion experiments from S1+S2 to S1+S2+S3+S4, or evento all scale fusion (S1+S2+S3+S4+S5) could increase theODS and OIS values over the output of each single scale.Moreover, the final fused results can further obtain the ODS

3789

Figure 9. The precision-recall curves on the CrackTree260, CrackLS315 and Stone331, respectively.

2* Scale CrackTree260 CrackLS315 Stone331ODS↑ OIS↑ ODS ↑ OIS↑ ODS↑ OIS↑

S1 0.680 0.702 0.648 0.671 0.760 0.771S2 0.709 0.723 0.691 0.632 0.769 0.775S3 0.740 0.742 0.746 0.652 0.779 0.812S4 0.756 0.761 0.755 0.661 0.796 0.821S5 0.799 0.801 0.761 0.665 0.809 0.815

S1+S2 0.768 0.772 0.735 0.742 0.815 0.820S1+S2+S3 0.802 0.818 0.809 0.821 0.821 0.823

S1+S2+S3+S4 0.854 0.857 0.828 0.835 0.851 0.867Fused 0.881 0.883 0.871 0.879 0.877 0.883Table 4. Multi-the scale analysis on the three datasets.

gain over the finest scale (S5) by 8.7% in average.

5.5. Efficiency Analysis

The FLOPs test and parameters calculation on the com-pared models are shown from Tab. 1 to Tab. 3 with dif-ferent inference image sizes (600 × 800, 512 × 512 and512 × 512, respectively). It shows that the CrackFormeris more efficient and requires fewer parameters. Specifi-cally, the CrackFormer achieves higher accuracy than thethe DeepCrack [37] with 8.1x fewer FLOPs and 4.2x fewerparameters. Compared to the other classical models, theCrackFormer achieves a higher ODS value, 2x to 3x fasterin average and 2x to 3x fewer parameters.

5.6. Ablation Study

To further check the gain of each module of our model,ablation study is done on the CrackLS315. The experimen-tal results are shown in Tab. 5. We first select the SegNetas the baseline. After the conv 3 × 3 is replaced by theSelf-AB in the encoder and decoder, the gain on ODS andOIS is 9.8% and 8.9%, respectively, showing that the self-attention block is effective for fine-grained crack represen-tation. Similarly, the Scal-AB can get a 9.7% ODS gainand an 8.9% OIS gain independently. Furthermore, com-pared with the DeepCrack, after the Self-AB is applied tothe DeepCrack, the gain on ODS and OIS is 1.1% and 0.3%,

respectively. In addition, the Scal-AB block can get a gainof 0.9% and 0.5% on ODS and OIS, respectively, indicatingthat the model works better on multi-scale fusion architec-ture as well. Finally, the Self-AB and Scal-AB modulesfurther achieve a gain of 0.7%− 0.9% and 0.7%− 0.9% onODS and OIS, respectively, indicating that the two attentionmechanisms are compatible in crack detection task.

Model Self-AB Scal-AB MSF ODS ↑ OIS ↑SegNet 0.761 0.780

DeepCrack ✓ 0.853 0.867- ✓ 0.859 0.869- ✓ 0.858 0.869- ✓ ✓ 0.864 0.870- ✓ ✓ 0.862 0.872

CrackFormer ✓ ✓ ✓ 0.871 0.879Table 5. Ablation study on the CrackLS315.

6. Conclusion

The proposed CrackFormer aims at detecting fine-grained cracks. We derive our model from the Seg-Net basic architecture and novel attention mechanisms.The proposed self-attention modules are embedded in theencoder-decoder blocks, where the 1x1 convolutional ker-nels are adopted for extracting contextual informationacross feature-channels, and efficient positional embeddingto capture large receptive field spatial contextual informa-tion for long range interactions. The proposed scaling-attention modules combine output from the correspond-ing encoder and decoder, and enable to obtain crisp crackboundary. On three classical crack detection benchmarkdatasets, the CrackTree, CrackLS315 and Stone331, wecan obtain pixel-level crack detection accuracy and achieveSOTA performance.

3790

References[1] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.

Imagenet classification with deep convolutional neural net-works. Communications of the ACM, 60:84–90, 2017.

[2] Abdullah Alfarrarjeh, Dweep Trivedi, Seon Ho Kim, andCyrus Shahabi. A deep learning approach for road damagedetection from smartphone images. In IEEE InternationalConference on Big Data (Big Data), 2018.

[3] Irwan Bello. Lambdanetworks: Modeling long-range inter-actions without attention. In ICLR, 2021.

[4] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens,and Quoc V. Le. Attention augmented convolutional net-works. In ICCV, 2019.

[5] Yutong Cai and Yong Wang. Ma-unet: An improved versionof unet based on multi-scale and attention mechanism formedical image segmentation, 2020. arXiv:2012.10952.

[6] Young-Jin Cha, Wooram Choi, and Oral Buyukozturk. Deeplearning-based crack damage detection using convolutionalneural networks. Computer-Aided Civil and InfrastructureEngineering, 32(5):361–378, 2017.

[7] Jean-Baptiste Cordonnier, Andreas Loukas, and MartinJaggi. On the relationship between self-attention and con-volutional layers. In ICLR, 2020.

[8] Piotr Dollar and C. Lawrence Zitnick. Fast edge detectionusing structured forests. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 37(8):1558–1570, 2015.

[9] Cao Vu Dung and Le Duc Anh. Autonomous concrete crackdetection using deep fully convolutional neural network. Au-tomation in Construction, 99:52–58, 2018.

[10] Yue Fei, Kelvin C. P. Wang, Allen Zhang, Cheng Chen,Joshua Q. Li, Yang Liu, Guangwei Yang, and Baoxian Li.Pixel-level cracking detection on 3d asphalt pavement im-ages through deep-learning-based cracknet-v. IEEE Transac-tions on Intelligent Transportation Systems, 21(1):273–284,2020.

[11] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, ZhiweiFang, and Hanqing Lu. Dual attention network for scenesegmentation. In CVPR, 2019.

[12] Changlu Guo, Marton Szemenyei, Yugen Yi, Wenle Wang,Buer Chen, and Changqi Fan. Sa-unet: Spatial attention u-net for retinal vessel segmentation. In ICPR, 2020.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In CVPR,2016.

[14] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu.Squeeze-and-excitation networks. In CVPR, 2018.

[15] Zilong Huang, Xinggang Wang, Lichao Huang, ChangHuang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-crossattention for semantic segmentation. In CVPR, 2019.

[16] Wei Ke, Jie Chen, Jianbin Jiao, Guoying Zhao, and QixiangYe. Srn: Side-output residual network for object symmetrydetection in the wild. In CVPR, 2017.

[17] Jacob Koniga, Mark David Jenkinsa, Mike Manniona,Peter Barriea, and Gordon Morisona. Optimized deepencoder-decoder methods for crack segmentation, 2020.arXiv:2008.06266v1.

[18] Haifeng Li, Dezhen Song, Yu Liu, and Binbin Li. Auto-matic pavement crack detection by multi-scale image fu-sion. IEEE Trans. on Intelligent Transportation System,20(6):2025–2036, 2019.

[19] Yun Liu and Ming-Ming Cheng. Richer convolutional fea-tures for edge detection. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 41:1936–1946, 2019.

[20] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, andXiang Bai. Richer convolutional features for edge detection.In CVPR, 2017.

[21] Yahui Liu, Jian Yao, Xiaohu Lu, Renping Xie, and Li Li.Deepcrack: A deep hierarchical feature learning architec-ture for crack segmentation. Neurocomputing, 338:139–153,2019.

[22] Hiroya Maeda, Yoshihide Sekimoto, Toshikazu Seto, Take-hiro Kashiyama, and Hiroshi Omata. Road damage detectionand classification using deep neural networks with smart-phone images. Computer-Aided Civil and Infrastructure En-gineering, 33(12):1127–1141, 2018.

[23] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, MatthewLee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori,Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, BenGlocker, and Daniel Rueckert. Attention u-net: Learningwhere to look for the pancreas, 2018. arXiv:1804.03999.

[24] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and InSoKweon. Bam: bottleneck attention module. In BMVC,2018.

[25] QinZou, YuCao, Qingquan Li, Qingzhou Mao, and SongWang. Cracktree: Automatic crack detection from pavementimages. Pattern Recognition Letters, 33:227–238, 2012.

[26] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, IrwanBello, Anselm Levskaya, and Jonathon Shlens. Stand-aloneself-attention in vision models. In NIPS, 2019.

[27] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:Convolutional networks for biomedical image segmentation.In International Conference on Medical Image Computingand Computer-Assisted Intervention, 2015.

[28] Zhuoran Shen, Mingyuan Zhang, Shuai Yi, Junjie Yan, andHaiyu Zhao. Efficient attention: Self attention with linearcomplexities. In arXiv:1812.01243, 2018.

[29] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. In ICLR,2015.

[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit abd Llion Jones, Aidan N. Gomez, Lukasz Kaiser, andIllia Polosukhin. Attention is all you need. In NIPS, 2017.

[31] Badrinarayanan Vijay, Kendall Alex, and Cipolla Roberto.Segnet: A deep convolutional encoder-decoder architecturefor image segmentation. IEEE Transactions on Pattern Anal-ysis Machine Intelligence, 39(12):2481–2495, 2017.

[32] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam,Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV,2020.

[33] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-ing He. Non-local neural networks. In CVPR, 2018.

[34] Saining Xie and Zhuowen Tu. Holistically-nested edge de-tection. In CVPR, 2015.

3791

[35] Fan Yang, Lei Zhang, Sijia Yu, Danil Prokhorov, Xue Mei,and Haibin Ling. Feature pyramid and hierarchical boostingnetwork for pavement crack detection. IEEE Transactions onIntelligent Transportation Systems, 21(4):1525–1535, 2019.

[36] Allen Zhang, Kelvin C. P. Wang, Baoxian Li, and EnhuiYang. Automated pixel-level pavement crack detection on3d asphalt surfaces using a deep-learning network. Com-puter Aided Civil Infrastructure Engineering, 32(10):805–819, 2017.

[37] Qin Zou and Zheng Zhang. Deepcrack: Learning hierarchi-cal convolutional features for crack detection. IEEE Trans-actions on Image Processing, 28:1498–1512, 2019.

3792

Documents

CrackFormer: Transformer Network for Fine-Grained Crack