4
Supplementary Material for Fast Online Object Tracking and Segmentation: A Unifying Approach Qiang Wang * CASIA [email protected] Li Zhang * University of Oxford [email protected] Luca Bertinetto * Five AI [email protected] Weiming Hu CASIA [email protected] Philip H.S. Torr University of Oxford [email protected] 1. Network architecture details Network backbone. Table 1 illustrates the details of our backbone architecture (f θ in the main paper). For both vari- ants, we use a ResNet-50 [2] until the final convolutional layer of the 4-th stage. In order to obtain a higher spa- tial resolution in deep layers, we reduce the output stride to 8 by using convolutions with stride 1. Moreover, we in- crease the receptive field by using dilated convolutions [1]. Specifically, we set the stride to 1 and the dilation rate to 2 in the 3×3 conv layer of conv4 1. Differently to the original ResNet-50, there is no downsampling in conv4 x. We also add to the backbone an adjust layer (a 1×1 con- volutional layer with 256 output channels). Examplar and search patches share the network’s parameters from conv1 to conv4 x, while the parameters of the adjust layer are not shared. The output features of the adjust layer are then depth-wise cross-correlated, resulting a feature map of size 17 × 17. Network heads. The network architecture of the branches of both variants are shows in Table 2 and 3. The conv5 block in both variants contains a normalisation layer and ReLU non-linearity while conv6 only consists of a 1×1 convolutional layer. Mask refinement module. With the aim of producing a more accurate object mask, we follow the strategy of [5], which merges low and high resolution features using multi- ple refinement modules made of upsampling layers and skip connections. Figure 1 illustrates how a mask is generated with stacked refinement modules. Figure 2 gives an exam- ple of refinement module U 3 . * Equal contribution. Work done while at University of Oxford. block examplar output size search output size backbone conv1 61×61 125×125 7×7, 64, stride 2 conv2 x 31×31 63×63 3×3 max pool, stride 2 1×1, 64 3×3, 64 1×1, 256 ×3 conv3 x 15×15 31×31 1×1, 128 3×3, 128 1×1, 512 ×4 conv4 x 15×15 31×31 1×1, 256 3×3, 256 1×1, 1024 ×6 adjust 15×15 31×31 1×1, 256 xcorr 17 × 17 depth-wise Table 1: Backbone architecture. Details of each building block are shown in square brackets. block score box mask conv5 1 × 1, 256 1 × 1, 256 1 × 1, 256 conv6 1 × 1, 2k 1 × 1, 4k 1 × 1, (63 × 63) Table 2: Architectural details of the three-branch head. k denotes the number of anchor boxes per RoW. block score mask conv5 1 × 1, 256 1 × 1, 256 conv6 1 × 1, 1 1 × 1, (63 × 63) Table 3: Architectural details of the two-branch head.

Supplementary Material for Fast Online Object Tracking and ......Supplementary Material for Fast Online Object Tracking and Segmentation: A Unifying Approach Qiang Wang CASIA [email protected]

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Supplementary Material forFast Online Object Tracking and Segmentation: A Unifying Approach

    Qiang Wang∗

    [email protected]

    Li Zhang∗

    University of [email protected]

    Luca Bertinetto∗

    Five [email protected]

    Weiming HuCASIA

    [email protected]

    Philip H.S. TorrUniversity of Oxford

    [email protected]

    1. Network architecture details

    Network backbone. Table 1 illustrates the details of ourbackbone architecture (fθ in the main paper). For both vari-ants, we use a ResNet-50 [2] until the final convolutionallayer of the 4-th stage. In order to obtain a higher spa-tial resolution in deep layers, we reduce the output strideto 8 by using convolutions with stride 1. Moreover, we in-crease the receptive field by using dilated convolutions [1].Specifically, we set the stride to 1 and the dilation rate to2 in the 3×3 conv layer of conv4 1. Differently to theoriginal ResNet-50, there is no downsampling in conv4 x.We also add to the backbone an adjust layer (a 1×1 con-volutional layer with 256 output channels). Examplar andsearch patches share the network’s parameters from conv1to conv4 x, while the parameters of the adjust layer arenot shared. The output features of the adjust layer are thendepth-wise cross-correlated, resulting a feature map of size17 × 17.

    Network heads. The network architecture of the branchesof both variants are shows in Table 2 and 3. The conv5block in both variants contains a normalisation layer andReLU non-linearity while conv6 only consists of a 1×1convolutional layer.

    Mask refinement module. With the aim of producing amore accurate object mask, we follow the strategy of [5],which merges low and high resolution features using multi-ple refinement modules made of upsampling layers and skipconnections. Figure 1 illustrates how a mask is generatedwith stacked refinement modules. Figure 2 gives an exam-ple of refinement module U3.

    ∗Equal contribution. Work done while at University of Oxford.

    block examplar output size search output size backboneconv1 61×61 125×125 7×7, 64, stride 2

    conv2 x 31×31 63×63

    3×3 max pool, stride 2 1×1, 643×3, 641×1, 256

    ×3conv3 x 15×15 31×31

    1×1, 1283×3, 1281×1, 512

    ×4conv4 x 15×15 31×31

    1×1, 2563×3, 2561×1, 1024

    ×6adjust 15×15 31×31 1×1, 256xcorr 17 × 17 depth-wise

    Table 1: Backbone architecture. Details of each buildingblock are shown in square brackets.

    block score box maskconv5 1 × 1, 256 1 × 1, 256 1 × 1, 256conv6 1 × 1, 2k 1 × 1, 4k 1 × 1, (63 × 63)

    Table 2: Architectural details of the three-branch head. kdenotes the number of anchor boxes per RoW.

    block score maskconv5 1 × 1, 256 1 × 1, 256conv6 1 × 1, 1 1 × 1, (63 × 63)

    Table 3: Architectural details of the two-branch head.

  • conv1 conv2 conv3 con4

    ResNet-50

    1*1*256(RoW)

    𝑈"𝑈#

    127*127*3

    127*127*1mask

    conv1 conv 2 conv3 conv 4

    255*255*3

    adjust

    adjust

    15*15*256

    31*31*256

    17*17*256

    𝑈$ 15*15*3231*31*16

    15*15*102415*15*51231*31*25661*61*64

    61*61*8deconv,

    32127*127*4conv, 3*3, 1Sigmoid

    Figure 1: Schematic illustration of mask generation with stacked refinement modules.

    conv2

    31*31*16

    31*31*256

    conv, 3*3, 16

    conv, 3*3, 16

    conv, 3*3, 64

    conv, 3*3, 32

    conv, 3*3, 16

    + 31*31*16

    31*31*16

    61*61*8 2x, up

    ReLU

    + Element-wise sum

    Refinement module

    Figure 2: Example of a refinement module U3.

    Target

    Search

    Figure 3: Score maps from Mask branch at different loca-tions.

    2. Further qualitative results

    Different masks at different locations. Our model gener-ates a mask for each RoW. During inference, we rely on the

    score branch to select the final output mask (using the loca-tion attaining the maximum score). The example of Figure 3illustrates the multiple output masks produced by the maskbranch, each corresponding to a different RoW.Benchmark sequences. More qualitative results for VOTand DAVIS sequences are shown in Figure 4 and 5.

    References[1] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and

    A. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 2018.

    [2] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In IEEE Conference on ComputerVision and Pattern Recognition, 2016.

    [3] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic,A. Eldesokey, G. Fernandez, and et al. The sixth visual objecttracking vot2018 challenge results. In European Conferenceon Computer Vision workshops, 2018.

    [4] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,M. Gross, and A. Sorkine-Hornung. A benchmark datasetand evaluation methodology for video object segmentation.In IEEE Conference on Computer Vision and Pattern Recog-nition, 2017.

    [5] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learningto refine object segments. In European Conference on Com-puter Vision, 2016.

    [6] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on videoobject segmentation. arXiv preprint arXiv:1704.00675, 2017.

  • butte

    rfly

    crab

    s1ic

    eska

    ter1

    ices

    kate

    r2m

    otoc

    ross

    1si

    nger

    2so

    ccer

    1

    Figure 4: Further qualitative results of our method on sequences from the visual object tracking benchmark VOT-2018 [3].

  • dog

    drif

    t-st

    raig

    htgo

    atL

    ibby

    mot

    ocro

    ss-j

    ump

    park

    our

    Gol

    d-Fi

    sh

    Figure 5: Further qualitative results of our method on sequences from the semi-supervised video object segmentation bench-marks DAVIS-2016 [4] and DAVIS-2017 [6].