Robust Visual Object Tracking with Two-Stream Residual ...Robust Visual Object Tracking with Two-Stream Residual Convolutional Networks Ning Zhang , Jingen Liu , Ke Wangy, Dan Zengz

Robust Visual Object Tracking with Two-StreamResidual Convolutional NetworksNing Zhang∗, Jingen Liu∗, Ke Wang†, Dan Zeng‡ and Tao Mei§

∗JD AI Research, Mountain View, USA. †Migu Culture & Technology, Beijing, China‡Shanghai University, Shanghai, China §JD AI Research, Beijing, China

∗{ning.zhang,jingen.liu}@jd.com, †[email protected], ‡[email protected], §[email protected]

Abstract—The current deep learning based visual trackingapproaches have been very successful by learning the targetclassification and/or estimation model from a large amountof supervised training data in offline mode. However, mostof them can still fail in tracking objects due to some morechallenging issues such as dense distractor objects, confusingbackground, motion blurs, and so on. Inspired by the human“visual tracking” capability which leverages motion cues todistinguish the target from the background, we propose a Two-Stream Residual Convolutional Network (TS-RCN) for visualtracking, which successfully exploits both appearance and motionfeatures for model update. Our TS-RCN can be integrated withexisting deep learning based visual trackers. To further improvethe tracking performance, we adopt a “wider” residual networkResNeXt as its feature extraction backbone. To the best of ourknowledge, TS-RCN is the first end-to-end trainable two-streamvisual tracking system, which makes full use of both appearanceand motion features of the target. We have extensively evaluatedthe TS-RCN on most widely used benchmark datasets includingVOT2018, VOT2019, and GOT-10K. The experiment results havesuccessfully demonstrated that our two-stream model can greatlyoutperform the appearance based tracker, and it also achievesstate-of-the-art performance. The tracking system can run at upto 38.1 FPS.

I. INTRODUCTION

Generic visual object tracking predicts the location of aclass-agnostic object at every frame of a video sequence. Itis a highly challenging task due to its class-agnostic nature,background distraction, illumination discrepancy, motion blur,and many more [1], [2]. In general, a visual tracking systemneeds to perform two tasks simultaneously: target classifica-tion and bounding-box estimation [3]. The former task is tocoarsely identify the target object region in current frame fromthe background, while the latter further estimates the precisebounding-box (i.e., tracker state) of the target object.

Recently, researchers have made great progress in visualobject tracking by exploiting the effective power of deepconvolution networks. The Siamese tracking approaches [4]leverage a large amount of supervised data to learn a moregeneral region similarity measurement in offline mode, whichenables tracking to be performed by searching image regionsmost similar to the target template. Due to the lack ofbackground appearance (e.g., distractor objects ), however, theSiamese approaches are inferior to deal with unseen objectsand distractor objects. To address these limits, Bhat et al. [5]propose a discriminative learning architecture (DiMP) whichis able to fully exploit both target and background appearance.

Fig. 1: The illustration of limitations for appearance basedvisual tracking. The first column shows the tracking resultswhere the Green, Blue, and Red bounding box representgroundtruth, DiMP tracker, and our TS-RCN tracker, respec-tively. The second column illustrates the HSV-color visualiza-tion of the optical flow. In all three cases, the target opticalflow has a different pattern than that of its local background.Row (a) shows dense similar objects (i.e. crabs) as distractors;Row (b) shows confusing background textures as distractors(i.e., the flying drone blends with background buildings); Row(c) shows the target (i.e., soccer ball) has motion blurs. Thisfigure is best viewed in PDF format.

Although DiMP is trained to separate the background fromtarget, it may still fail when the background becomes moreconfusing and challenging. As shown in Fig. 1 (a) and (b),DiMP is not able to track the targets (blue bounding box) dueto same-class distractors (i.e., similar crabs) and confusingbackground texture, respectively. Additionally, the trackingcan be disconnected if the current frame has motion blurson the target. For example, the soccer ball is blurred due tohigh speed as shown in 1 (c). All the aforementioned issuescan happen in most deep learning based tracking approaches

arX

iv:2

005.

0653

6v1

[cs

.CV

] 1

3 M

ay 2

020

where models are solely built on the target appearance.

In contrast, humans can effectively leverage an object’smotion to separate it from the background or other similarobjects when its appearance is less distinguishable. To addressthe aforementioned issues, we propose a Two-Stream ResidualConvolutional Network (TS-RCN) for visual object tracking,which is able to leverage the complementary informationfrom both appearance and motion streams. This two-streammechanism has been successfully applied to other video under-standing tasks [6], [7]. To the best of our knowledge, however,we are the first to study an end-to-end trainable two-streamsystem for visual tracking. Although [8] uses both CNN basedmotion and appearance features for tracking, their conventionaltracking system treats deep features as a regular feature similarto hand crafted ones such as HOG. In other words, the pre-trained (on UCF101 [9] and ImageNet [10]) deep features arenot trainable. As a result, the tracking system is extremelyslow and less effective.

In this work, we build our two-stream network on theDiMP tracker [5]. It can also be integrated with other deeplearning based trackers. Our TS-RCN computes optical flowfor each frame and formulates it as another two-channel inputin addition to the three-channel appearance. Features extractedfrom both streams using CNN are concatenated as one featurefor tracking. Unlike [5], we employ the ResNeXt [11] as ourbackbone for better feature extraction, which adopts the split-transform-merge strategy to improve network performance(i.e., expanding the network from its width).

We have extensively evaluated our TS-RCN on some widelyused benchmark datasets such as VOT2018 [12], VOT2019[13], and GOT-10K [2]. All experiment results have demon-strated that two-stream visual tracking can greatly outperformappearance based tracking across all major evaluation metrics.By adopting Farneback [14] for optical flow estimation, ourTS-RCN can run up to 38.1 FPS under Nvidia GTX 1080GPU. Additionally, to demonstrate the capability of long-termtracking, we also propose a two-stream based framework toautomatically initialize the lost tracker.

In summary, we have made the following contributions:

• We are the first to propose an end-to-end trainable Two-Stream Residual Convolutional Network to capture bothappearance and motion cues for visual tracking. It canwork with most deep learning based trackers.

• Unlike previous tracking approaches, our TS-RCN ex-ploits a “wider” residual network ResNeXt as its featureextraction backbone to further improve the tracking per-formance.

• We propose a two-stream tracker initialization for long-term video tracking, and demonstrate this capability withsoccer ball tracking.

• We have thoroughly evaluated our TS-RCN on mostpopular evaluation benchmark datasets, and achieved thestate-of-the-art (STOA) performance in terms of evalua-tion metrics.

II. RELATED WORKS

Generic visual object tracking has made great progress,thanks to the development of a variety of deep learning basedapproaches. The SiamFC tracker [4] is the first attempt toleverage a large amount of labeled data for offline tracker train-ing, which greatly improves the performance of conventionalonline trackers. Following the idea of SiamFC tracker, morerecent works [15], [16], [17], [18] have successfully developedto leverage various deep learning based backbone to furtherimprove the tracking performance. For example, SiamDW [15]adopts Residual Convolutional Networks [19] as its backbonefor feature extraction and has outperformed previous SiamFC[4].

Due to the lack of mechanisms to explicitly train a trackerto distinguish the target from background distractors, however,the Siamese approaches generally suffer a low tracking robust-ness [3], [5]. In other words, the Siamese approaches have aless discriminative target classifier. To address this issue, manyefforts have been made to develop discriminative online/offlinetarget classification and estimation [20], [21], [22], [3], [5] todistinguish the target from background. Inspired by correlationmethods like the Discriminative Correlation Filters (DCF)[20], which matches region-level deep features to the targettemplate, ATOM tracker [3] and DiMP tracker [5] proposeResidual Convolutional Networks (ResNet) [19] based end-to-end frameworks to learn target estimation and classificationoffline, which makes visual tracking more robust.

Although motion is an important cue for video understand-ing, there are only a few attempts to exploit motion cuesfor visual object tracking in most recent deep learning basedtracking approaches. Wu et al. [23] use Kalman-filter to modelthe target motion along with the Siamese trackers. Insteadof integrating the kalman-filter into the end-to-end Siamesenetwork, they use it to verify/rectify the Siamese tracker.Hence, it is less effective. A more relevant work by Gladhet al. [8] proposes to fuse deep motion features with deepappearance features as well as other conventional featureslike HOG feature into a DCF tracker. The deep motion andappearance features are pre-trained on UCF101 for actionclassification and ImageNet for image classification, respec-tively. Therefore, the deep features are not trainable, whichmake them no different from other conventional features. Theirtracking system is not a deep learning based tracking approach,and running at an extreme slow speed ( i.e., less than 1.0FPS). In contrast, our approach is an end-to-end trainable deeplearning based visual tracking system, which can run up to38.1 FPS with much higher tracking accuracy.

Our TS-RCN tracking approach is inspired by some previ-ous two-stream deep learning networks for video understandand action recognition [6], [7], [24], [25]. Typically, the two-streams refers to two input sources. It can be traced back tothe idea of “two-stream hypothesis” proposed in [26]. Suchhypothesis describes the human visual cortex contains twopathways: the ventral stream preforming object recognitionand the dorsal stream recognizing motions.

Fig. 2: TS-RCN Architecture with two stages: Feature Extraction and Tracking Network.

Fig. 3: Optical flow visualization: (a-b) consecutive video frames of a targeted soccer ball. (c): Color visualization based ondisplacement vector’s magnitude and direction, using the HSV color-space. (d-e): horizontal and vertical displacement vectorfields dtu, and dtv , respectively, with higher intensity representing positive values.

III. TS-RCN ARCHITECTURES

In this section, the architecture of the TS-RCN is presented.As depicted in Figure 2, the end-to-end trainable network canbe viewed as two stages: Feature Extraction and TrackingNetwork. At the Feature Extraction stage, both appearanceand optical flow are inputs to two separated residual convolu-tional networks (i.e. RGB-stream and the optical-flow-stream),respectively. Two types of features are concatenated togethervia a weighted sum strategy, which forms the input to theTracking Network for target classification and estimation. Inour current implementation, we adopt the DiMP [5] as ourtracking network. Nevertheless, the two-stream architecture isgeneric and can be applied to most existing deep learningbased visual tracking algorithms, such as the Siamese trackers[15], [16], [17].

A. Optical Flow

Optical flow is generally employed to capture the motionof objects in a video sequence. In the proposed TS-RCNarchitecture, the pixel-wise dense optical flows are computedas pixel displacement vectors for each frame. As shown inFigure 3, the flow’s two fields dtu and dtv are calculatedfrom two consecutive frames t and t− 1. Let dt(x, y) denotethe displacement vector at position (x, y) for frame t. The

horizontal direction dtu and vertical direction dtv of the opticalflows are illustrated in Figure 3 (d) and (e), respectively.

There are various implementations of optical flow comput-ing, such as Farneback [14], TV-L1 [27], FlowNet2.0 [28],and so on. Since our tracking does not require high-precisionoptical flow estimation, we implemented a faster Farnebackversion (GPU based), which can run at 40 FPS at 768x576resolution. In addition, we also tested the Total Variation(TV) based regularization with L1-norm (TV-L1) algorithmfor better precision with 30 FPS at 320x240 resolution.

During our TS-RCN training, we can leverage the pre-trained ResNet on ImageNet to the RGB-stream feature ex-traction. However, there is no pre-trained ResNet for opticalflow features. To mimic the behavior of RGB-stream and treattwo-streams equally, we discretize the optical flow into aninterval from 0 to 255 using a linear transformation, whichmakes the range of the optical flow value the same as thatof the RGB stream. This transformation unifies both RGBand optical flow streams. As a result, the pre-trained ResNeXtmodel with ImageNet can be applied to both streams.

B. Two-Stream Architecture

After the optical flow computation and preprocessing, bothRGB and optical flow streams take their respective input andfeed them to the backbone networks. In our experiment, the

RGB-stream takes RGB color-channels as input, while theflow-stream takes the optical flow u/v channels as input. Eachof the RGB-stream or flow-stream can be a ResNet or its vari-ations such as ResNeXt or Wide residual networks (WRNs)[29]. In Figure 2, we use ResNet blocks for illustration. Afusion layer is applied to combine features of the two streams.We adopted a weighted sum mechanism which was introducedin action recognition. [6], [24]

The two-stream combined feature is input to the classi-fication and estimation tasks. At the classification branch,the feature goes through a convolutional block to extract theclassification-layer feature. It is then used to train the classifi-cation predictor. In addition, an RGB-only model initializationis used when the initial frame F (0) with precise region ofinterest (ROI) is given, the ROI pooling operation is conductedto get the same size feature as the classification-layer featurefor the predictor model. This initialization effectively reducesthe optimization recursion for the classification prediction.Simultaneously, in the bounding-box estimation branch, adifferent IoU convolutional block takes the two-stream fusedfeature to extract the bounding-box regression-layer feature.This newly extracted feature is fed into the IoU network [30]based estimation model. Since it is an end-to-end trainablemechanism, the loss of combined regression from estimationand the classification are back-propagated through the two-stream structure.

Formally, given a video V, we randomly select two segments{Mmod,Mtrain }. Mmod data segment is selected in a fashionthat is always prior to Mtrain along the time course. Each setM = {Ij , Fj , Bj}

Nframes

j=1 , consists of images Ij , optical flowimages Fj , and their paired corresponding target bounding-boxes Bj at the current frame j. The optical flow of each Fj

is calculated from the paired frames (Ij−1, Ij).{ Smod, Strain } are the modulate samples and training sam-

ples. Our setup follows the DiMP tracker [5] in that Smod isused to provide a model predictor as a preprocessing to predictthe discriminative feature and maintain the generalization forthe future unseen Strain samples.Strain is formulated below, where xj = xts(Ij , Fj) is

the two-stream feature extraction of the backbone residualnetworks, and cj is the center coordinate of the bounding-boxBj .

Strain = {(xts(Ij , Fj), cj) : (Ij , Fj , Bj) ∈Mtrain} (1)

The combined two-stream feature xj is defined as theweighted sum of Rrgb(Ij) and Rflow(Fj), where Rrgb(·) andRflow(·) are the residual convolutional network for the RGBand flow streams respectively.

xj = λrgb ∗Rrgb(Ij) + λflow ∗Rflow(Fj) (2)

The obtained two-stream feature xj are used to feed theclassification and estimation networks. The total loss functionLtot = βLcls + Lest, where Lcls is the classification loss,and Lest is the bounding-box estimation loss. The targetclassification is based on a hinge based regression error, given

Fig. 4: Tracker re-initialization via an optical flow local searchobject detection.

a confidence score s and the target region z. The T is thethreshold.

l(s, z) =

{s− z z > T

max(0, s) z ≤ T(3)

The above hinge error is used in calculating the Lcls.Theconfidence score s is represented as the input feature at theith iteration’s target model f (i), where x is the two-streamfeature, and the ∗ denotes the convolution.

Lcls =1

Niter

Niter∑i=0

∑(x,c)∈Stest

∥∥∥∥l(x ∗ f (i), zc)∥∥∥∥2 (4)

The estimation loss Lest minimizes a prediction error ofthe following equation using mean-squared error functiong(∗), following the IoU-Net [30]. The function c(∗) is themodulation function generating the modulation vector. x0 andB0 are extracted two-stream features and the correspondingbounding-box from the first frame in Mmod. The functionIoU(∗) is the IoU model prediction using the IoU-Net, withthe input of extracted two-stream feature x and its correspond-ing bounding-box B from Mtrain

Lest(B) = g(c(x0, B0), IoU(x,B)) (5)

C. Tracking

At the inference stage, the tracking sequence’s currentframe is input to both RGB-stream and optical-flow-stream forbackbone feature extraction. Fusion layer gets the expectedvalue of the two-stream input with a weighted average thesame as the training stage. The calculated two-stream featureis used in classification predictor to generate a score predictionon the target object’s location. The feature is also used in theIoU prediction where proposals were calculated and rankedwith the IoU model for the best bounding-box estimation.

D. Long-video Tracking

In long-video SOT tasks, tracker initialization with objectdetection also suffers from the similar appearance-only lim-itation due to the target object blurring, speed variations,similar-appearance distractors and etc. In addition to the

tracker model, we furthermore propose an optical flow basedinitialization mechanism which takes the feedback from thetracking results and adaptively initialize the tracker withoptical flow based local detection. Figure 4 demonstrates thisre-initialization. When the target classification and estimationcollectively triggers a failed tracker, an optical flow based re-initialization is implemented to refine the object’s bounding-box. For instance, the prediction score or the IoU score islower than a threshold. As an illustration, the testing imagesfrom Figure 4 depicts this situation. The red-color bounding-box indicates the before and after failed tracker re-initializationwhich retrieved the soccer ball successfully. The blue-colorbounding-box indicates a failure in tracking the target withoutsuch re-initialization. Video examples of these will be providedas the supplementary material.

IV. EXPERIMENT RESULTS

To verify the effectiveness of our proposed TS-RCN forvisual object tracking, we conducted extensive experimentson some broadly used SOT benchmark datasets, such asVOT2018, VOT2019, and GOT-10K. Our approach was im-plemented in Python with pyTorch, and all experiments wererun on Nvidia GTX 1080. Our models were mainly trainedon three datasets, namely GOT-10K [2], LaSOT [1], andImageNetVid [10]. The trackingNet and Microsoft COCOimage dataset,1 are not used. The ImageNet pre-trained modelswere used for both appearance and motion stream. WithGPU-enabled Farneback optical flow computation, TS-RCNtracker can run at real-time mode on VOT2018 benchmark.Specifically, for the best performing backbone ResNeXt-50,its tracking speed is 21.74 FPS. Furthermore, TS-RCN’s speedcan achieve up to 38.1 FPS with ResNet-18 backbone.

A. Evaluation Metrics

In our experiments, we used tracking accuracy and robust-ness to measure performance. Both metrics were adopted byVOT-2013 Challenge and have become widely used since then[31].

Tracking accuracy is simply defined as the overlap betweenthe tracker’s bounding-box and ground-truth bounding-box.Given a continuous tracker i (no lost frames), the per-frameaccuracy at frame t is computed as φt(i) =

AGt ∩A

Tt

AGt ∪AT

t, where

ATt and AG

t are the estimated and the ground-truth bounding-box, respectively. Due to the stochastic nature of a tracker, atracking system will be repeatedly evaluated Nrep times onone dataset, and the actual per-frame accuracy Φt(i) is theaverage of φt(i) over Nrep times. Now, let Mvalid be the totalnumber of frames for tracker i, the tracking accuracy (ρA(i))of a tracker i is the average of Mvalid per-frame accuracyΦt(i).

Tracking robustness measures how reliable of a trackerwithout losing the target. It is linked to the tracking failures,which is the total number of tracking lost (i.e., the trackerdrifts away from the target). Let F (i, k) be the failure times

1Still image only dataset is not suitable for extracting optical flows.

[λrgb, λflow] [1, 0] [3/4, 1/4] [2/3, 1/3] [1/2, 1/2] [1/3, 2/3] [1/4, 3/4] [0, 1]EAO ↑ 0.379 0.370 0.354 0.459 0.398 0.367 0.031Accuracy ↑ 0.557 0.515 0.511 0.579 0.520 0.552 0.366Robustness ↓ 0.187 0.182 0.178 0.139 0.148 0.201 3.467

TABLE I: The results of TS-RCN trackers with variousweighting combinations of [λrgb, λflow]. The settings of [1, 0]and [0, 1] are actually two single-stream modes, i.e., the RGBand optical flow trackers, respectively. The top-2 performanceresults are colored in Red and Blue, respectively.

of the i-th tracker in experiment k, then the robustness ρR(i)of tracker i is the average over all Nrep repeated experiments.The overall accuracy and robustness across the entire testingsequence can be calculated as the weighted average of the per-sequence performance with weights proportional to the lengthsof the sequences.

In VOT2015 Challenge [32], the Expected Average Overlap(EAO) was designed as a principled combination of accuracyand robustness. Since then, it has been adopted as one ofthe main evaluation criteria for tracking performance. Moredetails can be found in [33]. In the following experiments, weuse VOT-toolkit2 and pysot-toolkit3 which are widely used tocalculate the EAO, accuracy and robustness.

B. Ablation Study

Unless otherwise specified, we employed ResNeXt-50 back-bone for feature extraction in our experiments, and trained TS-RCN tracker on dataset LaSOT and GOT-10K. All trackers areevaluated on VOT2018 testing dataset.

Two-Stream Hyper Parameter. As we can see from Equa-tion 2, two hyper parameters λrgb and λflow are used tobalance the effect of two streams in the tracking results. Whenλrgb = 1 and λflow = 0, TS-RCN becomes a DiMP tracker,which serves as our baseline approach. We conducted a grid-search-like approach to find the best value combination of λrgband λflow. Table I shows the performance of TS-RCN in termsof EAO, accuracy, robustness. When [λrgb, λflow] = [0.5, 0.5],the system achieves best performance in all metrics. Addition-ally, we notice that two-stream systems generally outperformtwo single-stream systems (λflow = 0 or λrgb = 0) acrossmost evaluation metrics. It clearly illustrates that the TS-RCNtracker can work better than the single-stream tracker.

It is also interesting to look at the plummeted performancewhen only single-stream optical flow is employed for tracking.To further analyze this situation, we performed another groupof experiments which look into the fine-scale changes of λvalues. Figure 5 depicts the tracking performance with λchanging at a scale of 0.01, from [0.05, 0.95] to [0, 1.0]. Thebar value shows the failure numbers for each λ configure,and the red-line with black-dot shows the corresponding EAOvalue. When [λrgb, λflow] changes to [0.02, 0.98], Failurerate significant increases and EAO decays drastically. Thisobservation shows that appearance is a primary stream for TS-RCN tracker given current experimental configurations. We

2https://github.com/votchallenge/vot-toolkit3https://github.com/STVIR/pysot

Fig. 5: EAO and Failures for different [λrgb, λflow].

ResNet Depth 18 50 101 152EAO ↑ 0.345 0.419 0.383 0.233

Params (millions) 11.69 25.56 44.56 60.19

TABLE II: ResNet depth impact of the performance with EAOscores and corresponding networks’ number of parameters (inmillions). Weights [λrgb, λflow] = [0.5, 0.5] is used.

conjecture that it may be due to the pre-train process of flow-stream network. We adopted the pre-trained RGB model forthe flow-stream. To solve this issue, we can separately traina pre-trained optical flow model, which we will leave to ourfuture work.

Backbone Depth. Network depth is another important fac-tor that affects the performance of the TS-RCN. In this groupof experiments, we examined how the tracking performancechanges with various backbone depths. As shown in TableII, ResNet achieves better performance when depth is 50,although deeper network generally produces better results.Deeper structure means many more parameters, which needs alot more training data to avoid over-fitting. Given our currenttraining data, depth of 101 or of 152 may cause over-fitting.

Backbone Architecture. As shown in the above exper-iments, the residual convolutional networks obtained bettertracking accuracy at depth of 50. In addition to the ResNet,various residual convolutional networks have been developed,such as the “width” of the network. In this ablation study, wecompare the tracking performance of ResNet, ResNeXt, andWRNs at depth of 50. The classic ResNet-50 has width ofbottleneck 4. The ResNext-50 has additional hyper parametercardinality 32. The WRNs-50 has additional widening factor2. We first verified that for all structures, two-stream approachachieved the best performance with [λrgb, λflow] = [0.5, 0.5].For the ResNet-50 backbone, we achieved 0.419 EAO, 0.571accuracy, and 0.168 robustness. For the WRNs-50 backbone,we obtained 0.390 EAO, 0.568 accuracy, and 0.195 robustness.Both structures show inferior performance than the counterpart

DBs GOT-10k GOT-10k + LaSOT GOT-10k + LaSOT + ImgNetVidEAO ↑ 0.383 0.459 0.378

TABLE III: EAO scores of dataset choices on the proposedTS-RCN with [λrgb, λflow] = [0.5, 0.5].

ResNeXt-50 in Table I. This ablation study demonstratesthat increasing the cardinality with the split-transform-mergestrategy improves the network performance.

Training Dataset. Deeper network may work better withmore training data. But the quality or type of the dataset mayalso matter. We conducted a set of experiments, in which weused various combination of datasets for training. Table IIIpresents tracking performance comparison in terms of EAO.GOT-10K is currently the largest annotated video datasetswith 563 categories, from 10,000 video segments with morethan 1.5 million manually-labeled bounding-box [2]. LaSOTis a dataset with 70 categories, from 1400 videos. [1]. Ima-geNetVid is part of the ImageNet ILSVRC2015 competition,consisting of a 30-category objects data from 4500 videos,with a total of more than one million annotated frames [10]. Aswe can see, the combination of LaSOT and GOT-10K achievesthe best performance with EAO score 0.459. It is worth notingthat it is not always the case that more datasets, the betterperformance. As indicated in Table III that when all threedatasets used, the EAO score dropped to 0.378. One possiblecause is that the training is object-appearance dependent.Hence, more category with balanced data will help to improvethe performance. Category increase with less balanced datamay jeopardize the performance. ImageNetVid has only 30categories which has a large overlap with the previous twodatasets. Hence, the performance tends to plateau and evendecrease in our experiment.

Two-Stream Fusion Strategy. We have two schemes tofuse the RGB and flow streams. The features are extractedseparately by two identical but independent networks fromtwo streams. It can be called “late-fusion”. Alternatively, wecan stack 3 RGB channels with 2 optical flow channels, andthen feed it into one common ResNeXt for feature extraction.This scheme is so-called “early-fusion”. In terms of EAO,late-fusion greatly outperforms early-fusion (0.459 vs 0.356).This could be because the optical flow and RGB are differentrepresentations in nature, one for pixel movement, and one forRGB values. Simply stacking them with a single backbonemodel diminishes each stream’s value and hence affects theoverall performance.

C. State-of-the-Art Comparison

VOT2018 and VOT2019 datasets. Following previous SOTevaluation, we evaluated our best tracking model on bothVOT2018 and VOT2019 datasets, which each containing 60challenging testing sequences.

As shown in Table IV, our TS-RCN tracker with ResNeXt-50 achieves the best tracking results on VOT2018 in EAOand robustness metrics, as compared to some recent popu-lar trackers such as SiamMask, DiMP, ATOM, and so on.It clearly illustrates that motion cues can greatly improvetracking robustness, which is critical for many practical appli-cations. Relatively speaking, accuracy is less important sinceit only measures the degree of intersection with the groundtruth. More specifically, our approach improves the EAO ofDiMP-50 ( single-stream version of our TS-RCN ) by 3.7%,

LADCF MFT SiamRPN DRT RCO UPDT ECO SiamFC ATOM SiamFC++ DaSiamRPN SiamMask SiamRPN++ DIMP-50 TS-RCN[34] [35] [18] [36] [12] [37] [38] [4] [3] [17] [39] [40] [16] [5] ours

EAO ↑ 0.389 0.385 0.383 0.356 0.376 0.378 0.280 0.187 0.401 0.426 0.326 0.387 0.414 0.422 0.459Accuracy ↑ 0.505 0.508 0.587 0.519 0.507 0.536 0.487 0.505 0.590 0.587 0.569 0.642 0.600 0.602 0.579Robustness ↓ 0.159 0.140 0.276 0.201 0.155 0.184 0.276 0.585 0.204 0.183 0.337 0.295 0.234 0.162 0.139

TABLE IV: Comparison with the STOA on VOT2018.

DRNet Trackyou ATP SiamRPN++ SiamMask SiamDW ST DIMP-50 TS-RCN[13] [13] [13] [16] [40] [15] [5] ours

EAO ↑ 0.393 0.394 0.393 0.282 0.287 0.297 0.342 0.375Accuracy ↑ 0.602 0.610 0.649 0.598 0.596 0.597 0.600 0.582Robustness ↓ 0.261 0.268 0.291 0.482 0.461 0.467 0.321 0.262

TABLE V: Comparison with the STOA on VOT2019.

DIMP-50 TS-RCN TS-RCN TS-RCN[5] ResNet-50 ResNeXt-50 WRNs-50

SR 0.75 (%) 45.8 46.0 47.7 35.2SR 0.50 (%) 71.1 71.6 71.6 61.1

AO (%) 59.7 60.8 60.7 52.2

TABLE VI: GOT10k testing result in terms of average overlap(AO), success rates (SR) at overlap thresholds 0.5 and 0.75.

and reduces the failure (robustness) by 2.3%, although ourmodel was trained with less training data than [5]. For a faircomparison, we re-train DiMP tracker with the same trainingdata as ours, and it achieves 0.385 EAO, 0.563 accuracy and0.209 robustness, which are all worse than that of ours.

There are fewer results on VOT2019 dataset. Per Table V,our approach achieves the top result in terms of robustness.Our results are comparable to the top performers on VOT2019benchmark, and better than that of recently developed STOAtracking approaches.

GOT-10k dataset [2]. On GOT-10K dataset, researchersuse average overlap (AO) and success rate (SR) to evaluateall trackers. AO denotes the average of overlaps between theestimated bounding-boxes and ground-truth. SR measures thepercentage of successfully tracked frames where the overlapsexceed a threshold (e.g. 0.5, 0.75). We used the GOT-10ktraining split for both network training and validation, andthe test split for testing. The test split has 180 videos. Thereis no overlap of object classes between the train and the test,which prevents over-fitting of an individual class. Hence, thepurpose of this evaluation focuses on testing the generalizationcapabilities of trackers on unseen object classes. To ensurea fair comparison, the presented trackers are trained andvalidated using the same subset drawn from the GOT-10ktraining set. This strategy prevents using external datasets ordifferent training set split to ensure the difference is onlycaused by the employed backbone structures.

Table VI presents the results. For the average overlap, base-line DiMP tracker achieves 59.7% and two-stream ResNet-50 achieves 60.8% and a relative gain of 1.1%. ResNeXt-50achieves the highest with 60.7% and a relative gain of 1.0%. Italso tops the success rate @0.5 and @0.75 overlap thresholdamong all models. This experiment using the same settingon different datasets verifies the generalization ability of theproposed TS-RCN.

D. Visualization

Some visualization examples are given in Figure 6. Thetop three rows are sequences from the VOT2018 benchmark.Row (a) demonstrates the blurred target object in appearance.Row (b-c) show the distractors with same category appearance.Although not the same visual appearance, RGB single streamtends to confuse the distractor from the true target if they arethe same category and in close proximity.

For the long-term video tracking, we focus on soccer balltracking in real matches. The soccer ball may be blurred anddeformed due to the high speed, or blended with backgrounddistractors. The bottom two rows of Figure 6 are examplesof optical flow based local detection, which retrieves thesoccer ball using the motion information. In comparison, RGBsingle-stream approach fails in tracking. Row (d) demonstratesthe occluded soccer ball by the player. The soccer ball isalso blurred and deformed. Row (e) illustrates the distractor(advertisement board) which has the same appearance. We willalso provide the corresponding long-video tracking clips in thesupplementary materials.

V. CONCLUSION

We propose a Two-Stream Residual Convolutional Networkwhich strategize to combine the RGB appearance and theoptical flow motion inputs. Our proposed architecture is basedon the ResNeXt residual networks as the backbone structureand can be integrated with existing trackers using end-to-end training. This strategy has been tested on VOT2018,VOT2019, and GOT-10k benchmarks, where it outperformedthe baselines that only use RGB single-stream, as well as otherSTOAs provided by the benchmarks.

REFERENCES

[1] H. Fan and et al., “Lasot: A high-quality benchmark for large-scalesingle object tracking,” in CVPR, 2019.

[2] L. Huang and et al., “Got-10k: A large high-diversity benchmark forgeneric object tracking in the wild,” IEEE TPAMI, 2019.

[3] M. Danelljan and et al., “Atom: Accurate tracking by overlap maximiza-tion,” in CVPR, 2019.

[4] L. Bertinetto and et al., “Fully-convolutional siamese networks for objecttracking,” in ECCV, 2016.

[5] G. Bhat and et al., “Learning discriminative model prediction fortracking,” in ICCV, 2019.

[6] K. Simonyan and A. Zisserman, “Two-stream convolutional networksfor action recognition in videos,” in NIPS, 2014.

[7] C. Feichtenhofer and et al., “Convolutional two-stream network fusionfor video action recognition,” in CVPR, 2016.

[8] S. Gladh and et al., “Deep motion features for visual tracking,” in ICPR,2016.

[9] K. Soomro and et al., “Ucf101: A dataset of 101 human actions classesfrom videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.

[10] O. Russakovsky and et al., “Imagenet large scale visual recognitionchallenge,” International journal of computer vision, 2015.

[11] S. Xie and et al., “Aggregated residual transformations for deep neuralnetworks,” in CVPR, 2017.

Fig. 6: Visualization of the TS-RCN tracker in Red and the baseline DiMP tracker in Blue. The groundtruth is in Green. Row(a-c): each sequence skips certain fixed-length frames rather than showing every consecutive frame in-between. For reference,the frame number is given on the top left corner. Row (d-e): long-video tracking with optical flow based local detection.

[12] K. Matej and et al., “The sixth visual object tracking vot2018 challengeresults,” 2018.

[13] ——, “The seventh visual object tracking vot2019 challenge results,”2019.

[14] G. Farneback, “Two-frame motion estimation based on polynomialexpansion,” in Scandinavian conference on Image analysis, 2003.

[15] Z. Zhang and H. Peng, “Deeper and wider siamese networks for real-time visual tracking,” in CVPR, 2019.

[16] B. Li and et al., “Siamrpn++: Evolution of siamese visual tracking withvery deep networks,” in CVPR, 2019.

[17] Y. Xu and et al., “Siamfc++: Towards robust and accurate visual trackingwith target estimation guidelines,” 2020.

[18] B. Li and et al., “High performance visual tracking with siamese regionproposal network,” in CVPR, 2018.

[19] K. He and et al., “Deep residual learning for image recognition,” inCVPR, 2016.

[20] D. S. Bolme and et al., “Visual object tracking using adaptive correlationfilters,” in CVPR, 2010.

[21] M. Danelljan and et al., “Learning spatially regularized correlation filtersfor visual tracking,” in ICCV, 2015.

[22] ——, “Beyond correlation filters: Learning continuous convolutionoperators for visual tracking,” in ECCV, 2016.

[23] C. Wu and et al., “Motion guided siamese trackers for visual tracking,”IEEE Access, 2020.

[24] L. Wang and et al., “Temporal segment networks: Towards goodpractices for deep action recognition,” in ECCV, 2016.

[25] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a newmodel and the kinetics dataset,” in CVPR, 2017.

[26] M. A. Goodale, A. D. Milner et al., “Separate visual pathways forperception and action,” 1992.

[27] J. S. Perez and et al., “TV-L1 optical flow estimation,” Image ProcessingOn Line, 2013.

[28] E. Ilg and et al., “Flownet 2.0: Evolution of optical flow estimation withdeep networks,” in CVPR, 2017.

[29] S. Zagoruyko and N. Komodakis, “Wide residual networks,” ECCV,2016.

[30] B. Jiang and et al., “Acquisition of localization confidence for accurateobject detection,” in ECCV, 2018.

[31] M. Kristan and et al., “The visual object tracking vot2013 challengeresults,” in ICCV Workshops, 2013.

[32] ——, “The visual object tracking vot2015 challenge results,” 2015.[33] ——, “A novel performance evaluation methodology for single-target

trackers,” IEEE TPAMI, 2016.[34] T. Xu and et al., “Learning adaptive discriminative correlation filters

via temporal consistency preserving spatial feature selection for robustvisual object tracking,” IEEE TIP, 2019.

[35] S. Bai and et al., “Multi-hierarchical independent correlation filters forvisual tracking,” 2018. [Online]. Available: http://arxiv.org/abs/1811.10302

[36] C. Sun, D. Wang, H. Lu, and M.-H. Yang, “Correlation tracking viajoint discrimination and reliability learning,” in CVPR, 2018.

[37] G. Bhat and et al., “Unveiling the power of deep tracking,” in ECCV,2018.

[38] M. Danelljan and et al., “Eco: Efficient convolution operators fortracking,” in CVPR, 2017.

[39] Z. Zhu and et al., “Distractor-aware siamese networks for visual objecttracking,” in ECCV, 2018.

[40] Q. Wang and et al., “Fast online object tracking and segmentation: Aunifying approach,” in CVPR, 2019.

http://arxiv.org/abs/1811.10302

http://arxiv.org/abs/1811.10302

Documents

Robust Visual Object Tracking with Two-Stream Residual ...Robust Visual Object Tracking with Two-Stream Residual Convolutional Networks Ning Zhang , Jingen Liu , Ke Wangy, Dan Zengz