MCMOT: Multi-Class Multi-Object Tracking using …image-net.org/challenges/talks/2016/ILSVRC2016_ITLab_for...MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

MCMOT: Multi-Class Multi-Object

Tracking using Changing Point Detection

Inha University¹, NaeulTech²

ILSVRC 2016 Object Detection from Video

Byungjae Lee¹, Songguo Jin¹, Enkhbayar Erdenee¹,

Mi Young Nam², Young Gui Jung², Phill Kyu Rhee¹

Results

2

• Object Detection from Video (VID)2nd place (mAP: 73.15%)

• Object Detection/Tracking from Video (VID)2nd place (mAP: 49.09%)

with additional training data

Overview

3

I. Faster R-CNN Object Detector+ Context region+ Larger feature map+ Ensemble+ Data configuration

II. MCMOT: Multi-Class Multi-Object Tracking• Tracking by Detection• Detection: Ensemble of CNNs• Tracking: MCMOT using CPD

S. Ren, K. He, R. Girshick, & J. Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. TPAMI 2016.

B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.

4

I. Faster R-CNN Object Detector


Object Detection from Video

5

Challenge I: Small Object

* Figure from J. Li et al. “Scale-aware Fast R-CNN for Pedestrian Detection”. arXiv 2015.


6

Challenge I: Small Object

Solution – Larger feature map

* Figure from J. Li et al. “Scale-aware Fast R-CNN for Pedestrian Detection”. arXiv 2015.


7

Challenge II: Blurred Object

* Figure from Y. Zhu et al. “segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection”. CVPR 2015.


8

Challenge II: Blurred Object

Solution – Context Region

* Figure from Y. Zhu et al. “segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection”. CVPR 2015.

Conv5

pool4

Network Architecture

9

I. Faster R-CNN with VGG16

Karen Simonyan, & Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. arXiv 2015.


FC FC

Softmax

class loss

Bbox. reg.

loss

RoI Pooling

layer

Region Proposal

NetworkConvolutional Network

Conv4

Conv3

Conv2

Conv1

pool1

pool2

pool3

Bbox layer

Class layer

Conv5

pool4


10

I. Faster R-CNN with VGG16

Karen Simonyan, & Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. arXiv 2015.


FC FC

Softmax

class loss

Bbox. reg.

loss

RoI Pooling

layer

Region Proposal


Conv4

Conv3

Conv2

Conv1

pool1

pool2

pool3

Bbox layer

Class layer


11

I. Faster R-CNN with VGG16+ Larger feature map (remove ‘pool4’ layer) (+2.9% mAP)

J. Li, X. Liang, S. Shen, T. Xu, & S. Yan. “Scale-aware Fast R-CNN for Pedestrian Detection”. arXiv 2015.


FC FC

Softmax

class loss

Bbox. reg.

loss

RoI Pooling

layer

Region Proposal

Network

Bbox layer

Class layerConvolutional Network

Conv4ˈ

Conv3

Conv2

Conv1

pool1

pool2

pool3


12S. Gidaris, & N. Komodakis. “Object detection via a multi-region & semantic segmentation-aware CNN model”. CVPR 2015.


FC FC

Softmax

class loss

Bbox. reg.

loss

RoI Pooling

layer

Region Proposal


Conv4ˈ

Conv3

Conv2

Conv1

pool1

pool2

pool3

Concat layer

Bbox layer

Class layer

FC FC

RoI

Pooling Context layer

Box * 1.5

I. Faster R-CNN with VGG16+ Larger feature map (remove ‘pool4’ layer) (+2.9% mAP)+ Context region (+2.6% mAP)

Training Data Configuration

13

* Images are from the ILSVRC VID dataset

ILSVRC VID Training Validation

Images 1122397 176126

Snippets 3862 555

Overview of ILSVRC VID Dataset


14



Images 1122397 176126

Snippets 3862 555


• Redundant images within each snippet• Diversity is too low to train CNN


15



Images 1122397 176126

Snippets 3862 555


• Redundant images within each snippet• Diversity is too low to train CNN• We need more data


16

ILSVRC VID

ILSVRC DET

AdditionalData

COCODET

ImageNet CLSPre-trained Model


17

ILSVRC VID

ILSVRC DET

AdditionalData

COCODET


COCO DETPre-trained Model

Fine-tune


18

ILSVRC VID

ILSVRC DET

Sample 2K images per class

Sample 2K images per class (30 classes)

AdditionalData

100 imagesper class

COCODET



Fine-tune


19

ILSVRC VID

ILSVRC DET



AdditionalData

100 imagesper class

COCODET



TrainingDataset

Fine-tune


20

ILSVRC VID

ILSVRC DET



AdditionalData

100 imagesper class

COCODET



TrainingDataset

FinalModel

Fine-tune

Fine-tune

Detection Components

21

VGG 16Baseline 70.7%

ResNet-101

Baseline 78.8%

* We used 12 anchors for RPN

* Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images)

mAP(%) mAP(%)


22

VGG 16Baseline 70.7%+ Larger feature map 73.6%+ Context layer 76.2%

ResNet-101

Baseline 78.8%



mAP(%) mAP(%)


23

VGG 16Baseline 70.7%+ Larger feature map 73.6%+ Context layer 76.2%

ResNet-101

Baseline 78.8%



mAP(%) mAP(%)

82.3%

Faster R-CNNVGG16

Faster R-CNNResNet

Detection Results

Combination

MCMOTTracking

mAP

24

II. MCMOT: Multi-Class Multi-Object Trackingusing Changing Point Detection


25

Motivation

0.23

0.44

0.620.66

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2013 2014 2015 2016

mA

P(%

)

Year

ILSVRC Object Detection Performance

• Object detector becomes robust• Should we use complex multi-object tracking algorithm?

26

Motivation

Based on high performance detection,simple & fast MOT algorithmcan achieve competitive result

27

Results• ILSVRC2016 Object Detection/Tracking from Video (VID)

2nd place (mAP: 49.09%)with additional training data

https://motchallenge.net/results/MOT16/

http://www.cvlibs.net/datasets/kitti/eval_tracking.php


28

Results• ILSVRC2016 Object Detection/Tracking from Video (VID)

2nd place (mAP: 49.09%)with additional training data

62.4 62.257.3

52.5 50.9

0

10

20

30

40

50

60

70

MCMOT NOMTwSDP16 KFILDAwSDP EAMTT AMPL

MO

TA(%

)

Method

Multiple Object Tracking Benchmark 2016

72.11

71.44

70.15

69.7369.46

68

68.5

69

69.5

70

70.5

71

71.5

72

72.5

MCMOT RBPF DuEye NOMT CNN-OCC

MO

TA(%

)

Method

KITTI Tracking Benchmark (Car)

https://motchallenge.net/results/MOT16/

http://www.cvlibs.net/datasets/kitti/eval_tracking.php


• Our MCMOT also achieves state-of-the-art results in different MOT datasets

29

Tracking by Detection

Sequence

Object Detection

Multi-ObjectTracking

𝑡

Trajectory


• Object Detection: Ensemble of CNNs• Multi-Object Tracking: MCMOT using CPD

30

MCMOT using CPD


Video sequence (a) Likelihood Calculation (b) Track Segment Creation

MCMOT Trajectory Result

𝑡 𝑡

Change point

Forward-backward Validation

𝑡

Change pointChange point

(d) Trajectory Combination

(c) Changing Point Detection

31

Track Segment Creation

H. Sakaino. “Video-based tracking, learning, and recognition method for multiple moving objects”. IEEE trans. On circuits and systems for video technology. 2013.


• MCMC-based MOT approach

Changing number of moving objects are challenging, which require high computation overheads due to a high-dimensional state space

• Separating motion dynamics

The method separates the motion dynamic model of Bayesian filter into the entity transitions and motion movesNo dimension variation in the iteration loop by separating the moves of birth and death

32




• Estimation of entity state transition (Birth, Death)

The entity transitions are modeled as the birth and death eventsWe estimate the entity prior by data-driven approach, instead of the inside of MCMC loop

33




• Separating motion dynamics

Since the Markov chain has no dimension variation in the iteration loop,it can reach to stationary states with less computation overhead

Pros

such a simple approach cannot deal with complex situations that occur in MOTMany of them are suffered from track drifts due to appearance variations

Cons

Drift problem is attacked by a CPD algorithm

34

Changing Point Detection

* Figure from J. Takeuchi, & K. Yamanishi. “A Unifying Framework for Detecting Outliers and Change Points from Time Series”. TKDE 2006.


• Two-Stage Learning for Changing Point Detection

• Drifts in MCMOT are investigated by detection such abrupt change points between stationary time series that represent track segment

• A possible track drift is determined by a changing point detection

• 2nd level time series is built using the scanned average responses to reduce outliers in the time series

35

Changing Point DetectionTrack segments

Changing Point

Detection

CPD score

Forward-backward

Validation

FB error

High

Exclude unstable

segments

High

Keep confident

segments

Low

Low

Confident segments

J. Takeuchi, & K. Yamanishi. “A Unifying Framework for Detecting Outliers and Change Points from Time Series”. TKDE 2006.


36

Changing Point Detection

* Images are from MOT 2016 benchmark

J. Takeuchi, & K. Yamanishi. “A Unifying Framework for Detecting Outliers and Change Points from Time Series”. TKDE 2006.


MOT16-09 #170 MOT16-09 #172 MOT16-09 #174 MOT16-07 #179

MOT16-02 #438 MOT16-02 #440 MOT16-02 #442 MOT16-02 #444

37

Tracking Speed


Tracking performances comparison on the MOT benchmark 2016

• The timing excludes detection time• With a Titan X Maxwell GPU, the detector runs at approximately 3.5 FPS

38

Results

ImageNet VID MOT Benchmark

KITTI Tracking Benchmark

39

Acknowledgement

Thank You

Email: [email protected]

Documents

MCMOT: Multi-Class Multi-Object Tracking using …image-net.org/challenges/talks/2016/ILSVRC2016_ITLab_for...MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection