Upload
hoangkhanh
View
224
Download
0
Embed Size (px)
Citation preview
MCMOT: Multi-Class Multi-Object
Tracking using Changing Point Detection
Inha University¹, NaeulTech²
ILSVRC 2016 Object Detection from Video
Byungjae Lee¹, Songguo Jin¹, Enkhbayar Erdenee¹,
Mi Young Nam², Young Gui Jung², Phill Kyu Rhee¹
Results
2
• Object Detection from Video (VID)2nd place (mAP: 73.15%)
• Object Detection/Tracking from Video (VID)2nd place (mAP: 49.09%)
with additional training data
Overview
3
I. Faster R-CNN Object Detector+ Context region+ Larger feature map+ Ensemble+ Data configuration
II. MCMOT: Multi-Class Multi-Object Tracking• Tracking by Detection• Detection: Ensemble of CNNs• Tracking: MCMOT using CPD
S. Ren, K. He, R. Girshick, & J. Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. TPAMI 2016.
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
4
I. Faster R-CNN Object Detector
S. Ren, K. He, R. Girshick, & J. Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. TPAMI 2016.
Object Detection from Video
5
Challenge I: Small Object
* Figure from J. Li et al. “Scale-aware Fast R-CNN for Pedestrian Detection”. arXiv 2015.
Object Detection from Video
6
Challenge I: Small Object
Solution – Larger feature map
* Figure from J. Li et al. “Scale-aware Fast R-CNN for Pedestrian Detection”. arXiv 2015.
Object Detection from Video
7
Challenge II: Blurred Object
* Figure from Y. Zhu et al. “segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection”. CVPR 2015.
Object Detection from Video
8
Challenge II: Blurred Object
Solution – Context Region
* Figure from Y. Zhu et al. “segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection”. CVPR 2015.
Conv5
pool4
Network Architecture
9
I. Faster R-CNN with VGG16
Karen Simonyan, & Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. arXiv 2015.
S. Ren, K. He, R. Girshick, & J. Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. TPAMI 2016.
FC FC
Softmax
class loss
Bbox. reg.
loss
RoI Pooling
layer
Region Proposal
NetworkConvolutional Network
Conv4
Conv3
Conv2
Conv1
pool1
pool2
pool3
Bbox layer
Class layer
Conv5
pool4
Network Architecture
10
I. Faster R-CNN with VGG16
Karen Simonyan, & Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. arXiv 2015.
S. Ren, K. He, R. Girshick, & J. Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. TPAMI 2016.
FC FC
Softmax
class loss
Bbox. reg.
loss
RoI Pooling
layer
Region Proposal
NetworkConvolutional Network
Conv4
Conv3
Conv2
Conv1
pool1
pool2
pool3
Bbox layer
Class layer
Network Architecture
11
I. Faster R-CNN with VGG16+ Larger feature map (remove ‘pool4’ layer) (+2.9% mAP)
J. Li, X. Liang, S. Shen, T. Xu, & S. Yan. “Scale-aware Fast R-CNN for Pedestrian Detection”. arXiv 2015.
S. Ren, K. He, R. Girshick, & J. Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. TPAMI 2016.
FC FC
Softmax
class loss
Bbox. reg.
loss
RoI Pooling
layer
Region Proposal
Network
Bbox layer
Class layerConvolutional Network
Conv4ˈ
Conv3
Conv2
Conv1
pool1
pool2
pool3
Network Architecture
12S. Gidaris, & N. Komodakis. “Object detection via a multi-region & semantic segmentation-aware CNN model”. CVPR 2015.
S. Ren, K. He, R. Girshick, & J. Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. TPAMI 2016.
FC FC
Softmax
class loss
Bbox. reg.
loss
RoI Pooling
layer
Region Proposal
NetworkConvolutional Network
Conv4ˈ
Conv3
Conv2
Conv1
pool1
pool2
pool3
Concat layer
Bbox layer
Class layer
FC FC
RoI
Pooling Context layer
Box * 1.5
I. Faster R-CNN with VGG16+ Larger feature map (remove ‘pool4’ layer) (+2.9% mAP)+ Context region (+2.6% mAP)
Training Data Configuration
13
* Images are from the ILSVRC VID dataset
ILSVRC VID Training Validation
Images 1122397 176126
Snippets 3862 555
Overview of ILSVRC VID Dataset
Training Data Configuration
14
* Images are from the ILSVRC VID dataset
ILSVRC VID Training Validation
Images 1122397 176126
Snippets 3862 555
Overview of ILSVRC VID Dataset
• Redundant images within each snippet• Diversity is too low to train CNN
Training Data Configuration
15
* Images are from the ILSVRC VID dataset
ILSVRC VID Training Validation
Images 1122397 176126
Snippets 3862 555
Overview of ILSVRC VID Dataset
• Redundant images within each snippet• Diversity is too low to train CNN• We need more data
Training Data Configuration
16
ILSVRC VID
ILSVRC DET
AdditionalData
COCODET
ImageNet CLSPre-trained Model
Training Data Configuration
17
ILSVRC VID
ILSVRC DET
AdditionalData
COCODET
ImageNet CLSPre-trained Model
COCO DETPre-trained Model
Fine-tune
Training Data Configuration
18
ILSVRC VID
ILSVRC DET
Sample 2K images per class
Sample 2K images per class (30 classes)
AdditionalData
100 imagesper class
COCODET
ImageNet CLSPre-trained Model
COCO DETPre-trained Model
Fine-tune
Training Data Configuration
19
ILSVRC VID
ILSVRC DET
Sample 2K images per class
Sample 2K images per class (30 classes)
AdditionalData
100 imagesper class
COCODET
ImageNet CLSPre-trained Model
COCO DETPre-trained Model
TrainingDataset
Fine-tune
Training Data Configuration
20
ILSVRC VID
ILSVRC DET
Sample 2K images per class
Sample 2K images per class (30 classes)
AdditionalData
100 imagesper class
COCODET
ImageNet CLSPre-trained Model
COCO DETPre-trained Model
TrainingDataset
FinalModel
Fine-tune
Fine-tune
Detection Components
21
VGG 16Baseline 70.7%
ResNet-101
Baseline 78.8%
* We used 12 anchors for RPN
* Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images)
mAP(%) mAP(%)
Detection Components
22
VGG 16Baseline 70.7%+ Larger feature map 73.6%+ Context layer 76.2%
ResNet-101
Baseline 78.8%
* We used 12 anchors for RPN
* Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images)
mAP(%) mAP(%)
Detection Components
23
VGG 16Baseline 70.7%+ Larger feature map 73.6%+ Context layer 76.2%
ResNet-101
Baseline 78.8%
* We used 12 anchors for RPN
* Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images)
mAP(%) mAP(%)
82.3%
Faster R-CNNVGG16
Faster R-CNNResNet
Detection Results
Combination
MCMOTTracking
mAP
24
II. MCMOT: Multi-Class Multi-Object Trackingusing Changing Point Detection
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
25
Motivation
0.23
0.44
0.620.66
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
2013 2014 2015 2016
mA
P(%
)
Year
ILSVRC Object Detection Performance
• Object detector becomes robust• Should we use complex multi-object tracking algorithm?
26
Motivation
Based on high performance detection,simple & fast MOT algorithmcan achieve competitive result
27
Results• ILSVRC2016 Object Detection/Tracking from Video (VID)
2nd place (mAP: 49.09%)with additional training data
https://motchallenge.net/results/MOT16/
http://www.cvlibs.net/datasets/kitti/eval_tracking.php
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
28
Results• ILSVRC2016 Object Detection/Tracking from Video (VID)
2nd place (mAP: 49.09%)with additional training data
62.4 62.257.3
52.5 50.9
0
10
20
30
40
50
60
70
MCMOT NOMTwSDP16 KFILDAwSDP EAMTT AMPL
MO
TA(%
)
Method
Multiple Object Tracking Benchmark 2016
72.11
71.44
70.15
69.7369.46
68
68.5
69
69.5
70
70.5
71
71.5
72
72.5
MCMOT RBPF DuEye NOMT CNN-OCC
MO
TA(%
)
Method
KITTI Tracking Benchmark (Car)
https://motchallenge.net/results/MOT16/
http://www.cvlibs.net/datasets/kitti/eval_tracking.php
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
• Our MCMOT also achieves state-of-the-art results in different MOT datasets
29
Tracking by Detection
Sequence
Object Detection
Multi-ObjectTracking
𝑡
Trajectory
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
• Object Detection: Ensemble of CNNs• Multi-Object Tracking: MCMOT using CPD
30
MCMOT using CPD
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
Video sequence (a) Likelihood Calculation (b) Track Segment Creation
MCMOT Trajectory Result
𝑡 𝑡
Change point
Forward-backward Validation
𝑡
Change pointChange point
(d) Trajectory Combination
(c) Changing Point Detection
31
Track Segment Creation
H. Sakaino. “Video-based tracking, learning, and recognition method for multiple moving objects”. IEEE trans. On circuits and systems for video technology. 2013.
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
• MCMC-based MOT approach
Changing number of moving objects are challenging, which require high computation overheads due to a high-dimensional state space
• Separating motion dynamics
The method separates the motion dynamic model of Bayesian filter into the entity transitions and motion movesNo dimension variation in the iteration loop by separating the moves of birth and death
32
Track Segment Creation
H. Sakaino. “Video-based tracking, learning, and recognition method for multiple moving objects”. IEEE trans. On circuits and systems for video technology. 2013.
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
• Estimation of entity state transition (Birth, Death)
The entity transitions are modeled as the birth and death eventsWe estimate the entity prior by data-driven approach, instead of the inside of MCMC loop
33
Track Segment Creation
H. Sakaino. “Video-based tracking, learning, and recognition method for multiple moving objects”. IEEE trans. On circuits and systems for video technology. 2013.
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
• Separating motion dynamics
Since the Markov chain has no dimension variation in the iteration loop,it can reach to stationary states with less computation overhead
Pros
such a simple approach cannot deal with complex situations that occur in MOTMany of them are suffered from track drifts due to appearance variations
Cons
Drift problem is attacked by a CPD algorithm
34
Changing Point Detection
* Figure from J. Takeuchi, & K. Yamanishi. “A Unifying Framework for Detecting Outliers and Change Points from Time Series”. TKDE 2006.
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
• Two-Stage Learning for Changing Point Detection
• Drifts in MCMOT are investigated by detection such abrupt change points between stationary time series that represent track segment
• A possible track drift is determined by a changing point detection
• 2nd level time series is built using the scanned average responses to reduce outliers in the time series
35
Changing Point DetectionTrack segments
Changing Point
Detection
CPD score
Forward-backward
Validation
FB error
High
Exclude unstable
segments
High
Keep confident
segments
Low
Low
Confident segments
J. Takeuchi, & K. Yamanishi. “A Unifying Framework for Detecting Outliers and Change Points from Time Series”. TKDE 2006.
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
36
Changing Point Detection
* Images are from MOT 2016 benchmark
J. Takeuchi, & K. Yamanishi. “A Unifying Framework for Detecting Outliers and Change Points from Time Series”. TKDE 2006.
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
MOT16-09 #170 MOT16-09 #172 MOT16-09 #174 MOT16-07 #179
MOT16-02 #438 MOT16-02 #440 MOT16-02 #442 MOT16-02 #444
37
Tracking Speed
B. Lee, E. Erdenee, S. Jin, & P. Rhee. “Multi-Class Multi-Object Tracking using Changing Point Detection”. arXiv 2016.
Tracking performances comparison on the MOT benchmark 2016
• The timing excludes detection time• With a Titan X Maxwell GPU, the detector runs at approximately 3.5 FPS