74

AttentionNet for Accurate Localization and Detection of ... › content › gtc-kr › part_1_kaist.pdfAttentionNet for Accurate Localization and Detection of Objects. (To appear in

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

    Donggeun Yoo, Sunggyun Park, Joon-Young Lee, Anthony Paek, In So Kweon.

  • State-of-the-art frameworks for object detection.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    CN

    N

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    CN

    N

    SVM

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    CN

    N

    SVM

    NM

    S

    BB R

    eg.

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    CN

    N

    SVM

    NM

    S

    BB R

    eg.

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    (−) The maximally scored region is prone to focus on discriminative part (e.g. face)

    rather than entire object (e.g. human body).

    CN

    N

    SVM

    NM

    S

    BB R

    eg.

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    (−) The maximally scored region is prone to focus on discriminative part (e.g. face)

    rather than entire object (e.g. human body).

    CN

    N

    SVM

    NM

    S

    BB R

    eg.

    Object proposal.

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

    CN

    N

    X1

    y1

    X2

    y2

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

    (X1,Y1)

    (X2,Y2)

    CN

    N

    X1

    y1

    X2

    y2

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

    (−) Direct mapping from an image to an exact bounding box is relatively difficult for a CNN.

    (X1,Y1)

    (X2,Y2)

    CN

    N

    X1

    y1

    X2

    y2

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Stop signal

    Idea: Ensemble of weak prediction.

  • Stop signal

    Idea: Ensemble of weak prediction.

  • Stop signal

    Stop signal

    Idea: Ensemble of weak prediction.

  • Stop signal

    Stop signal

    Idea: Ensemble of weak prediction.

  • Model: Rather than CNN regression model,

    use CNN classification model.

  • Model: Rather than CNN regression model,

    use CNN classification model.

    Bottom-right direction prediction. Top-left direction prediction.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Convolution.

    Convolution.

    Fully connected.

    Fully connected.

  • Model: Rather than CNN regression model,

    use CNN classification model.

    Bottom-right direction prediction. Top-left direction prediction.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Convolution.

    Convolution.

    Fully connected.

    Fully connected.

  • Model: Rather than CNN regression model,

    use CNN classification model.

    [ 3 directions, stop signal, no object ] ∈ ℜ5

    Bottom-right direction prediction. Top-left direction prediction.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Convolution.

    Convolution.

    Fully connected.

    Fully connected.

    [ 3 directions, stop signal, no object ] ∈ ℜ5

  • Model: Rather than CNN regression model,

    use CNN classification model.

    [ 3 directions, stop signal, no object ] ∈ ℜ5

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Convolution.

    Convolution.

    Fully connected.

    Fully connected.

    [ 3 directions, stop signal, no object ] ∈ ℜ5

    → ↘ ↓ • F ← ↖ ↑ • F

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Training AttentionNet.

  • Training AttentionNet.

    1. Generating training samples.

  • Training AttentionNet.

    2. Minimizing the loss function by back-propagation and stochastic gradient descent.

    𝐿 =1

    2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝑇𝐿, 𝑡𝑇𝐿 +

    1

    2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝐵𝑅 , 𝑡𝐵𝑅 .

  • Result. (Good examples.)

  • Result. (Good examples.)

  • Result. (Bad examples.)

  • How to detect multiple instance?

  • Extension to multiple-instance: 1. Fast multi-scale sliding window search

    using fully-convolutional network.

  • *Fast extraction of multi-scale dense activations.

  • *Fast extraction of multi-scale dense activations.

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    227×227×3

  • *Fast extraction of multi-scale dense activations.

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    227×227×3

    322×322×3

  • *Fast extraction of multi-scale dense activations.

    Idea: Fully connection can be equally implemented

    by convolutional layer.

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    227×227×3

    322×322×3

  • *Fast extraction of multi-scale dense activations.

    Idea: Fully connection can be equally implemented

    by convolutional layer.

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    Conv. 7

    Conv. 6

    227×227×3

    322×322×3

  • *Fast extraction of multi-scale dense activations.

  • *Fast extraction of multi-scale dense activations.

  • *Fast extraction of multi-scale dense activations.

  • *Fast extraction of multi-scale dense activations.

    Multi-scale

    dense

    activations.

    4,096

  • *Fast extraction of multi-scale dense activations.

    Multi-scale

    dense

    activations.

    4,096

    Each activation vector

    comes from each patch.

  • Extension to multiple-instance: 1. Fast multi-scale sliding window search

    using fully-convolutional network.

  • Extension to multiple-instance:

    2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.

  • Extension to multiple-instance:

    2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.

    Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.

  • Extension to multiple-instance:

    2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.

    Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.

    Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.

  • Extension to multiple-instance:

    2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.

    Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.

    Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.

    Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.

  • Extension to multiple-instance: Overall architecture for sliding window search.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Evaluation on PASCAL VOC Series.

    PASCAL VOC 2007 “Person”.

    PASCAL VOC 2012 “Person”.

    58.7 RCNN.

    RCNN-based.

  • Evaluation on PASCAL VOC Series.

    PASCAL VOC 2007 “Person”.

    PASCAL VOC 2012 “Person”.

    58.7 RCNN.

    RCNN-based.

    AttentionNet.

    AttentionNet.

  • Evaluation on PASCAL VOC Series.

    PASCAL VOC 2007 “Person”.

    PASCAL VOC 2012 “Person”.

    58.7 RCNN.

    RCNN-based.

    AttentionNet+RCNN.

    AttentionNet+RCNN.

  • Evaluation on PASCAL VOC Series.

    PASCAL VOC 2007 “Person”.

    PASCAL VOC 2012 “Person”.

    Precision-recall curve on PASCAL VOC 2007 “Person”.

    58.7