1
First Place Solution: Three-stage Visual Relationship Detector Yichao Lu*, Cheng Chang*, Himanshu Rai* Layer6 AI First Stage (Object Detection -Partial Weight Transfer) Second Stage (Spatio-Semantic Model) The goal of the aggregation model is to learn to aggregate the spatial, semantic, and visual features to generate the final prediction. We use a GBM which takes the predictions from the second stage model in addition to the spatial and semantic features as its input and outperforms both second stage models. Experiment results on the validation set demonstrates GBM based aggregation model outperforms averaging the two second-stage models. Third Stage (Aggregation Model) boy boy boy football Det. Model GBM GBM CNN hits boy football feature extraction semantic spatial visual detection input image Overview of Model Pipeline Introduction Object Detection: Ensemble of detectors trained using our Partial Weight Transfer Approach to detect objects belonging to one of the 57 categories. Relationship detection: GBMs and CNN based model to predict relationships. Given an image, the goal is to detect objects and their relationships. The relationships include human-object relationships, object-object relationships and object-aribute relationships. We propose a three-stage model, that consists of object detection models as first stage, a spatio-semantic GBM model and a CNN model for second stage and a GBM based aggregation model for third stage. Spatio-Semantic Model: We use a tree-based Gradient Boosting Machine (GBM) to model the spatial relationship between pairs of objects (e.g. the Euclidean distance between the centers of the objects) as well as their semantic relationship (e.g. the probability that a pair of objects are related). C A R B O Y W O M A N D O G C A R D O G G I R L M A N P E R S O N ……. ……. Open Images Model COCO Model In order to speed up training and achiever higher mAP, we propose PWT. Mapped classes between COCO and Open Images. Transfer classifier head weights for mapped classes, backbone and regression weights. Used ensemble of SOTA models like cascade RCNN, HRNet etc. by weighted NMS. For “is” , formed all possible subject-aribute relationships and used an object detector with PWT. C N N Conclusion and Learderboard We generalized across public and private leaderboard with about 5% lead over second team. Our PWT approach allowed us to train high performance models with low computational cost in a small amount of time. Second Stage (Visual Model) Visual Model: The convolutional neural network based visual model aims to provide visual features to solve relationships that cannot be inferred by only spatial and semantic features. The spatio-semantic model is good at predicting relationships that are highly determined by spatial features (e.g. man “on” horse, Fig 1 ) , while a visual model is required to predict relationships that are highly visual (e.g. woman “holds” mike, Fig 2). Fig 1 Fig 2

First Place Solution: Three-stage Visual Relationship DetectorFirst Place Solution: Three-stage Visual Relationship Detector Yichao Lu*, Cheng Chang*, Himanshu Rai* Layer6 AI First

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: First Place Solution: Three-stage Visual Relationship DetectorFirst Place Solution: Three-stage Visual Relationship Detector Yichao Lu*, Cheng Chang*, Himanshu Rai* Layer6 AI First

First Place Solution: Three-stage Visual Relationship DetectorYichao Lu*, Cheng Chang*, Himanshu Rai*

Layer6 AI

First Stage (Object Detection -Partial Weight Transfer)

Second Stage (Spatio-Semantic Model)

➢ The goal of the aggregation model is to learn to aggregate the spatial, semantic, and visual features to generate the final prediction.

➢ We use a GBM which takes the predictions from the second stage model in addition to the spatial and semantic features as its input and outperforms both second stage models. Experiment results on the validation set demonstrates GBM based aggregation model outperforms averaging the two second-stage models.

Third Stage (Aggregation Model)

boy

boy

boy

football

Det.

Mo

del

GB

M

GB

M

CN

N

hits

boyfootball

featureextraction

semantic

spatial

visual

detectioninput image

Overview of Model PipelineIntroduction

➢ Object Detection: Ensemble of detectors trained using our Partial Weight Transfer Approach to detect objects belonging to one of the 57 categories.

➢ Relationship detection: GBMs and CNN based model to predict relationships.

➢ Given an image, the goal is to detect objects and their relationships.➢ The relationships include human-object relationships, object-object relationships and

object-attribute relationships.➢ We propose a three-stage model, that consists of object detection models as first stage, a

spatio-semantic GBM model and a CNN model for second stage and a GBM based aggregation model for third stage.

➢ Spatio-Semantic Model: We use a tree-based Gradient Boosting Machine (GBM) to model the spatial relationship between pairs of objects (e.g. the Euclidean distance between the centers of the objects) as well as their semantic relationship (e.g. the probability that a pair of objects are related).

C A R

B O Y

W O M A N

D O G

C A R

D O G

G I R L

M A N

P E R S O N

…….

…….

Open Images Model

COCO Model

➢ In order to speed up training and achiever higher mAP, we propose PWT. ➢ Mapped classes between COCO and Open Images. ➢ Transfer classifier head weights for mapped classes, backbone and regression weights.➢ Used ensemble of SOTA models like cascade RCNN, HRNet etc. by weighted NMS. ➢ For “is” , formed all possible subject-attribute relationships and used an object detector

with PWT.

CNN

Conclusion and Learderboard➢ We generalized across public and private leaderboard with about 5% lead over second team.➢ Our PWT approach allowed us to train high performance models with low computational

cost in a small amount of time.

Second Stage (Visual Model)➢ Visual Model: The convolutional neural network based visual model aims to provide

visual features to solve relationships that cannot be inferred by only spatial and semantic features.

➢ The spatio-semantic model is good at predicting relationships that are highly determined by spatial features (e.g. man “on” horse, Fig 1 ) , while a visual model is required to predict relationships that are highly visual (e.g. woman “holds” mike, Fig 2).

Fig 1 Fig 2