74
CS 1674: Intro to Computer Vision Sequential Data: Language and Vision; (Video and Motion) Prof. Adriana Kovashka University of Pittsburgh December 3, 2019

CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

CS 1674: Intro to Computer Vision

Sequential Data: Language and Vision; (Video and Motion)

Prof. Adriana KovashkaUniversity of Pittsburgh

December 3, 2019

Page 2: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Plan for this lecture

• Language and vision

– Image captioning

– Tool: Recurrent neural networks

– Video captioning

– Visual question answering

• Motion and video

– Modeling and replicating motion

– Tracking how an object moves

Page 3: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

“It was an arresting face, pointed of chin, square of jaw. Her eyes

were pale green without a touch of hazel, starred with bristly black

lashes and slightly tilted at the ends. Above them, her thick black

brows slanted upward, cutting a startling oblique line in her

magnolia-white skin–that skin so prized by Southern women and so

carefully guarded with bonnets, veils and mittens against hot

Georgia suns”

Scarlett O’Hara described in Gone with the Wind

Tamara Berg

Motivation: Descriptive Text for Images

Page 4: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

This is a picture of one

sky, one road and one

sheep. The gray sky is

over the gray road. The

gray sheep is by the gray

road.

Here we see one road,

one sky and one bicycle.

The road is near the blue

sky, and near the colorful

bicycle. The colorful

bicycle is within the blue

sky.

This is a picture of two

dogs. The first dog is near

the second furry dog. Kulkarni et al., CVPR 2011

Some pre-RNN good results

Page 5: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Here we see one potted plant.

Missed detections:

This is a picture of one dog.

False detections:

There are one road and one cat.

The furry road is in the furry cat.

This is a picture of one tree, one

road and one person. The rusty

tree is under the red road. The

colorful person is near the rusty

tree, and under the red road.

This is a photograph of two sheeps and one

grass. The first black sheep is by the green

grass, and by the second black sheep. The

second black sheep is by the green grass.

Incorrect attributes:

This is a photograph of two horses and

one grass. The first feathered horse is

within the green grass, and by the second

feathered horse. The second feathered

horse is within the green grass. Kulkarni et al., CVPR 2011

Some pre-RNN bad results

Page 6: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Karpathy and Fei-Fei, CVPR 2015

Results with Recurrent Neural Networks

Page 7: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Recurrent Networks offer a lot of flexibility:

vanilla neural networks

Andrej Karpathy

Page 8: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Recurrent Networks offer a lot of flexibility:

e.g. image captioning

image -> sequence of words

Andrej Karpathy

Page 9: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Recurrent Networks offer a lot of flexibility:

e.g. sentiment classification

sequence of words -> sentiment

Andrej Karpathy

Page 10: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Recurrent Networks offer a lot of flexibility:

e.g. machine translation

seq of words -> seq of words

Andrej Karpathy

Page 11: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Recurrent Networks offer a lot of flexibility:

e.g. video classification on frame level

Andrej Karpathy

Page 12: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Recurrent Neural Network

x

RNN

Andrej Karpathy

RNN

Page 13: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Recurrent Neural Network

x

RNN

yusually want to

output a prediction

at some time steps

Adapted from Andrej Karpathy

Page 14: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Recurrent Neural Network

x

RNN

y

We can process a sequence of vectors x by

applying a recurrence formula at every time step:

new state old state input vector at

some time stepsome function

with parameters W

Andrej Karpathy

Page 15: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Recurrent Neural Network

x

RNN

y

We can process a sequence of vectors x by

applying a recurrence formula at every time step:

Notice: the same function and the same set

of parameters are used at every time step.

Andrej Karpathy

Page 16: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

x

RNN

y

(Vanilla) Recurrent Neural NetworkThe state consists of a single “hidden” vector h:

Andrej Karpathy

Page 17: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Character-level

language model

example

Vocabulary:

[h,e,l,o]

Example training

sequence:

“hello”

RNN

x

y

Andrej Karpathy

Example

Page 18: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Character-level

language model

example

Vocabulary:

[h,e,l,o]

Example training

sequence:

“hello”

Andrej Karpathy

Example

Page 19: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Character-level

language model

example

Vocabulary:

[h,e,l,o]

Example training

sequence:

“hello”

Andrej Karpathy

Example

Page 20: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Character-level

language model

example

Vocabulary:

[h,e,l,o]

Example training

sequence:

“hello”

Andrej Karpathy

Example

Page 21: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Extensions

• Vanishing gradient problem makes it hard to model long sequences

– Multiplying together many values between 0 and 1 (range of gradient of sigmoid, tanh)

• One solution: Use RELU

• Another solution: Use RNNs with gates

– Adaptively decide how much of memory to keep

– Gated Recurrent Units (GRUs), Long Short Term Memories (LSTMs)

Page 22: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10 - 8 Feb 2016Lecture 10 -

8 Feb 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

35

Andrej Karpathy

Generating poetry with RNNs

Page 23: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10 - 8 Feb 2016

train more

train more

train more

Lecture 10 -

8 Feb 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

36

at first:

Andrej Karpathy

Generating poetry with RNNs

More info: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 24: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10 - 8 Feb 2016Lecture 10 -

8 Feb 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

37

Andrej Karpathy

Generating poetry with RNNs

Page 25: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

CVPR 2015:

Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei

Show and Tell: A Neural Image Caption Generator, Vinyals et al.

Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.

Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Adapted from Andrej Karpathy

Image Captioning

Page 26: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Convolutional Neural Network

Recurrent Neural Network

Andrej Karpathy

Image Captioning

Page 27: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

test image

Andrej Karpathy

Image Captioning

Page 28: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

test image

Andrej Karpathy

Page 29: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

test image

XAndrej Karpathy

Page 30: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

test image

x0<START>

<START>

Andrej Karpathy

Image Captioning

Page 31: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

h0

y0

<START>

test image

before:

h = tanh(Wxh * x + Whh * h)

now:

h = tanh(Wxh * x + Whh * h + Wih * im)

im

Wih

Andrej Karpathy

Image Captioning

x0<START>

Page 32: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

h0

y0

test image

sample!

straw

<START>

Andrej Karpathy

Image Captioning

x0<START>

Page 33: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

h0

y0

test image

h1

y1

straw

<START>

Andrej Karpathy

Image Captioning

x0<START>

Page 34: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

h0

y0

test image

h1

y1

sample!

straw hat

<START>

Andrej Karpathy

Image Captioning

x0<START>

Page 35: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

h0

y0

test image

h1

y1

h2

y2

straw hat

<START>

Andrej Karpathy

Image Captioning

x0<START>

Page 36: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

h0

y0

test image

h1

y1

h2

y2

sample

<END> token

=> finish.

straw hat

<START>

Adapted from Andrej Karpathy

Image Captioning

Caption generated:“straw hat”

x0<START>

Page 37: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Andrej Karpathy

Image Captioning

Page 38: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Plan for this lecture

• Language and vision

– Image captioning

– Tool: Recurrent neural networks

– Video captioning

– Visual question answering

• Motion and video

– Modeling and replicating motion

– Tracking how an object moves

Page 39: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Generate descriptions for events depicted in video clips

A monkey pulls a dog’s tail and is chased by the dog.

Venugopalan et al., “Translating Videos to Natural Language using Deep Recurrent Neural Networks”, NAACL-HTL 2015

Video Captioning

Page 40: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Key Insight:

Generate feature representation of the video and “decode” it to a sentence

[Sutskever et al. NIPS’14]

[Donahue et al. CVPR’15]

[Vinyals et al. CVPR’15]

English

Sentence

RNN

encoder

RNN

decoderFrench

Sentence

EncodeRNN

decoderSentence

EncodeRNN

decoderSentence

Venugopalan et al., “Translating Videos to Natural Language using Deep Recurrent Neural Networks”, NAACL-HTL 2015

[Venugopalan et. al.

NAACL’15] (this work)

Video Captioning

Page 41: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Venugopalan et al., “Translating Videos to Natural Language using Deep Recurrent Neural Networks”, NAACL-HTL 2015

Video Captioning

Input

VideoSample frames

@1/10

Forward propagate

Output: “fc7” features(activations before classification layer) fc7: 4096 dimension

“feature vector”

CNN

Page 42: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Input Video Output

A

...

boy

is

playing

golf

<EOS>

Convolutional Net Recurrent Net

LSTM LSTM

LSTM LSTM

LSTM LSTM

LSTM LSTM

LSTM LSTM

LSTM LSTM

Venugopalan et al., “Translating Videos to Natural Language using Deep Recurrent Neural Networks”, NAACL-HTL 2015

Video Captioning

Mean across

all frames

Page 43: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

FGM: A person is dancing with the person on the stage.

YT: A group of men are riding the forest.

I+V: A group of people are dancing.

GT: Many men and women are dancing in the street.

FGM: A person is cutting a potato in the kitchen.

YT: A man is slicinga tomato.

I+V:Amanisslicing a carrot.

GT: A man is slicing carrots.

FGM: A person is walkingwith a person in the forest.

YT: A monkey is walking.

I+V:Abear is eating a tree.

GT: Two bear cubs are digging into dirt and plant matter

at the base of a tree.

FGM: A person is riding a horse on the stage.

YT: A group of playing are playing in the ball.

I+V:Abasketball player is playing.

GT: Dwayne wade does a fancy layup in an allstar game.

Venugopalan et al., “Translating Videos to Natural Language using Deep Recurrent Neural Networks”, NAACL-HTL 2015

Video Captioning

Page 44: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Encode

[Sutskever et al. NIPS’14]

[Donahue et al. CVPR’15]

[Vinyals et al. CVPR’15]

English

Sentence

RNN

encoder

RNN

decoderFrench

Sentence

EncodeRNN

decoderSentence

EncodeRNN

decoderSentence [Venugopalan et. al.

NAACL’15]

RNN

decoderSentence

RNN

encoder[Venugopalan et. al. ICCV’

15] (this work)

3

Venugopalan et al., “Sequence to Sequence - Video to Text”, ICCV 2015

Video Captioning

Page 45: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

S2VT Overview

LSTM LSTMLSTMLSTM LSTM LSTMLSTMLSTM

LSTM LSTMLSTMLSTM LSTM LSTMLSTMLSTM

CNN CNN CNN CNN

A man is

...

talking ...Encoding stage

Decoding stage

Now decode it toa sentence!

Venugopalan et al., “Sequence to Sequence - Video to Text”, ICCV 2015

Video Captioning

Page 46: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Visual Question Answering (VQA)

Task: Given an image and a natural language open-ended question, generate a natural language answer.

Agrawal et al., “VQA: Visual Question Answering”, ICCV 2015

Page 47: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Convolution Layer+ Non-Linearity

Pooling Layer Convolution Layer+ Non-Linearity

Pooling Layer Fully-Connected

4096-dim

Embedding

Embedding

“How many horses are in this image?”

Neural Network Softmax

over top K answers

Image

Question

1024-dim

Visual Question Answering (VQA)

Agrawal et al., “VQA: Visual Question Answering”, ICCV 2015

LSTM

Page 48: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Wu et al., “Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge From External Sources”, CVPR 2016

Visual Question Answering (VQA)

Page 49: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Agrawal et al., “VQA: Visual Question Answering”, ICCV 2015

Visual Question Answering (VQA)

https://vqa.cloudcv.org/

Page 50: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Plan for this lecture

• Language and vision

– Image captioning

– Tool: Recurrent neural networks

– Video captioning

– Visual question answering

• Motion and video

– Modeling and replicating motion

– Tracking how an object moves

Page 51: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Motion: Why is it useful?

Derek Hoiem

Page 52: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Motion: Why is it useful?

• Even “impoverished” motion data can evoke a strong percept

G. Johansson, “Visual Perception of Biological Motion and a Model For Its

Analysis", Perception and Psychophysics 14, 201-211, 1973.Derek Hoiem

Page 53: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Modeling Motion: Optical Flow

Walker et al., “Dense Optical Flow Prediction from a Static Scene”, ICCV 2015

Page 54: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Transferring Motion

Thomas, Song and Kovashka

Optical flow in generated video

Optical flow in source video

Key idea: Generate videos with similar flow patterns as source videos (+ many details).

Page 55: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Transferring Motion

Thomas, Song and Kovashka

Page 56: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Transferring Motion

Thomas, Song and Kovashka

Baking Blooming

Page 57: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Body pose tracking, activity recognition

Surveillance

Video-based interfaces

Medical apps

Censusing a bat population

Kristen Grauman

Tracking: some applications

Page 58: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Tracking examples

Traffic: https://www.youtube.com/watch?v=DiZHQ4peqjg

Soccer: http://www.youtube.com/watch?v=ZqQIItFAnxg

Face: http://www.youtube.com/watch?v=i_bZNVmhJ2o

Body: https://www.youtube.com/watch?v=_Ahy0Gh69-M

Eye: http://www.youtube.com/watch?v=NCtYdUEMotg

Gaze: http://www.youtube.com/watch?v=-G6Rw5cU-1c

Amin Sadeghi

Page 59: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Things that make visual tracking difficult

• Erratic movements, moving very quickly

• Occlusions, leaving and coming back

• Surrounding similar-looking objects

Adapted from Amin Sadeghi

Page 60: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Strategies for tracking

• Tracking by repeated detection

– Works well if object is easily detectable (e.g., face or colored glove) and there is only one

– Need some way to link up detections

– Best you can do, if you can’t predict motion

Amin Sadeghi

Page 61: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Strategies for tracking

• Tracking w/ dynamics: Using model of expected motion, predict object location in next frame – Restrict search for the object

– Measurement noise is reduced by trajectory smoothness

– Robustness to missing or weak observations

– Assumptions: Camera is not moving instantly to new viewpoint, objects do not disappear/reappear in different places in the scene

Amin Sadeghi

Page 62: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

General model for tracking

• State X: The actual state of the moving object that we want to estimate but cannot observe– E.g. position, velocity

• Observations Y: Our actual measurement or observation of state X, which can be very noisy

• At each time t, the state changes to Xt and we get a new observation Yt

• Our goal is to recover the most likely state Xt given:– All observations so far, i.e. y1, y2, …, yt

– Knowledge about dynamics of state transitions

Adapted from Amin Sadeghi and Kristen Grauman

X1 X2

Y1 Y2

Xt

Yt

… Xt-1

Yt-1Y0

X0

Page 63: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Steps of tracking

• Prediction: What is the next state of the object given past measurements?

1100 ,, ttt yYyYXP

Kristen Grauman

Page 64: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Steps of tracking

• Prediction: What is the next state of the object given past measurements?

• Correction: Compute an updated estimate of the state from prediction and measurements

1100 ,, ttt yYyYXP

ttttt yYyYyYXP ,,, 1100

Kristen Grauman

Page 65: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Problem statement

• We have models for

Likelihood of next state given current state (dynamics model):

Likelihood of observation given the state (observation or measurement model):

• We want to recover, for each t:

1tt XXP

tt XYP

tt yyXP ,,0

Amin Sadeghi

Page 66: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

The Kalman filter

• Linear dynamics model: state undergoes linear transformation plus Gaussian noise

• Observation model: measurement is linearly transformed state plus Gaussian noise

• The predicted/corrected state distributions are Gaussian

– You only need to maintain the mean and covariance

– The calculations are easy

Amin Sadeghi

Page 67: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Example: Constant

velocity (1D points)

time

measurements

states1 d

po

sit

ion

1 d position

Kristen Grauman

Page 68: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

• State vector: position p and velocity v

• Measurement is position only

Example: Constant

velocity (1D points)

1

11 )(

tt

ttt

vv

vtpp

t

t

tv

px

noisev

ptnoisexDx

t

t

ttt

1

1

110

1

noisev

pnoiseMxy

t

t

tt

01

Kristen Grauman

Page 69: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Prediction and correction

Prediction:

Correction:

1101110 ,,||,,| ttttttt dXyyXPXXPyyXP

ttttt

tttttt

dXyyXPXyP

yyXPXyPyyXP

10

100

,,||

,,||,,|

dynamics

model

corrected estimate

from previous step

observation

modelpredicted

estimate

Adapted from Amin Sadeghi

… …

……

Page 70: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Kalman filter processing

time

o state

x measurement

* predicted mean estimate

+ corrected mean estimate

bars: variance estimates

Example w/ constant velocityp

osit

ion

Time t Time t+1Adapted from Kristen Grauman

Page 71: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Kalman filter processing

time

o state

x measurement

* predicted mean estimate

+ corrected mean estimate

bars: variance estimates

Example w/ constant velocityp

osit

ion

Time t Time t+1Adapted from Kristen Grauman

Page 72: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Kalman filter processing

time

o state

x measurement

* predicted mean estimate

+ corrected mean estimate

bars: variance estimates

Example w/ constant velocityp

osit

ion

Time t Time t+1Adapted from Kristen Grauman

Page 73: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

Kalman filter processing

time

o state

x measurement

* predicted mean estimate

+ corrected mean estimate

bars: variance estimates

Example w/ constant velocityp

osit

ion

Time t Time t+1Adapted from Kristen Grauman

Page 74: CS 1674: Intro to Computer Visionkovashka/cs1674_fa19/vision_10_sequen… · There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one

CorrectionObservationGround Truth

Amin Sadeghi

Example w/ constant velocity