© Copyright National University of Singapore. All Rights Reserved. AUTOMATED LINK GENERATION FOR SENSOR- ENRICHED SMARTPHONE IMAGES IBM Interview talk,

© Copyright National University of Singapore. All Rights Reserved.

AUTOMATED LINK GENERATION FOR SENSOR-ENRICHED SMARTPHONE IMAGES

IBM Interview talk, Sept 29th 2015, SingaporePresenter: Seshadri Padmanabha VenkatagiriAdvisor: Prof. Chan Mun ChoonCollaborating Advisor: Prof. Ooi Wei Tsang


MOBILE USER GENERATED CONTENT: PHOTO

SLIDE: 2

More than 1.8 Billion photo uploads in 2014 till date

Source: Kleiner Perkins Caufield & Byers, Internet Trends 2014


ADHOC EVENTS: IMPORTANT SOURCE OF MOBILE UGC

SLIDE: 3

Exhibitions

Street Performances

Mishaps

Social events

Well attended

Attendees share same event context

Lack of prior information. Eg. Training data

Lack of planned infrastructure. Eg. GPS, camera deployments etc.


OBJECTIVE OF AUTOLINK

SLIDE: 4

To organize noisy unstructured photo collections to improve user interaction


APPLICATIONS OF STRUCTURED PHOTO COLLECTIONS

SLIDE: 5

Content Analytics/Discovery

Crowd-sourced surveillance

Photograph this! Scene recommendation


CHALLENGE 1: NOISE IN CONTENT

SLIDE: 6

occlusion

Varying lighting conditions

Diverse views

Diverse regions of interest

Diversity in moments captured

Scene-localization issues

Redundancy


CHALLENGE 2: RESOURCE BOTTLENECK

SLIDE: 7

Resource Bottleneck

UGC from Ad-hoc events

Exhibitions

Street Performances

Mishaps

Battery Bandwidth

Applications

Photograph this!

Content Analytics/Discovery

Scene recommendation

Surveillance


OUR APPROACH: DETAIL-ON-DEMAND PARADIGM

SLIDE: 8

Does not overwhelm users with lots of content. Suited for small-factor devices.

Retrieves specific content, on-demand. Consumes less bandwidth and power

Organizing content by providing progressively detailed content, helps analytics and elicits user interest


DETAIL-ON-DEMAND PARADIGM: EXISTING SOLUTIONS

SLIDE: 9

Hyperlinking Provides progressive content Could be adapted to link content from multiple

sources

Zooming Provides progressively detailed content Not suitable for relating content from multiple

sources

WE CHOOSE HYPERLINK BASED DETAIL-ON-DEMAND TO PROVIDE PROGRESSIVELY DETAILED CONTENT

WE CHOOSE HYPERLINK BASED DETAIL-ON-DEMAND TO PROVIDE PROGRESSIVELY DETAILED CONTENT


COMPARISON WITH STATE-OF-THE-ART

SLIDE: 10

Technique Automated

Prior information (Eg. Training, visual dictionary)

Camera Calibration

Distinguishes “same” from “similar” scenes

Requires GPS

Hyper-Hitchcock No No No Yes No

• Photo Tourism (Photo Synth)

• Google Street View

Yes No Yes No Yes

ImageWebs Yes Yes Yes No No

Geo-tagging Yes No No No Yes

AutoLink Yes No No Yes No

Manual techniqueManual technique


Technique Automated


Camera Calibration


Requires GPS




Yes No Yes No Yes





SLIDE: 11

GPS Error/Non-availability, Camera calibration difficult in ad-hoc events

GPS Error/Non-availability, Camera calibration difficult in ad-hoc events



SLIDE: 12

Technique Automated


Camera Calibration


Requires GPS




Yes No Yes No Yes




Visual vocabulary not always available in ad-hoc eventsVisual vocabulary not always available in ad-hoc events



SLIDE: 13

Technique Automated


Camera Calibration


Specialized equipment (Eg. GPS, laser range finding)



• Google street view

Yes No Yes No Yes



AutoLink Yes No No Yes NoGPS Error/Non-availability GPS Error/Non-availability


Technique Automated


Camera Calibration


Requires GPS




Yes No Yes No Yes





SLIDE: 14

Avoids GPS, Visual vocabulary and camera calibrationAvoids GPS, Visual vocabulary and camera calibration


APPLICATION CONTEXT

SLIDE: 15

Multiple scenes

Located indoor/outdoor or both

People move between scenes and capture photos

Capture photos with different orientations, and regions of interest


PROBLEM

SLIDE: 16

Image collection

Inertial Sensor log

Set of scenes user is interested in

?

Detail-on-demand Image hierarchy

High Context

High Detail


ARCHITECTURE

SLIDE: 17

AutoLink Server

Mobile Wireless Network

AutoLinkClient

1

2

Photos uploaded on-demand by smart phones

Client uploads Metadata in the form of sensor log and content characteristics extracted from photo Servers runs AutoLink and performs inter-scene and intra-scene clustering. Users could request photos by navigating through these links.2

1

Image content

hierarchy

High Context

High Detail


AUTOLINK OUTLINE

SLIDE: 18

Image features

(4) Sensor-assisted hybrid scene classification

(6) Hierarchy creation

(5) Region estimationPhoto

Scene Region

(3) User selected scenes

AngleTime Step

(2) Sensor log

(1) Capture photoAutoLink


STEP 1: RATIONALE OF SCENE CLASSIFICATION

SLIDE: 19

Order of reliability:

Use content features to find matching scene for an image.

If content does not provide a match, use sensors and already-labeled images, to improve the match possibility.


STEP 1: SCENE CLASSIFICATION

SLIDE: 20

Content features

User specified scenes

Match content

features

Match content

features

Apply Naïve Bayes Nearest

Neighbor classification



No scene match?

Label image with scene


Scene match?

Remains unlabeledRemains

unlabeled


CONTENT FEATURES

SLIDE: 21

Color-SIFT [Abdel2006]Captures scale-invariant and color-invariant characteristics

Maximally Stable Extremal Regions [Forssen2007]Describes region level features

ORB [Ethan2011]Rotation and noise resistant features

ColorGlobal color histogram description

[Ethan2011] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. ICCV 2011

[Abdel2006] A. Abdel-Hakim and A. Farag, CSIFT: A SIFT descriptor with color invariant characteristics. CVPR 2006 [Forssen2007] P.-E. Forssen, Maximally stable colour regions for recognition and matching. CVPR 2007


NAÏVE BAYES NEAREST NEIGHBOUR

SLIDE: 22

1. Compute descriptors d1, d2, …, dn of the photo

2. Compute descriptors d1, d2, …, dn for all the scenes (Preprocessing step, done only once)

3. For each, di and for each scenes, compute the nearest neighbour of di: NN(di).

4. Choose matching scene with ARG MIN { SUM([di – NN(di)]2) }


STEP 1: SCENE CLASSIFICATION

SLIDE: 23

Content (interest points)

User specified scenes

Match content

features

Match content

features





No scene match?



scene match?

Combine content, time, sensor features to find matching scene label

Combine content, time, sensor features to find matching scene label

AngleTime Step Content

Remains unlabeledRemains

unlabeled


TIME FEATURE: TEMPORAL CLUSTERING

SLIDE: 24

0

5

10

15

20

25

0 20 40 60 80 100Photo Timestamps (in minutes)

Clu

ster

Id

Inter-photo time gaps

Photo-clusters


TIME FEATURE: GAP LABELING

SLIDE: 25

t1 = 0min t2 = 1.5min t3 = 2min t4 = 5min t5 = 5.1min

Time

tgap Max{tgap} = 3 min

i1 = Scene 1 i5 = Scene 2i2 = No label i3 = No label i4 = No label

t1 = 0min t2 = 1.5min t3 = 2min t4 = 5min t5 = 5.1min

Time

tgap Max{tgap} = 3 min

i1 = Scene 1 i5 = Scene 2i2 = Scene 1 i3 = Scene 1 i4 = Scene 2

(a)

(b)

If time-gaps is less than a ΔT, which is a threshold obtained from the traces collected from dataset, then time-gap labeling is not applied because, photo as are too close to distinguish cluster boundaries


ACCELEROMETER FEATURE: STEP COUNT BETWEEN PHOTO CAPTURES

SLIDE: 26

Photo capture

Steps

Accelerometer +

Step counting algorithm


PHOTO ORIENTATION

SLIDE: 27

Photo orientation is obtained from sensor fusion of magnetic field sensor + Gyroscope + Accelerometer


STEP 2: REGION ESTIMATION

SLIDE: 28

Image

Scene

Window-based estimation


Super-pixel based estimation




SLIDE: 29

Image

Scene




1.Apply SEEDS super-pixel to image and scene.


1.Apply SEEDS super-pixel to image and scene.



SLIDE: 30

Image

Scene




1.Apply SEEDS super-pixel to image and scene. 2.Compute super-pixels which have matching features.3.Estimate bounding box around it.





SLIDE: 31

Image

Scene


1.Estimate region center using horizontal and vertical shifts in matching features of image and scene 2.Correct estimation using compass sensor z-axis


1.Estimate region center using horizontal and vertical shifts in matching features of image and scene 2.Correct estimation using compass sensor z-axis






ESTIMATING REGION CENTER

SLIDE: 32

Region Center is estimated by computing horizontal and vertical shift in image features between candidate image and a scene reference image.

Scene Reference image Candidate Image


ENERGY-DELAY-ACCURACY PERFORMANCE OF REGION CENTER ESTIMATION

SLIDE: 33

Image Resize Factor (%)

Delay (seconds)

Energy (mJoules)

Accuracy(%)

20 0.098 3.6 75.7

40 0.311 12.1 82.5

60 0.899 39.3 91.5

80 2.558 117.4 93.4

100 6.494 292.6 93.5

Subsampling to 60% of the original image size:•Only a 2% reduction in accuracy compared to using the original image •Power reduction by approximately 86%•computation time reduction by 86%



SLIDE: 34

Image

Scene

Find region center

Find region center

Super-pixel estimationSuper-pixel estimation

1. Iterate through bounding box around the region center

2. Find best match bounding box with the scene.





SLIDE: 35

Image

Scene

Find region center

Find region center








SLIDE: 36

Image

Scene

Find region center

Find region center








SLIDE: 37

Image

Scene





Use average of both estimates.Red: Our estimateBlue: Ground truth


DEMO: REGION CENTER AND REGION WINDOW ESTIMATION

SLIDE: 38


STEP 3: ADDING TO IMAGE HIERARCHY USING BOUNDING BOX (BB)

SLIDE: 39

High Context High Detail

Image

?BB <=L1 & >L2?

L1 L2 L3 L4


STEP 3: ADDING TO IMAGE HIERARCHY USING BOUNDING BOX(BB)

SLIDE: 40


BB <=L2 & >L3?

L1 L2 L3 L4


STEP 3: ADDING TO IMAGE HIERARCHY

SLIDE: 41


BB <=L3 & >L4?

L1 L2 L3 L4


STEP 3: ADDING TO IMAGE HIERARCHY USING BOUNDING BOX(BB)

SLIDE: 42


BB matches L4 !

L1 L2 L3 L4


DEMO: AUTOLINK IMAGE HIERARCHY

SLIDE: 43


EVALUATION: METHODOLOGY

Precision and Recall Running time Bounding box accuracy Ranking accuracy

SLIDE: 44

Metrics

NBNN [Boiman2008] Bag-of-words Structure-from-motion [Wu2013] NBNN + Time only

AutoLink compared with…

Evaluation with 2 datasets:1.Single participant2.6 participant

Evaluation with 2 datasets:1.Single participant2.6 participant

[Wu2013] Changchang Wu, "Towards Linear-time Incremental Structure from Motion", 3DV 2013 [Boiman2008] O. Boiman, E. Shechtman, M. Irani, ” In Defense of Nearest-Neighbor Based Image Classification”, CVPR 2008


EVALUATION: ACCURACY

SLIDE: 45

Dataset Single user Multiple user

Approach Precision Recall Precision Recall

AutoLink 0.7 0.78 0.71 0.51

NBNN + Sensors 0.76 0.38 0.74 0.37

NBNN + Time 0.64 0.73 0.64 0.32

NBNN 0.78 0.1 0.78 0.25

Structure-from-Motion

0.53 0.53 0.37 0.37

Bag-of-Visual words

0.19 0.029 0.19 0.19

Upto 70% precision and 78% recall

BOW poor performance without training

SfM does not classify scenes with similar features


EVALUATION: RUNNING TIME

SLIDE: 46

Approach Average running time per image

(in milliseconds)

AutoLink 298

NBNN + Sensors 290

NBNN + Time 292

NBNN 287

Structure-from-Motion 5946.7

Bag-of-Visual words 35.9

20 times faster than SfM

BOW faster but poor accuracy


EVALUATION: REGION MATCHING

SLIDE: 47

58% of photos have atleast 50% overlap

40% of photos have atleast 65% overlap


EVALUATION: TOP-M PREDICTIONS FOR RANDOM BOUNDING BOXES

SLIDE: 48

Top-M predictions Percentage accuracy

1 56.6

2 70.4

3 74.9

5 76.9

10 80.14

15 81.11

20 81.11

Top-2 results have the best match to requested bounding box 70% of the time.


THANK YOU


CONTENT + TIME + SENSOR

1. For every unlabeled and labeled image pair (i,*), estimate:– Number of steps traversed s– Time gap t

2. For all scenes j, get |s – si,j| and |t – ti,j| using the transition maps. Rank scenes j in decreasing order of this difference.

3. Combine content ranks with the above ranks using mean reciprocal ranking

4. Label, unlabeled image * with the high ranking scene.

SLIDE: 50


TRANSITION MAPS FOR TIME AND STEPS

SLIDE: 51

3

2

4

1

(s1,2,t1,2)

s3,4,t3,4

(s2,3,t2,3)

(s3,4,t3,4)

(s1,4,t1,4)

[si,j]: Matrix of steps taken from every scene i to every other scene j

[ti,j]: Matrix of time taken from every scene i to every other scene j

Documents

© Copyright National University of Singapore. All Rights Reserved. AUTOMATED LINK GENERATION FOR SENSOR- ENRICHED SMARTPHONE IMAGES IBM Interview talk,