Download pdf - CDVS, Deep Learning, AR and 5G Network

Compact Descriptor for Visual Search, Deep Learning, Augmented Reality and 5G Network

Yu Huang

[email protected]

Sunnyvale, CA

1

mailto:[email protected]

Outline • 5G Network is coming in 2020;

• Disruptive capabilities;

• Use cases and examples;

• Service and scenario requirements;

• Key technologies;

• Mobile Edge Computing;

• Fog Computing;

• Augmented Reality;

• AR technology and application;

• ARToolkit, Vuforia (PTC from Qualcomm), Total Immersion-D’Fusion, Wikitude, …

• Metaio(Apple) and 13th Lab(Facebook Oculus Rift);

• Google Glass and Google Project Tango;

• Microsoft Hololens;

• Pokemon Go;

• Meta and Magic Leap;

• CDVS (Compact Descriptor for Visual Search)

• Image feature extraction pipeline;

• Image matching and retrieval pipeline;

• What does CDVS play in 5G network?

• Visual search and visual SLAM (AR as use case).

• Deep Learning

• Feature representation and learning;

• Deep learning for visual search and SLAM in AR;

• Remarks

• Appendix

5G Network (5th Generation Mobile Technology) in 2020

• Higher peak and average data rates in down/up link

• Peaks of 10Gbps, average of 1Gbps

• Ultra-low latency/Ultra-dense network

• <1ms for certain services, <10ms as the norm, <5ms for end-to-end

• minimum of 50Mbps (10 Tbps/km2 to cover, e.g. a stadium with 30k devices), adapt data stream with (close to) “zero latency”

• Ultra reliable machine type communication

• Wherever, whenever and whatever

• Energy efficiency

• Doing much more with less

• Massive device connectivity/Massive machine type communication

• Internet of Things (IoT), Tactile Internet, …

5G Disruptive Capabilities

Source: https://5g-ppp.eu/wp-

content/uploads/2015/02/5G-

Vision-Brochure-v1.pdf

https://5g-ppp.eu/wp-content/uploads/2015/02/5G-Vision-Brochure-v1.pdf












5G Use Case Families and Related Examples

Source: https://www.ngmn.org/uploads/media/NGMN_5G_White_Paper_V1_0.pdf

https://www.ngmn.org/uploads/media/NGMN_5G_White_Paper_V1_0.pdf

https://www.ngmn.org/uploads/media/NGMN_5G_White_Paper_V1_0.pdf

5G Service and Scenario Requirements

Source: “5G: A Technology Vision”, White paper from Huawei Technologies.

Intel Joins 5G World!

Key Technological Components in 5G • Millimeter wave (mm-wave) radio access network (RAN); • Massive multiple input and multiple output (MIMO); • Software-defined networking (SDN); • Network functions virtualization (NFV) ; • Mobile edge computing (MEC); • Fog Computing (FC); • Device-to-Device (D2D) communication; • Moving network (MN);

Mobile Edge Computing • IT based servers at network edge, applying the concepts of cloud computing;

• Convergence of IT and Network infrastructure;

ETSI Version of Mobile Edge Computing

Mobile Edge Computing • Mobile edge computing (MEC): A cloud server running at the edge of a mobile

network and performing specific tasks that could not be achieved with traditional network infrastructure;

• Characterizations:

• On-Premises: The Edge is local;

• Proximity: Being close to the source of information;

• Low latency: considerably reduces latency;

• Location awareness: location based services, analytics;

• Network context information: local points-of-interest, businesses and events.

• MEC server looks like a mini-data center (MSR calls it “Cloudlet”).

Note: A MEC technical white paper, authored from those top players in the 5G value chain, as Huawei, IBM, Intel, Nokia, NTT and Vodafone.

Mobile Edge Computing • Use cases:

• Active Device Location Tracking;

• AR Content Delivery;

• Video Analytics;

• RAN-aware Content Optimization;

• Distributed Content and DNS Caching;

• Application-aware Performance Optimization;

Source: “Mobile-edge computing - introductory technical white paper”, Sept. 2014.

Fog Computing • Enable computing directly at the edge of the network;

• Fog computing: extension of the cloud computing paradigm from the core of network to the edge of the network;

• Gaming and real-time analytics;

• Cloudlet from MSR coincides with it;

• FC is a highly virtualized platform that provides computation, storage, and networking services between end devices and traditional cloud servers;

• IOx is a FC product from Cisco (shown in Figure); • Fog computing is a P2P architecture:

• Image processing, for example, can be portioned and shared btw more devices.

Source: Cisco. IOx overview. http://goo.gl/n2mfiw. 2014

http://goo.gl/n2mfiw

Fog Computing • Similar concepts: Mist computing (MC), Cloudlet, MEC and MCC; • MCC (mobile cloud computing) tries to push the boundaries of mobile applications

by offering centralized resources to augment mobile devices; • Integration of cloud computing into the mobile environment.

• Fog computing is combination of MCC and MEC; • Requirement:

• High bandwidth and wireless penetration;

• Higher computation and affordable storage;

• Real time application and lower latency requirement;

• Characterization: • Geographical distribution and location awareness;

• Very large number of nodes;

• Support for wireless access and mobility.

Fog Computing

Example of edge cloud and underlay network architecture

Cloud down to the edge and

Fog up to the network

Source: Q Li et al. (Intel corp.), “Edge Cloud and Underlay Networks: Empowering 5G Cell-Less Wireless Architecture”, European Wireless 2014.

What is Augmented Reality? • Augmented Reality (AR) is a variation of Virtual Environments (VE), or Virtual Reality (VR);

• AR allows the user to see the real world, with virtual objects superimposed upon or composite with the real world;

• In AR, 3-D virtual objects are integrated into a 3-D real environment in real time.

• AR can be thought of as the “middle ground” between VE (completely synthetic) and tele-presence (completely real), one of Mixed Reality.

Applications of AR • Medical and

visualization;

• Manufacturing and repairing;

• Annotation and navigation;

• Robot path planning;

• Entertainment and education;

• Military and emergency.

AR Technology • Most of AR systems involve the use of head mounted displays (HMD):

– There are two main categories of HMD-based AR systems:

• Optical See-Through Augmented Reality;

• Video See-Through Augmented Reality.

• Some AR systems involve use of projectors or other display devices (mobile or desktop).

• Mobile AR is a meeting point between AR, ubiquitous and wearable computings.

Important Issues of AR

• Focus and contrast; – Graphics can be rendered to simulate a limited depth-of-field (DOF);

– Everything must be clipped or compressed into the monitor’s dynamic range;

• Portability; – AR systems will emphasize the ability to walk around outdoors, away from controlled environments;

– Mobile device is a suitable platform for AR systems.

• Tracking and registration: – Objects in the real and virtual worlds have to be properly aligned with respect to each other;

– Hybrid method:

• GPS+multiple sensors+image/vision matching.

AR Principle

Marker Based Tracking: AR ToolKit

Vuforia’s AR SDK

• PTC purchased it from Qualcomm;

• Supports a variety of 2D and 3D target types which also includes marker-less Image Targets, singly track-able 3D Multi-Target and cylindrical targets;

• Also supports Frame Marker a form of Fiduciary Marker;

• Include localized Occlusion Detection using Virtual Buttons;

• Allows user to pick an image at runtime through user defined targets;

• Supports text (word) targets which recognize and tracks textual elements;

• It supports Extended Tracking, which enables continuous experience whether or not the target remains visible in the camera field of view;

• No IMU fusion yet.

Vuforia’s AR SDK

Total Immersion - D’Fusion

• More UI based ( D’Fusion Studio & D’Fusion CV) and enables to build the whole scenario through the GUI;

• One bundle scenario will work on both Android & iPhone;

• Tracks body movements for multiple marker tracking;

• Marker-less tracking using different sensor (GPS, compass, accelerometer, gyroscope, etc) and tracking of 2D/3D objects;

• Support natural feature tracking, face tracking which allows a robust recognition of faces, focusing on eye and mouth detection;

• Through finger pointing interactive zones recognized on images;

• Barcodes can be recognized both in portrait and landscape mode.

AR in Wikitude • It provide direction indicator and location tracker;

• Combines geo-based and image recognition to provide hybrid tracking;

• It is able to track up to 1000 image targets per dataset;

• Its image recognition engine based on the Natural Feature Tracking (NFT) ;

• The use of marker tracking, which uses artificial images; barcodes etc.

• Can track a target up to 20 meters away also have ability to snap your content to the screen regardless of whether the target image is still visible.

• Face tracking and 3D object tracking features aren’t provided yet.

AR in Metaio: UnifEye • New version: Unifeye Mobile;

• SDK works for Win. Mobile 6.1, iPhone and Symbian;

• 3D content can be visualized in context within live video as recorded by a cell phone's built-in camera;

• Layers 3D animations over 2D graphics on the cell phone's screen;

• presents virtual highlights superimposed on actual real-life street imagery;

• displays animated objects within a real/virtual game;

• Purchased by Apple.

AR in 13th Lab

• Acquired by Facebook Oculus with Nimble VR together;

• Using advanced computer vision through PointCloud SDK;

• Capable of taking a 2D image, mapping it into a 3D environ.;

• Also recognize and track 2D images;

• Can detect/track a 2D image and then transfer to SLAM, then map and track the environment around the image;

• Minecraft Reality maps and tracks the world around using the camera, and place Minecraft worlds in reality;

• Learn 3D from a point cloud.

Virtual Image Insertion in Planar Region for AR

• Challenging: how to less intrusively insert the contextually relevant image (what) at the right place (where) and the right time (when) with the attractive representation (how) in videos?

• Critical Problems:

– Scene understanding to find the right place;

– Relevant image selection for insertion;

– Dynamic registration for seamless real-virtual alignment;

– More attractive and less intrusive in real-virtual blending.

Virtual Image Insertion in Planar Region for AR

Flowchart of Virtual Image Insertion

Vanishing Point Estimation

Camera Calibration

Rectangular Planar Structure Extraction

Visual Tracking

Camera Prediction

Motion Filtering

Virtual Image Insertion

Hypothesis Verification

Color Harmonization

Set as Not Detected

Is Detected?

Initialization

Registration

Yes

Yes

Output

No

No

Input At the start, set as Not Detected

Model based Markerless AR

Model based Markerless AR

Google Glass • Monocular see-through multiplexed display

– 640 x 360 microprojector, 15 degree FOV – 5 MP camera, gyro, accelerometer

• Curved Mirror – off-axis projection – curved mirrors in front of eye – high distortion, small eye-box

• Waveguide – use internal reflection – unobstructed view of world – large eye-box

• Waveguide techniques for thin see-through displays – Wider FOV, enable AR applications – Social acceptability

Wearable Cognitive Assistance with Google Glass

• Joint work of CMU & Intel labs;

• Cognitive assistive system “Gabriel”;

• First-person image capture and sensing with Google Glass;

• Remote processing for real-time scene interpretation;

• Wi-Fi and WAN;

• Supported cognitive engines

• Face/Object recognition;

• Augmented reality;

• OCR;

• Motion classifier & activity inference;

Project Tango: Google’s AR Platform

• NVIDIA Tegra K1 processor

• 4GB of RAM

• 128GB of storage

• 1080p display

• Stock Android 4.4

• WiFi, Bluetooth LE and 4G LTE

• 120 degree front camera

• 4MP Camera and Depth Sensor

• Motion-tracking Camera

• Accurate Pose Estimation Position (x,y,z)

• Pointing direction (i, j, k, rotation)

• Uses feature tracking and SLAM

• Accuracy: <1% Fuses inertial and visual odometry

• Significantly more accurate than using inertial sensors alone

• Wide angle lens, global shutter Tracking through fast movements

• Avoid rolling shutter artifacts

• GPU accelerated processing

Project Tango: Google’s AR Platform

Microsoft HoloLens

• Microsoft HoloLens is the first holographic computer running

exclusively only on Windows 10.

• It is completely untethered i.e. no wires, phones or connection to a PC required.

• Microsoft HoloLens uses Augmented Reality (AR) Technology that allows it to pin

Hologram in your physical environment.

• Windows 10 is the first platform to support Holographic Computing with APIs

that enable gaze, gesture, voice and environmental understanding on an

untethered device like HoloLens.

• Microsoft HoloLens brings high-definition holograms to life in real world.

• As holograms, digital content will be as real as physical objects in actual world.

Miscrosoft HoloLens

Pokemon GO: Location-based Service (LBS) for AR

Pokemon GO: Location-based Service (LBS) for AR

Meta’s AR

•90-degree field of view. •2560 x 1440 display resolution. •Designed for comfort with everything below the eyebrows completely transparent and unobstructed so you can easily make eye contact with others. •Meta 2 works while wearing eyeglasses and worn for hours comfortably under most circumstances. •720p front-facing camera. •Sensor array for hand interactions and positional tracking.

•4 speaker near-ear audio. •Volume control. •9 foot cable for video, data, and power (HDMI Version 1.4b). •Support for Windows applications. •Mac support planned for this year. •Meta operating environment. •Meta 2 is a tethered device that requires a modern computer with Windows 8 or 10.

• Meta 2 Augmented Reality Headset • See-through headset that displays holograms and digital content, along with a software development kit built on top of the most popular 3D engine in the world: Unity.

Meta’s AR

Meta’s AR

Guideline #1: You are The O.S.

Guideline #2: Touch to See.

Guideline #3: The Holographic Campfire.

Magic Leap: Mixed Reality?



Difference of HoloLens, Meta, Magic Leap

Filtering-based SLAM • SLAM: Can build a map of an environment while using this map to compute its own location;

• A State Space Method (SSM): both observation and motion (state transition) models

– Observation model describes prob. of making an observation (landmarks) when the robot location/orientation and landmark locations (map) are known;

– Motion model is a prob. distrib. on state (location and orientation of the robot) transitions with applied control, i.e. a Markov process;

– By recursive Bayes estimation, update robot state and environment maps simultaneously;

• the map building problem is solved by fusing observations from different locations;

• the localization problem is to compute the robot location by estimating its state;

• Convergence: correlation btw landmark estimates increase monotonically and more observations made;

• Typical methods: Kalman filter/Particle filter;

• Robot relative location accuracy = localization accuracy with a given map.

Key frame-based SLAM • Bootstrapping

• Compute an initial 3D map

• Mostly based on concepts from two-view geometry

• Normal mode

• Assumes a 3D map is available and incremental camera motion

• Track points and use PnP for camera pose estimation

• Recovery mode

• Assumes a 3D map is available, but tracking failed: no incremental motion

• Relocalize camera pose w.r.t. previously reconstructed map

• Key-Frame BA: keep subset of frames as keyframes

• State: 3D landmarks + camera poses for all key frames

• BA can refine state in a thread separate from tracking component

Image-Based Localization Pipeline

• Key Frame and its Camera Pose Stored; • Frame id, image feature and camera pose.

• Extract Local Features; • Feature original or coded.

• Establish 2D-2D or 3D-to-2D Matches; • Single camera: 3d-to-2d;

• Stereo or depth camera: 2d-to-2d;

• Camera Pose Estimation. • Refine by 3-d point estimation;

• 2-d features along with their depths.

MPEG CDVS – Motivations and Timeline • Reduce load on wireless

networks carrying visual search-related information;

• Ensure interoperability of visual search applications and databases;

• Enable hardware support for descriptor extraction and matching in mobile devices;

• Enable high level of performance of implementations conformant to the standard;

• Simplify design of descriptor extraction and matching for visual search applications.

MPEG CDVS - Requirements • Robustness: accuracy

– verification >95%

– identification >90%

• Sufficiency: self contained;

• Compactness: compression 10x feature;

• Scalability: web scale (database size >100 million)

• Speed: VGA (maximal size < 640) 30fps (note: slower in CPU mobile, 0.5fps);

MPEG CDVS - Scopes

Instead of sending images

application can send compact descriptors

even perform search locally

• Descriptor extraction process needed to

ensure interoperability.

• Bitstream of compact descriptors.

• Considered system architectures

• Server mainly performs CDVS indexing and

query/retrieval and Clients (mobile devices,

desktop, … etc.) work differently:

• (a) Client only capture images and displays

the retrieval results;

• (b) Client can extract image descriptor and

also displays the retrieval results;

• (c) Client will take more tasks, like local

DB/cache-based image matching, and also

displays the retrieval results;

Image Feature CDVS Extraction Pipeline • Keypoint detection: ALP;

• SIFT descriptor;

• Feature selection;

• Local descriptor compression;

• Coordinate coding;

• Global descriptor aggregation;

Image Pairwise Matching and Retrieval Pipeline • Global descriptor: top matches;

• Local descriptor: decoding->matching->geometric verification->localization;

• Location coding and descriptor coding.

MPEG CDVS - Proposals • CE1: global descriptor: Residual Enhanced Visual Vector (REVV);

• SCFV (Scalable Compressed Fisher Vector);

• Robust Visual Descriptor (RVD);

• CE2: local descriptor compression: CHoG;

• Transform + Scalar Quantizer;

• Multi-stage Vector Quantizer;

• CE3: Location coding: context based coordinate coding;

• CE4: key-point detection: ALP;

• Block-based Frequency domain LoG;

• CE5: Local Descriptors: SIFT;

• CE6: retrieval pipeline: MBIT;

• CE7: feature selection: Naïve Bayesian learning based feature selection;

• CE8: pairwise matching pipeline: DISTRAT (local matching with geometric verification)

• Weighted Hamming distance (global matching).

CDVA: Compact Descriptor for Video Analysis

ALP • SIFT Patent: DoG filtering to construct scale space!

• ALP (low polynomial degree): by Telecom Italia;

• Idea: scale space response modeled by a polynomial function; then estimation of coefficients by LoG filtering at different scales:

• 1. Scale space: approximated by a polynomial, get local extrema in the polynomials, eliminates those exceeded at the boundaries, output a list of candidates (x, y, σ);

• 2. Clean candidates: either at edges with bigger ratio of curvatures, or with lower absolute values or curvatures;

• 3. Refinement of coordinates: approximate scale space by a polynomial;

• 4. Eliminates duplicates at octave boundaries;

• 5. Find the remaining candidates.

D. G. Lowe: ”Distinctive image features from scale-invariant keypoints”, IJCV 2004, Patent No US 6,711,293.

LoG kernel

Gaussian kernel K = 4

Assumption 1. LoG kernels well approximated by linear combinations of a

few LoG kernels at discrete scale or displacements;

Assumption 2. Coefficients of linear combinations are smooth functions of

scale or displacement, approximated by polynomials of low degree.

ALP

Scale space response as a polynomial in the displacement parameters (u, v)

K = 9

ALP

ALP

Gaussian Scale Space Construction

• Gray scaled image only;

• Scale space Gaussian pyramid;

• Down scaled by a factor of two;

• Convolutional with 2d Gaussian filter;

• Use 1-d horizontal Gaussian filter only (transpose);

• CDVS TM: 1-d hori. Gaussian plus 1-d vert. Gaussian;

• Save in RGBA OpenCL textures;

• OpenCL global memory access is too slow;

• Four gray pixels in one item;

• Reduce image width by 4: process 4 pixels per kernel instance;

• Only 5 (inst. of 15) memory accesses to read the adjacent pixels.

• Memory access and parallelism.

Laplacian Operation • Run after Gaussian filtering: LoG;

• Apply Laplacian operator to 4 gray pixels;

• Access the memory 5 times only;

• Run the kernel 1/4th of TM times;

• Note: CDVS TM implements Laplacian operator in serial.

Scale Space Function Approximation

• The ALP (A Low-degree Polynomial) detector identifies interest points finding local extrema in the scale space by approximating the scale-space using polynomials;

• ALP approximates LoG filtering functions with polynomials of low degree;

• The coefficients of polynomial are obtained by computing weighted sums of the Laplacian-of-Gaussian (LoG) images;

• Two kernels: both run 4 pixels in parallel

• Compute the coefficients;

• Evaluate the roots of 1st derivative of the polynomial, and store min and max of polynomial values for each pixel;

• Only one time of memory access;

• Note: Extrema detection refined to sub-pixel, also in parallel.

if (isExtrema) then p = ComputePolynomial (A, B, C, D); deltaX = ComputeDeltaX (p); deltaY = ComputeDeltaY (p); if ((|deltaX|<=TH1) && (|deltaY| <= TH2)) then isKeypoint = true; end if end if

Orientation Assignment

• Dominant orientation assigned per keypoint;

• Computation of gradient mag and direction;

• One kernel;

• Computation of orientation histogram;

• Another kernel;

• Orientation histogram smoothing;

• Work-groups (at least one work-item);

• Shared memory for work-items per work-group;

• Buffer for 2048 x 36 (histogram bin #) int. elements.

• Peak localization by 4 kernels in one work-group;

• 9 histogram bins per kernel.

• Quadratic interpolation for better accuracy.

Feature Selection and Duplicate Elimination

• Store keypoints in OpenCL textures, not CPU data structures to reduce memory access latency;

• keypoints # per octave: 256 (max # of keypoints per row) x 8 = 2048;

• Keypoint: 4 (elements) x 4 =16 floating numbers;

• Max size of textures: 1024x8;

• Feature selection (FS) based on relevancies: computing matching prob. (MP) and sorting;

• Create a histogram and put every features in its specific interval and then select only the most important part;

• A training data for Naïve Bayes learning;

• Consistent with various datasets.

• MP is computed by averaging training values of some of the scale space characteristics, its orientation and coord.;



Deferred FS


• FS performed before OA without performance loss;

• Read the keypoint structure first;

• Compute MP not including orientation contribution;

• Make histogram, the same to Deferred FS;

• Compute threshold, the same to Deferred FS;

• Select feature, the same to Deferred FS;

• Orientation assignment, the same to Deferred FS;

• Update histogram with orientation info;

• Then compute threshold again;

• Finally select feature again.

K. Lee and S. Lee and W. Oh, “Accelerating Local

Feature Extraction Using Two Stage Feature Selection and Partial Gradient Computation,” 2014


Hastened FS

GPU’s Performance with OpenCL

• Mobile GPU;

• ALP keypoint detector -> 7 times;

• CDVS feature extraction -> 3 times;

• Samsung Galaxy Note 3

• 4-core ARM

• GPU Adreno 330

• Arndale Octa board

• 4-core ARM

• GPU Mali T628

CDVS Feature Extraction Time

• SIFT interest point detection and feature extraction made the biggest contribution;

• Global descriptors as complex as Interest Point Detection;

• Very fast in computing local descriptors and coordinate encoding.

GD: Global Descriptor LD: Local Descriptor

Transform + Scalar Quantizer • Simple linear bin combinations in each

subregion HoG followed by coarse ternary scalar quantisation of the resultant values;

• Transform of each subregion rely on simple-to-compute bin relations about the HoG’s shape: (1)-(4);

• Divide relations into subsets A & B;

• Spatial neighboring processed similarly.

• Bitrate element selection: element eliminat.

• Scalar quantization and AAC

h0

h6 h7

h4

h5

h2 h3 h1

(h0 – h1)/2, (h1 – h2)/2, (h2 – h3)/2, (h3 – h4)/2,

(h4 – h5)/2, (h5 – h6)/2, (h6 – h7)/2, (h7 – h0)/2 (1)

neighbouring (45o) bin relations (h0 – h4)/2, (h1 – h5)/2, (h2 – h6)/2, (h3 – h7)/2 (2)

opposite (180o) bin relations

((h0 + h4) – (h2 + h6))/4, (h1 + h5) – (h3 + h7))/4 (3)

“line” relations (horiz. and vert. orient. bins)

((h0 + h2 + h4 + h6) – (h1 + h3 + h5 + h7))/8, ((h0 + h1 + h2 + h3) – (h4 + h5 + h6 + h7))/8 (4)

relations of two “halves”

A B A B

B A B A

A B A B

B A B A

h0 h1 h2 h3

h15

h4 h5 h6 h7

h8 h9 h10 h11

h12 h13 h14

Robust to coarse scalar quantization!

Set A

v0 = (h2 – h6)/2

v1 = (h3 – h7)/2

v2 = (h0 – h1)/2

v3 = (h2 – h3)/2

v4 = (h4 – h5)/2 (5)

v5 = (h6 – h7)/2

v6 = ((h0 + h4) – (h2 + h6))/4

v7 = ((h0 + h2 + h4 + h6) – (h1 + h3 + h5 + h7))/8

Set B

v0 = (h0 – h4)/2

v1 = (h1 – h5)/2

v2 = (h1 – h2)/2

v3 = (h3 – h4)/2

v4 = (h5 – h6)/2 (6)

v5 = (h7 – h0)/2

v6 = ((h1 + h5) – (h3 + h7))/4

v7 = ((h0 + h1 + h2 + h3) – (h4 + h5 + h6 + h7))/8

(Adaptive Arithmetic Coding).

(1k~16k, seen in next slide);

v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7




h12 h13 h14 h15

h0 h1 h2 h3

h4 h5 h6 h7

h8 h9 h10 h11





h12 h13 h14 h15

h0 h1 h2 h3

h4 h5 h6 h7

h8 h9 h10 h11





h8 h9 h10 h11

h12 h13 h14 h15

h0 h1 h2 h3

h4 h5 h6 h7





h12 h13 h14 h15

h0 h1 h2 h3

h4 h5 h6 h7

h8 h9 h10 h11





h8 h9 h10 h11

h12 h13 h14 h15

h0 h1 h2 h3

h4 h5 h6 h7Descriptor element selection: selected elements

appear in green. (a) 16K – 128, (b) 8K – 80,

(c) 4K – 64, (d) 2K – 40, (e) 1K – 20 elements.

(a) (b)

(c) (d)

(e)

Transform + Scalar Quantizer

CFV, SCFV

PQ vs Spectral Hashing and Hamming Embedding

CFV, SCFV

CFV, SCFV

• Global Descriptor Matching: I. Using bitwise XOR and POPCNT to compute Hamming distances;

II. Reading the weights from a small look-up table.

D=24 for 512B (in this case) or 32 for others

• Weights: learned off-line from matching/non-matching pairs in the training data:

• INRIA holidays, Oxford buildings and Caltech Pasadena buildings.

• Fast: generate the short list within 1 second from 1 million image database.

Geometric Verification Check: DISTRAT

• DISTance RATio coherence

Calculating the location geometric score: (a) features are matched according to the

descriptor paths, (b) distance of features within image are calculated, (c) log distance

ratios (LDR) of corresponding pairs are calculated , and (d) histogram of log distance

ratios is formed. The maximum value of the histogram is the geometric similarity score.

Geometric Verification Check: DISTRAT

quantizer

LDR evidence

eigenvalue

number of inliers

match weight

Local Descriptor Matching

w

#min 1,

#

X Y

X Y X Y

Q R R Q

2w

Learned from the cond. prob. of correct match by DISTRAT

Location Histogram Coding 1. A core area is identified, and the external

features are pruned out (note: just applied to location coding only);

2. After the creation of the bit stream header, and the encoding of the histogram count using an arithmetic coding, the encoding of the histogram map starts from the center of the image, adopting the presented circular scanning;

3. Continues until the number of non-null element in the image is reached;

4. Each element is encoded using a context built up on a sum-based basis;

5. Context trained on rescanned or original matrixes.

Create bit stream header

Encode histogram count

Start circular matrix scanning

Pruning of external features

Encode one element

Context

No

Yes Exit

Has last non-null element of histogram map being encoded?

(d) (e)

Location Histogram Coding

Homography Estimation for Localization

• Apply a short run of Random SAmple Consensus (RANSAC) for homographies to the estimated inlier set;

• RANSAC with a fixed number of generated hypotheses and a fixed threshold for inlier inclusion;

• A small number of RANSAC iteration is sufficient: 10;

• Choose bi-directionally matched interest point pairs;

• Direct Linear Transform (DLT) algorithm requires four or more matching point pairs as input;

• The DLT algorithm requires a normalization pass applied to both sets of points.

MBIT Indexing • Multi-Block Index Table;

• To reduce the range of Hamming-distance based similarity comparison of candidate images, eliminating the exhaustive search in the first-stage retrieval;

• Partition each global descriptor SCFV into multiple blocks and set up an index table for each block, following an inverted table structure;

• MBIT searching is to generate a shortlist of candidate images for re-ranking;

• 1. weighted voting to generate a set of candidate;

• A optimal threshold for conflict is learned off-line;

• 2. exhaustive search by Hamming distance.

MBIT Indexing

Illustration of the MBIT indexing structure

What Can CDVS Play in 5G Network? • Image-based Relocalization;

• Feature matching for location recognition;

• Failure recovery (loop closure detection);

• Image-based content search and targeting;

• Object or scene search;

• Location finding;

• Targeted ads;

• Video Analytics;

• Action or event search (CDVS -> CDVA).

Fog

Comput.

Cloud Edge

Computing

Cloudlet

CDVS-based Feature Extraction, Image Matching and Image Retrieval

CDVS-based Photo Search at WeChat • Rapid-Rich Object Search (ROSE) Lab of NTU

and Peking University;

• APIs of “Weixin Chat Smart Platform” for WeChat, Tencent:

– CDVS-based image recognition/search (the other is audio recognition/search);

• A free wine shopping assistant app to recognize wine labels;

• A travel assistant app for users to access tourism inf. around a landmark;

• Images (posters, advertisements, printed materials, covers of goods, etc. ) uploaded to the cloud to be indexed;

• Apple’s IOS and Google’s Android;

JiuKaopu

UGuide

Loop Detection using CDVS in Robotic Navigation • Telcom Italia’s recent work;

• Applied for loop-detection in robotic navigation (keyframe-based SLAM);

• A robot works in the indoor environment using a single down-facing camera;

• Landmark identification (global + local);

• Visual odometry (local only, very fast);

• Direct feature matching is still expensive in bandwidth cost and not low in delay;

• CDVS can achieve trade-offs between distinctiveness & storage requirements;

Source: http://jol.telecomitalia.com/jolvisible/cdvs-nella-robotica/?lang=en

I1 I2 CDVS

http://jol.telecomitalia.com/jolvisible/cdvs-nella-robotica/?lang=en







http://jol.telecomitalia.com/jolvisible/wp-content/uploads/2014/06/Robot.jpg

CDVS for Visual Navigation in Drone: MAV & UAV

• MAV( micro aerial vehicle);

• UAV (Unmanned Aerial Vehicle);

• SLAM for drone navigation (GPS denied);

• Key frame-based SLAM;

• PTAM: parallel tracking and mapping;

• What CDVS can do?

• Location/Landmark recognition;

• Loop closure as a special case;

• Visual odometry;

• Feature matching.

Reduce Bandwidth Cost and Latency

Avoid Missing Targeted Position

• Drone cannot take weighty LiDAR;

• Location database is saved in the server;

• It needs fast image search and matching for immediate drone localization (esp. when the drone is flying in a high speed);

• Data transmission btw server and drone should be ultra low for “zero latency”.


• Image transmission for image-based localization:

– Camera 720p @ 30fps, H.264 compression -> 10Mbps;

• CDVS transmission:

– Feature compression, 2K bytes is OK, 4 K bytes works well;

– Localization 10Hz (100 milliseconds) -> 0.32 Mbps in total;

• Latency comparison: replace video-based with CDVS based.

– Reduced to 320/10240 = 3.125% (32 times).

• 5G network bandwidth > 50 Mbps (assume 80 mbps for drone’s navigation);

• DARPA drone can fly 45 miles/h (20 meters/s) in the wide open field, less in urban, more in the empty space (Sony’s drone flies up to 102 m/h);

– Distance delayed error for CDVS vs video compression-based: 0.078 meters vs 2.5 meters.


AT&T and Intel want drones

connected to an LTE network.

100 Dancing Spaxels-like

Drones with Intel RealSense. Intelligent drone follows

with Intel RealSense.

Source: A. Majdik, D. Verda, Y. Albers-Schoenberg, D. Scaramuzza “Air-ground Matching: Appearance-based GPS-denied Urban Localization of Micro Aerial Vehicles”, Journal of Field Robotics, 2015.

High-Speed Autonomous Obstacle Avoidance with Pushbroom Stereo

• A small autonomous unmanned aerial

vehicle capable of high-speed flight

through complex natural environments;

• Using only onboard computation, perform

obstacle detection, planning, and feedback

control in realtime, 14 m/s (31 MPH);

• Pushbroom stereo, detecting obstacles at

120 f/s without overburdening processors;

• Model-based planning and control

techniques allows to track precise

trajectories that avoid obstacles identified

by the vision system.


Online planning: An obstacle is deemed to be in the path if the minimum distance between the trajectory and the point cloud is below a threshold


The red voxels show detected obstacles and the green squares show the planned trajectory as it is being executed.

Obstacle Detection and Navigation Planning for Autonomous MAVs

• Obstacle detection and real-time planning of collision-free trajectories are key for the fully autonomous operation of micro aerial vehicles in restricted environments.

• A multimodal sensor setup for omni-directional obstacle perception consisting of a 3D laser scanner, two stereo camera pairs, and ultrasonic distance sensors.

• Detected obstacles are aggregated in egocentric local multi-resolution grid maps.

• Generate trajectories in a multi-layered approach: from mission planning to global and local trajectory planning, to reactive obstacle avoidance.

• Obstacle perception: construct MAV-centric obstacle maps first; – A hybrid local multiresolution map that represents both occupancy information and the

individual distance measurements.

– The most recent measurements are stored in ring buffers within grid cells that increase in size with distance from the robot’s center.

– A high resolution in the close proximity to the sensor and a lower resolution far away from the robot, which correlates with the sensor’s characteristics in relative distance accuracy and measurement density.


• For each measurement and the corresponding 3D point, the individual cell of the map is marked as occupied;

– To compensate the sensor’s motion during scan acquisition, incorporate a visual odometry estimate from two pairs of wide-angle stereo cameras.

– Register consecutive 3D scans by matching Gaussian point statistics in grid cells (surfels) between local multi-resolution grid maps;

– Fuse the measurements in an occupancy grid maintaining occupancy probabilities.

The obstacle circled in blue. The MAV position is circled red.


• To reduce the planning complexity, divide the overall planning problem into multiple problems with different levels of abstractions;

• A hierarchical control architecture for MAV, with slower deliberative planners that solve complex path and mission planning problems on the upper layers and high-frequency reactive controllers on the lower layers.


What is Deep Learning? • Representation learning attempts to automatically learn good features or

representations;

• Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (intermediate and high level features);

• Become effective via unsupervised pre-training + supervised fine tuning;

– Deep networks trained with back propagation (without unsupervised pre-training) perform worse than shallow networks.

• Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);

• Semi-supervised: structure of manifold assumption;

– labeled data is scarce and unlabeled data is abundant.

Deep Feature Learning

• Hand-crafted features:

• Needs expert knowledge

• Requires time-consuming hand-tuning

• (Arguably) one limiting factor of computer vision systems

• Key idea of feature learning:

• Learn statistical structure or correlation from unlabeled data

• The learned representations used as features in supervised and semi-supervised settings

• Hierarchical feature learning

• Deep architectures can be representationally efficient.

• Natural progression from the low level to the high level structures.

• Can share the lower-level representations for multiple tasks in computer vision.

Feature Learning Architectures

Pixels /

Features

Filtering with Dictionary

(patch/tiled/convolutional)

Spatial/Feature

(Sum or Max)

Normalization between

feature responses

Features

+ Non-linearity

Local Contrast Normalization

(Subtractive & Divisive)

(Group)

Sparsity

Max

/

Softmax

Not an exact

separation

Deep Learning for SLAM

• Feature extraction and matching;

• Loop closure detection;

• Image-based re-localization;

• Stereo matching;

• Optic flow estimation;

• Geometric transform estimation;

• End-to-end learning.


• Feature extraction and matching;

• Loop closure detection;

• Image-based re-localization;

• Stereo matching;

• Optic flow estimation;

• Geometric transform estimation;

• End-to-end learning.


Optic flow estimation


Visual Tracking

Deep Learning for Image/Video Search

Deep Learning for Image/Video Search

107

Remarks • 5G network provides low/ultra-low latency and high bandwidth via fog

computing (FC) and mobile edge computing (MEC);

• Augmented reality needs registration and recognition;

• CDVS generates compact descriptors to images;

• CDVA will extend CDVS to video analysis;

• In 5G network, CDVS can support real-time and fast use cases:

• Augmented reality;

• Content search and targeting;

• Real-time video analytics;

• Deep learning can do the same as CDVS in use cases of 5G Network.

Appendix

109

C2TAM: SLAM in Cloud • Map optimization moved to cloud;

• Latency can be reduced by 5G Network;

• Data flow:

• Server to client: map;

• Client to server;

• Modes of operation:

• Map building and storage;

• Relocalization;

• Map extension;

• Concurrent mapping;

Source: L Riazuelo, J Civera, J. M. Montiel, “C2TAM: A Cloud Framework for Cooperative Tracking and Mapping”, RAS, 2014.

RoboEarth • Cloud Robotics infrastructure to close

the loop btw robot and cloud;

• WWW database: software components, maps for navigation, task knowledge, and object recognition models;

• The RoboEarth Cloud Engine makes any computation available to robots;

• The Cloud Engine’s environments provide high bandwidth access to the RoboEarth knowledge repository enabling robots to benefit from the experience of other robot.

Source: http://roboearth.org/

http://roboearth.org/

http://roboearth.org/

OpenCL (Open Computing Language ) for GPU

• Platform model:

• Host->Compute Device->Compute Unit->Processing Element;

• Memory model: Host memory

• Global Memory, Caches, Local Memory, Private Memory;

• Program structure:

• Kernels (parallel), host (serial) program;

• Execution model:

• Work group, work item;

• Synchronization in the group;

• Data/task parallelism;

• Contexts, queues, events;

• Data I/O structure: buffer

• Image2D of Texturing;

• RGBA for grayscale data;

• Interpol., border, cashing.

OpenCL (Open Computing Language ) for GPU

• Intel OpenCL APIs;

• Define the platform;

• Applications run on a host;

• Work of devices from host;

• Com./App. queues in order;

• Kernel code for work item;

• Context for environment;

• Events for sync.

Intel Gen 8 EU, Sub-slice and Slice

Intel Gen 9

Thanks!

116