Compact Descriptor for Visual Search, Deep Learning, Augmented Reality and 5G Network
Yu Huang
Sunnyvale, CA
1
Outline • 5G Network is coming in 2020;
• Disruptive capabilities;
• Use cases and examples;
• Service and scenario requirements;
• Key technologies;
• Mobile Edge Computing;
• Fog Computing;
• Augmented Reality;
• AR technology and application;
• ARToolkit, Vuforia (PTC from Qualcomm), Total Immersion-D’Fusion, Wikitude, …
• Metaio(Apple) and 13th Lab(Facebook Oculus Rift);
• Google Glass and Google Project Tango;
• Microsoft Hololens;
• Pokemon Go;
• Meta and Magic Leap;
• CDVS (Compact Descriptor for Visual Search)
• Image feature extraction pipeline;
• Image matching and retrieval pipeline;
• What does CDVS play in 5G network?
• Visual search and visual SLAM (AR as use case).
• Deep Learning
• Feature representation and learning;
• Deep learning for visual search and SLAM in AR;
• Remarks
• Appendix
5G Network (5th Generation Mobile Technology) in 2020
• Higher peak and average data rates in down/up link
• Peaks of 10Gbps, average of 1Gbps
• Ultra-low latency/Ultra-dense network
• <1ms for certain services, <10ms as the norm, <5ms for end-to-end
• minimum of 50Mbps (10 Tbps/km2 to cover, e.g. a stadium with 30k devices), adapt data stream with (close to) “zero latency”
• Ultra reliable machine type communication
• Wherever, whenever and whatever
• Energy efficiency
• Doing much more with less
• Massive device connectivity/Massive machine type communication
• Internet of Things (IoT), Tactile Internet, …
5G Disruptive Capabilities
Source: https://5g-ppp.eu/wp-
content/uploads/2015/02/5G-
Vision-Brochure-v1.pdf
5G Use Case Families and Related Examples
Source: https://www.ngmn.org/uploads/media/NGMN_5G_White_Paper_V1_0.pdf
5G Service and Scenario Requirements
Source: “5G: A Technology Vision”, White paper from Huawei Technologies.
Intel Joins 5G World!
Key Technological Components in 5G • Millimeter wave (mm-wave) radio access network (RAN); • Massive multiple input and multiple output (MIMO); • Software-defined networking (SDN); • Network functions virtualization (NFV) ; • Mobile edge computing (MEC); • Fog Computing (FC); • Device-to-Device (D2D) communication; • Moving network (MN);
Mobile Edge Computing • IT based servers at network edge, applying the concepts of cloud computing;
• Convergence of IT and Network infrastructure;
ETSI Version of Mobile Edge Computing
Mobile Edge Computing • Mobile edge computing (MEC): A cloud server running at the edge of a mobile
network and performing specific tasks that could not be achieved with traditional network infrastructure;
• Characterizations:
• On-Premises: The Edge is local;
• Proximity: Being close to the source of information;
• Low latency: considerably reduces latency;
• Location awareness: location based services, analytics;
• Network context information: local points-of-interest, businesses and events.
• MEC server looks like a mini-data center (MSR calls it “Cloudlet”).
Note: A MEC technical white paper, authored from those top players in the 5G value chain, as Huawei, IBM, Intel, Nokia, NTT and Vodafone.
Mobile Edge Computing • Use cases:
• Active Device Location Tracking;
• AR Content Delivery;
• Video Analytics;
• RAN-aware Content Optimization;
• Distributed Content and DNS Caching;
• Application-aware Performance Optimization;
Source: “Mobile-edge computing - introductory technical white paper”, Sept. 2014.
Fog Computing • Enable computing directly at the edge of the network;
• Fog computing: extension of the cloud computing paradigm from the core of network to the edge of the network;
• Gaming and real-time analytics;
• Cloudlet from MSR coincides with it;
• FC is a highly virtualized platform that provides computation, storage, and networking services between end devices and traditional cloud servers;
• IOx is a FC product from Cisco (shown in Figure); • Fog computing is a P2P architecture:
• Image processing, for example, can be portioned and shared btw more devices.
Source: Cisco. IOx overview. http://goo.gl/n2mfiw. 2014
Fog Computing • Similar concepts: Mist computing (MC), Cloudlet, MEC and MCC; • MCC (mobile cloud computing) tries to push the boundaries of mobile applications
by offering centralized resources to augment mobile devices; • Integration of cloud computing into the mobile environment.
• Fog computing is combination of MCC and MEC; • Requirement:
• High bandwidth and wireless penetration;
• Higher computation and affordable storage;
• Real time application and lower latency requirement;
• Characterization: • Geographical distribution and location awareness;
• Very large number of nodes;
• Support for wireless access and mobility.
Fog Computing
Example of edge cloud and underlay network architecture
Cloud down to the edge and
Fog up to the network
Source: Q Li et al. (Intel corp.), “Edge Cloud and Underlay Networks: Empowering 5G Cell-Less Wireless Architecture”, European Wireless 2014.
What is Augmented Reality? • Augmented Reality (AR) is a variation of Virtual Environments (VE), or Virtual Reality (VR);
• AR allows the user to see the real world, with virtual objects superimposed upon or composite with the real world;
• In AR, 3-D virtual objects are integrated into a 3-D real environment in real time.
• AR can be thought of as the “middle ground” between VE (completely synthetic) and tele-presence (completely real), one of Mixed Reality.
Applications of AR • Medical and
visualization;
• Manufacturing and repairing;
• Annotation and navigation;
• Robot path planning;
• Entertainment and education;
• Military and emergency.
AR Technology • Most of AR systems involve the use of head mounted displays (HMD):
– There are two main categories of HMD-based AR systems:
• Optical See-Through Augmented Reality;
• Video See-Through Augmented Reality.
• Some AR systems involve use of projectors or other display devices (mobile or desktop).
• Mobile AR is a meeting point between AR, ubiquitous and wearable computings.
Important Issues of AR
• Focus and contrast; – Graphics can be rendered to simulate a limited depth-of-field (DOF);
– Everything must be clipped or compressed into the monitor’s dynamic range;
• Portability; – AR systems will emphasize the ability to walk around outdoors, away from controlled environments;
– Mobile device is a suitable platform for AR systems.
• Tracking and registration: – Objects in the real and virtual worlds have to be properly aligned with respect to each other;
– Hybrid method:
• GPS+multiple sensors+image/vision matching.
AR Principle
Marker Based Tracking: AR ToolKit
Vuforia’s AR SDK
• PTC purchased it from Qualcomm;
• Supports a variety of 2D and 3D target types which also includes marker-less Image Targets, singly track-able 3D Multi-Target and cylindrical targets;
• Also supports Frame Marker a form of Fiduciary Marker;
• Include localized Occlusion Detection using Virtual Buttons;
• Allows user to pick an image at runtime through user defined targets;
• Supports text (word) targets which recognize and tracks textual elements;
• It supports Extended Tracking, which enables continuous experience whether or not the target remains visible in the camera field of view;
• No IMU fusion yet.
Vuforia’s AR SDK
Total Immersion - D’Fusion
• More UI based ( D’Fusion Studio & D’Fusion CV) and enables to build the whole scenario through the GUI;
• One bundle scenario will work on both Android & iPhone;
• Tracks body movements for multiple marker tracking;
• Marker-less tracking using different sensor (GPS, compass, accelerometer, gyroscope, etc) and tracking of 2D/3D objects;
• Support natural feature tracking, face tracking which allows a robust recognition of faces, focusing on eye and mouth detection;
• Through finger pointing interactive zones recognized on images;
• Barcodes can be recognized both in portrait and landscape mode.
AR in Wikitude • It provide direction indicator and location tracker;
• Combines geo-based and image recognition to provide hybrid tracking;
• It is able to track up to 1000 image targets per dataset;
• Its image recognition engine based on the Natural Feature Tracking (NFT) ;
• The use of marker tracking, which uses artificial images; barcodes etc.
• Can track a target up to 20 meters away also have ability to snap your content to the screen regardless of whether the target image is still visible.
• Face tracking and 3D object tracking features aren’t provided yet.
AR in Metaio: UnifEye • New version: Unifeye Mobile;
• SDK works for Win. Mobile 6.1, iPhone and Symbian;
• 3D content can be visualized in context within live video as recorded by a cell phone's built-in camera;
• Layers 3D animations over 2D graphics on the cell phone's screen;
• presents virtual highlights superimposed on actual real-life street imagery;
• displays animated objects within a real/virtual game;
• Purchased by Apple.
AR in 13th Lab
• Acquired by Facebook Oculus with Nimble VR together;
• Using advanced computer vision through PointCloud SDK;
• Capable of taking a 2D image, mapping it into a 3D environ.;
• Also recognize and track 2D images;
• Can detect/track a 2D image and then transfer to SLAM, then map and track the environment around the image;
• Minecraft Reality maps and tracks the world around using the camera, and place Minecraft worlds in reality;
• Learn 3D from a point cloud.
Virtual Image Insertion in Planar Region for AR
• Challenging: how to less intrusively insert the contextually relevant image (what) at the right place (where) and the right time (when) with the attractive representation (how) in videos?
• Critical Problems:
– Scene understanding to find the right place;
– Relevant image selection for insertion;
– Dynamic registration for seamless real-virtual alignment;
– More attractive and less intrusive in real-virtual blending.
Virtual Image Insertion in Planar Region for AR
Flowchart of Virtual Image Insertion
Vanishing Point Estimation
Camera Calibration
Rectangular Planar Structure Extraction
Visual Tracking
Camera Prediction
Motion Filtering
Virtual Image Insertion
Hypothesis Verification
Color Harmonization
Set as Not Detected
Is Detected?
Initialization
Registration
Yes
Yes
Output
No
No
Input At the start, set as Not Detected
Model based Markerless AR
Model based Markerless AR
Google Glass • Monocular see-through multiplexed display
– 640 x 360 microprojector, 15 degree FOV – 5 MP camera, gyro, accelerometer
• Curved Mirror – off-axis projection – curved mirrors in front of eye – high distortion, small eye-box
• Waveguide – use internal reflection – unobstructed view of world – large eye-box
• Waveguide techniques for thin see-through displays – Wider FOV, enable AR applications – Social acceptability
Wearable Cognitive Assistance with Google Glass
• Joint work of CMU & Intel labs;
• Cognitive assistive system “Gabriel”;
• First-person image capture and sensing with Google Glass;
• Remote processing for real-time scene interpretation;
• Wi-Fi and WAN;
• Supported cognitive engines
• Face/Object recognition;
• Augmented reality;
• OCR;
• Motion classifier & activity inference;
Project Tango: Google’s AR Platform
• NVIDIA Tegra K1 processor
• 4GB of RAM
• 128GB of storage
• 1080p display
• Stock Android 4.4
• WiFi, Bluetooth LE and 4G LTE
• 120 degree front camera
• 4MP Camera and Depth Sensor
• Motion-tracking Camera
• Accurate Pose Estimation Position (x,y,z)
• Pointing direction (i, j, k, rotation)
• Uses feature tracking and SLAM
• Accuracy: <1% Fuses inertial and visual odometry
• Significantly more accurate than using inertial sensors alone
• Wide angle lens, global shutter Tracking through fast movements
• Avoid rolling shutter artifacts
• GPU accelerated processing
Project Tango: Google’s AR Platform
Microsoft HoloLens
• Microsoft HoloLens is the first holographic computer running
exclusively only on Windows 10.
• It is completely untethered i.e. no wires, phones or connection to a PC required.
• Microsoft HoloLens uses Augmented Reality (AR) Technology that allows it to pin
Hologram in your physical environment.
• Windows 10 is the first platform to support Holographic Computing with APIs
that enable gaze, gesture, voice and environmental understanding on an
untethered device like HoloLens.
• Microsoft HoloLens brings high-definition holograms to life in real world.
• As holograms, digital content will be as real as physical objects in actual world.
Miscrosoft HoloLens
Pokemon GO: Location-based Service (LBS) for AR
Pokemon GO: Location-based Service (LBS) for AR
Meta’s AR
•90-degree field of view. •2560 x 1440 display resolution. •Designed for comfort with everything below the eyebrows completely transparent and unobstructed so you can easily make eye contact with others. •Meta 2 works while wearing eyeglasses and worn for hours comfortably under most circumstances. •720p front-facing camera. •Sensor array for hand interactions and positional tracking.
•4 speaker near-ear audio. •Volume control. •9 foot cable for video, data, and power (HDMI Version 1.4b). •Support for Windows applications. •Mac support planned for this year. •Meta operating environment. •Meta 2 is a tethered device that requires a modern computer with Windows 8 or 10.
• Meta 2 Augmented Reality Headset • See-through headset that displays holograms and digital content, along with a software development kit built on top of the most popular 3D engine in the world: Unity.
Meta’s AR
Meta’s AR
Guideline #1: You are The O.S.
Guideline #2: Touch to See.
Guideline #3: The Holographic Campfire.
Magic Leap: Mixed Reality?
Magic Leap: Mixed Reality?
Magic Leap: Mixed Reality?
Difference of HoloLens, Meta, Magic Leap
Filtering-based SLAM • SLAM: Can build a map of an environment while using this map to compute its own location;
• A State Space Method (SSM): both observation and motion (state transition) models
– Observation model describes prob. of making an observation (landmarks) when the robot location/orientation and landmark locations (map) are known;
– Motion model is a prob. distrib. on state (location and orientation of the robot) transitions with applied control, i.e. a Markov process;
– By recursive Bayes estimation, update robot state and environment maps simultaneously;
• the map building problem is solved by fusing observations from different locations;
• the localization problem is to compute the robot location by estimating its state;
• Convergence: correlation btw landmark estimates increase monotonically and more observations made;
• Typical methods: Kalman filter/Particle filter;
• Robot relative location accuracy = localization accuracy with a given map.
Key frame-based SLAM • Bootstrapping
• Compute an initial 3D map
• Mostly based on concepts from two-view geometry
• Normal mode
• Assumes a 3D map is available and incremental camera motion
• Track points and use PnP for camera pose estimation
• Recovery mode
• Assumes a 3D map is available, but tracking failed: no incremental motion
• Relocalize camera pose w.r.t. previously reconstructed map
• Key-Frame BA: keep subset of frames as keyframes
• State: 3D landmarks + camera poses for all key frames
• BA can refine state in a thread separate from tracking component
Image-Based Localization Pipeline
• Key Frame and its Camera Pose Stored; • Frame id, image feature and camera pose.
• Extract Local Features; • Feature original or coded.
• Establish 2D-2D or 3D-to-2D Matches; • Single camera: 3d-to-2d;
• Stereo or depth camera: 2d-to-2d;
• Camera Pose Estimation. • Refine by 3-d point estimation;
• 2-d features along with their depths.
MPEG CDVS – Motivations and Timeline • Reduce load on wireless
networks carrying visual search-related information;
• Ensure interoperability of visual search applications and databases;
• Enable hardware support for descriptor extraction and matching in mobile devices;
• Enable high level of performance of implementations conformant to the standard;
• Simplify design of descriptor extraction and matching for visual search applications.
MPEG CDVS - Requirements • Robustness: accuracy
– verification >95%
– identification >90%
• Sufficiency: self contained;
• Compactness: compression 10x feature;
• Scalability: web scale (database size >100 million)
• Speed: VGA (maximal size < 640) 30fps (note: slower in CPU mobile, 0.5fps);
MPEG CDVS - Scopes
Instead of sending images
application can send compact descriptors
even perform search locally
• Descriptor extraction process needed to
ensure interoperability.
• Bitstream of compact descriptors.
• Considered system architectures
• Server mainly performs CDVS indexing and
query/retrieval and Clients (mobile devices,
desktop, … etc.) work differently:
• (a) Client only capture images and displays
the retrieval results;
• (b) Client can extract image descriptor and
also displays the retrieval results;
• (c) Client will take more tasks, like local
DB/cache-based image matching, and also
displays the retrieval results;
Image Feature CDVS Extraction Pipeline • Keypoint detection: ALP;
• SIFT descriptor;
• Feature selection;
• Local descriptor compression;
• Coordinate coding;
• Global descriptor aggregation;
Image Pairwise Matching and Retrieval Pipeline • Global descriptor: top matches;
• Local descriptor: decoding->matching->geometric verification->localization;
• Location coding and descriptor coding.
MPEG CDVS - Proposals • CE1: global descriptor: Residual Enhanced Visual Vector (REVV);
• SCFV (Scalable Compressed Fisher Vector);
• Robust Visual Descriptor (RVD);
• CE2: local descriptor compression: CHoG;
• Transform + Scalar Quantizer;
• Multi-stage Vector Quantizer;
• CE3: Location coding: context based coordinate coding;
• CE4: key-point detection: ALP;
• Block-based Frequency domain LoG;
• CE5: Local Descriptors: SIFT;
• CE6: retrieval pipeline: MBIT;
• CE7: feature selection: Naïve Bayesian learning based feature selection;
• CE8: pairwise matching pipeline: DISTRAT (local matching with geometric verification)
• Weighted Hamming distance (global matching).
CDVA: Compact Descriptor for Video Analysis
ALP • SIFT Patent: DoG filtering to construct scale space!
• ALP (low polynomial degree): by Telecom Italia;
• Idea: scale space response modeled by a polynomial function; then estimation of coefficients by LoG filtering at different scales:
• 1. Scale space: approximated by a polynomial, get local extrema in the polynomials, eliminates those exceeded at the boundaries, output a list of candidates (x, y, σ);
• 2. Clean candidates: either at edges with bigger ratio of curvatures, or with lower absolute values or curvatures;
• 3. Refinement of coordinates: approximate scale space by a polynomial;
• 4. Eliminates duplicates at octave boundaries;
• 5. Find the remaining candidates.
D. G. Lowe: ”Distinctive image features from scale-invariant keypoints”, IJCV 2004, Patent No US 6,711,293.
LoG kernel
Gaussian kernel K = 4
Assumption 1. LoG kernels well approximated by linear combinations of a
few LoG kernels at discrete scale or displacements;
Assumption 2. Coefficients of linear combinations are smooth functions of
scale or displacement, approximated by polynomials of low degree.
ALP
Scale space response as a polynomial in the displacement parameters (u, v)
K = 9
ALP
ALP
Gaussian Scale Space Construction
• Gray scaled image only;
• Scale space Gaussian pyramid;
• Down scaled by a factor of two;
• Convolutional with 2d Gaussian filter;
• Use 1-d horizontal Gaussian filter only (transpose);
• CDVS TM: 1-d hori. Gaussian plus 1-d vert. Gaussian;
• Save in RGBA OpenCL textures;
• OpenCL global memory access is too slow;
• Four gray pixels in one item;
• Reduce image width by 4: process 4 pixels per kernel instance;
• Only 5 (inst. of 15) memory accesses to read the adjacent pixels.
• Memory access and parallelism.
Laplacian Operation • Run after Gaussian filtering: LoG;
• Apply Laplacian operator to 4 gray pixels;
• Access the memory 5 times only;
• Run the kernel 1/4th of TM times;
• Note: CDVS TM implements Laplacian operator in serial.
Scale Space Function Approximation
• The ALP (A Low-degree Polynomial) detector identifies interest points finding local extrema in the scale space by approximating the scale-space using polynomials;
• ALP approximates LoG filtering functions with polynomials of low degree;
• The coefficients of polynomial are obtained by computing weighted sums of the Laplacian-of-Gaussian (LoG) images;
• Two kernels: both run 4 pixels in parallel
• Compute the coefficients;
• Evaluate the roots of 1st derivative of the polynomial, and store min and max of polynomial values for each pixel;
• Only one time of memory access;
• Note: Extrema detection refined to sub-pixel, also in parallel.
if (isExtrema) then p = ComputePolynomial (A, B, C, D); deltaX = ComputeDeltaX (p); deltaY = ComputeDeltaY (p); if ((|deltaX|<=TH1) && (|deltaY| <= TH2)) then isKeypoint = true; end if end if
Orientation Assignment
• Dominant orientation assigned per keypoint;
• Computation of gradient mag and direction;
• One kernel;
• Computation of orientation histogram;
• Another kernel;
• Orientation histogram smoothing;
• Work-groups (at least one work-item);
• Shared memory for work-items per work-group;
• Buffer for 2048 x 36 (histogram bin #) int. elements.
• Peak localization by 4 kernels in one work-group;
• 9 histogram bins per kernel.
• Quadratic interpolation for better accuracy.
Feature Selection and Duplicate Elimination
• Store keypoints in OpenCL textures, not CPU data structures to reduce memory access latency;
• keypoints # per octave: 256 (max # of keypoints per row) x 8 = 2048;
• Keypoint: 4 (elements) x 4 =16 floating numbers;
• Max size of textures: 1024x8;
• Feature selection (FS) based on relevancies: computing matching prob. (MP) and sorting;
• Create a histogram and put every features in its specific interval and then select only the most important part;
• A training data for Naïve Bayes learning;
• Consistent with various datasets.
• MP is computed by averaging training values of some of the scale space characteristics, its orientation and coord.;
Feature Selection and Duplicate Elimination
Feature Selection and Duplicate Elimination
Deferred FS
Feature Selection and Duplicate Elimination
• FS performed before OA without performance loss;
• Read the keypoint structure first;
• Compute MP not including orientation contribution;
• Make histogram, the same to Deferred FS;
• Compute threshold, the same to Deferred FS;
• Select feature, the same to Deferred FS;
• Orientation assignment, the same to Deferred FS;
• Update histogram with orientation info;
• Then compute threshold again;
• Finally select feature again.
K. Lee and S. Lee and W. Oh, “Accelerating Local
Feature Extraction Using Two Stage Feature Selection and Partial Gradient Computation,” 2014
Feature Selection and Duplicate Elimination
Hastened FS
GPU’s Performance with OpenCL
• Mobile GPU;
• ALP keypoint detector -> 7 times;
• CDVS feature extraction -> 3 times;
• Samsung Galaxy Note 3
• 4-core ARM
• GPU Adreno 330
• Arndale Octa board
• 4-core ARM
• GPU Mali T628
CDVS Feature Extraction Time
• SIFT interest point detection and feature extraction made the biggest contribution;
• Global descriptors as complex as Interest Point Detection;
• Very fast in computing local descriptors and coordinate encoding.
GD: Global Descriptor LD: Local Descriptor
Transform + Scalar Quantizer • Simple linear bin combinations in each
subregion HoG followed by coarse ternary scalar quantisation of the resultant values;
• Transform of each subregion rely on simple-to-compute bin relations about the HoG’s shape: (1)-(4);
• Divide relations into subsets A & B;
• Spatial neighboring processed similarly.
• Bitrate element selection: element eliminat.
• Scalar quantization and AAC
h0
h6 h7
h4
h5
h2 h3 h1
(h0 – h1)/2, (h1 – h2)/2, (h2 – h3)/2, (h3 – h4)/2,
(h4 – h5)/2, (h5 – h6)/2, (h6 – h7)/2, (h7 – h0)/2 (1)
neighbouring (45o) bin relations (h0 – h4)/2, (h1 – h5)/2, (h2 – h6)/2, (h3 – h7)/2 (2)
opposite (180o) bin relations
((h0 + h4) – (h2 + h6))/4, (h1 + h5) – (h3 + h7))/4 (3)
“line” relations (horiz. and vert. orient. bins)
((h0 + h2 + h4 + h6) – (h1 + h3 + h5 + h7))/8, ((h0 + h1 + h2 + h3) – (h4 + h5 + h6 + h7))/8 (4)
relations of two “halves”
A B A B
B A B A
A B A B
B A B A
h0 h1 h2 h3
h15
h4 h5 h6 h7
h8 h9 h10 h11
h12 h13 h14
Robust to coarse scalar quantization!
Set A
v0 = (h2 – h6)/2
v1 = (h3 – h7)/2
v2 = (h0 – h1)/2
v3 = (h2 – h3)/2
v4 = (h4 – h5)/2 (5)
v5 = (h6 – h7)/2
v6 = ((h0 + h4) – (h2 + h6))/4
v7 = ((h0 + h2 + h4 + h6) – (h1 + h3 + h5 + h7))/8
Set B
v0 = (h0 – h4)/2
v1 = (h1 – h5)/2
v2 = (h1 – h2)/2
v3 = (h3 – h4)/2
v4 = (h5 – h6)/2 (6)
v5 = (h7 – h0)/2
v6 = ((h1 + h5) – (h3 + h7))/4
v7 = ((h0 + h1 + h2 + h3) – (h4 + h5 + h6 + h7))/8
(Adaptive Arithmetic Coding).
(1k~16k, seen in next slide);
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
h12 h13 h14 h15
h0 h1 h2 h3
h4 h5 h6 h7
h8 h9 h10 h11
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
h12 h13 h14 h15
h0 h1 h2 h3
h4 h5 h6 h7
h8 h9 h10 h11
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
h8 h9 h10 h11
h12 h13 h14 h15
h0 h1 h2 h3
h4 h5 h6 h7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
h12 h13 h14 h15
h0 h1 h2 h3
h4 h5 h6 h7
h8 h9 h10 h11
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
h8 h9 h10 h11
h12 h13 h14 h15
h0 h1 h2 h3
h4 h5 h6 h7Descriptor element selection: selected elements
appear in green. (a) 16K – 128, (b) 8K – 80,
(c) 4K – 64, (d) 2K – 40, (e) 1K – 20 elements.
(a) (b)
(c) (d)
(e)
Transform + Scalar Quantizer
CFV, SCFV
PQ vs Spectral Hashing and Hamming Embedding
CFV, SCFV
CFV, SCFV
• Global Descriptor Matching: I. Using bitwise XOR and POPCNT to compute Hamming distances;
II. Reading the weights from a small look-up table.
D=24 for 512B (in this case) or 32 for others
• Weights: learned off-line from matching/non-matching pairs in the training data:
• INRIA holidays, Oxford buildings and Caltech Pasadena buildings.
• Fast: generate the short list within 1 second from 1 million image database.
Geometric Verification Check: DISTRAT
• DISTance RATio coherence
Calculating the location geometric score: (a) features are matched according to the
descriptor paths, (b) distance of features within image are calculated, (c) log distance
ratios (LDR) of corresponding pairs are calculated , and (d) histogram of log distance
ratios is formed. The maximum value of the histogram is the geometric similarity score.
Geometric Verification Check: DISTRAT
quantizer
LDR evidence
eigenvalue
number of inliers
match weight
Local Descriptor Matching
w
#min 1,
#
X Y
X Y X Y
Q R R Q
2w
Learned from the cond. prob. of correct match by DISTRAT
Location Histogram Coding 1. A core area is identified, and the external
features are pruned out (note: just applied to location coding only);
2. After the creation of the bit stream header, and the encoding of the histogram count using an arithmetic coding, the encoding of the histogram map starts from the center of the image, adopting the presented circular scanning;
3. Continues until the number of non-null element in the image is reached;
4. Each element is encoded using a context built up on a sum-based basis;
5. Context trained on rescanned or original matrixes.
Create bit stream header
Encode histogram count
Start circular matrix scanning
Pruning of external features
Encode one element
Context
No
Yes Exit
Has last non-null element of histogram map being encoded?
(d) (e)
Location Histogram Coding
Homography Estimation for Localization
• Apply a short run of Random SAmple Consensus (RANSAC) for homographies to the estimated inlier set;
• RANSAC with a fixed number of generated hypotheses and a fixed threshold for inlier inclusion;
• A small number of RANSAC iteration is sufficient: 10;
• Choose bi-directionally matched interest point pairs;
• Direct Linear Transform (DLT) algorithm requires four or more matching point pairs as input;
• The DLT algorithm requires a normalization pass applied to both sets of points.
MBIT Indexing • Multi-Block Index Table;
• To reduce the range of Hamming-distance based similarity comparison of candidate images, eliminating the exhaustive search in the first-stage retrieval;
• Partition each global descriptor SCFV into multiple blocks and set up an index table for each block, following an inverted table structure;
• MBIT searching is to generate a shortlist of candidate images for re-ranking;
• 1. weighted voting to generate a set of candidate;
• A optimal threshold for conflict is learned off-line;
• 2. exhaustive search by Hamming distance.
MBIT Indexing
Illustration of the MBIT indexing structure
What Can CDVS Play in 5G Network? • Image-based Relocalization;
• Feature matching for location recognition;
• Failure recovery (loop closure detection);
• Image-based content search and targeting;
• Object or scene search;
• Location finding;
• Targeted ads;
• Video Analytics;
• Action or event search (CDVS -> CDVA).
Fog
Comput.
Cloud Edge
Computing
Cloudlet
CDVS-based Feature Extraction, Image Matching and Image Retrieval
CDVS-based Photo Search at WeChat • Rapid-Rich Object Search (ROSE) Lab of NTU
and Peking University;
• APIs of “Weixin Chat Smart Platform” for WeChat, Tencent:
– CDVS-based image recognition/search (the other is audio recognition/search);
• A free wine shopping assistant app to recognize wine labels;
• A travel assistant app for users to access tourism inf. around a landmark;
• Images (posters, advertisements, printed materials, covers of goods, etc. ) uploaded to the cloud to be indexed;
• Apple’s IOS and Google’s Android;
JiuKaopu
UGuide
Loop Detection using CDVS in Robotic Navigation • Telcom Italia’s recent work;
• Applied for loop-detection in robotic navigation (keyframe-based SLAM);
• A robot works in the indoor environment using a single down-facing camera;
• Landmark identification (global + local);
• Visual odometry (local only, very fast);
• Direct feature matching is still expensive in bandwidth cost and not low in delay;
• CDVS can achieve trade-offs between distinctiveness & storage requirements;
Source: http://jol.telecomitalia.com/jolvisible/cdvs-nella-robotica/?lang=en
I1 I2 CDVS
CDVS for Visual Navigation in Drone: MAV & UAV
• MAV( micro aerial vehicle);
• UAV (Unmanned Aerial Vehicle);
• SLAM for drone navigation (GPS denied);
• Key frame-based SLAM;
• PTAM: parallel tracking and mapping;
• What CDVS can do?
• Location/Landmark recognition;
• Loop closure as a special case;
• Visual odometry;
• Feature matching.
Reduce Bandwidth Cost and Latency
Avoid Missing Targeted Position
• Drone cannot take weighty LiDAR;
• Location database is saved in the server;
• It needs fast image search and matching for immediate drone localization (esp. when the drone is flying in a high speed);
• Data transmission btw server and drone should be ultra low for “zero latency”.
CDVS for Visual Navigation in Drone: MAV & UAV
• Image transmission for image-based localization:
– Camera 720p @ 30fps, H.264 compression -> 10Mbps;
• CDVS transmission:
– Feature compression, 2K bytes is OK, 4 K bytes works well;
– Localization 10Hz (100 milliseconds) -> 0.32 Mbps in total;
• Latency comparison: replace video-based with CDVS based.
– Reduced to 320/10240 = 3.125% (32 times).
• 5G network bandwidth > 50 Mbps (assume 80 mbps for drone’s navigation);
• DARPA drone can fly 45 miles/h (20 meters/s) in the wide open field, less in urban, more in the empty space (Sony’s drone flies up to 102 m/h);
– Distance delayed error for CDVS vs video compression-based: 0.078 meters vs 2.5 meters.
CDVS for Visual Navigation in Drone: MAV & UAV
AT&T and Intel want drones
connected to an LTE network.
100 Dancing Spaxels-like
Drones with Intel RealSense. Intelligent drone follows
with Intel RealSense.
Source: A. Majdik, D. Verda, Y. Albers-Schoenberg, D. Scaramuzza “Air-ground Matching: Appearance-based GPS-denied Urban Localization of Micro Aerial Vehicles”, Journal of Field Robotics, 2015.
High-Speed Autonomous Obstacle Avoidance with Pushbroom Stereo
• A small autonomous unmanned aerial
vehicle capable of high-speed flight
through complex natural environments;
• Using only onboard computation, perform
obstacle detection, planning, and feedback
control in realtime, 14 m/s (31 MPH);
• Pushbroom stereo, detecting obstacles at
120 f/s without overburdening processors;
• Model-based planning and control
techniques allows to track precise
trajectories that avoid obstacles identified
by the vision system.
High-Speed Autonomous Obstacle Avoidance with Pushbroom Stereo
Online planning: An obstacle is deemed to be in the path if the minimum distance between the trajectory and the point cloud is below a threshold
High-Speed Autonomous Obstacle Avoidance with Pushbroom Stereo
The red voxels show detected obstacles and the green squares show the planned trajectory as it is being executed.
Obstacle Detection and Navigation Planning for Autonomous MAVs
• Obstacle detection and real-time planning of collision-free trajectories are key for the fully autonomous operation of micro aerial vehicles in restricted environments.
• A multimodal sensor setup for omni-directional obstacle perception consisting of a 3D laser scanner, two stereo camera pairs, and ultrasonic distance sensors.
• Detected obstacles are aggregated in egocentric local multi-resolution grid maps.
• Generate trajectories in a multi-layered approach: from mission planning to global and local trajectory planning, to reactive obstacle avoidance.
• Obstacle perception: construct MAV-centric obstacle maps first; – A hybrid local multiresolution map that represents both occupancy information and the
individual distance measurements.
– The most recent measurements are stored in ring buffers within grid cells that increase in size with distance from the robot’s center.
– A high resolution in the close proximity to the sensor and a lower resolution far away from the robot, which correlates with the sensor’s characteristics in relative distance accuracy and measurement density.
Obstacle Detection and Navigation Planning for Autonomous MAVs
• For each measurement and the corresponding 3D point, the individual cell of the map is marked as occupied;
– To compensate the sensor’s motion during scan acquisition, incorporate a visual odometry estimate from two pairs of wide-angle stereo cameras.
– Register consecutive 3D scans by matching Gaussian point statistics in grid cells (surfels) between local multi-resolution grid maps;
– Fuse the measurements in an occupancy grid maintaining occupancy probabilities.
The obstacle circled in blue. The MAV position is circled red.
Obstacle Detection and Navigation Planning for Autonomous MAVs
• To reduce the planning complexity, divide the overall planning problem into multiple problems with different levels of abstractions;
• A hierarchical control architecture for MAV, with slower deliberative planners that solve complex path and mission planning problems on the upper layers and high-frequency reactive controllers on the lower layers.
Obstacle Detection and Navigation Planning for Autonomous MAVs
What is Deep Learning? • Representation learning attempts to automatically learn good features or
representations;
• Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (intermediate and high level features);
• Become effective via unsupervised pre-training + supervised fine tuning;
– Deep networks trained with back propagation (without unsupervised pre-training) perform worse than shallow networks.
• Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);
• Semi-supervised: structure of manifold assumption;
– labeled data is scarce and unlabeled data is abundant.
Deep Feature Learning
• Hand-crafted features:
• Needs expert knowledge
• Requires time-consuming hand-tuning
• (Arguably) one limiting factor of computer vision systems
• Key idea of feature learning:
• Learn statistical structure or correlation from unlabeled data
• The learned representations used as features in supervised and semi-supervised settings
• Hierarchical feature learning
• Deep architectures can be representationally efficient.
• Natural progression from the low level to the high level structures.
• Can share the lower-level representations for multiple tasks in computer vision.
Feature Learning Architectures
Pixels /
Features
Filtering with Dictionary
(patch/tiled/convolutional)
Spatial/Feature
(Sum or Max)
Normalization between
feature responses
Features
+ Non-linearity
Local Contrast Normalization
(Subtractive & Divisive)
(Group)
Sparsity
Max
/
Softmax
Not an exact
separation
Deep Learning for SLAM
• Feature extraction and matching;
• Loop closure detection;
• Image-based re-localization;
• Stereo matching;
• Optic flow estimation;
• Geometric transform estimation;
• End-to-end learning.
Deep Learning for SLAM
• Feature extraction and matching;
• Loop closure detection;
• Image-based re-localization;
• Stereo matching;
• Optic flow estimation;
• Geometric transform estimation;
• End-to-end learning.
Deep Learning for SLAM
Optic flow estimation
Deep Learning for SLAM
Visual Tracking
Deep Learning for Image/Video Search
Deep Learning for Image/Video Search
107
Remarks • 5G network provides low/ultra-low latency and high bandwidth via fog
computing (FC) and mobile edge computing (MEC);
• Augmented reality needs registration and recognition;
• CDVS generates compact descriptors to images;
• CDVA will extend CDVS to video analysis;
• In 5G network, CDVS can support real-time and fast use cases:
• Augmented reality;
• Content search and targeting;
• Real-time video analytics;
• Deep learning can do the same as CDVS in use cases of 5G Network.
Appendix
109
C2TAM: SLAM in Cloud • Map optimization moved to cloud;
• Latency can be reduced by 5G Network;
• Data flow:
• Server to client: map;
• Client to server;
• Modes of operation:
• Map building and storage;
• Relocalization;
• Map extension;
• Concurrent mapping;
Source: L Riazuelo, J Civera, J. M. Montiel, “C2TAM: A Cloud Framework for Cooperative Tracking and Mapping”, RAS, 2014.
RoboEarth • Cloud Robotics infrastructure to close
the loop btw robot and cloud;
• WWW database: software components, maps for navigation, task knowledge, and object recognition models;
• The RoboEarth Cloud Engine makes any computation available to robots;
• The Cloud Engine’s environments provide high bandwidth access to the RoboEarth knowledge repository enabling robots to benefit from the experience of other robot.
Source: http://roboearth.org/
OpenCL (Open Computing Language ) for GPU
• Platform model:
• Host->Compute Device->Compute Unit->Processing Element;
• Memory model: Host memory
• Global Memory, Caches, Local Memory, Private Memory;
• Program structure:
• Kernels (parallel), host (serial) program;
• Execution model:
• Work group, work item;
• Synchronization in the group;
• Data/task parallelism;
• Contexts, queues, events;
• Data I/O structure: buffer
• Image2D of Texturing;
• RGBA for grayscale data;
• Interpol., border, cashing.
OpenCL (Open Computing Language ) for GPU
• Intel OpenCL APIs;
• Define the platform;
• Applications run on a host;
• Work of devices from host;
• Com./App. queues in order;
• Kernel code for work item;
• Context for environment;
• Events for sync.
Intel Gen 8 EU, Sub-slice and Slice
Intel Gen 9
Thanks!
116