Applications of the VOLA Format for 3D Data Knowledge ... · Abstract—VOLA is a compact data structure that uniﬁes computer vision and 3D rendering and allows for the rapid cal-culation

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/319932593

ApplicationsoftheVOLAFormatfor3DDataKnowledgeDiscovery

ConferencePaper·July2017

CITATIONS

2

READS

403

7authors,including:

Someoftheauthorsofthispublicationarealsoworkingontheserelatedprojects:

EoT(EyesofThings)Viewproject

Interest-point-basedhumanbodytrackingViewproject

JonathanByrne

Intel

29PUBLICATIONS212CITATIONS

SEEPROFILE

SamCaulfield

Intel


SEEPROFILE

LéonieBuckley

TrinityCollegeDublin


SEEPROFILE

XiaofanXu

TrinityCollegeDublin


SEEPROFILE

AllcontentfollowingthispagewasuploadedbyJonathanByrneon20September2017.

Theuserhasrequestedenhancementofthedownloadedfile.

https://www.researchgate.net/publication/319932593_Applications_of_the_VOLA_Format_for_3D_Data_Knowledge_Discovery?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_2&_esc=publicationCoverPdfhttps://www.researchgate.net/publication/319932593_Applications_of_the_VOLA_Format_for_3D_Data_Knowledge_Discovery?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_3&_esc=publicationCoverPdfhttps://www.researchgate.net/project/EoT-Eyes-of-Things?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_9&_esc=publicationCoverPdfhttps://www.researchgate.net/project/Interest-point-based-human-body-tracking?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_9&_esc=publicationCoverPdfhttps://www.researchgate.net/?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_1&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Jonathan_Byrne?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_4&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Jonathan_Byrne?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_5&_esc=publicationCoverPdfhttps://www.researchgate.net/institution/Intel?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_6&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Jonathan_Byrne?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_7&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Sam_Caulfield?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_4&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Sam_Caulfield?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_5&_esc=publicationCoverPdfhttps://www.researchgate.net/institution/Intel?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_6&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Sam_Caulfield?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_7&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Leonie_Buckley?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_4&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Leonie_Buckley?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_5&_esc=publicationCoverPdfhttps://www.researchgate.net/institution/Trinity_College_Dublin?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_6&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Leonie_Buckley?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_7&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Xiaofan_Xu4?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_4&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Xiaofan_Xu4?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_5&_esc=publicationCoverPdfhttps://www.researchgate.net/institution/Trinity_College_Dublin?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_6&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Xiaofan_Xu4?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_7&_esc=publicationCoverPdfhttps://www.researchgate.net/profile/Jonathan_Byrne?enrichId=rgreq-68fcbbaac9a192ca7c765bb078b0cdde-XXX&enrichSource=Y292ZXJQYWdlOzMxOTkzMjU5MztBUzo1NDA2MzAyNzY4MTY4OTZAMTUwNTkwNzcxMDU5NA%3D%3D&el=1_x_10&_esc=publicationCoverPdf

1

Applications of the VOLA Format for 3D DataKnowledge Discovery.

Jonathan Byrne, Sam Caulfield, Léonie Buckley, Xiaofan Xu, Dexmont Pena, Gary Baugh and David MoloneyComputer Vision and Machine Learning Group, Movidius / Intel

Email: [email protected], [email protected], [email protected], [email protected],[email protected], [email protected], [email protected]

Abstract—VOLA is a compact data structure that unifiescomputer vision and 3D rendering and allows for the rapid cal-culation of connected components, per-voxel census/accounting,CNN inference, path planning and obstacle avoidance. Usinga hierarchical bit array format allows it to run efficiently onembedded systems and maximize the level of data compression.The proposed format allows massive scale volumetric data to beused in embedded applications where it would be inconceivable toutilize point-clouds due to memory constraints. Furthermore, ge-ographical and qualitative data is embedded in the file structureto allow it to be used in place of standard point cloud formats.This work examines the reduction in file size when encoding 3Ddata using the VOLA format and finds that it is an order ofmagnitude smaller than the current binary standard for pointcloud data.

Index Terms—Point Clouds, Embedded Systems, Auralisation,Deep Learning, GIS.

I. INTRODUCTIONThe worlds of computer vision and graphics, although

separate, are slowly being merged in the field of robotics.Computer vision is taking input from systems, such as lightdetection and ranging (LIDAR), structured light or camerasystems and generating point clouds or depth maps of theenvironment. This data then must be represented internally forthe environment to be interpreted correctly. Unfortunately theamount of data generated by modern sensors quickly becomestoo large for embedded systems. An example of the amountof memory required by dense representations is SLAMbench(kFusion) which requires 512 MiB to represent a 5 m3 volumewith 1 cm accuracy [1]. A terrestrial LIDAR scanner generatesa million unique points per second [2] and an hour long aerialsurvey can generate upwards of a billion unique points.

The result of having such vast quantities of data is that itsoon becomes impossible to process, let alone visualize thedata on all but the most powerful systems. Consequently itis almost never used directly. It is simplified by decimation,flattened into a DEM, or meshed using a technique such asDelaunay triangulation or Poisson reconstruction.

The original intention of VOLA was to develop a format thatenabled an embedded system to process point cloud data intoan internalized 3D model to easily navigate the environment.It has since been applied to numerous applications, such asray-casting, deep learning, auralisation, machine learning and3D Mapping. VOLA has shown itself to be a compact yetversatile format. This paper gives a summary of the differentapplications areas and the results obtained.

II. RELATED RESEARCH

There exist several techniques for organizing point clouddata and converting it to a solid geometry. Point clouds areessentially a list of coordinates, with each line containingpositional information as well as colour, intensity, number ofreturns and other attributes. Although the list can be sortedusing the coordinate values, normally a spatial partitioningalgorithm is applied to facilitate searching and sorting the data.Commonly used approaches are the the Octree [3] and theKD-Tree [4].

Octrees are based in three dimensional space and so theynaturally lend themselves to 3D visualization. There are ex-amples where the Octree itself is used for visualizing 3D data,such as the Octomap [5]. Octrees are normally composed ofpointers to locations in memory which makes it difficult tosave the structure as a binary. One notable exception is theDMGOctree [6] which uses a binary encoding for the positionof the point in the octree. Three bits are used as a uniqueidentifier for each level of the Octree. The DMGOctree usesa 32 or 64 bit encoding for each point to indicate the locationin the tree to a depth of 10 or 20 respectively.

Another technique for solidifying and simplifying a pointcloud is to generate a surface that encloses the points. A com-monly used approach that locally fits triangles to a set of pointsis Delaunay triangulation [7]. It maximizes the minimum anglefor all angles in the triangulation. Triangular irregular networks(TIN) [8] are extensively used in Geographical InformationSystems (GIS) and are based on Delaunay triangulation. Oneissue with that approach is that noise and overlapping pointscan cause the algorithm to make spurious surfaces.

A more modern and accurate meshing algorithm is Poissonsurface reconstruction [9]. The space is hierarchically parti-tioned and information on the orientation of the points is usedto generate a 3D model. It has been shown to generate accuratemodels and it is able to handle noise due to the combination ofglobal and local point information. Poisson reconstruction willalways output a watertight mesh but this can be problematicwhen there are gaps in the data. In an attempt to fill areas withlittle information, assumptions are made about the shape whichcan lead to significant distortions. There are also problemswith small, pointed surface features which tend to be roundedoff or removed by the meshing algorithm.

Finally there are volumetric techniques encoding pointclouds. Traditionally used for rasterizing 3D data for render-

2

VOLA OctreeImplementation Bit Sequence Pointer BasedTraversal Arithmetic Modular PointerVariable Depth Yes YesDense Search Complexity O(1) O(h)Sparse search complexity O(h) O(h)Embedded System Support Yes NoLook Up Table (LUT) Support Yes NoEasily Save to File Yes NoFile Structure Implicit ExplicitCacheable Yes NoHierarchical Memory Structure No Yes

TABLE I: A comparison between VOLA and Octrees.

ing [10], the data is fitted to a 3D grid and occupancy of apoint is represented using a volumetric element, or “Voxel”.Voxels allow the data to be quickly searched and traversed dueto being fitted to a grid. While it simplifies the data and maymerge many points into a single voxel, each point will have arepresentative voxel. Unlike meshing algorithms, voxels willnot leave out features but conversely may be more sensitive tonoise. The primary issue with voxel representations is that theyencode everything, including open space. This means that thereis a cubic relationship between the number of voxels that needto be stored and the volume resolution. For example, if theresolution is doubled then the memory requirements increaseby a factor of 8.

VOLA combines the hierarchical structure of the Octreewith traditional volumetric approaches, enabling it to onlyencode for occupied voxels. The approach is described indetail below.

III. THE VOLA FORMAT

VOLA is unique in that it combines the benefits of partition-ing algorithms with a minimal voxel format. It hierarchicallyencodes 3D data using modular arithmetic and bit countingoperations applied to a bit array. The simplicity of thisapproach means that it is highly compact and can be runon hardware with simple instruction sets. The choice of a 64bit integer as the minimum unit of computation means thatmodern processor operations are already optimized to handlethe format. While octree formats either need to be fully denseto be serialized or require bespoke serialization code for sparsedata, the VOLA bit array is immediately readable withoutheader information.

VOLA is built on the concept of hierarchically definingoccupied space using “one bit per voxel” within a standardunsigned 64 bit integer. The one-dimensional bit array thatmakes up the integer value is mapped to three-dimensionalspace using modular arithmetic. The bounding box containingthe points is divided into 64 cells and if there are pointscontained within a cell the bit is then set to 1 otherwise it isset to zero. The result of the first division is shown in Figure 1.

For the next level each occupied cell is assigned andadditional 64 bit integer and the space is further subdividedinto 64 cells. Any unoccupied cells on the upper levels areignored allowing each 64 bit integer to only encode foroccupied space. The bits are again set based on occupancyand appended to the bit array. The number of integers in each

Fig. 1: Tree depth one, the space is subdivided into 64 cells.The occupied cells are shown in green.

Fig. 2: Tree depth two. Each occupied cell is subdivided into64 smaller cells

3

Fig. 3: Tree depth three. The process is repeated until asufficiently granular resolution is obtained.

Fig. 4: Depth 1.

level can be computed by summing the number of occupiedbits in the previous level. The resolution increases fourfold foreach additional level as shown in figure 2

The process is repeated with each level increasing theresolution of the representation by four until a resolutionsuitable for the data is reached. This depends on the resolutionor points per meter of the data itself.

A. Two Bits Per Voxel

While the basic VOLA format can give occupancy informa-tion in a 3D environment, it cannot store information aboutthe voxels. A backward compatible variation of the format wasadopted in order to add this functionality. The VOLA Model isessentially mirrored so that each 64 bit block has an additional64 bit block appended to it. The additional block contains theinformation related to that volume. The information resolutionis less than the voxel resolution but the additional overheadonly equates to an additional bit per voxel. The result is that

Fig. 5: Depth 2.

Fig. 6: Depth 3.

Fig. 7: Depth 4.

4

the file size doubles but as it is already a fraction of the sizeof the original point cloud, the difference is negligible. The64 bits can be adapted to contain a variety of payloads, suchas material properties, geographical information and colourinformation.

This section has described the basic implementation ofthe VOLA format and the adapted format for embeddinginformation about the voxels. The following sections describethe applications that have been built upon this work.

IV. AUDIOAuralisation is used to describe rendering audible (imagi-

nary) sound fields[11]. For the purpose of JIT (Just In Time)auralisation, the environment being modelled is representedusing VOLA. The response being modelled depends on twomain factors - how far the sound travels between source andobserver, and by how much it is attenuated.

The distance travelled indicates how much the response willbe delayed with respect to the source, and the attenuationindicates how the power of the sound source is diminished.The attenuation is determined by the medium that the soundrays travel through/reflects off (e.g. through air or off a wall).Both the distance travelled and the attenuation of the soundcan be obtained from the VOLA model.

Using Pythagoras’ theorem, the distance between the soundsource and an occupied voxel is calculated. This is repeated forthe return distance between the same voxel and the observer.These two distances are then combined as the total distancetravelled by the sound ray and is known as an early reflector.

Currently only direct sound and early reflectors are takeninto account as they contain the majority of the sound energyof the acoustic response. These rays and their energies areshown in figures 8a and 8b respectively. Direct sound rays aresound rays that travel directly to the observer without reflectingoff any surfaces/objects.

The reflection coefficients are stored in the second bitof VOLA’s two per voxel representation, as described insection III-A. This can be manually set for an entire scene(e.g. if it is know that all the walls and floor are wooden).Manually measuring the reflection coefficient of materials isa complex process[12], however, CNNs can be trained todistinguish simple materials[13] (e.g. wood, metal, carpet), andassign predetermined reflection coefficients to voxels of thesematerials.

For optimisation, voxels that cause the same attenuationdue to distance travelled and reflection coefficients are binnedtogether. This is referred to as ”thresholding” the environment.Each of these bins are combined to generate an FIR filterthrough which the sound source is passed. The process ofconstructing this filter is shown in figure 9. The responsefrom multiple sound sources can be calculated simultaneouslythrough generating multiple filters.

An algorithm, SonifEye[14], implementing auralisation asdescribed above is implemented on the Myriad 2[15] . Thesounds source can be pre-recorded e.g. a musical recording,or can be recorded using a microphone. The runtime for aural-ising a single sound source in a 256 voxel cubed environmentis given in table II.

(a) Ray reflections

(b) Energy of ray reflections

Fig. 8: Sound Rays

Fig. 9: JIT audio modelling

The user’s surrounding are often neglected when providingaudio for VR. SonifEye allows for more realistic VR audio tobe generated in real time, eliminating the need to store cannedaudio.

V. DEEP LEARNING AND CONVOLUTIONAL NEURALNETS.

Performing object recognition on 3D point-cloud occludedvolumes depicting real-world scenes containing ubiquitousobjects is an important problem in the computer vision field. It

Step Time (ms)Threshold Environment 820

Create Filter 0.3Apply Filter 4.05

TABLE II: SonifEye Step Timings for a 256 voxel cubedenvironment for a sound source 1 second is length

5

presents multiple challenges, including the significant memoryrequirements for these volumetric representations as well asreal-time recognition of objects in mobile, power-constrainedor autonomous conditions.

Convolutional Neural Networks (CNNs) have been shownto be powerful classification tools for multiple real-worldcomputer vision tasks, such as scene segmentation andlabelling[16], medical imaging diagnosis[17], [18] and objectdetection in autonomous driving[19], in many cases approach-ing near-human performance. As CNNs take on new and morechallenging recognition tasks, they have been increasing incomplexity, thus requiring larger datasets for training. Thelack of readily available training data and memory require-ments are two of the factors hindering the training andaccuracy performance of 3D CNNs. Based on the previousworks [20] [21], a voxelised point-cloud dataset was createdcontaining 10 different 3D objects associated with differentscenes in VOLA format shown in Fig.10. This dataset can min-imize the memory footprint as well as increase the efficiency ofCNN performance. In order to cope with high computationalpower with low-power supply requirements and low-energyconsumption levels for real-time applications, the CNN modeltrained with our dataset is ported to a very low power and lowcost Fathom NCS [22] based on Myriad2 MA2450 VPU [15]and can perform inference in 11 ms.

Fig. 10: ModelNet10 [23] objects placed in different roomswith voxelised representation in VOLA format.

In order to train the network on our 3D objects with scene,we have developed a new architecture shown in Fig. 11. Theinput to the network is 64 × 64 × 64 VOLA format objectwith scene shown in Fig.10. There are 3 convolutional layers,1 fully connected layer and a softmax layer to produce theoutput. We also add a dropout layer after the third convo-lutional layer with drop rate of 0.5. The first convolutionallayer produces 8 feature blocks with size 573, the secondand the third convolutional layer produces 16 feature blockswith size 503 and 433 respectively. After that we add 1 fullyconnected layer to produce 10 units and the final output givesthe probability for different classes based on the input. This3D CNN model is designed and trained using Caffe [24]. Theaccuracy achieved for the 3D CNN is shown in Table III.

VI. KFUSIONSimultaneous localisation and mapping (SLAM) is a classic

computer vision problem which requires a significant amount

Fig. 11: 3D CNN architecture layout for Objects with back-ground. The input is a 64× 64× 64 object in VOLA formatwith each voxel represented by a single bit. The input passesthrough 3 convolutional layers and 1 fully connected layer.The kernels used for convolutional layers are 8 × 8 × 8 withstride 1 which gives the output feature blocks with size 573,503 and 433 from Conv1, Conv2 and Conv3 respectively. Theoutput from fully connected layer is 10 units.

Method AccuracyPointNet [25] 77.6%

3D ShapeNet [23] 84%OctNet[26] 90.1%

Our 3D CNN 91.3%VoxNet[27] 92%

TABLE III: Accuracy achieved for different existing CNNmodels and our model. The existing CNN models are trainedand tested on pure ModelNet10 objects where our 3D CNN istrained and tested on voxelized point-clouds contain Model-Net10 objects with background. (Higher percentage indicatesbetter performance).

of computational resource. Most SLAM solutions can becategorized as either a dense or sparse approach. For sparsetechniques only a subset of image/sensor features are trackedin order to map an environment and localize the position andorientation of the sensor. However for dense approaches themain goal is to create a compact map with sufficient detail ofthe environment being captured.

KFusion[28] is a dense SLAM open source implementationof Microsoft’s KinectFusion[29] algorithm. KFusion has beenported to the Myriad 2 platform with the goal of using VOLAin the pipeline for storing compact volumetric 3D data. Byusing VOLA, large maps constructed by the algorithm willoccupy a smaller memory footprint compared to the defaultdense voxel grid of truncated signed distance function andweights. Figure 12 shows the hardware setup for a live KFu-sion implementation running on Movidius’ Myriad 2 platform.A demonstration of KFusion running on the Myriad chip isavailable online at [30].

VII. TERRESTRIAL SCALE POINT CLOUD MAPPING INREAL TIME.

Laser scan (LIDAR) surveys have been carried out overmost major cities due to the high accuracy and speed of datacollection. Furthermore the open data movement has meantthat many governments have made this data publicly availablefor public use. Unfortunately the size of the point cloudsgenerated from the scans is so large (10 gigabytes up toa terabyte) that it is almost never used directly. Convertingit to the VOLA format allows for the size of the data to

6

Fig. 12: Live KFusion running on Myriad 2 platform.

Fig. 13: Height data embedded in geo-referenced VOLA data.

be massively reduced while still allowing it to be used formapping and machine learning applications.

The format specification has been designed to be compatiblewith terrestrial mapping applications. The point cloud data isgeoreferenced using both the WGS84 coordinate system [31]and the projected coordinate system that it was originallyrecorded in. This allows for all data to be georeferenced whileminimising distortion due to re-projection.

In order for the format to be used with standard GISapplications, an approach was developed to embed informationabout the voxels while minimizing memory usage. Semanticinformation, such as color, classification, number of returnsand height, is embedded in the data using the two bits pervoxel technique. An example of this is given in figures 13and 14 which show the height data and the number of returnsfrom the LIDAR scanner respectively.

VIII. EMBEDDED RAYCASTING

In autonomous applications, a compact and highly-detailedmap is not useful unless the autonomous agent can meaning-fully interact with it. One of the most fundamental interactionsis collision detection. Accurate and high-performance collisiondetection is of particularly high importance in applicationssuch as drones, where collision at speed can result in damageto property and harm to individuals. Common methods ofray/geometry intersection involve spatial partitioning schemes

Fig. 14: LIDAR return data embedded in geo-referencedVOLA data. The trees are highlighted in the dataset as theygenerate multiple returns.

such as the KD-tree or octree as mentioned previously. Theoctree is the most popular technique for geometry arrangedin regular grids. I compares the VOLA format to octrees ina number of scenarios. However, for certain architectures, theVOLA format provides other significant improvements overthe octree.

A. Parallelism

In a parallel program, cores may have to wait longer onaverage to perform writes and removes to a pointer-basedoctree when compared to VOLA. This is because of thecascading memory allocations and memory frees that wouldhave to be performed during the critical section. In the worstcase VOLA would only require a cascade of memory writes,resulting in less time spent waiting on unlocks.

B. Cache-friendliness

A level in an octree uses pointers to indicate the occupancyof the level below. On a 64-bit system, the pointer uses 8bytes of memory where a single bit would suffice to representoccupancy. The ratio of metadata (pointers) to data (voxels)is far lower in VOLA because everything is a voxel. Thisresults in better cache friendliness because voxel data doesnot compete with pointer data for space in the cache.

C. Level of detail

VOLA supports similar level of detail to that of octrees.In an octree, the number of voxels in the first level is 8 andthis number increases by a factor of 8 each successive level.In VOLA, this number is simply 64. This means the VOLAlevels of detail converge on the full level of detail faster whilestill offering several degrees of coarseness.

D. Data locality

Using heap-allocation for the octree can result in highlyfragmented memory and no assumptions can be made about

7

Fig. 15: VOLA raycaster running on Movidius Fathom. Theraycaster runs at 8 FPS.

the relative positions of data in memory. In VOLA, thelayout of the voxels is totally deterministic and very compact.Additionally, VOLA does not lose the ability to use MortonEncoding[32] to enhance data locality. In situation where mul-tiple VOLA structures are used in parallel (e.g. to store sub-sampled colour information in addition to voxels), a relativeaddress jump can be made from one VOLA structure to thesame point in the other.

E. Efficient Ray-casting

The above characteristics of VOLA result in high perfor-mance ray-casting on amenable platforms. Efficient algorithmsexist for ray/grid intersection tests[33]. These algorithms canbe further optimised to make use of level of detail to skipa large number of tests that would otherwise test known-empty space. Multi core systems can use the extra cores tocast multiple rays in parallel.

The Movidius Myriad 2 is a heterogenous VPU that fea-tures 12 vector processors and an on-chip addressable cachememory. Each vector processor has a 128 KiB preferentialslice of the 2 MiB on-chip memory which facilitates fast low-power access. Using VOLA, each slice can comfortably storeits own copy of the top three levels of detail of the VOLAdata structure. The remaining levels of the structure are storedin DDR memory.

A realtime raycaster was implemented on the MovidiusFathom stick as an example of the raycasting performance.The rays produce a depth image which visualises the voxelsdirectly without producing an intermediate data structure suchas a triangular mesh.

IX. CONCLUSIONSVOLA has proven to be a versatile and compact format.

It has been used for applications as diverse as audio mod-

elling, machine learning, and storing georeferenced geo-data.It allows for massive compression of data without sacrificingfunctionality. The bit sequences are amenable for use onembedded systems and have been applied to ray-casting andSLAM.

A. Future WorkVOLA is now been developed so that any information

can be automatically added and unpacked from the two bitper voxel representation. This will allow data to be sent forspecific applications for the same database. VOLA is alsobeing merged with existing GIS data so that a curated trainingset can be developed for training CNNs on real-world objects.

REFERENCES[1] C. D. Mutto, P. Zanuttigh, and G. M. Cortelazzo, Time-of-flight cameras

and microsoft kinect (TM). Springer Publishing Company, Incorporated,2012.

[2] L. Geosystems, “Leica scanstation p30/p40,” Product Specifications:Heerbrugg, Switzerland, 2015.

[3] D. Meagher, “Geometric modeling using octree encoding,” Computergraphics and image processing, vol. 19, no. 2, pp. 129–147, 1982.

[4] J. L. Bentley, “Multidimensional binary search trees used for associativesearching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517,1975.

[5] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard,“Octomap: An efficient probabilistic 3d mapping framework based onoctrees,” Autonomous Robots, vol. 34, no. 3, pp. 189–206, 2013.

[6] D. Girardeau-Montaut, Change detection on three-dimensional geomet-ric data. PhD thesis, T e l e com ParisTech, 2006.

[7] J.-D. Boissonnat, “Geometric structures for three-dimensional shaperepresentation,” ACM Transactions on Graphics (TOG), vol. 3, no. 4,pp. 266–286, 1984.

[8] T. K. Peucker, R. J. Fowler, J. J. Little, and D. M. Mark, “Thetriangulated irregular network,” in Amer. Soc. Photogrammetry Proc.Digital Terrain Models Symposium, vol. 516, p. 532, 1978.

[9] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruc-tion,” in Proceedings of the Fourth Eurographics Symposium on Ge-ometry Processing, SGP ’06, (Aire-la-Ville, Switzerland, Switzerland),pp. 61–70, Eurographics Association, 2006.

[10] J. F. Hughes, A. Van Dam, J. D. Foley, and S. K. Feiner, Computergraphics: principles and practice. Pearson Education, 2014.

[11] M. Kleiner, B.-I. Dalenbäck, and P. Svensson, “Auralization-anoverview,” Journal of the Audio Engineering Society, vol. 41, no. 11,pp. 861–875, 1993.

[12] D. Markovic, K. Kowalczyk, F. Antonacci, C. Hofmann, A. Sarti, andW. Kellermann, “Estimation of acoustic reflection coefficients throughpseudospectrum matching,” IEEE/ACM Transactions on Audio, Speechand Language Processing (TASLP), vol. 22, no. 1, pp. 125–137, 2014.

[13] S. Bell, P. Upchurch, N. Snavely, and K. Bala, “Material recognition inthe wild with the materials in context database,” in Proceedings of theIEEE conference on computer vision and pattern recognition, pp. 3479–3487, 2015.

[14] L. Buckley, S. Caulfield, and D. Moloney, “Mvecho - acoustic responsemodelling for auralisation,” in 2016 IEEE Hot Chips 28 Symposium(HCS), p. 1, Aug 2016.

[15] B. Barry, C. Brick, F. Connor, D. Donohoe, D. Moloney, R. Richmond,M. O’Riordan, and V. Toma, “Always-on vision processing unit formobile applications,” IEEE Micro, vol. 35, no. 2, pp. 56–66, 2015.

[16] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments forgenerating image descriptions,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pp. 3128–3137, 2015.

[17] A. Prasoon, K. Petersen, C. Igel, F. Lauze, E. Dam, and M. Nielsen,“Deep feature learning for knee cartilage segmentation using a triplanarconvolutional neural network,” in International conference on medicalimage computing and computer-assisted intervention, pp. 246–253,Springer, 2013.

[18] H. R. Roth, L. Lu, A. Seff, K. M. Cherry, J. Hoffman, S. Wang, J. Liu,E. Turkbey, and R. M. Summers, “A new 2.5 d representation for lymphnode detection using random sets of deep convolutional neural networkobservations,” in International Conference on Medical Image Computingand Computer-Assisted Intervention, pp. 520–527, Springer, 2014.

8

[19] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learningaffordance for direct perception in autonomous driving,” in Proceedingsof the IEEE International Conference on Computer Vision, pp. 2722–2730, 2015.

[20] X. Xu, D. Corrigan, A. Dehghani, S. Caulfield, and D. Moloney, “3d ob-ject recognition based on volumetric representation using convolutionalneural networks,” in International Conference on Articulated Motionand Deformable Objects, pp. 147–156, Springer, 2016.

[21] X. Xu, A. Dehghani, D. Corrigan, S. Caulfield, and D. Moloney, “Con-volutional neural network for 3d object recognition using volumetricrepresentation,” in Sensing, Processing and Learning for IntelligentMachines (SPLINE), 2016 First International Workshop on, pp. 1–5,IEEE, 2016.

[22] “Movidius announces deep learning accelerator and fathomsoftware framework.” https://www.movidius.com/solutions/machine-vision-algorithms/machine-learning.

[23] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3dshapenets: A deep representation for volumetric shapes,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,pp. 1912–1920, 2015.

[24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the 22nd ACM internationalconference on Multimedia, pp. 675–678, ACM, 2014.

[25] A. Garcia-Garcia, F. Gomez-Donoso, J. Garcia-Rodriguez, S. Orts-Escolano, M. Cazorla, and J. Azorin-Lopez, “Pointnet: A 3d convolu-tional neural network for real-time object class recognition,” in NeuralNetworks (IJCNN), 2016 International Joint Conference on, pp. 1578–1584, IEEE, 2016.

[26] G. Riegler, A. O. Ulusoys, and A. Geiger, “Octnet: Learning deep 3drepresentations at high resolutions,” arXiv preprint arXiv:1611.05009,2016.

[27] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural networkfor real-time object recognition,” in Intelligent Robots and Systems(IROS), 2015 IEEE/RSJ International Conference on, pp. 922–928,IEEE, 2015.

[28] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly,A. J. Davison, M. Luján, M. F. P. O’Boyle, G. Riley, N. Topham,and S. Furber, “Introducing SLAMBench, a performance and accuracybenchmarking methodology for SLAM,” in IEEE Intl. Conf. on Roboticsand Automation (ICRA), May 2015. arXiv:1410.2167.

[29] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. a. Fitzgibbon,“Kinectfusion: Real-time dense surface mapping and tracking,” in IEEEISMAR, IEEE, October 2011.

[30] “Raycaster running on the myriad demo video.” http://www.youtube.com/watch?v=-LHpTfX4cIsg.

[31] B. L. Decker, “World geodetic system 1984,” tech. rep., DTIC Docu-ment, 1986.

[32] G. M. Morton, A computer oriented geodetic data base and a newtechnique in file sequencing. International Business Machines CompanyNew York, 1966.

[33] J. Amanatides and A. Woo, “A fast voxel traversal algorithm for raytracing.” Accessed: 2017-06-70.

View publication statsView publication stats

https://www.movidius.com/solutions/machine-vision-algorithms/machine-learninghttps://www.movidius.com/solutions/machine-vision-algorithms/machine-learninghttp://www.youtube.com/watch?v=-LHpTfX4cIsghttp://www.youtube.com/watch?v=-LHpTfX4cIsghttps://www.researchgate.net/publication/319932593

IntroductionRelated ResearchThe VOLA FormatTwo Bits Per Voxel

AudioDeep Learning and Convolutional Neural Nets.KFusionTerrestrial Scale Point Cloud Mapping in Real Time.Embedded RaycastingParallelismCache-friendlinessLevel of detailData localityEfficient Ray-casting

ConclusionsFuture Work

References

Documents

Applications of the VOLA Format for 3D Data Knowledge ... · Abstract—VOLA is a compact data structure that uniﬁes computer vision and 3D rendering and allows for the rapid cal-culation