The Visual Computing Database: A Platform for …people.csail.mit.edu/jrk/visualdb_shareable.pdfThe Visual Computing Database: A Platform for Visual Data Processing and Analysis at

The Visual Computing Database:A Platform for Visual Data Processing and Analysis at Internet Scale

1 Introduction

There is strong evidence that a fundamental requirement of the next generation of visual and experientialcomputing (VEC) applications will be the efficient analysis and manipulation of large repositories of visualdata (images, videos, RGBD, point clouds, etc.). For example, highly publicized recent advances in im-age understanding [24] stem not only from improved deep-learning techniques, but also from the ability toexecute these techniques on internet-scale, human-annotated image collections [11, 28, 1, 45]. Analysis ofvisual data collections is also a major component of other advanced VEC applications such as computationalphotography, high-level image and video manipulation [19, 26, 25], and 3D scene reconstruction [36] andunderstanding [14]. Further, with image sensors rapidly becoming ubiquitous in all environments (in publicspaces, on vehicles, the body, etc.), we anticipate the emergence of entirely new ecosystems of applicationsmade possible by large-scale mining of streams of visual sensor data. These applications may range fromconsumer-centric personal assistants, to management of critical infrastructure for transportation and smartcities, and even include data-analysis for fundamental science.

Of course, scaling visual data analysis to facilitate applications that operate on collections such as all pub-lic images and videos on Facebook/YouTube (Figure 1-A), all video cameras in a major city (1-C), orpetabytes of images in a digital sky survey [42] (1-B), poses significant computer science challenges. First,the sheer size of visual data representations (in comparison to text) results in datasets that far exceed thedata-management capabilities of most researchers, application developers, or data analysts. For example,as a service to the community, Yahoo recently computed visual features for the 100 M images in the FlickrCreative Commons dataset [1] (the world’s largest annotated dataset), making the resulting 50 TB of descrip-tors available to the public. Although 50 TB is a daunting amount of data for most application developersto manage, it is dwarfed by the amount of visual data that is presently available for analysis in the world.For example, nearly three times as many photos are uploaded to Facebook each day! Second, state-of-the-art algorithms for understanding and manipulating large image datasets are computationally expensive andrequire supercomputing-scale processing. Very few programmers have the capability to develop efficientsolutions from scratch at these scales, inhibiting our field’s ability to explore more advanced data-drivenVEC applications.1

In response to these challenges, this project seeks to develop a new parallel computing platform thatfacilitates the development of applications that require visual data analysis at massive scale. We callthe proposed system a Visual Computing Database because it draws ideas from traditional database man-agement systems to more easily and powerfully organize and manage visual data collections using relationalmodels. However, unlike existing systems, it provides a unique set of programming abstractions and systemcapabilities specific to the needs of future large-scale VEC applications. At the core of our approach is thedesign and implementation of a new visual data processing and query language that integrates concepts fromhigh-performance functional image processing languages [33, 20] with traditional relational operators, pro-viding the ability to execute sequences of complex image and video analysis operations with exceptionallyhigh efficiency in the database (near data storage). We will implement and evaluate a cloud-based proto-type runtime for the visual computing database and release the software as an open source platform to thecommunity. Further, we will work with collaborators from Google to deploy the prototype as a cloudcomputing service on the Google Cloud Platform, enabling third-party applications (written by other1The disparity between the large amount of visual data generated in the world and the shortcomings of our ability to extract valuefrom it has led researchers to jokingly call visual data “the dark matter of the internet” [32].

1

(A) (B) (C)

Figure 1: Analysis of large-scale image collections is an increasingly important task in many fields including (A)fundamental computer vision research (extracting knowledge from internet photo and video collections), (B) survey-based astronomy (the LSST will generate 30 TB of image data each night), and (C) security and digital forensics(1000’s of live camera feeds are simultaneously streamed to Manhattan’s video command center). This project seeksto develop a scalable image analysis platform that will make it easy to express visual data queries, and more efficientto execute these queries against large-scale repositories.

analysts or researchers) to use the Visual Computing Database to execute computations at scale onwidely used, internet-scale computer vision datasets.

Since in practice visual data analysis workflows will involve more than just data retrieval operations, toensure a practical system, the Visual Computing Database will be augmented with mechanisms for initiatingsupercomputing-scale, non-relational computations on visual query results (e.g., via integration with popularlarge-scale machine learning frameworks for classification and regression) and then ingesting these resultsback into the database. All together, we aim to empower VEC community with a novel, scalable platformfor expressing large-scale visual data analytics.

1.1 Intellectual Merit

The design and implementation of the proposed Visual Computing Database presents many challenges atthe intersection of image processing, database system design, and high-performance computing that togetherform the intellectual merit of this work.

What are good programming abstractions for expressing complex queries on visual data collections?We seek to integrate high-performance image/video processing kernels as fundamental logic in visual databasequery operations. Success will require answering questions such as: how should relational data models beextended with first-class support for common visual computing entities such as images and videos, as wellas derived entities such as image patches, image descriptors, bounding boxes, or point clouds? How shouldthe scheduling algebra recently developed for high-level image processing languages (such as Halide [33])be expanded to incorporate the key relational algebra operations (e.g., select, join, group), allowing globaloptimizations to be made across both domains? How can we incorporate logical predicates such as spatialrelations (above, below, nearby, etc.) and temporal relations (before, during, etc.) into the queries?

How should the database system efficiently represent and manage derived and intermediate datasets?Exploratory data analysis involves sequences of analysis operations (“pipelines”) that generate large inter-mediate datasets. If materialized in full, these intermediates can often dwarf the original dataset in size.Efficiently managing these intermediates is critical. For example, when should data be recomputed ver-sus stored? And how can we maximize producer-consumer locality (at all levels of the machine) to avoid

2

generation of large intermediates. Scheduling operations will be critical to system performance.

What are practical ways to integrate large-scale distributed computation (e.g., for machine learningand analysis) with a visual database system? Modern approaches to image understanding rely heavilyon machine learning techniques. Thus, any practical system for visual data analysis must not only supportefficient database query operations, but also provide support for seamlessly executing complex analyses onthe results of queries. Common operations, such as training deep neural network models, involve expensivecomputations or large datasets that are often distributed over many machines. To maintain high performancethe system must also avoid unnecessary data marshaling between these two systems.

What are hardware platform requirements for large-scale visual data analysis applications? Whilethe focus of this proposal is not to explore specialized hardware for visual data analysis, we intend toanalyze the characteristics of large-scale visual computing applications using prototypes running on a varietyof platforms. We will build highly tuned implementations for traditional data centers built from multi-node, multi-core CPUs, as well as throughput-computing architectures such as Xeon Phi and GPUs. Theresults of this analysis will provide guidance for hardware designers seeking to specialize future hardwarearchitectures for this domain, and for system architects who wish to build large data centers.

1.2 Broader Impacts

Large scale data analytics is critical to science, medicine, government, and business. Analyzing video andimages is particularly challenging, because of the amount of data and the difficulty of extracting semanticinformation from pixel data. Image database and analysis systems stand to have significant impact on avariety of domains including:

Fundamental research in computer vision and graphics. We believe our system would enable com-puter graphics and vision researchers to test their algorithms more easily and quickly than they can do now,hopefully increasing the rate of progress in the fields. The database will also be useful for computationalphotography researchers seeking to develop algorithms that leverage large image databases for image ma-nipulation tasks.

Scientific instrumentation Many areas of science use instruments that generate large collections of imagesthat need to be processed. Sky surveys like Large Synoptic Survey Telescope (LSST) are anticipated togenerate over 30 TB of images a night. Processing and analyzing the images in these sky catalogs will bean important aspect of astronomy research in the coming decade. Similarly, biological medical researchersinvolved in projects like the Human Brain Project and Human Connectome Project will be generating col-lections of very large images that need to be processed. More generally, many future scientific instrumentswill be based on imagers which will generate massive amounts of imagery that needs to be processed andmade available to the scientific community.

Security and digital forensics: At the scale of local policing and also the broader scale of internationalhuman rights investigations. (Example of organizations/citizens/observers posting videos on-line, and orga-nizations trying to mine through the unstructured collections to establish the location of images as well asthe people and objects in them.)

General analytics and decision making. Because cameras are cheap and ubiquitous, image and video pro-cessing is becoming widespread and many applications are emerging. One example (see Section 4 example)is understanding traffic patterns in urban environments. By monitoring traffic in real-time, it is possible toimprove the efficiency of public and private transportation systems. We believe our system ultimately willhave many uses in the commercial world.

3

A central focus of this project is to develop a working system that allows a large community of users toregularly issue complex queries of large image collections. We will build a scalable visual database systemusing the Google Cloud Infrastructure, and populate it with millions of images. We will release our softwareas open source, so others can build upon our work.

Finally, because imaging is so computationally extensive, progress in image analysis has the potential tostimulate the high performance computing industry, and projects of strategic national importance such asthe Exascale Computing Initiative. For example, recent breakthroughs in image understanding based onconvolutional neural networks and deep belief networks have been quickly put into practice by large com-panies such as Google, Microsoft, Facebook and Baidu. This in turn has created a large demand for newdata centers with significantly increased computational capability, which has benefited HPC vendors suchas IBM, Intel and NVIDIA.

2 Background

Languages for high-performance image processing. Recently, systems such as Halide [33] (created byessential personnel Jonathan Ragan-Kelley and Darkroom [20] (created by Co-PI Hanrahan) have demon-strated that image processing pipelines expressed in a high-level functional representation can be compiledto highly optimized implementations for parallel architectures ranging from mobile phones to GPUs toFPGAs. For example, the Halide language has been demonstrated to generate pipelines that outperformhand-tuned implementations by advanced programmers. Halide is now in widespread use in industry andhas been used to build state-of-the-art image processing systems including Google’s HDR+ pipeline fea-tured in millions of Android phones (as well as Google Glass) and Google Photos’ AutoEnhance feature,which runs on tens of thousands of data center machines. Key to Halide’s success is the scheduling algebrait defines for concretely implementing pipelines of functional image processing operations. This successmirrors that of the use of relational algebra for query planning and optimization in database systems. In thisproject we seek to unify these two worlds (image processing pipelines and relational algebra operations) toenable simple, but performant, expression of complex visual data analysis queries. The key to doing thissuccessfully will be tracking and managing intermediate data simultaneously through both the relational andpixel/array-processing worlds, from small blocks of pixels in L1 cache to terabytes of patch features in adistributed file system.

Domain-specific languages and runtimes. In addition to domain-specific frameworks for image process-ing, the project PI’s and essential personal have a strong track record of developing domain-optimizedprogramming and runtime systems. PI Fatahalian and Co-PI Hanrahan collaborated as part of the researchteam that created the Brook stream processing language GPUs [4], a system that was influential in gener-alizing graphics systems to domains beyond rendering (a precursor to industry standard languages such asCUDA and OpenCL). PI Fatahalian and Co-PI Hanrahan also collaborated on the design of the Sequoiaprogramming language [16], a distributed runtime system that served as the precursor to the Legion runtimesystem [3] used as the primary distribute compute engine for the prototype developed as part of proposedwork. Co-PI Hanrahan has been involved in many recent efforts to develop high performance languagesand runtimes for domain-optimized systems, leading to the creation of Liszt [13], Riposte (a JIT compilerfor R) [38], and Terra [12], a framework for building domain specific languages (that will be used to imple-ment the proposed visual data query language compiler in this project).

Why a new database specific to visual computing? In recent years, the challenge of analyzing increas-ingly large data volumes has triggered a variety of database design efforts that explore problem-specificdata models and new mechanisms for expressing computation inside database engines. While all systems

4

(including the system we describe in this proposal) are inspired by the same basic tenants (that computationmust be moved into the database engine for efficiency, and that highly structured programming models arenecessary to simultaneously achieve high performance and application programmer productivity), their de-sign varies greatly based on domain of focus: e.g., graph processing (Neo4j [30], FlockDB [40], Teradata’sSQL-GR [39]), general machine learning (MLBase [23]), or array-based scientific and numerical computa-tion (SciDB [7]). While many of these existing systems provide functionality that is well aligned with themachine learning aspects of visual computing workloads, no existing system provides an ideal platform forexecuting complex image and video analysis. For example, although SciDB exports a first-class notion of2D array types that can be used to store images, the database provides no mechanisms for expressing high-performance pixel processing pipelines on these entities. Such pipelines are the fundamental componentvisual data analysis applications. GIS database systems [15] also fail to meet the needs of visual data analy-sis applications for similar reasons. (Although, as stated in Section 4, we anticipate 2D and 3D spatial joins,a common operation in GIS databases, will also be an important operation in a visual computing database.)

3 A Visual Data Analysis Scenario

To gain intuition about the needs of visual data analysis workloads, consider a scenario where an analystseeks to understand non-vehicular traffic in a metropolitan downtown region (e.g., to recommend locationsfor new bike lanes in the city). At the analyst’s disposal is a large corpus of video recorded over the pastyear from the city’s many outdoor security and traffic monitoring cameras. The analyst, employing state-of-the-art computer vision techniques [17], might seek to count instances of bicycles on each downtowncity block by crafting a bicycle detector using a pipeline of image processing operations as illustrated inFigure 2. Given a video frame, this pipeline identifies image regions (at multiple scales) that are likely tocontain objects [41] (a step implemented via low-level image processing filters), and then applies a widely-available bicycle classifier to these regions (evaluation of a deep-neural network followed by a linear SVMclassifier) [24, 17]. To generate the desired count of bicycle occurrences per city block, image regionscontaining bicycles must be joined back to the video streams from which they came and with the cityblock on which the camera resides. Finally, aggregation within blocks yields the final counts. Althoughconceptually simple to describe, this content-based visual query pipeline, when executed on a year’s worthof video from hundreds of cameras, involves executing a many complex operations on terabytes of imagepixels. More nuanced analyses would require expression of additional logic beyond just selection. Forexample, questions about the danger to cyclists on particular city blocks might be answered by not onlydetecting the presence of bicycles and vehicles in an image, but also computing the spatial proximity oftheir bounding boxes.

Many visual analysis tasks will also necessitate substantial processing on the results of visual queries. Con-sider an extension of the example above, where the traffic analyst wishes to ask further questions such aswhat percentage of the cyclist population wears a helmet? As it is unlikely for a detector of such a specificscene concept to be readily available, the analyst might seek to import images of cyclists wearing helmetsinto the database e.g., by saving the results of a Google Image Search for “cyclist with helmet” then usingthe annotated images as training data task-specific classifier to be used in a subsequent query. We note thatbest-practice training methods involve substantial amounts of processing, necessitating efficient distributedprocessing [27, 34, 9].

5

4 Proposed Work: A Database for Visual Data Processing and Analysis

Although presented in the context of a very specific data analysis scenario, we believe the image analysispipeline featured in Section 3 is representative of future every day visual data analysis tasks across a widevariety of disciplines. To meet the needs of these emerging applications, we believe that a platform for visualdata analysis at scale must:

• Ease the burden of managing terabyte-to-petabyte scale visual data collections, which includes bothoriginal source data and (potentially cached) intermediate collections generated as a result of analysisoperations.

• Provide a high level visual data query language that facilities rapid expression of complex queries andanalyses that operate directly on the visual content. This language must emit efficient implementations(near hand-tuned performance) on modern throughput architectures such as multi-core CPUs andGPUs (and afford the possibility of future acceleration by specialized hardware or FPGAs).

• Provide mechanisms to efficiently use the results of visual queries in large-scale parallel computations,in particular machine learning.

We discuss the details of how the proposed Visual Computing database will address these system require-ments in the following subsections.

4.1 Modeling Image Collections as a Relational Database

In most visual data analysis scenarios today (like the one described in Section 3), visual data is stored asan ad-hoc collection of image or video files alongside files containing disparate, inconsistently-structuredmetadata about the visual assets. An analyst seeking to perform a task must collect and organize the files,manually gather and normalize the relevant metadata (e.g., the city block in which each camera is located),and build custom scripts to tie together many stages of complex processing. To execute the computation,terabytes of data must be ingested into a distributed or network file system (e.g., NFS or HDFS), computationmanually distributed over a cluster, and storage of intermediate values manually considered and managed.

In response to this complexity, a core idea of this proposal is that large collections of images are betterrepresented, managed, and processed using a relational database model than as a tree of files in a traditionalfile system. The language of relations and relational algebra is a natural fit for operating on collections ofimages and their associated metadata. For example, images often are associated with the GPS position wherethey were taken, camera settings from the shot, and additional annotations (e.g.,“contains a cyclist”) thatmight have been generated from prior analyses and provide information about their contents. Although theadvantages of a relational data representation, such as “physical independence”, structured data managementand administration, efficient and flexible computation from simple declarative queries, implicit parallelismand distributed execution, are well established in many other domains [6], image collections are still oftenmanaged as sets of raw files because traditional database management systems are not optimized for eitherthe computation or storage requirements of large-scale visual data.

Our proposed visual computing database seeks to provide applications many of the benefits of traditionaldatabase management systems while also delivering visual data storage and computation efficiency match-ing or exceeding the hand-designed visual data processing systems today. A key aspect of achieving thisgoal is the addition of first-class database support for visual data types and pixel-level visual data processingon these types. For example, in addition to traditional database column types such as numbers, strings, and

6

Videoframes

Image pyramids

buildpyramids

selectivesearch

Interesting patches Resampled patches

warp CNN

Feature vectors

bicycleclassifier

Bicycledetections

join⨝

Labeledvideo frames

filter

Frames withdetections

Cameralocations join

⨝

Bicyclelocations

count(✶)group by block

Bicycle activityper-city block

patches = select pyramid(video.frame) from videos where video.type = “traffic”patches’ = selective_search(patches)patches’’ = warp(patches’)features = cnn(patches’’)detections = classify(features)matches = filter(join(videos, detections), detection=true)bike_locations = join(camera_locations, matches)bike_activity = select count(*) from bike_locations group by blockID

selectvideos

Database

Figure 2: A hypothetical multi-stage video analysis pipeline used to compute statistics of bike usage in city neighbor-hoods. The pipeline combines low-level image processing, results filtering and aggregation, neural-network evalua-tion, and integration of visual query outputs with non-visual data sources.

dates, the proposed database will also feature support for images, videos, and RGBD data. Since process-ing pixel data presents a substantially different workload than numbers and text, the proposed system willseparate the storage of metadata about images and videos (keeping it resident in traditional relational datastructures) from the storage of pixel data. We intend to start by extending the Postgres RDBMS for meta-data storage and queries [37], and coupling it to a distributed object store (such as Amazon S3 or OpenStackSwift [10, 2, 31]) for storing pixel data, and the Legion runtime for distributed in-memory caching and com-putation [3]. While the relational layer will continue to manage and arbitrate access to pixel-level data, thisdata will be located in a subsystem that is optimized for inexpensive and scalable storage of large binaryblobs (the object store). Queries on database fields corresponding to pixel data will include complex pixel-level computations that transform images to other images (e.g., reconstruction, filtering, and resamplingoperations) or into popular image descriptors (e.g., SIFT, HOG, “deep features” [29, 8, 24]).

First-class support for domain-specific visual-data types, as opposed to more general array-based types(e.g., SciDB [7]) affords improved opportunities for data compression. Our storage system will rely ondomain-specific image and video compression techniques wherever possible to reduce storage footprintand bandwidth requirements. Images and videos also carry explicit metadata describing the asset’s visualinterpretation, such as color space and gamma curves with which it is encoded. Failing to track and accountfor different encodings is a major source of image processing errors that a structured representation can helpeliminate.

4.2 Defining a Visual Query Language

A second key component of our proposed work is the design of a new visual query language used to retrieveinformation from the visual data stores described in Section 4.1. Visual data queries require efficient first-class pixel processing operations as a core part of the query language. Like the Halide language [33], thequery language will model high performance pixel manipulation as pipelines of pure user-defined functions.However, to support expression of complex selection logic, the language augments existing Halide con-structs with key relational algebra operations as well as logical predicates for both spatial and temporal rela-tions. Whereas Halide programs operate only on a single data type—functions defined on multi-dimensional

7

regular grids—the query pipelines expressed in this new query language operate on an array of object typesgenerated as a result of query processing, including collections, image patches, and feature descriptors.

To establish intuition about the type of queries this language must support, consider the pseudocode forthe previously described bicycle traffic analysis task shown in the bottom-left of Figure 2. It proceeds asfollows:

• The pipeline begins by selecting relevant videos from the database, and extracting patches by buildingpyramids of each video frame. This is (1) a traditional relational selection, based on metadata aboutthe video, followed by (2) a relational join of video frames with the set of all patch locations to producethe set of all patches at all pyramid scales for all videos, and finally (3) image processing of the videopixels to produce the pyramid patch pixels.

• Next, the pipeline performs a selective search process on those patches’ pixel data to filter out patchesless likely to contain objects of interest [41]. This is a combination of image processing on patchpixels to compute an “objectness” score, followed by relational selection of patches based on a scorethreshold.

• The pipeline then warps the selected patches to a standard input size for a convolutional neural net-work, applies the network to each patch to compute mid-level features, and applies a linear classifierto those mid-level features. Each of these operations is a traditional image processing operation on anindividual patch; together, they form a traditional image processing pipeline.

• The pipeline next uses the classifier response to filter the patch set to only those containing bicycles,and joins back to the metadata of the original videos from which the patches came. This is a relationalselection of patches based on classifier response, followed by a relational join of patches to videometadata.

• Finally, the pipeline geospatially joins bike matches to the city blocks where they were filmed, andaggregates the detection counts by block ID. This is a composition of relational and spatial queryoperations.

We now sketch additional examples of visual queries that we will consider during the design of the visualquery language. Below we use a SQL-like syntax to depict the queries in order to highlight the relationalaspects of these operations. However, we anticipate that a functional language syntax will be a more suitablefinal form of the visual query language.

Selecting “similar” images in feature space. One common data analysis operation is to query a databasefor “similar” images as defined in a desired feature space. For example, the following query requests the top10 neighbors in feature space for any images containing a cat:

catImages = select ∗ from images where catDetector(image) > thresholdtop5related = select ∗ from images a, catImages b

order by abs(myFeatureTransform(a.pixels)−myFeatureTransform(b.pixels))limit 10

This query first selects images containing detections of an object category (cats), using a predefined pipeline(catDetector) which could be internally similar to the bike detection pipeline described previously. Thequery then constructs the cross product of all images containing cats with all other images, using a join. Foreach of these images it then projects them into a feature space using myFeatureTransform, a complete imageprocessing pipeline which takes image pixels and returns a feature vector description of the image. (For

8

example, this pipeline might evaluate the primary layers of a deep convolutional neural network.) Finally,the query sorts the image pairs by their distance in feature space and selects the top 10.

Selecting images based on spatial relationships between detected objects. Another common visualanalysis technique looks for spatial relationships between objects in an image. We could find all imageswhere a helmet appears above a bicycle with the following query:

helmets = select bbox,imageID from pyramid(images) where helmetDetector(pixels) > thresholdbikes = select bbox,imageID from pyramid(images) where bikeDetector(pixels) > thresholdselect b.imageID in helmets h, bikes b where (h.bbox above b.bbox) and (h.imageID = b.imageID)

This query first finds the bounding boxes and image IDs of pyramid patches where either the helmet detectoror the bike detector fires. These detectors have been separately defined as pipelines of image- and feature-space computations starting from image patch pixels.

Iteratively bootstrapping a classifier based on strong detections. Finally, our query system should alsosupport using visual-relational queries coupled to non-relational computations. One example in contem-porary visual analysis which requires non-relational computation for machine learning is using iterativesemi-supervised learning to refine a classifier with new data [5]. To refine a car detector from a new batchof unlabeled images, we could run a query to drive a non-relational training algorithm:

newImages = select image from images where dateAdded > lastTrainingpatches = select pyramid(image) for image in newImagesinterestingPatches = selectiveSearch(patches)strongDetections = select patch from interestingPatches where carDetector(patch) > threshold

while (!converged) {// iterative map−reduce over query result set:map (strongDetections) {

// compute error...} reduce {

// update detector...}}

In this example, we first run a complex query to find strong image patches to add to the training set, and thenconsume the results of this query in a non-relational iterative map-reduce over the result set to update theclassifier with the new examples. The query first selects new images which have been added since the lasttraining. It then builds a pyramid of patches over these new images and performs selective search to initiallyprune the set, as in our other detection pipelines. Next it runs the existing car detector over each interestingpatch, and selects patches where the detections are above a high threshold (strong matches). Finally, thisresult collection is fed through the non-relational distributed training process to update the detector with thenew strong examples.

4.3 Scheduling Visual Data Analysis Queries

To execute the proposed visual data query language efficiently across large image collections, we mustdevelop a query compiler that unifies techniques for efficiently scheduling functional image processing withmethods for implementing relational algebra operations. A naive approach, starting from a generic relational

9

breadth first: each function isentirely evaluated before the nextone.

total fusion: values are computedon the fly each time that they areneeded.

sliding window: values are computed when needed then stored until not useful anymore.

tiles: overlapping regions areprocessed in parallel, functionsare evaluated one after another.

sliding windows within tiles:tiles are evaluated in parallelusing sliding windows.

in blurx out in blurx

blurx

out

in out

in blurx out in blurx out

serial y, serial x serial x, serial y serial yvectorized x

parallel yvectorized x

split x into 2xo+xi,split y into 2yo+yi,serial yo, xo, yi, xi

123456

789

101112

131415161718

192021222324

252627282930

313233343536

1 2 3 4 5 67 8 9 10 11 1213 14 15 16 17 1819 20 21 22 23 2425 26 27 28 29 3031 32 33 34 35 36

1 23 4

5 67 8

9 1011 12

13 1415 16

17 1819 20

21 2223 24

25 2627 28

29 3031 32

33 3435 36

1 23 45 67 89 1011 12

11

22

11

22

11

22

domain order

relation

σselect

relation

πproject

relation

⨝join

relation

relation

Figure 3: (top) Execution strategies for image processing computations expressible by Halide’s schedule algebra.These choices decompose the space of pixel operations in different ways to control the order and granularity ofintermediate results between stages. Even simple pipelines exhibit a rich space of scheduling choices, each expressingits own trade-off between parallelism, locality, and redundant recomputation of shared values, determined by howintermediate results are handled. (bottom) Relational algebra operations and data. Traditional relational queryplanning deals only with these operations over relations. The key challenge in our visual query language will beunifying the data, computation, and scheduling models for these two worlds.

database system, would model pixel operations as black-box external functions to be run at the leaves of therelational operations—e.g., a user-defined function applied by the database independently to individual blobsof pixel data. This is the approach taken by most traditional database systems when confronted with a needfor new per-element computations. However, this solution is insufficient to efficiently implement our visualquery language since queries may involve interleaved image-processing and relational operations (recall thequeries from Section 4.2).

In contrast to the black-box external function approach, SciDB’s “computational database” model insteadtries to extend relational algebra with numerical array processing operations [7]. However, for visual analysistasks, we believe the critical bottleneck is in pixel-level processing, not relational operations, and that spaceof transformations necessary to express state-of-the-art image processing operations is too large and complexto easily graft into a primarily relational processing system. We therefore propose to take the oppositeapproach: we will start from the functional Halide [33] language, which is already capable of expressinghighly optimized, state-of-the-art image processing pipelines and extend it with relational concepts.

The fundamental challenge of course, is defining a new scheduling algebra that admits composition ofscheduling transformations over Halide functions on regular, grid-structured numerical data with relationalalgebra operations on sets and relational data types (Figure 3). Efficient scheduling was the singular chal-lenge in the design and implementation of the Halide system (without it Halide programs make poor use ofthroughput-processing resources), and in large-scale image analysis the need for efficient scheduling is evengreater. The reason for this is that complex visual data analysis queries generate intermediate results areboth large—often much larger than the original data—and frequently very expensive to compute. Becauseof this, there is a tension between saving intermediate results for potential reuse in subsequent analyses and

10

dynamically recomputing them to save storage space. In addition, there is tension between interleaving thecomputation and consumption of intermediate values at very fine granularity (to exploit producer-consumerlocality so as to keep values resident in cache, in memory in a single node, or perhaps across all nodes in adatacenter), and aggregating coarse-grained tasks and values to preserve coherent batch execution. To fullycomprehend and globally optimize intermediate computations and management of derived data in thesequeries, scheduling operations for relational and image processing computations and data must be unified,affording the flexibility to optimize queries across the boundary between relational and pixel/numerical op-erations. A unified treatment of scheduling will allow optimization decisions to be applied at all systemscales, ranging from orchestration of data in the L1 cache of a single core (where Halide focuses now) tothe potential materialization of petabytes of storage across a whole cluster.

4.4 Integrating Non-relational Computation for Machine Learning

Applications built on relational database systems often require additional logic outside of the relationalquery language. A common case we expect to see in visual analysis is training an advanced machinelearning system, like a deep neural network, on the results of a query (e.g., training a bike helmet detectoras discussed in our urban planning example). Some regression and classification algorithms have beenexpressed in relational systems [21], but state-of-the-art neural network training using back-propagation andstochastic gradient descent [35] is not natural to implement within a relational system. Furthermore, realusers already rely on an ever-expanding wealth of existing libraries, built in traditional languages like C andCUDA, for these and other operations. We therefore cannot limit our system to only in-database operations,but must support integration of non-relational computations over the query result sets.

In traditional database systems this sort of non-relational computation is expressed in general-purpose pro-gramming languages, and tied to the database by consuming the results of queries using a database clientlibrary. Most often, this client library materializes a complete copy of the result data and streams it acrossthe network to a single client machine. When consuming query results in non-relational application logic,the defining characteristics of visual data analysis are the sheer size of the result data and the amount ofcomputation required to process it. We therefore need to provide facilities to push even non-relational com-putations, outside of our query language, towards the original and derived data, and to distribute this workacross a cluster. Our system will support tight integration with downstream non-relational computation,much like the SparkSQL system exposes results of relational queries as RDDs for further non-relationalprocessing with Spark, in-place on the cluster [44, 43].

5 Open-Source Prototype Implementation

We will implement the ideas described in the prior subsections as a open-source visual database platform.The proposed database will use OpenStack’s Object Store [31] as a durable backing store for database con-tents, enabling deployment of the database runtime on both private clusters and on commercial cloud plat-forms. We will develop a compiler for the proposed visual data processing/query language in Terra/Lua [12],leveraging Halide [33] (for image processing kernels) and the Legion parallel runtime system [3] as compi-lation targets.

We have selected Legion as a low-level parallel runtime environment as it affords the potential for highlyefficient parallel implementation across a variety of machine architectures. First, Legion’s logical regionabstraction, which encapsulates in-memory collections using a relational model of data, is well aligned withthe concept of query result sets in the proposed visual query language. Second, the Legion runtime system

11

provides a distributed scheduler for launching and managing parallel computations across a heterogeneouscollection of processing elements, including multiple machines, multi-core CPUs, and also GPUs. (Weanticipate platforms featuring high-density compute capability via GPUs and Xeon Phi accelerators willbe attractive execution platforms for visual data analytics.) Last, although Legion provides the service ofmanaging communication and parallel execution in a heterogeneous distributed system, it leaves full respon-sibility for determining how database operations should be decomposed and orchestrated to our compiler andruntime system. Thus, the database will be able to make efficient scheduling decisions based on high-levelknowledge of query structure.

Our implementation will also use the Legion runtime to provide applications with simple mechanisms tolaunch large distributed computations on the results of database queries (Section 4.4). For example, ourinitial prototype will provide applications the ability to perform distributed training of deep neural networksusing Caffe [22].

Deployment as a Cloud Service

To aggressively exercise our proposed designs, as well as to make the system available to the broader com-munity at scale, in the second half of the project we will work with collaborators at Google (see attachedletter of support from Solomon Boulos, with whom both PIs have collaborated successfully in the past—five joint publications since 2009) to deploy the Visual Computing Database runtime as a service on theGoogle Cloud Platform. This deployment will be run by the PIs (it is not an official Google service) butits deployment will mimic how existing services such as Google Cloud SQL and Datastore are available toapplications running on the Google App Engine [18].

To alleviate the burden of massive data ingest for users of the service, we will pre-populate the database ser-vice with several large-scale datasets widely used in computer vision research, including: the full ImageNetdataset [11], the Flickr Creative Commons Dataset [1], and the Microsoft COCO Dataset [28]. In additionto original source image and video data, we will also store materialized copies of commonly-used derivedquantities such as image features and object bounding boxes.

The deployment of the Visual Computing Database as a service (with common datasets held in a centralizedlocation) provides new opportunities for the visual computing community to efficiently collaborate and shareprocessing and results. For example, intermediate values generated by one research group (that may beexpensive to compute) can be cached by the database and made available to other groups in need of similarcomputations. Further, a common database language, data model, and execution platform will facilitatecode sharing/reuse and simplify evaluation of new new analysis techniques. (Researchers can easily runother researchers code!)

6 System Evaluation

No common benchmarks for visual data analytics currently exist, so to evaluate the proposed system weplan to implement a suite of visual query benchmarks in the new visual query language. We will drawour initial benchmarks from sources such as data analysis pipelines commonly used in the computer visioncommunity [17, 41, 14, 5]. We will also seek to include benchmarks representative of queries needed byother domains such as urban traffic planning, digital forensics, and large-survey astronomy.

Using this benchmark suite, we will evaluate the scalability of the prototype database implementation run-ning on the Google Cloud Platform under a variety of system operating conditions. We will measure scala-bility of queries with increasing platform size (number of nodes and cores per node), under varying databasesize (number of images/videos, number of features), and varying query complexity (complexity of feature

12

computation, number of joins, etc.).

The PIs also have access to a research cluster at Carnegie Mellon with compute nodes containing bothGPU and Xeon Phi accelerators. Using this cluster we will evaluate the performance of different CPUarchitectures on our benchmark visual data analysis workloads. This evaluation will also study platformefficiency under varying database operating conditions.

Finally, to better understand visual analytics workloads (beyond merely those in our benchmark suite) wewill develop a website that tracks usage statistics of queries run against the Google Cloud Platform databaseservice. For example, the website will post statistics of queries (images touched, cycles used, joins per-formed, etc.) against the database for the general public.

13

References

[1] Yahoo Flickr Creative Commons 100M Dataset, 2015. Available at http://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67.

[2] Amazon Web Services. Amazon Simple Storage Service, 2015. Available at http://aws.amazon.com/s3/.

[3] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing locality andindependence with logical regions. In Proceedings of the International Conference on High Perfor-mance Computing, Networking, Storage and Analysis, SC ’12, pages 66:1–66:11, Los Alamitos, CA,USA, 2012. IEEE Computer Society Press.

[4] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and PatHanrahan. Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777–786, August 2004.

[5] Xinlei Chen, A. Shrivastava, and A. Gupta. Neil: Extracting visual knowledge from web data. InComputer Vision (ICCV), 2013 IEEE International Conference on, pages 1409–1416, Dec 2013.

[6] Edgar F Codd. A relational model of data for large shared data banks. Communications of the ACM,13(6):377–387, 1970.

[7] P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang,M. Balazinska, J. Becla, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, andS. Zdonik. A demonstration of scidb: A science-oriented dbms. Proc. VLDB Endow., 2(2):1534–1537,August 2009.

[8] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In ComputerVision and Pattern Recognition (CVPR), 2005 IEEE Conference on, 2005.

[9] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ran-zato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. Large scale distributeddeep networks. In P. Bartlett, F.c.n. Pereira, C.j.c. Burges, L. Bottou, and K.q. Weinberger, editors,Advances in Neural Information Processing Systems 25, pages 1232–1240. 2012.

[10] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Ama-zon’s highly available key-value store. In Proceedings of Twenty-first ACM SIGOPS Symposium onOperating Systems Principles, SOSP ’07, pages 205–220, New York, NY, USA, 2007. ACM.

[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09, 2009.

[12] Zachary DeVito, James Hegarty, Alex Aiken, Pat Hanrahan, and Jan Vitek. Terra: A multi-stagelanguage for high-performance computing. SIGPLAN Not., 48(6):105–116, June 2013.

[13] Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley, Montserrat Medina, Mike Bar-rientos, Erich Elsen, Frank Ham, Alex Aiken, Karthik Duraisamy, Eric Darve, Juan Alonso, and PatHanrahan. Liszt: A domain specific language for building portable mesh-based pde solvers. In Pro-ceedings of 2011 International Conference for High Performance Computing, Networking, S torageand Analysis, SC ’11, pages 9:1–9:12, New York, NY, USA, 2011. ACM.

1

http://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67

http://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67

http://aws.amazon.com/s3/

http://aws.amazon.com/s3/

[14] Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. What makes parislook like paris? ACM Trans. Graph., 31(4):101:1–101:9, July 2012.

[15] ESRI, Inc. ArcGIS for Server Product Information Page, 2015. Available at http://www.esri.com/software/arcgis/arcgisserver.

[16] Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji YoungPark, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. Sequoia: program-ming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing,SC ’06. ACM, 2006.

[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detec-tion and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEEConference on, pages 580–587, June 2014.

[18] Google, Inc. Google App Engine home page, 2015. Available at https://cloud.google.com/appengine/.

[19] James Hays and Alexei A. Efros. Scene completion using millions of photographs. ACM Trans.Graph., 26(3), July 2007.

[20] James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell,Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. Darkroom: Compiling high-level image process-ing code into hardware pipelines. ACM Trans. Graph., 33(4):144:1–144:11, July 2014.

[21] Joseph M. Hellerstein, Christoper Re, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Alek-sander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. The madlibanalytics library: Or mad skills, the sql. Proc. VLDB Endow., 5(12):1700–1711, August 2012.

[22] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, SergioGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXivpreprint arXiv:1408.5093, 2014.

[23] Tim Kraska, Ameet Talwalkar, John C. Duchi, Rean Griffith, Michael J. Franklin, and Michael I.Jordan. Mlbase: A distributed machine-learning system. In CIDR 2013, Sixth Biennial Conference onInnovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings, 2013.

[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolu-tional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advancesin Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.

[25] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. Transient attributes forhigh-level understanding and editing of outdoor scenes. ACM Trans. Graph., 33(4):149:1–149:11, July2014.

[26] Jean-Francois Lalonde, Derek Hoiem, Alexei A. Efros, Carsten Rother, John Winn, and Antonio Cri-minisi. Photo clip art. ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), August 2007.

[27] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, JamesLong, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameterserver. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14),pages 583–598, Broomfield, CO, October 2014. USENIX Association.

2

http://www.esri.com/software/arcgis/arcgisserver

http://www.esri.com/software/arcgis/arcgisserver

https://cloud.google.com/appengine/

https://cloud.google.com/appengine/

[28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr,and C.Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla,Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision ECCV 2014, volume 8693 of LectureNotes in Computer Science, pages 740–755. Springer International Publishing, 2014.

[29] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision,60(2):91–110, November 2004.

[30] Neo Technology, Inc. The Neo4j Manual v2.2.0-M04, 2015. Available at http://neo4j.com/docs/milestone/.

[31] OpenStack.org. OpenStack home page, 2015. Available at https://www.openstack.org/.

[32] P. Perona. Vision of a visipedia. Proceedings of the IEEE, 98(8):1526–1534, aug. 2010.

[33] Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and FredoDurand. Decoupling algorithms from schedules for easy optimization of image processing pipelines.ACM Trans. Graph., 31(4):32:1–32:12, July 2012.

[34] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach toparallelizing stochastic gradient descent. In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, andK.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 693–701.Curran Associates, Inc., 2011.

[35] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagatingerrors. Nature, (323), 1986.

[36] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Modeling the world from internet photo collec-tions. Int. J. Comput. Vision, 80(2):189–210, November 2008.

[37] Michael Stonebraker and Lawrence A. Rowe. The design of POSTGRES. SIGMOD Rec., 15(2):340–355, June 1986.

[38] Justin Talbot, Zachary DeVito, and Pat Hanrahan. Riposte: A trace-driven compiler and parallel vmfor vector code in r. In Proceedings of the 21st International Conference on Parallel Architectures andCompilation Techniques, PACT ’12, pages 43–52, New York, NY, USA, 2012. ACM.

[39] Teradata, Inc. Teradata Aster SQL-GR Product Page, 2015.

[40] Twitter. FlockDB Github Repository, 2015. Available at https://github.com/twitter/

flockdb.

[41] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search forobject recognition. International Journal of Computer Vision, 104(2):154–171, 2013.

[42] Keith Wiley, Andrew J. Connolly, Jeffrey P. Gardner, K. Simon Krughoff, Magdalena Balazinska,Bill Howe, YongChul Kwon, and Yingyi Bu. Astronomy in the cloud: Using mapreduce for imagecoaddition. CoRR, abs/1010.1015, 2010.

[43] Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. Shark:SQL and rich analytics at scale. June 2013.

[44] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark:Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topicsin Cloud Computing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association.

3

http://neo4j.com/docs/milestone/

http://neo4j.com/docs/milestone/

https://www.openstack.org/

https://github.com/twitter/flockdb

https://github.com/twitter/flockdb

[45] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for Scene Recog-nition using Places Database. NIPS, 2014.

4

Documents

The Visual Computing Database: A Platform for …people.csail.mit.edu/jrk/visualdb_shareable.pdfThe Visual Computing Database: A Platform for Visual Data Processing and Analysis at