Transcript

Keyword(s): Abstract:

TouchPaper - An Augmented Reality Application with Cloud-Based ImageRecognition ServiceFeng Tang, Daniel R. Tretter, Qian Lin

HP LaboratoriesHPL-2012-131R1

image recognition; cloud service; augmented reality;

Augmented reality applications are increasingly used to enhance physical objects with digital information.In this paper, we present a TouchPaper system that consists of a mobile application and a cloud-basedimage recognition service. Through an authoring system, images can be linked to online content that willbe invoked when a printed version of the image is captured and "touched" on a mobile phone screen. Wedescribe a sBIP (Structured Binary Intensity Pattern) algorithm that performs the matching of the imagesand evaluate the performance of the overall TouchPaper system.

External Posting Date: September 6, 2012 [Fulltext] Approved for External PublicationInternal Posting Date: September 6, 2012 [Fulltext]Published in ICIMCS 2012: 4th International Conference on Internet Multimedia Computing and Service

Copyright ICIMCS 2012: 4th International Conference on Internet Multimedia Computing and Service

TouchPaper – An Augmented Reality Application with Cloud-Based Image Recognition Service

Feng Tang Hewlett Packard Laboratories

1501 Page Mill Rd. Palo Alto, CA, 94304

1(650)857-1501

[email protected]

Daniel R. Tretter Hewlett Packard Laboratories

1501 Page Mill Rd. Palo Alto, CA, 94304

1(650)857-1501

[email protected]

Qian Lin Hewlett Packard Laboratories

1501 Page Mill Rd. Palo Alto, CA, 94304

1(650)857-1501

[email protected]

ABSTRACT

Augmented reality applications are increasingly used to enhance physical objects with digital information. In this paper, we present a TouchPaper system that consists of a mobile application and a cloud-based image recognition service. Through an authoring system, images can be linked to online content that will be invoked when a printed version of the image is captured and “touched” on a mobile phone screen. We describe a sBIP (Structured Binary Intensity Pattern) algorithm that performs the matching of the images and evaluate the performance of the overall TouchPaper system.

Categories and Subject Descriptors J.7. [Computer Application] Computers in Other Systems - Publishing

General Terms Management, Design, Experimentation.

Keywords

multimedia, digital photography, image analysis, augmented reality.

1. Introduction Barcodes are gaining popularity as a way to link physical objects to digital content. This is largely driven by the rapid growth of SmartPhones with cameras. More sophisticated technologies based on image recognition have been developed to compute image features of the physical object, and match them to images in a database to provide the linking. Not using explicit barcodes dramatically improves the flexibility and visual appeal of the experience. In addition, augmented reality applications can be developed that track image features and adapt the display of digital augmentations accordingly. One such example is the Aurasma application1.

To create the best user experience, the matching of captured scenes with target images needs to be as fast as possible. For this, we need compact image features with a low computation complexity. In this paper, we describe an efficient image feature, called sBIP (Structured Binary Intensity Pattern). This feature

descriptor is highly efficient to compute as it only involves pixel intensity comparisons. It is also highly compact as the descriptor is formed by a set of binary comparison outputs. These properties make it very fast to compute and highly compact, which makes it very suitable for mobile applications.

As an application of the proposed descriptor, we describe the concept of Touchpaper to craft more compelling and richer interactive media experiences directly from traditional printed materials. Printed materials such as books, magazines, reports and marketing collateral are often created from digital components. While it is straightforward to embed static content (e.g. text, photos, and illustrations), it is difficult to embed dynamic content (e.g. audio, video, and animation) into these materials. Thus, such dynamic content is left out when prints are created from digital content, and the resulting collateral lacks any direct mapping to more dynamic material. With the proposed Touchpaper system, the printed content (e.g. greeting card, photobook, etc.) itself serves as a visual trigger when placed in front of a device’s camera to an interactive interface where the user can click on regions of interest for more information. We created am authoring interface so that the user can upload their photos and specify the active regions and their associated links (for example, the corresponding Facebook page of the photo for comments.). The interactive interface and information can also be generated automatically during the document creation process.

2. Structured Binary Intensity Pattern (sBIP) One of core problems of an image recognition system is how to represent the image. We use a local feature based approach where the image is represented by a set of sparse local keypoints. The feature ensemble collectively captures the most salient parts of the image. The local feature has two components: feature detector and feature descriptor. The feature detector locates the distinctive keypoints in an image, such as corners or edges. A descriptor is usually computed from a local patch centered at each keypoint to describe the local visual appearance. This descriptor is designed to be invariant to perspective changes, illumination changes, JPEG compression, and other common distortions. We use the FAST (Features from Accelerated Segment Test) [3] corner detector for feature detection due to its low computational complexity. Popular feature descriptor SIFT [4] is effective in image matching and recognition, but the high computational complexity prevents it from being useful in mobile applications. SURF [5] is a significant improvement in terms of speed compared to SIFT, but it is still not fast enough for near-realtime mobile recognition. BRIEF [6] compares randomly selected pixels to generate a descriptor for image matching, which is very fast to compute. However, the random selection of the pixels ignores the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICIMCS’12, September 9–11, 2012, Wuhan, Hubei, China. Copyright 2012 ACM 978-1-4503-1600-2/12/09…$10.00.

structure of the image patch which makes it less robust. The OSID feature descriptor [7] is designed to be invariant to brightness changes by using ordinal pixel information, but the ordering of pixels makes it less efficient. In this paper, we develop a feature descriptor that is very robust to image changes while remaining highly efficient to compute. Unlike OSID which compares a pixel to a set of intensity values (ordinal bin boundaries), our approach compares a pixel to only one intensity value, relaxing the ordering constraint while retaining the invariance property.

The proposed feature descriptor divides an image patch into structured subpatches and compares each subpatch to a pre-defined set of anchor points to form a binary descriptor for the patch. We call our descriptor Structured Binary Intensity Pattern(sBIP). More specifically, given an image patch centered at a keypoint, the proposed descriptor is computed as follows:

1) Anchor points selection: we select a set of K (for example 16) anchor points within the patch (denoted as orange dots in Fig 1.

2) For each anchor point, a small subpatch centered on that anchor point is selected and the average/sum of the pixels in this subpatch is computed.

3) Divide the image patch into nm (for example 44 subpatches as shown in the figure below.

4) Compare every subpatch to each of the anchor point patches. If it is brighter, output 1; otherwise, output 0. For each anchor point, this will result in an nm sequence of 0s and 1s.

5) Concatenate the binary sequences formed from all the anchor points into a long binary vector as the descriptor for this patch.

This process is illustrated in Figure 1. The final feature dimension

will be nmK = 16 * 4 * 4 = 256. Compared to the SURF descriptor [3], our feature is almost 10 times faster to compute while requiring 1/8 the storage space.

Our feature descriptor has the following advantages:

1. It is highly efficient to compute because all computations only involve integer comparisons.

2. It is highly compact in that the descriptor is a binary string instead of floating points. This makes it efficient for storage and transmission.

3. It is invariant to arbitrary monotonically increasing brightness changes.

Figure 1. Our feature descriptor computation

Figure 2. An example of matching feature points between the captured print image and the digital image.

3. Fast image matching using sBIP When the user captures a photo of a printed page, features are extracted and sent to the server to match to a database of images. Since there may be a large number of photos in the database, the naïve approach of matching the query image to each of the database images using bi-partite matching is often too slow for a quick response. Instead, we use a bag-of-features approach [8] for fast search. In addition, we use geometric verification [9] as an additional step to filter out the incorrectly matched images from the top ranked photos to increase the recognition accuracy. More specifically, the algorithm works as follows:

Feature extraction: for each photo in the database, sBIP features are extracted. For a VGA resolution photo, 200-250 features are detected.

Feature clustering: all the features detected across images in the database are aggregated to form the codewords using descriptor clustering to form the bag-of-features representation. These codewords are computed by using an approximate K-means approach. The approximate K-means works by accelerating the traditional K-means using a randomized KD tree algorithm [10]. In each iteration of the traditional K-means, the distances between all the features and each of the cluster centers are computed. This process is the major bottleneck for the clustering. In each iteration of the approximate K-means, a randomized tree structure is constructed on all the cluster centers, so that the distance computation between a feature descriptor and the cluster centers can be computed very efficiently. These clusters form the visual codewords for all the images of the database. In our experiments, fix the number of clusters to be 2000.

Bag-of-features representation: For each image in the database and also the query image, the extracted features are compared with the codewords. Each feature will find its nearest codewords to form the term frequency histogram. This histogram is a fixed length sparse representation of the feature statistics. To make it more robust, an inverse document frequency (idf) [11] is used to weight the term frequency with the intuition that more frequent features are less useful for the search.

Image search: For a query image, the term frequency vector is computed and compare with the database to obtain a top K most similar images. This is done using the Euclidean distance measure. In our experiments, we set K to be 5.

Geometric verification: After the top K images are found, a geometric verification is applied to check the consistency of the feature distribution between the query and the candidate image. RANSAC [12] is used to estimate the homography transformation and the number of in-lier matches.

4. TouchPaper System We implemented the sBIP algorithm in a hybrid mobile cloud system called TouchPaper. Starting with an image, we use the TouchPaper online portal to mark up the regions on the image for augmentation, as well as to specify the associated URLs. After the image is printed, the TouchPaper mobile app captures the image, runs sBIP, and sends the features to the TouchPaper online server. The online server performs feature matching, and highlights the marked regions on the mobile device screen. The user can touch each of the highlighted regions to view the online content.

Our system consists of two major components as shown in Figure 3: the image recognition cloud service that recognizes photos

captured by mobile device and the mobile client that enables the interactive experience. On the mobile client, when the user captures an image of a print, a fingerprint is extracted and sent to the server for recognition. The server maintains a database of fingerprints and the links associated with each photo. Once the query fingerprint is recognized, additional content such as active regions and URLs are sent back to the client. The user then can click hotspots to view linked content. The online service software is a custom-built, high-performance, in-memory database that runs in the cloud. The search system is highly-scalable and efficient allowing large reference libraries. It also supports real time workflows to add and remove signatures from the database.

Figure 3. Touchpaper System Architecture

5. User interface Our system has two main user interfaces: one is the mobile client (iPhone app) where the user can point to the print and access the additional content; the other is the authoring portal where the user can upload their own photos and specify the active regions and their corresponding links.

5.1 Mobile interface One of the design principles of the user interface is simplicity. The user points the smartphone at a photobook page or collage and the system recognizes the page. The recognition process happens in the cloud and the image ID together with the estimated homography are transmitted to the client. Active regions associated with the page are highlighted with transparent overlays, as shown in Fig. 4. The user can then click within each region to see the additional information (for example, comments on Facebook). In this process, the homography is used to map the user clicked coordinates to the original image space. After browsing the augmented content, the user can return to the captured page and click other regions if desired. The user can return to the capture mode by closing the current capture using the icon at the top right corner, as shown in Figure 4.

Figure 4. User interface for the Touchpaper mobile client.

5.2 Authoring portal We also provide an authoring portal where the user can upload their own photos and specify the active regions and their links. An example of this authoring process is shown in Figure 5. This online portal also supports addition and deletion of existing images from the database as well as addition/removal/editing of the active regions and their links. These links can be arbitrary URLs on the web, like a Facebook link for the photo so that the user can read comments, or a Youtube video associated with the photo.

Figure 5. An example of a photobook image (left), and the regions marked for detection, as well as the associated URLs (right).

6. Results We conducted extensive testing of the computation time for the sBIP algorithm, the computation time for the TouchPaper system in a realistic mobile environment, as well as the accuracy of the matching. We will discuss our results in this section.

6.1 sBIP Computation Time and Feature Size Since the sBIP algorithm only involves additions and subtractions, it is very fast to compute. Table 1 shows the performance of the sBIP algorithm compared with SURF on extracting features and matching two VGA resolution images.

Table 1: sBIP performance compared with SURF for 640x480 images on a 2.8 GHz CPU

Another advantage of sBIP is that the feature size is very compact. It takes 32 Bytes to describe a feature, as compared with 256 Bytes for SURF.

6.2 TouchPaper System Performance We tested with a database of 2000 images by computing their sBIP features and storing them on the server. 101 images in this database were printed out as 4”x6” prints and captured with our TouchPaper iPhone client. With the combination of global features, sBIP features, and verification, we were able to correctly match 97 images with the corresponding images in the database. The accuracy is 96.03%, with 0% false positives.

The average system speed on an iPodTouch 4 is as follows: feature detector: 0.12s; feature descriptor: 0.11s; transmission plus matching: 1.10s; total time: 1.33s. The fast response time we were able to accomplish made the user interaction experience very good.

6.3 Lighting Variation Tests One of the challenges of augmented reality applications is the unknown lighting environment. Figure 6(a) shows the original image in the database. Figure 6(b) shows the image of the printout captured by the iPhone. Figure 6(c) shows the same image captured under low light conditions. Figure 6(d)(e)(f) shows three examples of correct image matching by our TouchPaper system under the low lighting conditions.

Figure 6: Lighting variation tests.

7. Conclusion In this paper, we proposed a system called “TouchPaper” which can enable multiple active regions on a printed page such that when viewed using a mobile device, the user can click on different regions on the screen to access augmented content. This system is demonstrated through an iPhone app coupled with a cloud computing infrastructure.

References

[1] J. He, et.al. 2011. “Mobile Product Search with Bag of Hash Bits”, In Proceedings of ACM Multimedia., 839-840.

[2] Q. Liu, et al, 2010. Embedded media markers: marks on paper that signify associated media. In Proceedings of the 15th international conference on Intelligent user interfaces (IUI '10). ACM, New York, NY, USA, 149-158.

[3] E.Rosten, T. Drummond: 2006. Machine learning for high-speed corner detection. In Proceedings of the ECCV- Volume Part I, Springer-Verlag, Berlin, Heidelberg, 430-443.

[4] Lowe, David G. 1999. Object Recognition from Local Scale-Invariant Features. In Proceedings of ICCV, Vol. 2. IEEE Computer Society, Washington, DC, USA, 1150-.

[5] H. Bay et al., 2008. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 110,3, 346-359.

[6] M. Calonder, V. Lepetit, C. Strecha, P. Fua: 2010. BRIEF: binary robust independent elementary features. In Proceedings of ECCV, Springer-Verlag, Berlin, Heidelberg, 778-792.

[7] F. Tang, S.H. Lim, N.L. Chang, and H. Tao, 2009 A novel feature descriptor invariant to complex brightness changes. In Proceedings of CVPR., 2631-2638.

[8] J. Sivic and A. Zisserman. 2003. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proceedings of the ICCV- Vol. 2. 1470

[9] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman: 2007. "Object retrieval with large vocabularies and fast spatial matching," CVPR, IEEE Conference on , vol., no., pp.1-8

[10] Silpa-Anan, C.; Hartley, R.; 2008, "Optimised KD-trees for fast image descriptor matching," CVPR. IEEE Conference on , vol., no., pp.1-8, 23-28

[11] K. Sparck Jones. 1988. A statistical interpretation of term specificity and its application in retrieval. In Document retrieval systems, Peter Willett (Ed.). Taylor Graham Series In Foundations Of Information Science, Vol. 3. Taylor Graham Publishing, London, UK, UK 132-142.

milliseconds Feature extraction Matching Total

sBIP feature 3.5 13.8 17.3

SURF(64) 143.5 19.7 163.2


Recommended