2
Stereotime: A Wireless 2D and 3D Switchable Video Communication System You Yang 1 , Qiong Liu 1 , Yue Gao 2 , Binbin Xiong 1 , Li Yu 1 , Huanbo Luan 2 ,Rongrong Ji 3 , Qi Tian 4 1. Department of Electronics & Information Engineering, Huazhong University of Science & Technology, Wuhan, China 2. School of Computing, National University of Singapore, Singapore 3. Department of Cognitive Science, School of Information Science & Engineering, Xiamen University, Xiamen, China 4. Department of Computer Science, University of Texas at San Antonio {yangyou,q.liu,hustlyu}@hust.edu.cn, {kevin.gaoy,viclol36,jirongrong}@gmail.com, [email protected] ABSTRACT Mobile 3D video communication, especially with 2D and 3D compatible, is a new paradigm for both video commu- nication and 3D video processing. Current techniques face challenges in mobile devices when bundled constraints such as computation resource and compatibility should be con- sidered. In this work, we present a wireless 2D and 3D switchable video communication to handle the previous chal- lenges, and name it as Stereotime. The methods of Zig-Zag fast object segmentation, depth cues detection and merging, and texture-adaptive view generation are used for 3D scene reconstruction. We show the functionalities and compatibil- ities on 3D mobile devices in WiFi network environment. Categories and Subject Descriptors H.4.3 [Information Systems Applications]: Communi- cations Applications; H.5.1 [Information Interfaces and Presentation]: Multimedia Information Systems Keywords Video communication, Visual content, View generation, Scene reconstruction 1. INTRODUCTION Video communication has changed people’s daily life for instance providing face-to-face long distance communication [1]. With the development of portable devices, video com- munication becomes ubiquitous, with popular and practical products include Facetime, LinPhone, etc. In recent years, three dimensional (3D) technologies [2, 3] have been widely applied. 3D displaying has been ported to mobile devices, and brings people with immersive 3D viewing experiences. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prof t or commercial advantage, and that copies bear this notice and the full ci- tation on the f rst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). http://–enter the whole DOI string from rightsreview form conf rmation. Figure 1: An overview of our proposed 3D Facetime. With such attractive displaying capacity, 2D and 3D switch- able functionality becomes a new paradigm in the fields of video communication and 3D video processing. A natural solution is porting the system directly from wired networks to wireless networks and mobile devices [6]. This design includes stereo camera for scene capturing, stereo video codec for signal transmission, and autosteroscopic screen for 3D displaying. However, challenges may occur. The first challenge is the resource requirements of stereo video codec. Bundled coding techniques and spatio-temporal prediction in stereo video codec have significant resource requests[5]. The other challenge is 2D terminal users will receive redun- dant views if stereo camera is used. Our recent efforts [7] have found that 3D scene can be reconstructed by content analysis. In the processing of re- construction, depth information of the captured scene is first restored, then the depth information is propagated from key frame to the consequent frames, from which the 3D scene is finally reconstructed. Therefore, to deal with the above challenges, we present a 2D and 3D switchable video com- munication system with name Stereotime, for both tradi- tional and 3D mobile device users. Figure 1 illustrates the framework of our proposed system. In our system, we use the configurations as traditional video communication system, including single camera for scene capturing, video encoder for signal compression, and wireless channel for transmission. The key innovation in the proposed system is a content analysis based 3D scene reconstruction method applied on 3D terminals. In the pro- posed system, scene is captured as 2D video, compressed and transmitted to users but can be displayed in 3D style. For the user with 2D receiver, the system works in a tradi- 473 MM’13, October 21–25, 2013, Barcelona, Spain. ACM 978-1-4503-2404-5/13/10.

[ACM Press the 21st ACM international conference - Barcelona, Spain (2013.10.21-2013.10.25)] Proceedings of the 21st ACM international conference on Multimedia - MM '13 - Stereotime

  • Upload
    qi

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [ACM Press the 21st ACM international conference - Barcelona, Spain (2013.10.21-2013.10.25)] Proceedings of the 21st ACM international conference on Multimedia - MM '13 - Stereotime

Stereotime: A Wireless 2D and 3D Switchable VideoCommunication System

You Yang1, Qiong Liu1, Yue Gao2, Binbin Xiong1, Li Yu1,Huanbo Luan2,Rongrong Ji3, Qi Tian4

1. Department of Electronics & Information Engineering, Huazhong University of Science & Technology,Wuhan, China

2. School of Computing, National University of Singapore, Singapore3. Department of Cognitive Science, School of Information Science & Engineering, Xiamen University,

Xiamen, China4. Department of Computer Science, University of Texas at San Antonio

{yangyou,q.liu,hustlyu}@hust.edu.cn, {kevin.gaoy,viclol36,jirongrong}@gmail.com,[email protected]

ABSTRACTMobile 3D video communication, especially with 2D and3D compatible, is a new paradigm for both video commu-nication and 3D video processing. Current techniques facechallenges in mobile devices when bundled constraints suchas computation resource and compatibility should be con-sidered. In this work, we present a wireless 2D and 3Dswitchable video communication to handle the previous chal-lenges, and name it as Stereotime. The methods of Zig-Zagfast object segmentation, depth cues detection and merging,and texture-adaptive view generation are used for 3D scenereconstruction. We show the functionalities and compatibil-ities on 3D mobile devices in WiFi network environment.

Categories and Subject DescriptorsH.4.3 [Information Systems Applications]: Communi-cations Applications; H.5.1 [Information Interfaces and

Presentation]: Multimedia Information Systems

KeywordsVideo communication, Visual content, View generation, Scenereconstruction

1. INTRODUCTIONVideo communication has changed people’s daily life for

instance providing face-to-face long distance communication[1]. With the development of portable devices, video com-munication becomes ubiquitous, with popular and practicalproducts include Facetime, LinPhone, etc. In recent years,three dimensional (3D) technologies [2, 3] have been widelyapplied. 3D displaying has been ported to mobile devices,and brings people with immersive 3D viewing experiences.

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor prof t or commercial advantage, and that copies bear this notice and the full ci-tation on the f rst page. Copyrights for third-party components of this work must behonored. For all other uses, contact the owner/author(s). Copyright is held by theauthor/owner(s).

http://–enter the whole DOI string from rightsreview form conf rmation.

Figure 1: An overview of our proposed 3D Facetime.

With such attractive displaying capacity, 2D and 3D switch-able functionality becomes a new paradigm in the fields ofvideo communication and 3D video processing.

A natural solution is porting the system directly fromwired networks to wireless networks and mobile devices [6].This design includes stereo camera for scene capturing, stereovideo codec for signal transmission, and autosteroscopic screenfor 3D displaying. However, challenges may occur. The firstchallenge is the resource requirements of stereo video codec.Bundled coding techniques and spatio-temporal predictionin stereo video codec have significant resource requests[5].The other challenge is 2D terminal users will receive redun-dant views if stereo camera is used.

Our recent efforts [7] have found that 3D scene can bereconstructed by content analysis. In the processing of re-construction, depth information of the captured scene is firstrestored, then the depth information is propagated from keyframe to the consequent frames, from which the 3D sceneis finally reconstructed. Therefore, to deal with the abovechallenges, we present a 2D and 3D switchable video com-munication system with name Stereotime, for both tradi-tional and 3D mobile device users. Figure 1 illustrates theframework of our proposed system.

In our system, we use the configurations as traditionalvideo communication system, including single camera forscene capturing, video encoder for signal compression, andwireless channel for transmission. The key innovation inthe proposed system is a content analysis based 3D scenereconstruction method applied on 3D terminals. In the pro-posed system, scene is captured as 2D video, compressedand transmitted to users but can be displayed in 3D style.For the user with 2D receiver, the system works in a tradi-

473

MM’13, October 21–25, 2013, Barcelona, Spain.

ACM 978-1-4503-2404-5/13/10.

Page 2: [ACM Press the 21st ACM international conference - Barcelona, Spain (2013.10.21-2013.10.25)] Proceedings of the 21st ACM international conference on Multimedia - MM '13 - Stereotime

tional video communication style. While for the user with3D receiver, our system can provide 3D immersive viewingexperiences.

2. SYSTEM OVERVIEWWe develop the Stereotime on a 3D mobile phone, and the

interface of this system is shown by Figure 2. The systemis composed by 4 components, including scene capturing,video signal compression, wireless transmission, and contentanalysis for 3D scene reconstruction.

Scene Capturing In video communication, people usu-ally use convenient configurations for scene capturing. Typ-ically, a web camera with VGA resolution and auto-focus(usually 2mm focal length) is used in common systems. Thissetting is adopted in our system for scene capturing.

Video Signal Compression Video encoding techniquesare important for video transmission since the bitrate for rawvideo signals cannot be transmitted by wireless channel. Weuse H.264 video codec as the compression tool in our system.In the procedure of compression, only I- and P-type framesare used for the sake of real-time encoding. Furthermore,the interval length between two I-type frames is selectedautomatically by the encoder.

Wireless transmission Our system is designed indepen-dent to the wireless network constraints. Actually, with thehelp of H.264 video codec, the bandwidth for video trans-mission is less than 60Kbps in our system. Therefore, thesystem can be used in GSM with 200Kbps bandwidth, andalso in WiFi networks with more than 50Mbps.

Content Analysis for 3D Scene Reconstruction 3Dscene reconstruction is the key part to generate 3D contentat the receiver side. The procedure of reconstruction is com-posed by 4 steps as following.

Fast Object Segmentation: The object segmentation isthe pre-request for depth estimation. We assume that thechroma and illumination of a surface cannot change signifi-cantly, and the depth value of an object surface also satisfiesthis feature. Therefore, we use Zig-Zag scan method for fastobject surface segmentation.

Depth Cue Detection: We detect two types of depth cues,including (a) crossing lines and vanishing point, and (b) oc-clusion cue. These cues are helpful to detect the relativedepth information in the captured 2D scene.

Multi-cue Depth Merging and Depth Propagation: We as-sign depth value to every objects in the scene with the help ofdetected depth cues. After that, depth propagation methodin [7] is used for saving the computation load in object seg-mentation and depth cue detection.

Fast 3D Scene Reconstruction: 3D scene can be recon-structed by the color image received by the mobile deviceand the depth information calculated by previous steps. Af-ter that, different views can be obtained, as shown by Figure3. We adopt a texture-adaptive processing [4] to fill holesappeared in view switching, based on which the 3D scenecan be fluently displayed on auto stereoscopic screen.

The proposed Stereotime system is performed on a mobilephone with Android 2.3 system and auto stereoscopic screen.The captured video resolution is 640×480, and X.264 en-coder is applied for video compression. With these settings,our Stereotime only needs not more than 60Kpbs bandwidthin WiFi environment for video communication. Further-more, the speed for 3D scene reconstruction can up to 10fps and provide users fluent 3D display.

Figure 2: The interface and modes of Stereotime.

Figure 3: 3D scene reconstruction for different

views.

3. CONCLUSIONSIn this system, we demonstrate a 2D and 3D compatible

and switchable video communication system called 3D FaceTime. Our system can provide immersive 3D chatting ex-periences to users with lower cost as traditional 2D videocommunication system.

4. ACKNOWLEDGEMENTThis work was supported by Natural Science Foundation

of China (NSFC) (No.61170194, 61231010 and 61202301)and the National High Technology Research and Develop-ment Program (“863”Program) of China (No. 2012AA121604).

5. REFERENCES[1] K. E. Finn, A. J. Sellen, and S. B. Wilbur. Video-mediated

communication. L. Erlbaum Associates Inc., 1997.[2] Y. Gao, J. Tang, R. Hong, S. Yan, Q. Dai, N. Zhang, and

T. Chua. Camera constraint-free view-based 3D objectretrieval. IEEE Transactions on Image Processing,21(4):2269–2281, 2012.

[3] Y. Gao, M. Wang, D. Tao, R. Ji, and Q. Dai. 3D objectretrieval and recognition with hypergraph analysis. IEEETransactions on Image Processing, 21(9):4290–4303, 2012.

[4] Q. Liu, Y. Yang, Y. Gao, and R. Hong. Texture- adaptivehole-filling algorithm in raster-order for three-dimensionalvideo applications. Neurocomputing, 111:154–160, 2013.

[5] P. Merkle, A. Smolic, K. Muller, and T. Wiegand. Efficientprediction structures for multiview video coding. IEEETrans. on CSVT, 17(11):1461–1473, 2007.

[6] O. Schreer, P. Kauff, and T. Sikora. 3D videocommunication.Wiley Online Library, 2005.

[7] Y. Yang, Q. Liu, R. Ji, and Y. Gao. Dynamic 3D scenedepth reconstruction via optical flow field rectification. PLoSONE, 7(11):e47041, 2012.

474