5

Click here to load reader

[IEEE 2011 IEEE First International Conference on Consumer Electronics - Berlin (ICCE-Berlin) - Berlin, Germany (2011.09.6-2011.09.8)] 2011 IEEE International Conference on Consumer

  • Upload
    yvon

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 IEEE First International Conference on Consumer Electronics - Berlin (ICCE-Berlin) - Berlin, Germany (2011.09.6-2011.09.8)] 2011 IEEE International Conference on Consumer

Second Screen TV Synchronization

Christopher Howson, Eric Gautier, Philippe Gilberton, Anthony Laurent and Yvon Legallais Technicolor

France [email protected]

Abstract—The combined use of broadcast and broadband networks offers the perspective of efficient delivery of a wide range of personalized TV services. Such services are particularly compelling when related content is simultaneously rendered on a personal terminal and on a TV set. As media components are transported separately over different networks, and delivered to different devices, there is a need for a technique to assure their accurate temporal alignment. This paper describes a solution for frame accurate synchronization of hybrid media components used to compose personalized second screen TV services. The approach enables content components from distributed sources to be synchronously rendered on multiple terminals, even after transport over broadcast and broadband networks using different protocols and timing models. An evaluation system has been built showing accurate lip-sync for an on-demand personalized soundtrack consumed on an IP connected handheld terminal in conjunction with broadcast video on a TV set.

Keywords-broadcast; broadband; lip-sync; streaming

I. INTRODUCTION

The complementary nature of broadcast and broadband IP networks has opened the door to a hybrid delivery model in which the strengths of each network are leveraged to provide personalized TV services. Such a delivery model is already being exploited by a number of actors in the TV landscape. The manufacturers of consumer equipment are providing "Connected TVs" incorporating broadband access to catch-up TV, enhanced program guides and Internet video. Initiatives such as HbbTV [1] and YouView [2] have brought together broadcasters, content providers and Internet service providers seeking to define a standardized approach to the provision of hybrid broadcast broadband services. Whilst the first HbbTV services were launched as long ago as December 2009, current hybrid TV service deployments do not yet exploit the full potential of the Internet for delivery of media content and there remains significant potential for further innovation.

By using broadcast delivery for mass distribution of popular programs and broadband delivery for long tail and on-demand content, a combined delivery model is well adapted to providing personalized value-added TV services to large numbers of subscribers. Companion terminals, such as tablets or smartphones, are becoming well established as “TV buddies” for the consumption of personalized content linked to TV broadcasts. We are envisaging second screen use cases where alternative audio or video content, linked to broadcast programs, is carried over broadband, thereby enabling personalization and alleviating the burden on broadcast

network bandwidth. One example of such a service offers the user the possibility of selecting his preferred audio soundtrack on a handheld device, to accompany the broadcast video, displayed on a TV set. The main audio and video components are delivered over a broadcast network, whilst several other languages are available on-demand over the Internet. Another such service would enable a user to select a broadband delivered alternative view of a sporting or music event and render this on his tablet, in conjunction with the broadcast content displayed on a TV set. If the user experience of such second screen services is to be acceptable, then the media components, delivered separately over broadband and broadcast networks, need to be rendered with accurate synchronization.

Whilst existing hybrid TV services do employ trigger mechanisms for interactive applications, they do not incorporate techniques that would allow, for example, an alternative soundtrack delivered over the Internet to be accurately synchronized with a broadcast video component. In this paper we propose a system for synchronizing the rendering of audiovisual components delivered using different transport protocols, based on different system timing models and subject to different network latencies. We cover on-demand use cases where personalized content is rendered on a second screen device in synchronization with content on a broadcast receiver. Our approach accounts for the likelihood that broadband and broadcast content servers will be geographically separated with no common wall clock.

In the next section we will outline the key hybrid network synchronization issues and review previous related work. In Section III we will present our solution, which is a system for accurate synchronization of TV service components delivered over different networks, whatever the content type, transport protocol or timing model. Section IV describes an experimental implementation, which demonstrates lip-sync between a user selected soundtrack on a portable terminal and broadcast video on a TV set. Finally, we present our conclusions in Section V.

II. CONTENT SYNCHRONIZATION BACKGROUND

The different constraints of content delivery over broadcast and broadband networks have led to the adoption of technical solutions based on different transport protocols and underlying timing models. The MPEG2 transport stream (MPEG2-TS) [3], which is well established in the broadcast world, was designed for networks having a constant transmission delay. It specifies a buffer and timing model in which all receivers should have

2011 IEEE International Conference on Consumer Electronics - Berlin (ICCE-Berlin)

978-1-4577-0234-1/11/$26.00 ©2011 IEEE 361

Page 2: [IEEE 2011 IEEE First International Conference on Consumer Electronics - Berlin (ICCE-Berlin) - Berlin, Germany (2011.09.6-2011.09.8)] 2011 IEEE International Conference on Consumer

the same behavior. IP-based solutions, adopted in the broadband world, such as the real-time transport protocol (RTP) [4] or more recently HTTP adaptive streaming solutions [5], were designed for networks having variable transmission delay. To account for this, more flexible timing models have been adopted, resulting in implementation dependent receiver behavior. If we are to synchronize media components delivered over both types of networks, a solution that copes with both these different models is required.

One approach to the hybrid network synchronization problem is to use a unique delivery reference clock, such as the MPEG2-TS program clock reference (PCR) and its associated presentation time stamps (PTS), for both networks. In this case, the PTS is carried over the broadband network using an IP transport protocol, such as RTP. Both NHK and BBC researchers have adopted this approach, for cases where the broadband content server and broadcast equipment are collocated [6] or by employing clock recovery at a remote site [7]. However, a problem with this approach is that re-multiplexing functions, used in many networks, typically regenerate the PCR making it difficult to maintain clock continuity. Furthermore, for on-demand applications, knowledge of the PCR/PTS does not provide sufficient information to ensure that the requested content can be aligned with the broadcast stream. The PCR is attached to the service and contains no reference to the temporal position within the current event, where an event is a grouping of elementary streams, with a defined start and end time, belonging to a common service. A timing reference attached to the content itself is required to solve this problem.

Such a timing reference was developed in the SAVANT (Synchronized and scalable AV content Across NeTworks) project [8]. In order to synchronize components delivered simultaneously over broadcast, using MPEG2-TS, and over broadband, by RTP, a common counter (timeline) was associated with the components. The broadcast timeline consisted of Normal Play Time (NPT) descriptors transported in MPEG2-TS digital storage media command and control (DSM-CC) private data sections [9]. The delivery system simultaneously starts the generation of NPT descriptors and the RTP component time-stamping and therefore sets the same NPT values in RTP timestamps. Initial values are set to “0”. Such an approach implies co-localization of broadcast and broadband content sources. Furthermore, DVB now considers that “the use of NPT is obsolete” [10] and the RTP RFC [4] specifies that the initial value of the timestamp should be set randomly. Whilst the use of a common timeline is a promising approach, there still remains a need for a solution demonstrating suitability for distributed content sources, ease of deployment in existing broadcast TV infrastructure, and highly accurate synchronization.

An alternative approach for the synchronization of audiovisual components delivered over heterogeneous systems is to use characteristics of the audiovisual content itself as a temporal reference. One such technique is to exploit watermarks in the audio signal of a TV service. Indeed, audio watermarks are already commonly used to identify a program for audience measurement purposes. By using the channel identifier and timestamp in the watermark, it is possible to

detect the position in the program being watched. A number of actors, notably Nielsen [11], are using such techniques for second screen interactivity synchronized with a broadcast program. Another solution involves extracting a fingerprint directly from a captured sample of the audio or video and comparing it to a known database. Some technology providers (e.g. IntoNow [12], VideoSurf [13]) already have tablet applications, based on this approach, which are able to retrieve metadata relevant to, and roughly synchronized with, the main screen program. Unlike the previous techniques, this approach presents the advantage of leaving the broadcast content unchanged, though such solutions require supplementary data processing in the client device. However, both the watermarking and signal extraction techniques rely on the capture quality of the second screen device and are therefore susceptible to environmental noise and the capabilities of the device itself.

III. THE EVENT TIMELINE SOLUTION

Our “event timeline” solution is a system for accurate synchronization of TV service components delivered over different networks, whatever the content type, transport protocol or timing model. This entails the use of a common timing reference format that is exploitable by all receivers, whatever the delivery system, and in which the timing information is independent of transport and timing protocols. The timing information is incorporated as dedicated pieces of content which are directly attached to given events.

Timing information is generally present in material available for broadcast in the form of timecodes in content items such as files or tapes. Whilst this data is a good basis for synchronization, it is incomplete on its own. The timecode does not include a program or event identifier, which is a pre-requisite for on-demand applications. Furthermore the question of its transport is raised. A standardized approach for the carriage of this data in the group of pictures (GOP) header of the video component exists [14], but extracting the data for synchronization requires use of the video decoder. To avoid this, some video encoders adopt a proprietary solution, whereby this information is carried in the adaptation field of the MPEG2-TS header. As header information is not necessarily retained when content is stored, our event timeline solution carries timing information and event identifiers as a component, multiplexed with the other service components.

As part of a service, a timeline component is precisely synchronized with the other service components by means of the existing protocol synchronization mechanisms (e.g. PCR). An event timeline represents a counter indicating the progress of time in the event, such as the time elapsed since the start. As the components for a given second screen TV service may be delivered over both broadcast and broadband networks, the timeline component are readily carried over MPEG2-TS and IP transport protocols.

When delivered by MPEG2-TS, we format the timeline component according to the DVB specification for the carriage of synchronized auxiliary data [15]. An event timeline packet embeds “broadcast timeline” and “content labelling” descriptors. The former contains the timeline itself, whilst the

362

Page 3: [IEEE 2011 IEEE First International Conference on Consumer Electronics - Berlin (ICCE-Berlin) - Berlin, Germany (2011.09.6-2011.09.8)] 2011 IEEE International Conference on Consumer

latter references the associated event, and enables use of the timeline by an application.

In the case of IP transport, no existing standard defines a payload format to convey such a component, so we have chosen to adopt the same DVB packet payload format to deliver a timeline component over IP media transport protocols (e.g. RTP). The descriptors are sufficiently small to fit into a single RTP packet and do not therefore require an adaptation layer, such as that employed for video components.

A. Timeline Insertion

Whatever the transport protocol, timeline information is carried as an additional component. That raises the question of how to create and insert the timeline component within the service, whilst ensuring perfect synchronization with the associated audio and video components. The temporal element of the event timeline expresses position within an event, and we need to have information indicating when to start and stop its creation. It is also necessary to assure that the incrementation of the event timeline remains synchronized with the media content.

We propose to insert the timeline component in the same way as a DVB subtitling component is inserted. This is typically created and encoded during the audio/video encoding step. We use two specific modules for the timeline component creation and insertion process: the “timeline data supplier” and the “timeline encoder”.

Fig. 1 shows the timeline insertion architecture in a broadcast context, with an MPEG2-TS embedding a timeline component, in addition to the usual encoded audio and video components. As a typical audiovisual stream output from broadcast playout incorporates timecode information, the “timeline data supplier” is able to retrieve this information for an event, generate the full timeline component and provide it to the “timeline encoder”. The “timeline data supplier” is fed in advance with, for instance, one file per event requiring a timeline. Event timeline files are stored independently of the audiovisual content. Each file contains not only the video timecode corresponding to the beginning of the event timeline creation but also the event duration and the parameters to be set such as the “broadcast id” and the “content labelling”. The “timeline data supplier” retrieves the video timecodes from the incoming video and generates the event timeline.

Figure 1. Timeline component insertion in MPEG2-TS

The “timeline encoder” encapsulates the timeline component in the appropriate transport format and assures the synchronization of the timeline with the audio/video components by computing the presentation timestamps from

the recovered program system clock. Whilst Fig. 1 shows MPEG2-TS transport, implying the use of the PCR/PTS timing model, a similar approach is employed for IP transport, based on a Network Time Protocol (NTP) reference.

B. Synchronizing Broadcast with Broadband Multicast

When content components are delivered by means of broadcast or multicast, we have the transmission of a continuous sequence of events and the receiver passively takes delivery with no implicit indication of event beginning, or duration. So the timeline component provides the receivers with timing information about events and an event timeline should therefore always accompany hybrid event components delivered over the broadcast network or multicast over the broadband network.

Fig. 2 shows an example of an audio stream delivered over RTP, synchronized to a video stream broadcast over MPEG2-TS. As hybrid delivery is subject to inter-network delay, the streams from the two networks are not received simultaneously and the receiver must buffer the stream received in advance. The event timeline is used to precisely compute the buffering delay and to establish and maintain a correspondence between the event video wall-clock (MPEG’s PCR) and the event audio one (RTP’s clock).

Figure 2. Synchronization of MPEG2-TS and RTP delivered components

C. Synchronizing Broadcast with On-demand Broadband

Whilst in general the timeline should be carried over both networks, many second screen applications are likely to involve on-demand requests via the real-time streaming protocol (RTSP) or HTTP and, in this case, content can be synchronized without a timeline on the IP path. As such a process starts at the receiver’s instigation, the receiver can use the broadcast event timeline to determine the time elapsed since the beginning of the event and request the broadband content accordingly.

Fig. 3 illustrates a case where an audio component is requested over the broadband to accompany a broadcast content. The timeline is used to determine the temporal position in the current event (T0). The receiver adds to this a margin to account for the request/response time and also to ensure that the requested broadband content is received before

363

Page 4: [IEEE 2011 IEEE First International Conference on Consumer Electronics - Berlin (ICCE-Berlin) - Berlin, Germany (2011.09.6-2011.09.8)] 2011 IEEE International Conference on Consumer

the presentation of the corresponding broadcast content. We term this margin the overestimated round trip time (ORTT). The presentation timestamp of the first received audio sample corresponds to the computed offset (T3).

We seek to preserve the standard broadcast MPEG2-TS buffer and timing model in the receiver, and align the on-demand component presentation to that of the broadcast components. As the timing model for IP-based protocols is flexible, receivers can freely decide when to start the content presentation. The receiver synchronizes the system time clock of the broadband components to that of the broadcast components, which is considered as the master clock. In this way, the on-demand event audio does not require to be accompanied with an event timeline. The receiver assures the correspondence between the event video and event audio wall-clocks.

Figure 3. On-demand component request

A potential limitation with this approach is that a receiver knows when to request an on-demand event component only when it receives the first broadcast event timeline packet. At the start of a hybrid event, the receiver will miss the first samples due to the overestimated round trip time, unless there is a means to anticipate the request and be able to present the on-demand event component as soon as the associated broadcast event starts. Anticipation is made possible by adding a “countdown” to the event timeline. The countdown is conveyed in the timeline component and has a specific format. Event timeline countdown packets announce in advance the timestamp when the event will start.

D. Second Screen Terminals

For cases where an IP-connected second screen device is to be synchronized to the broadcast receiver, we have defined and implemented a protocol for communication between the broadcast receiver and companion terminals over the home local area network. In the general case, with a broadcast receiver and a portable device receiving broadband multicast content, both terminals have an event timeline to exploit. Each terminal is therefore aware of its temporal position in the ongoing event and it communicates this timeline information to the other device. The timeline values exchanged by the terminals are corrected to account for the communication time between the devices. A terminal which determines that it is in

advance must delay the rendering by that amount, thereby assuring content synchronization between the devices. In order to maintain synchronization, even in the presence of clock drift, the terminals periodically exchange and compare timeline information.

For on-demand scenarios, where there is no timeline on the broadband path, the portable terminal initially asks the main device to indicate the position in the event that it is currently presenting. We account for and minimize the delay between the request sent by the companion device and the response from the broadcast receiver. We then implement simple mechanisms to estimate broadband network delay for the on-demand content, to periodically check for synchronization state and to adapt media playback accordingly. This is a two-step procedure, whereby a first content request is made, in order to evaluate the round trip time of the connection to the server. The desired content component is then requested, starting from a position which incorporates the ORTT and the time elapsed since the last indication of current rendering position by the broadcast receiver. This ensures that the broadband content is received before the presentation of the broadcast content. If needed, fine adjustment may be undertaken by adapting the buffering in the player.

IV. EVALUATION SYSTEM

We have validated the feasibility of our approach through the implementation of a personalized TV service. The system allows a TV viewer to enjoy a film in his/her favorite language by using a companion device to select and receive this audio stream directly from the Internet. The interest of this scenario is that it, not only corresponds to an added value service, but it also sets a challenging synchronization target. For the user experience to be good, we need to achieve acceptable lip-sync, implying sub 25 ms alignment accuracy of the broadband audio with the broadcast video [16]. The system demonstrates second screen synchronization of on-demand IP delivered audio with a broadcast TV program and allows us to gauge user perception of the lip-sync.

Fig. 4 shows the evaluation system in which a TV program is a film, broadcast over DVB-T as an MPEG2-TS multiplex of video, audio and timeline components. The alternative audio tracks are stored as MP4 files on a streaming server connected to the Internet. The TV receiver is an Ethernet connected hybrid set top box (STB) in which we have added a plug-in for timeline management and communication with the personal terminal. The latter may be one of several Linux-based devices, a smartphone, a tablet or a laptop PC. It communicates with the STB and accesses the Internet through an ADSL-connected home gateway incorporating a Wi-Fi access point, or over a cellular 3G connection.

Users are able to select and listen to one of a number of languages for the film soundtrack on their personal device, whilst watching the associated film on the TV. A notification pop-up on the TV informs the user that alternative languages are available for the current film. The user can then launch a simple application on the handheld device to select his preferred soundtrack. In the current version of our system, the personal device makes requests to the audio server with RTSP

364

Page 5: [IEEE 2011 IEEE First International Conference on Consumer Electronics - Berlin (ICCE-Berlin) - Berlin, Germany (2011.09.6-2011.09.8)] 2011 IEEE International Conference on Consumer

and the audio is streamed over RTP. This experimental system has been operated by a range of end users who have evaluated language switching and playback performance for content chosen to include significant sequences with speech. Users have, in all cases, experienced robust audio streaming over the Internet and the home network Wi-Fi connection, without detecting any perceptible lip-sync issues. This demonstrates the high performance synchronization capability of our timeline implementation and companion device protocol.

Figure 4. Personalized audio evaluation system

This evaluation system is also suitable for other enhanced TV services and we are currently extending it for evaluation of second screen multi-view video applications. An initial scenario would enable a user to display alternative views of a music concert on his tablet, whilst the main view is displayed on the TV set.

V. CONCLUSION

In order to exploit hybrid broadcast broadband networking for the deployment of second screen TV services, there is a need for an accurate content synchronization solution that accounts for different transport protocols and timing models, whilst being easily incorporated in existing broadcast TV infrastructure.

We have proposed a system, for the deployment of second screen personalized TV services, which enables the rendering of content components, delivered independently over broadband and broadcast networks, to be accurately synchronized in user devices. The solution is based on the addition of an auxiliary component timeline associated with each group of media components delivered over the broadcast network and, in some cases, also over the broadband network. This timeline component conveys synchronization information related to each event and is used to align the presentation of the event media components. A key advantage of this solution is its compatibility with any existing, or future, transport protocol, as the timeline is a content item component in its own right. The approach is also applicable to any hybrid architecture, even where the broadband and broadcast servers do not share a common clock. We have shown that it is applicable to both multicast and on-demand IP transport.

We have described an experimental implementation of our solution which we are using for the evaluation of a number of second screen services. We have demonstrated the feasibility of the approach for achieving lip-sync synchronization accuracy, in the case of of an on-demand audio component delivered over RTP to a personal device, and played back synchronously with a broadcast video on a TV set.

Further work involves exploring other second screen applications, including multi-view video delivery to companion terminals, and the use of HTTP adaptive streaming for delivery to the companion terminal. Also, as the handheld devices rely on Wi-Fi or 3G communications, we are evaluating to what extent the varying characteristics of these networks could be detrimental to the user experience. Finally, we are investigating an approach in which the communications between receivers and companion terminals could leverage home network timing mechanisms defined in existing DLNA/UPnP standards.

ACKNOWLEDGMENT

This work was partly achieved as part of the Quaero Program, funded by OSEO, French State agency for innovation.

REFERENCES [1] ETSI TS 102 796: “Hybrid Broadcast Broadband TV”, version 1.1.1,

June 2010.

[2] YouView TV Ltd, “YouView Core Technical Specification”, version 1.0, April 2011.

[3] ISO/IEC 13818-1: “Generic Coding of Moving Pictures and Associated Audio Information: Systems”, October 2007.

[4] IETF Network Working Group, “Request for Comments: 3550 - RTP: A Transport Protocol for Real-Time Applications”, July 2003.

[5] T. Stockhammer, “Dynamic adaptive streaming over HTTP - standards and design principles”, ACM Multimedia Systems Conference (MMSys), February 2011.

[6] K. Matsumura, M. Evans, Y. Shishikui and A. McParland, “Personalization of broadcast programs using synchronized internet content”, IEEE International Conference on Consumer Electronics, Jan 2010.

[7] M. Armstrong, J. Barrett and M. Evans, “Enabling and enriching broadcast services by combining IP and broadcast delivery”, BBC Research White Paper WHP 185, Sep 2010.

[8] U. Rauschenbach, W. Putz, P. Wolf, R. Mies and G. Stoll, “A scalable interactive TV service supporting synchronised delivery over broadcast and broadband networks”, IBC Conference, September 2004.

[9] ISO/IEC 13818-6: “Generic Coding of Moving Pictures and Associated Audio Information: Extensions for DSM-CC”, 1998.

[10] ETSI TS 102 809: “Digital Video Broadcasting (DVB); Signalling and Carriage of Interactive Applications and Services in Hybrid Broadcast/broadband Environments”, version 1.1.1, January 2010.

[11] Media-Sync website , http://www.media-sync.tv/

[12] IntoNow website, http://www.intonow.com/

[13] VideoSurf website, http://www.videosurf.com/mobile/

[14] ISO/IEC 13818-2: “Generic Coding of Moving Pictures and Associated Audio Information: Video”, December 2000.

[15] ETSI TS 102 823, “Digital Video Broadcasting (DVB); Specification for the Carriage of Synchronized Auxiliary Data in DVB Transport Streams”, version 1.1.1, Nov 2005.

[16] ITU-R BT.1359-1, "Relative Timing of Sound and Vision for Broadcasting", 1998.

365