View
0
Download
0
Category
Preview:
Citation preview
© Copyright by James David Wong, 1999
AN EXTENSIBLE FRAMEWORK FOR RTSPAPPLICATIONS
BY
JAMES DAVID WONG
B.A., Rice University, 1994
THESIS
Submitted in partial fulfillment of the requirementsfor the degree of Master of Science in Computer Science
in the Graduate College of theUniversity of Illinois at Urbana-Champaign, 1999
Urbana, Illinois
iii
Abstract
With advances in processor performance, memory sizes, storage capacity and
telecommunications technologies, networked multimedia applications such as streaming
video have become increasingly popular. Early systems used proprietary protocols to
transport control information and multimedia data, hindering interoperation between
d ifferent implementations, or adapted existing protocols, sacrificing quality and
flexibility. The Real-Time Streaming Protocol (RTSP) addresses these concerns by
providing a standardized, open architecture for controlling playback and recording
functionality in multimedia systems. This thesis presents an object-oriented framework
for constructing applications that make use of RTSP, describes the integration of the
framework into a pair of multimedia applications, and documents its performance. The
presentation closes with a discussion of opportunities for future development of the
architecture.
iv
For Julie
v
Acknowledgements
I am indebted to Professor Roy H. Campbell for his guidance on the long, sometimes
strange odyssey that resulted in this thesis.
I must also offer thanks to my friends and co-workers at Vosaic, particularly Zhigang
Chen, Drew MacGregor, See-Mong Tan, Chuck Thompson and Miguel Valdez. Their
encouragement, support, insight and advice have been invaluable.
Likewise, I am grateful to Joe Shidle, David Hyatt, Rebecca Hyatt, Dan Gaines, Tim
Fraser, Garry Sittler, Mei-Ling Tong and numerous others for helping me to maintain my
sense of purpose and perspective and for providing unending motivation.
Finally, heartfelt thanks to my parents, Richard and Neva Wong, and brother, William
Wong, without whom, none of this would be possible.
vi
Table of Contents
Page
Chapter 1 Introduction .................................................................................................. 1
1.1 Multimedia on the Web........................................................................................ 2
1.2 Hardware Technology .......................................................................................... 3
1.3 Software for Internet Multimedia......................................................................... 4
1.4 The Framework .................................................................................................... 5
1.5 Organization ......................................................................................................... 6
Chapter 2 Background................................................................................................... 7
2.1 RTSP Evolution.................................................................................................... 7
2.2 RTSP Features...................................................................................................... 8
2.2.1 Setup and Control.................................................................................. 9
2.2.2 Timing and Synchronization ............................................................... 10
2.2.3 Security................................................................................................ 10
2.2.4 Extensibility......................................................................................... 11
2.2.5 Flexibility ............................................................................................ 12
2.2.6 Ease of Use.......................................................................................... 12
2.3 RTSP Applications............................................................................................. 13
vii
2.3.1 Video-on-Demand............................................................................... 13
2.3.2 Live Broadcasts ................................................................................... 13
2.3.3 Near Video-on-Demand ...................................................................... 14
2.3.4 Virtual Presentations ........................................................................... 14
2.3.5 Conferencing and Telephony .............................................................. 15
2.3.6 Distributed Digital Editing .................................................................. 15
2.4 Related Technologies ......................................................................................... 15
2.4.1 Hypertext Transport Protocol.............................................................. 15
2.4.2 PNA..................................................................................................... 18
2.4.3 H.323 ................................................................................................... 18
2.4.4 Session Initiation Protocol .................................................................. 24
2.4.5 Real-Time Transport Protocol............................................................. 28
2.4.6 Video Datagram Protocol.................................................................... 32
2.4.7 IP Multicast and the MBONE ............................................................. 35
2.4.8 Session Description Protocol and Session Announcement Protocol... 37
2.4.9 Synchronized Multimedia Integration Language................................ 39
2.5 Other RTSP Implementations ............................................................................ 40
2.5.1 Real Networks Reference Implementation.......................................... 40
2.5.2 RealSystem G2.................................................................................... 41
viii
2.5.3 CERN WrtpVoD ................................................................................. 43
2.5.4 IBM RTSP Toolkit .............................................................................. 44
2.5.5 Apple QuickTime................................................................................ 46
2.5.6 Academic Implementations................................................................. 47
Chapter 3 RTSP in Detail ............................................................................................ 48
3.1 RTSP Resources ................................................................................................. 49
3.2 RTSP Sessions.................................................................................................... 51
3.3 RTSP Messages.................................................................................................. 52
3.3.1 Requests .............................................................................................. 54
3.3.2 Request Methods ................................................................................. 56
3.3.3 Responses ............................................................................................ 60
3.3.4 Response Status Codes........................................................................ 62
3.3.5 Headers................................................................................................ 63
3.4 In-Line Data ....................................................................................................... 64
3.5 Extension Mechanisms....................................................................................... 65
Chapter 4 An Extensible Framework for RTSP Applications................................. 67
4.1 Implementation Language.................................................................................. 68
4.1.1 Exception Handling............................................................................. 68
ix
4.1.2 Run-time Type Information................................................................. 69
4.2 Support Libraries................................................................................................ 69
4.2.1 GPLib .................................................................................................. 69
4.2.2 Connection .......................................................................................... 70
4.2.3 OSLib .................................................................................................. 71
4.3 Class Models ...................................................................................................... 71
4.3.1 Overall Class Model............................................................................ 72
4.3.2 Request Class Model ........................................................................... 73
4.3.3 Exceptions ........................................................................................... 74
4.4 Functional Model ............................................................................................... 75
4.5 Dynamic Models ................................................................................................ 76
4.5.1 RTSPStream ........................................................................................ 78
4.5.2 RTSPInlineDataQueue ........................................................................ 79
4.5.3 RTSPMessageQueue........................................................................... 80
4.5.4 RTSPReader ........................................................................................ 81
4.5.5 RTSPDataReader................................................................................. 82
4.5.6 RTSPMessageReader .......................................................................... 83
4.5.7 RTSPRequestFactory .......................................................................... 84
4.5.8 RTSPMissive....................................................................................... 85
x
4.5.9 RTSPInlineData................................................................................... 85
4.5.10 RTSPMessage ..................................................................................... 86
4.5.11 RTSPRequest ...................................................................................... 87
4.5.12 RTSPAnnounceRequest ...................................................................... 89
4.5.13 RTSPDescribeRequest ........................................................................ 89
4.5.14 RTSPOptionsRequest.......................................................................... 90
4.5.15 RTSPPauseRequest ............................................................................. 90
4.5.16 RTSPPlayRequest ............................................................................... 91
4.5.17 RTSPRecordRequest ........................................................................... 92
4.5.18 RTSPSetupRequest ............................................................................. 92
4.5.19 RTSPTeardownRequest ...................................................................... 92
4.5.20 RTSPChannelListRequest ................................................................... 93
4.5.21 RTSPUserListRequest......................................................................... 94
4.5.22 RTSPGetUserDataRequest.................................................................. 94
4.5.23 RTSPUpdateUserDataRequest............................................................ 94
4.5.24 RTSPResponse .................................................................................... 95
4.5.25 RTSPHeader........................................................................................ 96
4.5.26 RTSPTransportHeader ........................................................................ 97
4.5.27 RTSP Exception Subclasses................................................................ 98
xi
4.6 Programming Interfaces ................................................................................... 100
4.6.1 RTSPStream Interface....................................................................... 100
4.6.2 RTSPMessage Interface .................................................................... 101
4.6.3 RTSPRequest Interface ..................................................................... 102
4.6.4 RTSPResponse Interface................................................................... 103
4.6.5 The RTSPHeader Interface ............................................................... 104
Chapter 5 Applications .............................................................................................. 105
5.1 The Vosaic Reflector........................................................................................ 105
5.1.1 Role of RTSP .................................................................................... 106
5.1.2 Integration ......................................................................................... 107
5.2 Vosaic IP Hoot ................................................................................................. 109
5.2.1 Role of RTSP .................................................................................... 110
5.2.2 Integration ......................................................................................... 111
5.3 Performance Data ............................................................................................. 112
5.3.1 Test Environment .............................................................................. 114
5.3.2 Experiments and Results ................................................................... 115
5.3.3 Analysis ............................................................................................. 119
Chapter 6 Framework Evolution .............................................................................. 121
xii
6.1 Implementation Issues...................................................................................... 121
6.1.1 Use of Standard C++ Features .......................................................... 122
6.1.2 Efficiency Concerns .......................................................................... 122
6.1.3 RTSPHeader Support ........................................................................ 123
6.1.4 Protocol Logic ................................................................................... 124
6.2 Architectural Enhancements............................................................................. 124
6.2.1 Client Interface .................................................................................. 125
6.2.2 Server Interface ................................................................................. 127
Chapter 7 Conclusions ............................................................................................... 132
References .................................................................................................................... 134
xiii
List of Tables
Table 3.1 Response Classes.....................................................................................................62
Table 5.1: Experimental Results ...........................................................................................115
xiv
List of Figures
Figure 4.1: Overall Class Model .............................................................................................72
Figure 4.2: Request Hierarchy.................................................................................................73
Figure 4.3: Exception Hierarchy .............................................................................................74
Figure 4.4: Functional Model..................................................................................................75
Figure 4.5: RTSPStream Dynamic Model ..............................................................................77
Figure 4.6: RTSPInlineDataQueue Dynamic Model ..............................................................79
Figure 4.7: RTSPMessageQueue Dynamic Model .................................................................80
Figure 4.8: RTSPReader Dynamic Model ..............................................................................81
Figure 4.9: RTSPDataReader Dynamic Model.......................................................................82
Figure 4.10: RTSPMessageReader Dynamic Model ..............................................................83
Figure 4.11: RTSPRequestFactory Dynamic Model...............................................................84
Figure 4.12: RTSPMissive Dynamic Model ...........................................................................85
Figure 4.13: RTSPInlineData Dynamic Model.......................................................................86
Figure 4.14: RTSPMessage Dynamic Model..........................................................................87
Figure 4.15: RTSPRequest Dynamic Model...........................................................................88
Figure 4.16: RTSPAnnounceRequest Dynamic Model ..........................................................89
Figure 4.17: RTSPDescribe Request Dynamic Model............................................................90
xv
Figure 4.18: RTSPPlayRequest Dynamic Model....................................................................91
Figure 4.19: RTSPSetupRequest Dynamic Model..................................................................92
Figure 4.20: RTSPChannelListRequest Dynamic Model .......................................................93
Figure 4.21: RTSPResponse Dynamic Model ........................................................................95
Figure 4.22: RTSPHeader Dynamic Model ............................................................................96
Figure 4.23: RTSPTransportHeader Dynamic Model.............................................................97
Figure 4.24: Exception Class Dynamic Model .......................................................................98
Figure 4.25: RTSPStream Interface ......................................................................................100
Figure 4.26: RTSPMessage Interface....................................................................................101
Figure 4.27: RTSPRequest Interface.....................................................................................102
Figure 4.28: RTSPResponse Interface ..................................................................................103
Figure 4.29: RTSPHeader Interface ......................................................................................104
Figure 5.1: Reflector Main Loop...........................................................................................107
Figure 5.2: Reflector Class Model ........................................................................................108
Figure 5.3: Reflector Functional Model ................................................................................109
Figure 5.4: IP Hoot Class Model...........................................................................................112
Figure 5.5: IP Hoot Functional Model ..................................................................................113
Figure 5.6: Test Request........................................................................................................116
Figure 5.7: Test Response .....................................................................................................117
xvi
Figure 6.1: Revamped Client Interface .................................................................................126
Figure 6.2: Revamped Server Interface.................................................................................128
1
Chapter 1 Introduction
As improving hardware and software technology has made the Internet a more accessible
and rich environment, considerable effort has been expended on attempts to broaden the
scope of the World Wide Web to incorporate audio and video in addition to still images
and text. A number of software architectures and network protocols have been created to
address the difficulties that arise in attempting to do so. The Real Time Streaming
Protocol (RTSP) is one such protocol, an open standard intended to serve as a common
language for continuous media clients and servers; RTSP is a simple, extensible control
protocol for managing the playback and recording of multimedia data streams over
computer networks. This thesis presents the specification and design of an object-oriented
framework for constructing applications that make use of RTSP to provide access to
continuous media services. The framework described herein facilitates the creation of
applications that provide networked multimedia functionality by adopting the same
principles of simplicity and flexibility that underlie the protocol’s design and manifesting
them in the form of a collection of collaborating objects and classes that provide the
functionality necessary to implement RTSP.
2
1.1 Multimedia on the Web
The motivation for the RTSP framework, and of the protocol itself, lies in the growth of
the Internet’s popularity and usage during the 1990’s. With the availability of
inexpensive, high-speed modems and the release of the NCSA Mosaic web browser,
usage of the Internet has exploded, particularly among groups that previously had neither
access to the Internet nor need for such access. The dramatic increase in the exposure and
use of the Internet has qualified it as a mass medium, analogous to television, radio and
newspapers, while the unique aspects of the Internet make it possible to provide targeted
content geared toward specific audiences to a degree never before possible.
As a natural consequence of more pervasive use of the Internet and the World Wide Web,
demand has grown for the kind of content users have become accustomed to seeing in
CD-ROM-based software titles: highly graphical, interactive multimedia productions.
Furthermore, the reach and scalability of the Internet has made it an attractive platform
for communications applications such as electronic webcasts and point-to-point and
multi-party conferencing. Implementing these applications in the Internet environment
presents a number of challenges: users must have access to sufficient bandwidth to
support the demands of media-rich applications; their computer systems must have
enough processing capability to decode compressed media streams in real time; and the
application software must include mechanisms for transmitting and decoding continuous
media data streams.
3
1.2 Hardware Technology
The first two challenges referenced above have been addressed by addressed by the
evolution of hardware technology. The amount of bandwidth available to end-users has
increased steadily since the popularization of the web, with modem bitrates increasing
from 14.4 kb/s to 56.6 kb/s, enough bandwidth to support high quality audio or low
resolution video with medium quality audio. Other connection technologies, such as
Integrated Services Digital Network (ISDN), cable modems and Asymmetric Digital
Subscriber Line (ADSL), have made headway as well, promising to bring additional
bandwidth to users and enable higher quality video in the not-too-distant future.
Likewise, processor performance has increased significantly from year to year. In 1995, a
Dell Dimension XPS personal computer achieved a score of 3.16 on the SPECint95
benchmark, a widely-used benchmark suite for evaluating computer systems’
performance on integer operations; in 1998, a Dell Precision Workstation 610 scored 19.0
on the same benchmark [1], a more than six-fold improvement. The infusion of new
technologies into the mainstream personal computer market, such as superscalar, RISC-
based processing and special purpose instruction set enhancements like the MMX
multimedia extensions to the Intel Architecture, have also increased the amount of
computing power available to consumers. As a result, the computer systems available to
most users are more than capable of decoding the highly compressed audio and video
data streams generated by networked multimedia applications.
4
The pace of advancement in hardware performance shows no signs of slowing, either.
Products and technologies on the horizon, such as Very Long Instruction Word (VLIW)
computing and accelerated hardware implementations of multimedia algorithms promise
to ensure that users will have enough processing power to handle the higher bitrate data
streams made possible by the new connection technologies described above.
1.3 Software for Internet Multimedia
The improvements in processing power and Internet connectivity have driven the
development of software architectures that take advantage of the available technology.
Various commercial software vendors have released proprietary software products that
provide differing levels of interactivity and richness. Among others, these include
Macromedia, whose Flash and Shockwave products enable the display of simple
animations and interactive presentations; Real Networks, whose Real System is the
dominant platform for delivery of multimedia over the Internet; Microsoft, which offers a
variety of live and on-demand multimedia and conferencing applications, and Apple
Computer, which distributes software that enables QuickTime movies to be played over
the Internet. Typically, these products are integrated with users’ web browsers through a
plug-in mechanism defined by Netscape Communications Corporation in its Navigator
web browser, a commercial follow-on to the original Mosaic browser.
In their first incarnations, the software solutions for providing interactive multimedia
over the Internet have been, to varying degrees, dependent on proprietary transport and
control protocols. As a result of this dependence on proprietary techniques,
5
interoperability between different vendors’ implementations has been impossible. RTSP
and several related specifications seek to address the problem of interoperation by
providing a standard platform upon which multimedia applications can be built. These
specifications, which include RTSP for control over the playback of multimedia; the
Real-Time Transport Protocol (RTP) for delivery of continuous media data streams; and
the Session Description Protocol (SDP), which facilitates the communication of
information required to display a multimedia presentation, are open standards
administered by the Internet Engineering Task Force. As such, they enable any developer
to create clients and servers that work transparently with applications by other authors,
while still allowing for differentiation through documented extension mechanisms.
1.4 The Framework
The framework described in this thesis is intended to provide a portable, flexible
implementation of RTSP for use in a variety of applications. Although there are a number
of other implementations of the protocol that are available for development purposes,
none adequately capture both the simplicity and versatility of the specification. Some
offer flexibility, but are difficult to use and extend, while others offer ease of use, but
provide only limited functionality. This framework makes use of object-oriented
techniques to provide easy-to-use, high-level abstractions, while still providing the user
with the flexibility and access needed to implement complex, customized applications.
6
1.5 Organization
The remainder of the thesis is organized as follows. Chapter 2 introduces RTSP, covering
its design goals, evolution and features, and explores related protocols and technologies
as they pertain to RTSP. Chapter 3 follows with a more detailed discussion of the
protocol, providing an explanation of its message structure and semantics sufficient to
understand the workings of the framework. Chapter 4 then presents the design of the
framework itself, detailing the classes and objects that make up the library and the ways
in which they interact, and Chapter 5 elaborates on the design by illustrating how the
components of the framework are integrated into a pair of multimedia applications.
Chapter 5 closes with performance data illustrating that the framework’s message
processing architecture is robust and efficient enough to serve as the basis for heavy-duty
multimedia servers. Finally, Chapter 6 considers issues raised by the current design and
implementation of the framework and highlights directions future development might
take.
7
Chapter 2 Background
This chapter presents background information central to an understanding of RTSP’s role
in networked multimedia systems in general and the design of this RTSP framework in
particular. Section 2.1 discusses the origins, evolution and status of the RTSP
specification. Section 2.2 describes the functionality offered by the protocol, and section
2.3 outlines some of the applications for which RTSP was designed. Section 2.4 explores
related protocols and standards for networked multimedia, and the closing section of this
chapter discusses other implementations of RTSP that are available at the time of this
writing.
2.1 RTSP Evolution
RTSP is an open standard published by the Internet Engineering Task Force (IETF) in a
standards-track Request for Comments (RFC 2326) [2]. RTSP is currently classified as a
Proposed Standard. Its development followed from the proliferation of proprietary
protocols for control and transport of multimedia data over the Internet and the elusive
goal of interoperability. The intent was to develop a flexible standard that enabled not
just interoperation between similar products from different vendors, but the ability to use
8
the same tools, file formats and protocols for telephony and conferencing applications as
in video-on-demand and webcasting environments.
With these goals in mind, Anup Rao of Netscape Communications Corporation and Rob
Lanphier of RealNetworks, Inc. (then called Progressive Networks) submitted a draft
proposal to the IETF Multiparty Multimedia Session Control (MMUSIC) working group
in November of 1996. Their proposal outlined a simple protocol based on binary, non-
human-readable messages with support for requesting live or on-demand playback of
multimedia. Shortly thereafter, Henning Schulzrinne of Columbia University submitted a
counterproposal detailing a protocol making use of HTTP-like, textual messages and
incorporating more general extension mechanisms, a more abstract treatment of
multimedia transport protocols, and support for recording functionality, as well as
playback. This specification evolved in series of Internet-Drafts released by the IETF,
and the effort culminated with the release of the RTSP RFC in April 1998. The final
specification is derived from Schulzrinne’s counterproposal, and contains contributions
from researchers and developers at Netscape, RealNetworks, Columbia University,
International Business Machines Corp., the French National Institute for Research in
Computer Science and Control (INRIA), and Microsoft Corporation, among others.
2.2 RTSP Features
This section contains a high-level discussion of some of the features and characteristics
of RTSP. Section 2.5.6 describes how these features are implemented by protocol.
9
2.2.1 Setup and Control
At its core, RTSP is a protocol that enables applications to set up and control the
playback and recording of multimedia data over a computer network. Setup consists of
arriving at an agreement as to the kind of data that will be played or recorded and the
mechanism through which it will be transported, and control offers the end-user the
ability to interactively manage the flow of data to or from the multimedia server. The
capabilities of RTSP with respect to each of these dimensions of functionality are
described below.
The setup process begins with a simple exchange of the capabilities supported by the
client and server, including any non-standard extensions to the protocol that one party or
the other has implemented or requires. It continues with negotiation of the transport
mechanism to be used to carry the multimedia stream; this is done in advance of the
initiation of delivery of the stream in order to ensure that neither participant is presented
with a stream it cannot handle. Once an appropriate means of transport has been selected,
playback or recording can commence, under the user’s control.
RTSP offers the user the ability to start, suspend and restart the transmission of the
multimedia stream as needed. In addition, RTSP-based applications can offer familiar,
VCR-like features, such as the ability to scan backwards or forwards through a
presentation and to seek to arbitrary points within a stored clip. RTSP also allows for
aggregate control over separate streams, so that presentations consisting of several
10
distinct tracks, such as a recording of a videoconference with separate audio and video
tracks, can be played and manipulated as a unified whole.
2.2.2 Timing and Synchronization
As described above, RTSP enables applications (and thus users) to begin playback at
arbitrary points within a multimedia clip or presentation; likewise, applications can
indicate that playback should stop at an arbitrary point. The desired starting and stopping
points can be specified in seconds, relative to the start of the presentation or, in the case
of archived recordings of live events, in wall clock time. In addition, clients can instruct
servers to begin or stop playback at a specified wall clock time. Thus, an RTSP client
could tell a server to play the third through sixth seconds of a stored video clip at some
moment in the future, provided the RTSP session is still active at the time.
RTSP-based applications can also use Society of Motion Picture and Television
Engineers (SMPTE) timestamps to express offsets from the beginning of clips. SMPTE
allows for frame-level control over playback and recording, thereby enabling RTSP
clients and servers to be used to perform professional-quality distributed editing of
multimedia presentations.
2.2.3 Security
RTSP provides flexible, open mechanisms for clients to interact in a secure manner with
RTSP servers. Authentication and encryption of client-server interactions are supported
11
through Internet standards, allowing implementations to provide as much or as little
security is required for a particular application.
In addition, RTSP is designed to be friendly to the firewall and proxy software in place at
many companies and public institutions that provide Internet access to employees and
patrons. The protocol itself readily lends itself to handling by transport-level proxy
services, such as SOCKS [3], and its status as an Internet standard makes it easy for
vendors of packet-filtering firewalls to allow legitimate RTSP requests to pass
unhindered while blocking the packets of hackers and other intruders.
2.2.4 Extensibility
In addition to supporting basic multimedia delivery through the features described above,
RTSP was designed to accommodate unforeseen applications and usage scenarios
through a variety of extension mechanisms. As needed, RTSP implementations can
modify the behavior and semantics of the basic operations defined by the protocol; add
entirely new operations; or, in the event the current protocol is completely unsuitable for
a particular problem, but developers wish to maintain some degree of backwards
compatibility with older software, just about every aspect of the protocol may be
changed.
Of course, the utility of changing the behavior of the protocol is greatly reduced if doing
so results in an application that is unable to inter-operate with other implementations of
RTSP. To address this problem, RTSP requires that clients and servers support a standard
12
means of feeling each other out in order to determine which non-standard options and
enhancements the other supports.
2.2.5 Flexibility
By design, and unlike some multimedia control protocols that have preceded it, RTSP is
flexible in its support for alternative mechanisms for tasks not directly related to the
control of delivery of multimedia data. In particular, RTSP is agnostic with respect to
such decisions as the choice of transport protocol used to deliver audio and video data
and the representation format used to describe multimedia presentations. The protocol
allows clients and servers to state the formats and protocols they support and arrive at a
mutually acceptable decision.
2.2.6 Ease of Use
RTSP is designed to be easy to implement and use. It is text-based and human-readable,
making debugging clients and servers easier. Its structure is modeled closely after that of
HTTP [4] and MIME [5], allowing the reuse of existing code in new RTSP
implementations. This similarity to HTTP also allows RTSP-based applications to take
advantage of standard extensions to HTTP, such digest access authentication [6] and
PICS [7, 8], a system for associating labels and ratings with content.
13
2.3 RTSP Applications
This section describes some of the applications for which RTSP was designed and
provides some insight into how particular features of the protocol are utilized in various
settings.
2.3.1 Video-on-Demand
Video-on-demand is one of the core applications RTSP was designed to support. Through
its basic command set, RTSP allows for the creation of sophisticated streaming video
applications with support for many advanced features, including:
• VCR-style control over the delivery and playback of multimedia,
• the ability to start playback at an arbitrary point within a presentation,
• independence from specific transport mechanisms and media types,
• interoperation of clients and servers from different vendors,
• tight integration with web browsers, and
• pay-per-view and other logging and billing methods.
2.3.2 Live Broadcasts
RTSP provides many of the same benefits to live audio and video applications that it
brings to on-demand systems. Administrators can easily use the protocol’s authentication
features to construct secure commercial systems based on RTSP-aware clients and
servers, and its support for varying transport mechanisms makes it feasible to support
14
both small-scale events in which the data stream is unicast directly to clients and large-
scale, high-profile broadcasts supporting many thousands of clients via IP Multicast.
Broadcasts can also combine the distribution methodologies to take advantage of the
bandwidth efficiency of multicast transmission while allowing users whose access
providers don’t support multicast traffic to participate.
2.3.3 Near Video-on-Demand
In addition to simple live and on-demand applications, RTSP supports near-on-demand
delivery, an amalgamation of the two approaches discussed in [9]. In this usage scenario,
a multimedia presentation is multicast several times at staggered intervals. This allows
for the use of IP multicast so that bandwidth is utilized efficiently, while maintaining
some of the flexibility and convenience of on-demand services. The multicast addresses
used by the staggered multicasts can be determined dynamically by the RTSP server, so
clients automatically receive the most recently started signal.
2.3.4 Virtual Presentations
Moving beyond the realm of simple streaming media applications, RTSP facilitates the
creation of virtual presentations incorporating live, stored and interactive multimedia.
The protocol’s support for playback of arbitrary segments of clips and controlling the rate
of playback makes it easy to integrate several segments into a single presentation, and the
ability to access seamlessly multiple servers simplifies the creation of interactive works
incorporating disparate types of multimedia content.
15
2.3.5 Conferencing and Telephony
RTSP can also be utilized in network-based conferencing and telephony applications.
Although it does not provide the signaling functionality required of the protocols that
form the foundation of these applications – those functions are left to dedicated protocols
like H.323 [10] and SIP [11] – RTSP can be used to play prerecorded content into an
active call or to record a conference for later retrieval and playback. In addition, RTSP
can be used as the basis for IP-based voice mail and menu systems.
2.3.6 Distributed Digital Editing
RTSP’s explicit support for recording operations, together with the provisions it makes
for frame-level timing, make it possible to implement distributed digital editing systems
for multimedia. Such a system could make use of multiple networked playback and
recording devices coordinated by an RTSP-based software application.
2.4 Related Technologies
2.4.1 Hypertext Transport Protocol
Developed in 1990 at the European Laboratory for Particle Physics (CERN) in Geneva,
Switzerland, the Hypertext Transfer Protocol (HTTP) provides access to a vast collection
of inter-linked Internet resources, primarily consisting of graphics and text [12], called
the World Wide Web (WWW, or web). Since the release of the NCSA (then the National
Center for Supercomputer Applications) web server software and graphical browser,
16
Mosaic, the web has experienced phenomenal rates of growth in size and usage. In
February 1994, the NCSA web server [13] handled one million requests per week; by
December of the same year, its load had grown to four million per week [12]. In
September 1998, some estimates placed the number of users of the web at 39 million
people [14].
As the web’s usage and accessibility grew, content providers sought to enrich their pages
by moving beyond hypertext and graphics and incorporating new media, including audio
and video, into their web sites. In November 1995, the web was estimated to contain over
eleven million distinct resources hosted on more than 225,000 servers [15]. At that point,
there were approximately 36,000 video files available on the web [16], and although
audio and video accounted for only 1% of the requests handled by NCSA’s server, they
were responsible for 28% of the bytes transferred.
Many early efforts aimed at integrating audio and video into the web made use of HTTP
to retrieve audio and video data in the same fashion as text and images. Several
difficulties arise with this approach. First, HTTP is oriented towards whole-file
downloads: the simplest way to access video via the web is to download an entire video,
storing it locally on the user’s hard disk before playback begins. Because of the large size
of files containing video data, however, this often results in a long wait between the time
a user decides to view a video and the time it actually begins playing. Various “fast start”
solutions have been adopted to address this issue [17], but they do not solve a more
fundamental problem: HTTP’s use of TCP for data transport. As described in Section
17
2.4.5 below, TCP is inappropriate for transport of audio and video data because it
introduces jitter, degrades picture quality by making sure each and every data packet is
delivered in order, and imposes inappropriate flow control on the data stream [18].
In addition, HTTP itself lacks features that are useful in the context of providing
multimedia services; the protocol includes a very limited command set and requires that
servers handle requests in a stateless fashion. As a consequence, there is no mechanism
for clients to retrieve descriptions of available resources in standard formats; the
transmission of a data stream cannot be paused and resumed in place at the user’s
request; the parameters used to encode and transmit the data stream cannot be changed on
the fly; and clients provide fine-grain, time-based control over the portions of the data
stream that it receives. Newer revisions of the HTTP specification provide support for
partial downloads [19], but at the level of byte ranges, which is inappropriate for
multimedia due the fact that many encoding formats make it difficult or impossible to
map time offsets to byte ranges without processing and decoding the entire data file.
As a result of these issues, HTTP enjoys a complementary relationship with RTSP. Web
pages served via HTTP can incorporate video accessed via RTSP, and HTTP can be used
to access meta-data such as session descriptions for use within RTSP-based applications.
HTTP is an Internet Engineering Task Force standard protocol described in full in a
series of IETF Request for Comments documents [4, 19].
18
2.4.2 PNA
PNA is the control and transport protocol used by the popular RealAudio, RealPlayer,
and RealServer streaming multimedia applications until the release of RealSystem G2,
comprising RealPlayer G2 and RealServer G2, in November 1998. PNA is a proprietary
protocol and as such, the full specification is not publicly available, although Real
Networks has released enough information to enable firewall and proxy software authors
to incorporate support for recognizing and relaying PNA traffic. With the release of Real
System G2, PNA has been replaced by RTSP (for control traffic) and RTP (for
multimedia data), though the client and server components continue to support PNA for
the sake of interoperability with the installed base of servers and clients.
2.4.3 H.323
Approved in 1996 by Study Group 15 of the International Telecommunications Union,
ITU Recommendation H.323 [10] describes protocols and components for systems that
provide conferencing and telecommunications services over packet-based networks. The
H.323 Recommendation encompasses standards for call control, multimedia and
bandwidth management and interfaces between packet-based communications systems
and circuit-switched networks. H.323 is one of a series of recommendations, known as
H.32X, that propose standards for conferencing applications over a variety of networks,
notably ISDN and the standard voice telephone network, which are addressed by
Recommendations H.320 and H.324, respectively [20]. Intended to provide
interoperability among compliant products and devices, H.323 has garnered considerable
19
support among vendors in the computing and telecommunications industries. Among
others, Intel, Microsoft Corporation and Netscape Communications Corporation have
pledged to support the standard in their products.
An H.323-based conferencing and communications system comprises a collection of
components, each of which provides a distinct service to the other components of the
system. The types of components defined in the Recommendation are as follows:
Terminals, which constitute the end-user endpoints of H.323 conferences; Gateways,
which provide interfaces between H.323-based systems and other communications
architectures; Gatekeepers, which provide access control services; and Multipoint Control
Units, which work with Multipoint Processors and Multipoint Controllers to enable
conferences involving more than two parties. These components are logical, rather than
physical, constructs; a single piece of hardware might contain one or many H.323
components. In addition, not all environments require every type of component; in the
simplest case, that of a two-party, endpoint-to-endpoint call within a single local area
network, terminals may be all that is required.
In addition to describing the hardware components that make up an H.323 system, the
Recommendation specifies communications protocols and encoding mechanisms through
which the components interact. These include a number of other ITU recommendations,
including standards for audio and video encoding algorithms (H.261, H.263, G.711,
G.722, G.728, G.729, MPEG1, G.723.1, etc.) and protocols for the transmission of binary
data within conferences (T.120), packetization of media streams for transmission over
20
computer networks (H.225), capability negotiation and channel allocation (H.245) and
call signaling (Q.931). The Recommendation specifies the use of Internet standard
protocols, including RTP, RSVP and IP Multicast, to enable the transmission of audio
and video data over IP-based networks. In addition, H.323 defines a protocol,
Registration/Admission/Status (RAS), used to control allocation of network resources,
regulate access to local and remote systems, and perform address translation
A more in-depth description of the elements of an H.323 system and the ways in which
those elements interact using the above-mentioned protocols and standards follows.
Terminals: Terminals represent the endpoints of an H.323 conference or call; they
provide for real-time, two-way communications between the user and other parties
involved in the call. The specification requires that all compliant Terminals support audio
encoded using one of the standard mechanisms; optionally, Terminals can also support
video and data conferencing. Encoded audio and video are packetized as dictated by
H.225 and transported via RTP, whereas data is encoded and transmitted as specified in
T.120. The encoding algorithms and data types to be included in a call are negotiated,
along with other considerations, such as the bitrates of the data streams generated by each
party, as described in H.245. Terminals must also support RAS in order to communicate
with Gatekeepers and may optionally incorporate a Multipoint Control Unit, described
below, in order to facilitate participation in multi-party conferences.
Gateways: H.323 Gateways enable H.323-enabled devices and software to communicate
with other ITU-compliant conferencing terminals residing on circuit switched networks
21
such as ISDN and the public telephone network. Providing this functionality entails
translating between different transmission formats, call control mechanisms, and
encoding formats, as well as establishing and maintaining connections on both of the
involved networks. Terminals communicate with Gateways using the H.245 and Q.931
call setup and signaling protocols.
Gatekeepers: H.323 systems can optionally include a Gatekeeper component that
provides call control services that help preserve the integrity and usability of the network
on which the system resides. The simplest of these services is translation of locally-
known aliases for H.323 endpoints and gateways into transport-level network addresses;
in systems including a Gateway, a Gatekeeper component is also used to translate
incoming E.164 (ISDN) addresses into the corresponding packet network addresses.
In addition to performing address translation, Gatekeepers control conferencing
endpoints’ access to the network. Because Terminals and other endpoints are required to
make use of a Gatekeeper if one is present in a system, Gatekeeper components can limit
the number of simultaneously active calls, thereby managing the amount of network
bandwidth consumed by conferencing applications. Furthermore, Gatekeeper can provide
call authorization functionality; for example, a Gatekeeper could enable or disable calls
to or from certain endpoints or restrict users’ ability to make calls to certain hours of the
day. The vendor of the hardware including the Gatekeeper determines the actual criteria
used to determine whether a call is allowed or disallowed.
22
Finally, a Gatekeeper component can also be used to facilitate the ad hoc creation of
multipoint conferences. When a point-to-point call is established between two parties, the
Gatekeeper for one of the parties can be configured to receive the H.245 signaling
information associated with the call. No processing of the H.245 data is required of the
Gatekeeper; it simply passes the data from one endpoint to the other. Then, when one of
the participants elects to expand the conference to include a third party, the Gatekeeper
directs the H.245 traffic to a Multipoint Controller as well, which then establishes the
parameters under which the expanded call will operate.
Gatekeepers provide their services to Terminals and other endpoints via the RAS
protocol.
Multipoint Control Unit: Like Terminals, Multipoint Control Units (MCU) are
endpoints – entities which users can call directly. A Multipoint Control Unit consists of a
Multipoint Controller and some number of Multipoint Processors; these components
work together to enable conferences involving three or more parties. Multipoint Control
Units can be located within Gateways and Gatekeepers, but in such an instance
implement wholly separate functions that merely happen to be located in the same piece
of equipment.
H.323 supports two types of multipoint calls, referred to in the recommendation as
centralized and decentralized conferences. In centralized conferences, all the Terminals
or other endpoints involved transmit audio, video, data and control information directly to
a Multipoint Control Unit. The MCU is responsible for multiplexing the incoming audio
23
streams, enabling endpoints to select a video feed in which they’re interested, and
distributing the appropriate streams to the participants. In contrast, in decentralized
conferences, participants use IP Multicast or a comparable technology to simultaneously
deliver audio and video content to all of the participants in a call, while call signaling and
shared data such as whiteboard information are still processed in a centralized fashion by
a Multipoint Control Unit.
Multipoint Controller: A Multipoint Controller resides within a Multipoint Control Unit
and implements control functionality that enables conferences involving three or more
parties. Multipoint Controllers use H.245 to determine the capabilities of each endpoint
involved in the conference and inform the participants of the encoding formats and
communications parameters acceptable to the group as a whole; this set of capabilities is
revised as users join or leave the call.
Multipoint Processors: The remaining component of a Multipoint Control Unit, a
Multipoint Processors (MP) is responsible for transforming the audio and video data
streams generated by participants in a centralized multipoint conference into the
appropriate form and returning the resulting streams to the connected parties. The
processing performed by an MP typically consists of some combination of switching and
mixing of the incoming data, and can also entail converting audio or video data into
alternative formats for display or playback on terminals that can’t decode the primary
format in use within the call.
24
Although H.323 is designed for advanced communications services, it can also be used
for simple media playback applications like those at which RTSP is targeted; Microsoft
Corporation’s NetShow and Media Player streaming video applications use H.323 in
exactly this fashion. Using H.323 for simple media playback and recording operations is
not without disadvantages, however. The ITU Recommendation that defines H.323 is
long and complex, making the specification difficult to implement [21]; likewise, the
protocols that make up the specification are themselves complex, resulting in higher call
setup latencies than are possible with a more lightweight protocol like RTSP. Finally, the
complex nature of H.323 results in more complicated and larger implementations, which
makes them inappropriate for some use in some applications, such as within a
downloadable Java applet.
2.4.4 Session Initiation Protocol
Like H.323, the Session Initiation Protocol (SIP) is a signaling protocol that enables the
creation of point-to-point and multi-party conferences and allows users to invite servers
and other users to participate in active calls [11]. SIP is designed as a more lightweight,
flexible and modular alternative to H.323, with a lineage consisting of standard Internet
protocols like HTTP, RTP and RTSP and without the baggage of the circuit-switched
ISDN protocols upon which H.323 is based. The creator of SIP provides a detailed
comparison of the two architectures in [21]; the protocol itself is a work-in-progress
described in draft form in an Internet Draft [22]. A brief synopsis of the design and
underlying philosophy of SIP follows. A comprehensive description of an Internet
25
telecommunications architecture based on SIP and related protocols like RTP and RTSP
is presented in [23].
One of the primary objectives of the SIP design is simplicity: the protocol is intended to
be easy to parse and debug. This objective is accomplished by using much of the syntax
of HTTP, extending that protocol to allow bi-directional messaging. This enables the
reuse of existing code for parsing HTTP-style messages and the multitude of HTTP
extension mechanisms, and the fact that the protocol is text-based facilitates debugging
SIP-based applications. In contrast, H.323 messages are encoded in binary form using
relatively complicated ASN.1 packed encoding rules.
The simplicity of SIP’s design also extends to the semantics of the messages exchanged
by participants in a conference. Rather than requiring applications to utilize a Byzantine
collection of interrelated protocols in order to provide communications services, SIP is
based on a small number of orthogonal commands that can be composed to provide high-
level call signaling functionality.
SIP’s simplicity brings with it a measure of focus and modularity to the specification.
The protocol’s scope extends solely to call setup and control functions; the mechanisms
through which features like service discovery and quality-of-service are provided are left
unspecified, so any of a number of alternative approaches can be utilized. The protocol
leverages existing Internet infrastructure, such as the Domain Name System and
electronic mail address formats, where possible rather than invent new solutions to old
problems. Furthermore, the features provided by SIP are not interdependent; for example,
26
an application might use SIP’s call setup functionality to locate the target of a call and
then use H.323’s to establish and maintain the call.
Extensibility and scalability are the other main goals of SIP’s design. Whereas H.323 can
be extended to support application- and vendor-specific functionality primarily through
nonstandardParam fields included in specific locations in its protocols’ grammars, SIP
builds on the mechanisms used in Internet protocols like HTTP and Simple Mail Transfer
Protocol (SMTP) [24] to provide significantly more flexibility. SIP’s architecture allows
clients to specify the exact features they require and for servers to accept or reject
requests based on their support for clients’ needs. In addition, SIP allows implementers to
add significant functionality to the protocol while preserving compatibility with existing
implementations: by default, SIP agents ignore unknown headers within requests and
reply messages, so older implementations can handle messages from applications
supporting extensions to the protocol transparently.
SIP’s extensibility also extends to the critical subject of audio and video encoding
formats. The primary mechanism through which information about encoding formats is
conveyed in SIP is SDP, which uses textual names to identify the codecs the participants
in a SIP session can understand. These names can be registered by individuals or groups
with the Internet Assigned Numbers Authority (IANA), which provides contact
information for the registrant to interested parties so that implementers can incorporate
support for any registered format in their applications. The H.323 Recommendation, on
the other hand, mandates that supported codecs must be centrally registered and
27
standardized with the ITU. As of July 1998, the only encoding formats approved for use
were ITU-developed, and many of them incorporated significant amounts of proprietary
intellectual property, making developing inexpensive H.323-based systems a challenging
proposition.
Finally, SIP improves upon H.323 in its utility for large-scale communications systems
by dint of its support for stateless and multicast signaling. Use of stateless call processing
enables SIP gateways and servers to handle a larger number of calls by reducing the
amount of memory and processing overhead associated with setting up and maintaining a
call. Multicast signaling allows SIP conferences to scale transparently from two to a
multitude of participants by removing the need for a central location at which all call
processing occurs.
Although it is possible to provide multimedia playback and recording services via SIP,
the signaling functionality that is the primary focus of the specification is superfluous in
these kinds of applications. RTSP, which is geared exclusively towards providing access
to stored and live multimedia data streams, is more appropriate. In fact, RTSP and SIP
are designed to complement each other nicely: SIP-aware conferencing systems can use
RTSP to enable features like voice mail, recording of conferences for future retrieval in
on-demand fashion, and playback of previously recorded material into an active call.
28
2.4.5 Real-Time Transport Protocol
One of the key challenges of providing advanced multimedia services over TCP/IP-based
Internets is delivering audio, video and other forms of data to viewers and listeners
quickly and efficiently, so that playback can begin promptly and proceed smoothly,
without delays or artifacts. The Real-Time Transport Protocol (RTP) is designed to
facilitate delivery of data with real-time characteristics, like audio and video, providing
functionality such as payload type identification, sequence numbering, source
identification, timestamping, and receiver-generated feedback for quality of service
monitoring. RTP is discussed at a high level in [9] and in detail in a standards track
Internet Engineering Task Force RFC [25].
Internet and web-based multimedia applications are well suited to a specialized protocol
like RTP, rather than the more common HTTP, described above, because of the network
performance requirements imposed by the characteristics of continuous media. In order to
play back high quality audio or video, multimedia-enabled clients and servers depend on
the network to provide predictable, if not necessarily minimal, delay. In addition, it is
typically unnecessary for every last bit of multimedia data streams to be delivered intact;
the delays and artifacts caused by retransmitting lost data packets are often more
noticeable than the effects of ignoring them altogether. For these reasons, reusing
existing protocols like HTTP, which make use of TCP as the underlying transport
mechanism, for multimedia data is usually inappropriate. Because it is designed to deliver
data packets reliably, TCP wastes time retransmitting dropped data packets, even when
the application would be better served by ignoring them. Furthermore, TCP provides
29
windowing and congestion control mechanisms that, though effective for many data
streams, can introduce latencies that obscure the temporal relationships between packets
in continuous media streams. Finally, TCP-based protocols cannot take advantage of IP
Multicast, described in section 2.4.7, which makes them ill-suited for large-scale
conferencing and live multimedia applications [9, 18, 26].
The origins of RTP can be found in earlier work on NVP[27] and PVP[28]. The protocol
is designed to serve as a common basis for a variety of real-time continuous media
services; RTP is application-independent, with application-specific profiles, or
specializations, providing additional functionality needed for particular real-time
applications. In a sense, the RTP specification is incomplete: in addition to the basic
protocol, an application must incorporate an appropriate profile and payload formats in
order to make effective use of RTP. The profile for audio and video conferences is
described in [29].
Because RTP is designed to be application-independent, implementations can be
packaged into reusable code libraries and incorporated directly into clients, servers, and
associated tools. Thus, protocol processing is performed at the application level, rather
than in a separate layer. This approach, called application level framing and integrated
layer processing, is described in [30]. In addition, the application-independence of RTP
also enables the creation of generic tools for monitoring, tracing and providing quality-
of-service information about real-time traffic without regard for the specific application
that generated it.
30
RTP traffic is divided into data packets and control packets. Continuous media data is
carried in data packets, while information about the performance of the network and other
non-continous data are communicated via the control mechanism. When, as is typically
the case in IP-based environments, both data and control information are carried in UDP
packets, sequential port numbers are used, with the lower, even-numbered port the target
for packets containing multimedia data and the higher, odd-numbered port used as the
destination for control packets.
RTP data packets begin with a number of fixed headers that provide information common
across the spectrum of applications supported by RTP. These include a sequence number,
payload type, timestamp, and identifiers specifying the source or sources of the data
contained in the packet. The timestamp is a thirty-two-bit value whose meaning is
dependent on the profile in use and the payload type; source identifiers are also thirty-two
bits in length, but do not correspond to a particular address format, such as four-byte IP
addresses. Rather, individual sources select random identifiers when they first generate
data to be transmitted via RTP, and conflicts are detected and eliminated as they occur.
The fixed header data is followed immediately by whatever profile-specific headers are
required by the profile for a particular application; these headers, if present, are then
followed by data of the variety specified in the payload type header field.
As described above, RTP control packets are used to convey information regarding the
level of service being provided to users by the network and real-time application. In
addition, RTP control packets can be used to communicate further information about the
31
party generating RTP traffic and to establish rough synchronization relative to wall clock
time among the participants in a live event or conference.
The format and communications parameters for RTP control packets are defined as a
subsidiary protocol within the RTP specification [25] called the RTP Control Protocol
(RTCP). RTCP packets are similar in structure to RTP data packets, but are usually
carried over a separate transport-level “connection.” The parties using an RTP-based
application periodically multicast RTCP packets to the other participants in order to
provide feedback about their state even when no real-time traffic is being generated.
Both producers and consumers of RTP data must distribute RTCP packets containing
sender or receiver reports. Receiver reports contain information useful to senders such as
the highest sequence number yet received; timestamps, which allow senders to compute
round-trip times; and a measure of jitter, or the variance in packet arrival times within a
data stream. Sender reports enable listeners to estimate the actual data rate of the real-
time data stream and establish a relationship between the timestamp values in the RTP
data packets they are receiving and wall clock time. Sender reports also contain
additional information identifying generators of RTP data above and beyond the thirty-
two-bit identifier included in RTP data packets. This information allows recipients to
provide readable names for senders to users and to maintain sender-specific information
when identifier conflicts occur.
Participants in a conference or broadcast decide how often to send RTCP messages based
on they bandwidth consumed by the data they are sending or receiving, thereby limiting
32
the amount of control traffic to a known, fixed percentage of the overall load. A value of
5% is suggested in the RTP specification, but the establishment of a mandatory fraction is
left to the creators of profiles. So that newcomers to a session can quickly identify the
parties generating RTP traffic, senders of real-time data are collectively allocated a
quarter of the fraction of the load assigned to control traffic. The remaining three-quarters
are split evenly among all the receiving participants. The specification includes an
algorithm for determining the appropriate interval at which RTCP reports should be
generated based on the available bandwidth and the number of participants [25].
RTP and RTSP complement each other nicely, with the former providing transport
services for networked multimedia applications and the latter providing control
functionality for those same applications. This is somewhat unsurprising, considering that
both protocols are defined in standards-track Internet RFC’s that share key contributors.
Although RTSP is designed to be neutral with respect to transport mechanisms, RTP is
the only multimedia transport protocol for which a means of specifying its use is
described in the RTSP specification [2].
2.4.6 Video Datagram Protocol
Like RTP, the Video Datagram Protocol (VDP) is a transport protocol geared to the
delivery of continuous media data over computer networks. Unlike RTP, VDP is
designed to take advantage of the point-to-point connections between clients and servers
in order to provide VCR-style control functionality, such as play, pause, rewind and fast-
forward commands, and to enable optimized data transmission appropriate to the levels
33
of network congestion and CPU utilization experienced by the participating systems.
VDP was developed by researchers at the University of Illinois in 1995 and 1996, and is
described in a paper presented at the Fourth International World Wide Web Conference
[18] and in a U.S. patent application [31]. Motivated by the difficulties posed by the use
of HTTP for the transmission of multimedia data in Internet applications, VDP is
designed to address the variability of Internet performance and client load, while
providing real-time, on-demand delivery of audio and video streams.
VDP is an asymmetric protocol involving two endpoints, a server and a client, which
communicate using two distinct data channels. The first is a reliable control channel used
by the client to convey user commands and connection management information to the
server; the second is an unreliable channel used for the transmission of multimedia data
from the server to the client and non-critical feedback in the opposite direction. This
feedback mechanism provides VDP-based systems with the ability to dynamically adjust
the data stream in order to accommodate changes in network performance and client CPU
utilization.
During playback, a VDP client measures performance by estimating packet round-trip
times and monitoring the rate at which the system is displaying received video frames
and the percentage of incoming data packets which are being lost in transit. In the event
the client system is unable to display the video stream due to insufficient CPU power or a
sufficiently high percentage of data packets are being lost, the VDP software generates a
feedback message which is sent from the client to the server over the unreliable data
34
channel. Upon receipt of such a message, the server addresses the situation by thinning
the data stream, reducing the frame rate of the video feed sent to the client and as a result,
the amount of network bandwidth and processing power necessary to deliver and display
the content. Should performance continue to suffer, additional feedback messages are
generated that trigger further thinning of the data stream. Conversely, should the
measured levels of packet loss or CPU utilization drop to more acceptable levels, the
client generates a feedback message that instructs the server to restore the data being
removed from the data stream. In this fashion, VDP clients and servers adapt to changing
conditions by altering the data stream as needed to ensure that video quality degrades
gracefully in the presence of problematic phenomena.
In addition to enabling video clients and servers to adapt dynamically to difficult
situations, VDP incorporates a demand re-send algorithm that improves the quality of
video encoded in media formats that include inter-frame dependencies, such as MPEG.
Because the communications channel used by VDP for video data is unreliable, it is
possible that a packet containing frame data upon which subsequent or previous frames
depend could be lost. In this event, VDP clients can request that the lost packet be re-
sent; it is the responsibility of the client to maintain an internal data buffer sufficiently
large to ensure that the re-sent frame arrives in time to be displayed and to identify which
frames are important enough to re-send based on the media format in use.
The VCR-style control functions supported by VDP overlap significantly with the scope
of RTSP’s functionality. As RTSP is designed to function independently of a designer’s
35
choice of transport protocol, it would be possible to construct a system that utilized RTSP
to convey control information and VDP to carry audio and video data, but the value of
doing so might be somewhat limited. RTP is presently in more widespread use than VDP,
is implemented in a wider variety of applications and ancillary tools, and offers support
for a broader base of media encoding formats. VDP’s ability to allow applications to
adapt to changes in the environment is critical, however; similar functionality might be
incorporated into RTP-based applications through the RTCP receiver report mechanism.
Further information about VDP and related research, including discussion of a
comprehensive architecture for structuring streamed multimedia presentations and
enhancements to VDP supporting frame-level addressing and in-line hyperlinking, is
presented in [32].
2.4.7 IP Multicast and the MBONE
In the simplest of network-based, real-time multimedia distribution systems, user clients
contact a server to request access to the feed and the multimedia data stream is sent
directly to them over the network. If 1,000 clients are interested in a particular broadcast,
1,000 identical copies of the stream are generated at the server and distributed over the
network simultaneously. Thus, when many users desire to access a live multimedia
broadcast, it is more efficient to generate a single data stream instead, which is then
replicated as needed when the network paths to recipients of the broadcast diverge [9].
The former approach is called unicasting; the latter, multicasting.
36
An early approach to multicasting over the Internet was ST-II [33]. ST-II utilized a
sender-oriented approach, in which each sender established a set of appropriate
connections based on the currently active recipients. Unfortunately, this approach is
unsuitable for large-scale distribution of streaming multimedia, because each recipient
must inform every potential sender of his participation in order to be added to the list of
active endpoints [9].
To address this problem, IP Multicast, defined as an Internet standard in [34] and
extended in [35], takes a receiver-oriented tack. Clients interested in receiving a multicast
advertise that fact using the Internet Group Management Protocol (IGMP). A multicast-
aware router on the client’s local network receives the advertisement, notes the client’s
interest, and uses the Distance Vector Multicast Routing Protocol (DVMRP) to
communicate the client’s interest to other multicast routers on the Internet. Multicast-
aware routers use DVMRP to ensure that multicast data is routed in such a way that it
reaches all interested parties without unnecessarily congesting networks that don’t
contain participating clients. Because the distribution and replication of multicast data is
handled transparently by cooperating routers, there is no need for senders to maintain a
record of all participating clients. As a consequence, the IP Multicast approach is far
more scalable and robust in environments with large numbers of listeners that can come
and go with relative frequency. However, IP Multicast’s reliance on router intelligence is
also a hindrance, as a significant fraction of Internet backbone routers are not multicast-
aware [9].
37
Because so much of the Internet does not support IP Multicast, the MBONE, or Multicast
Backbone, was created. The MBONE is an overlay network that connects islands of
networks that support multicast by encapsulating multicast packets in normal, unicast
UDP packets and transmitting them over the unsuspecting Internet. Described in depth in
[36, 37], the MBONE connects multicast routers using unicast tunnels over which
multicast data and DVMRP traffic is passed.
Although a control protocol like RTSP is not required in order to make use of IP
Multicast, there are number of ways in which RTSP and IP Multicast can be used
effectively together. For example, using RTSP enables multicast sessions to be named via
a simple URL, which client applications can use to obtain a full specification of a
session’s parameters. This mechanism can also be used to control access to multicast
sessions, using RTSP to authenticate users and provide them with the encryption keys
necessary to decrypt a private multicast. In addition, an RTSP server providing on-
demand audio or video services can be used to play stored multimedia clips into an IP
Multicast session already in progress or to record its contents for later viewing. The
participants in such a conference can use RTSP clients as a virtual remote control,
stopping and starting playback or recording as needed.
2.4.8 Session Description Protocol and Session Announcement Protocol
Described fully in [38], Session Description Protocol (SDP), is designed to enable the
announcement of the existence of multimedia broadcasts or conferences to clients and to
convey the information necessary for interested parties to participate in such a session.
38
An SDP description encapsulates such information as the name and purpose of a
multimedia session; the time or times during which the session will be active; the types of
media included in the session, and the formats in which they are encoded; and addressing
information such as URL’s, ports and internet addresses that describe how clients should
obtain access to the session. In addition, descriptions can include additional information,
such as contact data for the individual responsible for administrating the session and
details of the resource requirements of the broadcast or conference.
SDP is simply a data format, despite its name; no mechanism for transporting a session
description is described in the specification. Rather, SDP descriptions are carried over
other protocols, such as RTSP, SAP, and MIME-based electronic mail. Furthermore, SDP
session descriptions are text-based; this facilitates portability, the encapsulation of
descriptions within other text-based Internet protocols, and automated generation of
descriptions using scripting languages such as TCL and Perl.
The Session Announcement Protocol (SAP) is a simple protocol used to disseminate
announcements of multimedia sessions over the Internet. Session announcements take the
form of a single UDP packet containing a SAP header and a textual payload, which is a
single SDP session description. Announcements are multicast to a well-known multicast
address and port and can be received by any individual with multicast-capable hardware
and software. SAP is fully specified in an Internet-Draft [39] which has expired, but
should be the subject of an IETF RFC in the near future.
39
SDP is complementary to RTSP. An RTSP server can use SDP to provide a client with a
description of a particular multimedia resource in response to a DESCRIBE request, and
the client can then use the information contained in the description to set up appropriate
encoders and decoders to allow the user to participate in the session. Alternatively, a user
might receive an SDP session description via electronic mail or from some other source.
The description can be provided manually to client software, which can then use RTSP to
initiate transmission and playback of the session. In either case, the mechanism used to
communicate the characteristics of the broadcast or conference (SDP) is distinct from the
mechanism used to control the user’s participation in the session (RTSP). In that SAP
constitutes a mechanism for the distribution of session announcements containing SDP
descriptions to potential participants, RTSP and SAP are complementary as well.
2.4.9 Synchronized Multimedia Integration Language
Synchronized Multimedia Integration Language, or SMIL, is an HTML-like language
designed to allow users to use a text editor to create streaming multimedia presentations
for playback over the WWW. Described in a recommendation of the World Wide Web
Consortium [40], SMIL enables users to assemble multimedia resources to form a
presentation, to describe how the presentation should be displayed on-screen, and to
associate hyperlinks with multimedia objects. The multimedia resources that make up a
SMIL-based presentation are specified as Uniform Resource Identifiers (URI’s) and can
therefore include multimedia content accessible via RTSP as well as other access
40
protocols, like HTTP and FTP. A number of multimedia and network software companies
have promised support for SMIL in current and future products.
2.5 Other RTSP Implementations
2.5.1 Real Networks Reference Implementation
Real Networks Corporation provides a reference implementation of RTSP based on the
July 30, 1997 draft of the RTSP specification; it has not been updated to reflect the final
version of the standard. Released under the terms of the GNU general public license,
documentation and source code for the reference implementation are available from Real
Networks’ web site [41]. The package includes basic client and server implementations,
as well as several sample applications, all written in the C programming language. The
sample applications support playback of audio files in several formats and incorporate a
simple implementation of RTP, a graphical client interface, and server management via a
configuration file.
The reference implementation is intended to server as a test platform for other
implementations rather than as the basis for full-featured applications. As such, the
source code is not designed to be particularly flexible or extensible. In particular:
• the implementa
Recommended