19
Ž . Computer Networks 31 1999 237–255 Internet Telephony: architecture and protocols – an IETF perspective Henning Schulzrinne a, ) , Jonathan Rosenberg b,1 a Department of Computer Science, Columbia UniÕersity, New York, NY 10027, USA b Bell Laboratories, Lucent Technologies, Holmdel, NJ 07974-0616 USA Abstract Internet telephony offers the opportunity to design a global multimedia communications system that may eventually replace the existing telephony infrastructure. We describe the upper-layer protocol components that are specific to Internet Ž . telephony services: the Real-Time Transport Protocol RTP to carry voice and video data, and the Session Initiation Ž . Protocol SIP for signaling. We also mention some complementary protocols, including the Real Time Streaming Protocol Ž . Ž . RTSP for control of streaming media, and the Wide Area Service Discovery Protocol WASRV for location of telephony gateways. q 1999 Published by Elsevier Science B.V. All rights reserved. Keywords: Internet telephony; Internet multimedia; SIP; RTSP; RTP 1. Introduction Internet telephony, also known as voice-over-IP Ž . or IP telephony IPtel , is the real-time delivery of Ž . voice and possibly other multimedia data types between two or more parties, across networks using the Internet protocols, and the exchange of informa- tion required to control this delivery. Internet tele- phony offers the opportunity to design a global multimedia communications systems that may even- tually replace the existing telephony infrastructure, without being encumbered by the legacy of a cen- tury-old technology. While we will use the common term of ‘‘Internet telephony’’, it should be understood that the addition ) Corresponding author. Tel.: q1-212-939-7042; fax: q1-212- 666-0140; e-mail: [email protected]. 1 E-mail: [email protected]. of other media, such as video or shared applications, does not fundamentally change the problem. Also, unlike in traditional telecommunications networks Ž . where distribution applications radio, TV and com- Ž . munications applications telephone, fax are quite distinct in terms of technology, user interface, com- munications devices and even regulatory environ- ments, this is not the case for the Internet. The Ž . delivery of stored ‘‘streaming’’ media and tele- phone-style applications can share almost all of the underlying protocol infrastructure. Indeed, in the model presented here, the same end user application may well serve both. This affords new opportunities to integrate these two modes. Consider, for example, the simple act of rolling a VCR into a conference room to either playback or record a session. The equivalent function requires specialized equipment in the phone system, but is easy to implement in the system to be described here. As another example, it may be desirable to jointly view stored or live 1389-1286r99r$ - see front matter q 1999 Published by Elsevier Science B.V. All rights reserved. Ž . PII: S0169-7552 98 00265-7

Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

Ž .Computer Networks 31 1999 237–255

Internet Telephony: architecture and protocols – an IETFperspective

Henning Schulzrinne a,), Jonathan Rosenberg b,1

a Department of Computer Science, Columbia UniÕersity, New York, NY 10027, USAb Bell Laboratories, Lucent Technologies, Holmdel, NJ 07974-0616 USA

Abstract

Internet telephony offers the opportunity to design a global multimedia communications system that may eventuallyreplace the existing telephony infrastructure. We describe the upper-layer protocol components that are specific to Internet

Ž .telephony services: the Real-Time Transport Protocol RTP to carry voice and video data, and the Session InitiationŽ .Protocol SIP for signaling. We also mention some complementary protocols, including the Real Time Streaming Protocol

Ž . Ž .RTSP for control of streaming media, and the Wide Area Service Discovery Protocol WASRV for location of telephonygateways. q 1999 Published by Elsevier Science B.V. All rights reserved.

Keywords: Internet telephony; Internet multimedia; SIP; RTSP; RTP

1. Introduction

Internet telephony, also known as voice-over-IPŽ .or IP telephony IPtel , is the real-time delivery of

Ž .voice and possibly other multimedia data typesbetween two or more parties, across networks usingthe Internet protocols, and the exchange of informa-tion required to control this delivery. Internet tele-phony offers the opportunity to design a globalmultimedia communications systems that may even-tually replace the existing telephony infrastructure,without being encumbered by the legacy of a cen-tury-old technology.

While we will use the common term of ‘‘Internettelephony’’, it should be understood that the addition

) Corresponding author. Tel.: q1-212-939-7042; fax: q1-212-666-0140; e-mail: [email protected].

1 E-mail: [email protected].

of other media, such as video or shared applications,does not fundamentally change the problem. Also,unlike in traditional telecommunications networks

Ž .where distribution applications radio, TV and com-Ž .munications applications telephone, fax are quite

distinct in terms of technology, user interface, com-munications devices and even regulatory environ-ments, this is not the case for the Internet. The

Ž .delivery of stored ‘‘streaming’’ media and tele-phone-style applications can share almost all of theunderlying protocol infrastructure. Indeed, in themodel presented here, the same end user applicationmay well serve both. This affords new opportunitiesto integrate these two modes. Consider, for example,the simple act of rolling a VCR into a conferenceroom to either playback or record a session. Theequivalent function requires specialized equipment inthe phone system, but is easy to implement in thesystem to be described here. As another example, itmay be desirable to jointly view stored or live

1389-1286r99r$ - see front matter q 1999 Published by Elsevier Science B.V. All rights reserved.Ž .PII: S0169-7552 98 00265-7

Page 2: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255238

broadcast events, in a kind of ‘‘virtual couch’’,where friends, distributed across the Internet, couldshare the same movie, as well as side comments andconversations.

Internet telephony differs from Internet multime-dia streaming primarily in the control and establish-ment of sessions, the ‘‘signaling’’. While we gener-ally assume that a stored media resource is availableat a given location, identified at different levels of

Ž .abstraction by a URI Uniform Resource IdentiferŽ .or URL Uniform Resource Locator , participants in

phone calls are not so easily located. Personal mobil-ity, call delegation, availability, and desire to com-municate make the process of signaling more com-plex. On the other hand, streaming media can makeuse of time-axis manipulations, such as fast-forwardor absolute positioning. In the Internet, the Real

Ž . w xTime Streaming Protocol RTSP 1 is the standardprotocol for controlling multimedia streams. Section

Ž . w x5 describes the Session Initiation Protocol SIP 2co-developed by the authors for signaling Internettelephony services.

Both RTSP and SIP are part of a protocol stackthat has recently emerged from the Internet Engi-

Ž .neering Task Force IETF – the Internet Multime-w xdia Conferencing Architecture 3 . The protocols en-

compass both IPtel services and stored media ser-

vices in an integrated fashion. Fig. 1 depicts thestack, along with other protocols likely to be used forboth Internet streaming media and Internet tele-phony. Unlike circuit-switched telephony, Internettelephony services are built on a range of packetswitched protocols. For example, the functionality of

w xthe SS7 ISUP and TCAP 4 telephony signalingprotocols encompass routing, resource reservation,call admission, address translation, call establish-ment, call management and billing. In an Internetenvironment, routing is handled by protocols such as

Ž . w xthe Border Gateway Protocol BGP 5 , resourcereservation by the Resource Reservation ProtocolŽ . w xRSVP 6 or other resource reservation protocolsw x7 . SIP, described here, translates application-layeraddresses, establishes and manages calls. There iscurrently no Internet telephony billing protocol in the

w x w xInternet, although RADIUS 8 or DIAMETER 9 ,in combination with SIP authentication, may initiallyserve that purpose for calls through Internet tele-phony gateways. Having a number of different proto-cols, each serving a particular function, allows formodularity, flexibility, simplicity, and extensibility.End systems or network servers that only provide aspecific service need only implement that particularprotocol, without interoperability problems. Further-more, protocol components can be reused in other

Fig. 1. Internet telephony protocol stack.

Page 3: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255 239

applications, avoiding re-invention of specific func-tions in each application.

Even though the term Internet telephony is oftenw xassociated with point-to-point voice service 10 , none

of the protocols described here are restricted to asingle media type or unicast delivery. Indeed, one ofthe largest advantages of Internet telephony com-

Ž .pared to the Plain Old Telephone System POTS isthe transparency of the network to the media carried,so that adding a new media type requires no changesto the network infrastructure. Also, at least for sig-naling, the support of multiparty calls differs onlymarginally from two-party calls.

The protocols mentioned, and the rich servicesthey provide, are just one part of the picture forIPtel. As it was designed for data transport, theInternet currently offers only best effort service. Theresult is that voice packets suffer heavy losses andsignificant delays when there is network congestion,making the speech quality poor. A number of efforts

Ž .in the area of Quality of Service QoS managementare underway in order to address this issue.

The remainder of the paper is organized as fol-lows. Section 2 discusses the differences between the

Žgeneral switched telephone network GSTN, i.e., the.current phone system and the Internet telephony

architecture. Section 3 then discusses how thesedifferences translate into some of the advantages ofInternet telephony. We then discuss the key protocolcomponents of Internet telephony. Section 4 dis-

Ž .cusses the Real Time Transport Protocol RTP , andaddresses some of the efforts underway to resolvethe QoS problems with voice transport. Section 5.1discusses SIP and Section 6 discusses some addi-tional protocols, including a call processing languageand service location, which form pieces of the over-all puzzle. We then put them all together in Section7. We conclude in Section 8 and present ideas forfuture work.

2. Differences between Internet telephony and theGSTN

Internet telephony differs in a number of respectsfrom the GSTN, both in terms of architecture andprotocols. These differences affect the design oftelephony services.

Fundamentally, IPtel relies on the ‘‘end-to-end’’paradigm for delivery of services. Signaling proto-cols are between the end systems involved in thecall; network routers treat these signaling packetslike any other data, ignoring any semantics impliedby them. Note, however, that IPtel can make use of

Ž .‘‘signaling routers’’ which are effectively proxiesto assist in functions such as user location. In thiscase, these proxies can be used for routing only ofinitial signaling messages. Subsequent messages canbe exchanged end-to-end. As a consequence of theend-to-end signaling paradigm, call state is as wellend to end, as are instantiation of many telephonyfeatures.

The Internet itself is both multi-service and ser-vice-independent. It provides packet-level transport,end-to-end, for whatever services are deployed at theend systems through higher layer protocols and soft-ware. This leads to tremendous flexibility and exten-sibility. New application level services, such as theweb, email, and now IPtel, can be created anddeployed by anyone with access to the Internet.

IPtel separates call setup from reserving re-w xsources. In the Internet, protocols such as RSVP 6

are used to reserve resources. These protocols areapplication independent, and reservations may takeplace before or after actual flow of data begins.When used after the flow of data begins, the datawill be treated as best effort. As a result, IPtel can beused without per-call resource reservation in net-works with sufficient capacity. On the other hand,removing the ‘‘atomicity’’ of call setup found in thecurrent telephone system also breaks the assumptionof ‘‘all or nothing’’ call completion: since call setupand resource reservation are distinct, one may suc-ceed, while the other may fail. If resources arereserved first, the caller may incur a cost for holdingthose resources while ‘‘the phone rings’’, even

Žthough the call is not answered. If the network doesnot charge for reservations that are not actually used,the network becomes vulnerable to denial-of-serviceattacks, where the attacker can block others from

.making reservations. Also, in the phone system, theresource needs for a single leg of a call are knownahead of time; this is clearly not the case for Internettelephony calls, where the callee may choose tocommunicate with only a subset of the media offeredby the caller.

Page 4: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255240

Due to the limited signaling abilities of GSTNŽ .end systems, GSTN addresses phone numbers are

overloaded with at least four functions: end pointidentification, service indication, indication of whopays for the call and carrier selection. The GSTNalso ties call origination with payment, except as

Ž .modified by the address 800 numbers or specificŽ .manual features such as collect calls . IPtel ad-

dresses, which are formulated as URL’s, are usedsolely for endpoint identification and basic serviceindication. The other functions, such as payment andcarrier selection, are more readily handled by theprotocols, such as RSVP and RTSP, which carry theaddresses.

Internet telephony offers a larger degree of free-dom to allocate functions between network serversand user-supplied and operated end systems. Forexample, because of the end-to-end signaling, phoneservices such as distinctive ringing based on callurgency, media-based endpoint selection, and crypto-

Žgraphically authenticated caller id in addition tofeatures available from POTS phones, such as speed

.dialing and caller ID may be accomplished triviallyby even standalone IP telephones, resulting in betterscaling.

The phone system employs different signalingprotocols between a user and the ‘‘network’’Ž .User–Network Interface – UNI as compared to

Ž . Žbetween implicitly trusted network elements Net-.work–Network Interface – NNI . This makes certain

Ž .features such as mumber translation unavailable toend users, or leads to classifying end users as ‘‘net-works’’, with the attendant security implications inthe access to databases and network resources. TheUNIrNNI distinction does not exist on the Internet,both at the level of data transport and signaling. SIPcan set up calls between two end systems just aseasily as creating a ‘‘channel’’ on an aggregated

Ž .RTP session see Section 4.3 .The open, multi-service, end-to-end nature of the

Internet also means that various components of tele-phony services can be provided by completely differ-

Žent service vendors of course, agreement on proto-.cols is necessary for interoperability . For example,

one vendor can provide a name to IP address map-ping service, another can provide voice-mail, anothercan provide mobility services, while yet another canprovide conferencing services. Furthermore, the end-

to-end nature of the Internet means that anyone withan Internet connection can run and operate such aservice. This leads to an easy-entry, highly competi-tive marketplace for all Internet services, such asIPtel.

This separation of functionality also simplifies thenumber portability problem. As an organization mayprovide just a name mapping service, a user can

Žchange other service providers such as their voice-. 2mail provider or ISP without a change in name .

Changing name mapping providers may require achange in name. However, automated white-pagesservices, such as those run by Four11, allow anotherlayer of indirection which further alleviates the num-ber portability problem.

3. Features of Internet telephony

The architectural differences described in the pre-vious section lead to a number of advantages from

w xboth a user perspective and a carrier perspective 12 ,beyond just ‘‘cheaper phone calls’’:

Adjustable quality: While IPtel currently has theŽreputation for tin-can quality due to low bitrate

. Žcodecs, in part there is no reason except lack of.bandwidth why the same technology cannot supply

audiophile-quality music. Because the Internet is nota service-specific network, the media exchanged ischosen entirely by end systems. As such, end sys-tems can control the amount of compression based

w xon network bandwidth 13 or the content to beŽtransmitted. For example, music-on-hold as it is not

.speech is not suitable for very low bit-rate speechcodecs.

Security: The Internet has the reputation as beinginsecure, even though it is probably easier to tap atelephone demarc box than a router. SIP can encryptand authenticate signaling messages; RTP supportsencryption of media. Together, these provide crypto-graphically secure communications.

User identification: Standard POTS and ISDNtelephone service offers caller id indicating the num-

Ž .ber or, occasionally, name of the caller, but during

2 There is a more difficult issue of maintaining the same IPw xnetwork address when changing providers 11 .

Page 5: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255 241

a bridged multi-party conference, there is no indica-tion of who is talking. The real-time transport proto-

Ž . w xcol RTP 14 used for Internet telephony easilysupports talker indication in both multicast andbridged configurations and can convey more detailedinformation if the caller desires.

User interface: Most POTS and ISDN telephoneshave a rather limited user interface, with at best atwo-line liquid crystal display or, in the public net-work, cryptic commands like ‘‘)69’’ for call-back.Advanced GSTN features such as call-forwardingare rarely used or customized, since the sequence ofsteps is typically not intuitive. This is due in part tothe limited signaling capabilities of end systems, andthe general notion of ‘‘network intelligence’’ ascompared to ‘‘end system intelligence’’. As IPtel endsystems have much richer signaling capabilities, thegraphical user interface offered by Internet telephonycan be more readily customized and offer richerindications of features, process and progress.

Computer-telephony integration: Because of thecomplete separation of data and control paths and theseparation of phone equipment from the PC’s con-

Ž .trolling them, computer-telephony integration CTIw x w x15 is very complex, with specs 16 running to 3300pages. Much of the call handling functionality can beeasily accomplished once the data and control pathpass through intelligent, network-connected end sys-tems.

Feature ubiquity: The current phone system offersvery different features depending on whether theparties are connected by the same Private Branch

Ž .Exchange PBX , reside within the same local call-ing area or are connected by a long-distance carrier.Even trivial features such as caller ID only work fora small fraction of international calls, for example. APBX may not allow a call to be forwarded outsidethat PBX, or cause the forwarded call to still gothrough the PBX. Internet telephony does not sufferfrom this problem. This is because the Internet proto-cols are internationally used, and because servicesare defined largely by the end systems.

Multimedia: As pointed out, adding additionalmedia such as video, shared whiteboards or sharedapplications is much easier in the Internet environ-ment compared to the POTS and ISDN, as multiplex-ing is natural for packet networks. This makes sig-naling protocols simpler as well, as issues such as

B-channel allocation and synchronization are non-ex-istent in the Internet.

Carriers benefit as well:Silence suppression and compression: Sending

audio as packets makes it easy to suppress silenceperiods, further reducing bandwidth consumption,particularly in a multi-party conference or for voiceannouncement systems. Unlike the GSTN, whichgenerally does such silence suppression acrosstrans-continental links, IPtel performs silence sup-pression at the endpoints. Furthermore, as packetnetworks are much more suitable to multiplexing, nonetwork support is required to take advantage of endsystem silence suppression. This leads to a reduction

Žin cost to perform the silence suppression as it’sdistributed to end systems where it can be done

.cheaply , and an improvement in the scope of itsbenefits. Furthermore, compression can be used atend systems to reduce bandwidth consumption acrossthe entire network. Unfortunately, compression is atodds with enhanced voice quality services describedabove. However, there exist codecs which can com-press wideband speech to 16 kbrs, which can giveboth excellent voice quality and reduced bandwidthcompared to the GSTN. Note that silence suppres-sion and compression compensate for the decreasedefficiency of packet switching.

Shared facilities: The largest operational savingscan probably come from provisioning and managinga single, integrated network, rather than separatevoice, data and signaling networks.

AdÕanced serÕices: From first experiences andprotocols, it appears to be far simpler to develop anddeploy advanced telephony services in a packet-

w xswitched environment than in the GSTN 17,18 .w xInternet protocols, such as SIP 2 that support stan-

Ždard CLASS Custom Local Area Signaling Ser-. w x Žvices 19 features such as Call Forward No An-.swer take only a few tens of pages to specify. They

can perform the functions of both the user-to-net-work signaling protocols such as Q.931 as well as

Ž .the network signaling ISUP, Signaling System 7 .Separation of Õoice and control flow: In tele-

phony, the signaling flow, even though carried on aseparate network, has to ‘‘touch’’ all the intermedi-ate switches to set up the circuit. Since packet for-warding in the Internet requires no setup, Internet

Žcall control can concentrate on the call rather than

Page 6: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255242

.connection functionality. For example, it is easy toavoid triangle-routing when forwarding or transfer-ring calls; the transferring party can simply informthe transferred party of the address of the trans-ferred-to party, and the two can contact each otherdirectly. As there is no network connection state,none needs to be torn down.

4. RTP for data transport

Real-time flows such as IPtel voice and videostreams have a number of common requirements thatdistinguish them from ‘‘traditional’’ Internet dataservices:

Sequencing: The packets must be re-ordered inreal time at the receiver, should they arrive out oforder. If a packet is lost, it must be detected andcompensated for without retransmissions.

Intra-media synchronization: The amount of timebetween when successive packets are to be ‘‘played-out’’ must be conveyed. For example, no data isusually sent during silence periods in speech. Theduration of this silence must be reconstructed prop-erly.

Inter-media synchronization: If a number of dif-ferent media are being used in a session, there mustbe a means to synchronize them, so that the audiothat is played out matches the video. This is alsoknown as lip-sync.

Payload identification: In the Internet, it is oftennecessary to change the encoding for the mediaŽ .‘‘pay-load’’ on the fly to adjust to changing band-width availability or the capabilities of new usersjoining a group. Some kind of mechanism is there-fore needed to identify the encoding for each packet.

Frame indication: Video and audio are sent inlogical units called frames. It is necessary to indicateto a receiver where the beginning or end of a frameis, in order to aid in synchronized delivery to higherlayers.

These services are provided by a transport proto-w xcol. In the Internet, the RTP 14 is used for this

purpose. RTP has two components. The first is RTPitself, and the second is RTCP, the Real Time Con-trol Protocol. Transport protocols for real time media

w xare not new, dating back to the 1970’s 20 . How-

ever, RTP provides some functionality beyond rese-quencing and loss detection:

Multicast-friendly: RTP and RTCP have been en-gineered for multicast. In fact, they are designed tooperate in both small multicast groups, like thoseused in a three-person phone call, to huge ones, likethose used for broadcast events.

Media independent: RTP provides services neededfor generic real time media, such as voice and video.Any codec-specific additional header fields and se-mantics are defined for each media codec in separatespecifications.

Mixers and translators: Mixers are devices whichtake media from several users, mix or ‘‘bridge’’them into one media stream, and send the resultingstream out. Translators take a single media stream,convert it to another format, and send it out. Transla-tors are useful for reducing the bandwidth require-ments of a stream before sending it over a low-band-width link, without requiring the RTP source toreduce its bitrate. This allows receivers that areconnected via fast links to still receive high qualitymedia. Mixers limit the bandwidth if several sourcessend data simultaneously, fulfilling the function ofconference bridge. RTP includes explicit support formixers and translators.

QoS feedback: RTCP allows receivers to providefeedback to all members of a group on the quality ofthe reception. RTP sources may use this informationto adjust their data rate, while other receivers candetermine whether QoS problems are local or net-work-wide. External observers can use it for scalablequality-of-service management.

Loose session control: With RTCP, participantscan exchange identification information such asname, email, phone number, and brief messages.

Encryption: RTP media streams can be encryptedusing keys that are exchanged by some non-RTPmethod, e.g., SIP or the Session Description ProtocolŽ . w xSDP 21 .

4.1. RTP

RTP is generally used in conjunction with theŽ .User Datagram Protocol UDP , but can make use of

any packet-based lower-layer protocol. When a hostwishes to send a media packet, it takes the media,formats it for packetization, adds any media-specific

Page 7: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255 243

packet headers, prepends the RTP header, and placesit in a lower-layer payload. It is then sent into thenetwork, either to a multicast group or unicast toanother participant.

Ž .The RTP header Fig. 2 is 12 bytes long. The Vfield indicates the protocol version. The X flag sig-nals the presence of a header extension between thefixed header and the payload. If the P bit is set, thepayload is padded to ensure proper alignment forencryption.

Users within a multicast group are distinguishedby a random 32-bit synchronization source SSRCidentifier. Having an application-layer identifier al-lows to easily distinguish streams coming from thesame translator or mixer and associate receiver re-ports with sources. In the rare event that two usershappen to choose the same identifier, they redrawtheir SSRCs.

As described above, a mixer combines mediastreams from several sources, e.g., a conferencebridge might mix the audio of all active participants.In current telephony, the participants may have ahard time distinguishing who happens to be speaking

Ž .at any given time. The Contributing SSRC CSRClist, whose length is indicated by the CSRC lengthfield, lists all the SSRC that ‘‘contributed’’ contentto the packet. For the audio conference, it would listall active speakers.

RTP supports the notion of media-dependentframing to assist in the reconstruction and playout

Ž .process. The marker M bit provides information forthis purpose. For audio, the first packet in a voicetalkspurt can be scheduled for playout independentlyof those in the previous talkspurt. The bit is used in

Fig. 2. RTP fixed header format.

this case to indicate the first packet in a talkspurt.For video, a video frame can only be rendered whenits last packet has arrived. Here, the marker bit isused to indicate the last packet in a video frame.

The payload type identifies the media encodingused in the packet. The sequence number incre-ments sequentially from one packet to the next, andis used to detect losses and restore packet order. Thetimestamp, incremented with the media samplingfrequency, indicates when the media frame was gen-erated.

The payload itself may contain headers specificfor the media; this is described in more detail below.

4.2. RTCP: control and management

The Real Time Control Protocol, RTCP is thecompanion control protocol for RTP. Media sendersŽ . Ž .sources and receivers sinks periodically send

ŽRTCP packets to the same multicast group but.different ports as is used to distribute RTP packets.

Each RTCP packet contains a number of elements,Ž .usually a sender report SR or receiver report fol-

Ž .lowed by source descriptions SDES . Each serves adifferent function:

( )Sender Reports SR are generated by users whoŽ .are also sending media RTP sources . They describe

the amount of data sent so far, as well as correlatingŽthe RTP sampling timestamp and absolute ‘‘wall

.clock’’ time to allow synchronization between dif-ferent media.

( )ReceiÕer Reports IR are sent by RTP sessionŽ .participants which are receiving media RTP sinks .

Each such report contains one block for each RTPsource in the group. Each block describes the instan-taneous and cumulative loss rate and jitter from thatsource. The block also indicates the last timestampand delay since receiving a sender report, allowingsources to estimate their distance to sinks.

( )Source Descriptor SDES packets are used forŽsession control. They contain the CNAME Canoni-

.cal Name , a globally unique identifier similar informat to an email address. The CNAME is used forresolving conflicts in the SSRC value and associatedifferent media streams generated by the same user.SDES packets also identify the participant throughits name, email, and phone number. This provides a

Page 8: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255244

simple form of session control. Client applicationscan display the name and email information in theuser interface. This allows session participants tolearn about the other participants in the session. It

Žalso allows them to obtain contact information such.as email and phone to allow for other forms of

Žcommunication such as initiation of a separate con-.ference using SIP . This also makes it easier to

contact a user should he, for example, have left hiscamera running.

If a user is leaving, he includes a BYE message.( )Finally, Application APP elements can be used to

add application-specific information to RTCP pack-ets.

Since the sender reports, receiver reports, andSDES packets contain information which can contin-ually change, it is necessary to send these packetsperiodically. If the RTP session participants simplysent RTCP packets with a fixed period, the resultingbandwidth used in the multicast group would growlinearly with the group size. This is clearly undesir-able. Instead, each session member counts the num-

Žber of other session members it hears from via.RTCP packets . The period between RTCP packets

from each user is then set to scale linearly with thenumber of group members. This ensures that thebandwidth used for RTCP reports remains fixed,independent of the group size. Since the group sizeestimate is obtained by counting the number of otherparticipants, it takes time for each new participant toconverge to the correct group size count. If manyusers simultaneously join a group, as is common inbroadcast applications, each user will have an incor-rect, and very low estimate of the group size. Theresult is a flood of RTCP reports. A back-off algo-

w xrithm called reconsideration 22 is used to preventsuch floods.

4.3. Payload formats

The above mechanisms in RTP provide for ser-vices needed for the generic transport of audio andvideo. However, particular codecs will have addi-tional requirements for information that needs to beconveyed. To support this, RTP allows for payloadformats to be defined for each particular codec.These payload formats describe the syntax and se-

mantics of the RTP payload. The particular semanticof the payload is communicated in the RTP payloadtype indicator bits. These bits are mapped to actualcodecs and formats via a binding to names, regis-tered with the Internet Assigned Numbers AuthorityŽ . ŽIANA , and conveyed out of band through SDP,

.for example . Any number of bindings can be con-veyed out of band; this allows an RTP source tochange encoding formats mid-stream without explicitsignaling. Furthermore, anyone can register a nameŽ .so long as it has not been used , and procedures aredefined for doing so. This allows for RTP to be usedwith any kind of codec developed by anyone.

RTP media payload formats have been defined forw x w x w xthe H.263 23 , H.261 24 , JPEG 25 and MPEG

w x26 video codecs, and a host of other audio andvideo encoders are supported with simpler payload

w xformats 27 .Furthermore, RTP payload formats are being de-

fined to provide some generic services. One format,w xfor redundant audio codings 28 , allows a user to

transmit audio content using multiple audio codecs,each delayed from the previous, and of a lowerbitrate. This allows for lost packets to be recoveredfrom subsequent packets, albeit with a lower qualitycodec. Another payload format is being defined forparity and Reed Solomon like forward error correc-

Ž .tion FEC mechanisms, to allow for recovery of lostw xpackets in a codec-independent manner 29 . Yet

another format is being introduced to multiplex me-w xdia from multiple users into a single packet 30 .

This is particularly useful for trunk replacement be-tween Internet telephony gateways, where it canprovide a significant reduction in packet headeroverheads.

4.4. Resource reserÕation

Given the importance of telephony services, weanticipate that a significant fraction of the Internetbandwidth will be consumed by voice and video, thatis, RTP-based protocols. Due to its tight delay con-straints, IPtel streams are also likely candidates forguaranteed QOS. Unfortunately, existing proposals

w xsuch as RSVP 6 are rather complex, largely due tofeatures such as receiver orientation and support forreceiver diversity that is likely to be of limited use

Page 9: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255 245

w xfor Internet telephony. We have proposed 7 todispense with a separate resource reservation proto-col altogether and simply use RTCP messages to tellrouters along the path to reserve sufficient resources.To determine the amount of bandwidth for the reser-

Ž .vation, RTCP sender reports SR can be used as isŽby observing the difference between two subsequent

.SR byte counts , or an additional field can be in-serted that specifies the desired grade of service inmore detail. RTCP messages carrying reservationrequests are marked for special handling by a router

w xalert option 31 . The receiver reports back whetherthe reservation was completely or only partially suc-cessful.

Similar proposals of simplified, sender-based re-Žsource reservation protocols can be found albeit not

.using RTCP in the Scalable Reservation ProtocolŽ . w xSRP 32 .

The issue of reservations for IPtel services hasanother facet: if a reservation fails, the call can stilltake place, but using best effort transport of theaudio. This is certainly preferable to a fast-busysignal, where no communications at all are estab-lished. This introduces the possibility of completelyremoving the admission decision all together. Oneach link in the network, sufficient bandwidth forvoice traffic can be allocated. Time-honored ap-proaches to telephone network engineering can beused to determine the amount of allocation needed.The remaining bandwidth on links will be used forother services, and for best effort. Incoming packetscan be classified as voice or non-voice. Should therebe too many voice connections active at any point in

Ž .time possible since there is no admission control ,the extra calls can simply become best effort. Goodnetwork engineering can place reasonably low prob-abilities on such an event, and the use of ‘‘spillover’’best effort bandwidth can eliminate admission con-trol to handle these unlikely occurrences.

The above mechanism is generally referred to aspart of the class of IP service referred to as differen-

Žtiated serÕices. These rely on user subscription to.facilitate network engineering and packet classifica-

tion and marking at network peripheries. From theabove discussion, IPtel is a prime candidate for such

w xa service. In fact, the premium serÕice 33 providesalmost no delay and jitter for its packets, making itideal for real-time voice and video.

5. Signaling: Session Initiation Protocol

A defining property of IPtel is the ability of oneparty to signal to one or more other parties to initiatea new call. For purposes of discussing the SIP, wedefine a call is an association between a number ofparticipants. The signaling association between a pairof participants in a call is referred to as a connection.Note that there is no physical channel or networkresources associated with a connection; the connec-tion exists only as signaling state at the two end-points. A session refers to a single RTP sessioncarrying a single media type. The relationship be-tween the signaling connections and media sessionscan be varied. In a multiparty call, for example, eachparticipant may have a connection to every otherparticipant, while the media is being distributed viamulticast, in which case there is a single mediasession. In other cases, there may be a single unicastmedia session associated with each connection.Other, more complex, scenarios are possible as well.

An IPtel signaling protocol must accomplish anumber of functions:

Name translation and user location involve themapping between names of different levels of ab-straction, e.g., a common name at a domain and auser name at a particular Internet host. This transla-tion is necessary in order to determine the IP addressof the host to exchange media with. Usually, a userhas only the name or email address of the person

Žthey would like to communicate with [email protected], for example . This must be translated into

an IP address. This translation is more than just asimple table lookup. The translation can vary based

Žon time of day so that a caller reaches you at work.during the day, and at home in the evening , caller

Ž .so that your boss always gets your work number , orŽthe status of the callee so that calls to you are sent

to your voice mail when you are already talking to.someone , among other criteria.

Feature negotiation allows a group of end sys-tems to agree on what media to exchange and theirrespective parameters, such as encodings. The setand type of media need not be uniform within a call,as different point-to-point connections may involvedifferent media and media parameters. Many soft-ware codecs are able to receive different encodingswithin a single call and in parallel, for example,

Page 10: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255246

while being restricted to sending one type of mediafor each stream.

Any call participant can invite others into anexisting call and terminate connections with someŽ .call participant management . During the call, par-ticipants should be able to transfer and hold otherusers. The most general model of a multi-partyassociation is that of a full or partial mesh of connec-

Žtions among participants where one of the partici-.pants may be a media bridge , with the possible

addition of multicast media distribution.Feature changes make it possible to adjust the

composition of media sessions during the course of acall, either because the participants require additionalor reduced functionality or because of constraintsimposed or removed by the addition or removal ofcall participants.

There are two protocols that have been developedw xfor such signaling operations, SIP 2 , developed in

w xthe IETF, and H.323 34 , developed by the ITU. Inthe next section, we will discuss SIP, while H.323 iscovered by Toga et al. within this issue.

5.1. SIP oÕerÕiew

SIP is a client-server protocol. This means thatŽ .requests are generated by one entity the client , and

Ž .sent to a receiving entity the server which pro-cesses them. As a call participant may either gener-ate or receive requests, SIP-enabled end systems

Žinclude a protocol client and server generally called.a user agent server . The user agent server generally

responds to the requests based on human interactionor some other kind of input. Furthermore, SIP re-quests can traverse many proxy serÕers, each ofwhich receives a request and forwards it towards anext hop server, which may be another proxy serveror the final user agent server. A server may also actas a redirect serÕer, informing the client of theaddress of the next hop server, so that the client cancontact it directly. There is no protocol distinctionbetween a proxy server, redirect server, and useragent server; a client or proxy server has no way ofknowing which it is communicating with. The dis-tinction lies only in function: a proxy or redirectserver cannot accept or reject a request, whereas auser agent server can. This is similar to the Hyper-

Ž .text Transfer Protocol HTTP model of clients, ori-

gin and proxy servers. A single host may well act asclient and server for the same request. A connectionis constructed by issuing an INVITE request, anddestroyed by issuing a BYE request.

As in HTTP, the client requests invoke methodson the server. Requests and responses are textual,and contain header fields which convey call proper-ties and service information. SIP reuses many of theheader fields used in HTTP, such as the entity

Ž .headers e.g., Content-type and authenticationheaders. This allows for code reuse, and simplifiesintegration of SIP servers with web servers.

Calls in SIP are uniquely identified by a callidentifier, carried in the Call-ID header field in SIPmessages. The call identifier is created by the creatorof the call and used by all call participants. Connec-tions have the following properties: The logical con-nection source indicates the entity that is requesting

Ž .the connection the originator . This may not be theentity that is actually sending the request, as proxiesmay send requests on behalf of other users. In SIPmessages, this property is conveyed in the Fromheader field. The logical connection destination con-tained in the To field names the party who the

Ž .originator wishes to contact the recipient . The me-Ždia destination conveys the location IP address and

. Ž .port where the media audio, video, data are to besent for a particular recipient. This address may notbe the same address as the logical connection desti-nation. Media capabilities convey the media that aparticipant is capable of receiving and their at-tributes. Media capabilities and media destinationsare conveyed jointly as part of the payload of a SIPmessage. Currently, the Session Description ProtocolŽ . w xSDP 21 serves this purpose, although others arelikely to find use in the future 3. SDP expresses listsof capabilities for audio and video and indicateswhere the media is to be sent to. It also allows toschedule media sessions into the future and schedulerepeated sessions.

SIP defines several methods, where the first threemanage or prepare calls and connections: INVITEinvites a user to a call and establishes a new connec-tion, BYE terminates a connection between two users

3 In fact, H.245 capability sets can be carried in SDP for thispurpose.

Page 11: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255 247

Žin a call note that a call, as a logical entity, iscreated when the first connection in the call iscreated, and destroyed when the last connection in

.the call is destroyed , and OPTIONS solicits infor-mation about capabilities, but does not set up aconnection. STATUS informs another server aboutthe progress of signaling actions that it has requested

4 Ž .via the Also header see below . ACK is used forreliable message exchanges for invitations. CAN-CEL terminates a search for a user. Finally, REGIS-TER conveys information about a user’s location toa SIP server.

SIP makes minimal assumptions about the under-lying transport protocol. It can directly use anydatagram or stream protocol, with the only restrictionthat a whole SIP request or response has to be eitherdelivered in full or not at all. SIP can thus be usedwith UDP or TCP in the Internet, and with X.25,AAL5rATM, CLNP, TP4, IPX or PPP elsewhere.Network addresses within SIP are also not restrictedto being Internet addresses, but could be E.164Ž .GSTN addresses, OSI addresses or private number-ing plans. In many cases, addresses can be any URL;for example, a call can be ‘‘forwarded’’ to a mailtoURL for email delivery.

5.2. Message encoding

Unlike other signaling protocols such as Q.931and H.323, SIP is a text-based protocol. This designwas chosen to minimize the cost of entry. The datastructures needed in SIP headers all fall into theparameter-value category, possible with a single levelof subparameters, so that generic data coding mecha-

Ž .nisms like Abstract Syntax Notation 1 ASN.1 offerno functional advantage.

Ž .Unlike the ASN.1 Packed Encoding Rules PERŽ .and Basic Encoding Rules BER , a SIP header is

largely self-describing. Even if an extension has notbeen formally documented, as was the case for manycommon email headers, it is usually easy to reverse-engineer them. Since most values are textual, thespace penalty is limited to the parameter names,usually at most a few tens of bytes per request.

4 The Also and Replaces headers and the STATUS methodare part of the SIP call control extensions.

ŽIndeed, the ASN.1 PER-encoded H.323 signaling.messages are larger than equivalent SIP messages.

Besides, extreme space efficiency is not a concernfor signaling protocols.

If not designed carefully, text-based protocols canbe difficult to parse due to their irregular structure.SIP tries to avoid this by maintaining a commonstructure of all header fields, allowing a genericparser to be written.

Unlike, say, HTTP and Internet email, SIP wasdesigned for character-set independence, so that anyfield can contain any ISO 10646 character. Since SIPoperates on an 8-bit clean channel, binary data suchas certificates does not have to be encoded. Togetherwith the ability to indicate languages of enclosedcontent and language preferences of the requestor,SIP is well suited for cross-national use.

5.3. Addressing and naming

To be invited and identified, the called party hasto be named. Since it is the most common form ofuser addressing in the Internet, SIP chose an email-like identifier of the form ‘‘user@domain’’,‘‘user@host’’, ‘‘user@IP_address’’ or ‘‘phone-number@gateway’’. The identifier can refer to thename of the host that a user is logged in at the time,an email address or the name of a domain-specificname translation service. Addresses of the form‘‘phone-number@gateway’’ designate GSTN phonenumbers reachable via the named gateway.

SIP uses these addresses as part of SIP URLs,such as sip:[email protected]. This URL may wellbe placed in a web page, so that clicking on the link

w xinitiates a call to that address, similar to a mailto 35URL today.

We anticipate that most users will be able to usetheir email address as their published SIP address.Email addresses already offer a basic location-inde-pendent form of addressing, in that the host part doesnot have to designate a particular Internet host, butcan be a domain, which is then resolved into one ormore possible domain mail server hosts via Domain

Ž . Ž .Name System DNS MX mail exchange records.This not only saves space on business cards, but alsoallows re-use of existing directory services such as

Ž .the Lightweight Directory Access Protocol LDAPw x Ž .36 , DNS MX records as explained below and

Page 12: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255248

email as a last-ditch means of delivering SIP invita-tions.

For email, finding the mail exchange host is oftensufficient to deliver mail, as the user either logs in tothe mail exchange host or uses protocols such as the

Ž .Internet Mail Access Protocol IMAP or the PostŽ .Office Protocol POP to retrieve their mail. For

interactive audio and video communications, how-ever, participants are typically sending and receivingdata on the workstation, PC or Internet appliance intheir immediate physical proximity. Thus, SIP hasto be able to resolve ‘‘name@domain’’ to ‘‘user@host’’. A user at a specific host will be derivedthrough zero or more translations. A single exter-nally visible address may well lead to a differenthost depending on time of day, media to be used,and any number of other factors. Also, hosts thatconnect via dial-up modems may acquire a differentIP address each time.

5.4. Basic operation

The most important SIP operation is that of invit-ing new participants to a call. A SIP client firstobtains an address where the new participant is to becontacted, of the form name@domain. The clientthen tries to translate this domain to an IP addresswhere a server may be found. This translation is

Ž .done by trying, in sequence, DNS Service SRVŽ .records, Canonical Name CNAME and finally Ad-

Ž .dress A records. Once the server’s IP address hasbeen found, the client sends it an INVITE messageusing either UDP or TCP.

The server which receives the message is notlikely to be the user agent server where the user isactually located; it may be a proxy or redirect server.For example, a server at example.com contactedwhen trying to call [email protected] may forwardthe INVITE request to [email protected]. AVia header traces the progress of the invitation fromserver to server, allows responses to find their wayback and helps servers to detect loops. A redirectserver, on the other hand, would respond to theINVITE request, telling the caller to [email protected] directly. In either case, theproxy or redirect server must somehow determinethe next hop server. This is the function of a locationserÕer. A location server is a non-SIP entity whichhas information about next hop servers for varioususers. The location server can be anything – anLDAP server, a proprietary corporate database, alocal file, the result of a finger command, etc. Thechoice is a matter of local configuration. Figs. 3 and4 show the behavior of SIP proxy and redirectservers, respectively.

Proxy servers can forward the invitation to multi-ple servers at once, in the hopes of contacting theuser at one of the locations. They can also forwardthe invitation to multicast groups, effectively contact-ing multiple next hops in the most efficient manner.

Fig. 3. SIP proxy server operation.

Page 13: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255 249

Fig. 4. SIP redirect server operation.

Once the user agent server has been contacted, itsends a response back to the client. The response hasa response code and response message. The codesfall into classes 100 through 600, similar to HTTP.

Unlike other requests, invitations cannot be an-swered immediately, as locating the callee and wait-ing for a human to answer may take several seconds.Call requests may also be queued, e.g., if the callee

Ž .is busy. Responses of the 100 class denoted as 1xxindicate call progress; they are always followed byother responses indicating the final outcome of therequest.

While the 1xx responses are provisional, the otherclasses indicate the final status of the request: 2xxfor success, 3xx for redirection, 4xx, 5xx and 6xx forclient, server and global failures, respectively. 3xxresponses list, in a Contact header, alternate placeswhere the user might be contacted. To ensure relia-bility even with unreliable transport protocols, theserver retransmits final responses until the clientconfirms receipt by sending an ACK request to theserver.

All responses can include more detailed informa-tion. For example, a call to the central ‘‘switchboard’’address may return a web page that includes links tothe various departments in the company, providingnavigation more appropriate to the Internet than an

Ž .interactive voice response system IVR .

5.5. Protocol extensions

Since IPtel is still immature, it is likely thatadditional signaling capabilities will be needed in thefuture. Also, individual implementations and vendorsmay want to add additional features. SIP is designedso that the client can either inquire about serverabilities first or proceed under the assumption thatthe server supports the extension and then ‘‘backoff’’ if the assumption was wrong.

Methods: As in HTTP, additional methods can beintroduced. The server signals an error if a methodrequested by a client is not supported and informs itwith the Public and Allow response headers about

Page 14: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255250

the methods that it does support. The OPTIONSrequest also returns the list of available methods.

Request and response headers: As in HTTP or theŽ .Simple Mail Transport Protocol SMTP , client and

server can add request and response headers whichare not crucial to interpreting the request or responsewithout explicit indication. The entity receiving theheader simply silently ignores headers that it doesnot understand. However, this mechanism is notsufficient, as it does not allow the client to includeheaders that are vital to interpreting the request.Rather than enumerating ‘‘need-to-know’’ non-standard headers, the SIP Required header indicatesthose features that the client needs from the server.The server must refuse the request if it does notunderstand one of the features enumerated. Featurenames are either registered with the Internet As-

Ž .signed Number Authority IANA or derived hierar-chically from the feature owner’s Internet domainname, giving hints as to where further informationmight be found. SIP uses this to ascertain whethertelephony call-control functions are supported, avoid-ing the problem of partial implementations that haveunpredictable sets of optional features.

Status codes: Status codes returned in responsesare classified by their most-significant digit, so thatthe client knows whether the request was successful,failed temporarily or permanently. A textual statusmessage offers a fall-back mechanism that allows theserver to provide further human-readable informa-tion.

5.6. Telephony serÕices

SIP takes a different approach than standard tele-phony in defining services. Rather than explicitlydescribing the implementation of a particular service,it provides a number of elements, namely headersand methods, to construct services. The two principleheaders used are Also and Replaces. When presentin a request or response, the Also header instructsthe recipient to place a call to the parties listed.Similarly, Replaces instructs the recipient to termi-nate any connections with the parties listed. Anadditional method, called STATUS, is provided toallow a client to obtain results about progress of callsrequested via the Also header.

These elements, along with the basic SIP compo-nents, are easily used to construct a variety of tradi-tional telephony services. 700, 800 and 900 servicesŽpermanent numbers, freephone and paid information

.services are, from a call control perspective, simplyspecial cases of call forwarding, governed by adatabase lookup or some server-specified algorithm.The charging mechanisms, which differ for the threeservices, are handled by other protocols in an orthog-

Ž .onal fashion. Unlike for Intelligent Networking IN ,the number of such lookups within a call is notlimited. Call forwarding services based on user statusor preferences similarly require no additional proto-col machinery. As a simple example, we have imple-mented an automatic call forwarding mechanism thatinspects a callee’s appointment calendar to forwardcalls or indicate a more opportune time to call back.

While call forwarding precedes a call, call trans-fer allows to direct a call participant to connect toadifferent subscriber. Transfer services include blindand supervised call transfer, attendant and operatorservices, auto-dialer for telemarketing, or interactivevoice response. All are supported through use of theAlso header combined with programmed behaviorspecific to the particular service. For example, blindtransfer is implemented by having the transferringparty send a BYE to the transferred party containingan Also header listing the transferred-to party.

In SIP, call setup and session parameter modifica-Ž .tion are accomplished by the same INVITE request,

as all SIP requests are idempotent.In an Internet environment, no ‘‘lines’’ are tied

up by active calls when media is not being sent. Thismeans an IP telephone can support an unlimitednumber of active calls at one time. This makes iteasy, for example, to implement call waiting andcamp-on.

5.7. Multi-party calls

SIP supports the three basic modes of creatingŽ .multiparty conferences and their combinations : via

network-level multicast, via one or more bridgesŽ .also known as multipoint control units or as a meshof unicast connections. Multicast conferences requireno further protocol support beyond listing a groupaddress in the session description; indeed, the caller

Page 15: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255 251

does not even have to be a member of the multicastgroup to issue an invitation. Bridges are introducedlike regular session members; they may take overbranches of a mesh through the Replaces header,without the participants needing to be aware thatthere is a bridge serving the conference. SIP alsosupports conferences through full-mesh, also knownas multi-unicast. In this case, the client maintains apoint to point connection with each participant. Whilethis mode is very inefficient, it is very useful forsmall conferences where bridges or multicast serviceare not available. Full meshes are built up easilyusing the Also header. For example, to add a partici-pant C to a call between users A and B, user Awould send an INVITE to C with an Also headercontaining the address of B. Mixes of multi-unicast,multicast, and bridges are also possible.

SIP can also be used to set up calls to severalcallees. For example, if the callee’s address is amailing list, the SIP server can return a list ofindividuals to be called in an Also header. Alterna-tively, a single address, e.g., [email protected],may reach the first available administrator. Serverscan fan-out invitation requests, including sendingthem out via multicast. Multicast invitations are par-ticularly useful for inviting members of a departmentŽ [email protected] to a conference in anextremely efficient manner without requiring central-ized list administration.

Multicast invitations also allow a small confer-ence to gradually and smoothly migrate to a large-

Žscale Mbone-type conference such as the ones de-.scribed by Handley in this issue without requiring

separate protocols and architectures.

6. Additional protocols

We have so far touched on the two major protocolcomponents required for Internet telephony, namelyRTP and SIP. However, a number of other protocolcomponents are very useful for more rich services;we briefly mention them here.

When an IP host wishes to communicate with aGSTN endpoint, it must do so by means of an

Ž .Internet Telephony Gateway ITG . This will gener-ally require the host, or a proxy for the host, toeventually find the IP address of a telephony gate-

way appropriate for terminating the call. The selec-tion of such a gateway is not an easy problem. Thereare many factors which contribute to the process:cost of completing the call, billing methods sup-

Ž .ported credit card, debit card, e-cash , signalingŽ .protocols supported SIP, H.323 , media codecs sup-

ported, service provider, etc. Generally, the clientwill need to provide input, indicating many of thesepreferences, for the selection to take place. In essence,a telephony gateway is a service, and the clientdesires to select a server for this service based onsome criteria. Extensions to the Service Location

w x w xProtocol 37 for discovery of wide area services 38have been proposed which seem applicable to tele-

w xphony gateways 39 .As discussed in Section 5.1, SIP can be used to

translate addresses as the request moves from SIPserver to SIP server. This translation can be based onany criteria, such as caller and time of day. How-ever, there is no currently defined mechanism forallowing a user to express its preference for how anaddress is to be translated. For example, [email protected] calls [email protected], Alice willsend a SIP request to the SIP server at widgets.com.Bob would like the calls forwarded to his PC if he isconnected and if Carol calls, but the calls should be

Ž .forwarded to his voice mail through an ITG other-wise. This requires Bob to express preferences forsuch translations to the server. We are currentlydeveloping a call processing syntax which can beuploaded to a server in a SIP REGISTER message.Such a language is to be standardized in the IETF

Ž .iptel IP Telephony working group.With widespread use of IP telephony, IP tele-

phony voicemail is likely to follow. In order toŽsupport retrieval and recording of voicemail which

.could easily include video , a protocol is necessaryto give a user VCR-like controls over the voicemail

w xserver. The RTSP 40 is used for this purpose.RTSP allows a client to instruct a media server torecord and playback multimedia sessions, includingfunctions such as seek, fast forward, rewind, andpause. RTSP, like SIP, is a textual protocol similarin format to HTTP. RTSP integrates easily with SIP,in fact. A user can use SIP to invite a media serveror voicemail server to a multimedia session, and thenuse RTSP to control operation of the during thesession.

Page 16: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255252

7. Protocol integration

The previous sections have described the basics ofthe protocol mechanisms for Internet telephony: SIP,RTP, wide area service location, and RTSP. In thissection, we put the elements together and show howthey can be used for a complex service.

Fig. 5 shows an IP network consisting of threeŽ .Internet Service Providers ISP’s , A, B, and transit

ISP C. ISP A provides a SIP proxy server A, and aŽ .wide area service location WASRV server. ISP B

provides a SIP proxy server as well, in addition to anRTSP server for voicemail. ISP C provides a tele-

Ž .phony gateway ITG . User A in ISP A wishes tocall user B in ISP B. User A’s SIP user agent clientis configured to use SIP server A as a proxy for allcall requests. To call B, A sends a SIP INVITE

Ž .message to the SIP Server A 1 , indicating the nameŽ .of user B in the To field [email protected] . SIP

server A looks up the domain ispb.com in DNS, andobtains the IP address of server B. Acting as a proxy,

Ž .it forwards the SIP INVITE to SIP server B 2 .Server B checks its records, and finds a set of callprocessing instructions for the user. The instructionsindicate that the user is first to be contacted, viaproxy, at their PC. If no one answers, the servershould return two alternate locations for user B to

the requester: a telephone number and a voicemailserver. The SIP server then follows these instruc-tions, and forwards the INVITE message to user B’s

Ž .PC 3 .User B has instructed his SIP user agent server

software not to accept any calls. User B’s SIP userserver thus responds with an error message to SIP

Ž .server B 4 . The SIP server then sends a redirectŽ .response to SIP server A 5 . The response is in the

300 class set, and includes in the Contact fields twoalternate addresses. The first is a telephone URLŽ .tel: rrq1-732-555-1212 , and the second an RTSPURL for a media server. The location headers canindicate preferences, and the response indicates thatthe caller should try the phone URL first.

The SIP server A then tries to call the user at theGSTN number. To do this, it first queries a WASRV

Ž .server 6 . It provides the server with the telephonenumber to contact, and user preferences about billing

Ž .methods such as credit card payment . These prefer-ences can be supplied to the SIP server by the user inany number of ways, including manual entry by aadministrator. The WASRV server queries itsdatabase, and returns the address of an appropriate

Ž .gateway 7 .The SIP server then sends the original SIP IN-

Ž .VITE to the gateway 8 . The gateway makes a call

Fig. 5. Internet telephony components.

Page 17: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255 253

Ž . Ž .to the PSTN 9 , but the line is busy 10 . Thegateway indicates this to the server via a SIP error

Ž .message 11 . The server A still has the RTSP URLas a potential contact point for the client. Since itdoes not understand how to process RTSP URL’s, itreturns this URL to A’s user agent client via a SIP

Ž .redirection response 12 . User A can then contactthe RTSP server, and leave a message for user B.

This example illustrates a number of differentfacets of the architecture. The first is the ‘‘hopping’’of SIP INVITE requests between elements, with each

Želement accessing directories such as a WASRV.server or databases as needed. The behavior of each

SIP server is dependent on its local programmingand implementation. In the example here, SIP serverB had been programmed with user preferences aboutcall handling. Server A was programmed with userpreferences about billing, and had access to WASRVservices to complete calls to PSTN destinations. Thisheterogeneity allows for the definition of new ser-vices, and for differentiation among servers.

Another facet of the architecture is the smoothŽintegration with other services such as tel: and

.rtsp: . SIP allows for any type of URL to be carriedin the From, To, Contact, and Also fields. Thisallows calls to be handed off to other protocols andservices when needed.

8. Conclusion and future work

We have presented a portion of a protocol suitefor supported advanced IP telephony services on theInternet. This suite includes the RTP for transport,the SIP for signaling, and the RTSP for stored mediaretrieval. These protocols are independent and modu-lar, and when combined with billing, service discov-ery, and resource reservation protocols, form a com-plete architecture for future services.

However, much work remains. While IPtelpromises to add greater flexibility to telephony ser-vices and speed their implementation, IPtel first hasto overcome a number of well-known problems,including unpredictable quality of service in the widearea, the lack of reliable and cheap end systems,Internet unreliability, and the lack of a billing infras-tructure.

The Internet also currently lacks a generally ac-cepted reliable multicast protocol. Application shar-

Ž .ing, voting i.e., distributed counting and floor con-Ž .trol i.e., a distributed queue require such reliability.

As mentioned earlier, gateways to existingtelecommunications systems will have to play a largerole in the transition to an Internet-based telecommu-nications infrastructure. We are currently investigat-ing the interconnection of SIP servers with the SS7Ž . w xISUP protocol 4 , so that a gateway can become a

Žfirst-class citizen in the telephone network in otherwords, interface to the telephone network using an

.NNI as opposed to UNI . Gateways of SIP to H.323are also under study.

References

w x1 H. Schulzrinne, A comprehensive multimedia control archi-tecture for the Internet, Proc. Int. Workshop on Network andOperating System Support for Digital Audio and VideoŽ .NOSSDAV , St. Louis, MO, May 1997.

w x2 M. Handley, H. Schulzrinne, E. Schooler, J. Rosenberg, SIP:Session initiation protocol, Internet Draft, Internet Engineer-ing Task Force, October 1998. Work in progress.

w x3 M. Handley, J. Crowcroft, C. Bormann, J. Ott, The InternetMultimedia Conferencing Architecture, Internet Draft, Inter-net Engineering Task Force, July 1997, Work in progress.

w x4 T. Russell, Signaling System a7, McGraw-Hill, New York,1995.

w x Ž .5 Y. Rekhter, T. Li, A Border Gateway Protocol 4 BGP-4 ,Ž .Request for Comments Draft Standard 1771, Internet Engi-

neering Task Force, March 1995.w x6 B. Braden, L. Zhang, S. Berson, S. Herzog, S. Jamin,

Ž .Resource ReSerVation Protocol RSVP – Version 1 Func-Žtional Specification, Request for Comments Proposed Stan-

.dard 2205, Internet Engineering Task Force, October 1997.w x7 P. Pan, H. Schulzrinne, Yessir: A simple reservation mecha-

nism for the Internet, Technical Report RC 20697, IBMResearch, Hawthorne, New York, September 1997.

w x8 C. Rigney, RADIUS Accounting, Request for CommentsŽ .Informational 2139, Internet Engineering Task Force, April1997.

w x9 A. Rubens, P. Calhoun, DIAMETER Base Protocol, InternetDraft, Internet Engineering Task Force, March 1998, Workin progress.

w x10 International Telecommunication Union, Terms and Defini-tions, Recommendation B.13, Telecommunication Standard-ization Sector of ITU, Geneva, Switzerland, 1988.

w x11 L. Zhang, A. Mankin, J. Stewart, T. Narten, M. Crawford,Separating Identifiers and Locators in Addresses: An Analy-sis of the GSE Proposal for IPv6, Internet Draft, InternetEngineering Task Force, March 1998, Work in progress.

w x12 H. Schulzrinne, Re-engineering the telephone system, Proc.

Page 18: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255254

Ž .IEEE Singapore Int. Conf. on Networks SICON , Singapore,April 1997.

w x13 I. Busse, B. Deffner, H. Schulzrinne, Dynamic QoS controlof multimedia applications based on RTP, Computer Com-

Ž .munications 19 1996 49–58.w x14 H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, RTP:

A Transport Protocol for Real-time Applications, Request forŽ .Comments Proposed Standard 1889, Internet Engineering

Task Force, January 1996.w x15 C. Strathmeyer, Feature topic: Computer telephony, IEEE

Ž .Communications Magazine 34 1996 .w x Ž .16 Versit Consortium, Computer Telephony Integration CTI

Encyclopedia, Tech. Rep. Release 1.0, Versit Consortium,October 1996.

w x17 C. Low, The internet telephony red herring, Proc. GlobalInternet, London, England, November 1996, pp. 72–80.

w x18 C. Low, Integrating communications services, IEEE Commu-Ž .nications Magazine 35 1997 .

w x19 P. Moulton, J. Moulton, Telecommunications technical fun-damentals, technical handout, The Moulton Company,Columbia, MD, 1996, see http:rrwww.moultonco.comrsemnotesrtelecommrteladd.htm.

w x20 D. Cohen, A protocol for packet-switching voice communi-cation, Proc. Computer Network Protocols Symp., Liege,Belgium, February 1978.

w x21 M. Handley, V. Jacobson, SDP: Session Description Proto-Ž .col, Request for Comments Proposed Standard 2327, Inter-

net Engineering Task Force, April 1998.w x22 J. Rosenberg, H. Schulzrinne, Timer reconsideration for en-

hanced RTP scalability, Proc. Conf. on Computer Communi-Ž .cations IEEE Infocom , San Francisco, CA, MarchrApril

1998.w x23 C. Zhu, RTP Payload Format for H.263 Video Streams,

Ž .Request for Comments Proposed Standard 2190, InternetEngineering Task Force, September 1997.

w x24 T. Turletti, C. Huitema, RTP Payload Format for H.261Ž .Video Streams, Request for Comments Proposed Standard

2032, Internet Engineering Task Force, October 1996.w x25 L. Berc, B. Fenner, R. Frederick, S. McCanne, RTP Payload

Format for JPEG-compressed Video, Request for CommentsŽ .Proposed Standard 2035, Internet Engineering Task Force,October 1996.

w x26 R. Civanlar, G. Fernando, V. Goyal, D. Hoffman, RTPPayload Format for MPEG1rMPEG2 Video, Request for

Ž .Comments Proposed Standard 2250, Internet EngineeringTask Force, January 1998.

w x27 H. Schulzrinne, RTP Profile for Audio and Video Confer-Žences with Minimal Control, Request for Comments Pro-

.posed Standard 1890, Internet Engineering Task Force, Jan-uary 1996.

w x28 O. Hodson, I. Kouvelas, O. Hodson, M. Handley, J. Bolot,RTP Payload for Redundant Audio Data, Request for Com-

Ž .ments Proposed Standard 2198, Internet Engineering TaskForce, September 1997.

w x29 J. Rosenberg, H. Schulzrinne, An RTP Payload Format forGeneric Forward Error Correction, Internet Draft, InternetEngineering Task Force, March 1998, Work in progress.

w x30 J. Rosenberg, H. Schulzrinne, An RTP Payload Format forUser Multiplexing, Internet Draft, Internet Engineering TaskForce, January 1998, Work in progress.

w x31 D. Katz, IP Router Alert Option, Request for CommentsŽ .Proposed Standard 2113, Internet Engineering Task Force,February 1997.

w x32 W. Almesberger, J.-Y.L. Boudec, T. Ferrari, Scalable re-source reservation for the Internet, Proc. IEEE Conf. onProtocols for Multimedia Systems – Multimedia NetworkingŽ .PROMS-MmNet , Santiago, Chile, November 1997, Techni-cal Report 97r234, DI-EPFL, Lausanne, Switzerland.

w x33 K. Nichols, V. Jacobson, L. Zhang, A Two-bit DifferentiatedServices Architecture for the Internet, Internet Draft, InternetEngineering Task Force, November 1997, Work in progress.

w x34 International Telecommunication Union, Visual TelephoneSystems and Equipment for Local Area Networks WhichProvide a Non-guaranteed Quality of Service, Recommenda-tion H.323, Telecommunication Standardization Sector ofITU, Geneva, Switzerland, May 1996.

w x35 P. Hoffman, L. Masinter, J. Zawinski, The mailto URLŽ .scheme, Request for Comments Proposed Standard 2368,

Internet Engineering Task Force, July 1998.w x36 T. Howes, S. Kille, M. Wahl, Lightweight Directory Access

Ž . Ž .Protocol v3 , Request for Comments Proposed Standard2251, Internet Engineering Task Force, December 1997.

w x37 J. Veizades, E. Guttman, C. Perkins, S. Kaplan, ServiceŽLocation Protocol, Request for Comments Proposed Stan-

.dard 2165, Internet Engineering Task Force, June 1997.w x38 J. Rosenberg, H. Schulzrinne, B. Suter, Wide Area Network

Service Location, Internet Draft, Internet Engineering TaskForce, December 1997, Work in progress.

w x39 J. Rosenberg, H. Schulzrinne, Internet telephony gatewayŽlocation, Proc. Conf. on Computer Communications IEEE

.Infocom , San Francisco, CA, MarchrApril 1998.w x40 H. Schulzrinne, R. Lanphier, A. Rao, Real Time Streaming

Ž . ŽProtocol RTSP , Request for Comments Proposed Stan-.dard 2326, Internet Engineering Task Force, April 1998.

Henning Schulzrinne received his un-dergraduate degree in economics andelectrical engineering from the Technis-che Hochschule in Darmstadt, Germany,in 1984, his MSEE degree as a Fulbrightscholar from the University of Cincin-nati, OH and his Ph.D. degree from theUniversity of Massachusetts in Amherst,Massachusetts in 1987 and 1992, respec-tively. From 1992 to 1994, he was amember of technical staff at AT&T BellLaboratories, Murray Hill. From 1994–

Ž .1996, he was associate department head at GMD-Fokus Berlin ,before joining the Computer Science and Electrical Engineeringdepartments at Columbia University, New York. His researchinterests encompass real-time network services, the Internet andmodeling and performance evaluation.

Page 19: Internet Telephony: architecture and protocols – an IETF …hgs/papers/2071.pdf · 2003-03-25 · Computer Networks 31 1999 237–255 . Internet Telephony: architecture and protocols

( )H. Schulzrinne, J. RosenbergrComputer Networks 31 1999 237–255 255

Jonathan Rosenberg is currently aMember of Technical Staff in the HighSpeed Networks Research Department,Bell Labs, Lucent Technologies, inHolmdel, NJ. After completing his Mas-ters from MIT, he came to work at BellLabs, where he worked on developmentand implementation of low bitrate videocodecs. His current research interests in-clude IP voice transport, signaling pro-tocols and architectures for IP tele-phony, and service discovery. Mr.

Rosenberg is currently pursuing a part-time Ph.D. at ColumbiaUniversity, working with Prof. Henning Schulzrinne on IP tele-phony. He has published numerous papers, and filed over a dozenpatents, in the general areas of multimedia and networking. He isactive in the IETF, where he chairs the newly formed iptelworking group.