15
YouTube Traffic Characterization: A View From the Edge Phillipa Gill, Martin Arlitt, Zongpeng Li, Anirban Mahanti Enterprise Systems and Software Laboratory HP Laboratories Palo Alto HPL-2007-119 July 12, 2007* traffic characterization, YouTube, Web 2.0, caching This paper presents a traffic characterization study of the popular video sharing service, YouTube. Over a three month period we observed almost 25 million transactions between users on an edge network and YouTube, including more than 600,000 video downloads. We also monitored the globally popular videos over this period of time. In the paper we examine usage patterns, file properties, popularity and referencing characteristics, and transfer behaviors, and compare them to traditional Web and media streaming workload characteristics. We also analyze social networking (Web 2.0) aspects of YouTube. We conclude the paper with a discussion of the implications of the observed characteristics. For example, we find that as with the traditional Web, caching could improve the end user experience, reduce network bandwidth consumption, and reduce the load on YouTube's core server infrastructure. Unlike traditional Web caching, Web 2.0 provides additional metadata that should be exploited to improve the effectiveness of strategies like caching. * Internal Accession Date Only Approved for External Publication To be presented and published at ACM SIGCOMM Internet Measurement Conference, IMC'07. © Copyright 2007 ACM

YouTube Traffic Characterization: A View From the Edge · YouTube Traffic Characterization: A View From the Edge Phillipa Gill{Martin Arlittz{Zongpeng Li{Anirban Mahantix {Department

Embed Size (px)

Citation preview

YouTube Traffic Characterization: A View From the Edge Phillipa Gill, Martin Arlitt, Zongpeng Li, Anirban Mahanti Enterprise Systems and Software Laboratory HP Laboratories Palo Alto HPL-2007-119 July 12, 2007* traffic characterization, YouTube, Web 2.0, caching

This paper presents a traffic characterization study of the popular video sharing service, YouTube. Over a three month period we observed almost 25 milliontransactions between users on an edge network and YouTube, including morethan 600,000 video downloads. We also monitored the globally popular videosover this period of time.

In the paper we examine usage patterns, file properties, popularity andreferencing characteristics, and transfer behaviors, and compare them totraditional Web and media streaming workload characteristics. We also analyzesocial networking (Web 2.0) aspects of YouTube. We conclude the paper with adiscussion of the implications of the observed characteristics. For example, we find that as with the traditional Web, caching could improve the end userexperience, reduce network bandwidth consumption, and reduce the load onYouTube's core server infrastructure. Unlike traditional Web caching, Web 2.0provides additional metadata that should be exploited to improve theeffectiveness of strategies like caching.

* Internal Accession Date Only Approved for External Publication

To be presented and published at ACM SIGCOMM Internet Measurement Conference, IMC'07. Copyright 2007 ACM

YouTube Traffic Characterization: A View From the Edge

Phillipa Gill Martin Arlitt Zongpeng Li Anirban MahantiDepartment of Computer Science, University of Calgary, CanadaEnterprise Systems & Software Lab, HP Labs, Palo Alto, USA

Department of Computer Science and Engineering, Indian Institute of Technology, Delhi, India

ABSTRACTThis paper presents a traffic characterization study of thepopular video sharing service, YouTube. Over a three monthperiod we observed almost 25 million transactions betweenusers on an edge network and YouTube, including more than600,000 video downloads. We also monitored the globallypopular videos over this period of time.

In the paper we examine usage patterns, file properties,popularity and referencing characteristics, and transfer be-haviors, and compare them to traditional Web and mediastreaming workload characteristics. We also analyze socialnetworking (Web 2.0) aspects of YouTube. We conclude thepaper with a discussion of the implications of the observedcharacteristics. For example, we find that as with the tra-ditional Web, caching could improve the end user experi-ence, reduce network bandwidth consumption, and reducethe load on YouTubes core server infrastructure. Unliketraditional Web caching, Web 2.0 provides additional meta-data that should be exploited to improve the effectivenessof strategies like caching.

1. INTRODUCTIONThe Web is slowly but steadily undergoing a metamor-

phosis as more and more users are able to create, share, anddistribute content on the Web. This shift toward user gen-erated content represents one of the biggest changes of theWeb since its inception in the early 1990s. This paradigmshift has resulted in a surge in popularity of Web sites thatenable users to build social networks and share content. To-day, user generated content available on the Web includestextual information contained in Weblogs (blogs) [7, 41],photos on sites such as Flickr [21] and Facebook [20], andvideos on sites such as FlixHunt [22] and YouTube [42]. Col-lectively, these types of Web sites are referred to in the mediaas Web 2.0 (to distinguish these from the so-called Web 1.0sites that host content from established providers) [12].

Web 2.0 changes how users participate in the Web. In-stead of consuming content posted by a single administra-tor, users are now able to post their own content and viewcontent posted by their peers. Some Web 2.0 sites, for ex-ample MySpace [34] and Facebook [20], promulgate socialnetworking by allowing individuals with similar interests toform social groups. Tagging [4,31,36], a feature that allowsusers to associate words or phrases (tags) with content

Copyright is held by the author/owner(s).Submitted to IMC07,October 2007, San Diego, California, USA.ACM X-XXXXX-XX-X/XX/XX.

they post or view on a Web page, is extensively used onsome Web 2.0 sites to categorize and organize content [21].

Adoption of Web 2.0 has been widespread, with users of allages participating in posting as well as viewing content [32].The diversity of participants in Web 2.0 is possible becauseof the low barrier to entry into these online communities.Many Web 2.0 sites are designed such that signing up andposting content are relatively easy. This enables users whomay not be technically savvy to participate along side moreexperienced users.

Many Web 2.0 sites allow users to store large multimediafiles. As the popularity of such sites grows, and as the avail-ability of consumer broadband increases, the sheer volumeof data exchanged for Web 2.0 traffic has the potential toseverely strain these sites and the shared network resourcesthat connect them to users. Understanding Web 2.0 work-loads will aid in capacity planning, network management,the design of new systems, and in future network research.While there are extensive studies of traditional Web work-load [5, 6, 16, 29], there have been no substantive studies ofWeb 2.0 workloads in the literature. Our work aims to fillthis gap and strives to provide insights into how user gener-ated content is viewed and distributed on the Internet.

In this paper, we analyze and characterize one such Web2.0 site, YouTube [42], the largest video sharing site on theInternet [27]. According to estimates, with 100 million videoviews per day YouTube accounts for approximately 60% ofthe videos watched on the Internet; YouTube is also growingat a rapid pace, with 65,000 video uploads per day [40]. Thisconstant growth of YouTube makes capturing its behaviorby examining a single point in time almost impossible. Ouranalysis is based on three-months of data that reflects trendsof YouTube traffic from both a local campus network anda global (i.e., Internet wide) perspective. Locally, we con-sider the network resources consumed by YouTube traffic aswell as the viewing habits of campus YouTube users. Glob-ally, we consider characteristics of the most popular videoson YouTube and examine the relationship between globallypopular videos and videos that are popular on campus.

The main contributions of our paper are threefold. First,we introduce an efficient measurement framework that en-ables us to monitor a popular and resource intensive Web2.0 application over an extended period of time (while pro-tecting user privacy). Second, we provide one of the firstextensive characterization studies of Web 2.0 traffic. Third,we examine the implications of the observed characteristics.In particular, we analyze a wide range of features, includingusage patterns, file properties, popularity and referencing

behaviors, and transfer characteristics, which we compareto characteristics of traditional Web and streaming mediaworkloads. For example, we observe that a small fraction ofthe requests to YouTube are for videos, but video downloadsaccount for almost all of the bytes transferred, as video filesizes are orders of magnitude larger than files of other con-tent types. Although similar properties have been observedfor other Internet applications, with Web 2.0 the impactmay be more significant, as it is for content that appeals toa much large audience. An obvious performance and scala-bility enhancement is to utilize caching effectively. Althoughcaching has been thoroughly studied for traditional Webworkloads, there are differences to consider for Web 2.0. Forexample, the ability for anyone to create content and makeit available online implies there will be sustained supply ofcontent, increasing both the depth and breadth. This canreduce the effectiveness of caching; indeed, we observe alower concentration of references than has been observed intraditional Web workloads. However, Web 2.0 provides anabundance of meta-data (compared to the traditional Web;e.g., user ratings, video categories, etc.); this meta-data canand should be exploited by Web 2.0 caching, in order to bemore effective.

The remainder of the paper is structured as follows. Sec-tion 2 presents background information on YouTube. Wediscuss related work in Section 3. Our data collection frame-work is described in Section 4, followed by a high-level analy-sis of the collected data in Section 5. The next four sectionscharacterize the YouTube workload in more detail. Section 6characterizes YouTubes video and non-video files. The pop-ularity characteristics of video files accessed by users on ourcampus is analyzed in Section 7, and locality properties ofour campus YouTube traffic is analyzed in Section 8. InSection 9, we characterize the transfer size and durationsfor YouTube traffic on our campus network. Section 10 de-scribes the implications of the workload characteristics weidentified. We conclude the paper in Section 11 with a sum-mary of our contributions and a discussion of future work.

2. BACKGROUNDYouTube was founded in February 2005 as a Web site that

enables users to easily share video content. As YouTubeexpanded, features were added to facilitate social network-ing among its users. Users can tag their uploaded videoswith keywords or phrases that best describe their content,and these tags are used by YouTube to provide users witha list of related videos. Tagging, social networking, andthe abundance of user generated content make YouTube thequintessential Web 2.0 site. According to recent media re-ports, YouTube is the largest video sharing Web site on theInternet with over 100 million video accesses per day and65,000 video uploads per day [40]. Time magazines 2006year end issue named You as the person of the year, asan homage to YouTube and other Web 2.0 users. Due tothe incredible popularity of YouTube, it attracted the at-tention of numerous investors. In November 2006, YouTubewas acquired by Google for $1.65 billion US.

One of the keys to YouTubes success is its use of AdobesFlash Video (FLV) format for video delivery. While usersmay upload content in a variety of media formats (e.g.,WMV, MPEG and AVI), YouTube converts them to FlashVideo before posting them. This enables users to watch thevideos without downloading any additional browser plug-

ins provided they have the Flash Player 7 installed. Itis estimated that over 90% of clients have Flash Player 7installed1. To enable playback of the flash video beforethe content is completely downloaded, YouTube relies onAdobes progressive download technology.

Traditional download-and-play requires the full FLV fileto be downloaded before playback can begin. Adobes pro-gressive download feature allows the playback to begin with-out downloading the entire file. This is accomplished us-ing ActionScript commands that supply the FLV file to theplayer as it is being downloaded, enabling playback of thepartially downloaded file. Progressive download works withWeb servers and video content is delivered using HTTP/TCP.This delivery technique is sometimes referred to as pseudostreaming to distinguish it from traditional media streaming.Traditional on-demand streaming of stored media files typ-ically requires the use of dedicated streaming servers thatfacilitate client-server interaction during the course of thevideo playback. This interaction may be used for adapta-tion of video quality or user interactions such as fast forwardor rewind operations.

While video content is usually the focus of a visit to theYouTube Web site, there are many file transfers that happenbehind the scenes to embed the video file and display thesurrounding Web site content. For example, when a userclicks on a video of interest, a GET request for the titleHTML page for the requested video is made. This HTMLpage typically includes references to a number of Javascriptfiles. These scripts are responsible for embedding the Shock-wave Flash (SWF) player file, and other peripheral taskssuch as processing video ratings and comments. The SWFfile is relatively small (26 KB), so the page loads quickly.Once the player is embedded, a request for the FLV videofile is issued. The FLV video file is downloaded to the userscomputer using an HTTP GET request, which is servicedby either a YouTube server or a server from a content dis-tribution network (CDN).

3. RELATED WORKThere are numerous studies of traditional Web (now

referred to as Web 1.0) workloads. Cunha et al. char-acterized a set of Web browser traces [17], while Gribbleand Brewer analyzed HTTP traces from a dial-in modempool [23]. Both examine characteristics such as access pat-terns, object types, and object sizes. Arlitt and Williamsonidentified a set of ten characteristics common to Web serverworkloads [6]. Arlitt and Jin examined the much busierWorld Cup 1998 Web site, and verified that these charac-teristics existed [5]. Mahanti et al. [29] and Duska et al. [19]characterized Web proxy workloads. A common conclusionfrom all of these studies was that caching had the potentialto improve both the user experience (i.e., through reducedlatency) and the scalability of the Web (i.e., by distribut-ing the workload). Our work is complementary, in that itexamines Web 2.0 workloads for similar characteristics andopportunities for infrastructure improvements.

Characterization of both stored and live media streaminghas also received considerable attention in the literature.Characteristics of media files on the Web have been studiedusing a crawling or searching perspective, originating from

1http://www.adobe.com/products/player census/flashplayer/version penetration.html

an edge network [1,26], or by analyzing streams collected inthe network [14,24].

In 1998, Acharya et al. presented one of the earliest knownstudy of the characteristics of streaming media files storedon Web servers [1]. This was followed by Chesire et al. [14]who analyzed a week long trace, collected in 2000 from theircampuss Internet gateway, of live and on-demand RTSP ses-sions. They found that most media streams viewed on theircampus were encoded at low bit rates suitable for streamingto dial-up users, were typically less than 1 MB in size, andhad durations less than 10 minutes. In addition, they foundmedia file popularity to be Zipf-like.

In 2003, Li et al. [26] crawled 17 million Web pages forstored audio/video files and discovered 30,000 such files. An-alyzing these files, they reported several observations: mediadurations are long-tailed; media files are typically encodedin proprietary formats; and most video files are encoded atbit rates targeted at broadband users.

In 2004, Sripanidkulchai et al. [38] analyzed a workloadof live media streams collected from a large CDN. Theyobserve that media popularity follows a 2-mode Zipf distri-bution. The authors also observe exponentially distributedclient arrival times within small time windows and heavy-tailed session durations.

Workloads from media servers in corporate, university,and commercial environments environments have also beenstudied [2, 3, 13, 15, 25, 43]. For example, Almeida et al. [3]presented a detailed analysis of workloads from two mediaservers (eTeach and BIBS) located at two large universitiesin the United States. They found file popularity can bemodeled as a concatenation of two Zipf distributions, andthat client interarrival times followed the exponential distri-bution in the case of eTeach and the Pareto distribution inthe case of BIBS. They also observed uniform access to allsegments of popular files whereas access to segments of in-frequently accessed files was non-uniform. The authors alsoobserved a lack of temporal locality in client requests.

Cherkasova and Gupta [13] analyzed the workloads of twocorporate media servers. They report that video popularityis Zipf-like, that a significant fraction of the total requestsand bytes transferred were for new content, and that mostaccesses to a file occurred soon after the files were madeavailable on the servers.

More recently, Yu et al. [43] presented an analysis of thefile reference characteristics and the user behavior of thePowerinfo system, a production video-on-demand systemdeployed in major Chinese cities by China Telecom. Thesystem mostly hosts older television programs encoded inMPEG format. The authors analyzed 217 days of accesslogs from one city with 150,000 users. Their access logsrecorded 6,700 unique video requests and a total of 21 mil-lion video requests. They found that: request arrival rate isstrongly influenced by time of day, request arrivals can bemodeled by a modified Poisson distribution, video popular-ity follows the Zipf distribution, and user interest in videosis fueled by several factors such as the list of videos on themost recommended list and the availability of new videos.

Our work is complementary to the aforementioned works.As previously stated, to the best of our knowledge, this is thefirst workload characterization study of the popular videosharing Web site YouTube. In addition, there are a num-ber of other factors that distinguish our research from priorwork. First, traditional Web and media servers have strict

control over who posts content whereas in YouTube anyonecan post videos. Second, YouTubes clientele is vastly dif-ferent from the clientele of educational, corporate, and evencommercial video-on-demand sites. Third, unlike traditionalstreaming media where content is typically streamed usinga proprietary application layer protocol, with either TCPor UDP being used at the transport layer, YouTube uses apseudo-streaming model (i.e., progressive download) wherecontent is delivered using HTTP. As such there are many dif-ferences between YouTube and other media streaming ser-vice providers, and the increased use of YouTube warrantsa careful study of this new type of video traffic.

4. DATA COLLECTION FRAMEWORKYouTubes workload is a moving target. Everyday, new

videos are added, new ratings are submitted, and new com-ments are posted. The popularity of videos also changes ona daily basis. In this paper, we propose a multilevel ap-proach to capturing YouTube traffic and understanding itsworkload characteristics. First, we monitor YouTube usageon our local (University of Calgary) campus network. Do-ing so provides us with an understanding of how YouTubemay be used by clients of other large edge networks. Sec-tion 4.1 describes our local data collection methodology.Second, we collect statistics on the most popular videos onthe YouTube site. Section 4.2 explains our global data col-lection methodology. By keeping statistics of both local andglobal YouTube usage we are able to compare and contrastcharacteristics of videos that are popular at both the localand global level.

4.1 Data Collection of Edge YouTube UsageAn enabling step in this work was the collection of data

from an edge network. Our goals in data collection were to:

collect data on all YouTube usage at the University ofCalgary network

gather such data for an extended period of time protect user privacy

This conceptually simple task proved challenging, for anumber of reasons. One challenge is the global popularityof YouTube. Due to its popularity, YouTubes delivery in-frastructure is comprised of many servers, including somefrom (one or more) Content Distribution Networks (CDNs).A second challenge is our network monitor has limited CPUand storage resources2, thus making storage of lengthy fullpacket traces infeasible. A third challenge is our campusrecently upgraded from a 100 to a 300 Mb/s full-duplex net-work link to the Internet; users on campus were happy toincrease their Internet usage, which places greater pressureon our aging network monitor. Figure 1 shows the aggregatebandwidth (inbound + outbound) consumed on our campusInternet link during the collection period.

The data collection methodology we used to address thesechallenges is as follows:

identify a set of servers that provide YouTube content2Our monitor was purchased in spring 2003, when our In-ternet connection was only 12 Mb/s. Our monitor has twoIntel Pentium III 1.4 GHz processors, 2 GB RAM, and two70 GB drives.

0 100 200 300 400 500 600

Apr 1Mar 1Feb 1Jan 14

Ban

dwid

th (

Mb/

s)

Figure 1: Aggregate Campus Internet BandwidthDuring Collection Period

use bro [9] to collect summary information on eachHTTP transaction involving one of those servers

restart bro daily, compress the previous days logWe identified the servers to monitor a priori. Initially we

used tcpdump [39] to gather traces on a workstation whilewe browsed the YouTube site. This provided a sample ofthe servers used to deliver YouTube content. We then usedwhois to determine the networks that the servers IP ad-dresses were affiliated with. We identified two networks(youtube and youtube2) that were assigned to YouTube. Forour long-term data collection, we gather all HTTP trans-actions involving any IP address on these two networks.We also identified one CDN (Limelight Networks) deliver-ing YouTube content. Extracting traffic for this CDN re-quired a slightly different approach, as the Limelight CDNalso serves traffic for other popular sites such as Facebookand MySpace. Fortunately, Limelight incorporates YouTubeinto the fully qualified domain name (FQDN) of each server,so we were able to consider only the transactions includingan HTTP Host: field that included the term youtube.3

We used bro to extract summaries of each YouTube HTTPtransaction in realtime. We chose bro because it implementsmany of the functions we require; we just had to write ascript to handle events of interest. For each transaction werecord a variety of data about the TCP connection (e.g., du-ration, initial RTT, start and end sequence numbers), theHTTP request (e.g., the method, URL, Host: name), theHTTP response (e.g., status code, content length, date). Toprotect user privacy, we convert the YouTube visitor iden-tifier to a unique integer; furthermore, the mapping is notrecorded to disk, and the mapping is only valid for a 24 hourperiod (i.e., until bro is restarted). This prevents us fromanalyzing some aspects of user longevity, but protects userprivacy.

After initial experimentation with bro on our monitor,we found it necessary to add an additional field to eachtransaction summary. This field indicates the parsing statusof each transaction, which falls into one of four categories:

1. Complete - the entire transaction was successfully parsed.

2. Interrupted - the TCP connection was reset before thetransaction was complete.

3. Gap - the monitor missed a packet, and thus bro wasunable to parse the remainder of the transaction.

4. Failure - bro was unable to parse the transaction foran unknown reason.

3Unfortunately, we still had to process traffic from othersites on the Limelight network, as multiple FQDNs oftenmapped to the same IP address.

Table 1: Breakdown of TransactionsCategory Transactions % of Total

Completed 22,403,657 90.82Interrupted 462,903 1.88Gapped 383,878 1.56Failed 1,418,178 5.75Total 24,668,616 100.01

Table 2: Breakdown of Video TransactionsCategory Transactions % of Total

Completed 154,294 24.66Interrupted 151,687 24.25Gapped 319,612 51.09Total 625,593 100.00

Table 1 summarizes the prevalence of each of the trans-action categories.4 As we would expect, most transactionshave a Complete status. About 6% of transactions failed.For transactions in this category we have no informationfrom HTTP headers. The two most likely reasons for failedtransactions are: our monitor dropped a packet in the con-nection before the HTTP headers were parsed; or the TCPconnection was not established in an expected manner, soour script did not know how to handle it properly5. Unfortu-nately, as we are summarizing each transaction in real-timeand not retaining the raw packet traces, we do not haveany definitive evidence to determine the prevalence of each.However, neither of these issues are related to the type ofobject being transferred, so it is unlikely that a dispropor-tionate fraction of failed transactions were for video objects.We record the number of transactions that ended up in thiscategory to ensure we are gathering information on the ma-jority of identified YouTube transactions. Our analyses inthe remainder of this paper ignores the failed transactions.

The breakdown of transactions for video requests is shownin Table 2. For video requests, only about one quarter of thetransactions were complete. The main reason for this is thelarge number of transactions with a gap. As YouTube traf-fic increased on our campus, we observed that during busyperiods our monitor (when running bro) could not keep upwith the network load6. This resulted in some transactionsummaries being incomplete due to gaps in a TCP connec-tions sequence number space. As Table 2 indicates, most ofthe gaps occur in video transactions. This happens becausethe video transactions achieve much higher download ratesthan most other (smaller) transactions, thus placing a higherload on our monitor. It is important to note, however, thatmost of the data we use is from the HTTP headers, andthese are seen in the first few packets exchanged in a trans-action, when the transfer rates are lower. As a result, we arestill able to apply all of our analyses to transactions in thiscategory, except those analyses which require the TransferDuration.

Approximately, 24% of video transactions fall into the in-terrupted category. We can also use these transactions in

4The total is 100.01% due to rounding error.5Our script expects a three packet establishment handshakefor each TCP connection: SYN, SYN ACK, ACK. If aclients TCP stack behaves differently from this (e.g., piggy-backs the ACK on a data packet), our script will mark thetransaction as failed.6In the near future we plan to upgrade to a more powerfulmonitor.

0

0.2

0.4

0.6

0.8

1

0.01 0.1 1 10 100 1000

P[X