Upload
nicholas-randall
View
213
Download
0
Embed Size (px)
Citation preview
HPL- 04/19/23 Lucy CherkasovaH1
Characterizing Locality, Evolution, and Life Span of Accesses in
Enterprise Media Server Workloads
Ludmila Cherkasova and Minaxi Gupta
Hewlett-Packard Labs
HPL- 04/19/23 Lucy CherkasovaH2
Introduction
Streaming media – a new wave of rich Internet content Video is popular for:
News Sports Entertainment Education Training
Enterprise media servers: Online advertisement Web marketing Customer interaction centers Collaboration Training
HPL- 04/19/23 Lucy CherkasovaH3
Challenges
Streaming media delivery challenges: Real time High bandwidth Magnitude amount of storage Sensitivity to network congestion
Understanding the nature of media server workloads is crucial for properly provisioning current and future services
HPL- 04/19/23 Lucy CherkasovaH4
Related Work
Studies of educational workloads Non streaming multimedia stored on web servers (Acharya et al.,
1998) mMod (multicast Media on demand) with mix of educational and
entertainment content (Acharya et al., 2000) eTeach and BIBS (Almeida et al., 2001)
Media proxy analysis (University of Washington, Chesire at al., 2001) Results showed very little locality: 78% of files are accessed once
HPL- 04/19/23 Lucy CherkasovaH5
Goals of Our Study
Characterize access patterns for enterprise media servers Extract some QoS related metrics for media sites (from the logs) Characterize locality properties and compare them with
traditional web workloads characterization Characterize evolution of site content and rate of changes on
the site Two new metrics: new files impact and life span
Characterize dynamics of the sites and growth trends Design a tool (MediaMetrics) for service providers
HPL- 04/19/23 Lucy CherkasovaH6
Data Collection Sites
HP Corporate Media Solutions server (HPC), for over 2.5 year: November, 1998 to April, 2001 (Windows Media Server) Video coverage of major events Keynote speeches, addresses, and presentations Meetings with industry analysts Promotional events and product introduction Demos of product usage
HPLabs Media Server (HPLabs), for 1 year 9 months: July, 1999 to April, 2001 (RealServer G2), internal server Coffee talks, prominent presentations, seminars, meetings Cooltown videos HP wide business events, etc
HPL- 04/19/23 Lucy CherkasovaH7
Media Server Log Formats
Media access logs record information about all request and responses processed by media server
Windows Media Server and RealServer G2 have different log formats
Typical (common) fields: Client IP address Timestamp of the request File name of the requested video The advertised duration of video (in sec) The size of requested file (in bytes) The elapsed time of the requested media file when the play ended The average bandwidth available to a client in Kb/sec
(during the session) Number of bytes sent by the server Number of bytes received by the client, etc.
HPL- 04/19/23 Lucy CherkasovaH8
Media Sessions
Clients can pause, rewind, fast forward, skip using a slide bar A session is a sequence of client requests corresponding to the
same file access Windows Media Server Logs contain a separate entry for each
client request (a session = multiple requests) RealServer log did not have this information
HPL- 04/19/23 Lucy CherkasovaH9
Summary Statistics HPC HPLabs
Duration 29 months 21 months
Total sessions 666,074 14,489
Total requests 1,179,814 NA
Unique files 2,999 412
Unique clients 131,161 2,482
Storage requirement
42 GB 48 GB
Bytes transferred
2,664 GB 172 GB
In HPC, 471 files corresponded to live streams: we excluded them from further analysis
HPL- 04/19/23 Lucy CherkasovaH10
Files and Session Characteristics
Distribution of stored videos and percentage of corresponding client accesses to those files
42% - short videos (less than 10 min)23% - medium video group (10-30 min)34% - long video (longer than 30 min)
HPL- 04/19/23 Lucy CherkasovaH11
11% - short videos (less than 10 min)10% - medium video group (10-30 min)79% - long video (longer than 30 min)
Interesting observation: the client accesses are almost uniformly distributed across the 6 analyzed classes for both workloads
This is a very useful property for synthetic workload generation.
HPL- 04/19/23 Lucy CherkasovaH12
Session Duration Characterization
77-79% of sessions were less than 10 min7-12% of sessions were 10-30 min long6-13% of sessions longer than 30 min. In spite of a significant difference in the type of content for both workloads (in terms of file duration distribution) the client viewing behaviors were almost identical for both workloads: browsing nature of client behavior
HPL- 04/19/23 Lucy CherkasovaH13
Client Interactivity
Percentage of sessions with interactive requests for different file size classes.
99.9% of sessions with interactive requests were high-bandwidth sessions with available bandwidth greater than 56 Kb/s
15.3% of interactivity for short sessions, 22.6% - for medium sessions,62.2% of sessions - for long sessions.
HPL- 04/19/23 Lucy CherkasovaH14
Encoding Rates and Available Bandwidth
59% of files encoded at 56Kb/s and lower.1999 year: 1.7% of the files encoded at a rate between 128-256Kb/s2001 year: 27.8% of the files encoded at a rate between 128-256Kb/s
Most of the files and the corresponding average bandwidth available to the user show a good alignment.
HPL- 04/19/23 Lucy CherkasovaH15
67% of the files are encoded at 256Kb/s and higher.
The gap between the demand and and available bandwidth per session is very high.
The information provided by MediaMetrics could be used by service providers for choosing the right encoding rates.
HPL- 04/19/23 Lucy CherkasovaH16
Completed and Aborted Sessions
Completed sessions: 29% for HPC 12.6% for HPLabs
However, difference in bandwidth was not too much different between completed and aborted sessions.
Most of the aborted sessions accessed initial segments of media files.
Incompleted sessions accessing any other segment (other than beginning): 1.5% in a short video group 2.4% in a medium video group 4-7% in a long video group
HPL- 04/19/23 Lucy CherkasovaH17
QoS Related Observations
Media access logs report Number of bytes sent by the server Number of bytes received by the client
MediaMetrix estimates the percentage of bytes lost during the file transfer to implicitly judge about QoS observed by the client
Lost bytes estimates produces useful results when data transmitted over UDP (HPC server is using UDP, HPLabs server -- TCP)
It might be less accurate for data transmitted over TCP: in presence of congestion, media server will retransmit part of data to compensate
for lost packets the difference in server sent bytes and clients received bytes not always explicitly
result in worse QoS (due to buffering on a client side)
Two groups of media sessions• low-bandwidth sessions (with available bandwidth less than 56 Kb/s)• high-bandwidth sessions (with available bandwidth greater than 56 Kb/s)
HPL- 04/19/23 Lucy CherkasovaH18
QoS Related Observations
• HPC had 61% of high-bandwidth sessions• HPLabs had 23% of high-bandwidth sessions• High-bandwidth sessions transferred 4-6 times more bytes• HPC workload : QoS observed by low- and high-bandwidth
sessions was practically the same: • 96.5% of low-bandwidth sessions had 0-5% of bytes loss per
session• 97.1% of high-bandwidth sessions had 0-5% of bytes loss per
session• HPLabs workload QoS :
• 64.6% of low-bandwidth sessions had 0-5% of bytes loss per session
• 88.8% of high-bandwidth sessions had 0-5% of bytes loss per session
• It stresses the essential role of available bandwidth for media sessions over TCP
HPL- 04/19/23 Lucy CherkasovaH19
Locality Characterization
Locality invariant for web server workloads: 10% of most popular files account for 90% of all requests and 90% of all bytes transferred
HPC: 90% of media sessions target 14% of the filesHPLabs: 90% of media sessions target 30% of the files
HPC: sessions to 14% of most popular files transfer 94% of bytesHPLabs: sessions to 30% of most popular files transfer 92% of bytes
Conclusion: locality invariant is applicable for media workloads too!
HPC: 14% of the most popular files are accessed by 96% of clientsHPLabs: 30% of the most popular files are accessed by 97% of clients
HPL- 04/19/23 Lucy CherkasovaH20
Locality from System Resource Usage Angle
Let define active storage set as combined size of all the media files accessed in the logs
80% to 88% of sessions are to files that constitute only 20% of active storage set
82% to 92% of all transferred “most popular” bytes are to only 20% of active storage set
These normalized metrics are useful to estimate storage requirements and potential bandwidth savings when designing or applying optimization technique
HPL- 04/19/23 Lucy CherkasovaH21
Zipf or Not a Zipf?
Zipf-like distributions were observed for web servers and web proxies workloads as well as was reported in the recent study for media proxy workload
the popularity of i-th most popular file is proportional to
Distribution of the file access frequencies (file popularities) for entire duration of the log – not a Zipf!
Question: does it depend on log duration?
HPL- 04/19/23 Lucy CherkasovaH22
Web servers: typical value of alpha varies varies between 1.4 – 1.6Web proxies: typical value of alpha is less than 1, it varies varies between 0.64 to 0.83Media proxies: alpha = 0.47
HPLabs media server: six month periods can be approximated with Zipf-like distribution and alpha=1.6
HPL- 04/19/23 Lucy CherkasovaH23
HPC media server: files popularity on a monthly basis can be aproximated with Zipf-like distribution and alpha=1.5
For different months, alpha varies between 1.4 to 1.6.
These observations are very useful for synthetic workload generation.
HPL- 04/19/23 Lucy CherkasovaH24
File Sharing Statistics
Both workloads exhibit high degree of clients’ file sharing access pattern!
HPC: 70 most popular files are accessed by more than 1000 clients, with some most popular files accessed by 10,000-12,000 clients
HPLabs: 17 most popular files are accessed by 113-341 unique clients
HPL- 04/19/23 Lucy CherkasovaH25
Rarely Accessed Files Statistics
Files Requested
up to
1 / 5 / 10 times
Storage
Requirements for
Corresponding Files
HPC 16% / 38% / 47%
10% / 26% / 34%
HPLabs 19% / 45% / 59% 17% / 39% / 52%
• These numbers are lower than compared to similar statistics for web server workloads
• For web server workloads, “onetimers” may account for 20% to 40% of the files and 20% to 40% of the active storage
HPL- 04/19/23 Lucy CherkasovaH26
Dynamics and Evolutions of Media Sites Burstiness
Some days exhibit two orders of magnitude higher number of sessions for both workloads
For enterprise web server workloads, daily traffic amount is much more predictable
Studies of educational media server workloads showed less degree of burstiness, more correlated with the day of the week
HPL- 04/19/23 Lucy CherkasovaH27
New Files Impact (HPC)
We define a file being new if it was never accessed before (based on the information in access logs)
Our intent: to observe the site’ dynamics and evolution due to new files
HPC site has explicit growth trend with respect of total number of files accessed per month, and consistently steady amount of new files added to a site monthly.
HPL- 04/19/23 Lucy CherkasovaH28
New Files Impact (HPLabs)
The growth of total number of files accessed each month for HPLabs is negative!?
We asked the support team: any specific reasons?
Suspicion was is there a significant number of files that “nobody watches”?Or the actual information of new media content on that site decreased over time?
Team confirmed that only limited number of new files was added lately because of a transition plan to upgrade the entire site design and equipment
So, the negative trend was observed correctly.
HPL- 04/19/23 Lucy CherkasovaH29
New Files Impact (Unique Clients)
These graphs are again correlated with the trends of the sessions to new files!
Conclusion: the number of new files added per month plays a crucial role in defining the site dynamics, evolution, and growth rates!
HPL- 04/19/23 Lucy CherkasovaH30
New Trends Over Time
Analysis of HPC workload over time revealed interesting overall trends in site media content and session characteristics
Total number of unique clients accessing media content in each 6 month duration doubled over the duration of our logs.
Total number of sessions in each 6 month duration also doubled over the duration of our logs.
Average file size in each 6 month duration increased from less than 7MB to more than 20MB in our logs.
Bytes transferred per session increased from just over 1MB to over 6MB in our logs.
HPL- 04/19/23 Lucy CherkasovaH31
New Files Impact (conclusion)
• The access pattern of enterprise media servers resembles with the access patterns of new web sites: most of the client monthly accesses (50-80%) target newly added information.
• Dynamics of enterprise web sites exhibits much more stability: only 2% of monthly requests are to the new files.
HPL- 04/19/23 Lucy CherkasovaH32
Life Span of File Accesses
Question: how much does the popularity of the file and frequency of accesses changes over time?
Enterprise media server workloads exhibit high locality of references: 90% of media sessions target only 14%-30% of the files
We define the core-90% as the set of most frequently accessed files that makes up for 90% of all the media sessions (it is performance critical set of files)
Life duration of a file : time between the first and the last accesses to this file in the considered workload.
HPL- 04/19/23 Lucy CherkasovaH33
Life Duration of the Files
High percentage of short-lived files: HPC: 37% of all files live less than a month HPLabs: 50 % of all files live less than a month73% of the files live less than 6 months for both workloadsOnly 8-10% of the files live longer than a year.
Question: what is the density of accesses over time?The plotted histograms for most frequent files had lognormal-like curve with most accesses occurring during first 1-3 weeks after the files introduction.
HPL- 04/19/23 Lucy CherkasovaH34
Life Span Metric
Life span metric: cumulative distribution of accesses to the files since their introduction at a site. HPC HPLabsFirst week: 52% 51%Second week: 16% 10%Third week: 6% 5%4th and 5th weeks: 3% 1%.Enterprise media servers exhibit access patterns similar to news web sites:• most of accesses are to new documents, and • after certain time period these documents are accessed very rare
HPL- 04/19/23 Lucy CherkasovaH35
Rate of Change
Life span is normalized metric: the files could have been individually introduced at different times.
The metric reflects the rate of change of the files during their existence at the site.
Life span metric reflects timeliness of the introduced files: Longer life span means that information at the site is less timely and
has more consistent percentile of accesses over time.
Life span metric allows one to interpolate the intensity of the client accesses over time to the new and existing files over a future period of time.
HPL- 04/19/23 Lucy CherkasovaH36
Conclusion
Media server access logs are invaluable source of information about traffic access patterns and system resource requirements
MediaMetrics was specially designed for service providers and system administrators to understand nature of traffic to their media sites
Our analysis established a set of invariants specific for enterprise media servers workloads and compared them with well known related invariants and observations for web server workloads
HPL- 04/19/23 Lucy CherkasovaH37
Acknowledgments
Both tool and study would not have been possible without media access logs and help provided by Nic Lyons, Wray Smallwood, Brett Bausk, Magnus Karlsson, Wenting Tang, Yun Fu, John Apostolopoulos, and Susie Wee.
Their help is highly appreciated.