30
EtE Monitor H 1 EtE: Passive End-to-End Internet Service Performance Monitoring Yun Fu, Lucy Cherkasova , Wenting Tang, and Amin Vahdat HPLabs and Duke University

EtE Monitor H 1 EtE: Passive End-to-End Internet Service Performance Monitoring Yun Fu, Lucy Cherkasova, Wenting Tang, and Amin Vahdat HPLabs and Duke

Embed Size (px)

Citation preview

EtE MonitorH1

EtE: Passive End-to-End Internet Service Performance Monitoring

Yun Fu, Lucy Cherkasova, Wenting Tang, and Amin Vahdat

HPLabs and Duke University

EtE MonitorH2

HP.com???

A lot of research is done to optimize web server performance in order to improve client experience

BUT Do we know what is the client experience? What are the critical latency components in the end-to-end response time? Do we know whether the improvements on the web server side indeed improve end-user experience? Do we know who the clients are and where they are located on the Internet?

Service provider problems...

EtE MonitorH3

End-to-End Web Service Measurement: Why Is It Important?

Two main factors impact the response time perceived by the clients: network latency and server side processing time

Many web sites use complex multi-tiered architecture A set of new technologies, such as servlets and Javaserver pages,

extend the web servers to generate information-rich dynamic web pages and to leverage existing business systems

Combination of these technologies could lead to increased server-side processing time especially in distributed environment

New ad-hoc business metric: web service is considered to be “unavailable” if its response time exceeds 6 seconds

The service providers need a quantitative analysis of the major latency components contributing to the response time to achieve given business and QoS objectives: Invest in more powerful site infrastructure or Choose a CDN service?

EtE MonitorH4

Why Is It Difficult?

Web pages are complex objects with multiple embedded images HTTP protocol is stateless: different images are requested by

client browser independently: • Some of them are issued concurrently• Some of them use persistent connections• Some of them are obtained from proxies• Some of them are obtained from user browser caches

The response time of a web page observed by the client is the result of download of all page related images

EtE MonitorH5

What Are Currently Available Solutions?

Active periodic probing of a particular web page from a fixed number of clients across the Internet Keynote service

– Keynote “clients” are not the real web site clients– Allows monitoring of a particular web page– Always pulls the entire page (with all embedded images) from the server

Page instrumentation technique based on downloadable JavaScript or Java Applet to a client web browser HP Open View “Web Transaction Observer”

– The measurement starts after download of the main html page (significant portion of the response time is missing)

– Does not provide latency breakdown unless the web server is also instrumented

eBusiness Assurance (eBA, from Candle Corp) Quality of Service (QoS) Monitor (IBM, Tivoli) Research paper by Rajamony and Elnozahy from IBM (Austin) uses

JavaScript to instrument the links to particular pages. Somewhat more limited: cannot measure directly accessed pages, e.g “index.html”…

EtE MonitorH6

What Do We Propose?

EtE monitor Passive monitoring tool for end-to-end response time measurement Non-intrusive, does not require any changes or modifications to a site content, or server

side infrastructure, or client browsers Can be used for sites with static or dynamically generated content

What does it provide? End-to-end response measurement for all the pages and all the clients accessing the site Analysis of response components:

• Server processing time portion• Network transfer time portion

Reports the % of data delivered from the server vs the % of data cached on the client side Reports the % of aborted page accesses and the related performance reasons Analysis of the most frequently accessed documents and their response time Client clustering by ASes (Autonomous Systems)

• Requests (bytes) clustering by ASes and the corresponding response time And more …..

EtE MonitorH7

EtE Monitor Architecture

1. The Network Packet Collector module: collects network packets using tcpdump and records them in Network Trace enabling offline analysis.

2. In the Request-Response Reconstruction module, EtE monitor reconstructs all TCP connections from the Network Trace and extracts HTTP transactions (a request with corresponding response) from the payload. EtE monitor stores the HTTP header lines and other related information in the Transaction Log

3. The Web Page Reconstruction module is responsible for grouping the request-response pairs into logical web page accesses and stores them in the Web Page Session Log

4. The Performance Analysis and Statistics module summarizes a variety of performance characteristics integrated across all client accesses

EtE MonitorH8

Request-Response Reconstruction Module

The TCP connections are rebuilt from Network Trace using: The client IP address The client port number The request (response) TCP sequence number

Within the payload of the rebuilt TCP connections, HTTP transactions are delimited as defined by HTTP protocol

After reconstructing the HTTP transactions, the monitor records the HTTP header lines and other information of interest in the Transaction Log and discards the transaction body

EtE MonitorH9

Request-Response Reconstruction Module (continuation)

Each entry in the Transaction Log includes: The client IP address A unique flow ID for TCP connection The requested URL The content type The payload size The referer field The via field Whether the request was aborted The number of packets resent in the response The corresponding timestamps

EtE MonitorH10

Page Reconstruction Module

To measure the client perceived end-to-end response time for retrieving a web page, we need to group the objects in a web page access

We use two-pass heuristic method and statistical filtering mechanism to reconstruct different client page access First pass: EtE monitor uses the HTTP requests with referer field to

build a Knowledge Base of web pages and their embedded objects Second pass:

• EtE monitor reconstructs the page accesses without referer field using the Knowledge Base of web pages and some additional heuristics

• EtE monitor uses statistical analysis to identify valid access patterns and filter the accesses grouped incorrectly

EtE MonitorH11

Example

Example of initial html.file request and the following embedded object request with corresponding referer field:

EtE MonitorH12

First Pass: Client Access Table

EtE monitor stores web page access information into a hash table using client IP addresses: • If the content type is text/html, a new web page entry is created in the Web Page Table• For other types, the request URL is inserted according to its referer field

EtE MonitorH13

Building a Knowledge Base of Web Pages

From the Client Access Table, EtE monitor determines the content template of any given web page as a combined set of all objects that appear in all access patterns for this page

EtE MonitorH14

Second Pass: Reconstruction of Web Page Accesses

With the help of Knowledge Base, EtE monitor processes the entire Transaction Log again, and creates a new Client Access Table

This time it processes the objects without referer field: EtE monitor consults the Knowledge Base while checking all the page

entries in the Web Page Table to find the page an object might be embedded in, and appends it at the end of that page

If none of the web page entries in the Web Page Table contains the object based on the Knowledge Base then• EtE monitor searches for the page accessed with the same flow ID• Otherwise it appends the object to the latest accessed page (additionally it

uses configurable think time threshold to delimit web pages)• If the think time threshold is exceeded, the object is dropped

EtE MonitorH15

Identifying Valid Accesses Using Statistical Analysis of Access Patterns

Although the above two-pass process is very efficient, there could still be some accesses grouped incorrectly

We use a statistical analysis to better approximate the actual content of web pages and filter out the incorrectly constructed accesses

EtE MonitorH16

Metrics to Measure Web Service Performance

Response time metrics End-to-end response time observed by the client for a web page download

Latency breakdown: server related and network related portions

Connection set-up time

Metrics evaluating web service caching efficiency Server file hit ratio

Server byte hit ratio

Aborted pages and QoS Why the accesses are aborted:

• Bad performance?

• Client browsing patterns?

EtE MonitorH17

Example: 1-object page retrieval(basic timestamps)

EtE MonitorH18

Latency Breakdown for Multiple Concurrent Connections: Server Processing vs Network

EtE MonitorH19

Metrics Evaluating Web Service Caching Efficiency

Original web page url1 (page template): • 10 objects, • 100 Kbytes.

Access to url1: Acc1• 5 objects, • 70 Kbytes.

Access to url1: Acc2• 7 objects, • 80 Kbytes.

FileHitRatio(Acc1) = 5/10, 50%ByteHitRatio(Acc1)=70/100, 70%

FileHitRatio(Acc1) = 7/10, 70%ByteHitRatio(Acc1)=80/100, 80%

ServerFileHitRatio(url1) = (5/10 + 7/10) / 2, 60%ServerByteHitRatio(url1) = (70/100 + 80/100) / 2, 75%

The smaller is the better!

EtE MonitorH20

Case Studies

HPL external site (HPL) From July12, 2001 to August 11, 2001 The site has mostly static content

Open View Support site (Support) From October 11, 2001 to October 25, 2001 The site uses JavaServer Pages technology for dynamic page

generation

EtE MonitorH21

Sites Statistics At-A-Glance

EtE MonitorH22

HPLabs Site Case Study

• Figure shows the EtE time to index.html on hourly scale during a month• In spite of overall good performance, hourly averages reflect significant variation in response time observed by the clients

• Periods of increased latency correspond to weekends! What is the problem?

HPL site during a month (accesses to index.html page)

EtE MonitorH23

• Resent packets typically reflect network congestion or network–related bottlenecks• Periods of increased resent packets correspond to weekends

• The explanation: the client population significantly “changes” during weekends• Most of the clients access the web site from home via low-bandwidth connections

It is extremely important to understand the client population! Active probing approach using artificial clients (with typically “good” connection to the Internet) lacks this information

Understanding the Client Population

EtE MonitorH24

Performance Analysis of Accesses to itanium.html

First Figure:• Number of accesses to itanium.html page• From being the most popular page in the beginning of the study, it gets to the 7th place after 10 days

Second Figure• Percentage of accesses above 6 sec to itanium.html page• Question: why is the latency observed by the clients getting higher?

EtE MonitorH25

Caching Efficiency of the Page

When the page is getting less popular, “colder”, the number of objects and bytes retrieved from the original server increases significantly: i.e. fewer network caches store the page related objects

It translates into increased response time observed by the client

Active probing technique cannot reflect the caching efficiency of the siteThe tools based on instrumentation technique cannot provide insight into this problem either

EtE MonitorH26

Clients Clustering by ASes

• Clients grouped by ASes show a heavy tail distribution• These figures allow us to see large client clusters and their corresponding end-to-end response time• The ability of EtE monitor to measure performance metrics for a certain group of clients is particularly attractive for Service Providers to validate required SLAs

EtE MonitorH27

Validation Experiments

We performed two groups of experiments To validate the accuracy of EtE measurements To evaluate the page access reconstruction power of EtE

• How dependent are the reconstruction results on the existence of referer field information?

The results are encouraging: EtE provides a very close approximation of the response time EtE monitor does a good job of page reconstruction even when

the requests do not have any referer field! However, two-pass heuristic method and statistical filtering mechanism

we use to reconstruct page accesses increase the number of reconstructed pages by about 20-30%

EtE MonitorH28

Limitations

EtE monitor is not appropriate for sites that encrypt much of their data (e.g., via SSL)

EtE monitor is not appropriate for sites that “outsource” most of their content to CDNs

Similar limitation applies to pages with “mixed” content: if a portion of the page is served from some other remote sites. In this case, EtE will measure only response time for local site content

For clients coming behind the proxy, EtE monitor measures the response time as observed from the proxy

Since the tool is based on heuristics and statistics to reconstruct the page content, the best results are obtained when the sample size is large enough

Dynamically generated content creates additional challenges for EtE monitor (typical for other analysis tools too): a configuration file provided by a site administrator is needed

EtE MonitorH29

Conclusion and Future Work

Understanding performance characteristics of Internet services is critical to evolving and engineering the web services to match: Changing demand levels Client populations Global network characteristics

EtE monitor, based on a novel technique, offers a number of benefits unavailable from other tools and by other means.

EtE monitor can be extended to work in “almost real-time” to provide timely information about web services and their performance.

Extended analysis on client clustering will provide an opportunity to use the information from EtE monitor for intelligent decision making on service placement and service optimization

EtE MonitorH30

Acknowledgements

The tool and the study would not be possible without a generous help of our HP colleagues: HPLabs team:

• Mike Rodriquez, Annabelle Eseo, and Peter Haddad HPO, Managed Web Services:

• Guy Mathews OpenView team:

• Steve Yonkaitis, Bob Husted, Norm Follett, and Don Reab US support team

• Claude Villermain, Vincent Rabiller, Pierre-Emmanuel Delforge

Their help is highly appreciated !