32
1 1 Best Practices in Web Performance Monitoring Alistair A. Croll VP Product Management and Co-Founder 2 So you want to monitor things.

Gigamon U - Web Performance Monitoring

Embed Size (px)

DESCRIPTION

Alistair Croll, Interop conference faculty and Coradiant's VP of product management gives an unbiased, top down view of Web performance monitoring. This informative look at Web measurement business goals, operating processes, tools and metrics will give you a solid understanding of the issues, without a product pitch. Coradiant is the leader in Web Performance Monitoring. The award-winning TrueSight Real-User Monitor allows organizations to watch what matters to their business, by delivering accurate, detailed information on the performance and integrity of Web applications in real time. Incident management, service-level management and change-impact management are three key capabilities. TrueSight watches any web or enterprise web application and lets site operators identify problems more quickly, isolate root-cause faster, and effect fixes more quickly than anything else on the market.

Citation preview

Page 1: Gigamon U - Web Performance Monitoring

1

1

Best Practices inWeb Performance Monitoring

Alistair A. CrollVP Product Management and Co-Founder

2

So you want tomonitor things.

Page 2: Gigamon U - Web Performance Monitoring

2

3

But there are toomany toys out there…

4

A top-down approach to webperformance monitoring

Metrics

Tools

Operating processes

Business goals

Page 3: Gigamon U - Web Performance Monitoring

3

5

A top-down approach to webperformance monitoring

Metrics

Tools

Operating processes

Business goalsS

impl

ify &

inte

rpre

tStarthere!

6

What goals?(in plain English)

Page 4: Gigamon U - Web Performance Monitoring

4

7

Goals

• Make the application available– I can use it

• Ensure user satisfaction– It’s fast & meets or exceeds my expectations

• Balance capacity with demand– It handles the peak loads– It doesn’t cost too much

• Minimize MTTR– When it breaks, I can fix it efficiently

• Align operations tasks with business priorities– I work on what matters first

8

They can use it

Page 5: Gigamon U - Web Performance Monitoring

5

9

Make the application available

• The most basic goal• App should be reachable, responsive, and

functionally correct• 3 completely different issues

– Can I communicate with the service?– Can I get end-to-end responses in a timely

manner?– Is the application behaving properly?

10

They’re happy &productive

Page 6: Gigamon U - Web Performance Monitoring

6

11

Ensure user satisfaction

• How fast is fast enough?• Depends on the task

– Login versus reports• Depends on user expectations

– ATMs versus banking systems• Depends on the user’s state of mind

– Deeply engaged versus browsing

12

Balance capacity with demand

• Performance degrades with demand

Load (requests per second)

Performance(end-to-end

delay)

Maximumacceptable delay

Maximum capacity

Page 7: Gigamon U - Web Performance Monitoring

7

13

I can fix it fast

14

Minimize MTTR

• Fix it efficiently• Know the costs of downtime• Application- and business-dependent

– Direct (operational) costs– Penalties– Opportunity costs– Abandonment costs

Page 8: Gigamon U - Web Performance Monitoring

8

15

Minimize MTTR

• Don’t just think about lost revenue

16

Minimize MTTR

• And consider the whole resolution cycle

Eventoccurs

ITAware Reproduced Diagnosed Resolved Deployed

Time to recover

Verified

Page 9: Gigamon U - Web Performance Monitoring

9

17

I worry about whatmatters

18

Align operations tasks withbusiness priorities

• Know what the business goals are• Fix problems, not incidents• Know the real impact of an issue

Page 10: Gigamon U - Web Performance Monitoring

10

19

Align operations tasks withbusiness priorities

• Tackle problems, not incidents

Incident

Bob fromHouston hada 500 error

Problem

Houstoncan’t use the

order app

SLMviolation

10% ofrequests aregetting 500

errors

So dideveryone

else inHouston!

Andthey’re allcoming

fromHouston!

20

Align operations tasks withbusiness priorities

• Know the real impact of issues

Good requestsTime

Requests Errored requests

Affected users

Total impactChange from “normal”

Page 11: Gigamon U - Web Performance Monitoring

11

21

So I have these goals…

• Make the application available• Ensure user satisfaction• Balance capacity with demand• Minimize MTTR• Align operations tasks with business

priorities

• How do I make sure I meet themrepeatably and predictably?

22

Okay, got the goals

Page 12: Gigamon U - Web Performance Monitoring

12

23

But how do I makethis real?

24

A top-down approach to webperformance monitoring

Metrics

Tools

Operating processes

Business goals

Goals driveprocesses

Page 13: Gigamon U - Web Performance Monitoring

13

25

Processes

• Reporting & overcommunication• Capacity planning• SLA definition• Problem detection• Problem localization & resolution

26

Keep people informed

Page 14: Gigamon U - Web Performance Monitoring

14

27

Reporting & overcommunication:Know the audience

Network operations Network latency, throughput,retransmissions, service outages

Marketing Abandonment, conversion,demographics

Server operations Host latency, server errors,session concurrency

Security Anomalies, fraudulent activity

Finance Capacity planning, time out ofSLA, IT repair costs

Different stakeholders The same data sources

28

I have enough juice

Page 15: Gigamon U - Web Performance Monitoring

15

29

Capacity planning

• Define peak load• Define acceptable performance &

availability• Select margin of error

– Cost of being wrong– Variance and confidence in the data

• Build capacity & monitor– Performance versus load

30

Capacity planning

Page 16: Gigamon U - Web Performance Monitoring

16

31

We all agree on what’s“good enough”

32

SLA definition

• Select a metric• Select an SLA target

– That you control– That can be reliably measured

• Define how many transactions can exceedthis target before being in violation

• Monitor– Metric, percentile

Page 17: Gigamon U - Web Performance Monitoring

17

33

SLA definition

• 95% of all searches by zipcode by all HRpersonnel will take under 2 seconds forthe network to deliver

95%95% Percentiles, not averagesAll searches by All searches by zipcodezipcode Application function, not portAll HR personnelAll HR personnel User-centric, actual requestsUnder 2 secondsUnder 2 seconds Performance metricFor the network to deliverFor the network to deliver A specific element of delay

34

I know whereproblems are…

Page 18: Gigamon U - Web Performance Monitoring

18

35

Problem detection

• Detect incidents as soon as they affecteven one user

• Is the incident part of a bigger problem?• Prioritize problems by business impact

– Number of users affected– Dollar value lost– Severity of the issue

36

…and I can figure outwhat’s behind them

Page 19: Gigamon U - Web Performance Monitoring

19

37

Problem localization & resolution

• Reproduction of the error– Capture a sample incident

• Deductive reasoning– Check tests to see what else is failing– Do incidents share a common element?– Do incidents happen at a certain load?– Do incidents recur around a certain time?

38

Problem localization & resolution

Page 20: Gigamon U - Web Performance Monitoring

20

39

Problem localization & resolution

• What do they have in common?

40

Problem localization & resolution

Page 21: Gigamon U - Web Performance Monitoring

21

41

A top-down approach to webperformance monitoring

Metrics

Tools

Operating processes

Business goals

Selecttools that

makeprocesseswork best

42

Tools:The three-legged stool

Synthetic

Real User

Device

Page 22: Gigamon U - Web Performance Monitoring

22

43

Device monitoring:Watching the infrastructure

• Less relation to application availability• Vital for troubleshooting and localization• Will show “hard down” errors

– But good sites are redundant anyway• Correlation between a metric (CPU, RAM)

and performance degradation showswhere to add capacity

44

Synthetic testing:Checking it yourself

• Local or outside• Same test each time• Excellent for network

baselining when youcan’t control end-user’s connection

• Use to check if aregion or function isdown for everyone

• Limited usefulness forproblem re-creation

Page 23: Gigamon U - Web Performance Monitoring

23

45

Synthetic testing:Checking it yourself

46

Real User Monitoring:2 main uses

• Tactical– Detect an incident as soon as 1 user gets it– Capture session forensics

• Long-term– Actual user service delivery– Performance/load relations– Capacity planning

Page 24: Gigamon U - Web Performance Monitoring

24

47

Real user monitoring:2 main uses

• Outlined in ITIL

Service support

Incident management

Problem management

Service delivery

Service level management

Availability management

Capacity planning

48

OK, I’ve got the tools.What do I look at?

Page 25: Gigamon U - Web Performance Monitoring

25

49

A top-down approach to webperformance monitoring

Metrics

Tools

Operating processes

Business goals

Use the rightmetrics for

the audience& question

50

Metrics

• Measure everything– A full performance model

• Availability– Can I use it?

• User satisfaction– What’s the impact of bad performance?

• Use percentiles– Averages lie

Page 26: Gigamon U - Web Performance Monitoring

26

51

A full performance model

• The HTTP data model– Redirects– Containers– Components– User sessions

• HTTP-specific latency– SSL– Redirect time– Host latency– Network latency– Idle time– Think time

52

Availability

• Network errors– High

retransmissions,DNS resolutionfailure

Page 27: Gigamon U - Web Performance Monitoring

27

53

Availability

• Client errors– 404 not found

54

Availability

• Application errors– HTTP 500

Page 28: Gigamon U - Web Performance Monitoring

28

55

Availability

• Service errors

56

Availability

• Content & back-end errors– “ODBC Error

#1234”

Page 29: Gigamon U - Web Performance Monitoring

29

57

Availability

• Custom errors– Specific to your

business

58

User satisfaction:Satisfied, tolerating, frustrated

What metric? What function?

Targetperformance

Impact onusers

Percentiledata

Page 30: Gigamon U - Web Performance Monitoring

30

59

Averages lie:Use percentiles

60

Averages lie:Use percentiles

Average varies wildly,making it hard to

threshold properly orsee a real slow-down.

Page 31: Gigamon U - Web Performance Monitoring

31

61

Averages lie:Use percentiles

80th percentileonly spikes oncefor a legitimate

slow-down (20%of users affected)

62

Averages lie:Use percentiles

Setting a usefulthreshold on

percentiles givesless false positivesand more real alerts

Page 32: Gigamon U - Web Performance Monitoring

32

63

A top-down approach to webperformance monitoring

MetricsMetrics

ToolsTools

Operating processesOperating processes

Business goalsBusiness goals

64

Questions?

acroll<at>coradiant.com(514) 944-2765