Upload
bob-sneed
View
265
Download
0
Tags:
Embed Size (px)
DESCRIPTION
This is a presentation I gave at the 2008 Annual Hotsos Symposium. Its punny title speaks to the obsession many performance analysts have with the analysis of utilization (U) decoupled from any observations about how a system is actually performing. The preso is a bit of a tirade, but a good rant all-in-all for recalling some of the practices that lead to the common practices of massive over-provisioning and tragically under-managing systems performance.
Citation preview
Capacity: It's NotAll About U!
(née: “Regarding�Capacity”)
Bob Sneed - Sr. Staff Engineer Sun Microsystems, Inc.
Performance & ApplicationsEngineering (PAE)
Hotsos Symposium 2008, March 2-6 @ DallasRev 1.9c – March 19, 2008
Copyright © 2008, Sun Microsystems, Inc. All Rights Reserved.
2Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Abstract
When it comes to managing computer capacity, the state-of-the-industry is wildly diverse -- but often both primitive and
inconsistent in the area of enterprise computing. Indeed, most discussions regarding capacity don't even involve appropriate
engineering units of measure! It's no surprise that the relationship between capacity management, performance
management, and Quality of Service (QoS) management is so uneven in practice. This session will survey modern
quandaries in Performance and Capacity Management, and offer some insights and abstractions aimed at stimulating
constructive discussion, progressive engineering development, and intelligent practices in this area.
3Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Disclaimers Opinions and views expressed herein are those of the author,
Bob Sneed, and do not represent any official opinion of Sun Microsystems, Incorporated - or anyone else.
I'm not a doctor and I don't even play one on TV - but I do regard Tom Baker and Chris Eccleston as role models.
There is no warranty, expressed or implied, in the quality of the information herein, or its fitness for any given purpose.
If you goof up applying this stuff and have a bad outcome or destroy a bunch of data – it's not my fault or Sun's.
This is version 1.x material.Batteries not included.
Your mileage may vary (YMMV).
4Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Agenda
• Motivations [10]• Let's Talk PerfCap [15]• Case Study [10]• Ruminations on the State of the Art [ 5]• Heterogeneity, Elasticity, and Covariance [15]• Concluding Remarks [ 5](All times in Bob-minutes; YMMV ...)
5Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Motivations
6Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Concerns and Premises• Primitivism: Many customers are doing capacity
wrong with the result being variously massive over-provisioning, surprises in production, or much ado about normal!• I'm annoyed: Many "capacity crises” are actually
either chaos in action or misunderstandings about The Way Things Work.• Advancing the art: Investments are required to
make industry advances in managing Performance and Capacity (PerfCap).• Customer value: Right-sizing is a win-win scenario.
7Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
How widespread is “wrong”?• It's not that everyone is doing it wrong ...> ... though even many who do PerfCap right are crippled by
organizational behaviour and GIGO constraints ...• In some places, PerfCap tends to get done right ...> Technical computing (HPC, HPTC)> Embedded computing & realtime systems> In well-defined tiers with homogeneous workloads
• In some places, PerfCap tends to get done wrong ...> Commercial IT – especially around big databases> Heterogeneous workloads - some inherently complex,
some resulting from consolidation or virtualization• Bob says: “Tiers are for people who have not
discovered resource and workload management!”
8Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap / Physics Metaphor
• Primitivism, pre-science ~ state of the practice> Wonder; everything is mystery and magic> Underlying causes attributed to nature or deities> Stagnant - “Because we've always done it that way”
• Newtonian physics ~ state of the art> Causality; testable hypotheses, repeatable outcomes> Mathematical relationships determined> Enables the modern era
• Einsteinian physics ~ the horizon> Relativity; frames of reference> True nature of things theorized; testability gets harder> Propels the post-modern era
9Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Over-Provisioning; so What?• Pros ...> Hardware is cheap. Sun sells hardware. Good for Bob!> Feature/function time-to-market has priority.> Performance expertise scarce and inconsistent.> No time for learning “new tricks”.> “Throwing Iron” at problems has a fixed cost and a set
delivery date - and it often “works”.• Cons ...> Capital costs> Operational costs (power, cooling, space, administration)> Stagnation: The applicable math, science, and vocabulary
has ended up deferred – for nearly an entire era.
10Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Let's Talk PerfCap
11Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap Language: Goals• Business Metrics> System performance in business terms, such as
transactions per second, batch run time, or percent of jobs/transactions meeting some performance criteria (Service Level Agreement, or SLA)
> Business objectives are typically diverse in terms of importance and resource demands
• Business Metrics and Indicators (BMIs)> Business metrics plus secondary indicator variables, such
as aggregate packet rate or commit rate> These are observables one might monitor and alarm on
12Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap: Solving the Right Problem
• “The Goal” - Goldratt> Written as a novel; an unusual approach
to conveying principles from Operations Research
• “Are Your Lights On?” - Gause & Weinberg> A fun and easy read> From the same Weinberg as the classic
“Psychology of Computer Programming”
13Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap Language: Capacity• Some definitions> English: The ability to do a job.> Technical: The maximum reliable throughput with
acceptable response times.> Geek: The throughput limitation of the bottleneck device.
• Supermarket metaphors> What percent of cashiers should be always idle?> What purposes do “express lanes” serve?
• Submarine metaphor> Compare “100% underwater” with “crush depth”; which
one represents capacity?
14Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap Language: Capacity Planning• Capacity Planning defined - with footnotes> Estimating[A] capacity requirements[B] in time to be able
order, receive, provision, and deploy – before you run out of capacity.
[A] Prognostication and prestodigitation, usually based on B.S. forecasts from marketing departments
[B] NOTE: Related disciplines increase capacity without capital outlays● Efficiency – doing more with less; tuning; optimization● Software Performance Engineering (SPE) – the discipline of
engineering to meet performance requirements• It's not all about U! (Utilization)> It's mostly about R (response time), X (throughput),
service demands, and efficiency (which relates to U) and The Way Things Work
15Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap Language: Queuology• Queueing Theory = math used for PerfCap work > Too bad it does not have a simple one-word name like
arithmetic, calculus, topology, trigonometry, or sadistics (how about “queuology”?)
• Response-time = Queue wait + Service time> R = W + S> NOTE: This is not Plain English. It must be taught in
context to enable meaningful conversations.• Bottleneck = scaling constraint> NOTE: This is not Plain English. In PerfCap, this term has
no negative emotional connotation.
16Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap Language: Crazy about U!
• Utilization (U)> The percent of time a resource is not idle> Physics analogy: Work = Force * Displacement
● No displacement means no work• Another physical metaphor ...> Helicopter: What does a helicopter's engine tachometer
tell you about the helicopter's performance?
17Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap Language: U is for Useless?• “Utilization is Virtually Useless as a Metric” - Adrian
Cockcroft, CMG 2006> http://perfcap.blogspot.com/2005/12/cmg05-trip-comments-and-utilization-is.html > http://www.cmg.org/membersonly/2006/papers/6133.pdf “We have all been conditioned over the years to use utilization or %busy as the
primary metric for capacity planning. Unfortunately, with increasing use of CPU virtualization and sophisticated CPU optimization techniques such as hyper-threading
and power management the measurements we get from systems are "virtually useless". This paper will explain many of the fundamental alternatives, and express capacity in terms of headroom, in units of throughput within a response time limit.”
• Adrian wins 2007 CMG Michelson Award> http://perfcap.blogspot.com/2007/12/a-michelson-award-acceptance-speech.html "Those who ask questions about utilization don't understand that their questions
have no meaning so the answers are irrelevant :-)"
18Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Aggregate Utilization: U-all?
• Business Logic> Workload classes (eg: OLTP, BATCH, pseudo-BATCH)
● Varies in business priority● Varies in relative I/O content● Varies in propensity to compute
> Per-class utilization varies based on many system factors (CPU architecture, OS scheduling, space/speed tradeoffs, efficiency tradeoffs, virtualization), and also due to often-uncontrolled competition for resources
> Cycles-per-instruction (CPI) varies with compile/build factors and competition factors
> Utilization is limited by concurrency of demand and bounded by serialization per Amdahl's Law
> Utilization often largely due to bad app code and/or bugs
19Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Aggregate Utilization: U what !?@#!
• Overhead categories> Polling operations> Lock and latch spins (adaptive)> Locking and latching cache coherency> Memory management (a maze of twisty passages ...)> Re-work (fail-and-retry logic)> Migrations & cache invalidations> Context switches (voluntary and involuntary)> Hardware thread-switching (some cheap, some not)
● SMP, VMT, SMT, CMT – all different!> Performance monitoring and management tools
● Significant “probe effect” can occur from some tools● The aggregate impact of tools is often a root cause of problems
> Bad tuning and bugs - outside of the business logic
20Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap Language: Like, U-know?• Workload Characterization> PerfCap definition: Attribution of resource utilization to
various distinct business processes or technical functionality● Essential to understanding resource usage
> Engineering definition: Characterization of platform response factors under a given workload● Interesting to drive systems engineering
> Vernacular definition: Various broad terms like OLTP, BATCH, DSS, DW, PROD, UETP, DVLP, TEST, OLAP, ERP, ETL, ad-hoc, and my personal favourite - “mixed”● Suggestive of requirements, but non-quantitative
21Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Hockey Sticks and Knees 4 U
Excerpted from "Analyzing Computer System Performance” by Neil J. Gunther, Springer-Verlag 2005. ISBN 3540208658 (Used with permission.)
22Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
So, what do U know?• Do you know your overhead/work ratio?• Do you know your ratio of OLTP to pseudo-BATCH?• Do you know how these vary under load?• Do you know how to observe, measure, and manage
these things?
23Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
PerfCap Language: Method Rrrrrr!
• Performance> CPU %busy, %usr/%sys ratio> IOPS, disk latency, %wio> Graphs of aggregated data
• Capacity> Whatever you get at 100% utilization
• Headroom> (100% – utilization)
• Utilization> (100% – headroom)
• Performance> Response time> Throughput> Variance
• Capacity> Latent performance
• Headroom> ((100% capacity) –
(current peak performance))• Utilization
> (100% – %idle)
Right Wrong
24Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Case Study
25Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Case Study: Scenario• Financial E10K user upgraded to E2900> CPU power of E2900 was 125% that of the 10K system
● E10K: #64 US-II @ (64 “slow” cores)● E2900: #12 US-IV+ @ (24 “fast” cores)
> Result: Utilization on E2900 was greater than on E10K!> Impact: Great angst! Management wanted %idle > 20!
E2900 dissed. Move to E6900 contemplated. (Focus was on utilization (U) ... response-time (R) and throughput (X) were essentially ignored)
> Breakthrough! Customer agreed to a test-to-fail exercise!● Monitor response times per-transaction-class● Increase benchmark workload until SLA not met
26Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
0 100 200 300 400 5000
120
240
360
480
600
RTX2
SLA =600 sec
0 100 200 300 400 5000
20
40
60
80
100
Users
It's not all about U!
UCPU
Max = 100%
0 100 200 300 400 5000
0.1
0.2
0.3
0.4
0.5RTX1
SLA = 0.5 sec
OMG! 20%Headroom?
No! 300% Headroom
27Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Case Study: Experimental Results• The new system had plenty of latent capacity!> Test-to-fail revealed 300% headroom at 80% utilization!> All they needed was 1X headroom at 100 users!> Workload characterization revealed that a single CPU-
greedy transaction of no business importance was vastly over-achieving its SLA
> The CPU-greedy transaction under Solaris TS scheduling automatically fell to priority 0 - thus having zero impact on real OLTP as OLTP demand ramped up to 4x the level that corresponded with 80% aggregate CPU utilization
> At the “tipping point”, the chaos may have been due to LGWR priority dropping to 0 under Solaris TS scheduling
28Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Case Study: Business Outcome
• Customer emergency upgraded to an E6900> CPU power of E6900 was 200% that of the 10K system> Rumor has it that they got a really good discount> E6900 showed a “comforting” 20%+ idle under full test load
• Moral> Science is often secondary in commercial IT> Due to issues of organizational behaviour, even empirical
results might fail to triumph over rules of thumb> The cost of hardware is a minor issue to many IT
managers' decision-making process> Get over it ... or - develop new metrics and methods by
which IT managers can be made comfortable!
29Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Ruminations on theState of the Art
30Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Common PerfCap Mistakes• Absence of business metrics> What Problem are You Trying to Solve?
• Equating usage with demand or requirement> In other words, assuming that demand is inelastic
• Failure to do performance first and often> Why scale waste and inefficiency?
• Assuming supply is inelastic> In other words, assuming service times are constant
• Misinterpreting “the device with the highest utilization is the bottleneck device”> Hmm, what about polling loops?
• Decisions based on intuition and rules of thumb> Sophistication can pay great rewards!
31Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
What's the Right Way to do PerfCap?1) Empirical Methods (The Best & Most Expensive)
● Benchmarks, stress testing, test-to-scale, test-to-fail – with known Best Practices & basic performance analysis and tuning
2) Modeling (Highly Recommended & Moderate Cost)● Using tools such as TeamQuest Model (TQM), BMC Perform/Predict, Hy-
Performix, Gunther's PDQ or other application of proper science and math3) Expert Opinions (The Minimum & Cheapest)
● Listening to the right experts for Best Practices, analysis and tuning methods, and sizing
4) Guesswork (The Norm)● Straight-line extrapolations, naïve use of reference benchmarks, massive
over-provisioning, bogus testing, luck5) Opportunism (Commonplace)
● Spend the available budget
32Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
RTFM: PerfCap Resources
• Dr. Neil Gunther – prolific, readable, digestible> “The Practical Performance Analyst” - foundational http://www.amazon.com/dp/059512674X/ > “Guerrilla Capacity Planning” - http://www.perfdynamics.com/Manifesto/gcaprules.html
33Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
RTFM: PerfCap Resources
• Cary Millsap – digestible, practical, methodical> “Optimizing Oracle Performance”
● Chapter 1 & 2 – a great intro to the art of PerfCap, whether or not one applies it to Oracle
● Method R
34Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
RTFM: PerfCap Resources
• Raj Jain - “The Art of Computer Systems Performance Analysis”> Fundamental, foundational, readable
35Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
When Models Break
• Good models break due to factors that are exogenous to the model (ie: not considered)> Examples: bus saturation, cache saturation, lock
contention, covariance• Bad models break because they are bad models> Examples: “straight line” projections, models that do
not consider basic queuing phenomena
36Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
What Breaks Existing Models
• Heterogeneity> There is diversity in both supply and demand factors> For example, OLTP, BATCH, and DSS are classical
characterizations for common workload elements• Elasticity> Resource supply and demand factors are each elastic> For example, per-transaction demand might diminish under
increasing load and supply might become more efficient• Covariance> Competition for resources impacts all competitors -
sometimes adversely or pathologically
37Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Heterogeneity, Elasticity,and Covariance
38Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Heterogeneity: Many Dimensions
• Business priority> Importance to the enterprise
• Service demand> Resource requirement, including deadline constraints
• Technical priority> Solaris scheduling priority
• Quality (versus quantity)> Not all CPU-seconds are created equal
• Urgency> Importance, as distinct from priority or share
● (example: princes and paupers)
39Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Heterogeneity: Early Warning Signs
• “ERP”• “Consolidation”• “RDBMS”• “Ad-hoc”• “Custom”• “Producer/Consumer”• “Client/Server”• “Dispatcher thread/process”• Testimony to the contrary (eg: “It's entirely
homogeneous OLTP!”)
40Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Heterogeneity: Example(s)# prstat -m PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 13632 oracle 50 50 0.0 0.0 0.0 0.0 0.0 0.0 0 0 48K 0 sqlplus/1 13633 oracle 0.0 96 0.0 0.0 0.0 0.0 48 0.0 0 0 46K 0 sqlplus/1 15849 oracle 92 0.1 0.0 0.0 0.0 100 100 0.1 13 45 1K 0 oracle/11 27639 oracle 91 0.1 0.0 0.0 0.0 100 100 0.1 24 50 2K 0 oracle/11 13601 root 18 54 0.0 0.0 0.0 0.0 36 0.0 178 178 87K 0 ps/1 13551 root 0.0 68 0.0 0.0 0.0 0.0 39 0.0 244 195 93K 0 prstat/1 12614 oracle 64 0.2 0.0 0.0 0.0 100 100 0.1 50 38 3K 0 oracle/11 24020 oracle 47 0.5 0.0 0.0 0.0 100 100 0.1 190 36 10K 0 oracle/11[...] 11087 oracle 9.3 0.1 0.0 0.0 0.0 0.0 90 0.0 5 6 6K 0 oracle/1 13490 root 0.0 8.5 0.0 0.0 0.0 0.0 93 0.0 380 0 25K 0 sh/1 2154 oracle 7.9 0.2 0.0 0.0 0.0 100 100 0.0 53 5 3K 0 oracle/11 9656 oracle 7.1 0.1 0.0 0.0 0.0 0.0 92 0.0 37 5 2K 0 oracle/1 24156 oracle 6.7 0.1 0.0 0.0 0.0 100 100 0.0 6 4 2K 0 oracle/11 13496 oracle 6.2 0.0 0.0 0.0 0.0 0.0 93 0.0 341 0 19K 0 sh/1 13488 oracle 6.0 0.0 0.0 0.0 0.0 0.0 96 0.0 330 0 19K 0 sh/1 25478 oracle 3.9 0.1 0.0 0.0 0.0 0.0 96 0.0 46 3 2K 0 oracle/1 8098 oracle 2.9 0.1 0.0 0.0 0.0 0.0 97 0.0 60 3 2K 0 oracle/1[...]Total: 295 processes, 2869 lwps, load averages: 11.64, 12.02, 12.05
41Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Heterogeneity: Exploring
• Fun commands you can use at home ...# Taking U apartprstat -n 8192 -m // Microstate accountingprstat -n 8192 -mL // Per-thread microstate accounting
# Thread count ...awk '{print $15}' < prstat-sample.1 | sort | grep oracle | uniq -c | more
# CPU intensity ...grep oracle\/ prstat-sample.1 | awk '{print $3}' | sort -n +1 | uniq -c | more
# Diverse priorities ...ps -e -o pid,class,pri,args
42Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Heterogeneity: Deal with it!• Identify it> This is one aspect of workload characterization in the
language of PerfCap> Consider its many dimensions (business priority, service
demand, technical priority, urgency, deadlines)• Tell the OS about it> The OS does not know your priorities, so tell it!> Automating this is a good investment
• Model it> w.r.t. competition and covariance – TBD
43Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Elasticity: Supply Factors
• In general, “supply” is net of competing demands> “I'm giving ya all I got, captain!”> FCFS – who got in line first?
• In a specific configuration, elastic factors abound> With mixed-speed CPUs, Q(CPU-second) = f(MHz)> With CMT, Q(CPU-second) = f(core loading)> Q(CPU-second) = f(ISA & pipeline sophistication)
• Unmanaged, the probability of thread pinning will increase with increasing interrupt load
44Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Elasticity: Supply Factors
• Priority preemption> Good – under TS, compute hogs will drift to priority 0> Bad - unmanaged, a large population of homogeneous
threads may frivolously preempt each other> Ugly – interrupts have top priority; they can even interrupt
and “pin” realtime (RT) threads> Hideous – it's really tragically bad when TS demotes your
highest-importance thread (eg: Oracle LGWR)
45Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Elasticity: Supply Factors# mpstat 5
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 211 449 142 423 4 21 25 0 460 17 2 0 82 1 1 0 127 155 2 296 2 6 23 0 199 13 1 0 86 2 0 0 30 30 0 56 0 3 9 0 64 1 0 0 98 3 0 0 0 2 0 2 0 1 4 0 0 0 0 0 100 8 1 0 199 278 0 548 4 11 37 0 470 23 1 0 76 9 0 0 0 2 0 2 0 1 4 0 0 0 0 0 100 10 0 0 30 53 0 104 0 3 11 0 155 4 0 0 95 11 0 0 0 2 0 2 0 1 3 0 0 0 0 0 100 16 1 0 178 258 0 508 3 10 29 0 521 16 1 0 82 17 0 0 3 5 3 4 0 1 6 0 2 0 0 0 100
[...]
104 1 0 222 194 4 377 1 6 28 0 281 16 1 0 83105 0 0 0 2 0 2 0 1 2 0 0 0 0 0 100106 0 0 0 3 0 4 0 1 3 0 13 0 0 0 100107 0 0 0 2 0 2 0 1 2 0 0 0 0 0 100112 1 0 141 229 1 451 2 3 23 0 289 18 1 0 81113 0 0 1 3 1 2 0 1 1 0 0 0 0 0 100114 0 0 0 6 0 9 0 2 2 0 3 0 0 0 100115 0 0 0 2 0 2 0 1 1 0 0 0 0 0 100120 4 0 397 409 3 804 4 3 44 0 450 23 3 0 74121 0 0 1 3 1 2 0 1 2 0 0 0 0 0 100122 0 0 13 15 0 28 0 2 3 0 13 1 0 0 99123 0 0 0 2 0 2 0 1 1 0 0 0 0 0 100
46Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Elasticity: Supply Factors$ awk '{print $3,$4}' ps-sample.out | sort | uniq -c | sort -nr +2 1 RT 157 1 RT 140 1 RT 100 1 SYS 98 1 SYS 96 3 TS 60 2 FX 60 1 SYS 608238 TS 59 1 TS 58 3 TS 54 11 TS 53 2 TS 52 1 TS 51 6 TS 50 14 TS 49 1 TS 36 1 TS 34 1 TS 29 1 TS 22 1 TS 12 3 TS 0$ grep lgw ps-sample.out 10494 1 TS 34 ora_lgwr_XYZP
NOTE: ps-sample.out data was from 'ps -e -o pid,ppid,class,pri,args'
Hey! Wait a minute! I'm really important!
Why didn't anyone tell the OS?
Help!
Important!
CPU hogs, punished by TS
Primary modality; OLTP shadows
47Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Elasticity: Demand Factors
• “The mythical CPU-second”> Sensitivity to compile options – eg: branch mispredicts,
pipelining, inlined macro-operations versus library calls> Sensitivity to link options – eg: locality versus I$ and D$
behaviour> Sensitivity to competition – could be viewed as elasticity of
demand or supply, or as covariance ... depending on one's point of perspective
> Adaptive algorithms – eg: decisions to yield and re-queue (rather than spin) might be made as a function of system load –– and that can reduce the CPU-sec/transaction as load increases
48Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Elasticity: Demand Factors
• Under high load, frivolous migrations should decrease, leading to improved cache utilization and reduced memory waits• Demand can vary in both quality (overhead/work) and
quantity (overhead+work) as load is varied> Ratio of business logic to spins for locks and latches> Write coalescing by LGWR> Checkpoint write deferral by DBWR
49Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Elasticity: Deal with it!
• Demand> Seek out and destroy inefficiency – but keep the 80/20 rule
in mind> Use Resource Management (RM) at the app, OS, and DB
levels – maybe Oracle Resource Manager (ORM)?> The final constraints are the speed of your components
and the speed of light• Supply> Invest in getting required factor-level QoS to various
processes in relation to their business criticality
50Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Covariance: Pigs at the Trough • Workloads often unmanaged and multi-modal> Spectrum is wide, but simple case is BATCH vs. OLTP
• What if your OLTP SLA outliers are due to I/O competition from your BATCH?> Maybe your BATCH is being over-served for I/O?> Maybe you could throttle your BATCH I/O demands?
• What if your BATCH SLA outliers are due to CPU competition from your OLTP?> Maybe your OLTP is being over-served for CPU?> Maybe you could dynamically compromise on your OLTP
CPU priority?
51Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Covariance: Some Examples
• “Foxes and chickens” problem: mixing incompatible species in the same cage• Most famously: “batch versus OLTP”> I/O demand by batch is what typically slows OLTP, but
CPU demand by batch should not impact OLTP> OLTP demand for I/O or CPU might impact batch
• Harder to see: “cache-sensitive” versus “cache- poluting” competition> Cache-sensitive workload elements can be slowed by
elements that constantly spoil the cache• Heads-up! Virtualization means increased sharing!
52Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Covariance: Deal with It!
• Expensive: Physical segregation and isolation> e.g. - run BATCH or reports on another system> e.g. - dedicate disks, channels, buses, and CPU to
business or technical functions as required• Primitive: Temporal segregation and isolation> e.g. - run BATCH at night
• Refined: Prioritization, throttling, deadline scheduling> e.g. - run BATCH at low priority, inject delays, increase
priorities as deadlines get closer
53Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Concluding Remarks
54Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Parting Thoughts• Participate in CMG http://www.cmg.org
“Ignorance of the law is no excuse!”• Go where you may not have gone before> Test-to-fail> Analyse> Fix or manage> Repeat
• If you are not managing to Business Metrics, you are wasting time and energy!
55Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Special Thanks to ...
Q&A?• Adrian Cockcroft, Cary Millsap, Jim Holtman, Dr. Neil Gunther
> mentors and provocateurs• David J. Miller, Benoit Chaffanjon
> editorial services & peer review• Glenn Fawcett
> smoke-jumping brotherhood & cool graphics• Jim Mauro
> northern star• Larry Klein
> inspiration from “It's all about U” ... and in general
56Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Extended Discussion Slides
57Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Primitivism• “You might be a redneck if ...”> You think "capacity" is when you pass out.> You cannot imagine why anyone would model a cue.> You have only seen a queue on Hop Sing or David
Caradine.> You believe chaos past 80% utilization is a law of
nature.> You make no effort whatsoever to control what's
important to you.
[... with a tip of the hat to Jeff Foxworthy ...]
58Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
“Some people think that once they know the
tricks of the trade, that they know the trade.”
“A little bit of knowledge can be a dangerous thing.”
59Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Paths Forward• Increased education in PerfCap> Math, science, language/vocabulary> “Do performance first, then capacity.”
• Increase usage of available tools> Extract benefits, learn limitations, develop art
• Increased networking amongst stakeholders> Build awareness of what can go wrong; seek synergy
• Breaking new ground> CMT and Virtualization challenges> Power management> Automating workload management> “PerfViz” - CMG focus area> “Regarding Capacity” - Our focus for the rest of the hour ...
60Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Water Glass Metaphors• Is it 50% full or 50% empty?> CMG-speak: Is it 80% busy, or 20% under-utilized?
• “Big Rocks”> Demonstrates heterogeneity and priority
61Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Two Views of “Best Practices”• Bob Sneed's> “Best Practices are time-proven and customer-proven
practices which are well-documented and believed to have little or no downside potential.”
> “... practical workarounds for product design limitations”> “... contrast with just works; needs no practices”> “... contrast with tuning, which implies trial and error
• Dr. Neil Gunther's> “Best Practices are an admission of failure.”> “... trading workarounds, practices, and 'rules of thumb'
does not advance the science or deepen understanding> “... contrast with decomposing, understanding, modeling,
proper engineering”> “... just another form of trial and error”
62Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Pop Quiz #1
• SITUATION: A system runs at 100% CPU usage for 1 hour each day completing a single compute-bound task. The SLA requires the task to complete in 4 hours.• Q1: How much “headroom” does this system have?• Q2: How can this task's resource footprint be
managed to never exceed 80% CPU usage?
63Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Pop Quiz #1: Answers• SITUATION: A system runs at 100% CPU usage for 1
hour each day completing a single compute-bound task. The SLA requires the task to complete in 4 hours.• Q1: How much “headroom” does this system have?• A1: 300% (in workload terms) or 75% (in percent-of-
system terms) - it can do 4x the work it now does and remain within the SLA.• Q2: How can this task's resource footprint be
managed to never exceed 80% CPU usage?• A2a: Huh? Why would anyone want to do that?• A2b: Resource management.
64Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Pop Quiz #2
• SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system.• Q1: What is the new system's projected CPU
utilization?• Q2: How can this system's workload be managed to
never exceed 75% CPU utilization?
65Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Pop Quiz #2: Answers• SITUATION: An 8-way 1000-BogoMIPs box runs at
75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system.• Q1: What is the new system's projected CPU
utilization?• A1: 100%. Each of the four compute-bound threads
will keep one CPU 100% busy.• Q2: How can this system's workload be managed to
never exceed 75% CPU utilization?• A2a: Huh? Why would anyone want to do that?• A2b: Resource management.
66Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Pop Quiz #3
• SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system. (Same as last quiz, OK?)• Q1: How will the compute-bound thread's
performance be impacted by the upgrade? (Just roughly speaking – no need for precision here!)
67Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Pop Quiz #3: Answers• SITUATION: An 8-way 1000-BogoMIPs box runs at
75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system. (Same as last quiz, OK?)• Q1: How will the compute-bound thread's
performance be impacted by the upgrade? (Just roughly speaking – no need for precision here!)• A1: It should run almost 4x faster. Each new CPU is
4x faster than the old ones. (2000/4)/(1000/8) = 4. The OLTP will use some of the CPU cycles, but its service demand pales next to the compute jobs.
68Copyright © 2008, Sun Microsystems, Inc. All rights reserved.
Pop Quiz #4
• ESSAY QUESTION: “At what point do these principles become difficult?”