Performance Tuning EC2 Instances

PowerPoint Presentation

November 12, 2014 | Las Vegas, NVPFC306Performance TuningEC2 InstancesBrendan Gregg, Performance Engineering, Netflix

2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

1

Brendan GreggSenior Performance Architect, NetflixLinux and FreeBSD performanceOn the Performance Engineering Team, led byCoburn Watson (and were hiring!)Recent work:Linux perf-tools, using ftrace & perf_eventsPrevious work includes:USE Method, flame graphs, heat maps, DTrace toolsSysadmin, training, kernel engineering, performance

Massive Amazon EC2 Linux cloudTens of thousands of server instancesAuto scale by ~3k each dayCentOS and UbuntuFreeBSD for content deliveryApprox. 33% of US Internet traffic at nightPerformance is criticalCustomer satisfaction: now over 50M subscribers$$$ price/performanceDevelop tools for cloud-wide and instance analysis

Netflix Performance Engineering TeamEvaluate technologyInstance types, Amazon EC2 optionsRecommendations & best practicesInstance kernel tuning, assist app tuningDevelop performance toolsDevelop tools for observability and analysisProject supportNew database, programming language, software changeIncident responsePerformance issues, scalability issues

AgendaInstance SelectionAmazon EC2 FeaturesKernel TuningObservability

Performance Tuning on Amazon EC2In the Netflix cloud, everything is a tunableIncluding instance typePerformance wins have immediate benefitsGreat place to do performance engineering!

WARNINGSThis is whats in our medicine cabinetConsider these best before: 2015Take only if prescribed by a performance engineer

1. Instance Selection

The Netflix CloudMany different application workloads: compute, storage, caching

S3EC2CassandraEVCacheApplications(Services)ELBElasticsearchSQSSES

Current Generation Instance Familiesi2: Storage-optimizedSSD large capacity storager3: Memory optimizedLowest cost/Gbytec3: Compute-optimizedLatest CPUs, lowest price/compute perfm3: General purposeBalancedPlus some others

i2.8xlarge

Instance SizesRanges from medium to 8xlarge, depending on typeNetflix has over 30 instance types in useTraditional:Tune the workload to match the serverCloud: Find an ideal workload and instance type combinationInstead of: given A, optimize B; this is optimize A+BGreater flexibility, best price/performance

Netflix Instance Type SelectionFlow ChartBy-ResourceBrute Force

Instance Selection Flow Chart

Starti2

Need largedisk capacity?

Disk I/Obound?

Cancache?CPUMemoryc3m3r3YNNYSelect memory to cache working setFind bestbalanceNYl 8xlExceptionsTrade-off

Netflix AWS EnvironmentElastic Load Balancing allows instance types to be tested with real loadSingle instance canary, then,Auto Scaling GroupMuch better thanmicro-benchmarking alone, which is extremely error prone

InstanceInstanceInstanceASG-v011

InstanceInstanceInstanceASG-v010ASG Clusterprod1Canary

ELB

By-Resource ApproachDetermine bounding resourceEg: CPU, disk I/O, or network I/OFound using:Estimation (expertise)Resource observability with an existing real workloadResource observability with a benchmark or load test (experimentation)Choose instance type for the bounding resourceIf disk I/O, consider caching, and a memory-optimized typeWe have tools to aid this choice: Nomogram Visualization

Nomogram Visualization Tool


Select instance familiesSelect resourcesFrom any desiredresource, seetypes & cost


eg, 8 vCPU:

By-Resource Approach, cont.This focuses on optimizing a given workloadMore efficiency can be found by adjusting the workload to suit different instance types

Brute Force ChoiceRun load test on ALL instance typesOptionally different workload configurations as wellMeasure throughputAnd check for acceptable latencyCalculate price/performance for all typesChoose most efficient type

Latency RequirementsCheck for an acceptable latency distribution when optimizing for price/performance

ThroughputAverage Latency(Response Time)

Headroom

UnacceptableAcceptablemeasureprice/performance

Netflix Instance Type Re-selectionVarianceUsageCost

1. Instance VarianceAn instance type may be resource constrained only occasionally, or after warmup, or a code changeContinually monitor performance, analyze variance/outliers

2. Instance Usage

Older instance type usage can be quickly identified using internal tools, re-evaluated, and upgraded to newer types

3. Instance CostAlso checkedregularly.Tuning the pricein price/perf.

3. Instance CostAlso checkedregularly.Tuning the pricein price/perf.

ServicesCost per hour

2. EC2 Features

EC2 FeaturesA) Xen Modes: HVM, PVB) SR-IOV

Xen ModesThe best performance is the latest PV/HVM hybrid mode that Xen supportsOn Amazon EC2:HVM == PVHVM if your Linux kernel version supports itPV == PV Current fastest on Amazon EC2, in general: HVM (PVHVM)Future fastest, in general: should be PVHWARNINGS: Your mileage may vary, and beware of outdated info.

Xen Modes

SR-IOVSingle Root I/O Virtualization (SR-IOV)PCIe device provides virtualized instancesAWS Enhanced NetworkingAvailable for some instance types: C3, R3, I2VPC onlyExample improvements:Improved network throughputReduced average network RTT, and reduced jitterImprovement depends on instance type and network

3. Kernel Tuning

Kernel TuningTypically 5-25% wins, for average performanceAdds up to significant savings for the Netflix cloudBigger wins when reducing latency outliersDeploying tuning:Generic performance tuning is baked into our base AMIExperimental tuning is a package add-on (nflx-kernel-tunables)Workload specific tuning is configured in application AMIsRemember to tune the workload with the tunables

OS DistributionsNetflix uses both CentOS and UbuntuOperating system distributions can override tuning defaultsDifferent kernel versions can provide different defaults, and additional tunablesTunable parameters include sysctl and /sysThe following tunables provide a look into our current medicine cabinet.Your mileage may vary. Only take if prescribed by a performance engineer.

Tuning TargetsCPU SchedulerVirtual MemoryHuge PagesFile SystemStorage I/ONetworkingHypervisor (Xen)

1. CPU SchedulerTunables:Scheduler class, priorities, migration latency, tasksetsUsage:Some apps benefit from reducing migrations using taskset(1), numactl(8), cgroups, and tuning sched_migration_cost_nsSome Java apps have benefited from SCHED_BATCH, to reduce context switching. E.g.:

# schedtool B PID

2. Virtual MemoryTunables:swappiness, overcommit, OOM behaviorUsage:Swappiness is set to zero to disable swapping and favor ditching the file system page cache first to free memory. (This tunable doesnt make much difference, as swap devices are usually absent.)

vm.swappiness = 0 # from 60

3. Huge PagesPage size of 2 or 4 Mbytes, instead of 4k, should reduce various CPU overheads and improve MMU page translation cache reach.Tunables:Explicit huge page usage, transparent huge pages (THPs)Usage:THPs (enabled in later kernels), depending on the workload & CPUs, sometimes improve perf (~5% lower CPU), but sometimes hurt perf (~25% higher CPU during %usr, and more during %sys refrag). To disable:

# echo never > /sys/kernel/mm/transparent_hugepage/enabled # from madvise

4. File SystemTunables:page cache flushing behavior, file system type and its own tunables (e.g., ZFS on Linux)Usage:Page cache flushing is tuned to provide a more even behavior: background flush earlier, aggressive flush laterAccess timestamps disabled, and other options depending on the FS

vm.dirty_ratio = 80 # from 40vm.dirty_background_ratio = 5 # from 10vm.dirty_expire_centisecs = 12000 # from 3000mount -o defaults,noatime,discard,nobarrier

5. Storage I/OTunables:Read ahead size, number of in-flight requests, I/O scheduler, volume stripe widthUsage:Some workloads, e.g., Cassandra, can be sensitive to read ahead sizeSSDs can perform better with the noop scheduler (if not default already)Tuning md chunk size and stripe width to match workload/sys/block/*/queue/rq_affinity2/sys/block/*/queue/schedulernoop/sys/block/*/queue/nr_requests256/sys/block/*/queue/read_ahead_kb256mdadm chunk=64 ...

6. NetworkingTunables:TCP buffer sizes, TCP backlog, device backlog, TCP reuse, Usage:net.core.somaxconn = 1000net.core.netdev_max_backlog = 5000net.core.rmem_max = 16777216net.core.wmem_max = 16777216net.ipv4.tcp_wmem = 4096 12582912 16777216net.ipv4.tcp_rmem = 4096 12582912 16777216net.ipv4.tcp_max_syn_backlog = 8096net.ipv4.tcp_slow_start_after_idle = 0net.ipv4.tcp_tw_reuse = 1net.ipv4.ip_local_port_range = 10240 65535net.ipv4.tcp_abort_on_overflow = 1 # maybe

7. Hypervisor (Xen)Tunables:PV/HVM (baked into AMI)Kernel clocksource. From slow to fast: hpet, xen, tscUsage:Weve encountered a Xen clocksource regression moving to Ubuntu Trusty; fixed by tuning clocksource to TSC (although beware of clock drift). Best case example (so far): CPU usage reduced by 30%, and average app latency reduced by 43%.echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

4. Observability

ObservabilityFinding, quantifying, and confirming tunablesDiscovering system wins (5-25%s) and application wins (2-10xs)

Inefficient TuningWhack-A-Mole Anti-Method:Tune things at random until the problem goes awayStreet Light Anti-Method:1. Pick observability tools that areFamiliarFound on the InternetAt random2. Run tools3. Look for obvious issues

Efficient TuningBased on observability (data-driven): Observe workload and resource usageAnalyze results:Identify software and hardware components in useStudy in-use components for tunablesQuantify expected speed-upTune, including experimental, the highest value targetsExample methodology: the USE Method

USE MethodFor every hardware andsoftware resource, check:UtilizationSaturationErrorsResource constraints show as saturation or high utilizationResize or change instance typeInvestigate tunables for the resourceThe USE Method poses questions you then use tools to answerResourceUtilization(%)

SaturationErrorsX

More Inefficient TuningBlame-Someone-Else Anti-MethodFind a system or environment component you are not responsible for, or cannot observeHypothesize that the issue is with that componentRedirect the issue to the responsible teamWhen proven wrong, go to 1Eg, blaming the instance

Exonerating the InstanceObservability often exonerates the instanceComplex perf issue with no obvious cause: an instance issue? After analysis, sometimes: yes; usually, its the app (80/20 rule)The 80/20 Rule:80%: Improvement gained by application refactoring and tuning20%: OS tuning, instance or infrastructure improvement, etc.There is also the 20/80 Latency Outlier Rule:20%: Latency outliers are caused by application code80%: Caused by instance, crontab, network, JVM GC, etc.

Or Finding the MonkeyThe Netflix Latency MonkeyInjects latency in odd placesPart of the Simian Army: a test suite forour fault-tolerant architecture

Analysis PerspectivesWorkload Analysis:A.k.a. Top Down AnalysisBegin with application workload,then break down request time.Resource Analysis:A.k.a. Bottom Up AnalysisBegin with resource performanceand then their workloads.E.g., the USE Method, followed byworkload characterizationApplicationSystem LibrariesSystem CallsKernelDevicesWorkload AnalysisResource AnalysisLoad

Instance Observability ToolsLinux performance toolsNetflix custom tools

A. Linux Performance ToolsStatistical toolsProfiling toolsTracing toolsHardware counters

1. Statistical Toolsvmstat, pidstat, sar, etc., used mostly normally$ sar -n TCP,ETCP,DEV 1Linux 3.2.55 (test-e4f1a80b) 08/18/2014 _x86_64_(8 CPU)

09:10:43 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s09:10:44 PM lo 14.00 14.00 1.34 1.34 0.00 0.00 0.0009:10:44 PM eth0 4114.00 4186.00 4537.46 28513.24 0.00 0.00 0.00

09:10:43 PM active/s passive/s iseg/s oseg/s09:10:44 PM 21.00 4.00 4107.00 22511.00

09:10:43 PM atmptf/s estres/s retrans/s isegerr/s orsts/s09:10:44 PM 0.00 0.00 36.00 0.00 1.00[]

2. Profiling ToolsProfiling: characterizing usage. E.g.:Sampling on-CPU stack traces to explain CPU usageFrequency counting object types to explain memory usageProfiling leads to tuningHot code path -> any related tunables/config?Frequent object allocation -> any way to avoid this?

Profiling TypesApplication profilingDepends on application and languageE.g., Java Flight Recorder, Yourkit, Lightweight Java ProfilerMany tools are inaccurate or broken; test, verify, cross-checkSystem profilingLinux perf_events (the perf command)ftrace can do kernel function countsSystemTap has various profiling capabilities

Application Profiling: LJPThe Lightweight Java Profiler (LJP)Basic, open source, free, asynchronous CPU profilerUses an agent that dumps hprof-like outputhttps://code.google.com/p/lightweight-java-profiler/wiki/GettingStartedThe profiling output can be visualized as flame graphs

Stack frameMouse-overframes toquantifyAncestry

System Profiling: perf_eventsperf_events sampling CPU stack traces can show:All application logicWhether high level logic can be seen depends on the app and VMJVM internals & librariesThe Linux kernelperf CPU flame graphs:# git clone https://github.com/brendangregg/FlameGraph# cd FlameGraph# perf record -F 99 -ag -- sleep 60# perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg

BrokenJava stacks(missingframepointer)KernelTCP/IPGCIdlethreadTimeLocksepoll

3. Tracing ToolsMany system tracers exist for Linuxftraceperf_eventseBPFSystemTapktapLTTngdtrace4linuxOracle Linux DTracesysdigIll summarize ftrace & perf_events

ftracePart of the Linux kernelfirst added in 2.6.27 (2008), and enhanced in later releasesUse directly via /sys/kernel/debug/tracingAlready available in all Netflix Linux instancesFront-end tools aid usage: perf-tools collectionhttps://github.com/brendangregg/perf-toolsUnsupported hacks: see WARNINGsAlso see the trace-cmd front-end, as well as perf

ftrace tool: iosnoop# ./iosnoop tsTracing block I/O. Ctrl-C to end.STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms5982800.302061 5982800.302679 supervise 1809 W 202,1 17039600 4096 0.625982800.302423 5982800.302842 supervise 1809 W 202,1 17039608 4096 0.425982800.304962 5982800.305446 supervise 1801 W 202,1 17039616 4096 0.485982800.305250 5982800.305676 supervise 1801 W 202,1 17039624 4096 0.43[]

# ./iosnoop hUSAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration] -d device # device string (eg, "202,1) -i iotype # match type (eg, '*R*' for all reads) -n name # process name to match on I/O issue -p PID # PID to match on I/O issue -Q # include queueing time in LATms -s # include start time of I/O (s) -t # include completion time of I/O (s)[]

perf_eventsPart of the Linux sourceAdded from linux-tools-common, etc.Powerful multi-tool and profilerinterval sampling, CPU performance counter eventsuser and kernel dynamic tracingkernel line tracing and local variables (debuginfo)kernel filtering, and in-kernel counts (perf stat)Needs kernel debuginfo for advanced usesA difficulty for our instances, since its > 100 Mbytes

perf_events Example# perf record e skb:consume_skb ag -- sleep 10# perf report[...] 74.42% swapper [kernel.kallsyms] [k] consume_skb | --- consume_skb arp_process arp_rcv __netif_receive_skb_core __netif_receive_skb netif_receive_skb virtnet_poll net_rx_action __do_softirq irq_exit do_IRQ ret_from_intr[]

Summarizing stack traces for atracepoint

perf_events can do many things,it is hard to pick just one example

4. Hardware CountersModel Specific Registers (MSRs)Basic details: timestamp clock, temperature, powerSome are available in Amazon EC2Performance Monitoring Counters (PMCs)Advanced details: cycles, stall cycles, cache missesNot available in Amazon EC2 (by default)Root cause CPU usage at the cycle levelE.g., higher CPU usage due to more memory stall cycles

MSRsCan be used to verify real CPU clock rateCan vary with turboboost. Important to know for perf comparisons.Tool from https://github.com/brendangregg/msr-cloud-tools:ec2-guest# ./showboostCPU MHz : 2500Turbo MHz : 2900 (10 active)Turbo Ratio : 116% (10 active)CPU 0 summary every 5 seconds...

TIME C0_MCYC C0_ACYC UTIL RATIO MHz06:11:35 6428553166 7457384521 51% 116% 290006:11:40 6349881107 7365764152 50% 115% 289906:11:45 6240610655 7239046277 49% 115% 2899[...]

Real CPU MHz

B. Netflix Online ToolsNetflix develops many online web-based tools for cloud-wide performance observabilityPer-instance capable tools include:AtlasVector

AtlasCloud-wideand instancemonitoring.Resource andapp metrics,trends.

AtlasCloud-wideand instancemonitoring.Resource andapp metrics,trends.

RegionAppBreakdownsMetricsOptionsInteractiveGraphSummary Statistics

AtlasAll metrics in one systemSystem metrics:CPU usage, disk I/O, memoryApplication metrics:latency percentiles, errorsFilters or breakdowns byregion, application, ASG, metric, instanceQuickly narrow an investigation to an instance or resource typeCan identify potential instance type changes from Atlas data

VectorReal-timeper-secondinstancemetrics.On-demandprofiling.

VectorReal-timeper-secondinstancemetrics.On-demandprofiling.

UtilizationSaturationErrorsPer deviceBreakdowns

VectorGiven an instance, analyze low-level performanceOn-demand: CPU flame graphs, heat maps, ftrace metrics, and SystemTap metricsQuick: GUI-driven root cause analysisScalable: other teams can use it easilyCurrently in development (beta-use)

SummaryInstance SelectionAmazon EC2 FeaturesKernel TuningObservability

References & LinksAmazon EC2:http://aws.amazon.com/ec2/instance-types/http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.htmlhttp://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.htmlNetflix on EC2:http://www.slideshare.net/cpwatson/cpn302-yourlinuxamioptimizationandperformancehttp://www.brendangregg.com/blog/2014-09-27/from-clouds-to-roots.htmlXen Modes:http://www.brendangregg.com/blog/2014-05-07/what-color-is-your-xen.htmlMore Performance Analysis:http://www.brendangregg.com/linuxperf.htmlhttp://www.slideshare.net/brendangregg/linux-performance-tools-2014http://www.brendangregg.com/USEmethod/use-linux.htmlhttp://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.htmlhttps://github.com/brendangregg/FlameGraph https://github.com/brendangregg/perf-tools

ThanksNetflix Performance Engineering TeamCoburn WatsonScott Emmons: nomogram visualization, insttypesMartin Spier: VectorAmer Ather: tracing, VectorVadim Filanovsky: AWS cost reportingNetflix Insight Engineering TeamRoy Rapoport, etc: AtlasAmazon Web ServicesBryan Nairn, Callum Hughes

Netflix talks at re:InventTalkTimeTitlePFC-305Wednesday, 1:15pmEmbracing Failure: Fault Injection and Service ReliabilityBDT-403Wednesday, 2:15pmNext Generation Big Data Platform at NetflixPFC-306Wednesday, 3:30pmPerformance Tuning EC2DEV-309Wednesday, 3:30pmFrom Asgard to Zuul, How Netflixs proven Open Source Tools can accelerate and scale your servicesARC-317Wednesday, 4:30pmMaintaining a Resilient Front-Door at Massive ScalePFC-304Wednesday, 4:30pmEffective Inter-process Communications in the Cloud: The Pros and Cons of Micro Services ArchitecturesENT-209Wednesday, 4:30pmCloud Migration, Dev-Ops and Distributed SystemsAPP-310Friday, 9:00amScheduling using Apache Mesos in the Cloud

PFC306

Please give us your feedback on this presentation 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.Join the conversation on Twitter with #reinvent

Technology

Performance Tuning EC2 Instances