The Performance People
Performance Management with Free and Bundled Tools
Adrian Cockcroft Netflix Inc.
(Co-authored with Mario Jauvin
MFJ Associates
11 April 2023
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Agenda
Overview of Capacity Planning Requirements and Data Sources
Performance Data Collection Free Network Monitoring Tools Free System Monitoring Tools Free Load Generation and Modelling Tools Licences and References
April 11, 2023 Adrian Cockcroft and Mario Jauvin
What are we talking about?
Network monitoring with
WireShark, MRTG, BigSister, Cacti,
Nagios, OpenNMS, Zenoss, Openxtra,
ntopDatabase Tier monitoring
With SEtoolkit, Orca,XEtoolkit
Application Tier monitoring with Orca,
Cacti, BigSister, Ganglia, XEtoolkit
QA Load generation with Grinder or SLAMD,
modelling with PDQ and R
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Capacity Planning Requirements and Data
Sources
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Definitions
Capacity– Resource utilization and headroom
Planning– Predicting future needs by analyzing historical data
and modeling future scenarios Performance Monitoring
– Collecting and reporting on performance data Free Tools
– Bundled with the OS or available for no $$$
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Capacity Planning Requirements
We care about CPU, Memory, Network and Disk resources, and Application response times
We need to know how much of each resource we are using now, and will use in the future
We need to know how much headroom we have to handle higher loads
We want to understand how headroom varies, and how it relates to application response times and throughput
April 11, 2023 Adrian Cockcroft and Mario Jauvin
CPU Capacity Measurements
CPU Capacity is defined by CPU type and clock rate, or a benchmark rating like SPECrateInt2000
CPU utilization is defined as busy time divided by elapsed time for each CPU
CPU load average measures the average number of jobs running and ready to run
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Memory Capacity Measurements
Physical Memory Capacity Utilization and Limits– Kernel memory– Shared Memory segment– Executable code, stack and heap– File system cache usage– Unused free memory
Virtual Memory Capacity - Swap Space Memory Throughput
– Page in and page out rates
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Network Capacity Measurements
Network Interface Throughput– Byte and packet rates input and output
TCP Protocol Specific Throughput– TCP connection count and connection rates– TCP byte rates input and output
NFS/SMB Protocol Specific Throughput– Byte rates read and write– NFS/SMB service response times
HTTP Protocol Specific Throughput– HTTP operation rates– Get and put payload byte rates and size distribution
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Disk Capacity Measurements
Detailed metrics vary by platform Easy for the simple disk cases Hard for cached RAID subsystems Almost Impossible for shared disk
subsystems and SANs– Another system or volume can be sharing a
backend spindle, when it gets busy your own volume can saturate, even though you did not change your own workload
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Capacity Planning Challenges
Constantly changing infrastructure Limited attention span from staff Horizontally scaled commodity systems Per node software licencing costs too much Too many tools, too many agents per node Too much data, not enough analysis Non-linear and non-intuitive scalability Lack of tools and metrics for virtualized resources
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Observability
Four different viewpoints– Management– Engineering– QA Testing– Operations
Each needs very different information Ideal would be different views of the same
performance database Reality is a mess of disjoint tools
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Management Viewpoint
Daily summary of status and problems Business oriented metrics Future scenario planning Marketing and management input Concise report with dashboard style status
indicators Free tools: R, Spreadsheet and Web based
displays, no good summarization tools
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Engineering Viewpoint
Large volumes of detailed data at several different time scales
Input to tuning, reconfiguring and future product development
Low level problem diagnosis Detailed reports with drill down and correlation
analysis Free tools: XE/SE Toolkit, Orca, Ganglia, Cacti, R
April 11, 2023 Adrian Cockcroft and Mario Jauvin
QA Test Viewpoint
Workload specification tools Load generation frameworks Testing for functionality and performance Regression tools to compare releases Modelling difference between test configuration
and production configuration Free Tools: The Grinder, SLAMD, R, PDQ
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Operations Viewpoint
Immediate timeframe Real time display, updated in seconds Alert based monitoring High level problem diagnosis Simple high level graphs and views Free tools: BigSister, Nagios, OpenNMS,
MRTG, Cacti, Ganglia, WireShark, ntop
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Measurement Data Interfaces Several generic raw access methods
– Read the kernel directly (not a good idea)– Structured system data (Solaris kstat, Linux /proc)– Process data– Network data– Accounting data– Application data
Command based data interfaces– Scrape data from vmstat, iostat, netstat, sar, ps– Higher overhead, lower resolution, missing metrics
Data available is platform specific either way Much more detail on this topic in the Solaris/Linux Performance
Measurement and Tuning Class
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Free Network Monitoring Tools
April 11, 2023 Adrian Cockcroft and Mario Jauvin
SNMP
Simple network management protocol UDP protocol based on port 161 Client/server like
– Client is called management application entity– Server is called an agent entity
Agent entity is designed to be implemented on network hardware, router, switches, etc
April 11, 2023 Adrian Cockcroft and Mario Jauvin
SNMP – MIBs
Management information base Defines the structure and the semantic of the
information that can be reported on Most commonly used is MIB-II which defines a set of
standard networking attributes– Interface tables– System level information– Routing tables
Specified using ASN.1 (abstract syntax notation 1)
April 11, 2023 Adrian Cockcroft and Mario Jauvin
SNMP – commands
Called PDU (protocol data units) GET GETNEXT GETBULK SET Encoded using BER (basic encoding rules)
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Versions
Version 1, original version done in May 1991 Version 2, around 1993. Failed because the
IETF credo of “rough consensus and running code” could not be met on securing SNMP
Turned into V2c for community string security (like V1)
Version 3, added security and complexity in 1998
April 11, 2023 Adrian Cockcroft and Mario Jauvin
SNMP tools
Too numerous to name all but… OpenNMS Nagios Cacti MRTG Net-snmp
– See www.snmplink.org
April 11, 2023 Adrian Cockcroft and Mario Jauvin
SNMP tools
Snmpwalk – will report all data in a specified MIB
getIf – will report data about interfaces and includes built-in MIB browser
Snmptable – will report tabular data from MIB tables
April 11, 2023 Adrian Cockcroft and Mario Jauvin
OpenNMS
Well…. it’s not that portable– 95% java is not 100% java– Requires about 20-30 different platform specific
packages (PostgreSQL, Perl, RRD tool, Tomcat 4 etc…)
– Difficult to install– Easy auto discovery– Web-based interface
April 11, 2023 Adrian Cockcroft and Mario Jauvin
OpenNMS
Main screen shot
April 11, 2023 Adrian Cockcroft and Mario Jauvin
OpenNMS
Node screen shot
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Nagios
Easy to build/compile (on Solaris 10) Easy to install Quick response from CGI Configuration is manual and a pain
– 13 configuration files with all kinds of interrelated entries
– Tedious and error prone Requires plugins to do anything
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Nagios
Main screen shot
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Nagios
Host detail screen shot
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
ntop
Similar to familiar UNIX top tool for processes but used for network
Provide huge selection of real-time data Can be found at http://www.openxtra.co.uk/
April 11, 2023 Adrian Cockcroft and Mario Jauvin
ntop – Active Sessions
April 11, 2023 Adrian Cockcroft and Mario Jauvin
ntop Hosts
April 11, 2023 Adrian Cockcroft and Mario Jauvin
ntop Network Load
April 11, 2023 Adrian Cockcroft and Mario Jauvin
ntop_Network_Thruput
April 11, 2023 Adrian Cockcroft and Mario Jauvin
ntop Port Dist
April 11, 2023 Adrian Cockcroft and Mario Jauvin
ntop_Protocol_Dist
April 11, 2023 Adrian Cockcroft and Mario Jauvin
ntop Protocols
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Zenoss
Open source monitoring and management of IT infrastructure
Zenoss core is free Other editions are for a fee Get it from http://www.zenoss.com/download/
April 11, 2023 Adrian Cockcroft and Mario Jauvin
zenoss Architecture
April 11, 2023 Adrian Cockcroft and Mario Jauvin
zenoss Dash Config
April 11, 2023 Adrian Cockcroft and Mario Jauvin
zenoss Google
April 11, 2023 Adrian Cockcroft and Mario Jauvin
zenoss Google Alerts
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Zenoss Graphs
April 11, 2023 Adrian Cockcroft and Mario Jauvin
zenoss Topology
April 11, 2023 Adrian Cockcroft and Mario Jauvin
MRTG
Really simple to install and configure Require manual config file creation Only for MIB-II interface plotting out of the
box Graphing not flexible, axis, time etc
April 11, 2023 Adrian Cockcroft and Mario Jauvin
MRTG
Interface screen shot
April 11, 2023 Adrian Cockcroft and Mario Jauvin
MRTG
Other CPU screen shot
April 11, 2023 Adrian Cockcroft and Mario Jauvin
RRD tool
Software to store, retrieve and graph numerical time series data
Use a round robin algorithm Data files are a fixed size
– Don’t grow– Don’t require maintenance
April 11, 2023 Adrian Cockcroft and Mario Jauvin
RRD tool
Compiles on most platforms Used by many SNMP based tools
– OpenNMS– Cacti– BigSister– WeatherMap4RRD– MailGraph
April 11, 2023 Adrian Cockcroft and Mario Jauvin
RRD tool
14all CGI script that plots data similar to MRTG
Configurable to collect data at different interval (unlike MRTG)
Flexible and variable in what data can be collected
April 11, 2023 Adrian Cockcroft and Mario Jauvin
RRD tool
Sample screen shot
April 11, 2023 Adrian Cockcroft and Mario Jauvin
RRD tool
Screen shot
April 11, 2023 Adrian Cockcroft and Mario Jauvin
RRD tool
Create a RRD database
rrdtool create test.rrd \
--start 920804400 \
DS:speed:COUNTER:600:U:U \
RRA:AVERAGE:0.5:1:24 \
RRA:AVERAGE:0.5:6:10
April 11, 2023 Adrian Cockcroft and Mario Jauvin
RRD tool
Create a graph
rrdtool graph speed.png \
--start 920804400 --end 920808000 \
DEF:myspeed=test.rrd:speed:AVERAGE \
LINE2:myspeed#FF0000
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Free Performance Data Collection and Rules
Toolkits
April 11, 2023 Adrian Cockcroft and Mario Jauvin
SE toolkit Example Tools A free performance toolkit for rapidly creating custom data sources Makes all the very extensive Solaris metrics easily available Very system specific and not enough metrics exist to port to Linux Written by Rich Pettit with contributions from Adrian Cockcroft Get SE3.4 from http://sourceforge.net/projects/setoolkit/ Open source with support for SPARC & x86 Solaris 8, 9, 10Function Example SE Programs
Rule Monitors cpg.se monlog.se mon_cm.se live_test.se percollator.se
zoom.se virtual_adrian.se virtual_adrian_lite.se
Disk Monitors siostat.se xio.se xiostat.se iomonitor.se iost.se xit.sedisks.se
CPU Monitors cpu_meter.se vmmonitor.se mpvmstat.se
Process Monitors msacct.se pea.se ps-ax.se ps-p.se pwatch.se pw.se
Network Monitors net.se tcp_monitor.se netmonitor.se netstatx.se nfsmonitor.se nx.se
Clones iostat.se uname.se vmstat.se nfsstat-m.se perfmeter.se xload.se
Data browsers aw.se infotool.se multi_meter.se
Contributed Code anasa dfstats kview systune watch orcollator.se
Test Programs syslog.se cpus.se pure_test.se collisions.se uptime.se dumpkstats.senet_example nproc.se kvmname.se
April 11, 2023 Adrian Cockcroft and Mario Jauvin
SE language features
SE is a 64bit interpreted dialect of C– Not a new language to learn from scratch!– Standard C /usr/ccs/bin/cpp used at runtime to preprocess SE scripts– Main omissions - pointer types and goto– Main additions - classes and “string” type– powerful ways to handle dynamically allocated data– built-in fast balanced tree routines for storing key indexed data
Dynamic linking to all existing C libraries– Built-in classes access kernel data– Supplied class code hides details, provides the data you want
Example scripts improve on basic utilities e.g. siostat.se, nx.se, pea.se Example rule based monitors e.g. virtual_adrian.se, orcallator.se
Creating Rules
Based on real experiences of all the things that go wrong
Capture an approximation to intuition Test and calibrate rules on as many systems as
possible Easy??
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Configuring Rules Thresholds should be configured Very application dependent Capture the operating envelope
– Measure the underlying values– Measure peaks in normal operation– Note values during problems– Set thresholds to capture the difference
This applies to any tool– SE Toolkit, Cacti, Ganglia, Nagios, OpenNMS etc.
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Rules as Objects
Define only the input and output information Hide implementation details Make high level rule objects trivial to use and
reuse SE Toolkit does it in three lines of code:
– #include <rules file>– Declare rule object as a typed variable– Read and use or print object status
April 11, 2023 Adrian Cockcroft and Mario Jauvin
"virtual adrian" rules summary
Disk Rule for all disks at once– Looks for slow disks and unbalanced usage
Network Rule for all networks at once– Looks for slow nets and unbalanced usage
Swap Rule - Looks for lack of available swap space RAM Rule - Looks for short page residence times CPU Power Rule
– Scales on MP systems– Looks for long run queue delays
Mutex Rule - Looks for kernel lock contention and high sys CPU time TCP Rule
– Looks for listen queue problems– Reports on connection attempt failures
April 11, 2023 Adrian Cockcroft and Mario Jauvin
XE Toolkit - www.xetoolkit.com
Complete re-write of SE Toolkit by Rich Pettit– Extensible Java collector, customize with jar files– Release 1.2 available April 2008– Multi-platform support Solaris, Linux/x86, Windows, BSD,
OSX, HP-UX, AIX, Linux/s390, Linux/Power Licencing
– Free GPL version for standard use and shared derivations– Open source, hosted at http://sourceforge.net/projects/xe-toolkit/
– Commercial support available if needed– Commercial product license for custom in-house derivations
Addresses all the issues people had with SE toolkit !
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Captive Metrics / XE Toolkit Architecture
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Free System Monitoring Tools
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Collated Performance Data - Orca
Problems with time sync when collecting data from multiple tools– No timestamp at all for vmstat, netstat, df...– No timestamp by default for iostat and ps...– No way to collect realtime stats from an http logfile
Use SE Toolkit to generate one timestamped row containing all the data– First version of percollator.se written by Adrian Cockcroft in 1996– Extended orcallator.se written by Blair Zajac a few years later– Graphs generated by orca batch job feeding rrdtool based web pages– Active community developing tool at http://www.orcaware.com– Extended to collect much more data, including process workloads– Basic data collection ported to Linux, HP-UX and Windows
Orca is basically MRTG for System metrics rather than Network See http://www.orcaware.com/orca/docs/Orca_Understanding_Performance_Data.ppt
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Orca data collections Collected using “procollator” reading info from /proc on Linux
[Uptime] [Average # Processes in Run Queue (Load Average)] [CPU Usage] [New Process Spawn Rate] [Number of System & Running Processes] [Context Switches & Interrupts Rate] [Interface Input Bits Per Second] [Interface Output Bits Per Second] [Interface Input Packets Per Second] [Interface Output Packets Per Second] [Interface Input Errors Per Second] [Interface Output Errors Per Second] [Interface Input Dropped Per Second] [Interface Output Dropped Per Second] [Interface Output Collisions] [Interface Output Carrier Losses] [TCP Current Connections] [IP Statistics] [TCP Statistics] [ICMP Statistics] [UDP Statistics] [Disk System Wide Reads/Writes Per Second] [Disk System Wide Transfer Rate] [Disk Reads/Writes Per Second] [Disk Transfer Rate] [Disk Space Percent Usage] [Physical Memory Usage] [Swap Usage] [Page Ins & Outs Rate] [Swap Ins & Outs Rate]
Orca on Solaris collects many more metrics than shown above Strength of Orca is lots of detailed metrics with low overhead for collection Easily customized to add more system metrics or application metrics Orca can already track HTTP traffic and parse log files
April 11, 2023 Adrian Cockcroft and Mario Jauvin
All metrics are stored in “round robin database” format
using RRDtool to generate displays over different time
spans
Web page is simple collection of plots with drill down by
metric or by time
Suitable for monitoring a relatively small number of
systems in great detail, e.g. backend database servers
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Cacti – www.cacti.net
Web based user interface based on RRDtool More sophisticated GUI than Orca or MRTG Less sophisticated system metric collection,
but more coverage of networking Better management of groups of systems
and devices than Orca, useful for tens to hundreds of nodes
Access control and personalization for users
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Ganglia – www.ganglia.info
Web based RRDtool GUI somewhat similar to Cacti Better management of clusters of systems and
devices than Cacti, useful for hundreds to thousands of nodes in a hierarchy of clusters
Provides many summary statistic plots at cluster level and collects detailed configuration data
XML based data representation Uses low overhead network protocol In common use at hundreds of large HPC Grid sites,
less visibly in use at some large commercial sites
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
BigBrother and BigSister
Network and system dashboard alert monitor Widely used at internet sites Bigbrother is at http://www.bb4.com Bigsister is at http://bigsister.graeff.com Bigsister seems to have more features, alert
logging, better portability and more efficient data collection. Compatible update to BB4.
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Free QA Test and Modelling Tools
April 11, 2023 Adrian Cockcroft and Mario Jauvin
QA Test Requirements
Generate test workload– SLAMD, Grinder
Collect performance metrics– Any of the tools already mentioned
Report regression against baseline Predict capacity needed for production system
– Use spreadsheets for simple linear prediction– Use modelling tools such as PDQ for queuing models
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Grinder 3 - Powerful New Features 100% Pure Java - works on any hardware platform and any
operating system that supports J2SE 1.3 and above. Java and Jython based load testing framework
– Web Browsers: simulate web browsers using HTTP, and HTTPS.– Web Services: test interfaces using SOAP and XML-RPC.– Database: test databases using JDBC.– Middleware: RPC and MOM based systems using IIOP, RMI/IIOP,
RMI/JRMP, and JMS.– Other Internet protocols: POP3, SMTP, FTP, and LDAP.
See http://grinder.sourceforge.net/g3/features.html J2EE Performance Testing with BEA WebLogic Server by Peter
Zadrozny, Philip Aston and Ted Osborne, originally published by Expert Press and now by APress uses Grinder 2 throughout.
April 11, 2023 Adrian Cockcroft and Mario Jauvin
SLAMD
Load generation framework, written in Java Originally built to test LDAP servers by Sun Extended to be very generic and published
as open source. Actively being developed. Sophisticated functions and user interface See http://www.slamd.com Latest Release 2.0 has better usability focus
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
April 11, 2023 Adrian Cockcroft and Mario Jauvin
PDQ Modelling Tool
Dr Neil Gunther’s toolkit at http://www.perfdynamics.com
Library used from C or Perl provides MVA queueing models
Use to calibrate in QA and predict in production PDQ modelling tool details:
– The Practical Performance Analyst Dr. Neil Gunther - McGraw-Hill, 1998 ISBN 0-07-912946-3
– Analyzing Computer System Performance with Perl:PDQ 2004, ISBN 3-54-020865-8
April 11, 2023 Adrian Cockcroft and Mario Jauvin
References and Conclusion
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Licences for Free Tools
Open Source Initiative– “OSI Approved licences”– http://opensource.org/licenses/category
Comparisons of Common Licences– http://zooko.com/license_quick_ref.html
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Web Pages and Books Adrian’s Performance and other topics blog
– http://perfcap.blogspot.com MFJ Associates performance tools link page
– http://www.mfjassociates.net/perf_links.html More free tools compiled by John Sellens
– http://www.generalconcepts.com/resources/monitoring/ More tools compiled by Openxtra
– http://www.openxtra.co.uk/resource-center/open_source_network_monitor_tools.php SE toolkit info: Sun Performance and Tuning - Java and the Internet - Adrian
Cockcroft and Richard Pettit - Sun Press/Prentice Hall, 2nd Edition, 1998 ISBN 0-13-095249-4
Solaris 8 and Linux: System Performance Tuning 2nd Edition – Gian-Paolo Musumeci, O’Reilly 2002 ISBN: 0-596-00284-X
Solaris Internals http://www.solarisinternals.com– Richard McDougall and James Mauro - new 2nd edition and new performance book by
Richard McDougall and Brendan Gregg
April 11, 2023 Adrian Cockcroft and Mario Jauvin
Concluding Remarks
Many large installations depend on free tools A full suite of functionality is available Several tools are needed to cover the bases Tradeoff between function and ease of use Support may be available, but typically
Google is the best support tool Functionality is increasing….