View
3
Download
0
Category
Preview:
Citation preview
#IDUG
Agenda
● DB2 pureScale Technology Overview
● Initial Configuration Best Practices
– Cluster Topology, Network, Storage
– Instance and Database Configuration
– Client and Workload Balancing Configuration
● Performance Tuning Best Practices
– Compute
– Network
– Storage
– Workload
• Emerging Best Practices
#IDUG
Agenda
● DB2 pureScale Technology Overview
● Initial Configuration Best Practices
– Cluster Topology, Network, Storage
– Instance and Database Configuration
– Client and Workload Balancing Configuration
● Performance Tuning Best Practices
– Compute
– Network
– Storage
– Workload
• Emerging Best Practices
#IDUG
Cluster Interconnect
Technology Overview
Single Database View
Clients
Database
Log Log Log Log
Shared Storage Access
CS CS CSCS
CS CS
CS
Member Member Member Member
Primary2nd-ary
DB2 engine runs on several host computers• Co-operate with each other to provide coherent access to the
database from any member
Data sharing architecture • Shared access to database• Members write to their own logs• Logs accessible from another host (used during recovery)
Cluster caching facility (CF)• Efficient global locking and buffer management• Synchronous duplexing to secondary ensures availability
Low latency, high speed interconnect• Special optimizations provide significant advantages on RDMA-
capable interconnects (eg. Infiniband)
Clients connect anywhere,…… see single database
• Clients connect into any member• Automatic load balancing and client reroute may change
underlying physical member to which client is connected
Integrated cluster services• Failure detection, recovery automation, cluster file system• In partnership with STG (GPFS,RSCT) and Tivoli (SA MP)
Leverages IBM’s System z Sysplex Design Concepts
#IDUG
Agenda
● DB2 pureScale Technology Overview
● Initial Configuration Best Practices
– Cluster Topology, Network, Storage
– Instance & Database Configuration
– Client and Workload Balancing Configuration
● Performance Tuning Best Practices
– Compute
– Network
– Storage
– Workload
• Emerging Best Practices
#IDUG
Overall System Topology
• Use at least 2 physical machines for production deployments• At least 2 members on separate machines
• Primary and secondary CF on separate machines
• Single machine pS instances are not recommended for production deployments
• However, can be used for development and/or QA systems
• eg. define multiple logical members on a single machine, to match production topology in a QA system
• Isolate CFs and members using one of the following methods,…… in order of preference :
1. Separate physical machines
2. Separate partitions (eg. AIX LPARs)
3. Separate CPUs
• On Linux, rely on pureScale default behavior
• If a member and CF are co-located, DB2 will automatically bind them to separate CPUs
• On AIX, see bullet 2 (i.e. use LPARs !)
• If, for some reason you cannot use LPARs
- use DB2_RESOURCE_POLICY to tell DB2 to automatically bind members to a set of CPUs- set CF_NUM_WORKERS to limit the CF to remaining CPUs
BP
BP
BP
#IDUG
• Use these approximate ratios as rules-of-thumb :
• For write-heavy workloads, both primary and secondary CF should have at least
1/6th the combined CPU count of all members
• For predominantly read workloads, both primary and secondary CF may suffice
with 1/12th the combined CPU count of all members
• Ensure each CF worker threads get it’s own dedicated logical CPU
• CF response time is critical to overall system performance• CF workers are constantly look for new requests from members• Each worker needs it’s own logical CPU to delivery optimal response time
• Use dedicated (not shared) LPARs on AIX for the CF
• Use at least 1 physical core for each CF• eg. 4 SMT threads (aka logical CPUs) on Power 7
• Leave some CPU resource for pureScale’shousekeeping & recovery threads • CF threads are in busy loop looking for new requests
• Using the default of CF_NUM_WORKERS=AUTOusually takes care of this for you
OK, But How Should I Divide Up CPU Resources ?
CF
BP
Tip
BP
BPBP
BPBP
Tip
CF Workers
Logical CPUs
Cores
#IDUG
Example Configurations : Example 1
Workload : OLTP workload with typical R/W ratio (20% transactions write)
Target Cluster : Ten machines, each with 4 cores, 8 logical CPUs
Recommendation ?
• Target member to primary CF CPU ratio ~8:1
• Action:• Dedicate 8 full machines to 8 members
• Dedicate 2 full machines to CFs
• Leave CF_NUM_WORKERS=AUTO
• Notes:• When a CF is the only member/CF on a machine, the
CF_NUM_WORKERS=AUTO settig will not assign all CPUs to workers
• It will leave usually 1 CPUunassigned for use bypureScale’s recoveryautomation threads
Member Member Member Member Member Member CAs
Ten Machines
Member MemberCAp
TipTip
#IDUG
Example Configurations : Example 2
Member
CFp
Member
CFs
Two Machines
Workload : Moderately heavy write ratio (eg. 30% transactions write)
Target Cluster : Two quad-core x86 machines- each core has 2 logical CPUs (8 logical CPUs per machine)- bare metal
Recommendation ?
• Target member to primary CF CPU ratio ~6:1
• Action:• Define a member and CF on each machine
• Leave CF_NUM_WORKERS=AUTO
• Notes:• When a CF & member are co-located on Linux, setting
CF_NUM_WORKERS=AUTO usually results in a ~80:20 member:CF split, ie:
member: 6 CPUsCF: 2 CPUs
• Also, DB2 will automatically bind member and CFto appropriate cores !
• A CF needs at least 1 worker for every port(this example assumes each CF is using <=2 ports)
3 cores6 CPUs
CF_NUM_WORKERS=AUTO
1 core 2 CPUs
TipTip
#IDUG
Example Configurations : Example 3
Workload : Moderately heavy write ratio (eg. 30% transactions write)
Target Cluster : Four 16-core Power 7 machines- each core has 4 logical CPUs (64 logical CPUs/machine)- use AIX LPARs
Recommendation ?
• Target member to primary CF CPU ratio ~6:1
• Action:• Dedicate 2 full machines to 2 members
• On other 2 machines define:• one 8-core LPAR for a member
• one dedicated 8-core LPAR for a CF
• Set CF_NUM_WORKERS=28
• Notes:• CF_NUM_WORKERS=AUTO leaves
just 1 logical CPU for recovery threads(i.e. in this case, results in 31). This isjust 25% of 1 Power 7 core, and may notbe sufficient.
Member
MemberMember
Four Machines
Member
CFp CFs
LPAR 18 cores
32 CPUs
LPAR 28 cores
32 CPUs
CF_NUM_WORKERS=28
TipTip
#IDUG
Network Configuration : RDMA Network
● Use at least 2 switches – eg. connect each CF/member to each of 2
switches– Avoids single point of failure
● Use 2 (or more) RDMA adapters on each CF and member
– Up to 4 per CF; up to 2 per member– Allows CF/member to survive single adapter failure– Provides additional bandwidth/IOPs
– Note: 2nd port on the same adapter may not provide significant additional bandwidth/IOPs
• If a member/CF must use a single adapter, utilize both of its ports to maximize availability
– Eg. If a member uses just 1 adapter, utilize both of its ports to connect the member to 2 switches
pureScale can be configured with either Infiniband and RDMA over Converged Ethernet (aka RoCE)See: “Installation prerequisites for DB2 pureScale Feature” for current list of supported Adapters
BPBPBP
BPBPBP
TipTip
#IDUG
Storage Selection
• DB2 pureScale supports all SAN storage, with 3 categories of support
• Category 1: Tested storage which supports fast I/O fencing
• Category 2: Tested storage without fast I/O fencing
• Category 3: All other types of shared storage
• Use Category 1 storage to get fastest member recovery
• Uses fast IO fencing (aka ‘SCSI-3 Persistent Reserve) to quickly (within seconds) isolate shared storage from a failing member
• Allows recovery to begin with assurance that a ‘comatose’ (non-responsive) member will not come back to life and start writing to data or logs
• Without fast IO fencing, a lease expiry fencing mechanism is used that generally takes several 10s of seconds to ensure a non-responsive member will not write to data or logs
• Includes IBM DS3000, DS5000, DS8000, V7000, EMC Symmetrix, Hitachi, Netapp
• See this link for the latest information regarding storage support
• http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/index.jsp?topic=%2Fcom.ibm.db2.luw.qb.server.doc%2Fdoc%2Fc0059360.html
BPBPBP
Tip
#IDUG
The Importance of Category 1 Storage
Log Log
Data
Member 1 Member 2
M1
Log Log
Data
Member 1 Member 2
M1
Category 1 Other Storage
Wait for lease to expire
Recovery in seconds Recovery in ~minute
HW Fence
#IDUG
Current Category 1 List (as of Oct 2013)
See "Shared Storage Considerations" section of Information Center for latest information:http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.qb.server.doc
/doc/c0059360.html
#IDUG
Storage Selection (continued)
• Use fast storage for logs
● Like any database, pureScale needs adequate IO bandwidth to keep response times low when the system is under heavy load
● In addition, cluster databases may need to flush their logs to disk more often than single node database, so log I/O performance is even more important
● Solid-state disks (SSDs) can be very useful in minimizing IO times
– A relatively small SSD investment can make a big difference in a log-bound system where the storage write cache can't keep up
– Ensure appropriate redundancy (eg. RAID, and/or DB2 log mirroring)
Member 1 Member 2
CF
Log Log
UPDATE … WHERE KEY=4 UPDATE … WHERE KEY=5
LogBuffer
Page
BPBPBP
UPDATE … WHERE KEY=6
PageLog
Buffer
#IDUG
Cluster File System Configuration
– Place logs and tablespaces on separate filesystems
– General recommendation: one file system for all members’ logs and another for all tablespaces
– Consider additional data file systems for very large databases or to enable multi-temperature storage on different storage groups
– Note: each additional file system add a small incremental cost during member recovery (~ 1-2 seconds)
– Use the db2cluster command to create additional file systems
db2cluster -cfs -create …
– Rely on pureScale’s default file system parameter settings
– db2cluster uses defaults appropriate for pureScale (eg. Direct I/O, block size)
– Add capacity at the cluster file system level db2cluster -cfs -add …
– Use LUNs with the same characteristics where possible to simplify maintenance
– ie. same size and performance characteristic
– To rebalance after adding LUNs over new storage
db2cluster -cfs -rebalance …
BP
BP
Tip
#IDUG
New Instance (aka ‘DBM’) Configuration Parameters
AUTO (based on number of CPUs)
# threads in a CF that can process CF GBP, GLM and other requests
CF_NUM_WORKERS
Default DescriptionParameter
% of INSTANCE_MEMORY to reserve for automated recoveries of other members
# connections allowed at a CF (from each member)
Overall memory to use for a CF
Path where CF related diagnostic messages will be stored
Severity of diagnostic messages logged
AUTO (typically ~5%)RSTRT_LIGHT_MEM
AUTO (grows as needed)CF_NUM_CONNS
AUTO (typically uses between 70-90% of host memory)
CF_MEM_SZ
NULL (use DIAGPATH)CF_DIAGPATH
2CF_DIAGLEVEL
• Rely on defaults for pureScale instance configuration
• Ensure DIAGPATH & CF_DIAGPATH are set to local paths (ie. not shared)
• Some customers have experienced up to a 3 second reduction in member recovery time
• Due to less contention when writing diagnostics during recovery
• pureScale configures local diagnostic paths by default starting in 10.1 (not 9.8)
db2 update dbm cfg using DIAGPATH /local_path
db2 update dbm cfg using CF_DIAGPATH /local_path
BPBPBP
Tip
#IDUG
New Database Configuration Parameters
• Start with defaults for pureScale DB configuration• Usually results in an appropriate 80:15:5 ratio between GBP:LOCK:SCA
• Customization may be appropriate if …• … running more than 1 database
• If all similar in priority/size/workload, do the following to give equal CF memory to each
db2 update dbm cfg using NUM_DB <# of your DBs>
db2set DB2_DATABASE_CF_MEMORY -1
• Otherwise, explicitly configure CF_DB_MEM_SZ for each
db2 update db cfg for <db1> using CF_DB_MEM_SZ <nnn>
db2 update db cfg for <db2> using CF_DB_MEM_SZ <mmm>
• .. you want to reduce the window during which simultaneous primary and secondary CF failure could cause a group crash recovery to occur:
db2 update db cfg for <db1> using CF_CATCHUP_TRGT 5
DefaultDescriptionParameter
Target time for a newly started secondary CF to enter peer state
Total amount of CF memory for a given database (includes previous 3 areas)
Memory to use for the System Communication Area
Memory used for the GLM
Memory used for the GBP
15 minutesCF_CATCHUP_TRGT
AUTOCF_DB_MEM_SZ
AUTOCF_SCA_SZ
AUTOCF_LOCK_SZ
AUTOCF_GBP_SZ
#IDUG
CF Memory Configuration Details
� CF memory is dominated by the Group Bufferpool (GBP)
– The GBP only stores modified pages, so the higher the read ratio, the less memory required by the CF
– The GBP is always allocated in 4K pages, regardless of the bufferpool page size(s) at the members
� Size CF memory so that GBP gets ~35-40% the memory of the sum of all members’ bufferpools
• Example:– LBP size in each of 4 members : 10 M 4KB pages = 160 GB total– Reasonable CF_GBP_SZ is : ~15 4K pages = 60 GB total
• For higher read workloads (e.g. 85-95% SELECT), the required size decreases since there are fewer modified pages in the system– Consider 25% a minimum, even for very read-heavy workloads
• For 2 member clusters, the % can range higher to 40-50%
BP
#IDUG
Automatic CF Memory Sizing : 1 Active Database
● Total CF memory allocation is controlled by
DBM config parameter CF_MEM_SZ
● Default AUTOMATIC settings provide reasonable initial calculations (but no self tuning)
• CF_MEM_SZ set to 70-90% of physical memory
• CF_DB_MEM_SZ defaults to CF_MEM_SZ
(for single DB)
• CF_SCA_SZ = 5-20% of CF_DB_MEM_SZ● Metadata space for table control blocks, etc.
• CF_LOCK_SZ = 15% of CF_DB_MEM_SZ
• CF_GBP_SZ = remainder of CF_DB_MEM_SZ
CF_MEM_SZ (Instance)
CF_DB_MEM_SZ (DB 1)
CF_GBP_SZ
CF_SCA_SZ
CF_LOCK_SZ
#IDUG
Example CF Memory Sizing : 1 Active Database
● 4 Members, each with two bufferpools● 8 M pages (for 4K page size)
● 6 M pages (for 8K page size)
● How big should the GBP be ?
● A Solution● Total LBP size is 4*(32 GB + 48 GB) = 320 GB
● Target GBP size : 35-40% = 112-128 GB
● Give CF partition/machine ~ 196GB
● Rely on AUTO configuration● CF_MEM_SZ ~ 80% ~ 160GB
● CF_GBP_SZ ~ 80% ~ 128 GB
● Other memory areas given appropriate defaults
● Or, use explicit settings
CF_MEM_SZ (Instance)
CF_DB_MEM_SZ (DB 1)
CF_GBP_SZ
CF_SCA_SZ
CF_LOCK_SZ
#IDUG
If you have multiple databases and all of similar importance and workload, you can use the DB2_DATABASE_CF_MEMORY registry variable to evenly divide CF memory across your databases
• Ensures first database to activate doesn't consume all CF memory
● If DB2_DATABASE_CF_MEMORY=-1CF_DB_MEM_SZ = CF_MEM_SZ / NUM_DB
● If DB2_DATABASE_CF_MEMORY=33CF_DB_MEM_SZ = (33/100) * CF_MEM_SZ
CF_MEM_SZ (Instance)
CF_DB_MEM_SZ (DB 2)
CF_GBP_SZ
CF_LOCK_SZ
CF_SCA_SZ
CF_DB_MEM_SZ (DB 3)
CF_GBP_SZ
CF_LOCK_SZ
CF_SCA_SZ
CF_DB_MEM_SZ (DB 1)
CF_GBP_SZ
CF_LOCK_SZ
CF_SCA_SZ
CF Memory : Multiple Active Databases
Tip
#IDUG
Client Configuration : WLB, Affinity or Subsetting ?
• Use default (“all member”) WLB for single workloads
• When consolidating multiple workloads or databases in a single pureScalecluster
● Before 10.5, use affinity routing● With 10.5 and beyond, use member subsetting
Member Member Member Member Member
Batch OLTP
Member Member Member
Single Workload
Member
Workload balanced over all members Workload independently balanced over
defined subsets of members
Member Member Member Member
No workload balancing
Workload
A
Workload
B
Workload
C
Workload
C
BP
BP
Default WLB AffinityMember Subsets
#IDUG
Configuring WLB (For “All Members” or Subsets)
• Defaults are reasonable for many deployments
• The most common customizations are:
1. Enabling transaction level WLB (vs connection level)
2. Keep alive timeout
3. Enabling the alternate server list for first connection processing
• Following slides show how to customize these for the CLI/ODBC client
• Subsequent slide maps to corresponding parameters for Java
#IDUG
Customize CLI/ODBC WLB with db2dsdriver.cfg
<databases>
<database name="SAMPLE" host="myhost1.net1.com" port=“50001">
<parameter name=”KeepAliveTimeout” value=”10”/>
<WLB>
<parameter name="enableWLB" value="true"/>
<parameter name="maxRefreshInterval" value=“30"/>
</WLB>
<ACR>
<parameter name=”enableAcr” value=”true”/>
<parameter name=”enableSeamlessAcr” value=”true”/>
<parameter name="enableAlternateServerListFirstConnect" value="true"/>
<alternate_server_list>
<server name="m2" hostname="myhost2.net1.com" port=“50001" />
<server name="m3" hostname="myhost3.net1.com" port=“50001" />
<server name="m4" hostname="myhost4.net1.com" port=“50001" />
</alternate_server_list>
</ACR>
</database>
</databases>
#IDUG
1) Enabling Transaction Level WLB
<databases>
<database name="SAMPLE" host="myhost1.net1.com" port=“50001">
<parameter name=”KeepAliveTimeout” value=”10”/>
<WLB>
<parameter name="enableWLB" value="true"/>
<parameter name="maxRefreshInterval" value=“30"/>
</WLB>
<ACR>
<parameter name=”enableAcr” value=”true”/>
<parameter name=”enableSeamlessAcr” value=”true”/>
<parameter name="enableAlternateServerListFirstConnect" value="true"/>
<alternate_server_list>
<server name="m2" hostname="myhost2.net1.com" port=“50001" />
<server name="m3" hostname="myhost3.net1.com" port=“50001" />
<server name="m4" hostname="myhost4.net1.com" port=“50001" />
</alternate_server_list>
</ACR>
</database>
</databases>
#IDUG
ClientDB2
DriverAppl-ication
Member 1
Member 2
DB2 Server
Max size of physical connection pool is unlimited by default.
Can set finite limit via :
Member 3
Physical Connection Pool Managed by Client Driver
Thread 1
Thread 2
• The connections established by applications are termed logical• Transparent to the application, the DB2 driver maintains a number of physical connections to members
• Each logical connection is mapped to a single physical connection• This mapping is done transparently to the application and may change on connection or transaction boundaries
• In this example, the application has two threads, each with one logical connection• Thread 1’s logical connection is currently mapped to a physical connection to member 1 (in red)• Thread 2’s logical connection is currently mapped to a physical connection to member 2 (also in red)• The remaining 7 physical connections are not currently in use (in black)
<WLB><parameter name=”maxTransports” value=”100”/>
</WLB>
CONNECT
CONNECT
#IDUG
ClientDB2
DriverAppl-ication
Member 1
P=20
Member 2
P=20
DB2 Server
Member 3
P=60
Connection Level WLB Enabled by Default
Thread 2
Thread 5
• The client driver transparently routes logical connection requests to members with less logical connections than their relative priority calls for
• In this example, the relative member priorities (P) indicate M1, M2 and M3 should have 20%, 20% and 60% of the logical connections, respectively
• M1 has too many connections, M2 has too few, and M3 has the right amount
• So, when thread 2 disconnects and then reconnects to the database, the client driver will use a physical connection to M2
CONNECTRESET
CONNECT
Thread 4
Thread 3
Thread 1
Note: P ranges from 0-100 and is based on CPU load, paging rates and other factors.Larger values mean more cycles available and more work should be routed to the member.
#IDUG
ClientDB2
DriverAppl-ication
Member 1
P=30
Member 2
P=30
DB2 Server
Member 3
P=90
Transaction Level WLB
Thread 2
Thread 5
• DB2 client driver transparently routes transactions to members with less logical connections than their relative priority calls for
• In this example, the priorities indicate M1, M2 and M3 should have 20%, 20% and 60% of the logical connections, respectively
• M1 has too many connections, M2 has too few, and M3 has the right amount
• So, when thread 2 starts a new transaction, the client driver may use a physical connection to M2
• Enabled via:
COMMIT
SELECT
Thread 4
Thread 3
Thread 1
<parameter name=”enableWLB” value=”true”/>
#IDUG
Connection or Transaction Level ?
• Transaction level WLB is a good general purpose approach
• If in doubt, start with transaction level WLB
• However, watch out for, application usage patterns that maintain state across transactions and will revert back to connection level WLB:
• Sequences• Use db2set DB2_ALLOW_WLB_WITH_SEQUENCES=YES to allow WLB & sequences
• Confirms applications don’t get previous sequence value in a transaction before getting next value
• This db2set is commonly used
• Declared Global Temporary Tables• Cursors with hold• Full list:
http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/index.jsp?topic=%2Fcom.ibm.db2.
luw.qb.server.doc%2Fdoc%2Fr0056430.html
Tip
BP
#IDUG
Connection or Transaction Level ? (continued)
• Connection level WLB may have advantages in certain scenarios:
• Minimize reconnect overhead for tiny transactions
• However, watch out for:
• Long-lived connections (balancing may rarely/never occur)
• Notes:• Both transaction level and connection level WLB work best when each client
process/JVM has multiple connections, particularly in pre-10.1 clients
• This is because balancing is performed independently by each client process/JVM• Decisions made by 1 client process/JVM may not be coordinated with decisions of another
• 10.1 C-Client and 10.5 JCC client have enhancements for transaction level WLB to better handle the single transaction per process/JVM case
Tip
#IDUG
2) KeepAliveTimeout
<databases>
<database name="SAMPLE" host="myhost1.net1.com" port=“50001">
<parameter name=”KeepAliveTimeout” value=”10”/>
<WLB>
<parameter name="enableWLB" value="true"/>
<parameter name="maxRefreshInterval" value=“30"/>
</WLB>
<ACR>
<parameter name=”enableAcr” value=”true”/>
<parameter name=”enableSeamlessAcr” value=”true”/>
<parameter name="enableAlternateServerListFirstConnect" value="true"/>
<alternate_server_list>
<server name="m2" hostname="myhost2.net1.com" port=“50001" />
<server name="m3" hostname="myhost3.net1.com" port=“50001" />
<server name="m4" hostname="myhost4.net1.com" port=“50001" />
</alternate_server_list>
</ACR>
</database>
</databases>
#IDUG
What is KeepAliveTimeout ?
• A value that limits how long it may take before a client (or server) detects the abnormal termination of the server (or client)
• If not set, the system default may be used (typically ~ 2 hours)• This means that in the event of a DB2 member failure, a DB2 client that is awaiting a reply to a
SQL statement, may not notice the failure for up to 2 hours
• Can change host level setting, but change will affect all TCPIP connections on the host
• Note: 10.1 and beyond clients use a default of 15 seconds
• Use a value that supports your failover objective• 5-10 seconds is often a reasonable value: <parameter name=”KeepAliveTimeout” value=”10”/>
ClientDB2
DriverAppl-ication
SELECT SELECT
socket lost err
CONNECT
successACR
SELECT
result set
result set
Member 1
Member 2
DB2 ServerSocket lost error returned to DB2 driver within 10 seconds of member failure if KeepAliveTimeout set to 10.
BP
#IDUG
3) Enabling Alternate Members on First Connect
<databases>
<database name="SAMPLE" host="myhost1.net1.com" port=“50001">
<parameter name=”KeepAliveTimeout” value=”10”/>
<WLB>
<parameter name="enableWLB" value="true"/>
<parameter name="maxRefreshInterval" value=“30"/>
</WLB>
<ACR>
<parameter name=”enableAcr” value=”true”/>
<parameter name=”enableSeamlessAcr” value=”true”/>
<parameter name="enableAlternateServerListFirstConnect" value="true"/>
<alternate_server_list>
<server name="m2" hostname="myhost2.net1.com" port=“50001" />
<server name="m3" hostname="myhost3.net1.com" port=“50001" />
<server name="m4" hostname="myhost4.net1.com" port=“50001" />
</alternate_server_list>
</ACR>
</database>
</databases>
#IDUG
Huh ? Alternate Members on First Connect ?
• After a successful connection to any member of a pureScale cluster, the client will receive a complete list of all members and their IP information
• From this point on, the client can transparently perform workload balacing and reroute connections from failed members to other members, and
• However, if a client’s initial connection targets a member that is not currently active, the attempt will fail
• The client is unable to transparently try other members, because it does not have their IP information
• Use enableAlternateServerListFirstConnect to give the client all members’ IP information up front
• With this, even if the target member of the first client connect attempt is not available, the client will be able retry against other members
BP
#IDUG
Notes on Configuring CLI/ODBC vs Java/JCC
maxRetriesForClientReroute, retryIntervalForClientReroute, affinityFailbackInterval
maxAcrRetries, acrRetryInterval, affinityFailbackInterval
ACR connection retry and failback behavior
See next slidekeepAliveTimeoutRapid detection of failed server
With JNDI : db2.jcc.clientRerouteServerListJNDIName,db2.jcc.DB2ClientRerouteServerListWithout JNDI: db2.jcc.clientRerouteAlternateServerName, db2.jcc.clientRerouteAlternatePortNumber
enableAlternateServerListFirstConnect
Try these other members if initial connect to specific member fails
client_affinity_definied, affinity_list in db2dsdriver.cfg. Use
db2.jcc.dsdriverConfigFile to specify a db2dsdriver.cfg file.
client_affinity_defined, affinity_list
Client-Member Affinitization
db2.jcc.maxTransportObjectsmaxTransportsMaximum number of underlying physical connections
db2.jcc.maxRefreshIntervalmaxRefreshIntervalMaximum age of member weight information before it is refreshed
enableSeamlessFailoverenableSeamlessAcrSeamless ACR
No explicit switch. Can enable/disable via
presence/absence of alternate server information.
enableAcrAutomatic Client Reroute
Transaction level WLB
Connection level WLB
Capability
enableSysplexWLBenableWLB
Not available.connectionLevelLoadBalancing
Java equivalent (eg. configuration property)
db2dsdriver.cfg
#IDUG
Setting KeepAliveTimeout for Java Clients
• Connection URL Examplejdbc:db2://coralpib19a.torolab.ibm.com:56733/eComHQ:enableSysple
xWLB=true;keepAliveTimeOut=10;
• Java application code:
String url =
jdbc:db2://coralpib19a.torolab.ibm.com:56733/eComHQ;
Properties properties = new Properties();
properties.put("user", "yourID");
properties.put("password", "yourPassword");
properties.put("enableSysplexWLB", "true");
properties.put("keepAliveTimeOut", "10");
Connection con = DriverManager.getConnection( url, properties );
http://public.dhe.ibm.com/software/dw/data/dm-1206purescaleenablement/wlb.pdf
White Paper on Client Configuration and WLB
Tip
#IDUG
Agenda
● DB2 pureScale Technology Overview
● Initial Configuration Best Practices
– Cluster Topology, Network, Storage
– Client and Workload Balancing Configuration
– Instance & Database Configuration
● Performance Tuning Best Practices
– Compute
– Network
– Storage
– Workload
• Emerging Best Practices
#IDUG
Is the CF Healthy ?
Member Member Member Member
CF
LockList
LBP
dbheapLockList
LBP
dbheapLockList
LBP
dbheapLockList
LBP
dbheap
LockList
GBP
SCA
CF Memory ?
CF CPU ?
Network ?
Read Steve Rees’ “DB2 Performance and Monitoring Best Practices” onwww.ibm.com/developerworks
BP
© 2012 IBM Corporation41
#IDUG
Accounting for pureScale Bufferpool Operations
CF
Member
X
CF
Member
GBP GBP
LBP LBP
CF
Member
GBP
LBP
CF
Member
GBP
LBP
Pool_data_l_reads
Pool_data_lbp_pages_found
Pool_data_gbp_l_reads
Pool_data_gbp_invalid_pages
Pool_data_gbp_p_reads
Pool_data_p_reads
Agent Agent Agent Agent
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
Found in Found in
LBPLBPInvalid in Invalid in
LBP, found LBP, found
in GBPin GBP
Not in LBP, Not in LBP,
found in found in
GBPGBP
Not in LBP or GBP,Not in LBP or GBP,
found on diskfound on disk
PageFoundWhere?
Metricsaffected
© 2012 IBM Corporation42
#IDUG
Calculating Bufferpool Hit Ratios
(POOL_DATA_L_READS - POOL_DATA_P_READS)
POOL_DATA_L_READS
(POOL_DATA_LBP_PAGES_FOUND –
POOL_ASYNC_DATA_LBP_PAGES_FOUND)
POOL_DATA_GBP_L_READS
(POOL_DATA_GBP_L_READS - POOL_DATA_GBP_P_READS)
POOL_DATA_GBP_L_READS
� Overall (and non-pS) hit ratio– Primary indicator of BP quality– Great values: 95% for index, 90% for data– Good values: 80-90% for index, 75-85% for data
� LBP hit ratio– Usually less significant indicator
• Excludes GBP hits• Includes invalid pages
� GBP hit ratio– Typically lower than overall hit ratio (GBP
currently only caches updated pages)– More reads generally decreases GBP hit
ratio
� Calculate via MON_GET_BUFFERPOOL()– Can be issued at any member– Can retrieve data for any (or all) members– eg. to calculate GBP hit ratio, request data for all
members via specifying member=-2
SELECT POOL_DATA_GBP_L_READS,
POOL_DATA_GBP_P_READS,
FROM TABLE (
MON_GET_BUFFERPOOL(‘ ‘,-2) )
#IDUG
Bufferpool Tuning
• See Steve Rees’ “DB2 Performance and Monitoring Best Practices” on www.ibm.com/developerworks
• One important indicator : Group Bufferpool Full conditions
select sum(num_gbp_full) from table(mon_get_group_bufferpool(-2))
• Occur when there are no free locations in the GBP for incoming pages from the
members
• Causes an internal ‘stall’ condition during which dirty pages are written synchronously to
create more space
• Of course this is fully transparent to applications
• Similar to "dirty steal" in single node DB2
• They are counted at the members, but are specific to the GBP, hence the above query
sums across all members
• Can often also indicate storage too slow for workload
#IDUG
• pureScale page locks are physical locks, indicating which membercurrently ‘owns’ the page. Picture the following:
• Member A : acquires a page P and modifies a row on it, and continues with its transaction. ‘A’ holds an exclusive page lock on page P until ‘A’commits
• Member B : wants to modify a different row on the same page P. What now?
• ‘B’ doesn’t have to wait until ‘A’ commits & frees the page lock
• The CF will negotiate the page back from ‘A’ in the middle of ‘A’s transaction, on ‘B’s behalf. This is called a reclaim.
• Provides far better concurrency & performance than needing to wait for a page lock until the holder commits.
Log P P
pureScale Page Negotiation (or 'reclaims')
P
Member A Member B
LogP ?P !
CF
GLM
Px: A: B
P
#IDUG
Monitoring Page Reclaims
● Page reclaims help eliminate page lock waits, but they're not cheap● Excessive reclaims can cause contention – low CPU usage, reduced
throughput, etc.
● MON_GET_PAGE_ACCESS_INFO gives very useful reclaim stats
Schema name
Is 12,641 excessive? Maybe –it depends how long these accumulated. RoT: more than 1 reclaim per 10 Tx is worth looking into
#IDUG
Reducing Page Reclaims : Page Size and Extent Size
• Reduce page size if possible• If on 10.5, row size can exceed page size if
EXTENDED_ROW_SZ=ENABLE
“Small but hot" tables with frequent updates may benefit from increased PCTFREE
• Spreads rows over more pages
• Increases overall space consumption – 50% pctfree can double object size, but halve contention
• Note : PCTFREE only takes effect on LOAD and REORG
• Larger extent sizes tend to perform better than small ones• Some operations require CF communication & other processing each
time a new extent is created
• Larger extents mean fewer CF messages
• Default of 32-page extent size usually works well
Tip
Tip
BP
#IDUG
Consider a CURRENT MEMBER Default Column
� Case 1: intensive inserts of successively incrementing/decrementing numeric values, timestamps, etc.• Can cause significant demand for the page at the high/low end of the index, as the
page getting all the new keys gets reclaimed between members
• Consider adding a hidden CURRENT MEMBER leading column – so each member
tends to insert into a different page
alter table orders add column curmem smallint
default current member implicitly hidden;
create index seqindex on orders (curmem, ordnum);
DB2 10 multi-range index scan makes this unconventional index work…
M1 M2 M3
0, 1, 2 … 567
Orders(ordnum)
M1 M2 M3
M1,0 …. M1,567 M2,0 …. M2,566 M3,0 …. M3,565
Orders(curmem,ordnum)
#IDUG
Consider a CURRENT MEMBER Default Column
� Case 2: low-cardinality indexes – e.g. GENDER, STATE, etc.• Here, there can be significant demand for the pages where new RIDs are added for
the (relatively few) distinct keys
• Consider transparently increasing the cardinality (and separate new key values by
member) by adding a trailing CURRENT MEMBER column to the index
alter table customer add column curmem smallint
default current member implicitly hidden;
create index genderidx on customer (gender, curmem);
M1 M2 M3
1,F.. 1,M.. 2,F..2,M.. 3,F..3,M..
Gender
M1 M2 M3
F,F,F,F,F,F,F, M,M,M,M,M,M,M
Gender
#IDUG
Other Application / Schema Design Considerations
SEQUENCEs and IDENTITY columns should use a large cache and avoid the ORDER keyword
• Obtaining new batches of numbers requires CF communication and a log flush in pureScale
• Larger cache size (100 or more – best to tune) means fewer refills & better performance
CREATE SEQUENCE MY_SEQ START WITH 1 INCREMENT BY 1
NO MAXVALUE NO CYCLE CACHE 200
BP
#IDUG
CF CPU Capacity : Experiences
SELECT VARCHAR(NAME,20) AS ATTRIBUTE,
VARCHAR(VALUE,25) AS VALUE,
VARCHAR(UNIT,8) AS UNIT
FROM SYSIBMADM.ENV_CF_SYS_RESOURCES
ATTRIBUTE VALUE UNIT-------------------- ----------- ------HOST_NAME coralm215 -MEMORY_TOTAL 64435 MBMEMORY_FREE 31425 MBMEMORY_SWAP_TOTAL 4102 MBMEMORY_SWAP_FREE 4102 MBVIRTUAL_MEM_TOTAL 68538 MBVIRTUAL_MEM_FREE 35528 MBCPU_USAGE_TOTAL 93 PERCENT
HOST_NAME coralm216 -MEMORY_TOTAL 64435 MBMEMORY_FREE 31424 MBMEMORY_SWAP_TOTAL 4102 MBMEMORY_SWAP_FREE 4102 MBVIRTUAL_MEM_TOTAL 68538 MBVIRTUAL_MEM_FREE 35527 MBCPU_USAGE_TOTAL 93 PERCENT
16 record(s) selected.
`
• vmstat & other CPU monitoring
tools typically show the CF at 100% busy - even when the cluster is idle
• Use ENV_CF_SYS_RESOURCES to
get more accurate memory and CPU utilization
• Response time to requests from members may degrade as sustained CF CPU utilization climbs above 80-90%
• Allocating additional CPU cores to the CF may be required
Tip
#IDUG
� Infiniband is not infinite…
� Typical ratio is 1 CF RDMA adapter per 6-8 CF cores
� Main symptoms of interconnect bottleneck• High CF response time
• Increased member CPU time
• Poor cluster throughput with CPU capacity remaining on CF
� Look for an avg CF_WAIT_TIME of < 200msCF_WAITS : ~ # times a member makes a CF request (mostly dependent on
the workload rather than the tuning)CF_WAIT_TIME – time accumulated by members waiting on a CF response • Note: CF_WAIT_TIME does NOT include reclaim time or lock wait time
These metrics are available at the statement level in MON_GET_PKG_CACHE_STMT, or at the agent level in MON_GET_WORKLOAD, etc. (more useful for overall tuning)
Detecting an Interconnect Bottleneck
BP
Tip
#IDUG
� Situation: very busy pureScale cluster running SAP workload
� CF with two Infiniband HCAs
� CF_WAIT_TIME / CF_WAITS gives us a rough idea of average interconnect network time per CF call• Important – this is an average over all CF calls
• Look for an average less than 200 µs
• Best way to judge good or bad numbers – look for a change from what's normal for
your system
� Observed average per call CF_WAIT_TIME with 2 CF HCAs – 630 µs• This is very high – even a very busy system should be less than 200 µs
• CF CPU utilization about 75% - high, but not so high to cause this major slowdown
Interconnect Bottleneck Example
Tip
#IDUG
� And good things happened!
CFsec
CFpri
Add Another CF HCA
MetricMetric 2 CF HCAs2 CF HCAs 3 CF HCAs3 CF HCAs
Average CF_WAIT_TIME 630 µs 145 µs
Activity time of key INSERT statement 15.6 ms 4.2 ms
Activity wait time of key INSERT 8 ms 1.5 ms
Mbr1
Mbr3
Mbr2
Mbr4
CFsec
CFpri
Mbr1
Mbr3
Mbr2
Mbr4
#IDUG
� MON_GET_CF_WAIT_TIME() gives round-trip counts & times by message type
MemberCF
LOCKs
WRITEs
READs
CF_CMD_NAMECF_CMD_NAME REQUESTSREQUESTS WAIT_TIMEWAIT_TIME
SetLockState 107787498 6223065328
WriteAndRegisterMultiple 4137160 2363217374
ReadAndRegister 57732390 4227970323
Other Ways to Drill Down on Interconnect Traffic
#IDUG
� Typical MON_GET_CF_WAIT_TIME() desireable upper bounds
MemberCF
LOCKs
WRITEs
READs
CF_CMD_NAMECF_CMD_NAME AVG WAIT_TIMEAVG WAIT_TIME
SetLockState 30-40
WriteAndRegisterMultiple 500
ReadAndRegister 100
279943693257732390ReadAndRegister
CF_CMD_NAMECF_CMD_NAME REQUESTSREQUESTS CMD_TIMECMD_TIME
SetLockState 107787498 3552982001
WriteAndRegisterMultiple 4137160 994550123
� MON_GET_CF_CMD() gives command processing time on the CF, without network time
Other Ways to Drill Down on Interconnect Traffic
279943693257732390ReadAndRegister
598801653992011CrossInvalidate
CF_CMD_NAMECF_CMD_NAME REQUESTSREQUESTS CMD_TIMECMD_TIME
SetLockState 107787498 3552982001
WriteAndRegisterMultiple 4137160 994550123
• CrossInvalidate CMD times >20 usec can indicate a network bottleneckCrossInvalidate processing has the least CF CPU overhead, and so can often be
usedas a direct indicator of network health
Tip
#IDUG
The Role of Cross Invalidation in pureScale
Cross Invalidation (aka Silent Invalidation)
� CF receives new version of a page
� CF uses RDMA to turn off the ‘valid’ bit associated with (now) stale copies of this page in any member local bufferpool
� Since this is done with RDMA, it requires no CPU cycles on this other members
� No interrupt or other message processing required
� Cross Invalidation is key to efficient scaling as the cluster grows
GBP GLM SCA
Buffer Mgr
Lock Mgr Lock Mgr Lock Mgr Lock Mgr
New
page image
Rea
d P
age
#IDUG
Agenda
● DB2 pureScale Technology Overview
● Initial Configuration Best Practices
– Cluster Topology, Network, Storage
– Client and Workload Balancing Configuration
– Instance & Database Configuration
● Performance Tuning Best Practices
– Compute
– Network
– Storage
– Workload
• Emerging Best Practices
#IDUG
New Recovery Time Controls in 10.5
• page_age_trgt_mcr : Target for age of oldest updated page in LBP not reflected in
GBP (pureScale) or persistent storage (non-pureScale)
• DB configuration parameter; Default 120 seconds
• Comments:
– Can approximate member recovery time, when you have very large batch update transactions
– If you do not have large batch update transactions, this parameter has little effect, due to pureScale‘Force-at-Commit’ policy, which requires all transactions to send all updated pages to the CF’s GBP before the transaction can commit, and limits member recovery time to ~20 sec
• If your workload has large very batch update transactions and you want to avoid the exposure of a larger member recovery period, consider a smaller value than the 120 second default
• page_age_trgt_gcr : Target for age of oldest update page in LBP not reflected on
persistent storage
• DB configuration parameter; Default 240 seconds
• Comments
– Can approximate group crash recovery time
– Recall : group crash recovery only occurs in rare simultaneous failure conditions (eg. both CF fail at the same time)
• Consider higher value if experiencing high I/O due to cast outConsider lower value if need significantly lower recovery times for cluster wide outage (e.g. concurrent Primary & Secondary CF failure) is required
BP
BP
Set SOFTMAX=0 to usethese on an upgraded db
Set SOFTMAX=0 to usethese on an upgraded db
Tip
#IDUG
Using Random Indexes (in DB2 10.5)• Intensive inserts of successively incrementing/decrementing key values can cause
significant demand for pages at the high/low end of the index
The 'top' page getting all the new keys gets reclaimed between members
• The new 'RANDOM' keyword on index create causes keys to be hashed on insert
Use this to address hot high key indexes that are used for unique/primary constraints, or point access queries
Other Indexes with contention points should still use “current member” indexes
create index ordindex on orders (ordnum random);
M1 M2 M3 M1 M2 M3
Orders(ordnum) Orders(ordnum random)
Demand for last page in an ascending index
RANDOM spreads increasing keys over the index evenly, avoiding hotspots
BP
#IDUG
Enabling Sufficient Capacity during Maintenance
• pureScale supports completely seamless online system maintenance• With online “rolling maintenance” where members/CFs are maintained one-at-a-time:
1. Drain member of work
2. Temporarily remove member/CF from of the cluster, and perform maintenance
3. Add back to the cluster and restart
4. Repeat with next member/CF
• Unlike other methods, requires no forcing of connections, error codes, or quiesce waits
• DB2 10.5, adds support for online rolling DB2 fixpack updates
• Possible strategies for ensuring sufficient capacity during online maintenance• Schedule maintenance during offpeak hours
• For example, if you have 25% less workload during a 2 hour window Sunday am …
… define a 4 member cluster (each with the same CPU power), so that shutting down 1 removes only 25 % capacity
• Maintain full capacity by dynamically adding CPUs to member
• Via dynamic LPARs on AIX, for example
DB2 DB2 DB2 DB2
Tip
Recommended