Troubleshooting SQL Server

Troubleshooting SQL Troubleshooting SQL ServerServer

Stephen Rose- MCSE, MCT, MCSA, MCP+IMicrosoft MVP- Connected Systems Developer

AgendaAgenda

Who Am I?Where Do I Start?Case Study- MS Society of CanadaOptimal EnvironmentPerformance Monitor (PerfMon)Optimizing SQLConclusionsQ and A

Who Am I?Who Am I?

◦Stephen RoseStephen Rose Partner /Network Architect with Odyssey

Consulting Group MCSE, MCT, MCSA, MCP+I 2007 Microsoft Most Valuable Professional –

Networking Certified in Windows NT, 2000, and 2003 15 years of Tech Experience Technical Blogger with Fast Company Magazine

◦http://blog.fastcompany.com/experts Personal Tech Blog @

◦http://mcsegeek.wordpress.com Member of the UCSD Advisory Board Member of INETA.org Board

Let’s beginLet’s begin

Case Study BackgroundCase Study Background

Odyssey Consulting Group was contracted by the Multiple Sclerosis Society of Canada to help redesign and optimize their internal network systems to better support their new online fundraising portal.

Technologies like web farms, load balancing, SQL clustering and server virtualization were introduced to help meet MS Society meet their needs but the big issue was SQL and it’s connections to some legacy systems.

Optimal EnvironmentOptimal Environment

Disc Array◦Small Disks = Faster◦10 30GB Disks rather than 2 150GB◦Seek Time, Latency, Search◦10k – 15k◦RAID 0+1

32 Bit SQL vs. 64 Bit SQLClusteringServer 2008 w/ SQL 2008Web FarmLoad Balancing

PerfMon which is a SNMP based performance monitoring tool.

PerfMon has the following chracteristics: ◦High performance◦It requires little cpu to

run, even with more that thousand hosts being polled.

MS Society of CanadaMS Society of Canada

Network SetupNetwork Setup

Web Server ◦Dual Xeon Processor 3 GHz ◦2 GB RAM ◦2 x 72GB 10K drives (RAID 1) ◦Windows 2003 SP1

NAT-SQL-01 ◦Quad Xeon MP 1.5 GHz ◦4 GB RAM ◦2x 36 GB 10K Drives RAID 1(Internal, running

Windows, Page Files and SQL app only) ◦12 x 72 GB 15K Drives (Connected to a SAN)◦RAID 10 (Running SQL data and log files) ◦Windows 2003 SP1 SQL 2000 SP4

NAT-SQL-02 Quad Xeon MP 1.5 GHz 4 GB RAM 2 x 18 GB 10K Drives RAID 1(Internal, running

Windows, Page Files and SQL app only) 12 x 18 GB 15K Drives (Connected to a External

SCSI array) RAID 10 (Running SQL data and log files) Windows 2003 SP1 SQL2005

SAN ◦IBM DS4300 Expansion Unit 2 x 72GB 15K

Drives for NAT-APP-03 user files◦RAID 1 12 x 72GB 15K

Drives for NAT-SQL-01 data and log files◦RAID 10 2 x 72GB 15K

Drives for NAT-SQL-03 data and log files◦RAID 1 6 x 72GB 15K

Drives for VMWare◦RAID 5 2 SAN Switches for redundancy

Connected to NAT-SQL-01◦NAT-APP-03, NAT-SQL-03) ◦All servers have Dual HBA's for redundancy

Network ◦2 x Cisco ASA5510 Firewalls, connected to 3

SDSL and 1 ADSL internet lines (2 lines per firewall in context mode)

◦2 x Cisco Catalyst 3560G Core Switches (configured for failover, default gateway for network, firewalls plugged into these and linked to 3Com switches below)

◦VLAN's configured for routing tables

Network ◦2 x 3Com Superstack 3 4228G switches (All

servers plugged directly into these along with all hubs and Rogers VPN connection)

◦Dual T1 line connected to 3com switches via a Cisco 1700 Series Router linking 14 remote sites for VPN

Issues/SolutionsIssues/Solutions

% of Processor Time % of Processor Time

Processor Usage:Issue:

◦The processor usage averages around 50%-70%. Processor usage should be around 20%. This shows there is not enough processor cycles to manage the data.

Solution:◦Utilize more processors. Preferably 64 Bit

capable of Hyperthreading with 64 bit SQL and 2003 OS.

\\NAT-SQL-01\Processor(_Total)\% \\NAT-SQL-01\Processor(_Total)\% Processor TimeProcessor Time

\\NAT-APP-04\Processor(_Total)\% \\NAT-APP-04\Processor(_Total)\% Processor TimeProcessor Time

Disk SpeedDisk Speed

Disk Speed:Issue:

◦Your average write time is around 75MS. This should hang around 20MS. Your Read time is averaging 100. It should be around 40MS.

Solution:◦Upgrading to smaller disks (40-60GB Max) that

are faster (15,000 RPM).

\\NAT-SQL-01\PhysicalDisk(_Total)\% Disk \\NAT-SQL-01\PhysicalDisk(_Total)\% Disk Read TimeRead Time

\\NAT-APP-04\PhysicalDisk(_Total)\% Disk \\NAT-APP-04\PhysicalDisk(_Total)\% Disk Write TimeWrite Time

How Do We Fix These Issues?How Do We Fix These Issues?

RecommendationsRecommendations

Memory◦ Memory is being maxed out. Upgrade to max RAM.

WEB SERVER- ◦ Requires 4 GB RAM min.

SQL 1◦ Requires more than 4 GB.◦ Reduce the size of the 12 drives to 40GB from the 72 GB◦ More processing power is required. 1.5 GB is not enough power.◦ Upgrade to SQL 2005 64 BIT◦ Switch from RAID 1 to RAID 0+1

SQL 2◦ Same recommendations except drive sizes and SQL are fine.

SAN◦ Reduce the size of the drives from 72GB to 40 GB ◦ Go 0+1 on all RAID.

Let’s Dig DeeperLet’s Dig Deeper

OLTP- Online Transaction OLTP- Online Transaction ProcessingProcessing

OLTP work loads are characterized by high volumes of similar small transactions.

◦It is important to keep these characteristics in mind as we examine the significance of database design, resource utilization and system performance.

◦In this case study, the OLTP was an online system that allowed people to sponsor walkers/runners in MS Society events.

Database Design Issue If:Database Design Issue If:

Too many table joins for frequent queries. Overuse of joins in an OLTP application results in longer running queries & wasted system resources.

◦Generally, frequent operations requiring 5 or more table joins should be avoided by redesigning the database.


Too many indexes on frequently updated (inclusive of inserts, updates and deletes) tables incur extra index maintenance overhead.

◦Generally, OLTP database designs should keep the number of indexes to a functional minimum, again due to the high volumes of similar transactions combined with the cost of index maintenance.


Big IOs such as table and range scans due to missing indexes.

◦By definition, OLTP transactions should not require big IOs and should be examined.


Unused indexes incur the cost of index maintenance for inserts, updates, and deletes without benefiting any users.◦Unused indexes should be eliminated. ◦Any index that has been used (by select,

update or delete operations) will appear in sys.dm_db_index_usage_stats.

◦Thus, any defined index not included in this DMV has not been used since the last re-start of SQL Server.

CPU Bottleneck If:CPU Bottleneck If:

Signal waits > 25% of total waits.

◦See sys.dm_os_wait_stats for Signal waits and Total waits.

◦Signal waits measure the time spent in the runnable queue waiting for CPU.

◦High signal waits indicate a CPU bottleneck.


Plan re-use < 90% .◦A query plan is used to execute a query. ◦Plan re-use is desirable for OLTP workloads because

re-creating the same plan (for similar or identical transactions) is a waste of CPU resources.

◦Compare SQL Server SQL Statistics: batch requests/sec to SQL compilations/sec.

◦Compute plan re-use as follows: Plan re-use = (Batch requests - SQL compilations) / Batch requests.

◦Special exception to the plan re-use rule: Zero cost plans will not be cached (not re-used) in SQL 2005 SP2.

◦Applications that use zero cost plans will have a lower plan re-use but this is not a performance issue.


Parallel wait type cxpacket > 10% of total waits. ◦Parallelism sacrifices CPU resources for speed

of execution.

◦Given the high volumes of OLTP, parallel queries usually reduce OLTP throughput and should be avoided.

◦See sys.dm_os_wait_stats for wait statistics.

Memory Bottleneck If:Memory Bottleneck If:

Consistently low average page life expectancy. ◦See Average Page Life Expectancy Counter

which is in the Perfmon object SQL Server Buffer Manager (this represents is the average number of seconds a page stays in cache).

◦For OLTP, an average page life expectancy of 300 is 5 minutes.

◦Anything less could indicate memory pressure, missing indexes, or a cache flush.


Sudden big drop in page life expectancy. OLTP applications (e.g. small transactions) should have a steady (or slowly increasing) page life expectancy.

◦See Perfmon object SQL Server Buffer Manager.


Pending memory grants.

◦See counter Memory Grants Pending, in the Perfmon object SQL Server Memory Manager.

◦Small OLTP transactions should not require a large memory grant.


Sudden drops or consistenty low SQL Cache hit ratio. OLTP applications (e.g. small transactions) should have a high cache hit ratio.

◦Since OLTP transactions are small, there should not be (1) big drops in SQL Cache hit rates or (2) consistently low cache hit rates < 90%.

◦Drops or low cache hit may indicate memory pressure or missing indexes.

IO Bottleneck If:IO Bottleneck If:

High average disk seconds per read. When the IO subsystem is queued, disk seconds per read increases. ◦See Perfmon Logical or Physical disk (disk

seconds/read counter). ◦Normally it takes 4-8ms to complete a read

when there is no IO pressure. ◦When the IO subsystem is under pressure due

to high IO requests, the average time to complete a read increases, showing the effect of disk queues.


Periodic higher values for disk seconds/read may be acceptable for many applications. ◦For high performance OLTP applications,

sophisticated SAN subsystems provide greater IO scalability and resiliency in handling spikes of IO activity.

◦Sustained high values for disk seconds/read (>15ms) does indicate a disk bottleneck.udden drops or consistenty low SQL Cache hit ratio.

◦OLTP applications (e.g. small transactions) should have a high cache hit ratio.

◦Since OLTP transactions are small, there should not be (1) big drops in SQL Cache hit rates or (2) consistently low cache hit rates < 90%.

◦Drops or low cache hit may indicate memory pressure or missing indexes.


High average disk seconds per write. See Perfmon Logical or Physical disk. ◦The throughput for high volume OLTP applications is

dependent on fast sequential transaction log writes.◦A transaction log write can be as fast as 1ms (or

less) for high performance SAN environments. ◦For many applications, a periodic spike in average

disk seconds per write is acceptable considering the high cost of sophisticated SAN subsystems.

◦However, sustained high values for average disk seconds/write is a reliable indicator of a disk bottleneck.


Big IOs such as table and range scans due to missing indexes.

◦Top wait statistics in sys.dm_os_wait_stats are related to IO such as ASYNCH_IO_COMPLETION, IO_COMPLETION, LOGMGR, WRITELOG, or PAGEIOLATCH_x.

Network Bottleneck If:Network Bottleneck If:

High network latency coupled with an application that incurs many round trips to the database.

◦Network bandwidth is used up.

◦See counters packets/sec and current bandwidth counters in the network interface object of Performance Monitor.

◦For TCP/IP frames actual bandwidth is computed as packets/sec * 1500 * 8 /1000000 Mbps.

SQL VirtualizationSQL Virtualization

Hyper-V, is a hypervisor-based technology that is a key feature of Windows Server 2008.It provides scalability and high performance by supporting features like guest multi-processing support and 64-bit guest and host support; reliability and security through its hypervisor architecture; flexibility and manageability by supporting features like quick migration of virtual machines from one physical host to another, and integration with System Center Virtual Machine Manager.

Questions?Questions?

Thank YouThank You

Email:◦[email protected]

Blog:◦http://mcsegeek.wordpress.com

Technology

Troubleshooting SQL Server