44
Troubleshooting SQL Troubleshooting SQL Server Server Stephen Rose- MCSE, MCT, MCSA, MCP+I Microsoft MVP- Connected Systems Developer

Troubleshooting SQL Server

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Troubleshooting SQL Server

Troubleshooting SQL Troubleshooting SQL ServerServer

Stephen Rose- MCSE, MCT, MCSA, MCP+IMicrosoft MVP- Connected Systems Developer

Page 2: Troubleshooting SQL Server

AgendaAgenda

Who Am I?Where Do I Start?Case Study- MS Society of CanadaOptimal EnvironmentPerformance Monitor (PerfMon)Optimizing SQLConclusionsQ and A

Page 3: Troubleshooting SQL Server

Who Am I?Who Am I?

◦Stephen RoseStephen Rose Partner /Network Architect with Odyssey

Consulting Group MCSE, MCT, MCSA, MCP+I 2007 Microsoft Most Valuable Professional –

Networking Certified in Windows NT, 2000, and 2003 15 years of Tech Experience Technical Blogger with Fast Company Magazine

◦http://blog.fastcompany.com/experts Personal Tech Blog @

◦http://mcsegeek.wordpress.com Member of the UCSD Advisory Board Member of INETA.org Board

Page 4: Troubleshooting SQL Server

Let’s beginLet’s begin

Page 5: Troubleshooting SQL Server

Case Study BackgroundCase Study Background

Odyssey Consulting Group was contracted by the Multiple Sclerosis Society of Canada to help redesign and optimize their internal network systems to better support their new online fundraising portal.

Technologies like web farms, load balancing, SQL clustering and server virtualization were introduced to help meet MS Society meet their needs but the big issue was SQL and it’s connections to some legacy systems.

Page 6: Troubleshooting SQL Server

Optimal EnvironmentOptimal Environment

Disc Array◦Small Disks = Faster◦10 30GB Disks rather than 2 150GB◦Seek Time, Latency, Search◦10k – 15k◦RAID 0+1

32 Bit SQL vs. 64 Bit SQLClusteringServer 2008 w/ SQL 2008Web FarmLoad Balancing

Page 7: Troubleshooting SQL Server

PerfMon which is a SNMP based performance monitoring tool.

PerfMon has the following chracteristics: ◦High performance◦It requires little cpu to

run, even with more that thousand hosts being polled.

Page 8: Troubleshooting SQL Server

MS Society of CanadaMS Society of Canada

Page 9: Troubleshooting SQL Server

Network SetupNetwork Setup

Page 10: Troubleshooting SQL Server

Web Server ◦Dual Xeon Processor 3 GHz ◦2 GB RAM ◦2 x 72GB 10K drives (RAID 1) ◦Windows 2003 SP1

NAT-SQL-01 ◦Quad Xeon MP 1.5 GHz ◦4 GB RAM ◦2x 36 GB 10K Drives RAID 1(Internal, running

Windows, Page Files and SQL app only) ◦12 x 72 GB 15K Drives (Connected to a SAN)◦RAID 10 (Running SQL data and log files) ◦Windows 2003 SP1 SQL 2000 SP4

Page 11: Troubleshooting SQL Server

NAT-SQL-02 Quad Xeon MP 1.5 GHz 4 GB RAM 2 x 18 GB 10K Drives RAID 1(Internal, running

Windows, Page Files and SQL app only) 12 x 18 GB 15K Drives (Connected to a External

SCSI array) RAID 10 (Running SQL data and log files) Windows 2003 SP1 SQL2005

Page 12: Troubleshooting SQL Server

SAN ◦IBM DS4300 Expansion Unit 2 x 72GB 15K

Drives for NAT-APP-03 user files◦RAID 1 12 x 72GB 15K

Drives for NAT-SQL-01 data and log files◦RAID 10 2 x 72GB 15K

Drives for NAT-SQL-03 data and log files◦RAID 1 6 x 72GB 15K

Drives for VMWare◦RAID 5 2 SAN Switches for redundancy

Connected to NAT-SQL-01◦NAT-APP-03, NAT-SQL-03) ◦All servers have Dual HBA's for redundancy

Page 13: Troubleshooting SQL Server

Network ◦2 x Cisco ASA5510 Firewalls, connected to 3

SDSL and 1 ADSL internet lines (2 lines per firewall in context mode)

◦2 x Cisco Catalyst 3560G Core Switches (configured for failover, default gateway for network, firewalls plugged into these and linked to 3Com switches below)

◦VLAN's configured for routing tables

Page 14: Troubleshooting SQL Server

Network ◦2 x 3Com Superstack 3 4228G switches (All

servers plugged directly into these along with all hubs and Rogers VPN connection)

◦Dual T1 line connected to 3com switches via a Cisco 1700 Series Router linking 14 remote sites for VPN

Page 15: Troubleshooting SQL Server

Issues/SolutionsIssues/Solutions

Page 16: Troubleshooting SQL Server

% of Processor Time % of Processor Time

Processor Usage:Issue:

◦The processor usage averages around 50%-70%. Processor usage should be around 20%. This shows there is not enough processor cycles to manage the data.

Solution:◦Utilize more processors. Preferably 64 Bit

capable of Hyperthreading with 64 bit SQL and 2003 OS.

Page 17: Troubleshooting SQL Server

\\NAT-SQL-01\Processor(_Total)\% \\NAT-SQL-01\Processor(_Total)\% Processor TimeProcessor Time

Page 18: Troubleshooting SQL Server

\\NAT-APP-04\Processor(_Total)\% \\NAT-APP-04\Processor(_Total)\% Processor TimeProcessor Time

Page 19: Troubleshooting SQL Server

Disk SpeedDisk Speed

Disk Speed:Issue:

◦Your average write time is around 75MS. This should hang around 20MS. Your Read time is averaging 100. It should be around 40MS.

Solution:◦Upgrading to smaller disks (40-60GB Max) that

are faster (15,000 RPM).

Page 20: Troubleshooting SQL Server

\\NAT-SQL-01\PhysicalDisk(_Total)\% Disk \\NAT-SQL-01\PhysicalDisk(_Total)\% Disk Read TimeRead Time

Page 21: Troubleshooting SQL Server

\\NAT-APP-04\PhysicalDisk(_Total)\% Disk \\NAT-APP-04\PhysicalDisk(_Total)\% Disk Write TimeWrite Time

Page 22: Troubleshooting SQL Server

How Do We Fix These Issues?How Do We Fix These Issues?

Page 23: Troubleshooting SQL Server

RecommendationsRecommendations

Memory◦ Memory is being maxed out. Upgrade to max RAM.

WEB SERVER- ◦ Requires 4 GB RAM min.

SQL 1◦ Requires more than 4 GB.◦ Reduce the size of the 12 drives to 40GB from the 72 GB◦ More processing power is required. 1.5 GB is not enough power.◦ Upgrade to SQL 2005 64 BIT◦ Switch from RAID 1 to RAID 0+1

SQL 2◦ Same recommendations except drive sizes and SQL are fine.

SAN◦ Reduce the size of the drives from 72GB to 40 GB ◦ Go 0+1 on all RAID.

Page 24: Troubleshooting SQL Server

Let’s Dig DeeperLet’s Dig Deeper

Page 25: Troubleshooting SQL Server

OLTP- Online Transaction OLTP- Online Transaction ProcessingProcessing

OLTP work loads are characterized by high volumes of similar small transactions.

◦It is important to keep these characteristics in mind as we examine the significance of database design, resource utilization and system performance.

◦In this case study, the OLTP was an online system that allowed people to sponsor walkers/runners in MS Society events.

Page 26: Troubleshooting SQL Server

Database Design Issue If:Database Design Issue If:

Too many table joins for frequent queries. Overuse of joins in an OLTP application results in longer running queries & wasted system resources.

◦Generally, frequent operations requiring 5 or more table joins should be avoided by redesigning the database.

Page 27: Troubleshooting SQL Server

Database Design Issue If:Database Design Issue If:

Too many indexes on frequently updated (inclusive of inserts, updates and deletes) tables incur extra index maintenance overhead.

◦Generally, OLTP database designs should keep the number of indexes to a functional minimum, again due to the high volumes of similar transactions combined with the cost of index maintenance.

Page 28: Troubleshooting SQL Server

Database Design Issue If:Database Design Issue If:

Big IOs such as table and range scans due to missing indexes.

◦By definition, OLTP transactions should not require big IOs and should be examined.

Page 29: Troubleshooting SQL Server

Database Design Issue If:Database Design Issue If:

Unused indexes incur the cost of index maintenance for inserts, updates, and deletes without benefiting any users.◦Unused indexes should be eliminated. ◦Any index that has been used (by select,

update or delete operations) will appear in sys.dm_db_index_usage_stats.

◦Thus, any defined index not included in this DMV has not been used since the last re-start of SQL Server.

Page 30: Troubleshooting SQL Server

CPU Bottleneck If:CPU Bottleneck If:

Signal waits > 25% of total waits.

◦See sys.dm_os_wait_stats for Signal waits and Total waits.

◦Signal waits measure the time spent in the runnable queue waiting for CPU.

◦High signal waits indicate a CPU bottleneck.

Page 31: Troubleshooting SQL Server

CPU Bottleneck If:CPU Bottleneck If:

Plan re-use < 90% .◦A query plan is used to execute a query. ◦Plan re-use is desirable for OLTP workloads because

re-creating the same plan (for similar or identical transactions) is a waste of CPU resources.

◦Compare SQL Server SQL Statistics: batch requests/sec to SQL compilations/sec.

◦Compute plan re-use as follows: Plan re-use = (Batch requests - SQL compilations) / Batch requests.

◦Special exception to the plan re-use rule: Zero cost plans will not be cached (not re-used) in SQL 2005 SP2.

◦Applications that use zero cost plans will have a lower plan re-use but this is not a performance issue.

Page 32: Troubleshooting SQL Server

CPU Bottleneck If:CPU Bottleneck If:

Parallel wait type cxpacket > 10% of total waits. ◦Parallelism sacrifices CPU resources for speed

of execution.

◦Given the high volumes of OLTP, parallel queries usually reduce OLTP throughput and should be avoided.

◦See sys.dm_os_wait_stats for wait statistics.

Page 33: Troubleshooting SQL Server

Memory Bottleneck If:Memory Bottleneck If:

Consistently low average page life expectancy. ◦See Average Page Life Expectancy Counter

which is in the Perfmon object SQL Server Buffer Manager (this represents is the average number of seconds a page stays in cache).

◦For OLTP, an average page life expectancy of 300 is 5 minutes.

◦Anything less could indicate memory pressure, missing indexes, or a cache flush.

Page 34: Troubleshooting SQL Server

Memory Bottleneck If:Memory Bottleneck If:

Sudden big drop in page life expectancy. OLTP applications (e.g. small transactions) should have a steady (or slowly increasing) page life expectancy.

◦See Perfmon object SQL Server Buffer Manager.

Page 35: Troubleshooting SQL Server

Memory Bottleneck If:Memory Bottleneck If:

Pending memory grants.

◦See counter Memory Grants Pending, in the Perfmon object SQL Server Memory Manager.

◦Small OLTP transactions should not require a large memory grant.

Page 36: Troubleshooting SQL Server

Memory Bottleneck If:Memory Bottleneck If:

Sudden drops or consistenty low SQL Cache hit ratio. OLTP applications (e.g. small transactions) should have a high cache hit ratio.

◦Since OLTP transactions are small, there should not be (1) big drops in SQL Cache hit rates or (2) consistently low cache hit rates < 90%.

◦Drops or low cache hit may indicate memory pressure or missing indexes.

Page 37: Troubleshooting SQL Server

IO Bottleneck If:IO Bottleneck If:

High average disk seconds per read. When the IO subsystem is queued, disk seconds per read increases. ◦See Perfmon Logical or Physical disk (disk

seconds/read counter). ◦Normally it takes 4-8ms to complete a read

when there is no IO pressure. ◦When the IO subsystem is under pressure due

to high IO requests, the average time to complete a read increases, showing the effect of disk queues.

Page 38: Troubleshooting SQL Server

IO Bottleneck If:IO Bottleneck If:

Periodic higher values for disk seconds/read may be acceptable for many applications. ◦For high performance OLTP applications,

sophisticated SAN subsystems provide greater IO scalability and resiliency in handling spikes of IO activity.

◦Sustained high values for disk seconds/read (>15ms) does indicate a disk bottleneck.udden drops or consistenty low SQL Cache hit ratio.

◦OLTP applications (e.g. small transactions) should have a high cache hit ratio.

◦Since OLTP transactions are small, there should not be (1) big drops in SQL Cache hit rates or (2) consistently low cache hit rates < 90%.

◦Drops or low cache hit may indicate memory pressure or missing indexes.

Page 39: Troubleshooting SQL Server

IO Bottleneck If:IO Bottleneck If:

High average disk seconds per write. See Perfmon Logical or Physical disk. ◦The throughput for high volume OLTP applications is

dependent on fast sequential transaction log writes.◦A transaction log write can be as fast as 1ms (or

less) for high performance SAN environments. ◦For many applications, a periodic spike in average

disk seconds per write is acceptable considering the high cost of sophisticated SAN subsystems.

◦However, sustained high values for average disk seconds/write is a reliable indicator of a disk bottleneck.

Page 40: Troubleshooting SQL Server

IO Bottleneck If:IO Bottleneck If:

Big IOs such as table and range scans due to missing indexes.

◦Top wait statistics in sys.dm_os_wait_stats are related to IO such as ASYNCH_IO_COMPLETION, IO_COMPLETION, LOGMGR, WRITELOG, or PAGEIOLATCH_x.

Page 41: Troubleshooting SQL Server

Network Bottleneck If:Network Bottleneck If:

High network latency coupled with an application that incurs many round trips to the database.

◦Network bandwidth is used up.

◦See counters packets/sec and current bandwidth counters in the network interface object of Performance Monitor.

◦For TCP/IP frames actual bandwidth is computed as packets/sec * 1500 * 8 /1000000 Mbps.

Page 42: Troubleshooting SQL Server

SQL VirtualizationSQL Virtualization

Hyper-V, is a hypervisor-based technology that is a key feature of Windows Server 2008.It provides scalability and high performance by supporting features like guest multi-processing support and 64-bit guest and host support; reliability and security through its hypervisor architecture; flexibility and manageability by supporting features like quick migration of virtual machines from one physical host to another, and integration with System Center Virtual Machine Manager.

Page 43: Troubleshooting SQL Server

Questions?Questions?

Page 44: Troubleshooting SQL Server

Thank YouThank You

Email:◦[email protected]

Blog:◦http://mcsegeek.wordpress.com