Upload
stephen-rose
View
4.519
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
Troubleshooting SQL Troubleshooting SQL ServerServer
Stephen Rose- MCSE, MCT, MCSA, MCP+IMicrosoft MVP- Connected Systems Developer
AgendaAgenda
Who Am I?Where Do I Start?Case Study- MS Society of CanadaOptimal EnvironmentPerformance Monitor (PerfMon)Optimizing SQLConclusionsQ and A
Who Am I?Who Am I?
◦Stephen RoseStephen Rose Partner /Network Architect with Odyssey
Consulting Group MCSE, MCT, MCSA, MCP+I 2007 Microsoft Most Valuable Professional –
Networking Certified in Windows NT, 2000, and 2003 15 years of Tech Experience Technical Blogger with Fast Company Magazine
◦http://blog.fastcompany.com/experts Personal Tech Blog @
◦http://mcsegeek.wordpress.com Member of the UCSD Advisory Board Member of INETA.org Board
Let’s beginLet’s begin
Case Study BackgroundCase Study Background
Odyssey Consulting Group was contracted by the Multiple Sclerosis Society of Canada to help redesign and optimize their internal network systems to better support their new online fundraising portal.
Technologies like web farms, load balancing, SQL clustering and server virtualization were introduced to help meet MS Society meet their needs but the big issue was SQL and it’s connections to some legacy systems.
Optimal EnvironmentOptimal Environment
Disc Array◦Small Disks = Faster◦10 30GB Disks rather than 2 150GB◦Seek Time, Latency, Search◦10k – 15k◦RAID 0+1
32 Bit SQL vs. 64 Bit SQLClusteringServer 2008 w/ SQL 2008Web FarmLoad Balancing
PerfMon which is a SNMP based performance monitoring tool.
PerfMon has the following chracteristics: ◦High performance◦It requires little cpu to
run, even with more that thousand hosts being polled.
MS Society of CanadaMS Society of Canada
Network SetupNetwork Setup
Web Server ◦Dual Xeon Processor 3 GHz ◦2 GB RAM ◦2 x 72GB 10K drives (RAID 1) ◦Windows 2003 SP1
NAT-SQL-01 ◦Quad Xeon MP 1.5 GHz ◦4 GB RAM ◦2x 36 GB 10K Drives RAID 1(Internal, running
Windows, Page Files and SQL app only) ◦12 x 72 GB 15K Drives (Connected to a SAN)◦RAID 10 (Running SQL data and log files) ◦Windows 2003 SP1 SQL 2000 SP4
NAT-SQL-02 Quad Xeon MP 1.5 GHz 4 GB RAM 2 x 18 GB 10K Drives RAID 1(Internal, running
Windows, Page Files and SQL app only) 12 x 18 GB 15K Drives (Connected to a External
SCSI array) RAID 10 (Running SQL data and log files) Windows 2003 SP1 SQL2005
SAN ◦IBM DS4300 Expansion Unit 2 x 72GB 15K
Drives for NAT-APP-03 user files◦RAID 1 12 x 72GB 15K
Drives for NAT-SQL-01 data and log files◦RAID 10 2 x 72GB 15K
Drives for NAT-SQL-03 data and log files◦RAID 1 6 x 72GB 15K
Drives for VMWare◦RAID 5 2 SAN Switches for redundancy
Connected to NAT-SQL-01◦NAT-APP-03, NAT-SQL-03) ◦All servers have Dual HBA's for redundancy
Network ◦2 x Cisco ASA5510 Firewalls, connected to 3
SDSL and 1 ADSL internet lines (2 lines per firewall in context mode)
◦2 x Cisco Catalyst 3560G Core Switches (configured for failover, default gateway for network, firewalls plugged into these and linked to 3Com switches below)
◦VLAN's configured for routing tables
Network ◦2 x 3Com Superstack 3 4228G switches (All
servers plugged directly into these along with all hubs and Rogers VPN connection)
◦Dual T1 line connected to 3com switches via a Cisco 1700 Series Router linking 14 remote sites for VPN
Issues/SolutionsIssues/Solutions
% of Processor Time % of Processor Time
Processor Usage:Issue:
◦The processor usage averages around 50%-70%. Processor usage should be around 20%. This shows there is not enough processor cycles to manage the data.
Solution:◦Utilize more processors. Preferably 64 Bit
capable of Hyperthreading with 64 bit SQL and 2003 OS.
\\NAT-SQL-01\Processor(_Total)\% \\NAT-SQL-01\Processor(_Total)\% Processor TimeProcessor Time
\\NAT-APP-04\Processor(_Total)\% \\NAT-APP-04\Processor(_Total)\% Processor TimeProcessor Time
Disk SpeedDisk Speed
Disk Speed:Issue:
◦Your average write time is around 75MS. This should hang around 20MS. Your Read time is averaging 100. It should be around 40MS.
Solution:◦Upgrading to smaller disks (40-60GB Max) that
are faster (15,000 RPM).
\\NAT-SQL-01\PhysicalDisk(_Total)\% Disk \\NAT-SQL-01\PhysicalDisk(_Total)\% Disk Read TimeRead Time
\\NAT-APP-04\PhysicalDisk(_Total)\% Disk \\NAT-APP-04\PhysicalDisk(_Total)\% Disk Write TimeWrite Time
How Do We Fix These Issues?How Do We Fix These Issues?
RecommendationsRecommendations
Memory◦ Memory is being maxed out. Upgrade to max RAM.
WEB SERVER- ◦ Requires 4 GB RAM min.
SQL 1◦ Requires more than 4 GB.◦ Reduce the size of the 12 drives to 40GB from the 72 GB◦ More processing power is required. 1.5 GB is not enough power.◦ Upgrade to SQL 2005 64 BIT◦ Switch from RAID 1 to RAID 0+1
SQL 2◦ Same recommendations except drive sizes and SQL are fine.
SAN◦ Reduce the size of the drives from 72GB to 40 GB ◦ Go 0+1 on all RAID.
Let’s Dig DeeperLet’s Dig Deeper
OLTP- Online Transaction OLTP- Online Transaction ProcessingProcessing
OLTP work loads are characterized by high volumes of similar small transactions.
◦It is important to keep these characteristics in mind as we examine the significance of database design, resource utilization and system performance.
◦In this case study, the OLTP was an online system that allowed people to sponsor walkers/runners in MS Society events.
Database Design Issue If:Database Design Issue If:
Too many table joins for frequent queries. Overuse of joins in an OLTP application results in longer running queries & wasted system resources.
◦Generally, frequent operations requiring 5 or more table joins should be avoided by redesigning the database.
Database Design Issue If:Database Design Issue If:
Too many indexes on frequently updated (inclusive of inserts, updates and deletes) tables incur extra index maintenance overhead.
◦Generally, OLTP database designs should keep the number of indexes to a functional minimum, again due to the high volumes of similar transactions combined with the cost of index maintenance.
Database Design Issue If:Database Design Issue If:
Big IOs such as table and range scans due to missing indexes.
◦By definition, OLTP transactions should not require big IOs and should be examined.
Database Design Issue If:Database Design Issue If:
Unused indexes incur the cost of index maintenance for inserts, updates, and deletes without benefiting any users.◦Unused indexes should be eliminated. ◦Any index that has been used (by select,
update or delete operations) will appear in sys.dm_db_index_usage_stats.
◦Thus, any defined index not included in this DMV has not been used since the last re-start of SQL Server.
CPU Bottleneck If:CPU Bottleneck If:
Signal waits > 25% of total waits.
◦See sys.dm_os_wait_stats for Signal waits and Total waits.
◦Signal waits measure the time spent in the runnable queue waiting for CPU.
◦High signal waits indicate a CPU bottleneck.
CPU Bottleneck If:CPU Bottleneck If:
Plan re-use < 90% .◦A query plan is used to execute a query. ◦Plan re-use is desirable for OLTP workloads because
re-creating the same plan (for similar or identical transactions) is a waste of CPU resources.
◦Compare SQL Server SQL Statistics: batch requests/sec to SQL compilations/sec.
◦Compute plan re-use as follows: Plan re-use = (Batch requests - SQL compilations) / Batch requests.
◦Special exception to the plan re-use rule: Zero cost plans will not be cached (not re-used) in SQL 2005 SP2.
◦Applications that use zero cost plans will have a lower plan re-use but this is not a performance issue.
CPU Bottleneck If:CPU Bottleneck If:
Parallel wait type cxpacket > 10% of total waits. ◦Parallelism sacrifices CPU resources for speed
of execution.
◦Given the high volumes of OLTP, parallel queries usually reduce OLTP throughput and should be avoided.
◦See sys.dm_os_wait_stats for wait statistics.
Memory Bottleneck If:Memory Bottleneck If:
Consistently low average page life expectancy. ◦See Average Page Life Expectancy Counter
which is in the Perfmon object SQL Server Buffer Manager (this represents is the average number of seconds a page stays in cache).
◦For OLTP, an average page life expectancy of 300 is 5 minutes.
◦Anything less could indicate memory pressure, missing indexes, or a cache flush.
Memory Bottleneck If:Memory Bottleneck If:
Sudden big drop in page life expectancy. OLTP applications (e.g. small transactions) should have a steady (or slowly increasing) page life expectancy.
◦See Perfmon object SQL Server Buffer Manager.
Memory Bottleneck If:Memory Bottleneck If:
Pending memory grants.
◦See counter Memory Grants Pending, in the Perfmon object SQL Server Memory Manager.
◦Small OLTP transactions should not require a large memory grant.
Memory Bottleneck If:Memory Bottleneck If:
Sudden drops or consistenty low SQL Cache hit ratio. OLTP applications (e.g. small transactions) should have a high cache hit ratio.
◦Since OLTP transactions are small, there should not be (1) big drops in SQL Cache hit rates or (2) consistently low cache hit rates < 90%.
◦Drops or low cache hit may indicate memory pressure or missing indexes.
IO Bottleneck If:IO Bottleneck If:
High average disk seconds per read. When the IO subsystem is queued, disk seconds per read increases. ◦See Perfmon Logical or Physical disk (disk
seconds/read counter). ◦Normally it takes 4-8ms to complete a read
when there is no IO pressure. ◦When the IO subsystem is under pressure due
to high IO requests, the average time to complete a read increases, showing the effect of disk queues.
IO Bottleneck If:IO Bottleneck If:
Periodic higher values for disk seconds/read may be acceptable for many applications. ◦For high performance OLTP applications,
sophisticated SAN subsystems provide greater IO scalability and resiliency in handling spikes of IO activity.
◦Sustained high values for disk seconds/read (>15ms) does indicate a disk bottleneck.udden drops or consistenty low SQL Cache hit ratio.
◦OLTP applications (e.g. small transactions) should have a high cache hit ratio.
◦Since OLTP transactions are small, there should not be (1) big drops in SQL Cache hit rates or (2) consistently low cache hit rates < 90%.
◦Drops or low cache hit may indicate memory pressure or missing indexes.
IO Bottleneck If:IO Bottleneck If:
High average disk seconds per write. See Perfmon Logical or Physical disk. ◦The throughput for high volume OLTP applications is
dependent on fast sequential transaction log writes.◦A transaction log write can be as fast as 1ms (or
less) for high performance SAN environments. ◦For many applications, a periodic spike in average
disk seconds per write is acceptable considering the high cost of sophisticated SAN subsystems.
◦However, sustained high values for average disk seconds/write is a reliable indicator of a disk bottleneck.
IO Bottleneck If:IO Bottleneck If:
Big IOs such as table and range scans due to missing indexes.
◦Top wait statistics in sys.dm_os_wait_stats are related to IO such as ASYNCH_IO_COMPLETION, IO_COMPLETION, LOGMGR, WRITELOG, or PAGEIOLATCH_x.
Network Bottleneck If:Network Bottleneck If:
High network latency coupled with an application that incurs many round trips to the database.
◦Network bandwidth is used up.
◦See counters packets/sec and current bandwidth counters in the network interface object of Performance Monitor.
◦For TCP/IP frames actual bandwidth is computed as packets/sec * 1500 * 8 /1000000 Mbps.
SQL VirtualizationSQL Virtualization
Hyper-V, is a hypervisor-based technology that is a key feature of Windows Server 2008.It provides scalability and high performance by supporting features like guest multi-processing support and 64-bit guest and host support; reliability and security through its hypervisor architecture; flexibility and manageability by supporting features like quick migration of virtual machines from one physical host to another, and integration with System Center Virtual Machine Manager.
Questions?Questions?