89
Quantitative Performance Analysis Joe Chang [email protected] www.sql-server-performance.com/joe _chang.asp

Quantitative Performance Analysis

  • Upload
    brock

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Quantitative Performance Analysis. Joe Chang [email protected] www.sql-server-performance.com/joe_chang.asp. Objectives. Estimate DB performance early What design/architecture decisions impact performance? Is the project/architecture feasible? Production database performance tuning - PowerPoint PPT Presentation

Citation preview

Page 2: Quantitative Performance Analysis

Objectives

Estimate DB performance early What design/architecture decisions impact performance?Is the project/architecture feasible?

Production database performance tuning

Reduces guess work, but there are easier ways

Server Performance CharacteristicsProcessor Architecture:

Pentium III – Xeon – Itanium 2– OpteronSystem Architecture: 2, 4, 8, 16-way etc

Page 3: Quantitative Performance Analysis

SPEC CPU 2000 Integer

Pentium M 2.0GHz 90nm, all others 130nmXeon 2.4GHz/512K base: 913 (used later in this presentation)Pentium III Xeon 700MHz/2M : 431

0

500

1000

1500

2000

2500

3000

base gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 tw olf

Pentium M 2.0GHz/2M

Xeon 3.2GHz/2M

Opteron 2.2GHz/1M

Itanium 2 1.5GHz/6M

Power5 1.9GHz

Page 4: Quantitative Performance Analysis

TPC-C Performance – SQL Server

# CPUs System tpm-C $/tpm-C Mem(GB) # Disks1 HP (3.2GHz/2M) 35,030 $1.88 12 43+42 HP (3.2GHz/2M) 60,364 $3.51 12 280+10

4 IBM x365 102,667 $3.52 32 2668 IBM x445 156,195 $4.31 64 616

16 Unisys ES7000 237,869 $5.08 64 700+1032 Unisys ES7000 304,148 $6.18 64 1092+12

# CPUs System tpm-C $/tpm-C Mem(GB) # Disks4 HP DL585 115,110 2.62 32 295+8

Xeon 3.2GHz/2M & Xeon MP 3.0GHz/4M

Opteron 2.2GHz/1M

IA-32 limited max memory (64GB), AWE overhead , bus architecture

Page 5: Quantitative Performance Analysis

Itanium 2 (SQL Server) vs IBM Power 5

# CPUs System tpm-C $/tpm-C Mem(GB) # Disks4 Pwr5 570 /Oracle 194,391 $5.62 128 432+16

8 Pwr5 570 /UDB 429,900 $4.99 256 880+4016 Pwr5 570 /UDB 809,144 $4.95 512 1600+4032 Pwr5 695 /Oracle 1,601,785 $5.27 1024 3200+96

64 Pwr5 695 /UDB 3,210,540 $5.19 2048 6400+140

IBM Power 4+/5 1.9GHz

# CPUs System tpm-C $/tpm-C Mem(GB) # Disks4 HP rx5670 121,065 $4.49 64 448+208 Bull 175,366 $4.54 128 225

16 Unisys ES7000 309,037 $4.49 128 770+2432 NEC Express 577,531 $7.74 512 1150

64 HP Superdome 786,646 $6.49 512 1792+60

Itanium 2 1.5GHz/6M

Page 6: Quantitative Performance Analysis

Scaling

Pn / P1 = S ** log2(n)Pn Performance with n processors P1 Perf. with 1 processorS Scale Factor n Number of processors

0

4

8

12

16

20

24

28

32

36

2 4 8 16 32 64Processors

Scal

ing

S=1.5

S=1.6

S=1.7

S=1.8

Linear

Page 7: Quantitative Performance Analysis

Unit of Measure – CPU-Cycles

Query costs measured in CPU-cyclesAlternative: CPU-sec

Cost = Runtime (sec) × CPU Util. × Available CPU-cycles ÷ IterationsAvailable CPU-cycles = Number of CPUs × FrequencyExample 4 x 700MHz = 2.8B cycles/sec

CPU-cycles does not imply CPU instructions, Unit of time same as CPU clock

1GHz CPU: time unit = 1ns2GHz CPU: time unit = 0.5ns

All tests on Windows 2000/2003, SQL Server 2000 SP1-3

Page 8: Quantitative Performance Analysis

CPU-Cycles dependencies

CPU-cycles on one processor architecture has no relation to another – ex. Pentium III, Pentium 4, Itanium, OpteronSome platform dependencies – cache size, bus speed, SMP

Notes: Some platform dependencies – cache size, bus speed, SMP

Processor System/Cache/Mem

Performance Cost / trans.

Itanium 2 4 x 1.5GHz/6M /64G

121,065 2.974M

Xeon 4 x 3.0GHz/4M /32G

102,667 7.013M

Opteron 4 x 2.2GHz/1M /32G

105,687 4.996M

Page 9: Quantitative Performance Analysis

Cost Structure - ModelStored Procedure Call Cost =

RPC cost (once per procedure)+ Type cost (once per procedure?)+ Query costs (one or more per

procedure)

Query – one or more components

Component Cost = Cost for component operation base

+ Cost per additional row or page

Only stored procedures are examined

Page 10: Quantitative Performance Analysis

RPC Cost

Cost of RPC to SQL Server includes:1) Network roundtrip2) SQL Server handling costs

Calls to SQL Server are made with RPC (not SQL Batch)Profiler -> Event Class: RPC

ADO.NET Command Type Stored Procedure or Text with parametersCommand Text without parameters: SQL Batch

Page 11: Quantitative Performance Analysis

Type Cost?

Blank Procedure: ~250,000 CPU-cyclesCREATE PROC p_ProcBlank ASRETURN

Proc with Single Query ~320K CPU-cyclesCREATE PROC p_ProcSingleQuery ASSELECT … FROM TableA WHERE ID = @ID

Proc with Two Queries ~360K CPU-cyclesCREATE PROC p_ProcTwoQueries ASSELECT … FROM TableA WHERE ID = @IDSELECT … FROM TableB WHERE ID = @ID

RPC Cost ~250K CPU-cycles,Type Cost ~30K CPU-cycles, Query Cost ~40K CPU-cycles

Page 12: Quantitative Performance Analysis

Blank RPC Performance

All systems with 2 processors, Xeon 3.2GHz with 800MHz busRPC’s are expensive relative to simple query

0

5,000

10,000

15,000

20,000

25,000

30,000

Pent III900M/2M

Xeon3.06G/512K

Xeon3.06G/512

w/HT

Xeon3.20G/1M

800

Xeon3.20G/1M

w/HT

Itanium 21.50G/6M

Opteron2.20G/1M

RPC

/sec

120K

270K240K 260K

195K

154K

140K

Costs in CPU-Cycles

Page 13: Quantitative Performance Analysis

RPC Cost

Processor PIII PIII X Xeon Opteron Itanium 2* It2 CPUs 2 4 2 2 2 4 8RPC cost 140K 200K 250 140K 155K 290K 350K

270K (2.xx driver)Type Cost Select 20-30K ~5K 35-55K ~20K ~8K

Systems: Pentium III 2x 600MHz/256K, 2x 733MHz/256K, PIII Xeon 2x 500MHz/2M, 4x 700MHz/2M, 4x900/2M Xeon (P4) 2x 2.0GHz/512K 2x 2.4GHz/512K Opteron 2x 2.2GHz/1M Itanium 2 2x 900MHz/1.5M 8x1.5GHz/6MOS: W2K, W2K3, various spSQL Server 2000, various spPIII: Intel PRO/100+, Others: Broadcom Gigabit Ethernet driver 5.xx+*Itanium 2 system booted with 2, 4 or 8 processors(4P config may have had procs from more than 1 cell)

Costs in CPU-Cycles

Page 14: Quantitative Performance Analysis

RPC Cost – Fiber versus Threads

PIII Xeon - TCP 1P 2P 4PFE-Thread 105K 150K 200KFE-Fiber 95K 120K 170KXeon - TCPGE-Thread 210K 250KGE-Fiber 200K 230KXeon - VIVI Thread 190KVI Fiber 160K 180KItanium 2 - TCP 1P 2P 4P 8PThread 105K 155K 290K 350KFiber 95K 145K 260K 300K

Costs in CPU-Cycles

Broadcom Gigabit Ethernet driver 5.xx, 6.xx, 7.xx (270K for 2P 2.xx driver)VI: QLogic QLA2350, drivers: qla2300 8.2.2.10, qlvika 1.1.3

Page 15: Quantitative Performance Analysis

RPC Cost TCP vs Named Pipes

PIII Xeon 4P TCP named pipesFE-Thread 200K 315KFE-Fiber 170K 370K

Xeon, Thread 1P 2PGE, TCP 210K 250KGE, Named Pipes 320K 360K

Broadcom Gigabit Ethernet driver 5.xx, 6.xx, 7.xx (270K for 2P 2.xx driver)VI: QLogic QLA2350, drivers: qla2300 8.2.2.10, qlvika 1.1.3

Costs in CPU-Cycles

Page 16: Quantitative Performance Analysis

RPC Costs – owner, case

PIII PIII X P4/Xeon

RPC cost 140K 140K? 250Ksp_executesql 210K

Unspecified owner, Ex: user1 calls procedure owned by dbo +100K on 4P PIII, +100K on 2P Xeon, 300K on 8P Itanium2Case mismatch:

Actual procedure: p_Get_RowsCalled procedure: p_get_rows

+100K on 4P PIII, +150K on 2P Xeon, +300K on 8P Itanium2

Costs in CPU-Cycles

Page 17: Quantitative Performance Analysis

Single Row Select Costs

Clustered Index SeekDoes cost depend on index depth?Role of I/O count in cost

Index Seek with Bookmark LookupDoes cost depend on table type?

Heap versus clustered indexTable Scan

Page 18: Quantitative Performance Analysis

Single Row Index Seek vs Index Depth

010,00020,00030,00040,00050,00060,00070,00080,00090,000

100,000110,000

0 2 4 6 8 10SELECT's per Stored Procedure

CPU-

cycl

es p

er S

ELEC

T

Cl 50K Cl 200K Cl 500KNH 50K NH 200K NH 500KNC 50K NC 200K NC 500K

Index depth: 50K rows -2 both, 200K 3-Cl, 2-NC, 500K 3 Pentium III 2x733

Page 19: Quantitative Performance Analysis

Batching Queries into a Single RPC

Each query: clustered index seek returning 1 row

0

20,000

40,000

60,000

80,000

100,000

120,000

0 2 4 6 8 10Queries per batch

Que

ries

/ sec

X 3.06/512K X 3.2/1MIt2 1.5/6M Opt 2.2/1MPIII 733

Page 20: Quantitative Performance Analysis

Single Row Cost Summary

Index depthPlan – no, True cost -yesCost versus index depth not fully examinedFill factor dependence not tested

Bookmark Lookup – Table typePlan cost -noTrue cost higher for clustered index than heap

Page 21: Quantitative Performance Analysis

Multi-row Select Queries

Queries return aggregate- Not a test of network bandwidthSingle Table Example SELECT @Count = count(*), @Value1 = AVG(randDecimal) FROM M3C_01 WHERE GroupID = @ID1 Count, 1 AVG on 5 byte decimalJoin ExampleSELECT count(*), AVG(m.randDecimal), min(n.randDecimal) FROM M2A_02 m INNER LOOP JOIN M2B_02 n ON n.ID = m.IDWHERE m.GroupID = @ID AND n.GroupID = @ID1 Count, 2 AVG on 5 byte decimal

Table Scan tests return either- a single row- 1 Count, 1 aggregate on 5 byte decimal

Page 22: Quantitative Performance Analysis

Index Seek – Aggregates

Xeon 2x3.06GHz/512K

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

count

cnt+avg(money)

cnt+avg(decimal)

Page 23: Quantitative Performance Analysis

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

10 100 1,000 10,000rows per query

Que

ry x

Row

s / s

ec

Seq Heap Dist Heap Seq Cl IX Dist Cl IXSeq Hp PL Dist Hp PL Seq Cl PL Dist Cl PL

Multi-row Bookmark Lookups

Table lock escalation at >5K rows ?

2x2.2GHz/1M Opteron

1 Count, 1 AVG on decimal

Page 24: Quantitative Performance Analysis

Table Scan

0

50,000

100,000

150,000

200,000

250,000

1 10 100 1,000 10,000table size (pages)

page

s/se

c

Default

Row Lock

Page Lock

Table Lock

No Lock

Xeon 3.2GHz/800MHz FSB HT

Page 25: Quantitative Performance Analysis

Table Scan – Component Cost

Total cost = RPC + Type + Base + per page costsPIII (X) P4/Xeon Opteron It2 2P 8P

Type + Base cost 60K/40K 145K 35K 90K

Cost per page: NOLOCK: 24K - 16K 23-35KTABLOCK: 24K 25K 16KPAGLOCK: 26K 26K 20K 17K 23-35KROWLOCK: 140K 250K 110K 100K 150K

Table Scan or Index Scan – Plan FormulaI/O: 0.0375785 + 0.0007407 per pageCPU: 0.0000785 + 0.0000011 per row

Measured Table Scan cost formulas for 99 rows per pageCosts in CPU-Cycles per page

Page 26: Quantitative Performance Analysis

Table Scan Cost per Page

0

1

2

3

4

5

6

7

1 10 100 1000Table size (pages)

Norm

aliz

ed c

ost p

er p

age Default Row Lock

Page Lock Table LockNo Lock Plan

2x733MHz/256KB Pentium III

Page 27: Quantitative Performance Analysis

Bookmark – Table Scan Cross-over

1

10

100

1,000

10 100 1000 10000rows

Norm

aliz

ed C

ost

Bk on ClusterBk on HeapTable ScanClust. IndexPlan Cost, BkPlan Cost, ScanPlan, Clust. IX

Plan Costs: ≤1GB 2x733MHz/256K Pentium III

Page 28: Quantitative Performance Analysis

Loop, Hash and Merge Join

Loop joins: Case 1, 2, 3 etcHash joins: in-memory versus othersMerge: regular versus many-to-manyMerge join with sort operationsLocks – page lock

Page 29: Quantitative Performance Analysis

Joins – Locking GranularityXeon 2x2.0GHz/512K

Count+2 AVG(decimal) Hash join spools to tempdb

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

10 100 1,000 10,000 100,000rows

CPU-

Cycl

es p

er ro

w

Loop L Page LockHash H Page LockMerge M Page Lock

Page 30: Quantitative Performance Analysis

Loop, Hash & Merge Joins

Xeon 2x3.06GHz/512K (Count)

100,000

1,000,000

10,000,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Loop Loop PLHash Hash PLMerge Merge PL

Page 31: Quantitative Performance Analysis

Processor Architecture

Pentium III, Xeon Opteron, Itanium 2

Page 32: Quantitative Performance Analysis

Index Seek

COUNT(*) + AVG(Decimal)

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

4,500,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Xeon 3.06/512K

Xeon 3.2 w/HT

Itanium 2

Opteron 2.2

Page 33: Quantitative Performance Analysis

Bookmark Lookup

Heap, Page Lock, sequential rows 1 Count, 1 AVG on decimal

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

10 100 1,000 10,000rows per query

Que

ry x

Row

s / s

ec

PIII 900/2M

Xeon 2.4

Itanium 2

Opteron 2.2

Page 34: Quantitative Performance Analysis

Table Scan

NOLOCK

0

50,000

100,000

150,000

200,000

250,000

300,000

1 10 100 1,000 10,000table size (pages)

page

s/se

c

X 3.06/512K

X 3.2/1M

It2 1.5/6M

Opt 2.2/1M

4xPIII-700

Page 35: Quantitative Performance Analysis

Hyper-ThreadingXeon 3.06GHz/512K (130nm) vs.Xeon 3.2GHz/1M & 800MHz FSB (90nm)

Page 36: Quantitative Performance Analysis

Hyper-ThreadingNo Hyper-Threading

0

20

40

60

80

100

0 5,000 10,000 15,000 20,000 25,000

RPC/secCP

U

with Hyper-Threading

0

20

40

60

80

100

0 5,000 10,000 15,000 20,000 25,000RPC/sec

CPU

CPU linear with throughput

CPU non-linear with throughput

Exercise caution in extrapolating system performance based on low CPU measurements

Page 37: Quantitative Performance Analysis

Index Seek – Single Row

0

20,000

40,000

60,000

80,000

100,000

120,000

0 2 4 6 8 10Queries per batch

Que

ries

/ sec

X 3.06/512K X 3.06 w/HT

X 3.2/1M X 3.2 w/HT

Page 38: Quantitative Performance Analysis

Index Seek - Multiple Rows

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s /s

ec

Xeon 3.06/512K

Xeon 3.06 w/HT

Xeon 3.2/1M

Xeon 3.2 w/HT

Page 39: Quantitative Performance Analysis

Table Scan - NOLOCK

0

50,000

100,000

150,000

200,000

250,000

300,000

1 10 100 1,000 10,000table size (pages)

page

s/se

c

X 3.06/512K X 3.06 w/HT

X 3.2/1M X 3.2 w/HT

Xeon, Xeon/800 NOLOCK

Page 40: Quantitative Performance Analysis

Bookmark Lookup – Cl. IX, Seq, PL

1 Count, Clustered Index, Sequential rows, Page Lock

0

100,000

200,000

300,000

400,000

500,000

600,000

10 100 1,000 10,000 100,000rows per query

quer

y x

row

s / s

ec

Xeon 3.06/512K Xeon 3.06 w/HT

Xeon 3.2/1M Xeon 3.2 w/HT

Page 41: Quantitative Performance Analysis

Bookmark Lookup – Heap, Seq, PL

1 Count, Heap, Sequential rows, Page Lock

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Xeon 3.06/512K Xeon 3.06 w/HT

Xeon 3.2/1M Xeon 3.2 w/HT

Page 42: Quantitative Performance Analysis

Loop Join

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Xeon 3.06/512K Xeon 3.06 w/HTXeon 3.2/1M Xeon 3.2 w/HT

Page 43: Quantitative Performance Analysis

Hash Join

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Xeon 3.06/512K Xeon 3.06 w/HT

Xeon 3.2/1M Xeon 3.2 w/HT

Page 44: Quantitative Performance Analysis

Merge Join

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

4,500,000

5,000,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Xeon 3.06/512K Xeon 3.06 w/HT

Xeon 3.2/1M Xeon 3.2 w/HT

Page 45: Quantitative Performance Analysis

Hyper-Threading Summary

Lower cost RPCImproves performance of low cost queries

May improve performance on some opsDegrades performance on othersXeon 800MHz FSB with 90nm core better than 130nm

Page 46: Quantitative Performance Analysis

Scaling

Itanium 2, 1.5GHz/6M, 1, 2, 4, 8, 16 CPUs

Page 47: Quantitative Performance Analysis

Scaling

First System: HP rx76208-way Itanium 2 1.5GHz/6MBoot with /NUMPROC = xxQuery: Count + 1 or 2 AVG(Decimal)

Second System: Unisys ES7000 Aries 42016-way Itanium 2 1.5GHz/6MBooted with all 16 CPUs, Affinity Mask = 1, 2, …Query: Count onlySQL Server 2000, Service Pack 3 (build 760)

Page 48: Quantitative Performance Analysis

Index Seek

9X difference between 16P and 1P

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

1P

2P

4P

8P

16P

Page 49: Quantitative Performance Analysis

Bookmark Lookup 1-way

Minor differences between Table org & lock granularity

0

50,000

100,000

150,000

200,000

250,000

300,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Seq Heap PL

Seq Heap

Seq Cl PL

Seq Clustered

Page 50: Quantitative Performance Analysis

Bookmark Lookup – 16 way

Major differences! 100X !!

10,000

100,000

1,000,000

10,000,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Seq Heap PL

Seq Heap

Seq Cl PL

Seq Clustered

Page 51: Quantitative Performance Analysis

Bookmark Lookup – Clustered Index

Really bad news beyond 2P, especially bad ~5K rows

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

1P 2P 4P8P 16P

Page 52: Quantitative Performance Analysis

Bookmark Lookup – Cl. IX, PL

No scaling beyond 2P, but no disaster at 5K rows

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

1P 2P 4P

8P 16P

Page 53: Quantitative Performance Analysis

Bookmark Lookup – Heap

Performance degradation at 5K rows

10,000

100,000

1,000,000

10,000,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

1P 2P

4P 8P16P

Page 54: Quantitative Performance Analysis

Bookmark Lookup – Heap Page Lock

Excellent scaling, 8P anomaly?

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

4,500,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

1P 2P 4P

8P 16P

Page 55: Quantitative Performance Analysis

Joins – 1P

Up to 10X difference between loop, hash & merge

100,000

1,000,000

10,000,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Loop

Hash

Merge

Page 56: Quantitative Performance Analysis

Joins – 16P

Up to 100X difference between loop, hash & merge

100,000

1,000,000

10,000,000

100,000,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Loop

Hash

Merge

Page 57: Quantitative Performance Analysis

Loop Join

No scaling beyond 2P

0

50,000

100,000

150,000

200,000

250,000

300,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

1P 2P 4P

8P 16P

Page 58: Quantitative Performance Analysis

Hash Join

Scales to 8P, peaks at ~5K rows

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

1P 2P 4P8P 16P

Page 59: Quantitative Performance Analysis

Merge Join

Scales well all around

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

20,000,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

1P

2P

4P

8P

16P

Page 60: Quantitative Performance Analysis

Table Scans

Questionable scaling, peaks at 300 pages, cache size ?

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

100 1,000 10,000 100,000table size (pages)

Page

s /s

ec

1P

2P

4P

8P

16P

Page 61: Quantitative Performance Analysis

Scaling Summary

Avoid Bookmark Lookups on Tables with Clustered indexesUse Heap tablesUse NOLOCK, especially ~5000 rows / query

Avoid Loop JoinsDesign for Merge or Hash Joins

Avoid table scans (Itanium only?)

Page 62: Quantitative Performance Analysis

Bookmark HP 32P, SP3 build 760/818

10,000

100,000

1,000,000

10,000,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Seq Heap PL

Seq Heap

Seq Cl PL

Seq Clustered

Page 63: Quantitative Performance Analysis

Bookmark HP 32P, SP4 build 2000?

10,000

100,000

1,000,000

10,000,000

10 100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Seq Heap PL

Seq Heap

Seq Cl PL

Seq Clustered

Page 64: Quantitative Performance Analysis

Joins HP 32P, SP3 build 760/818

100,000

1,000,000

10,000,000

100,000,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Loop

Hash

Merge

Page 65: Quantitative Performance Analysis

Joins HP 32P, SP4 build 2000?

100,000

1,000,000

10,000,000

100,000,000

100 1,000 10,000 100,000rows per query

Que

ry x

Row

s / s

ec

Loop

Hash

Merge

Page 66: Quantitative Performance Analysis

Table Scan HP 32P,

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1,800,000

2,000,000

10 100 1,000 10,000 100,000table size (pages)

Page

s/se

c

760 Def 760 NL818 Def 818 NL

sp4 Def sp4 NL

Page 67: Quantitative Performance Analysis

Bookmark Lookup (PL Sequential)

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

8,000,000

9,000,000

10,000,000

100 1,000 10,000 100,000rows per query

Que

ry-ro

ws/

sec

16P Xeon

16P It2

32P It2

Page 68: Quantitative Performance Analysis

Index Seek

1,000,000

10,000,000

100,000,000

100 1,000 10,000 100,000rows per query

Que

ry-ro

ws/

sec

16P Xeon

16P It2

32P It2

Page 69: Quantitative Performance Analysis

Loop Join

0

50,000

100,000

150,000

200,000

250,000

300,000

100 1,000 10,000 100,000rows per query

Que

ry-ro

ws/

sec

16P Xeon

16P It2

32P It2

Page 70: Quantitative Performance Analysis

Hash Join

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

100 1,000 10,000 100,000rows per query

Que

ry-ro

ws/

sec

16P Xeon

16P It2

32P It2

Page 71: Quantitative Performance Analysis

Merge

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000

45,000,000

100 1,000 10,000 100,000rows per query

Que

ry-ro

ws/

sec

16P Xeon

16P It2

32P It2

Page 72: Quantitative Performance Analysis

Joins HP 32P, SP4 build 2000?

Page 73: Quantitative Performance Analysis

Joins HP 32P, SP4 build 2000?

Page 74: Quantitative Performance Analysis

Database Design for Performance

Round-trip minimization – RPC costRow count management – Cost per row

Indexes – isolate queried rows to a limited number of adjacent pages quickly, not most selective columns 1st

Design for low cost operationsCovered index instead of bookmark lookupsMerge joins, Indexed viewsAvoid excessive logic

NOLOCK on non-transactional data

Page 75: Quantitative Performance Analysis

Statistics

Accuracy & RelevanceMore than keeping statistics up to dateData queried needs to reflect data in tableAvoid populating database with test data having different distribution than live data

Page 76: Quantitative Performance Analysis

Performance Issues

NC Index Seek + Bookmark Lookup vs. Table Scan

Query optimizer switches to table scantoo soon for in-memory, too late for disk bound data

Watch table scan lock levelRow count plan versus actual cost issues

May be related to WHERE clause conditionsLock hints Merge and Hash joins vs. Loop joinsFixed costs favor consolidation

Both in RPC and queries

Page 77: Quantitative Performance Analysis

Summary

Query cost structure Fixed costs identified

Costs applied once per RPCComponent operations examined

Base cost and cost per row or page Lock hints – Row, Page, Table, No Lock

Page 78: Quantitative Performance Analysis

Additional Information

www.sql-server-performance.com/joe_chang.asp

SQL Server Quantitative Performance AnalysisSQL Server Quantitative Performance AnalysisServer System ArchitectureServer System ArchitectureProcessor PerformanceProcessor PerformanceDirect Connect Gigabit NetworkingDirect Connect Gigabit NetworkingParallel Execution PlansParallel Execution PlansLarge Data OperationsLarge Data OperationsTransferring StatisticsTransferring StatisticsSQL Server Backup Performance with Imceda LiteSpeedSQL Server Backup Performance with Imceda [email protected]

Page 79: Quantitative Performance Analysis

Quantitative Performance Analysis

Backup Slides

Page 80: Quantitative Performance Analysis

Subjects

Cost measurementTest procedure, Unit of measure

Query CostsSingle row, table scan, multi-row, joins

Discrepancies with plan costsLogical I/O, Lock Hints, Conditions

Database design implicationsDesign, coding and indexes

Page 81: Quantitative Performance Analysis

Test Procedure

Load generatorDrive DB server CPU utilization as high as possible (80-100%)Multiple load generators running same query

Single thread will not fully load DB serverNetwork propagation time is significantMost queries run on single processorServer may have more than one processor

Component operation cost derivedComparing two queries differing in one op

Page 82: Quantitative Performance Analysis

Index Seek by Aggregates

SELECT COUNT(*) SELECT COUNT(*), CONVERT int to bigint, etcSELECT COUNT(*), SUM(int), SELECT COUNT(*), SUM(Money)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Xeon 1P Opt 1P It2 1P Xeon 2P Opt 2P It 2P

Dura

tion/

1K ro

ws

(ms)

1 of 10M CountSum ConvertMax MoneyDecimal

Page 83: Quantitative Performance Analysis

Join Cost Linearity on 32-bit

0

1

2

3

4

5

6

100 1000 10000

rows (1000's)

Dura

tion/

1K ro

w (m

s) Index Seek

Merge Join

Hash Join

Loop Join

Hash join cost per row is somewhat linear up to ~5M rowsDuration then jumps due to disk IO

Page 84: Quantitative Performance Analysis

Join Cost Linearity on Itanium 2

Loop & Merge join cost per row independent of row countHash join cost per row is not (no spooling to temp)

0

1

2

3

4

5

6

100 1,000 10,000rows (000's)

Dura

tion/

1K ro

ws

(ms) Index Seek

Merge Join

Hash Join

Loop Join

Page 85: Quantitative Performance Analysis

Hash Join – row size

Optimizer cost: depends on OS size, not IS sizeActual cost: depends on # of columns and OS/IS

0

1

2

3

4

5

6

100 1,000 10,000rows (000's)

Dura

tion/

1K ro

ws

(ms)

Count OS 1 Sum

IS 1 Sum OS 2 Sum

IS 2 Sum OS 3 Sum

IS 3 Sum

Page 86: Quantitative Performance Analysis

Type + Component Base Cost

Processor PIII PIII X Xeon Opt Itanium 2

# of CPUs 2 4 2 2 2 4 8Frequency ~73

3 700 2-2.4 2.2 900M 1.5G 1.5G

Cache 256K 2M 512K 1M 1.5M 6M 6MAggregate 100K 40K 135K 45K 45K 35K 66KTable Scan 60K 40K 145K 50K 35K 40K 90KLoop Join 110K 24K 130K 50K 45K 10K 15KHash Join 500K 250K 620K 250K 225K 210K 320KMerge Join 150K 54K 170K 80K 75K 55K 110KMerge + Sort 145K 320K 140K 230KMany-to-Many 250K 550K 200K 320K

Costs in CPU-Cycles

Big cache lowers base cost?

Page 87: Quantitative Performance Analysis

Cost per RowProcessor PIII Xeon Opt Itanium 2 2P, 4P, 8P

Index Seek 1.3K 1.8K 1.1K 1.0K 1.0K 1.0KBookmark LU (CL) 16K 20K 12K 11K 16K 27KBL (CL) page lock 11K 15K 8K 8K 13K 18KBookmark LU (Heap) 11K 14K 9K 10K 11K 16KBL (Heap) page lock 7K 9K 5K 6K 8K 8KLoop Join 16K 25K 13K 11K 16K 23KLoop Join, page lock 13K 20K 11.5K 9K 14K 21KHash Join 8.5K 11K 7.5K 7K 7K 7KHash Join, page lock 6.5K 8K 6.0K 5K 5K 5KMerge Join 6.5K 10K 5.0K 6K 6K 6KMerge Join, page lock 3.0K 4K 2.5K 3K 3K 3KMerge+Sort 7.5K 9K 6K 7KMany-To-Many PL 32K 40K 18K 31K

Costs in CPU-Cycles

Page 88: Quantitative Performance Analysis

IUD Cost Structure

2xXeon* Notes

RPC cost 240,000 Higher for threads, owner m/m

Type Cost 130,000 once per procedureIUD Base 170,000 once per IUD statementSingle row IUD 300,000 Range: 200,000-400,000

Multi-row InsertCost per row 90,000 cost per additional row

INSERT, UPDATE & DELETE cost structure very similarMulti-row UPDATE & DELETE not fully investigated

*Use Windows NT fibers on

Page 89: Quantitative Performance Analysis

INSERT Cost Structure

Index and Foreign Key not fully explored

Early measurements:

50-70,000 per additional index50-70,000 per foreign key

Assumes inserts into “sequential” segmentsCluster on Identity or Cluster Key + IdentityCluster on row GUID can cause substantial disk loading!