October 1 st 2014, Oracle OpenWorld 2014 Eric Grancher, head of database services group, CERN-IT CERN DB blog: //cern.ch/db-blog

Database Storage 101 Planning and Monitoring: Performance and Reliability [CON5773]

October 1st 2014, Oracle OpenWorld 2014Eric Grancher, head of database services group, CERN-IT

CERN DB blog: http://cern.ch/db-blog/ Video screen capture at https://indico.cern.ch/event/344531/

http://cern.ch/db-blog/

https://indico.cern.ch/event/344531/




3

Storage, why does it matter?• Database Concepts 12c Release 1 (12.1) "An essential

task of a relational database is data storage." • Performance (even if DB would be in memory, commit

are synchronous operations)• Availability, many (most?) recovery due to failing IO

subsystems• Most of the architecture can fail, storage not!• Fantastic evolution in recent years, more to come!

4

Outlook• Oracle IOs• Technologies• Planning• Lessons from experience / failure• Capturing information

5

Understand DB IO patterns (1/2)• You do not know what you cannot measure

(24x7!)• Oracle DB does different types of IO• Many OS tools enable to understand them• Oracle DB provides timing and statistics

information about the IO operations (as seen from the DB)

6

Understand DB IO patterns (2/2)• snapper (Tanel Poder)• strace (Linux) / truss (Solaris)• perf / gdb (see Frits Hoogland’s blog)

7

8

9

Overload at CPU level (1/2)• Observed many times: “the storage is slow”

(and storage administrators/specialists say “storage is fine / not loaded”)

• Typically happens that observed (from Oracle rdbms point of view) IO wait times are long if CPU load is high

• Instrumentation / on-off CPU

10

Overload at CPU level (2/2)

time

Oracle

OS

IO

t1 t2

Acceptable load

Oracle

OS

IO

t1 t2 t1 t2

High loadOff cpu

t1 t2 t1 t2

11

Technology• Rotating disk• Flash based devices• Local and shared storage• Functionality

12

Rotating disk (1/2)• Highest capacity / cost

credit: Wikipediacredit: (WLCG) computing Model update

13

Rotating disk (2/2)• Well known technology… but complex

credit: Computer Architecture: A Quantitative Approach

14

Flash memory / NAND (1/2)

credit: Wikipediacredit: Intel (2x)

• Block erasure (128kB)• Memory wear

15

Flash memory / NAND (2/2)• Multi-level (MLC) and single-level (SLC), even TLC• Low latency 4kB: read in 0.02ms, write in 0.2ms• Complex algorithms to improve write performance, use

over-provisioning• Enormous

differences

16

Local versus shared (1/2)• Locally attached storage• Remotely attached storage

(NFS, FC)• Real Application Cluster

credit: Oracle credit: Broadberry

17

Local versus shared (2/2)• Serial ATA (SATA):

• Current: SATA 3.0, 6 Gb/s, half duplex, 550 MB/s• SATA express: 2 PCIe 3.0 lanes -> 1969 MB/s

• PCI Express: 1.5 GB/s PCI-Express 2.0 x4• Serial Attached SCSI (SAS)

• 6/12 Gb/s, multiple initiators, full duplex

• NFS: 10GbE• FC: 16 Gb/s: 1.6GB/s (+ duplex)• Infiniband: QDR 4x 40Gb/s

18

Bandwidth (1/2) • How many spinning disks to fill a 16GFC

channel? • Each spinning disk (15k RPM) at 170 MB/s.

1.6GB/s / 170 MB/s ~= 9.4 • Each flash disk SATA 3.0 at 550 MB/s. 1.6GB/s /

550 MB/s ~= 2.9

• Link aggregation!

credit: Tomshardware, wikipedia

references

http://www.tomshardware.com/charts/enterprise-hdd-charts/-01-Read-Throughput-Average-h2benchw-3.16,3373.html

19

Bandwidth (2/2)• Only few disk saturate storage networking, so

• either many independent sources and ports on the server • or (FC) multipathing • or Direct NFS link aggregation

• PCI Express (local)

• Balanced systems in term of storage networking require good planning

20

Queuing / higher load

credit: R scripts Kyle Hailey

21

Writeback / memory caching• Major gain (x ms to <1ms), but it requires solid execution• 9 August 2010

• Power stop• No Battery Backup Unit and write-back enabled• Database corruption, 2 minutes data loss

• 18 August 2010 • Disk failure• Double controller failures (write-back)• Database corruption

• 1 September 2010 • Planned intervention, clean database stop• Power stop (write-back data has not been flushed)• Database corruption including backup on disk

22

Functionality • Thin provisioning (ex: db1 5TB in a 15TB

volume)• Snapshot at the storage level, restore at the

storage level (ex: 10TB in 15s)• Cloning (*) at the storage level

Integration with Oracle Multitenant

24

ASM, (cluster/network) file system• [754305.1] 11.2 DBCA does not support raw

devices anymore • [12.1 doc] Raw devices have been de-

supported and deprecated in 12.1• Local filesystem, cluster and NFS for RAC• ASM

http://docs.oracle.com/database/121/UPGRD/deprecated.htm%23UPGRD60120

25

Filesystem• FILESYSTEMIO_OPTIONS=SETALL enables direct

IO and asynchronous IO• NFS with Direct NFS• Cluster filesystem (example ACFS)• Advanced features like snapshots• Important to do regular de-fragmentation if there is

some sort of copy-on-write• Simplicity

26

ASM• ASMLib /dev/oracleasm controversial: (ASMLib in RHEL6)• ASM filter driver -> AFD:* /dev/oracleafd/ • ACFS interesting, many use cases, logs in databases for

example• Setup has evolved at CERN:

• Powerpath / EMC• RHEL3: Qlogic driver• RHEL4: dm + chmod• RHEL5/6: udev (permission) and dm (multipathing)

http://www.dbaexpert.com/blog/red-hat6-support-for-asmlib/

27

Planning• If random IOs are crucial: “IOPS for sale, capacity

for free” (Jeffrey Steiner)• Read: for IOPS, for bandwidth• Write: log/redo and DB writer• Identify capacity, latency, bandwidth, functionality

needs• Validate with reference workloads (SLOB, fio) and

your workload (Real Application Testing)

28

Real Application Testing

Capture

inse

rt…

PL/SQL

update …

delete …Original

Target

Replay

inse

rt…

PL/SQL

update …

delete …

Disks (1/5)• Disks are not reliable (any vendor, enterprise or not...), “there are

only two types of disk drives in the industry. Drives that have failed, and drives that are about to fail “ (Jeff Bonwick)

• RAID 4/5 gives a false sense reliability (see next slide)

• Experience: datafile not reachable anymore, disk array double error, (smart) move to local disk, 3rd disk failure, major recovery, bandwidth issue, 4th disk failure... Post-mortem

29 29

Disks (2/5)

30

31

2* – 4 – 6*2* – 4 – 6*

2 – 4* – 62 – 4* – 6

Disks (3/5), lessons• Monitoring storage is crucial• Regular media check is very important (will it be possible to

read when needed what is not read on a regular basis)• Parity drive / parity blocks• ASM partner extent Note:416046.1 “A corruption in the

secondary extent will normally only be seen if the block in the primary extent is also corrupt. ASM fixes corrupt blocks in secondary extent automatically at next write of the block.” (if it is over-written before it is needed ), >=11.1 preferred read failure group before RMAN full backup, >=12.1 ASM diskgroup scrubbing (ALTER DISKGROUP data SCRUB)

• Double parity / triple mirroring* Primary extent

31

1* – 3 – 5*

1 – 3* – 5

32

Disks (4/5)• Disks are larger and larger

• speed stay ~constant -> issue with speed• bit error rate stay constant (10-14 to 10-16), increasing

issue with availability

• With x as the size and α the “bit error rate”

32

Disks, redundancy comparison (5/5)

1 TB SATA desktop Bit error rate

10^-14

RAID 1 7.68E-02

RAID 5 (n+1) 3.29E-01 6.73E-01 8.93E-01

~RAID 6 (n+2) 1.60E-14 1.46E-13 6.05E-13

~triple mirror 8.00E-16 8.00E-16 8.00E-16

1TB SATA enterpriseBit error rate

10^-15

RAID 1 7.96E-03

RAID 5 (n+1) 3.92E-02 1.06E-01 2.01E-01

~RAID 6 (n+2) 1.60E-16 1.46E-15 6.05E-15


450GB FCBit error rate

10^-16

RAID 1 4.00E-04

RAID 5 (n+1) 2.00E-03 5.58E-03 1.11E-02

~RAID 6 (n+2) 7.20E-19 6.55E-18 2.72E-17


5 14 28 5 14 28

10TB SATA enterpriseBit error rate

10^-15

RAID 1 7.68E-02

RAID 5 (n+1) 3.29E-01 6.73E-01 8.93E-01

~RAID 6 (n+2) 1.60E-15 1.46E-14 6.05E-14

~triple mirror 8E-17 8E-17 8E-17

33 33

34

Upgrades (1/3)• “If it is working, you should not change it!”…• But often changes come through the little

door (disk replacement, IPv6 enablement on the network, “just a new volume”, etc.)

• Example:

35

Upgrades (2/3)• SunCluster, new storage (3510) introduced,

1 LUN ok• Second LUN introduced (April), all fine• Standard maintenance operation (July),

cluster does not restart• Multipathing: DMP / MPxIO, cluster SCSI

reservations on the shared disks

35

36

Upgrades (3/3)• 23 hours downtime in a critical path to LHC magnets testing• Some production with reduced HW availability• Important stress...• Up to date multi-pathing layer would have avoided it (or no

change)

36

37

Measure• “You can't manage what you don't measure”• The Oracle DB has wait information

(v$session, v$event_histogram) AWR / statspack, set retention

• It has additional information, ready to be used…

38

39

AWR history

40

• IO tests and planning: SLOB, fio, etc. help to size• Latency variation is the user experience (C. Millsap)• 1, 19, 0, 20, 10: average 10• 9, 11, 10, 9.5, 10.5 : average 10• 10.01, 9.99, 10, 10.1, 9.9: average 10

• 10.01, 9.99, 10.01, 9.99, 500 , 10.01, 9.99: average ~10• Ex: web page: 5 SQL statements, 10 IOs per request• 50 IOs at 0.2 ms = 5 ms• 50 Ios at 300 ms = 1500 ms = 1.5s

Slow IO is different than IO outlier

41

Pathological cases, latency• Spinning disk latency O(10ms), Flash O(0.1ms)• Is it always in this order of magnitude? If not, you

should know, with detailed and timing(*) information.

• Reasons include bugs, disk failures, temporary or global overload, etc.

(*) correlation with other sources (OS logs, storage sub-system logs, ASM logs, etc.)

42

Complementing AWR for IO• AWR captures histogram information, not

single IO• AWR does not capture information on ADG• In addition, desirable to extract information

• Longer term (across migrations, upgrades, capacity planning)

• With timing, slow IO operations from ASH

43

Capture Active Session History long IO

time

lgwr

session 1

session 2

session 3

session 4

session 5

Relative wait times indicative…

ON CPU

db file sequential read

direct path read

log file parallel write

ON CPU

direct path read

log file parallel write

db file sequential read

44

ASH long IO repository• Stores information about long(*) IO

operations• Query it to identify major issues• Correlate with histograms and total IO

operations

(*) longer than expected, >1s, >100ms, >10ms

45

46

47

48

49

“Low latency computing”• from Kevin Closson

Needed: specify storage with latency targets: • With a specified workload• 99% of IO operations take less than X• 99.99% of IO operations take less than Y• 100% of IO operations take less than Z

50

Conclusion• Critical: availability, performance and functionality• Performance:

• Spinning disk: IOPS matter, capacity “for free”• Flash: “low latency computing’, bandwidth, especially in storage network• Btw, absorbing lot of IO require CPU• Planning: SLOB, fio, Real Application Testing

• “High-availability • HA is low technology” (Carel-Jan Engel) = complexity is the enemy of high-availability• Mirror / triple-mirror / triple parity• Some sort of scrubbing is essential

• (N)FS or ASM are both solid and capable, depends on what is simpler and best known in your organisation

• RAC adds requirement for shared storage, be careful that the storage interconnect is not a bottleneck

• Latency is complex, measure and keep long term statistics (AWR retention, extract data, data visualization), key differentiator between Flash solutions

I am a proud member of an EMEA Oracle User Group

Are you a member yet?

Meet us at south upper lobby of Moscone South

www.iouc.org

53

References• Kyle Hailey R scripts https://github.com/khailey/fio_scripts • Kevin Closson SLOB http://kevinclosson.net/slob/ • Frits Hoogland gdb/strace• http

://fritshoogland.wordpress.com/tag/oracle-io-performance-gdb-debug-internal-internals/ • fio http://freecode.com/projects/fio • Luca Canali OraLatencyMap• http://db-blog.web.cern.ch/blog/luca-canali/2014-06-recent-updates-oralatencymap-and

-pylatencymap

• My Oracle Support, White Paper: ASMLIB Installation & Configuration On MultiPath Mapper Devices (Step by Step Demo) On RAC Or Standalone Configurations. (Doc ID 1594584.1)

• Computer Architecture: A Quantitative Approach

CERN DB blog: http://cern.ch/db-blog/

https://github.com/khailey/fio_scripts

https://github.com/khailey/fio_scripts

http://kevinclosson.net/slob/

http://kevinclosson.net/slob/

http://fritshoogland.wordpress.com/tag/oracle-io-performance-gdb-debug-internal-internals/



http://freecode.com/projects/fio

http://freecode.com/projects/fio

http://db-blog.web.cern.ch/blog/luca-canali/2014-06-recent-updates-oralatencymap-and-pylatencymap

http://db-blog.web.cern.ch/blog/luca-canali/2014-06-recent-updates-oralatencymap-and-pylatencymap

Documents

October 1 st 2014, Oracle OpenWorld 2014 Eric Grancher, head of database services group, CERN-IT CERN DB blog: //cern.ch/db-blog