Upload
jonas-dawson
View
220
Download
2
Embed Size (px)
Citation preview
Scalable, Fault-Tolerant NAS for Oracle - The Next Generation
Kevin Closson
Chief Software Architect
Oracle Platform Solutions, Polyserve Inc
The Un-”Show Stopper”• NAS for Oracle is not “file serving”, let me explain…
• Think of GbE NFS I/O paths from Oracle Servers to the NAS device that are totally direct. No VLANing sort of indirection.
– In these terms, NFS over GbE is just a protocol as is FCPover FiberChannel
– The proof is in the numbers.• A single dual-socket/dual-core ADM server running Oracle10gR2 can push through
273MB/s of large I/Os (scattered reads, direct path read/write, etc) of triple-bonded GbE NICs!
• Compare that to infrastructure and HW costs of 4GbE FCP (~450MB/s, but you need 2 cards for redundancy)
– OLTP over modern NFS with GbE is not a challenging I/O profile.
• However, not all NAS devices are created equal by any means
Agenda
• Oracle on NAS
• NAS Architecture
• Proof of Concept Testing
• Special Characteristics
Oracle on NAS
Oracle on NAS
• Connectivity– Fantasyland Dream Grid™ would be nearly impossible with FibreChannel
switched fabric, for instance:• 128 nodes == 256 HBAs, 2 switches each with 256 ports just for the servers then you
have to work out storage paths
• Simplicity– NFS is simple. Anyone with a pulse can plug in cat-5 and mount filesystems.– MUCH MUCH MUCH MUCH MUCH simpler than:
• Raw partitions for ASM• Raw, OCFS2 for CRS• Oracle Home? Local Ext3 or UFS?• What a mess
– Supports shared Oracle Home, shared APPL_TOP too– But not simpler than a Certified Third Party Cluster Filesystem , but that is a
different presentation• Cost
– FC HBAs are always going to be more expensive than NICs– Ports on enterprise-level FC switches are very expensive
Oracle on NAS
• NFS Client Improvements– Direct IO
• open(,O_DIRECT,) works with Linux NFS clients, Solaris NFS client, likely others
• Oracle Improvements• init.ora filesystemio_options=directIO• No async I/O on NFS, but look at the numbers• Oracle runtime checks mount options
• Caveat: It doesn’t always get it right, but at least it tries (OSDS)• Don’t be surprised to see Oracle offer a platform-independent NFS client
• NFS V4 will have more improvements
NAS Architecture
NAS Architecture• Single-headed Filers
• Clustered Single-headed Filers
• Asymmetrical Multi-headed NAS
• Symmetrical Multi-headed NAS
Single Headed Filer Architecture
NAS Architecture: Single-headed Filer
Filesystems/u01/u02/u03
GigE Network
Oracle Database Servers
Filesystems/u01/u02/u03
A single one of these…
Has the same (or more) bus bandwidth
as this!
Oracle Servers Accessing a Single-headed Filer: I/O Bottleneck
I/O Bottleneck
Oracle Servers Accessing a Single-headed Filer: Single Point of Failure
Oracle Database Servers
Filesystems/u01/u02/u03
Single Point of Failure
Highly Available through failover-HA,DataGuard, RAC, etc
Clustered Single-headed Filers
Architecture: Cluster of Single-headed Filers
Filesystems/u01/u02
Filesystems/u03
Paths Active AfterFailover
Oracle Servers Accessing a Cluster of Single-headed Filers
Filesystems/u01/u02
Filesystems/u03
Paths Active AfterFailover
Oracle Database Servers
Architecture: Cluster of Single-headed Filers
Filesystems/u01/u02
Filesystems/u03
Paths Active AfterFailover
Oracle Database Servers
What if /u03 I/O saturates this Filer?
Filer I/O Bottleneck. Resolution == Data Migration
Filesystems/u01/u02
Filesystems/u03
Paths Active AfterFailover
Oracle Database Servers
Filesystems/u04
Migrate some of the “hot” data to /u04
Data Migration Remedies I/O Bottleneck
Filesystems/u01/u02
Filesystems/u03
Paths Active AfterFailover
Oracle Database Servers
Filesystems/u04
Migrate some of the “hot” data to /u04
NEW Single Point of Failure
Summary: Single-headed Filers
• Cluster to mitigate S.P.O.F– Clustering is a pure afterthought with filers– Failover Times?
• Long, really really long. – Transparent?
• Not in many cases.• Migrate data to mitigate I/O bottlenecks
– What if the data “hot spot” moves with time? The Dog Chasing His Tail Syndrome
• Poor Modularity• Expanded by pairs for data availability• What’s all this talk about CNS?
Asymmetrical Multi-headed NAS Architecture
Asymmetrical Multi-headed NAS Architecture
FibreChannel SAN
…
…
Three Active NAS Heads / Three For Failover and
“Pools of Data”
Note: Some variants of this architecture support M:1 Active:Standbybut that doesn’t really change much.
Oracle Database Servers
SAN Gateway
Asymmetrical NAS Gateway Architecture
• Really not much different than clusters of single-headed filers:
– 1 NAS head to 1 filesystem relationship
– Migrate data to mitigate I/O contention
– Failover not transparent
• But:
– More Modular
• Not necessary to scale up by pairs
Symmetric Multi-headed NAS
HP Enterprise File Services Clustered Gateway
Symmetric vs Asymmetric
NASHead
NASHead
NASHead
/Dir1/File1 /Dir2/File2 /Dir3/File3
/Dir1/File1 /Dir2/File2 /Dir3/File3
/Dir3/File3/Dir2/File2
NAS Head
NAS Head
NAS Head
/Dir1/File1
/Dir1/File1
/Dir2/File2
/Dir3/File3
/Dir3/File3/Dir2/File2
/Dir1/File1
/Dir2/File2
/Dir1/File1
EFS-CG
Enterprise File Services Clustered Gateway Component Overview
• Cluster Volume Manager– RAID 0– Expand Online
• Fully Distributed, Symmetric Cluster Filesystem– The embedded filesystem is a fully distributed, symmetric cluster filesystem
• Virtual NFS Services– Filesystems are presented through Virtual NFS Services
• Modular and Scalable– Add NAS heads without interruption– All filesystems can be presented for read/write through any/all NAS heads
EFS-CG Clustered Volume Manager
• RAID 0 – LUNS are RAID 1, so this implements S.A.M.E.
• Expand online– Add LUNS, grow volume
• Up to 16TB– Single Volume
The EFS-CG Filesystem
• All NAS devices have embedded operating systems and file systems, but the EFS-CG is:
– Fully Symmetric• Distributed Lock Manager• No Metadata Server or Lock Server
– General Purpose clustered file system– Standard C Library and POSIX support– Journaled with Online recovery
• Proprietary format but uses standard Linux file system semantics and system calls including flock() and fcntl() clusterwide
• Expand a single filesystem online up to 16TB, up to 254 filesystems in current release.
EFS-CG Filesystem Scalability
Scalability. Single Filesystem Export Using x86 Xeon-based NAS Heads (Old Numbers)
123246
493
739
986 1,0841,196
0
200
400
600
800
1,000
1,200
Meg
aByt
es p
er
Sec
on
d (
MB
/s)
1 2 4 6 8 9 10
Cluster Size (Nodes)
# Servers Total bytes (Mbytes) Time (sec.) Mbytes/Sec. Gbits/Sec Scale Factor Scaling Coefficient1 16,384 133 123.19 0.96 1.00 100%2 32,768 133 246.38 1.92 2.00 100%4 65,536 133 492.75 3.85 4.00 100%6 98,304 133 739.13 5.77 6.00 100%8 131,072 133 985.50 7.70 8.00 100%9 147,456 136 1,084.24 8.47 8.80 98%
10 163,840 137 1,195.91 9.34 9.71 97%
123246
493
739
986 1,0841,196
Meg
aByt
es p
er
Sec
ond
(MB
/s)
Cluster Size (Nodes)
NAS I/O Throughput (via NFS)
HP StorageWorks Clustered File System is optimized for both READ and WRITE performance.
ApproximateSingle-headed
Filer limit
NAS Heads
Virtual NFS Services
• Specialized Virtual Host IP
• Filesystem groups are exported through VNFS
• VNFS failover and rehosting are 100% transparent to NFS client– Including active file descriptors, file locks (e.g. fctnl/flock), etc
EFS-CG Filesystems and VNFS
/u01/u02
NAS Head
/u04/u03
vnfs2b
/u03
NASHead
/u01
vnfs1
Enterprise File Services Clustered Gateway
/u04
NAS Head
/u02
NASHead
/u04/u03
vnfs1b vnfs3b
…
Enterprise File Services Clustered Gateway
Oracle Database Servers
EFS-CG Management Console
EFS-CG Proof of Concept
EFS-CG Proof of Concept
• Goals
– Use Oracle10g (10.2.0.1) with a single high performance filesystem for the RAC database and measure:
– Durability
– Scalability
– Virtual NFS functionality
EFS-CG Proof of Concept
• The 4 filesystems presented by the EFS-CG were:
– /u01. This filesystems contained all Oracle executables (e.g., $ORACLE_HOME)
– /u02. This filesystem contained the Oracle10gR2 clusterware files (e.g., OCR, CSS) and some datafiles and External Tables for ETL testing
– /u03. This filesystem was lower-performance space used for miscellaneous tests such as backup disk-to-disk
– /u04. This filesystem resided on a high-performance volume that spanned two storage arrays. It contained the main benchmark database
EFS-CG P.O.C. Parallel Tablespace Creation
• All datafiles created in a single exported filesystem
– Proof of multi-headed, single filesystem write scalability
EFS-CG P.O.C. Parallel Tablespace Creation
Multi-headed EFS-CG Tablespace Creation Scalability
111
208
0
50
100
150
200
250
Single-head, Single GigE Path Multi-headed, dual GigE Paths
MB
/s
EFS-CG P.O.C. Full Table Scan Performance
• All datafiles located in a single exported filesystem
– Proof of multi-headed, single filesystem sequential I/O scalability
EFS-CG P.O.C.Parallel Query Scan Throughput
Multi-headed EFS-CG Full Table Scan Scalability
98
188
0
50
100
150
200
250
Single-head, Single GigE Path Multi-headed, dual GigE Paths
MB
/s
EFS-CG P.O.C.OLTP Testing
• OLTP Database based on an Order Entry Schema and workload
• Test areas
– Physical I/O Scalability under Oracle OLTP – Long Duration Testing
EFS-CG P.O.C.OLTP Workload Transaction Avg Cost
Oracle Statistics Average Per Transaction
SGA Logical Reads 33
SQL Executions 5
Physical I/O 6.9 *
Block Changes 8.5
User Calls 6
GCS/GES Messages Sent 12
* Averages with RAC can be deceiving, be aware of CR sends
EFS-CG P.O.C.OLTP Testing
10gR2 RAC Scalability on EFS-CG
650
1246
1773
2276
0
500
1000
1500
2000
2500
1 2 3 4
RHEL4-64 RAC Servers
Tra
nsa
ctio
ns
per
S
eco
nd
EFS-CG P.O.C.OLTP Testing. Physical I/O Operations
RAC OLTP I/O Scalability on EFS-CG
5214
8831
1161913743
0
5000
10000
15000
1 2 3 4
RHEL4-64 RAC Servers
Ran
do
m 4
K I
Op
s
EFS-CG Handles all OLTP I/O Types Sufficiently—no Logging Bottleneck
OLTP I/O by Type
893
5593
8150
0100020003000400050006000700080009000
redo writes datafile writes datafile reads
I/O
Op
s p
er S
eco
nd
Long Duration Stress Test• Benchmarks do not prove durability
– Benchmarks are “sprints”
– Typically 30-60 minute measured runs (e.g., TPC-C)
• This long duration stress test was no benchmark by any means
– Ramp OLTP I/O up to roughly 10,000/sec
– Run non-stop until the aggregate I/O breaks through 10 Billion physical transfers
– 10,000 physical I/O transfers per second for every second of nearly 12 days
Long Duration Stress Test
Long Duration Stress Test
Long Duration Stress Test
Special Characteristics
Special Characteristics
• The EFS-CG NAS Heads are Linux Servers
– Tasks can be executed directly within the EFS-CG NAS Heads at FCP speed:
– Compression
– ETL, data importing
– Backup
– etc..
Example of EFS-CG Special Functionality
• A table is exported on one of the RAC nodes
• The export file is then compressed on the EFS-CG NAS head:
– CPU from NAS Head, instead of database servers• The NAS heads are really just protocol engines. I/O DMAs are offloaded to the I/O
subsysystems. There are plenty of spare cycles.
– Data movement at FCP rate instead of GigE• Offload the I/O fabric (NFS paths from servers to the EFS-CG)
Export a Table to NFS Mount
Compress it on the NAS Head
Questions and Answers
Backup Slide
EFS-CG NAS Head EFS-CG NAS Head
SAN
Ethernet Switch
FiberChannel Switches
…
3 GbE NFS Paths:Can be triple bonded, etc
EFS-CG Scales “Up” and “Out”