27
 Cheap Clustering with OCFS2 Mark Fasheh Oracle August 14, 2006

Cheap Clustering with OCFS2 - Oracle | oss.oracle.com · Cheap Clustering with OCFS2 Mark Fasheh Oracle August 14, ... – 3 node cluster for less than $1,000! ... $ iscsiadm -m node

Embed Size (px)

Citation preview

   

Cheap Clustering with OCFS2

Mark FashehOracle

August 14, 2006

   

What is OCFS2

● General purpose cluster file system– Shared disk model– Symmetric architecture– Almost POSIX compliant

● fcntl(2) locking● Shared writable mmap

● Cluster stack– Small, suitable only for a file system

   

Why use OCFS2?

● Versus NFS– Fewer points of failure– Data consistency– OCFS2 nodes have direct disk access

● Higher performance● Widely distributed, supported

– In Linux kernel– Novell SLES9, SLES10– Oracle support for RAC customers

   

OCFS2 Uses

● File Serving– FTP– NFS

● Web serving (Apache)● Xen image migration● Oracle Database

   

Why do we need “cheap” clusters?

● Shared disk hardware can be expensive– Fibre Channel as a rough example

● Switches: $3,000 - $20,000● Cards: $500 - $2,000● Cables, GBIC – Hundreds of dollars● Disk(s): The sky's the limit

● Networks are getting faster and faster– Gigabit PCI card: $6

● Some want to prototype larger systems– Performance not necessarily critical

   

Hardware

● Cheap commodity hardware is easy to find:– Refurbished from name brands (Dell, HP, IBM,

etc)– Large hardware stores (Fry's Electronics, etc)– Online – Ebay, Amazon, Newegg, etc

● Impressive Performance– Dual core CPUs running at 2GHz and up– Gigabit network– SATA, SATA II

   

Hardware Examples - CPU

● 2.66GHz, Dual Core w/MB: $129– Built in video, network

   

Hardware Examples - RAM

● 1GB DDR2: $70

   

Hardware Examples - Disk

● 100GB SATA: $50

   

Hardware Examples - Network● Gigabit network card: $6

– Can direct connect rather than buy a switch, buy two!

   

Hardware Examples - Case

● 400 Watt Case: $70

   

Hardware Examples - Total

● Total hardware cost per node: $326– 3 node cluster for less than $1,000!– One machine exports disk via network

● Dedicated gigabit network for the storage● At $50 each, simple to buy an extra, dedicated disk● Generally, this node cannot mount the shared disk

● Spend slightly more for nicer hardware– PCI-Express Gigabit: $30– Athlon X2 3800+, MB (SATA II, DDR2): $180

   

Shared Disk via iSCSI● SCSI over TCP/IP

– Can be routed– Support for authentication, many enterprise

features● iSCSI Enterprise Target (IETD)

– iSCSI “server”– Can run on any disks, regular files– Kernel / User space components

● Open iSCSI Initiator– iSCSI “client”– Kernel / User space components

   

Trivial ISCSI Target Config.

● Name the target– iqn.YYYY-MM.com.example:disk.name

● Create “Target” stanza in /etc/ietd.conf– Lun definitions describe disks to export– fileio type for normal disks– Special nullio type for testing

Target iqn.2006-08.com.example:lab.exportsLun 0 Path=/dev/sdX,Type=fileioLun 1 Sectors=10000,Type=nullio

   

Trivial ISCSI Initiator Config.● Recent releases have a DB driven config.

– Use “iscsiadm” program to manipulate– “rm -f /var/db/iscsi/*” to start fresh– 3 steps

● Add discovery address● Log into target● When done, log out of target

$ iscsiadm -m discovery --type sendtargets –portal examplehost[cbb01c] 192.168.1.6:3260,1 iqn.2006-08.com.example:lab.exports

$ iscsiadm -m node --record cbb01c –-login

$ iscsiadm -m node --record cbb01c –-logout

   

Shared Disk via SLES10

● Easiest option– No downloading – all packages included– Very simple setup using YAST2

● Simple to use, GUI configuration utility● Text mode available

● Supported by Novell/Suse● OCFS2 also integrated with Linux-HA

software● Demo on Wednesday

– Visit Oracle booth for details

   

Shared Disk via AoE

● ATA over Ethernet– Very simple standard – 6 page spec!– Lightweight client

● Less CPU overhead than iSCSI– Very easy to set up – auto configuration via

Ethernet broadcast– Not routable, no authentication

● Targets and clients must be on the same Ethernet network

● Disks addressed by “shelf” and “slot” #'s

   

AoE Target Configuration

● “Virtual Blade” (vblade) software available for Linux, FreeBSD– Very small, user space daemon– Buffered I/O against a device or file

● Useful only for prototyping● O_DIRECT patches available

– Stock performance is not very high● Very simple command

– vbladed <shelf> <slot> <ethn> <device>

   

AoE Client Configuration

● Single kernel module load required– Automatically finds blades– Optional load time option, aoe_iflist

● List of interfaces to listen on● Aoetools package

– Programs to get AoE status, bind interfaces, create devices, etc

   

OCFS2

● 1.2 tree– Shipped with SLES9/SLES10– RPMS for other distributions available online– Builds against many kernels– Feature freeze, bug fix only

● 1.3 tree– Active development tree– Included in Linux kernel– Bug fixes and features go to -mm first.

   

OCFS2 Tools

● Standard set of file system utilities– mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc– Cluster aware– o2cb to start/stop/configure cluster– Work with both OCFS2 trees

● Ocfs2console GUI configuration utility– Can create entire cluster configuration– Can distribute configuration to all nodes

● RPMS for non SLES distributions available online

   

OCFS2 Configuration

● Major goal for OCFS2 was simple config.– /etc/ocfs2/cluster.conf

● Single file, identical on all nodes– Only step before mounting is to start cluster

● Can configure to start at boot

$ /etc/init.d/o2cb online <cluster name>Loading module "configfs": OKMounting configfs filesystem at /sys/kernel/config: OKLoading module "ocfs2_nodemanager": OKLoading module "ocfs2_dlm": OKLoading module "ocfs2_dlmfs": OKMounting ocfs2_dlmfs filesystem at /dlm: OKStarting O2CB cluster ocfs2: OK

   

Sample cluster.confnode: ip_port = 7777 ip_address = 192.168.1.7 number = 0 name = keevan cluster = ocfs2

node: ip_port = 7777 ip_address = 192.168.1.2 number = 1 name = opaka cluster = ocfs2

cluster: node_count = 2 name = ocfs2

   

OCFS2 Tuning - Heartbeat

● Default heartbeat timeout tuned very low for our purposes– May result in node reboots for lower

performance clusters– Timeout must be same on all nodes– Increase O2CB_HEARTBEAT_THRESHOLD value

in /etc/sysconfig/o2cb● OCFS2 Tools 1.2.3 release will add this to the

configuration script.● SLES10 users can use Linux-HA instead

   

OCFS2 Tuning – mkfs.ocfs2

● OCFS2 uses cluster and block sizes– Clusters for data, range from 4K-1M

● Use -C <clustersize> option– Blocks for meta data, range from .5K-4K

● Use -b <blocksize> option

● More meta data updates -> larger journal– -Jsize=<journalsize> to pick different size

● mkfs.ocfs2 -T filesystem-type– -Tmail option for meta data heavy workloads

– -Tdatafiles for file systems with very large files

   

OCFS2 Tuning - Practices● No indexed directories yet

– Keep directory sizes small to medium● Reduce resource contention

– Read only access is not a problem– Try to keep writes local to a node

● Each node has it's own directory● Each node has it's own logfile

● Spread things out by using multiple file systems– Allows you to fine tune mkfs options

depending on file system target usage

   

References

● http://oss.oracle.com/projects/ocfs2/● http://oss.oracle.com/projects/ocfs2-tools/● http://www.novell.com/linux/storage_foundation/● http://iscsitarget.sf.net/● http://www.open-iscsi.org/● http://aoetools.sf.net/● http://www.coraid.com/● http://www.frys-electronics-ads.com/● http://www.cdw.com/