ZFS Monitoring and Management at LLNL · LLNL-PRES-724397 2 Contract awarded to RAID Inc Lustre 2.8 on top of ZFS Wanted a vendor agnostic software stack 16 MDS nodes, 36 OSS nodes

LLNL-PRES-724397This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

ZFS Monitoring and Management at LLNL

Tony Hutter

ZFS User Conference 2017

March 16, 2017

LLNL-PRES-7243972

Contract awarded to RAID Inc

Lustre 2.8 on top of ZFS

Wanted a vendor agnostic software stack

16 MDS nodes, 36 OSS nodes

2880 8TB HDDs, 96 800GB SSDs

8 24-bay SSD JBODs, 36 84-bay HDD JBODs

Smaller configuration for “Brass” and “Jet” systems, but same RAID Inc. hardware.

Meet “Zinc” our new 18PB filesystem

LLNL-PRES-7243973

Glamour shot

LLNL-PRES-7243974

Node configuration

MDS Node 3

MDS Node 2

MDT enclosure L

MDS Node 1

MDT enclosure U

MDS Node 0

OST enclosure U

OSS Node 0

OST enclosure L

OSS Node 1

Metadata rack Object store rack

LLNL-PRES-7243975

Multipath = each disk has two SAS connections

Increases bandwidth and provides link failover

Disk shows up as /dev/sda and /dev/sdb, and also multipath device /dev/dm-N

We use the ZFS 'vdev-id' script to make friendly aliases for the drives in /dev/disk/by-vdev/

Multipath drives

/dev/disk/by-vdev/L0 -> ../../dm-50/dev/disk/by-vdev/L1 -> ../../dm-81/dev/disk/by-vdev/L2 -> ../../dm-99.../dev/disk/by-vdev/L83 -> ../../dm-157

LLNL-PRES-7243976

Splunk is a syslog processing engine with web front-end.

It has a query language that allows you to construct tables and graphs from syslog values.

You can group together multiple graphs and tables into a “dashboard”, for a single pane of glass view of your systems.

Just log key=value pairs to syslog and Splunk can graph it.

Monitoring with Splunk

LLNL-PRES-7243977

Splunk example

Dec 6 14:00:41 jet21 zpool_status: pool=mypool, vdev=B7, state=FAULTED, read_errors=0, write_errors=3, chksum_errors=0, resilver=0

zpool_status host=* pool=*| where state!="ONLINE" OR read_errors!=0 OR write_errors!=0 OR chksum_errors!=0...

syslog

Splunk query

=

+

LLNL-PRES-7243978

zpool status across all filesystems

LLNL-PRES-7243979

Graphing zpool status over time

LLNL-PRES-72439710

SMART stats (smartctl -a)

We're logging smart status, read & write uncorrectable errors, and Grown Defect List (GLIST). All drives report SAS stats.

LLNL-PRES-72439711

SMART Gown Defect List (smartctl -a)

Not enough data to know if GLIST is a predictor of pending drive failure yet.

LLNL-PRES-72439712

SMART Drive Temperatures (smartctl -a)

We've noticed on occasions that a few of our disks run hotter than spec.

LLNL-PRES-72439713

SMART Drive Temperatures (smartctl -a)

Drives at the back of the enclosure get hotter, so we adjusted our raidz2 configuration to have mix of drives from front and back.

LLNL-PRES-72439714

Enclosure sensor values (sg_ses)

We graph enclosure fan speed, temperature, voltage, and current.

LLNL-PRES-72439715


We graph the number of SES values reporting “Critical” or to look for potential hardware problems.

LLNL-PRES-72439716


LLNL-PRES-72439717

SAS PHY Errors (/sys/class/sas_phy/...)

Bad SAS PHYs can create ZFS read/write errors, and cause drives to disappear and re-appear.

LLNL-PRES-72439718

Disk history by drive serial number

Periodically logging drive serial numbers allow you to see if and when drives were replaced, and it helps locate the drive if it's been moved to another enclosure.

You can also build a record of any SMART errors associated with that serial number in case you need to RMA the drive.

LLNL-PRES-72439719

zpool iostat bandwidth

LLNL-PRES-72439720

zpool iostat latency

LLNL-PRES-72439721

We log stats with cron every hour. Also log zpool stats on every vdev state change via a zedlet.

'zpool status -c' can be useful for grabbing stats:

Logging scripts

# zpool status -c 'smartctl -a $VDEV_UPATH | grep "Drive Temp"' ...

NAME STATE READ WRITE CKSUMjet18 DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 Current Drive Temperature: 26 C L1 ONLINE 0 0 0 Current Drive Temperature: 25 C L14 ONLINE 0 0 0 Current Drive Temperature: 33 C L15 ONLINE 0 0 0 Current Drive Temperature: 30 C L28 ONLINE 0 0 0 Current Drive Temperature: 37 C L29 ONLINE 0 0 0 Current Drive Temperature: 35 C L42 ONLINE 0 0 0 Current Drive Temperature: 40 C L43 ONLINE 0 0 0 Current Drive Temperature: 39 C L56 ONLINE 0 0 0 Current Drive Temperature: 43 C L70 ONLINE 0 0 0 Current Drive Temperature: 46 C

LLNL-PRES-72439722

We use zed to automatically turn on/off slot LEDs when vdevs go FAULTED/DEGRADED/UNAVAIL.

We enable auto-replace on our pools so we can swap in a new disk for an old one and have it auto-resilver.

This allows operations staff to replace bad drives without being root.

Slot fault LEDs

Documents

ZFS Monitoring and Management at LLNL · LLNL-PRES-724397 2 Contract awarded to RAID Inc Lustre 2.8 on top of ZFS Wanted a vendor agnostic software stack 16 MDS nodes, 36 OSS nodes