22
Corralling Big Data at TACC Tommy Minyard Texas Advanced Computing Center DDN User Group Meeting November 18, 2013

Corralling Big Data at TACC

Embed Size (px)

DESCRIPTION

In this presentation from the DDN User Meeting at SC13, Tommy Minyard from the Texas Advanced Computing Center describes TACC's new Corral data storage system. Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/

Citation preview

Page 1: Corralling Big Data at TACC

Corralling Big Data at TACC

Tommy Minyard

Texas Advanced Computing Center

DDN User Group Meeting

November 18, 2013

Page 2: Corralling Big Data at TACC

TACC Mission & StrategyThe mission of the Texas Advanced Computing Center is to enable scientific discovery and enhance society through the application of advanced computing technologies.

To accomplish this mission, TACC:

– Evaluates, acquires & operatesadvanced computing systems

– Provides training, consulting, anddocumentation to users

– Collaborates with researchers toapply advanced computing techniques

– Conducts research & development toproduce new computational technologies

Resources &

Services

Research &

Development

Page 3: Corralling Big Data at TACC

TACC Storage Needs

• Cluster specific storage– High performance (tens to hundreds GB/s bandwidth)

– Large-capacity (~2TBs per Teraflop), purged frequently

– Very scalable to thousands of clients

• Center-wide persistent storage– Global filesystem available on all systems

– Very large capacity, quota enabled

– Moderate performance, very reliable, high availability

• Permanent archival storage– Maximum capacity, tens of PBs of capacity

– Slow performance, tape-based offline storage with spinning storage cache

Page 4: Corralling Big Data at TACC

History of DDN at TACC

• 2006 – Lonestar 3 with DDN S2A9500

controllers and 120TB of disk

• 2008 – Corral with DDN S2A9900 controller

and 1.2PB of disk

• 2010 – Lonestar 4 with DDN SFA10000

controllers with 1.8PB of disk

• 2011 – Corral upgrade with DDN SFA10000

controllers and 5PB of disk

Page 5: Corralling Big Data at TACC

Global Filesystem Requirements

• User requests for persistent storage available

on all production systems

– Corral limited to UT System users only

• RFP issued for storage system capable of:

– At least 20PB of usable storage

– At least 100GB/s aggregate bandwidth

– High availability and reliability

• DDN solution selected for project

Page 6: Corralling Big Data at TACC

Stockyard: Design and Setup

Page 7: Corralling Big Data at TACC

Stockyard: Design and Setup

• A Lustre 2.4.1 based global files system, with

scalability for future upgrades

• Scalable Unit (SU): 16 OSS nodes providing

access to 168 OST’s of RAID6 arrays from

two SFA12k couplets, corresponding to 5PB

capacity and 25+ GB/s throughput per SU

• Four SU’s provide 20PB with 100GB/s now

• 16 initial LNET router set for external mounts

Page 8: Corralling Big Data at TACC

SU (One server rack with Two DDN

SFA12k couplet racks)

Page 9: Corralling Big Data at TACC

SU Hardware Details

• SFA12k Rack: 50U rack with 8x L6-30p

• SFA12k couplet with 16 IB FDR ports (direct

attachment to the 16 OSS servers)

• 84 slot SS8460 drive enclosures (10 per rack,

20 enclosures per SU)

• 4TB 7200RPM NL-SAS drives

Page 10: Corralling Big Data at TACC

Stockyard Logical Layout

Page 11: Corralling Big Data at TACC

Stockyard: Capabilities and Features

• 20PB usable capacity with 100+ GB/s

aggregate bandwidth

• Client systems can bring its own LNET router

set to connect to the Stockyard core IB

switches or connect to the built-in LNET

routers using either IB or TCP. (FDR14 or

10GigE)

• HSM potential to Ranch tape archival system

Page 12: Corralling Big Data at TACC

Capabilities and Features (cont’d)

• Meta-data performance enhancement

possible with DNE (phase1)

• NRS (Network Request Scheduler)

evaluation: characteristics of different policies

on ost_io.nrs_policies, particularly with

crrn(client round-robin over nids) under

contention dominated by a few jobs

Page 13: Corralling Big Data at TACC

Stockyard: Numbers So Far

• 16 LET-routers configured as direct client

(within the Stockyard fabric) can push 25GB/s

on the unit

• With two SU’s the same set of clients can

achieve 50GB/s, and 75GB/s with three SU.

• With four SU we hit the 16 client limit. No

improvement beyond 75GB/s (corresponding

to ~4.7GB/s from each client)

Page 14: Corralling Big Data at TACC

Numbers So Far (Single Client)

• Single thread write performance with Lustre

2.4.1 is ~770MB/s

– big improvement over 2.1.X at about 500MB/s

• Multi-thread from a single client saturates

around 4.7GB/s (with credits=256 on both

servers and clients)

Page 15: Corralling Big Data at TACC

Numbers So Far (Aggregate)

• Performance numbers with 16 lnet-routers :

75GB/s from 16 direct clients

• Numbers from Stampede compute clients:

65GB/s with 256 clients (IOR, posix, fpp, with

8 tasks per node)

• Saturation point for Stampede clients: 65GB/s

• N.B. credits=64 on client nodes of Stampede

– Quick test on interactive 2.1.x node with higher

credit number gives expected boost.

Page 16: Corralling Big Data at TACC

Numbers So Far (Failover Tests)

• OSS failover test setup and results

• Procedure: – Identify the OST’s for the test pair

– Initiate the dd processes targeted to the particular OST’s each of

about 67GB in size so that it does not finish before the failover

– Interrupt one of the OSS server with shutdown using ipmitool

– Record the individual dd process outputs as well as server and

client side Lustre messages

– Compare and confirm the recovery and operation of the failover

pair with 21 OST’s

• All I/O completes within 2 minutes of failover

Page 17: Corralling Big Data at TACC

Failover Testing (cont’d)

• Similarly for MDS pair: same sequence of interrupted

I/O and collection of Lustre messages on both servers and

clients, client side log shows the recovery.– Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:

13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed

out for sent delay: [sent 1381348698/real 0] req@ffff88180cfcd000

x1448277242593528/t0(0) o250-

>MGC192.168.200.10@[email protected]@o2ib100:26/25 lens 400/544 e 0 to

1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

– Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:

13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar

message

– Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at

MGC192.168.200.10@o2ib100_1) after server handle changed from

0xb9929a99b6d258cd to 0x6282da9e97a66646

– Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100:

Connection restored to MGS (at 192.168.200.11@o2ib100)

Page 18: Corralling Big Data at TACC

Automated Failover

• The tests were on an artificial setup to

simplify the tracking of the completion of the

I/O on clients and shutdown and failover

mounts were done manually.

• Corosync and pacemaker are being set up to

automate the process.

Page 19: Corralling Big Data at TACC

Routed Clients

• We monitor the routerstat output on the

attached routers and differences between two

timestamps, focusing on the even distribution

of request streams

• Contrary to the expectation that “autodown”

may suffice, Lustre clients need to have

“check_routers_before_use=1” to have

automatic updates of router status

Page 20: Corralling Big Data at TACC

Routed Clients (cont’d)

• Even with automatic router checks, clients

cannot detect the non-functional routers: a

router which was active only on the client side

will be assumed to be active by clients

• Clients encounter timeouts due to the non-

functional routers

• Resolution: separate router checks on router

nodes are added.

Page 21: Corralling Big Data at TACC

Stockyard: Looking Ahead

• Deploy as a global $WORK space for TACC

resources, will push the number of clients to

all TACC resources

• Evaluation of Lustre 2.5.0 before full

production for HSM functionality and

compatibility with SAMFS on Ranch

• Quota management (different on 2.4+)

• Integrated monitoring setup

• Security evaluation

Page 22: Corralling Big Data at TACC

Summary

• Storage capacity and performance needs

growing at exponential rate

• High-performance and reliable filesystems

critical for HPC productivity

• Benefits of large parallel filesystems outweigh

the system administration overhead

• Current best solution for cost, performance

and scalability is Lustre-based filesystem