Active Storage and Its Applications
Jarek Nieplocha, Juan Piernas-Canovas
Pacific Northwest National Laboratory
2007 Scientific Data Management All Hands MeetingSnoqualmie, WA
2
OutlineOutline
Description of the Active Storage Concept
New Implementation of Active Storage
Programming Framework
Examples and Applications
3
Active Storage in Parallel FilesystemsActive Storage in Parallel Filesystems
Active Storage exploits the old concept of moving computing to the data source to avoid data transfer penalties
applications use compute resources on the storage nodes
Storage nodes are full-fledged computers with lots of CPU power available, and standard OSes and Processors
P
P
P
P Net
wor
k
FS
FS
computenodes
I/O nodes
Y=foo(X)
x
Y
P
P
P
P Net
wor
k
FS
FS
computenodes
I/O nodes
Y=foo(X)
Active StorageTraditional Approach
4
ExampleExample
BLAS DSCAL on diskY = α .Y
Experiment– Traditional: The input file is read
from filesystem, and the output file is written to the same file system. The input file has 120,586,240 doubles.
– Active Storage: Each server receives the factor, reads the array of doubles from its disk locally, and stores the resulting array on the same disk. Each server processes 120,586,240/N doubles, where N is the number of servers
Speedup contributed to using multiple OSTs and avoiding data movement between client and servers (no network bottleneck)
DSCAL
0
5
10
15
20
25
30
35
0 10 20 30 40
Number of OSTs
Sp
eed
-up
5
Related WorkRelated Work
Active Disk/Storage concept was introduced a decade ago to use processing resources ‘near’ the disk
On the Disk Controller. On Processors connected to disks. Reduce network bandwidth/latency limitations.
References DiskOS Stream Based model (ASPLOS’98: Acharya, Uysal, Saltz) Active Storage For Large-Scale Data Mining and Multimedia (VLDB
’98: Riedel, Gibson, Faloutsos)
Research proved Active Disk idea interesting, but Difficult to take advantage of in practice Processors in disk controllers not designed for the purpose Vendors have not been providing SDK
Y=foo(X)
6
Lustre ArchitectureLustre Architecture
Client
OST
MDSMDSMDSMDSO(10)OST
OSTOST
OSTOST
OSTOST
OSTOST
OSTOST
OSTOST
OSTOST
OST OSTO(1000)
O(10000)
NetworkNetwork
Directory Metadata& concurrencyFile IO
& Locking
Recovery, File Status,File Creation
ClientClientClient
7
Active Storage in Kernel SpaceActive Storage in Kernel Space
When the client writes to the file A: ASOBD makes a
copy of data, and sends it to ASDEV
The PC reads from and writes to the char device
Original data in A, processed data in B
Char device
Disk
NAL
OBDfilter
OST
Ldiskfs
Kernel space
ASOBD ASDEV
User space
Processing
compone
nt
A B
Active
Storage
Module
8
Active Storage ApplicationActive Storage ApplicationHigh Throughput ProteomicsHigh Throughput Proteomics
9.4 Tesla High Throughput Mass Spectrometer
1 Experiment per hour5000 spectra per experiment4 MByte per spectrum
Per instrument:20 Gbytes per hour480 Gbytes per day
Application ProblemGiven 2 float input number for target mass and tolerance, find all the possible protein sequences that would fit into specified range Active Storage Solution Each OST receives its part of the float pair sent by the client stores the resulting processing output in its Lustre OBD (object-based disk)
0
500
1000
1500
2000
3.55 7.1 14.2 28.4 56.8
Output Size (GB)
Com
plet
ion
Tim
e (s
econ
ds)
NoAS
AS
Next generation technologywill increase data rates x200
9
SC’2004 StorCloud Most Innovative Use SC’2004 StorCloud Most Innovative Use AwardAward
Sustained 4GB/sActive Storage write processing
Proteomics Application
320 TB Lustre984 400GB disks•40 Lustre OSS's running Active Storage
•4 Logical Disks (160 OST’s)•2 Xeon Processors
•1 MDS•1 Client creating files
LustreOST
Client System
Gigabit Network
LustreOSS 39
LustreOST
LustreOST
LustreMDS
LustreOSS 0
LustreOSS 38
10
Active Storage in User SpaceActive Storage in User Space
Problems with the Kernel Space implementation Portability, maintenance, extra memory copies
We developed a User Space implementation Most file system allows the storage nodes to be clients Most file system allows to create files with a given layout Our framework launches Processing Components on the
storage nodes which have the files to be processed Processing Components read from and write to local files
Highly Portable Implementation Used with Lustre 1.6, PVFS2 2.7 Bug in Lustre 1.4 (and SFS): frequent kernel crashes when
mounting the file system on the storage nodes Held initial discussions with IBM on GPFS port
11
Active Storage in User SpaceActive Storage in User Space
Network Interconnect
ComputeNode
ComputeNode
ComputeNode
Storage
Node0
Storage
NodeN-1
Metadata
Server
Parallel Filesystem's Components
(also clients of the filesystem)
.....
Parallel Filesystem's Clients
.....
Data
I/O
Traffic
asmaster
ASRFProcessingComponent
ASRFProcessingComponent
ComputeNode
Active StorageRuntime
Framework
12
Performance EvaluationPerformance Evaluation
AMINOGEN Bioinformatics Application
Input file: ASCII file, mass and tolerance pairs, one per line. Total size = 44 bytes
Output file: binary file which contains amino acid sequences. Total size = 14.2 GB
Overall execution time
13
Enhanced Implementation ofEnhanced Implementation ofActive Storage for Striped FilesActive Storage for Striped Files
Striped Files broadly used for performance not supported by earlier AS
work
Enhanced Implementation Use striping data from
filesystem New component: AS Mapper Locality awareness in
Processing Component: compute on local chunks
Climate application with netCDF Computes statistics of key
variables from Global Cloud Resolving simulation (U. Colorado)
Eliminated >95% network traffic
LIBAS
readLocal chunks
2
Contiguous fileWrite
Local chunks
6 1014 18 ...
Local chunks
2 3 4
Active Storage Runtime Framework
...
Processing component
read call write call
GLIBC
read write
0 1
Processing Component
14
Examples and ApplicationsExamples and Applications
Juan Piernas-CanovasJuan Piernas-Canovas
15
Active Storage in DSCAL ExampleActive Storage in DSCAL Example
OST31/lustre
OST43/lustre
Comp. Node/lustre
Comp. Node/lustre
Comp. Node/lustre
Comp. Node/lustre
DataI/O
Traffic
Parallel Filesystem's Clients
Parallel Filesystem's Components
dscal dscal
Network Interconnect
MDS &MGS
asmaster
Doubles.20
Doubles.15.out
Doubles.20.out
Doubles.15
16
Non-Striped FilesNon-Striped Files
<?xml version="1.0"?>
<rule>
<match>
<pattern>/lustre/doubles.*</pattern>
</match>
<program>
<path arch="any">/lustre/dscal</path>
<arguments>12345.67890 @ @.out</arguments>
</program>
</rule> /lustre/doubles.15in OST43
/lustre/doubles.15.outin OST43 (new file)
17
Climate ApplicationClimate Application
Collaboration with SciDAC GCRM SAP (Karen)
Problem: Compute averages for variables generated from scientific simulation
stored in striped output files
geodesic grid
netCDF data format
Objective: Optimize performance by exploiting data locality in AS Processing Components to minimize network traffic
18
Non-Striped FilesNon-Striped Files
<?xml version="1.0"?>
<rule>
<match>
<pattern>/lustre/doubles.*</pattern>
</match>
<program>
<path arch="any">/lustre/dscal</path>
<arguments>12345.67890 @ @.out</arguments>
</program>
</rule> /lustre/doubles.20in OST31
/lustre/doubles.20.outin OST31 (new file)
Execution: /lustre/asd/asmaster /lustre/dscal.xml
19
Processing PatternsProcessing Patterns
In user space, it is easy to support different processing patterns:
Active
Storage PC
Client datastrea
m
Active
Storage
Client datastrea
m
PC
1W0 1W#W
20
No Output File (Pattern 1WNo Output File (Pattern 1W0)0)
<?xml version="1.0"?>
<rule>
<match>
<pattern>/lustre/doubles.*</pattern>
</match>
<program>
<path arch="any">/lustre/dscal1</path>
<arguments>12345.67890 @</arguments>
</program>
</rule>/lustre/doubles.15
in OST43
21
Several Output Files (Pattern Several Output Files (Pattern 1W1W#W)#W)
<?xml version="1.0"?>
<rule>
<match>
<pattern>/lustre/doubles.*</pattern>
</match>
<program>
<path arch="any">/lustre/dscal3</path>
<arguments>12345.67890 @ @.out @.err</arguments>
</program>
</rule>/lustre/doubles.15
in OST43/lustre/doubles.15.outin OST43 (new file)
/lustre/doubles.15.errin OST43 (new file)
22
Transparent Access to Striped FilesTransparent Access to Striped Files
<?xml version="1.0"?>
<rule>
<match>
<pattern>/lustre/doubles.*</pattern>
</match>
<program>
<path arch="any">/lustre/dscal</path>
<arguments>12345.67890 @{hidechunks}
@{copystriping,hidechunks}.out</arguments>
</program>
</rule>
Transparent access to thechunks of the input file
Transparent access to thechunks of the output fileNew output file with the same
striping of the input file
23
Mapper and Striped netCDF FilesMapper and Striped netCDF Files
Network Interconnect
Storage
Node0
Metadata
Server
Data
I/O
Traffic
asmasterASRFPC
Storage
Node1
ASRF
Storage
Node2
ASRFPC
Storage
NodeN-1
ASRF......
Header Var. data
Var. data
Var. data
Var. data Var. data
Mapper(0, 2)
Striped
netCDF
file............
......
......
24
Processing of netCDF filesProcessing of netCDF files
<?xml version="1.0"?><rule> <stdfiles> <stdout>@.out-${NODENAME}</stdout> </stdfiles> <match> <pattern>/lustre/data.*</pattern> </match> <program> <path arch="any">/lustre/processnetcdf.py</path> <arguments>@ ta</arguments> </program> <mapper> <path arch="any">/lustre/netcdfmapper.py</path> <arguments>@ ta ${CHUNKNUM} ${CHUNKSIZE}</arguments> </mapper></rule>
Striping information of/lustre/data.37
Variable name in the netCDF file
/lustre/data.37
Non-striped output file/lustre/data.37.out-ost43
25
PVFS2 supportPVFS2 support
<?xml version="1.0"?><rule> <match> <pattern>/lustre/doubles.*</pattern> </match> <program> <path arch="any">/lustre/dscal</path> <arguments>12345.67890 @{hidechunks} @{copystriping,hidechunks}.out</arguments> </program> <filesystem> <type>pvfs</type> <mntpoint>/pvfs2</mntpoint> </filesystem></rule>
PVFS2
26
Local File System with Virtual StripingLocal File System with Virtual Striping
<?xml version="1.0"?><rule> <match> <pattern>/lustre/doubles.*</pattern> </match> <program> <path arch="any">/lustre/dscal</path> <arguments>12345.67890 @{hidechunks} @{copystriping,hidechunks}.out</arguments> </program> <filesystem> <type>localfs</type> <striping>8:1048576</striping> </filesystem></rule>
Virtual striping:- stripe size: 1MB- stripe count: 8
Local file system
27
Further InformationFurther Information
Technical paper J. Piernas, J. Nieplocha, E. Felix, “Evaluation of Active
Storage Strategies for the Lustre Parallel Filesystem”, Proc. SC’07
Website:http://hpc.pnl.gov/projects/active-storage
Upcoming release in December 2007 Support for Lustre 1.6, PVFS2, and Linux local file
systems
Source code available now under request. Just send us an e-mail!Jarek Nieplocha <[email protected]>
Juan Piernas-Canovas <[email protected]>
Active Storage and Its Applications
Jarek Nieplocha, Juan Piernas-Canovas
Pacific Northwest National Laboratory
Questions?