View
223
Download
0
Category
Preview:
Citation preview
IMCa: A High-Performance Caching Front-end for GlusterFS on InfiniBand
Ranjit Noronha and Dhabaleswar K. PandaNetwork Based Computing Lab
The Ohio State University<noronha, panda>@cse.ohio-state.edu
Outline of the Talk
• Background and Motivation
• Architecture and Design of IMCa
• Experimental Evaluation of IMCa
• Conclusions and Future Work
Background
• Large Scale Scientific and Commercial Workloads
• Petascale Computers have arrived
• High-Performance access to the I/O data is crucial – Parallel applications is often limited by I/O
• Clustered/Parallel File Systems have evolved to meet this challenge
• File System performance still dependent on disk performance
• Single Server Bandwidth Drop With Multiple Clients
• Parallel I/O Bandwidth From Multiple Servers
0
500
1000
1 2 3 4 5 6 7 8
RDMA IPoIB GigE
0
500
1000
1 2 3 4 5 6 7 8
R DMA IP oIB G igE
Ba
nd
wid
th (
Meg
aB
yte
s/s)
Number of clients
4GB Server Memory
8GB Server Memory
• Performance for Small Files– Generally difficult to achieve– Many environments with a large number of small files– Storing on the same disk block provides limited benefit– Striping does not provide benefit– Store on different servers
• Cache Coherency Problems– Client side cache provides good performance– Non-coherent client cache limited when there is sharing– Limited Scalability of coherent caches
• Server Load Problems– RDMA reduces overhead from TCP/IP– RDMA based transport protocols cannot reduce copying costs
within the file system
Problem Statement• Which file-system operations are potential
targets for caching?• What are the alternatives to the traditional
client cache/server cache architecture?• What are the advantages and disadvantages
of alternate cache architectures?• How do we provide the performance of the
non-coherent client cache without the scalability problems of the coherent client cache?
Outline of the Talk
• Background and Motivation
• Architecture and Design of IMCa
• Experimental Evaluation of IMCa
• Conclusions and Future Work
Potential File System Operations That May Be Cached
• Potential Targets For Caching– Should be something the client reads– Should be possible to uniquely identify cache target– Should be possible to chunk the data element
• Small Operations Stat, Create, Delete, Open• Stat
– Read by the client – Used as a form of update by many applications– Should be used – Should be updated on read/write operations on the server
• Create/delete– Not read by the client– Delete should invalidate previous cache entries
• File Open– Not a target for caching, but may be used for prefetching
• Data Transfer Operations – Read and Writes– Blocks Needed
Intermediate Cache Architecture (IMCa)
• Easy to maintain coherency• Extensible• Can multiple Cache nodes
provide benefit?
FS Client
SMCache
Underlying FS Cache
Cache1 Cache2 CachenEach Cache is a node (MCD Array)
Hash Function (CRC32) toFind the Cache Server
Hash Function (CRC32) toFind the Cache Server
CMCache
ext3
Need for Blocks In IMCa
• Most file system store data on the disk as blocks
• Parallel file-systems stripe data across multiple servers
• IMCa uses a fixed block size to store data across the cache servers– Block size should provide good performance
for most small files– Should avoid
• excessive fragmentation
Requested Data
Data Block Boundaries
Extradata File data segmented
by IMCa blocksize
Design-Read Operations (Hit)Client
SMCache
Underlying FS Cache
Cache1 Cache2 Cachen
CMCache
ext3
Design-Read Operations (Miss)Client
SMCache
Underlying FS Cache
Cache1 Cache2 Cachen
CMCache
ext3
Design-Write Operations
Client
SMCache
Underlying FS Cache
Cache1 Cache2 Cachen
CMCache
ext3
Advantages/Disadvantages of IMCa
• Fewer Requests Hit the Server• Latency for requests read from the cache is
lower• MCDs are self-managing• Failures in MCDs do not impact correctness• Additional node elements needed especially
for caching• Cold Misses are expensive• Additional Blocks/Data Transfers Needed• Overhead and delayed updates
Outline of the Talk
• Background and Motivation
• Architecture and Design of IMCa
• Experimental Evaluation of IMCa
• Conclusions and Future Work
GlusterFS File System
• Clustered File System
• Client and Server in userspace
• Use FUSE interface to translate FS calls from the kernel to the user daemons
• No Stripping data distributed across servers
• Possible to apply translators at the server and client to perform different functions
• WWW.glusterfs.org
Experimental Setup
• 64-node cluster– 8-core Intel Clovertown– 8 GB memory
• InfiniBand DDR is the interconnect• GlusterFS file-system• The data servers each have 8 RAID highpoint disks• Communication protocol is IPoIB in Reliable Connected
(RC) mode• MCDs run on independent nodes and use up to 6GB of
memory • CMCache and SMCache use a CRC32 hash function for
locating data on the MCDs• Lustre 1.6.4.3 is used with a socklnd for comparison
Experiment-stat
– Consists of two stages
– First stage (untimed)• 262144 files created by a single node
– Second stage (timed)• each node tries to perform a stat on each of the 262144 files sequentially
Stat performance
• Time to stat 262144 different files• Benchmark has two phase create (untimed), followed by stat (timed)• 82% improvement at 64 nodes
0
500
1000
1500
2000
2500
3000
3500
16 32 64
No Cache 1 Cache Server 2
2 Cache Servers 4 Cache Servers
6 Cache Servers Lustre-4DS
Number of Nodes
Tim
e (s
eco
nd
s)
0
100
200
300
400
500
1 2 4 8
No Cache 1 Cache Server 2
2 Cache Servers 4 Cache Servers
6 Cache Servers Lustre-4DS
Experiment-Write Single Client
– One Client
– Writes 1,024 records of size r sequentially to the file
– Measure time for this to complete
Write – Single Client
0200400600800
1000120014001600
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
8192
1638
4
3276
8
No Cache IMCa (2K) IMCa (Server Threads)
Write
La
ten
cy (
mic
rose
con
ds)
I/O Record Size (bytes)
• 2KB block size• Server thread helps performance
Experiments-Read
• Single Client Read– Follows Write component of the benchmark
– Move file pointer to the beginning of the file
– Read 1,024 records of size r sequentially to the file
– Measure time for this to complete
• Multiple Client Read– Each client uses a separate file
• Multiple Client Read Shared – Same file used by every client
• Lustre configurations– Cold Client Cache Unmount between Write and Read
– Warm Client Cache No unmount between Write and Read
Read Latency (Single Client)
0
200
400
600
800
1000
1200
1 4 16 64 256
1024
4096
1638
4
6553
6
No Cache Cache (256)
Cache (2K) Cache (8K)
Lustre-1DS (Cold) Lustre-4DS (Cold)
Lustre-4DS (Warm)
0
5000
10000
15000
20000
25000
No Cache Cache (256)
Cache (2K) Cache (8K)
Lustre-1DS (Cold) Lustre-4DS (Cold)
Lustre-4DS (Warm)
La
ten
cy (
us)
Bytes•Lustre shows best latency•Cache provides benefit for small message sizes
Read Multiple Client (32 clients)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 2 4 8
16
32
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
32
76
8
65
53
6
NoCache
IMCa (1)
IMCa (2)
IMCa (4)
Lustre (Cold)
Lustre (Warm)
•51% improvement in latency at 16K•Multiple MCDs help reduce capacity misses
Bytes
Tim
e (m
icro
seco
nd
s)
Iozone throughput
0
200
400
600
800
1000
No Cache Cache (1) Cache (2) Cache (4) Lustre-1DS (Cold)
1
2
4
8
Th
rou
gh
pu
t (M
ega
By
tes/
seco
nd
)
•1, 2, 4, 8 IOzone threads, 1GB files, 2KB block size•325 MB/s (NoCache) -> 868 MB/s (4 MCDs)
Read-Shared Latency
0
200
400
600
800
1000
1200
1400
1600
2 4 8 16 32
No Cache Lustre-1DS (Cold) MCD (1)
Tim
e (
mic
rose
cond
s)
Number of nodes
•IMCa helps improve performance over NoCache case
Outline of the Talk
• Background and Motivation
• Architecture and Design of IMCa
• Experimental Evaluation of IMCa
• Conclusions and Future Work
Conclusions and Future Work
•Proposed, Designed and Evaluated an Intermediate Cache for GlusterFS•Good improvement in stat performance•Improvement in latency/throughput of read operations
•Depends on block size• Would like to evaluate the performance with RDMA•Would like to evaluate distribution algorithms
Acknowledgements
Our research is supported by the following organizations
Thank you
{noronha, panda}@cse.ohio-state.edu
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
Recommended