A Study of Caching in Parallel File Systems Dissertation Proposal Brad Settlemyer

Preview:

Citation preview

A Study of Caching in Parallel File Systems

Dissertation Proposal

Brad Settlemyer

2

Trends in Scientific Research

• Scientific inquiry is now information intensive– Astronomy, Biology, Chemistry, Climatology, Particle

Physics – all utilize massive data sets

• Data sets under study are often very large– Genomics Databases (50 TB and growing)– Large Hadron Collider (15 PB/yr)

• Time spent manipulating data often exceeds time spent performing calculations– Checkpointing I/O demands are particularly

problematic

3

Typical Scientific Workflow

1. Acquire data• Observational Data (sensor-based, telescope, etc.)• Information Data (gene sequences, protein folding)

2. Stage/Reorganize data to fast file system• Archive retrieval• Filtering extraneous data

3. Process data (e.g. Feature Extraction)4. Output results data5. Reorganize data for visualization6. Visualize Data

4

Trends in Supercomputing

• CPU performance is increasing faster than disk performance– Multicore CPUs and increased intra-node parallelism

• Main memories are large– 4GB cost < $100.00

• Networks are fast and wide– >10Gb network and buses available

• Num Application Processes is increasing rapidly– RoadRunner > 128K concurrent processes achieving

>1 Petaflop– BlueGene/P > 250K concurrent processes achieving

>1 Petaflop

5

I/O Bottleneck

• Application processes are able to construct I/O requests faster than the storage system can provide service

• Applications are unable to fully utilize the massive amounts of available computing power

6

Parallel File Systems

• Addresses I/O bottleneck by providing simultaneous access to large number of disks

Switched Network

I/ONodes

CPUNodes

Process 0

PFSServer 0

Process 1 Process 2 Process 3

PFSServer 3

PFSServer 2

PFSServer 1

7

PFS Data Distribution

PFSServer 0

PFSServer 3

PFSServer 2

PFSServer 1

Strip AStrip BStrip CStrip DStrip EStrip F

Logical File Data

Physical Data Locations

Strip AStrip E

Strip BStrip F

Strip DStrip C

8

Parallel File Systems (cont.)

• Aggregate file system bandwidth requirements largely met– Large, aligned data requests can be rapidly

transferred– Scalable to hundreds of client processes and

improving

• Areas of inadequate performance– Metadata Operations (Create, Remove, Stat)– Small Files– Unaligned Accesses– Structured I/O

9

Scientific Workflow Performance

1. Acquire or Simulate Data• Primarily limited by physical bandwidth characteristics

2. Move or Reorganize Data for Processing• Often metadata intensive

3. Data Analysis or Reconstruction• Small, unaligned accesses perform poorly

4. Move/Reorganize Data for visualization• May perform poorly (small, unaligned accesses)

5. Visualize Data• Benefits from reorganization

10

Alleviating the I/O bottleneck

• Avoid data reorganization costs– Additional work that does not modify results– Limits use of high level libraries

• Increase contiguity/granularity– Interconnects and parallel file systems are well tuned

for large contiguous file accesses– Limits use of low latency messaging available

between cores

• Improve locality– Avoid device accesses entirely– Difficult to achieve in user applications

11

Benefits of Middleware Caching

• Improves locality– PVFS Acache and Ncache– Improve write-read and read-read accesses

• Small accesses– Can bundle small accesses into compound operation

• Alignment– Can compress accesses by performing aligned

requests

• Transparent to application programmer

12

Proposed Caching Techniques

In order to improve the performance of smalland unaligned file accesses, we propose middleware designed to enhance parallel file systems with the following:

1. Shared, Concurrent Access Caching

2. Progressive Page Granularity Caching

3. MPI File View Caching

13

Shared Caching

• Single data cache per node– Leverages trend toward large numbers of

cores– Improves contiguity of alternating request

patterns

• Concurrent access– Single Reader/Writer– Page locking system

14

File Write Example

Logical File

Process 0 I/O Requests Process 1 I/O Requests

15

File Write Example

Logical File

Process 0 I/O Requests Process 1 I/O Requests

16

File Write Example

Logical File

Process 0 I/O Requests Process 1 I/O Requests

17

File Write Example

Logical File

Process 0 I/O Requests Process 1 I/O Requests

18

File Write Example

Logical File

Process 0 I/O Requests Process 1 I/O Requests

19

File Write w/ Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Page 0 Cache Page 2Cache Page 1

20

File Write w/ Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Page 0 Cache Page 2Cache Page 1

21

File Write w/ Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Page 0 Cache Page 2Cache Page 1

22

File Write w/ Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Page 0 Cache Page 2Cache Page 1

23

File Write w/ Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Page 0 Cache Page 2Cache Page 1

24

Progressive Page Caching

• Benefits of paged caching– Efficient for the file system– Reduces cache metadata overhead

• Issues with paged caching– Aligned pages may retrieve more data than otherwise

required– Unaligned writes do not cache easily

• Read the remaining page fragment• Do not update cache with small writes

• Progressive paged caching addresses issues while minimizing performance and metadata overhead

25

Unaligned Access Caches

• Accesses are independent and not on page boundaries

• Requires increased cache overhead

• How to organize unaligned data– List I/O Tree– Binary Space Partition Tree

26

Paged Cache OrganizationLogical File

Logical File Logical File Logical File

27

BSP Tree Cache Organization

12

1

4

20

5

Logical File

11

8

28

List I/O Tree Cache Organization

10,2

0,1

2,2

Logical File

5,3

29

Progressive Page OrganizationLogical File

Logical File Logical File Logical File

2,21,3

0,1

2,2

30

View Cache

• MPI provides a more descriptive facility for describing file I/O– Collective I/O– MPI provides file views for describing file subregions

• Use file views as a mechanism for coalescing reads and writes during collective I/O

• How to take the union of multiple views.– Use a heuristic approach to detect structured I/O

31

Collective Read Example

Logical File

Process 0 I/O Requests Process 1 I/O Requests

32

Collective Read Example

Logical File

Process 0 I/O Requests Process 1 I/O Requests

33

Collective Read Example

Logical File

Process 0 I/O Requests Process 1 I/O Requests

34

Collective Read w/ Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

35

Collective Read w/ Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

36

Collective Read w/Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

37

Collective Read w/ Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

38

Collective Read w/ Cache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

39

Collective Read w/ ViewCache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

40

Collective Read w/ ViewCache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

41

Collective Read w/ ViewCache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

42

Collective Read w/ ViewCache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

43

Collective Read w/ ViewCache

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Block 0 Cache Block 2Cache Block 1

44

Study Methodology

• Simulation-based study– HECIOS

• Closely modelled on PVFS2 and Linux• 40,000 sloc• Leverages OMNeT++, INET Framework

– Cache Organizations• Core Sharing• Aligned Page access• Unaligned page access

45

HECIOS Overview

HECIOS System Architecture

46

HECIOS Overview (cont.)

HECIOS Main Window

47

HECIOS Overview (cont.)

HECIOS Simulation Top View

48

HECIOS Overview (cont.)

HECIOS Simulation Detailed View

49

Contributions

1. HECIOS, the High End Computing I/O Simulator developed and made available under open source license.

2. Flash I/O and BT-IO traced at large scale and traces now publicly available

3. Rigorous study of caching factors in parallel file system

4. Novel cache designs for unaligned file access and MPI view coalescing

50

The End

Thank You For Your Time!

Questions?

Brad Settlemyer

bradles@clemson.edu

51

Dissertation Schedule

• August – Complete trace parser enhancements. Shared cache impl. Complete trace collection.

• September – Aligned cache sharing study.• October – Unaligned cache sharing study.• November – SigMetrics deadline. View

coalescing cache.• December – Finalize experiments. Finish

writing thesis. Defend thesis.

52

PVFS Scalability

Read and Write Bandwidth Curves for PVFS

53

Shared Caching (cont.)

Logical File

Process 0 I/O Requests Process 1 I/O Requests

Cache Page 0 Cache Page 2Cache Page 1

54

Bandwidth Effects

Write Bandwidth on Adenine (MB/sec)

Num

Clients

PVFS w/

8 IONodes

PVFS w/ Replication

16 IONodes

Percent

Performance

1 10.3 9.8 95.1%

4 28.2 28.7 101.8%

8 43.4 39.8 91.5%

16 43.4 40.3 92.9%

32 50.1 38.2 76.2%

55

Experimental Data Distribution

PFSServer 0

PFSServer 3

PFSServer 2

PFSServer 1

Strip AStrip BStrip CStrip DStrip EStrip F

Logical File Data

Physical Data Locations

Strip AStrip E

Strip BStrip F

Strip DStrip C

Strip AStrip E

Strip D Strip CStrip BStrip F

56

Discussion (cont.)

PFSServer 0

PFSServer 3

PFSServer 2

PFSServer 1

Strip AStrip BStrip CStrip DStrip EStrip F

Logical File Data

Physical Data Locations

Strip AStrip E

Strip BStrip F

Strip DStrip C

Strip AStrip D Strip CStrip F

Strip BStrip E

57

Switched Network

I/ONodes

CPUNodes

Process 0

PFSServer 0

Process 1 Process 2 Process 3

PFSServer 3

PFSServer 2

PFSServer 1

Recommended