IPPS 981 What’s So Different about Cluster Architectures? David E. Culler Computer Science Division U.C. Berkeley

IPPS 98 1

What’s So Different about Cluster Architectures?

David E. Culler

Computer Science Division

U.C. Berkeley

http://now.cs.berkeley.edu

IPPS 98 2

High Performance Clusters “happen”

• Many groups have built them.

• Many more are using them.

• Industry is running with it

– Virtual Interface Architecture

– System Area Networks

• A powerful, flexible new design technique

IPPS 98 3

Outline

• Quick “guided tour” of Clusters at Berkeley

• Three Important Advances

=> Virtual Networks Alan Mainwaring

=> Implicit Co-scheduling Andrea Arpaci-Dusseau

=> Scalable I/O Remzi Arpaci-Dusseau

• What it means

IPPS 98 4

Stop 1: HP/fddi Prototype

• FDDI on the HP/735 graphics bus.

• First fast msg layer on non-reliable network

IPPS 98 5

Stop 2: SparcStation NOW

• ATM was going to take over the world.

The original INKTOMI

IPPS 98 6

Stop 3: Large Ultra/Myrinet NOW

IPPS 98 7

Stop 4: Massive Cheap Storage

•Basic unit:

2 PCs double-ending four SCSI chains

Currently serving Fine Art at http://www.thinker.org/imagebase/

IPPS 98 8

Stop 5: Cluster of SMPs (CLUMPS)

• Four Sun E5000s– 8 processors

– 3 Myricom NICs

• Multiprocessor, Multi-NIC, Multi-Protocol

– see S. Lumetta IPPS98

IPPS 98 9

Stop 6: Information Servers

• Basic Storage Unit:– Ultra 2, 300 GB raid, 800 GB

tape stacker, ATM

– scalable backup/restore

• Dedicated Info Servers– web,

– security,

– mail, …

• VLANs project into dept.

IPPS 98 10

Stop 7: Millennium PC Clumps

• Inexpensive, easy to manage Cluster

• Replicated in many departments

• Prototype for very large PC cluster

IPPS 98 11

So What’s So Different?

• Commodity parts?

• Communications Packaging?

• Incremental Scalability?

• Independent Failure?

• Intelligent Network Interfaces?

• Complete System on every node– virtual memory

– scheduler

– files

– ...

IPPS 98 12

Three important system design aspects

• Virtual Networks

• Implicit co-scheduling

• Scalable File Transfer

IPPS 98 13

Communication Performance Direct Network Access

• LogP: Latency, Overhead, and Bandwidth

• Active Messages: lean layer supporting programming models

0

2

4

6

8

10

12

14

16

µs

gLOrOs

Latency 1/BW

IPPS 98 14

General purpose requirements

• Many timeshared processes– each with direct, protected access

• User and system

• Client/Server, Parallel clients, parallel servers– they grow, shrink, handle node failures

• Multiple packages in a process– each may have own internal communication layer

• Use communication as easily as memory

IPPS 98 15

Virtual Networks

• Endpoint abstracts the notion of “attached to the network”

• Virtual network is a collection of endpoints that can name each other.

• Many processes on a node can each have many endpoints, each with own protection domain.

IPPS 98 16

Process 3

How are they managed?

• How do you get direct hardware access for performance with a large space of logical resources?

• Just like virtual memory– active portion of large logical space is bound to physical

resources

Process n

Process 2Process 1

***

HostMemory

Processor

NICMem

Network Interface

P

IPPS 98 17

Endpoint Transition Diagram

COLDPaged Host Memory

WARMR/O

Paged Host Memory

HOTR/W

NIC Memory

Read

Evict

Swap

WriteMsg Arrival

IPPS 98 18

Network Interface Support

• NIC has endpoint frames

• Services active endpoints

• Signals misses to driver– using a system endpont

Frame 0

Frame 7

Transmit

Receive

EndPoint Miss

IPPS 98 19

Solaris System Abstractions

Segment Driver• manages portions of an address space

Device Driver• manages I/O device

Virtual Network Driver

IPPS 98 20

LogP Performance

• Competitive latency

• Increased NIC processing

• Difference mostly– ack processing

– protection check

– data structures

– code quality

• Virtualization cheap0

2

4

6

8

10

12

14

16

gam AM gam AM

µs

gLOrOs

IPPS 98 21

Bursty Communication among many

Client

Client

Client

ServerServerServer

Msgburst work

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 2 4 6 8 10 12 14 16

Clients

Ms

gs

/Se

c

0

10000

20000

30000

40000

50000

60000

70000

0 5 10 15 20

Clients

Bu

rst

Ba

nd

wid

th

(Ms

g/S

ec

)

IPPS 98 22

Multiple VN’s, Single-thread Server

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28

Number of virtual networks

Ag

gre

ga

te m

sgs/

scontinuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

IPPS 98 23

Multiple VNs, Multithreaded Server

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28

Number of virtual networks

Ag

gre

ga

te m

sgs/

scontinuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

IPPS 98 24

Perspective on Virtual Networks

• Networking abstractions are vertical stacks– new function => new layer

– poke through for performance

• Virtual Networks provide a horizontal abstraction– basis for build new, fast services

IPPS 98 25

Beyond the Personal Supercomputer

• Able to timeshare parallel programs – with fast, protected communication

• Mix with sequential and interactive jobs

• Use fast communication in OS subsystems– parallel file system, network virtual memory, …

• Nodes have powerful, local OS scheduler

• Problem: local schedulers do not know to run parallel jobs in parallel

IPPS 98 26

Local Scheduling

• Schedulers act independently w/o global control

• Program waits while trying communicate with its peers that are not running

• 10 - 100x slowdowns for fine-grain programs!

=> need coordinated scheduling

A AAB

BC

A

A

AA

B C

B

C

Time

P1 P2 P3 P4

A

C

IPPS 98 27

Explicit Coscheduling

• Global context switch according to precomputed schedule

• How do you build it? Does it work?

A A AA

B CB C

A A AA

B CB C

TimeP1 P2 P3 P4

Master

IPPS 98 28

Typical Cluster Subsystem Structures

A

LS

A A

LS

A

LS

A

LS

A

Master

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

Local service

Applications

Communication

Global Service

Communication

Communication

Master-Slave

Peer-to-Peer

IPPS 98 29

Ideal Cluster Subsystem Structure

• Obtain coordination without explicit subsystem interaction, only the events in the program

– very easy to build

– potentially very robust to component failures

– inherently “service on-demand”

– scalable

• Local service component can evolve.

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

IPPS 98 30

Three approaches examined in NOW

• GLUNIX explicit master-slave (user level)– matrix algorithm to pick PP

– uses stops & signals to try to force desired PP to run

• Explicit peer-peer scheduling assist with VNs– co-scheduling daemons decide on PP and kick the solaris

scheduler

• Implicit– modify the parallel run-time library to allow it to get itself co-

scheduled with standard scheduler

A

LS

A A

LS

A

LS

A

LS

A

M

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

IPPS 98 31

Problems with explicit coscheduling

• Implementation complexity

• Need to identify parallel programs in advance

• Interacts poorly with interactive use and load imbalance

• Introduces new potential faults

• Scalability

IPPS 98 32

Why implicit coscheduling might work

• Active message request-reply model

• Infer non-local state from local observations; react to maintain coordination

observation implication action

fast response partner scheduled spin

delayed response partner not scheduled block

WS 1 Job A Job A

WS 2 Job B Job A

WS 3 Job B Job A

WS 4 Job B Job A

sleep

spin

request response

IPPS 98 33

Obvious Questions

• Does it work?

• How long do you spin?

• What are the requirements on the local scheduler?

IPPS 98 34

How Long to Spin?

• Answer: round trip time + 5 x wake-up time

– round-trip to stay scheduled together

– plus wake-up to get scheduled together

– plus wake-up to be competitive with blocking cost

– plus 3 x wake-up to meet “pairwise” cost

Job A

Job B

Spin-Wait

WS 1

WS 2

Job A

Job C

Job AWakeup

Job C

Spin-Wait Sleep

Job B

2L+4o2L+4o+W

IPPS 98 35

Does it work?

IPPS 98 36

Synthetic Bulk-synchronous Apps

• Range of granularity and load imbalance– spin wait 10x slowdown

IPPS 98 37

With mixture of reads

• Block-immediate 4x slowdown

IPPS 98 38

Timesharing Split-C Programs

IPPS 98 39

Many Questions

• What about – mix of jobs?

– sequential jobs?

– unbalanced placement?

– Fairness?

– Scalability?

• How broadly can implicit coordination be applied in the design of cluster subsystems?

IPPS 98 40

A look at Serious File I/O

• Traditional I/O system

• NOW I/O system

• Benchmark Problem: sort large number of 100 byte records with 10 byte keys

– start on disk, end on disk

– accessible as files (use the file system)

– Datamation sort: 1 million records

– Minute sort: quantity in a minute

Proc-Mem

P-M P-M P-M P-M

IPPS 98 41

NOW-Sort Algorithm: 1 pass

• Read – N/P records from disk -> memory

• Distribute – send keys to processors holding result buckets

• Sort– partial radix sort on each bucket

• Write– gather and write records to disk

IPPS 98 42

Key Implementation Techniques

• Performance Isolation: highly tuned local disk-to-disk sort

– manage local memory

– manage disk striping

– memory mapped I/O with m-advise, buffering

– manage overlap with threads

• Efficient Communication– completely hidden under disk I/O

– competes for I/O bus bandwidth

• Self-tuning Software– probe available memory, disk bandwidth, trade-offs

IPPS 98 43

World-Record Disk-to-Disk Sort

• Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth

Minute Sort

SGI Power Challenge

SGI Orgin

0123456789

0 10 20 30 40 50 60 70 80 90 100

Processors

Gig

abyt

es s

orted

IPPS 98 44

Towards a Cluster File System

• Remote disk system built on a virtual networkR

ate

(MB

/s)

LocalRemote

5.0

6.0

Read Write

CP

U U

tiliz

atio

n

Read Write0%

40%

20%

client

server

Client

RDlibRD server

Activemsgs

IPPS 98 45

Streaming Transfer Experiment

P P P P

0 1 2 3

0 1 2 3

Loca

lP

3F

S L

oca

l

P3F

S R

ever

seP

3F

S R

emot

eP P P P

0 1 2 3

0 1 2 3

P P P P

3 2 1 0

0 1 2 3

P P P P

0 1 2 3

0 1 2 3

IPPS 98 46

Results

• Data distribution affects resource utilization

• Not delivered bandwidth

LocalP3FS LocalP3FS ReverseP3FS Remote

Rat

e (M

B/s

)

5.0

6.0

Access Method

CP

U U

tiliz

atio

n0%

40%

Access Method

20%

client

server

IPPS 98 47

I/O Bus crossings

M

P

NI

M

P

NI

M

P

NI

M

P

NI

Parallel Scan Parallel Sort

(a) local disk (b) remote disk (a) local disk (b) remote disk

IPPS 98 48

Conclusions

• Complete system on every node makes clusters a very powerful architecture.

• Extend the system globally– virtual memory systems,

– schedulers,

– file systems, ...

• Efficient communication enables new solutions to classic systems challenges.

• Opens a rich set of issues for parallel processing beyond the personal supercomputer.

Documents

IPPS 981 What’s So Different about Cluster Architectures? David E. Culler Computer Science Division U.C. Berkeley