View
219
Download
0
Tags:
Embed Size (px)
Citation preview
IPPS 98 1
What’s So Different about Cluster Architectures?
David E. Culler
Computer Science Division
U.C. Berkeley
http://now.cs.berkeley.edu
IPPS 98 2
High Performance Clusters “happen”
• Many groups have built them.
• Many more are using them.
• Industry is running with it
– Virtual Interface Architecture
– System Area Networks
• A powerful, flexible new design technique
IPPS 98 3
Outline
• Quick “guided tour” of Clusters at Berkeley
• Three Important Advances
=> Virtual Networks Alan Mainwaring
=> Implicit Co-scheduling Andrea Arpaci-Dusseau
=> Scalable I/O Remzi Arpaci-Dusseau
• What it means
IPPS 98 4
Stop 1: HP/fddi Prototype
• FDDI on the HP/735 graphics bus.
• First fast msg layer on non-reliable network
IPPS 98 5
Stop 2: SparcStation NOW
• ATM was going to take over the world.
The original INKTOMI
IPPS 98 6
Stop 3: Large Ultra/Myrinet NOW
IPPS 98 7
Stop 4: Massive Cheap Storage
•Basic unit:
2 PCs double-ending four SCSI chains
Currently serving Fine Art at http://www.thinker.org/imagebase/
IPPS 98 8
Stop 5: Cluster of SMPs (CLUMPS)
• Four Sun E5000s– 8 processors
– 3 Myricom NICs
• Multiprocessor, Multi-NIC, Multi-Protocol
– see S. Lumetta IPPS98
IPPS 98 9
Stop 6: Information Servers
• Basic Storage Unit:– Ultra 2, 300 GB raid, 800 GB
tape stacker, ATM
– scalable backup/restore
• Dedicated Info Servers– web,
– security,
– mail, …
• VLANs project into dept.
IPPS 98 10
Stop 7: Millennium PC Clumps
• Inexpensive, easy to manage Cluster
• Replicated in many departments
• Prototype for very large PC cluster
IPPS 98 11
So What’s So Different?
• Commodity parts?
• Communications Packaging?
• Incremental Scalability?
• Independent Failure?
• Intelligent Network Interfaces?
• Complete System on every node– virtual memory
– scheduler
– files
– ...
IPPS 98 12
Three important system design aspects
• Virtual Networks
• Implicit co-scheduling
• Scalable File Transfer
IPPS 98 13
Communication Performance Direct Network Access
• LogP: Latency, Overhead, and Bandwidth
• Active Messages: lean layer supporting programming models
0
2
4
6
8
10
12
14
16
µs
gLOrOs
Latency 1/BW
IPPS 98 14
General purpose requirements
• Many timeshared processes– each with direct, protected access
• User and system
• Client/Server, Parallel clients, parallel servers– they grow, shrink, handle node failures
• Multiple packages in a process– each may have own internal communication layer
• Use communication as easily as memory
IPPS 98 15
Virtual Networks
• Endpoint abstracts the notion of “attached to the network”
• Virtual network is a collection of endpoints that can name each other.
• Many processes on a node can each have many endpoints, each with own protection domain.
IPPS 98 16
Process 3
How are they managed?
• How do you get direct hardware access for performance with a large space of logical resources?
• Just like virtual memory– active portion of large logical space is bound to physical
resources
Process n
Process 2Process 1
***
HostMemory
Processor
NICMem
Network Interface
P
IPPS 98 17
Endpoint Transition Diagram
COLDPaged Host Memory
WARMR/O
Paged Host Memory
HOTR/W
NIC Memory
Read
Evict
Swap
WriteMsg Arrival
IPPS 98 18
Network Interface Support
• NIC has endpoint frames
• Services active endpoints
• Signals misses to driver– using a system endpont
Frame 0
Frame 7
Transmit
Receive
EndPoint Miss
IPPS 98 19
Solaris System Abstractions
Segment Driver• manages portions of an address space
Device Driver• manages I/O device
Virtual Network Driver
IPPS 98 20
LogP Performance
• Competitive latency
• Increased NIC processing
• Difference mostly– ack processing
– protection check
– data structures
– code quality
• Virtualization cheap0
2
4
6
8
10
12
14
16
gam AM gam AM
µs
gLOrOs
IPPS 98 21
Bursty Communication among many
Client
Client
Client
ServerServerServer
Msgburst work
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2 4 6 8 10 12 14 16
Clients
Ms
gs
/Se
c
0
10000
20000
30000
40000
50000
60000
70000
0 5 10 15 20
Clients
Bu
rst
Ba
nd
wid
th
(Ms
g/S
ec
)
IPPS 98 22
Multiple VN’s, Single-thread Server
0
10000
20000
30000
40000
50000
60000
70000
80000
1 4 7 10 13 16 19 22 25 28
Number of virtual networks
Ag
gre
ga
te m
sgs/
scontinuous
1024 msgs
2048 msgs
4096 msgs
8192 msgs
16384 msgs
IPPS 98 23
Multiple VNs, Multithreaded Server
0
10000
20000
30000
40000
50000
60000
70000
80000
1 4 7 10 13 16 19 22 25 28
Number of virtual networks
Ag
gre
ga
te m
sgs/
scontinuous
1024 msgs
2048 msgs
4096 msgs
8192 msgs
16384 msgs
IPPS 98 24
Perspective on Virtual Networks
• Networking abstractions are vertical stacks– new function => new layer
– poke through for performance
• Virtual Networks provide a horizontal abstraction– basis for build new, fast services
IPPS 98 25
Beyond the Personal Supercomputer
• Able to timeshare parallel programs – with fast, protected communication
• Mix with sequential and interactive jobs
• Use fast communication in OS subsystems– parallel file system, network virtual memory, …
• Nodes have powerful, local OS scheduler
• Problem: local schedulers do not know to run parallel jobs in parallel
IPPS 98 26
Local Scheduling
• Schedulers act independently w/o global control
• Program waits while trying communicate with its peers that are not running
• 10 - 100x slowdowns for fine-grain programs!
=> need coordinated scheduling
A AAB
BC
A
A
AA
B C
B
C
Time
P1 P2 P3 P4
A
C
IPPS 98 27
Explicit Coscheduling
• Global context switch according to precomputed schedule
• How do you build it? Does it work?
A A AA
B CB C
A A AA
B CB C
TimeP1 P2 P3 P4
Master
IPPS 98 28
Typical Cluster Subsystem Structures
A
LS
A A
LS
A
LS
A
LS
A
Master
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
Local service
Applications
Communication
Global Service
Communication
Communication
Master-Slave
Peer-to-Peer
IPPS 98 29
Ideal Cluster Subsystem Structure
• Obtain coordination without explicit subsystem interaction, only the events in the program
– very easy to build
– potentially very robust to component failures
– inherently “service on-demand”
– scalable
• Local service component can evolve.
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
IPPS 98 30
Three approaches examined in NOW
• GLUNIX explicit master-slave (user level)– matrix algorithm to pick PP
– uses stops & signals to try to force desired PP to run
• Explicit peer-peer scheduling assist with VNs– co-scheduling daemons decide on PP and kick the solaris
scheduler
• Implicit– modify the parallel run-time library to allow it to get itself co-
scheduled with standard scheduler
A
LS
A A
LS
A
LS
A
LS
A
M
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
IPPS 98 31
Problems with explicit coscheduling
• Implementation complexity
• Need to identify parallel programs in advance
• Interacts poorly with interactive use and load imbalance
• Introduces new potential faults
• Scalability
IPPS 98 32
Why implicit coscheduling might work
• Active message request-reply model
• Infer non-local state from local observations; react to maintain coordination
observation implication action
fast response partner scheduled spin
delayed response partner not scheduled block
WS 1 Job A Job A
WS 2 Job B Job A
WS 3 Job B Job A
WS 4 Job B Job A
sleep
spin
request response
IPPS 98 33
Obvious Questions
• Does it work?
• How long do you spin?
• What are the requirements on the local scheduler?
IPPS 98 34
How Long to Spin?
• Answer: round trip time + 5 x wake-up time
– round-trip to stay scheduled together
– plus wake-up to get scheduled together
– plus wake-up to be competitive with blocking cost
– plus 3 x wake-up to meet “pairwise” cost
Job A
Job B
Spin-Wait
WS 1
WS 2
Job A
Job C
Job AWakeup
Job C
Spin-Wait Sleep
Job B
2L+4o2L+4o+W
IPPS 98 35
Does it work?
IPPS 98 36
Synthetic Bulk-synchronous Apps
• Range of granularity and load imbalance– spin wait 10x slowdown
IPPS 98 37
With mixture of reads
• Block-immediate 4x slowdown
IPPS 98 38
Timesharing Split-C Programs
IPPS 98 39
Many Questions
• What about – mix of jobs?
– sequential jobs?
– unbalanced placement?
– Fairness?
– Scalability?
• How broadly can implicit coordination be applied in the design of cluster subsystems?
IPPS 98 40
A look at Serious File I/O
• Traditional I/O system
• NOW I/O system
• Benchmark Problem: sort large number of 100 byte records with 10 byte keys
– start on disk, end on disk
– accessible as files (use the file system)
– Datamation sort: 1 million records
– Minute sort: quantity in a minute
Proc-Mem
P-M P-M P-M P-M
IPPS 98 41
NOW-Sort Algorithm: 1 pass
• Read – N/P records from disk -> memory
• Distribute – send keys to processors holding result buckets
• Sort– partial radix sort on each bucket
• Write– gather and write records to disk
IPPS 98 42
Key Implementation Techniques
• Performance Isolation: highly tuned local disk-to-disk sort
– manage local memory
– manage disk striping
– memory mapped I/O with m-advise, buffering
– manage overlap with threads
• Efficient Communication– completely hidden under disk I/O
– competes for I/O bus bandwidth
• Self-tuning Software– probe available memory, disk bandwidth, trade-offs
IPPS 98 43
World-Record Disk-to-Disk Sort
• Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth
Minute Sort
SGI Power Challenge
SGI Orgin
0123456789
0 10 20 30 40 50 60 70 80 90 100
Processors
Gig
abyt
es s
orted
IPPS 98 44
Towards a Cluster File System
• Remote disk system built on a virtual networkR
ate
(MB
/s)
LocalRemote
5.0
6.0
Read Write
CP
U U
tiliz
atio
n
Read Write0%
40%
20%
client
server
Client
RDlibRD server
Activemsgs
IPPS 98 45
Streaming Transfer Experiment
P P P P
0 1 2 3
0 1 2 3
Loca
lP
3F
S L
oca
l
P3F
S R
ever
seP
3F
S R
emot
eP P P P
0 1 2 3
0 1 2 3
P P P P
3 2 1 0
0 1 2 3
P P P P
0 1 2 3
0 1 2 3
IPPS 98 46
Results
• Data distribution affects resource utilization
• Not delivered bandwidth
LocalP3FS LocalP3FS ReverseP3FS Remote
Rat
e (M
B/s
)
5.0
6.0
Access Method
CP
U U
tiliz
atio
n0%
40%
Access Method
20%
client
server
IPPS 98 47
I/O Bus crossings
M
P
NI
M
P
NI
M
P
NI
M
P
NI
Parallel Scan Parallel Sort
(a) local disk (b) remote disk (a) local disk (b) remote disk
IPPS 98 48
Conclusions
• Complete system on every node makes clusters a very powerful architecture.
• Extend the system globally– virtual memory systems,
– schedulers,
– file systems, ...
• Efficient communication enables new solutions to classic systems challenges.
• Opens a rich set of issues for parallel processing beyond the personal supercomputer.