56
CS 505: Thu D. Nguyen utgers University, Spring 2005 1 CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University

CS 505: Computer Structures Networks

Embed Size (px)

DESCRIPTION

CS 505: Computer Structures Networks. Thu D. Nguyen Spring 2005 Computer Science Rutgers University. Send. Receive. P0. P1. N0. N1. Communication Fabric. Basic Message Passing. Send. Receive. P0. P1. N0. Terminology. Basic Message Passing: Send: Analogous to mailing a letter - PowerPoint PPT Presentation

Citation preview

Page 1: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 1

CS 505: Computer Structures

Networks

Thu D. Nguyen

Spring 2005

Computer Science

Rutgers University

Page 2: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 2

Basic Message Passing

P0 P1

N0

Send Receive

P0 P1

N0 N1

Communication Fabric

Send Receive

Page 3: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 3

Terminology

• Basic Message Passing:– Send: Analogous to mailing a letter– Receive: Analogous to picking up a letter from the

mailbox– Scatter-gather: Ability to “scatter” data items in a

message into multiple memory locations and “gather” data items from multiple memory locations into one message

• Network performance:– Latency: The time from when a Send is initiated until

the first byte is received by a Receive.– Bandwidth: The rate at which a sender is able to send

data to a receiver.

Page 4: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 4

Scatter-Gather

… Message

Memory

Scatter (Receive)

… Message

Memory

Gather (Send)

Page 5: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 5

Network Topologies

Page 6: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 6

Terminology

• Network partition: When a network is broken into two or more components that cannot communicate with each others.

• Diameter: Maximum length of shortest path between any two processors.

• Connectivity: Measure of the multiplicity of paths between any two processors - Minimum number of links that must be removed to partition the network.

• Bisection width: Minimum number of links that must be removed to partition the network into two equal halves.

• Bisection bandwidth: Minimum volume of communication allowed between any two halves of the network with an equal number of processors.

Page 7: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 7

Bisection Bandwidth

Bisection Bandwidth=

Bisection Width * Link Bandwidth

Page 8: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 8

Typical Network Diagram

Page 9: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 9

Typical Node

CPU

Memory NIC Router

Page 10: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 10

Bus-Based Network

• Advantages– Simple– Diameter = 1

• Disadvantages– Blocking– Bandwidth does not scale with p

– Easy to partition network

Bus

Page 11: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 11

Completely-Connected Network

• Advantages– Diameter = 1– Bandwidth scales with p– Non-blocking– Difficult to partition network

• Disadvantages– Number of links grows O(p2)– Fan-in (and out) at each

node grows linearly with p

Page 12: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 12

Star Network

• Essentially same as Bus-Based Network

Page 13: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 13

Ring Network

Page 14: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 14

Mesh and Torus Network

Page 15: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 15

Multistage Network

Page 16: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 16

Perfect Shuffle

Page 17: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 17

Omega Network - Log(p) Stages

Page 18: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 18

Blocking in Omega Network

Page 19: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 19

Tree Network

Page 20: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 20

Fat Tree Network

Page 21: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 21

Hypercube Network

Page 22: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 22

Hypercube Network

Page 23: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 23

k-ary d-cube Networks

• k: radix of the network - the number of processors in each dimension

• d: dimension of the network• k-ary d-cube can be constructed from k k-

ary (d-1)-cubes by connecting the nodes occupying identical positions into rings

• Examples:– Hypercube: binary d-cube– Ring: p-ary 1-cube

Page 24: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 24

Arbitrary Topology Networks

Switch

Switch Switch

Page 25: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 25

Network Characteristics

Page 26: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 26

Packet vs. Wormhole Routing

Message

Packets

Worm

Page 27: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 27

Store-and-Forward vs. Cut-Through Routing

• Store-and-Forward:– Cannot route/forward a packet until

the entire packet has been received

• Cut-Through:– Can route/forward a packet as soon

as the router has received and processed the header

• Worm-hole is always cut-through because not enough buffer space to hold entire message

• Packet routing is almost always cut-through as well

• Difference: when blocked, a worm can span multiple routers while a packet will fit entirely into the buffer of a single router

Page 28: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 28

Collective Communication Primitives

• Send/Receive necessary and sufficient• Broadcast, multicast

– one-to-all, all-to-all, one-to-all personalized, all-to-all personalized

– flood

• Reduction– all-to-one, all-to-all

• Scatter, gather• Barrier

Page 29: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 29

Broadcast and Multicast

P0

P1

P2

P3

Broadcast

Message

P0

P1

P2

P3

Message

Multicast

Page 30: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 30

All-to-All

P0

P1

P2

P3

Message

Message Message

Message

Page 31: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 31

Reduction

sum 0for i 1 to p do sum sum + A[i]

P0

P1

P2

P3

A[1]

A[2]

A[3]

P0

P1

P2

P3

A[1]

A[2] + A[3]

A[3]

A[0]

A[1]

A[2]

A[3]

A[0] + A[1]

A[2] + A[3]

A[0] + A[1] + A[2] + A[3]

Page 32: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 32

Ring Broadcast

O(p)

Page 33: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 33

Ring Broadcast

O(logp)

Page 34: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 34

Mesh Broadcast

)(log

))log(2(

))log(2(2

1

pO

pO

pO

Page 35: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 35

Computation vs. Communication Cost

• 2GHz clock => 1/2 ns instruction cycle• Memory access:

– L1: ~2-4 cycles => 1-2 ns– L2: ~5-10 cycles => 2.5-5 ns– Memory: ~120-300 cycles => 60-150 ns

• Message roundtrip latency:– ~20 s– Suppose 75% hit ratio in L1, no L2, 1 ns L1 access time,

200 ns memory access time => average memory access time 51 ns

– 1 message roundtrip latency = ~400 memory accesses

Page 36: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 36

Performance … Always Performance!

• So … obviously, when we talk about message passing, we want to know how to optimize for performance

• But … which aspects of message passing should we optimize?

– We could try to optimize everything» Optimizing the wrong thing wastes precious

resources, e.g., optimizing leaving mail for the mail-person does not increase overall “speed” of mail delivery significantly

Page 37: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 37

Martin et al.: LogP Model

Page 38: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 38

Sensitivity to LogGP Parameters

• LogGP parameters:– L = delay incurred in passing a short msg from source to

dest– o = processor overhead involved in sending or receiving

a msg– g = min time between msg transmissions or receptions

(msg bandwidth)– G = bulk gap = time per byte transferred for long

transfers (byte bandwidth)

• Workstations connected with Myrinet network and Generic Active Messages layer

• Delay insertion technique• Applications written in Split-C but perform

their own data caching

Page 39: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 39

Sensitivity to Overhead

16

8.5

0.5

P

sg

sL

Page 40: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 40

Sensitivity to Gap

16

9.2

0.5

P

so

sL

Page 41: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 41

Sensitivity to Latency

16

8.5

9.2

P

sg

so

Page 42: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 42

Sensitivity to Bulk Gap

16

8.5

9.2

0.5

P

sg

so

sL

Page 43: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 43

Summary

• Runtime strongly dependent on overhead and gap

• Strong dependence on gap because of burstiness of communication

• Not so sensitive to latency => can effectively overlap computation and communication with non-blocking reads (writes usually do not stall the processor)

• Not sensitive to bulk gap => got more bandwidth than we know what to do with

Page 44: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 44

What’s the Point?

• What can we take away from Martin et al.’s study?

– It’s extremely important to reduce overhead because it may affect both “o” and “g”

– All the “action” is currently in the OS and the Network Interface Card (NIC)

• Subject of von Eicken et al., “Active Message: a Mechanism for Integrated Communication and Computation,” ISCA 1992.

Page 45: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 45

User-Level Access to NIC

• Basic idea: allow protected user access to NIC for implementing comm. protocols at user-level

Page 46: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 46

User-level Communication

• Basic idea: remove the kernel from the critical path of sending and receiving messages

– user-memory to user-memory: zero copy– permission is checked once when the mapping is

established– buffer management left to the application

• Advantages– low communication latency– low processor overhead– approach raw latency and bandwidth provided by the

network

• One approach: U-Net

Page 47: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 47

U-Net Abstraction

Page 48: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 48

U-Net Endpoints

Page 49: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 49

U-Net Basics

• Protection provided by endpoints and communication channels

– Endpoints, communication segments, and message queues are only accessible by the owning process (all allocated in user memory)

– Outgoing messages are tagged with the originating endpoint address and incoming messages are demultiplexed and only delivered to the correct endpoints

• For ideal performance, firmware at NIC should implement the actual messaging and NI multiplexing (including tag checking). Protection must be implemented by the OS by validating requests for the creation of endpoints. Channel registration should also be implemented by the OS.

• Message queues can be placed at different memories to optimize polling

– Receive queue allocated in host memory– Send and free queues allocated in NIC memory

Page 50: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 50

U-Net Performance on ATM

Page 51: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 51

U-Net UDP Performance

Page 52: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 52

U-Net TCP Performance

Page 53: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 53

U-Net Latency

Page 54: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 54

Virtual Memory-Mapped Communication

• Receiver exports the receive buffers • Sender must import a receive buffer before

sending• The permission of sender to write into the

receive buffer is checked once, when the export/import handshake is performed (usually at the beginning of the program)

• Sender can directly communicate with the network interface to send data into imported buffers without kernel intervention

• At the receiver, the network interface stores the received data directly into the exported receive buffer with no kernel intervention

Page 55: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 55

Virtual-to-physical address

• In order to store data directly into the application address space (exported buffers), the NI must know the virtual to physical translations

• What to do?

sender receiver

int rec_buffer[1024];exp_id = export(rec_buffer, sender);

recv(exp_id);

int send_buffer[1024];recv_id = import(receiver, exp_id);

send(recv_id, send_buffer);

Page 56: CS 505: Computer Structures Networks

CS 505: Thu D. NguyenRutgers University, Spring 2005 56

Software TLB in Network Interface

• The network interface must incorporate a TLB (NI-TLB) which is kept consistent with the virtual memory system

• When a message arrives, NI attempts a virtual to physical translation using the NI-TLB

• If a translation is found, NI transfers the data to the physical address in the NI-TLB entry

• If a translation is missing in the NI-TLB, the processor is interrupted to provide the translation. If the page is not currently in memory, the processor will bring the page in. In any case, the kernel increments the reference count for that page to avoid swapping

• When a page entry is evicted from the NI-TLB, the kernel is informed to decrement the reference count

• Swapping prevented while DMA in progress