Scalable multiprocessors

Scalable Multiprocessors

SCALABILITY• Almost all computers allow the capability of the system to be

increased in some form, for example by adding memory, I/O cards, disks or upgraded processor(s), but the increase typically has hard limits

• A scalable system attempts to avoid inherent design limits on the extent to which resources can be added to the system

• Four aspects of scalability:– How does the bandwidth or throughput of the system increase with additional

processors?

– How does the latency or time per operation increase?

– How does the cost of the system increase?

– How do we actually package the systems and put them together

Bandwidth Scaling

• If a large number of processors are to exchange information simultaneously with many other processors or memories, a large number of independent wires must connect them.

• Thus scalable machines must be organized in the manner shown in figure (next slide) where a large number of processor modules and memory modules are connected together by independent wires through a large number of switches

• A switch may be realized by a bus, a crossbar or even a collection of multiplexers

• The number of outputs (or inputs) to the switch is called degree of the switch

• Switches are limited in scale but may be interconnected to form large configurations, that is, networks

• Controllers are also available to determine which inputs are to be connected to which outputs at each instant in time

• A network switch is a more general-purpose device, in which the information presented at the input is enough for the switch controller to determine the proper output without consulting all the nodes

• Pairs of modules are connected by routes through network switches

• The most common structure for scalable machines is illustrated by the generic architecture shown in fig (next slide)

• Here one or more processors are packaged together with one or more memory modules and a communication assist as an easily replicated unit, which is called a node

• The intranode switch is typically a high-performance bus

In dancehall configuration processing nodes are separated from memory nodes by the network

• If the memory modules are on the opposite side of the interconnect, as in fig (previous slide) the network bandwidth requirement scales linearly with the number of processors, even when no communication occurs between processes

• Providing adequate bandwidth scaling may not be enough for the computational performance to scale perfectly since the access latency increases with the number of processors

• By distributing the memories across the processors, all processes can access local memory with fixed latency, independent of the number of processors; thus the computational performance of the system can scale perfectly

The following assumptions are made to achieve scalable bandwidth:

• It must be possible to have a very large number of concurrent transactions using different wires

• They are initiated independently and without global arbitration

• The effects of a transaction (such as changes of state) are directly visible only by the nodes involved in the transaction

• The effects may eventually become visible to other nodes as they are propagated by additional transactions

• Although it is possible to broadcast information to all nodes, broadcast bandwidth (i,.e. the rate at which broadcasts can be performed) does not increase with the number of nodes

Latency Scaling

The time to transfer n bytes between two nodes is given by

T(n) = Overhead + channel time + routing Delay

Where overhead is the processing time in initiating or completing the transfer

Channel Time is n/B (where B is the bandwidth of the thinnest channel)

Routing Delay is a function f(H,n) of the number of routing steps or hops in the transfer and number of bytes transferred

Prob 7.1: Many classic networks are constructed out of fixed-degree switches in a configuration or topology, such that for n nodes the distance from any network input to any network output is log2n and the total number of switches is α n log n for some small constant α. Assuming the overhead as 1µs per message, the link bandwidth is 64 MB/s and the router delay is 200 ns per hop. How much does the time for a 128-byte transfer increase as the machine is scaled from 64 to 1,024 nodes?

Solution: At 64 nodes, six hops are required so

This increases to 5µs on a 1024-node configuration. Thus, the latency increases by less than 20% with a 16-fold increase in machine size. Even with this small transfer size, a store-and-forward delay would add 2µs(the time to buffer 128 bytes)to the routing delay per hop. Thus the latency would be

at 64 nodes and

Cost Scaling:

• It may be viewed as a fixed cost for the system infrastructure plus an incremental cost of adding processors and memory to the system:

Realizing Programming Models• Here we examine what is required to

implement programming models on large distributed –memory machines

• These machines have been most strongly associated with message-passing programming models

• Shared address space programming models have become increasingly important and well represented

• The concept of a communication abstraction, which defined the set of communication primitives provided to the user

• These could be realized directly in the hardware via system software or through some combination of the two, as shown in fig below

• In large-scale parallel machines the programming model is realized in a similar manner, except that the primitive events are transactions across the network, that is, network transactions rather than bus transactions

• A network transaction is a one-way transfer of information from an output buffer at the source to an input buffer at the destination that causes some kind of action at the destination, the occurrence of which is not directly visible at the source, as shown in fig (next slide)

• Primitive Network Transactions

• Before starting a bus transaction, a protection check has been performed as part of the virtual-to-physical address translation

• The format of information in a bus transaction is determined by the physical wires of the bus, i.e. the data lines, address lines and command lines

• The information to be transferred onto the bus is held in special output registers viz., address, command and data registers until it can be driven onto the bus

• A bus transaction begins with arbitration for the medium

• Most buses employ a global arbitration scheme where a processor requesting a transaction asserts a bus request line and waits for the corresponding bus grant

• The destination of the transaction is implicit in the address

• Each module on the bus is configured to respond to a set of physical addresses

• All modules examine the address and one responds to the transaction

• If none responds, the bus controller detects the time-out and aborts the transaction

• Each module includes a set of input registers, capable of buffering any request to which it might respond

• Each bus transaction involves a request followed by a response

• In the case of a read, the response is the data and an associated completion signal

• For a write it is just the completion acknowledgement

• In either case, both the source and destination are informed of the completion of the transaction

• In split-transaction buses, the response phase of the transaction may require rearbitrationand may be performed in a different order than the requests

• Care is required to avoid deadlock with split transactions because a module on the bus may be both requesting and servicing transactions

• The module must continue servicing bus requests and accept replies while it is attempting to present its own request

• The bus design ensures that, for any transaction that might be placed on the bus, sufficient input buffering exists to accept the transaction at the destination

• This can be accomplished by providing enough resources or by adding a negative acknowledgement signal (NACK)

Issues present in a network transaction

• Protection: As the number of components becomes larger, the coupling between components looser and the individual components more complex, limitations occur as to how much each component trusts the others to operate correctly. In a scalable system, individual components will often perform checks on the network transaction so that an errant program or faulty hardware component cannot corrupt other components of the system

Format: Most network links are narrow, so the information associated with a transaction is transferred as a serial stream. Typical links are a few (1 to 16) bits wide. The format of the transaction is dictated by how the information is serialized onto the link. Thus there is a great deal of flexibility in this aspect of design. The information in a network transaction is an envelope with more information inside. The envelope includes information pertaining to the physical network to get the packet from it’s source to it’s destination port. Some networks are designed to deliver only fixed-size packets others can deliver variable-size packets.

Output Buffering: The source must provide storage to hold information that is to be serialized onto the link, either in registers, FIFOs or memory. Since network transactions are one-way and can potentially be pipelined, it maybe desirable to provide a queue of output registers. If the packet format is variable up to some moderate size, a similar approach may be adopted where each entry in the output buffer is of variable size. If a packet can be quite long, then typically the output controller contains a buffer of descriptors, pointing to the data in memory.

Media arbitration: There is no global arbitration for access to the network and many network transactions can be initiated simultaneously. Initiation of the network transaction places an implicit claim on resources in the communication path from the source to the destination as well as on resources at the destination. These resources are potentially shared with other transactions. Local arbitration is performed at the source to determine whether or not to initiate the transaction. The resources are allocated incrementally as the message moves forward.

Destination name and routing:

The source must be able to specify enough information to cause the transaction to be routed to the appropriate destination. There are many variations in how routing is specified and performed, but basically the source performs a translation from some logical name for the destination to some form of physical address.

• Input buffering: At the destination, the information in the network transaction must be transferred from the physical link into some storage element. This maybe simple registers or a queue or it may be delivered directly into memory. The input buffer is in some sense a shared resource used by many remote processors.

• Action: The action taken at the destination may be very simple or complex. In either case, it may involve initiating a response.

• Completion detection: The source has an indication that the transaction has been delivered into the network but usually no indication that it has arrived at its destination. This completion must be inferred from a response, an acknowledgement or some additional transaction.

• Transaction ordering: In a network the ordering is quite weak. Some networks ensure that a sequence of transactions from a given source to a single destination will be seen in order at the destination; others will not even provide this assurance. In either case no node can percievethe global order.

• Deadlock avoidance: Most modern networks are deadlock free as long as the modules on the network continue to accept transactions. Within the network, this may require restrictions on permissible routes or other special precautions.

• Delivery guarantees: A fundamental decision in the design of a scalable network is the behavior when the destination buffer is full. This is clearly an issue on an end-to-end basis since it is necessary for the source to know whether the destination input buffer is available when it is attempting to initiate a transaction. It is also an issue on a link-by-link basis within the network itself.

Shared Address Space

• Realizing the shared address space communication abstraction requires a two-way request-response protocol, as shown in fig (previous slide)

• A global address is decomposed into a module number and a local address.

• For a read operation, a request is sent to the designated module requesting a load of the desired address and specifying enough information to allow the result to be returned to the requestor through a response network transaction.

• A write is similar, except that the data is conveyed with the address and command to the designated module and the response is merely an acknowledgement to the requestor that the write has been performed. The response informs the source that the request has been received or serviced, depending on whether it is generated before or after the remote action.

• A send/receive pair in the message-passing model is conceptually a one-way transfer from a source area specified by the source user process to a destination area specified by the destination user process.

• In addition, it embodies a pairwisesynchronization event between the two processes.

• Message passing interface (MPI) distinguishes the notion of when a call to a send or receive function returns from when a message operation completes.

• A synchronous send completes once the matching receive has executed, the source data buffer can be reused and the data is ensured of arriving in the destination receive buffer.

• A buffered send completes as soon as the source data buffer can be reused, independent of whether the matching receive has been issued; the data may have been transmitted or it may be buffered somewhere in the system.

• Buffered send completion is asynchronous with respect to the receiver process

• A receive completes when the message data is present in the receive destination buffer.

• A blocking function, send or receive, returns only after the message operation completes

• A non blocking function returns immediately, regardless of message completion and additional calls to a probe function are used to detect completion

• The protocols are concerned only with message operation and completion, regardless of whether the functions are blocking

Education

Scalable multiprocessors