YonseiYonsei UniversityUniversity
Chapter 18Chapter 18
Parallel ProcessingParallel Processing
YonseiYonsei UniversityUniversity18-2
ContentsContents• Multiple Processor Organizations• Symmetric Multiprocessors• Cache Coherence and the MESI Protocol• Clusters• Nonuniform Memory Access• Vector Computation
YonseiYonsei UniversityUniversity18-3
Types of Parallel ProcessorTypes of Parallel Processor• Types of Parallel Processor Systems
– Single instruction, single data stream - SISD– Single instruction, multiple data stream - SIMD– Multiple instruction, single data stream - MISD– Multiple instruction, multiple data stream-
MIMD
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-4
Taxonomy Of Parallel ProcessorTaxonomy Of Parallel Processor Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-5
SISDSISDSingle processorSingle instruction streamData stored in single memoryUni-processor
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-6
SIMDSIMD• Single machine instruction • Controls simultaneous execution• Number of processing elements• Lockstep basis• Each processing element has associated
data memory• Each instruction executed on different set
of data by different processors• Vector and array processors
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-7
SIMDSIMD• With distributed memory
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-8
MISDMISD• Sequence of data• Transmitted to set of processors• Each processor executes different
instruction sequence• Never been implemented
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-9
MIMDMIMD• Set of processors• Simultaneously execute different
instruction sequences• Different sets of data• SMPs, clusters and NUMA systems
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-10
MIMDMIMD• With shared memory
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-11
MIMDMIMD• With distributed memory
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-12
MIMD MIMD -- OverviewOverview• General purpose processors• Each can process all instructions
necessary• Further classified by method of processor
communication
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-13
Tightly Coupled Tightly Coupled -- SMPSMP• Processors share memory• Communicate via that shared memory• Symmetric Multiprocessor (SMP)
– Share single memory or pool– Shared bus to access memory– Memory access time to given area of memory is
approximately the same for each processor
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-14
Tightly Coupled Tightly Coupled -- NUMANUMA• Nonuniform memory access• Access times to different regions of
memroy may differ
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-15
Loosely Coupled Loosely Coupled -- ClustersClusters• Collection of independent uniprocessors
or SMPs• Interconnected to form a cluster• Communication via fixed path or network
connections
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-16
Design IssuesDesign Issues• Physical Organization• Interconnection Structures• Interprocessor Communication• Operating System Design• Application Software Techniques
Multiple Processor Organizations
YonseiYonsei UniversityUniversity18-17
SMP CharacteristicsSMP Characteristics• Two or more similar processors • Share the same main memory and I/O facilities• All processors share access to I/O devices• All processors can perform the same functions• The system is controlled by an integrated OS
– Providing interaction between processors – Interaction at job, task, file and data element levels
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-18
SMP AdvantagesSMP Advantages• Performance
– If some work can be done in parallel
• Availability– Since all processors can perform the same functions, failure
of a single processor does not halt the system
• Incremental growth– User can enhance performance by adding additional
processors
• Scaling– Vendors can offer range of products based on number of
processors
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-19
Multiprogramming & MultiprocessingMultiprogramming & Multiprocessing• Interleaving(multiprogramming)
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-20
Multiprogramming & MultiprocessingMultiprogramming & Multiprocessing• Interleaving and
overlapping(multiprocessing)
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-21
Tightly CoupledTightly Coupled• Time-shared or common bus• Multiport memory• Central control unit
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-22
TimeTime--Shared BusShared Bus• To facilitate DMA transfers from I/O
processors, the following features are provided– Addressing
• Distinguish modules on the bus to determine the source and destination of data
– Arbitration• I/O module can temporarily function as “master”
– Time sharing• When module is controlling the bus, other modules
are locked out and must, if necessary, suspend operation until bus access is achieved
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-23
Time Shared BusTime Shared Bus Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-24
Time Shared Bus : AdvantagesTime Shared Bus : Advantages• Simplicity
– The simplest approach to multiprocessor organization
• Flexibility– Easy to expand the system by attaching more
processors to the bus• Reliability
– The bus is essentially a passive medium
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-25
Time Shared Bus : DrawbacksTime Shared Bus : Drawbacks• Performance
– The speed of the system is limited by the bus cycle time
– To equip each processor with a cache memory for improving performance
– Each processor should have local cache• Reduce number of bus accesses
– Leads to problems with cache coherence• Solved in hardware - see later
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-26
MultiportMultiport MemoryMemory• The multiport memory allows the direct,
independent access of main memory modules by each processor and I/O module
• Direct independent access of memory modules by each processor
• Logic required to resolve conflicts• Little or no modification to processors or
modules required
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-27
MultiportMultiport MemoryMemory• Advantages
– Little or no modification is needed for either processor or I/O modules to accommodate multiport memory
– Better performance than bus approach– Security
• To configure portions of memory as private to one or more processors and/or I/O module
• Disadvantage– Logic associated with memory is required for resolving
conflicts– More complex than the bus approach
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-28
Central Control UnitCentral Control Unit• To separate data streams back and forth between
independent modules: processor, memory, I/O module
• The controller can buffer requests and perform arbitration and timing functions
• To pass status and control messages between processors
• To perform cache update alerting• Advantages
– flexibility and simplicity of interfacing of the bus approach
• Disadvantages– the control unit is quite complex– A potential performance bottleneck
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-29
Multiprocessor OS Multiprocessor OS DeisgnDeisgn• An SMP OS manages processor and other
computer resources so that the user perceives a single OS controlling system resources
• The key design issues– Simultaneous concurrent processes– Scheduling– Synchronization– Memory management– Reliability and fault tolerance
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-30
S/390 SMPS/390 SMP Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-31
S/390 SMP ComponentsS/390 SMP Components• PU
– CISC microprocessor– 64-KB L1 cache
• L2 cache– 384 KB– To be arranged in cluster of two– Each cluster supports three PUs
• Bus-switching network adapter(BSN)– to interconnect the L2 caches and the main
memory– To includes a level 3 (L3) cache (2 MB)
• Memory card– 8G per card
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-32
Switched InterconnectionSwitched Interconnection• The single bus becomes a bottleneck
affecting the scalability• The S/390 copes with this problem in two
ways– Main memory is split into four separate cards,
each with its own storage controller that can handle memory accesses at high speeds
– The connection from processors to a single memory card is not in the form of a shared bus but rather point to point links, where each link connects a group of three processors via an L2 cache to a BSN
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-33
Shared L2 CachesShared L2 Caches• Needs of Shared L2 Caches
– In moving from G3(generation 3) to G4, IBM doubled the speed of the microprocessors. If the G3 organization was retained, a significant increase in bus traffic would occur. At the same time, it was desired to reuse as many G3 components as possible. Without a significant bus upgrade, the BSNs would become a bottleneck
– Analysis of typical S/390 workloads revealed a large degree of sharing of instructions and data among processors
• The use of one or more L2 caches, each of which was shared by multiple processors
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-34
L3 CacheL3 Cache• Each L3 cache provides a buffer between L2
caches and one memory card• To provides the data much more quickly than a
main memory access if the requested cache line is already shared by other processors but was not recently used by the requesting processor
38 GB32Memory32 MB14L3 cache5256 KB5L2 cache
8932 KB1L1 cache
Hit Rate(%)Cache SizeAccess Penalty
(PU cycles)Memory
subsystem
Symmetric Multiprocessors
YonseiYonsei UniversityUniversity18-35
Cache Coherence ProblemCache Coherence Problem• Problem - multiple copies of same data in
different caches• Can result in an inconsistent view of memory• Write back
– Write operations are usually made only to the cache – Main memory is only updated when the
corresponding cache line is flushed from the cache• Write through
– All write operations are made to main memory as well as to the cache, ensuring that main memory is always valid
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-36
Software SolutionsSoftware Solutions• Compiler-based coherence mechanisms :• Compiler and operating system deal with
problem• Overhead transferred to compile time• Design complexity transferred from
hardware to software• Simplest approach
– Prevent any shared data variables from being cached
• Efficient approach – Analyze the code to determine safe periods for
shared variables
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-37
Hardware SolutionsHardware Solutions• Cache coherence protocols• Dynamic recognition at run time of
potential inconsistency conditions• More efficient use of cache• Hardware schemes can be divided into
two categories– Directory protocols– Snoopy protocols
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-38
Directory ProtocolsDirectory Protocols• Collect and maintain information about
copies of data in cache• Directory stored in main memory• Requests are checked against directory• Appropriate transfers are performed• Creates central bottleneck• Effective in large scale systems with
complex interconnection schemes
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-39
Snoopy ProtocolsSnoopy Protocols• Distribute cache coherence responsibility
among cache controllers• Cache recognizes that a line is shared• Updates announced to other caches• Suited to bus based multiprocessor• Increases bus traffic
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-40
Snoopy ProtocolsSnoopy Protocols• Two approach to the snoopy protocol
– Write-invalidate– Write-update (Write-broadcast)
• Performance depends on the number of local caches and pattern of memory reads and writes
• To adaptive protocols that employ both
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-41
Write InvalidateWrite Invalidate• Multiple readers, one writer• When a write is required, all other caches
of the line are invalidated• Writing processor then has exclusive
(cheap) access until line required by another processor
• Used in Pentium II and PowerPC systems• State of every line is marked as modified,
exclusive, shared or invalid• MESI
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-42
Write UpdateWrite Update• Multiple readers and writers• When a processor wishes to update a
shared line, the word to be updated is distributed to all others and caches containing that line can update it
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-43
MESI ProtocolMESI Protocol• Modified : The line in the cache has been
modified and is available only in this cache
• Exclusive : The line in the cache is the same as that in main memory and is not present in any other cache
• Shared : The line in the cache is the same as that in main memory and may be present in another cache
• Invalid : The line in the cache does not contain valid data
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-44
MESI State Transition DiagramMESI State Transition Diagram Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-45
MESI State Transition DiagramMESI State Transition Diagram(a) Line in cache at initiating processor
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-46
MESI State Transition DiagramMESI State Transition Diagram(b) Line in snooping cache
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-47
Read MissRead Miss• When a read miss occurs in the local
cache, the processor initiates a memory read to read the line of main memory containing the missing address
• The processor inserts a signal on the bus that alerts all other processor/cache units to snoop the transaction
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-48
Read HitRead Hit• When a read hit occurs on a line currently
in the local cache, the processor simply reads the required item
• There is no state change• The state remains modified, shared, or
exclusive
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-49
Write MissWrite Miss• When a write miss occurs in the local
cache, the processor initiates a memory read to read the line of main memory containing the missing address
• RWITM(read-with-intent-to-modify)• When the line is loaded, it is immediately
marked modified
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-50
Write HitWrite Hit• When a write hit occurs on a line currently
in the local cache, the effect depends on the current state of that line in the local cache:
• Shared• Exclusive• Modified
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-51
L1L1--L2 Cache ConsistencyL2 Cache Consistency• Needs to maintain data integrity across
both levels of cache and across all caches in the SMP configuration
• To extend the MESI protocol to the L1 caches– Each line in the L1 cache includes bits to indicate
the state• Why?
– To adopt the write-through policy in the L1 cache• If L1 has a write-back, more complex
– The write through is to the L2 cache and not to the memory
Cache Coherence and the MESI Protocol
YonseiYonsei UniversityUniversity18-52
ClusteringClustering• Clustering is an alternative to symmetric
multiprocessing(SMP)• Cluster
– A group of interconnected, whole computers working together as a unified computing resource that can create the illusion of being one machine
• Whole Computer– A system that can run on its own, apart from the cluster
• Benefits– Absolute scalability– Incremental scalability– High availability– Superior price/performance
Clusters
YonseiYonsei UniversityUniversity18-53
Cluster ConfigurationsCluster Configurations(a) Standby Server with No Shared Disk
Clusters
YonseiYonsei UniversityUniversity18-54
Cluster ConfigurationsCluster Configurations(b) Shared Disk
Clusters
YonseiYonsei UniversityUniversity18-55
Clustering MethodsClustering Methods
Requires lock manager software. Usually used with disk mirroring or RAID technology
Low network and server overhead. Reduced risk of downtime caused by disk failure
Multiple servers simultaneously share access to disks
Servers share disks
Usually requires disk mirroring or RAID technology to compensate for risk of disk failure.
Reduced network and server overhead due to elimination of copying operations
Servers are cabled to the same disks, but each server owns its disks. If one server fails, its disks are taken over by the other server
Servers connected to disks
High network and server overhead due to copying operations
High availability
Separate servers have their own disks. Data is continuously copied from primary to secondary server
Separate servers
Increased complexityReduced cost because secondary servers can be used for processing
The secondary server is also used for processing tasks
Active secondary
High cost because the secondary server is unavailable for other processing tasks.
Easy to implementA secondary server takes over in case of primary server failure
Passive standby
LimitationsBenefitsDescriptionClustering Method
Clusters
YonseiYonsei UniversityUniversity18-56
Operating System Design IssuesOperating System Design Issues• Failure Management
– Failover– Failback
• Load Balancing• Clusters versus SMP
Clusters
YonseiYonsei UniversityUniversity18-57
Failure ManagementFailure Management• A highly available cluster
– A high probability that all resources will be in service– If a failure does occur, the queries in progress are lost– Any lost query, if retried, will be serviced by a different
computer in the cluster
• A fault-tolerant cluster– All resources are always available– Achieved by the use of redundant shared disks and
mechanisms for backing out uncommitted transactions– Failover : Switching resources over from a failed to
alternative system– Failback : restoring resources to the original system
Clusters
YonseiYonsei UniversityUniversity18-58
Load BalancingLoad Balancing• Incremental scalability• Automatically include new computers in
scheduling• Middleware needs to recognise that
processes may switch between machines
Clusters
YonseiYonsei UniversityUniversity18-59
ParallelizingParallelizing• Single application executing in parallel on a
number of machines in cluster– Complier
• Determines at compile time which parts can be executed in parallel
• Split off for different computers– Application
• Application written from scratch to be parallel• Message passing to move data between nodes• Hard to program• Best end result
– Parametric computing• If a problem is repeated execution of algorithm on different s
ets of data• e.g. simulation using different scenarios• Needs effective tools to organize and run
Clusters
YonseiYonsei UniversityUniversity18-60
Cluster Computer ArchitectureCluster Computer Architecture Clusters
YonseiYonsei UniversityUniversity18-61
Cluster MiddlewareCluster Middleware• Unified image to user
– Single system image• Single point of entry• Single file hierarchy• Single control point• Single virtual networking• Single memory space• Single job management system• Single user interface• Single I/O space• Single process space• Checkpointing• Process migration
Clusters
YonseiYonsei UniversityUniversity18-62
Clusters Clusters vsvs SMPSMP• Both provide multiprocessor support to high dem
and applications• Both available commercially
– SMP for longer
• The advantages of the SMP approach– SMP is easier to manage and configure than a cluster– SMP usually takes up less physical space and draws less
power than cluster
• The advantages of the cluster approach– To result in clusters dominating the high-performance server
market– Superior in terms of incremental and absolute scalability– Superior in terms of availability
• All components of the system can readily be made highly redundant
Clusters
YonseiYonsei UniversityUniversity18-63
NonuniformNonuniform Memory Access (NUMA)Memory Access (NUMA)• Alternative to SMP & clustering• Uniform memory access
– All processors have access to all parts of memory using load & store– Access time to all regions of memory is the same– Access time to memory for different processors same– As used by SMP
• Nonuniform memory access– All processors have access to all parts of memory using load & store– Access time of processor differs depending on region of memory– Different processors access different regions of memory at different
speeds• Cache coherent NUMA
– Cache coherence is maintained among the caches of the various processors
– Significantly different from SMP and clusters
NonuniformMemory Access
YonseiYonsei UniversityUniversity18-64
MotivationMotivation• SMP has practical limit to number of processors
– Bus traffic limits to between 16 and 64 processors
• In clusters each node has own memory– Apps do not see large global memory– Coherence maintained by software not hardware
• NUMA retains SMP flavour while giving large scale multiprocessing– e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000
processors
• The objective with NUMA is to maintain a transparent system-wide memory while permitting multiple multiprocessor nodes, each with its own bus or other internal interconnect system
NonuniformMemory Access
YonseiYonsei UniversityUniversity18-65
CCCC--NUMA OrganizationNUMA Organization NonuniformMemory Access
YonseiYonsei UniversityUniversity18-66
CCCC--NUMA OperationNUMA Operation• Each processor has own L1 and L2 cache• Each node has own main memory• Nodes connected by some networking facility• Each processor sees single addressable
memory space• Memory request order:
– L1 cache (local to processor)– L2 cache (local to processor)– Main memory (local to node)– Remote memory
• Delivered to requesting (local to processor) cache• Automatic and transparent
NonuniformMemory Access
YonseiYonsei UniversityUniversity18-67
OrganizationOrganization• Example
– Suppose the P2-3 requests a memory location 798, the memory of node 1
1. P2-3 issues read request on the snoopy bus of node 22. The directory on node 2 sees the request and recognizes
that the location is in lode13. Node 2’s directory sends a request to node 14. Node 1’s directory requests the contents of 7985. Node 1’s main memory responds by putting the requested
data on bus6. Node 1’s directory picks up the data from the bus7. The value is transferred back to node 2’s directory8. Node 2’s directory places the data back no node 2’bus9. The value is picked up and placed in P2-3’s cache and
delivered to P2-3
NonuniformMemory Access
YonseiYonsei UniversityUniversity18-68
Cache CoherenceCache Coherence• Node 1 directory keeps note that node 2 h
as copy of data• If data modified in cache, this is broadcast
to other nodes• Local directories monitor and purge local
cache if necessary• Local directory monitors changes to local
data in remote caches and marks memory invalid until writeback
• Local directory forces writeback if memory location requested by another processor
NonuniformMemory Access
YonseiYonsei UniversityUniversity18-69
NUMA Pros and ConsNUMA Pros and Cons• The advantage of a CC-NUMA system
– Can deliver effective performance at higher levels of parallelism than SMP, without requiring major software changes
• The disadvantages of a CC-NUMA system– Performance can breakdown if too much access to remote
memory• Can be avoided by:
– L1 & L2 cache design reducing all memory access» Need good temporal locality of software
– Good spatial locality of software– Virtual memory management moving pages to nodes
that are using them most– Not transparently look like an SMP
• Software changes will be required to move an operating system and applications from an SMP to a CC-NUMA system
– Availability
NonuniformMemory Access
YonseiYonsei UniversityUniversity18-70
Vector ComputationVector Computation• Maths problems involving physical
processes present different difficulties for computation– Aerodynamics, seismology, meteorology– Continuous field simulation
• High precision• Repeated floating point calculations on large
arrays of numbers
VectorComputation
YonseiYonsei UniversityUniversity18-71
Vector ComputationVector Computation• Supercomputers handle these types of
problem– Hundreds of millions of flops– $10-15 million– Optimised for calculation rather than multitasking and
I/O– Limited market
• Research, government agencies, meteorology• Array processor
– Alternative to supercomputer– Configured as peripherals to mainframe & mini– Just run vector portion of problems
VectorComputation
YonseiYonsei UniversityUniversity18-72
Example of Vector AdditionExample of Vector Addition VectorComputation
YonseiYonsei UniversityUniversity18-73
Matrix Multiplication (C = A x B)Matrix Multiplication (C = A x B) VectorComputation
YonseiYonsei UniversityUniversity18-74
ApproachesApproaches• General purpose computers rely on iteration to do
vector calculations• In example this needs six calculations• Vector processing
– Assume possible to operate on one-dimensional vector of data– All elements in a particular row can be calculated in parallel
• Parallel processing– Independent processors functioning in parallel– Use FORK N to start individual process at location N– JOIN N causes N independent processes to join and merge
following JOIN• O/S Co-ordinates JOINs• Execution is blocked until all N processes have reached JOIN
VectorComputation
YonseiYonsei UniversityUniversity18-75
Processor DesignsProcessor Designs• Pipelined ALU
– Within operations– Across operations
• Parallel ALUs• Parallel processors
VectorComputation
YonseiYonsei UniversityUniversity18-76
Approaches to Vector ComputationApproaches to Vector Computation(a) Pipelined ALU
VectorComputation
YonseiYonsei UniversityUniversity18-77
Approaches to Vector ComputationApproaches to Vector Computation(b) Parallel ALUs
VectorComputation
YonseiYonsei UniversityUniversity18-78
Pipelined ProcessingPipelined Processing VectorComputation
YonseiYonsei UniversityUniversity18-79
Ex. Computing C=(s x A) + BEx. Computing C=(s x A) + B1. Vector load A->Vector
Register(VR1)2. Vector load B->VR23. Vector multiply s x VR1->VR34. Vector add VR3 + VR2 -> VR45. Vector store VR4->C
VectorComputation
YonseiYonsei UniversityUniversity18-80
ChainingChaining• Cray Supercomputers• Vector operation may start as soon as first
element of operand vector available and functional unit is free
• Result from one functional unit is fed immediately into another
• If vector registers used, intermediate results do not have to be stored in memory
VectorComputation
YonseiYonsei UniversityUniversity18-81
Taxonomy of Computer OrganizationTaxonomy of Computer Organization VectorComputation
YonseiYonsei UniversityUniversity18-82
Taxonomy of Computer OrganizationTaxonomy of Computer Organization• Parallel processors
– The multiple processors can function cooperatively on a given task
• Vector processor– Often equated with pipelined ALU organization– Also designed by a parallel ALU organization and a parallel
processor• Array processing
– Usually refers to a Parallel ALU– But any of the three organizations is optimized for the processing
of array– Usually refers to an auxiliary processor attached to a general-
purpose processor and used to perform vector computation• The pipelined ALU organization dominates the
marketplace– Less complex
VectorComputation
YonseiYonsei UniversityUniversity18-83
IBM 3090 Vector FacilityIBM 3090 Vector Facility• IBM facility makes use of a number of
vector registers• Each register is actually a bank of scalar
registers• To compute vector sum C=A+B, the vectors
A and B are loaded into two vector registers• The data from these registers are passed
through the ALU as fast as possible• The computation overlap, and the loading of
the input data into the resisters in a block, results in a significant speeding up over an ordinary ALU operation
VectorComputation
YonseiYonsei UniversityUniversity18-84
OrganizationOrganization• The fixed and predetermined structure of
vector data permits housekeeping instructions inside the loop to be replaced by faster internal machine operations
• Data-access and arithmetic operations on several successive vector elements can proceed concurrently by overlapping such operations in a pipelined design or by performing multiple-element operations in parallel
• The use of vector registers for intermediate results avoids additional storage reference
VectorComputation
YonseiYonsei UniversityUniversity18-85
IBM 3090 with Vector FacilityIBM 3090 with Vector Facility VectorComputation
YonseiYonsei UniversityUniversity18-86
RegistersRegisters• The IBM organization is referred to as
register-to-register, because the vector operands, both input and output, can be stored in vector registers
• The main disadvantage of the use of vectro registers is that the programmer or compiler must take them into account
VectorComputation
YonseiYonsei UniversityUniversity18-87
Alternative ProgramsAlternative Programs VectorComputation
YonseiYonsei UniversityUniversity18-88
Alternative ProgramsAlternative Programs(a) Storage to Storage
VectorComputation
YonseiYonsei UniversityUniversity18-89
Alternative ProgramsAlternative Programs(b) Register to Register
VectorComputation
YonseiYonsei UniversityUniversity18-90
Alternative ProgramsAlternative Programs(c) Storage to Register
VectorComputation
YonseiYonsei UniversityUniversity18-91
Alternative ProgramsAlternative Programs(d) Compound Instructions
VectorComputation
YonseiYonsei UniversityUniversity18-92
RegistersRegisters• The vector registers can also be coupled
to form 8 64-bit vector registers• The architecture specifies that each
register contains from 8 to 512 scalar elements
• Three additional registers are needed by the vector facility
VectorComputation
YonseiYonsei UniversityUniversity18-93
Registers of the IBM 3090Registers of the IBM 3090 VectorComputation
YonseiYonsei UniversityUniversity18-94
Compound InstructionCompound Instruction• Instruction execution can be overlapped
using chining to improve performance • The designers of the IBM vector facility
chose not to include this capability for several reasons
• Compound instruction do not require the use of additional registers for temporary storage of intermediate results and they require one less register
VectorComputation
YonseiYonsei UniversityUniversity18-95
The Instruction SetThe Instruction Set• There are memory-to-register load and
register-to-memory store instructions • Many of instruction use a three operand
format
VectorComputation
YonseiYonsei UniversityUniversity18-96
IBM 3090 Vector FacilityIBM 3090 Vector Facility VectorComputation