Solutions Manual

Computer Architecture: Fundamentals and Principles of Computer Design

Solutions Manual

Joseph D. Dumas II

University of Tennessee at Chattanooga Department of Computer Science and Electrical Engineering

Copyright © 2006

CRC Press/Taylor & Francis Group

1 Introduction to Computer Architecture

1. Explain in your own words the differences between computer systems architecture and

implementation. How are these concepts distinct, yet interrelated? Give a historical

example of how implementation technology has affected architectural design (or vice

versa).

Architecture is the logical design of a computer system, from the top level on

down to the subsystems and their components – a specification of how the parts of

the system will fit together and how everything is supposed to function.

Implementation is the physical realization of an architecture – an actual, working

hardware system on which software can be executed.

There are many examples of advances in implementation technology

affecting computer architecture. An obvious example is the advent of magnetic core

memory to replace more primitive storage technologies such as vacuum tubes, delay

lines, magnetic drums, etc. The new memory technology had much greater storage

capacity than was previously feasible. The availability of more main memory

resulted in changes to machine language instruction formats, addressing modes, and

other aspects of instruction set architecture.

2. Describe the technologies used to implement computers of the first, second, third, fourth,

and fifth generations. What were the main new architectural features that were

introduced or popularized with each generation of machines? What advances in software

went along with each new generation of hardware?

First generation computers were unique machines built with very primitive

implementation technologies such as electromagnetic relays and (later) vacuum

tubes. The main new architectural concept was the von Neumann stored-program

paradigm itself. (The early first generation machines were not programmable in the

sense we understand that term today.) Software, for those machines where the

concept was actually relevant, was developed in machine language.

Second-generation computers made use of the recently invented transistor as

a basic switching element. The second generation also saw the advent of magnetic

core memory as a popular storage technology. At least partly in response to these

technological advances, new architectural features were developed including virtual

memory, interrupts, and hardware representation of floating-point numbers.

Advances in software development included the use of assembly language and the

first high-level languages including Fortran, Algol, and COBOL. Batch processing

systems and multiprogramming operating systems were also devised during this

time period.

The third generation featured the first use of integrated circuits (with

multiple transistors on the same piece of semiconductor material) in computers.

Not only was this technology used to create smaller CPUs requiring less wiring

between components, but semiconductor memory devices began to replace core

memory as well. This led to the development of minicomputer architectures that

were less expensive to implement and helped give rise to families of computer

systems sharing a common instruction set architecture. Software advances included

increased use of virtual memory, the development of more modern, structured

programming languages, and the dawn of timesharing operating systems such as

UNIX.

Fourth generation computers were the first machines to use VLSI integrated

circuits including microprocessors (CPUs fabricated on a single IC). VLSI

technology continued to improve during this period, eventually yielding

microprocessors with over one million transistors and large-capacity semiconductor

RAM and ROM devices. VLSI “chips” allowed the development of inexpensive but

powerful microcomputers during the fourth generation. These systems gradually

began to make use of virtual memory, cache memory, and other techniques

previously reserved for mainframes and minicomputers; they provided direct

support for high-level languages either in hardware (CISC) or by using optimizing

compilers (RISC). Other software advances included new languages like BASIC,

Pascal, and C, and the first object-oriented language (C++). Office software

including word processors and spreadsheet applications helped microcomputers

gain a permanent foothold in small businesses and homes.

Fifth generation computers exhibited fewer architectural innovations than

their predecessors, but advances in implementation technology (including pipelined

and superscalar CPUs and larger, faster memory devices) yielded steady gains in

performance. CPU clock frequencies increased from tens, to hundreds, and

eventually to thousands of megahertz; today, CPUs operating at several gigahertz

are common. “Standalone” systems became less common as most computers were

connected to local area networks and/or the Internet. Object-oriented software

development became the dominant programming paradigm, and network-friendly

languages like Java became popular.

3. What characteristics do you think the next generation of computers (say, 5-10 years from

now) will display?

The answer to this question will undoubtedly vary from student to student,

but might include an increased reliance on networking (especially wireless

networking), increased use of parallel processing, more hardware support for

graphics, sound, and other multimedia functions, etc.

4. What was the main architectural difference between the two early computers ENIAC and

EDVAC?

ENIAC was not a programmable machine. Connections had to be re-wired

to do a different calculation. EDVAC was based on the von Neumann paradigm,

where instructions were not hard-wired but rather resided in main memory along

with the data. The program, and thus the system’s functionality, could be changed

without any modification to the hardware. Thus, EDVAC (and all its successors

based on the von Neumann architecture) were able to run “software” as we

understand it today.

5. Why was the invention of solid state electronics (in particular, the transistor) so important

in the history of computer architecture?

The invention of the transistor, and its subsequent use as a switching element

in computers, enabled many of the architectural enhancements that came about

during the second (and later) generations of computing. Earlier machines based on

vacuum tubes were limited in capability because of the short lifetime of each

individual tube. A machine built with too many (more than a few thousand)

switching elements could not be reliable; it would frequently “go down” due to tube

failures. Transistors, with their much longer life span, enabled the construction of

computers with tens or hundreds of thousands of switching elements, which allowed

more complex architectures to flourish.

6. Explain the origin of the term “core dump.”

The term “core dump” dates to the second and third generations of

computing, when most large computers used magnetic core memory for main

storage. Since core memory was nonvolatile (retained its contents in the absence of

power), when a program crashed and the machine had to be taken down and

restarted, the offending instruction(s) and their operands were still in memory and

could be examined for diagnostic purposes. Some later machines with

semiconductor main memory mimic this behavior by “dumping” an image of a

program’s memory space to disk to aid in debugging in the event of a crash.

7. What technological advances allowed the development of minicomputers, and what was

the significance of this class of machines? How is a microcomputer different from a

minicomputer?

The main technological development that gave rise to minicomputers was the

invention of the integrated circuit. (The shrinking sizes of secondary storage devices

and advances in display technology such as CRT terminals also played a part.) The

significance of these machines was largely due to their reduced cost as compared to

traditional mainframe computers. Because they cost “only” a few thousand dollars

instead of hundreds of thousands or millions, minicomputers were available to

smaller businesses (and to small workgroups or individuals within larger

organizations). This trend toward proliferation and decentralization of computing

resources was continued by the microcomputers of the fourth generation.

The main difference between a microcomputer and a minicomputer is the

microcomputer’s use of a microprocessor (or single-chip CPU) as the main

processing element. Minicomputers had CPUs consisting of multiple ICs or even

multiple circuit boards. The availability of microprocessors, coupled with the

miniaturization and decreased cost of other system components, made computers

smaller and cheaper and thus, for the first time, accessible to the average person.

8. How have the attributes of very high performance systems (a.k.a. supercomputers)

changed over the third, fourth, and fifth generations of computing?

The third generation of computing saw the development of the first

supercomputer-class machines, including the IBM “Stretch”, the CDC 6600 and

7600, the TI ASC, the ILLIAC IV and others. These machines were very diverse

and did not share many architectural attributes.

During the fourth generation, vector machines including the Cray-1 and its

successors (and competitors) became the dominant force in high-performance

computing. By processing vectors (large one-dimensional arrays) of operands in

highly pipelined fashion, these machines achieved impressive performance on

scientific and engineering calculations (though they did not achieve comparable

performance increases on more general applications). Massively parallel machines

(with many, simple processing elements) also debuted during this period.

Vector machines lost popularity in the fifth generation, largely giving way to

highly parallel scalar systems using large numbers of conventional microprocessors.

Many of these systems are cluster systems built around a network of relatively

inexpensive, “commodity” computers.

9. What is the most significant difference between computers of the last 10-15 years versus

those of previous generations?

Fifth generation computers are smaller, cheaper, faster, and have more

memory than their predecessors – but probably the single most significant

difference between modern systems and those of the past is the pervasiveness of

networking. Almost every general-purpose or high-performance system is

connected to a local area network, or a wide area network such as the Internet, via

some sort of wired or wireless network connection.

10. What is the principal performance limitation of a machine based on the von Neumann

(Princeton) architecture? How does a Harvard architecture machine address this

limitation?

The main performance limitation of a von Neumann machine is the “von

Neumann bottleneck” – the single path between the CPU and main memory, over

which instructions as well as data must be accessed. A Harvard architecture

removes this bottleneck by having either separate main memories for instructions

and data (with a dedicated connection to each), or (much more common nowadays)

by having only one main memory, but separate cache memories (see Chapter 2) for

instructions and data. The separate memories can be optimized for access patterns

typical of each type of memory reference in order to maximize data and instruction

bandwidth to the CPU.

11. Summarize in your own words the von Neumann machine cycle.

Fetch instruction, decode instruction, determine operand address(es), fetch

operand(s), perform operation, store result … repeat for next instruction.

12. Does a computer system with high generality tend to have higher quality than other

systems? Explain.

Not necessarily. If anything, a more general architecture tends to be more

complex, as its designers try to make it capable of doing a wide variety of things

reasonably well. This increased complexity, as compared to a more specialized

architecture, may lead to a higher probability of “bugs” in the implementation, all

else being equal.

13. How does “ease of use” relate to “user friendliness”?

Not at all; at least, not directly. User friendliness refers to the end user’s

positive experience with the operating system and applications that run under it.

Ease of use is an attribute that describes how well the architecture facilitates the

development of system software such as operating systems, compilers, linkers, etc.

In other words, it is a measure of “systems programmer friendliness.” While there

is no direct connection, an architecture that is not “easy to use” could possibly give

rise to systems software with a higher probability of bugs, which may ultimately

lead to a lower quality experience on the part of the end user.

14. The obvious benefit of maintaining upward and/or forward compatibility is the ability to

continue to run “legacy” code. What are some of the disadvantages of compatibility?

Building in compatibility with previous machines makes the design of an

architecture more complex. This may result in higher design and implementation

costs, less architectural ease of use, and a higher probability of flaws in the

implementation of the design.

15. Name at least two things (other than hardware purchase price, software licensing cost,

maintenance, and support) that may be considered cost factors for a computer system.

Costs are not always monetary – at least, not directly. Other cost factors,

depending on the nature of the system and where it is used, might include power

consumption, heat dissipation, physical volume, mass, and losses incurred if a

system fails due to reliability issues.

16. Give as many reasons as you can why PC compatible computers have a larger market

share than Macs.

It is probably impossible to know all the reasons, but one of the biggest is

that PCs have an “open”, rather than proprietary, architecture. Almost from the

very beginning, compatible “clones” were available at competitive prices, holding

down not only the initial cost of buying a computer, but also the prices for software

and replacement parts. Success breeds success, and the larger market share meant

that manufacturers who produced PC hardware were able to invest in research and

development that produced better, faster, and more economical PC compatible

machines.

17. One computer system has a 3.2 GHz processor, while another has only a 2.7 GHz

processor. Is it possible that the second system might outperform the first? Explain.

It is entirely possible that this might be the case. CPU clock frequency is only

one small aspect of system performance. Even with a lower clock frequency (fewer

clock cycles occurring each second) the second system’s CPU might outperform the

first because of architectural or implementation differences that result in it

accomplishing more work per clock cycle. And even if the first system’s CPU is

indeed more capable, differences in the memory and/or input/output systems might

still give the advantage to the second system.

18. A computer system of interest has a CPU with a clock cycle time of 2.5 ns. Machine

language instruction types for this system include: integer addition/subtraction/logic

instructions which require 1 clock cycle to be executed; data transfer instructions which

average 2 clock cycles to be executed; control transfer instructions which average 3 clock

cycles to be executed; floating-point arithmetic instructions which average 5 clock cycles

to be executed; and input/output instructions which average 2 clock cycles to be

executed.

a) Suppose you are a marketing executive who wants to hype the performance of this

system. Determine its “peak MIPS” rating for use in your advertisements.

The fastest instructions take only one clock cycle to execute, so in order to

calculate peak MIPS, assume that the whole program uses only these instructions.

That means that the machine will execute one instruction every 2.5 ns. Thus we

calculate:

Instruction execution rate = (1 instruction) / (2.5 * 10-9 seconds) = 4 * 108

instructions/second = 400 * 106 instructions/second = 400 MIPS

b) Suppose you have acquired this system and want to estimate its performance when

running a particular program. You analyze the compiled code for this program and

determine that it consists of 40% data transfer instructions, 35% integer addition,

subtraction, and logical instructions, 15% control transfer instructions, and 10% I/O

instructions. What MIPS rating do you expect the system to achieve while running

this program?

First, we need to determine the mean number of cycles per instruction using

a weighted average based on the percentages of the different types of instructions:

CPIavg = (0.40)(2 cycles) + (0.35)(1 cycle) + (0.15)(3 cycles) + (0.10)(2 cycles) = (0.80 +

0.35 + 0.45 + 0.20) = 1.80 cycles/instruction

We already determined in part (a) above that if instructions take a single

cycle, then we can execute 400 * 106 of them per second. This is another way of

saying that the CPU clock frequency is 400 MHz. Given this knowledge and the

average cycle count per instruction just calculated, we obtain:

Instruction execution rate = (400 M cycles / second) * (1 instruction / 1.8 cycles) ≈

222 M instructions/second = 222 MIPS

c) Suppose you are considering purchasing this system to run a variety of programs

using mostly floating-point arithmetic. Of the widely-used benchmark suites

discussed in this chapter, which would be the best to use in comparing this system to

others you are considering?

If general-purpose floating-point performance is of interest, it would be hard

to go wrong by using the SPECfp floating-point CPU benchmark suite (or some

subset of it, if specific types of applications to be run on the system are known).

Other possibilities include the Whetstones benchmark or (if applications of interest

perform vector computations) LINPACK or Livermore Loops. Conversely, you

would definitely not want to compare the systems using any of the integer-only or

non-CPU-intensive benchmarks such as Dhrystones, TPC, etc.

d) What does MFLOPS stand for? Estimate this system’s MFLOPS rating; justify your

answer with reasoning and calculations.

MFLOPS stands for Millions of Floating-point Operations Per Second. Peak

MFLOPS can be estimated in a similar manner to parts (a) and (b) above:

Peak floating-point execution rate = (400 M cycles / second) * (1 FLOP / 5 cycles) =

80 MFLOPS

A more realistic estimate of a sustainable floating-point execution rate would

have to take into account the additional operations likely to be required along with

each actual numeric computation. While this would vary from one program to

another, a reasonable estimate might be that for each floating-point arithmetic

operation, the program might also perform two data transfers (costing a total of

four clock cycles) plus one control transfer (costing three clock cycles). This would

mean that the CPU could only perform one floating-point computation every 12

clock cycles for a sustained execution rate of (400 M cycles / second) * (1 FLOP / 12

cycles) ≈ 33 MFLOPS. The student may come up with a variety of estimates based

on different assumptions, but any realistic estimate would be significantly less than

the 80 MFLOPS peak rate.

19. Why does a hard disk that rotates at higher RPM generally outperform one that rotates at

lower RPM? Under what circumstances might this not be the case?

There are generally three components to the total time required to read or

write data on a rotating disk. These are the time required to step the read/write

head in or out to the desired track, the rotational delay in getting to the start of the

desired sector within that track, and then the time needed to actually read or write

the sector in question. All else being equal, increasing disk RPM reduces the time it

takes for the disk to make a revolution and so tends to reduce the second and third

delay components, while it does nothing to address the first. If the higher-RPM

drive had a longer track-to-track seek time, though, it might take just as long or

even longer, overall, to access desired data as compared with a lower-RPM drive

with shorter track-to-track access time.

20. A memory system can read or write a 64-bit value every 2 ns. Express its bandwidth in

MB/s.

Since one byte equals 8 bits, a 64-bit value is 8 bytes. So we can compute the

memory bandwidth as:

BW = (8 bytes) / (2 * 10-9 seconds) = 4 * 109 bytes/second = 4 GB/s or 4000 MB/s

21. If a manufacturer’s brochure states that a given system can perform I/O operations at 500

MB/s, what questions would you like to ask the manufacturer’s representative regarding

this claim?

One should probably ask under what conditions this data transfer rate can

be achieved. If it is a “peak” transfer rate, it is probably unattainable under any

typical circumstances. It would be very helpful to know the size of the blocks of

data being transferred and the length of time for which this 500 MB/s rate was

sustained. Odds are probably good that if this is a peak rate, that it is only valid for

fairly large block transfers of optimum size, and for very short periods of time. This

may or may not reflect the nature of the I/O demands of a customer’s application.

22. Fill in the blanks below with the most appropriate term or concept discussed in this

chapter:

Implementation - the actual, physical realization of a computer system, as opposed to

the conceptual or block-level design

Babbage’s Analytical Engine - this was the first design for a programmable digital

computer, but a working model was never completed

Integrated circuits - this technological development was an important factor in moving

from second generation to third generation computers

CDC 6600 - this system is widely considered to have been the first supercomputer

Altair - this early microcomputer kit was based on an 8-bit microprocessor; it introduced

10,000 hobbyists to (relatively) inexpensive personal computing

Microcontroller - this type of computer is embedded inside another electronic or

mechanical device such as a cellular telephone, microwave oven, or automobile

transmission

Harvard architecture - a type of computer system design in which the CPU uses

separate memory buses for accessing instructions and data operands

Compatibility - an architectural attribute that expresses the support provided for

previous or other architectures by the current machine

MFLOPS - a CPU performance index that measures the rate at which computations can

be performed on real numbers rather than integers

Bandwidth - a measure of memory or I/O performance that tells how much data can be

transferred to or from a device per unit of time

Benchmark - a program or set of programs that are used as standardized means of

comparing the performance of different computer systems

2 Computer Memory Systems

1. Consider the various aspects of an ideal computer memory discussed in Section 2.1.1 and

the characteristics of available memory devices discussed in Section 2.1.2. Fill in the

columns of the table below with the following types of memory devices, in order from

most desirable to least desirable: magnetic hard disk, semiconductor DRAM, CD-R,

DVD-RW, semiconductor ROM, DVD-R, semiconductor flash memory, magnetic floppy

disk, CD-RW, semiconductor static RAM, semiconductor EPROM.

Cost/bit (will obviously fluctuate somewhat depending on market conditions): CD-

R, DVD-R, CD-RW, DVD-RW, magnetic hard disk, magnetic floppy disk,

semiconductor DRAM, semiconductor ROM, semiconductor EPROM,

semiconductor flash memory, semiconductor static RAM.

Speed (will vary somewhat depending on specific models of devices): semiconductor

static RAM, semiconductor DRAM, semiconductor ROM, semiconductor EPROM,

semiconductor flash memory, magnetic hard disk, DVD-R, DVD-RW, CD-R, CD-

RW, magnetic floppy disk.

Information Density (again, this may vary by specific types of devices): Magnetic

hard disk, DVD-R and DVD-RW, CD-R and CD-RW, semiconductor DRAM,

semiconductor ROM, semiconductor EPROM, semiconductor flash memory,

semiconductor static RAM, magnetic floppy disk.

Volatility: Optical media such as DVD-R, CD-R, DVD-RW, and CD-RW are all

equally nonvolatile. The read-only variants cannot be erased and provide secure

storage unless physically damaged. (The same is true of semiconductor ROM.) The

read-write optical disks (and semiconductor EPROMs and flash memories) may be

intentionally or accidentally erased, but otherwise retain their data indefinitely in

the absence of physical damage. Magnetic hard and floppy disks are nonvolatile

except in the presence of strong external magnetic fields. Semiconductor static

RAM is volatile, requiring continuous application of electrical power to maintain

stored data. Semiconductor DRAM is even more volatile since it requires not only

electrical power, but also periodic data refresh in order to maintain its contents.

Writability (all memory is readable): Magnetic hard and floppy disks and

semiconductor static RAM and DRAM can be written essentially indefinitely, and as

quickly and easily as they can be read. DVD-RW, CD-RW, and semiconductor

flash memory can be written many times, but not indefinitely, and the write

operation is usually slower than the read operation. Semiconductor EPROMs can

be written multiple times, but only in a special programmer, and only after a

relatively long erase cycle under ultraviolet light. DVD-R and CD-R media can be

written once and only once by the user. Semiconductor ROM is pre-loaded with its

binary information at the factory and can never be written by the user.

Power Consumption: All types of optical and magnetic disks as well as

semiconductor ROMs, EPROMs, and flash memories can store data without power

being applied at all. Semiconductor RAMs require continuous application of power

to retain data, with most types of SRAMS being more power-hungry than DRAMs.

(Low-power CMOS static RAMs, however, are commonly used to maintain data for

long periods of time with a battery backup.) While data are being read or written,

all memories require power. Semiconductor DRAM requires relatively little power,

while semiconductor ROMs, flash memories, and EPROMs tend to require more

and SRAMs, more still. All rotating disk drives, magnetic and optical, require

significant power in order to spin the media and move the read/write heads as well

as to actually perform the read and write operations. The specifics vary

considerably from device to device, but those that rotate the media at higher speeds

tend to use slightly more power.

Durability: In general, the various types of semiconductor memories are more

durable than disk memories because they have no moving parts. Only severe

physical shock or static discharges are likely to harm them. (CMOS devices are

particularly susceptible to damage from static electricity.) Optical media are also

very durable; they are nearly impervious to most dangers except that of surface

scratches. Magnetic media such as floppy and hard disks tend to be the least

durable as they are subject to erasure by strong magnetic fields and also are subject

to “head crashes” when physical shock causes the read-write head to impact the

media surface.

Removability/Portability: Flash memory, floppy disks, and optical disks are

eminently portable and can easily be carried from system to system to transfer data.

A few magnetic hard drives are designed to be portable, but most are permanently

installed in a given system and require some effort for removal. Semiconductor

ROMs and EPROMs, if placed in sockets rather than being soldered directly to a

circuit board, can be removed and transported along with their contents. Most

semiconductor RAM devices lose their contents when system power is removed and,

while they could be moved to another system, would not arrive containing any valid

data.

2. Describe in your own words what a hierarchical memory system is and why it is used in

the vast majority of modern computer systems.

A hierarchical memory system is one that is comprised of several types of

memory devices with different characteristics, each occupying a “level” within the

overall structure. The higher levels of the memory system (the ones closest to, or a

part of, the CPU) offer faster access but, due to cost factors and limited physical

space, have a smaller storage capacity. Thus, each level can typically hold only a

portion of the data stored in the next lower level. As one moves down to the lower

levels, speed and cost per bit generally decrease, but capacity increases. At the

lowest levels, the devices offer a great deal of (usually nonvolatile) storage at

relatively low cost, but are quite slow. For the overall system to perform well, the

hierarchy must be managed by hardware and software such that the stored items

that are used most frequently are located in the higher levels, while items that are

used less frequently are relegated to the lower levels.

3. What is the fundamental, underlying reason why low-order main memory interleaving

and/or cache memories are needed and used in virtually all high-performance computer

systems?

The main underlying reason why speed-enhancing techniques such as low-

order interleaving and cache continue to be needed and used in computer systems is

that main memory technology has never been able to keep up with the speed of

processor implementation technologies. The CPUs of each generation have always

been faster than any devices (from the days of delay lines, magnetic drums, and core

memory all the way up to today’s high-capacity DRAM ICs) that were feasible,

from a cost standpoint, to be used as main memory. If anything, the CPU-memory

speed gap has widened rather than narrowed over the years. Thus, the speed and

size of a system’s cache may be even more critical to system performance than

almost any other factor. (If you don’t believe this, examine the performance

difference between an Intel Pentium 4 and an otherwise similar Celeron processor.)

4. A main memory system is designed using 15 ns RAM devices using a 4-way low-order

interleave.

(a) What would be the effective time per main memory access under ideal

conditions?

Under ideal conditions, four memory accesses would be in progress at any

given time due to the low-order interleaving scheme. This means that the effective

time per main memory access would be (15 / 4) = 3.75 ns.

(b) What would constitute “ideal conditions”? (In other words, under what

circumstances could the access time you just calculated be achieved?)

The ideal condition for best performance of the memory system would be

continuous access to sequentially numbered memory locations. Equivalently, any

access pattern that consistently used all three of the other “leaves” before returning

to the one just accessed would have the same benefit. Examples would include

accessing every fifth numbered location, or every seventh, or any spacing that is

relatively prime with 4 (the interleaving factor).

(c) What would constitute “worst-case conditions”? (In other words, under what

circumstances would memory accesses be the slowest?) What would the access

time be in this worst-case scenario? If ideal conditions exist 80% of the time and

worst-case conditions occur 20% of the time, what would be the average time

required per memory access?

The worst case would be a situation where every access went to the same

device or group of devices. This would happen if the CPU needed to access every

fourth numbered location (or every eighth, or any spacing that is an integer multiple

of 4). In this case, access time would revert to that of an individual device (15 ns)

and the interleaving would provide no performance benefit at all.

In the hypothetical situation described, we could take a weighted average to

determine the effective access time for the memory system: (0.80)(3.75 ns) +

(0.20)(15 ns) = (3 + 3) = 6 ns.

(d) When ideal conditions exist, we would like the processor to be able to access

memory every clock cycle with no “wait states” (that is, without any cycles

wasted waiting for memory to respond). Given this requirement, what is the

highest processor bus clock frequency that can be used with this memory system?

In part (a) above, we found the best-case memory access time to be 3.75 ns.

Matching the CPU bus cycle time to this value and taking the reciprocal (since f =

1/T) we obtain:

f = 1/T = (1 cycle) / (3.75 * 10-9 seconds) ≈ 2.67 * 108 cycles/second = 267 MHz.

(e) Other than increased hardware cost and complexity, are there any potential

disadvantages of using a low-order interleaved memory design? If so, discuss one

such disadvantage and the circumstances under which it might be significant.

The main disadvantage that could come into play is due to the fact that

under ideal conditions, all memory modules are busy all the time. This is good if

only one device (usually the CPU) needs to access memory, but not good if other

devices need to access memory as well (for example, to perform I/O). Essentially all

the memory bandwidth is used up by the first device, leaving little or none for

others.

Another possible disadvantage is lower memory system reliability due to

decreased fault tolerance. In a high-order interleaved system, if one memory device

were to fail, 3/4 of the memory space would still be usable. In the low-order

interleaved case, if one of the four “leaves” fails, the entire main memory space is

effectively lost.

5. Is it correct to refer to a typical semiconductor integrated circuit ROM as a “random

access memory”? Why or why not? Name and describe two other logical organizations

of computer memory that are not “random access.”

It is correct to refer to a semiconductor ROM as a “random access memory”

in the strict sense of the definition – a “random access” memory is any memory

device that has an access time independent of the specific location being accessed.

(In other words, any randomly chosen location can be read or written in the same

amount of time as any other location.) This is equally true of most semiconductor

read-only memories as it is of semiconductor read/write memories (which are

commonly known as “RAMs”). Because of the commonly-used terminology, it is

probably better not to confuse the issue by referring to a ROM IC as a “RAM”,

even though that is technically a correct statement.

Besides random access, the other two logical memory organizations that may

be found in computer systems are sequential access (typical of tape and disk

memories) and associative (or content addressable).

6. Assume that a given system’s main memory has an access time of 6.0 ns, while its cache

has an access time of 1.2 ns (five times as fast). What would the hit ratio need to be in

order for the effective memory access time to be 1.5 ns (four times as fast as main

memory)?

Since effective memory access time in such a system is based on a weighted

average, we would need to solve the following equation:

ta effective = ta cache * (ph) + ta main * (1 - ph)

for the particular values given in the problem, as shown:

1.5 ns = (1.2 ns)(ph) + (6.0 ns)(1 - ph)

Using basic algebra we solve to obtain ph = 0.9375.

7. A particular program runs on a system with cache memory. The program makes a total

of 250,000 memory references; 235,000 of these are to cached locations.

(a) What is the hit ratio in this case?

ph = number of hits / (number of hits + number of misses) = 235,000 / 250,000 = 0.94

(b) If the cache can be accessed in 1.0 ns but the main memory requires 7.5 ns for an

access to take place, what is the average time required by this program for a

memory access assuming all accesses are reads?

ta effective = ta cache * (ph) + ta main * (1 - ph) = (1.0 ns)(0.94) + (7.5 ns)(0.06) = (0.94 +

0.45) ns = 1.39 ns

(c) What would be the answer to part (b) if a write-through policy is used and 75% of

memory accesses are reads?

If a write-through policy is used, then all writes require a main memory

access and write hits do nothing to improve memory system performance. The

average write access time is equal to the main memory access time, which is 7.5 ns.

The average read access time is equal to 1.39 ns as calculated in (b) above. The

overall average time per memory access is thus given by:

ta effective = (7.5 ns)(0.25) + (1.39 ns)(0.75) = (1.875 + 1.0425) ns = 2.9175 ns

8. Is hit ratio a dynamic or static performance parameter in a typical computer memory

system? Explain your answer.

Hit ratio is a dynamic parameter in any practical computer system. Even

though the cache and main memory sizes, mapping strategy, replacement policy,

etc. (which can all affect the hit ratio) are constant within a given system, the

proportion of cache hits to misses will still vary from one program to another. It

will also vary widely within a given run, based on such factors as the length of time

the program has been running, the code structure (procedure calls, loops, etc.) and

the properties of the specific data set being operated on by the program.

9. What are the advantages of a set-associative cache organization as opposed to a direct-

mapped or fully associative mapping strategy?

A set-associative cache organization is a compromise between the direct-

mapped and fully associative organizations that attempts to maximize the

advantages of each while minimizing their respective disadvantages. Fully

associative caches are expensive to build but offer a higher hit ratio than direct-

mapped caches of the same size. Direct-mapped caches are cheaper and less

complex to build but performance can suffer due to usage conflicts between lines

with the same index. By limiting associativity to just a few parallel comparisons

(two- and four-way set-associative caches are most common) the set-associative

organization can achieve nearly the same hit ratio as a fully associative design at a

cost not much greater than that of a direct-mapped cache.

10. A computer has 64 MB of byte-addressable main memory. It is proposed to design a 1

MB cache memory with a refill line (block) size of 64 bytes.

(a) Show how the memory address bits would be allocated for a direct-mapped cache

organization.

Since 64M = 226, the total number of bits required to address the main

memory space is 26. And since 64 = 26, it takes 6 bits to identify a particular byte

within a line. The number of refill lines in the cache is 1M / 64 = 220 / 26 = 214 = 16K.

Since there are 214 lines in the cache, 14 index bits are required. 26 total address

bits – 6 “byte” bits – 14 “index” bits leaves 6 bits to be used for the tag. So the

address bits would be partitioned as follows: Tag (6 bits) | Index (14 bits) | Byte (6

bits)

(b) Repeat part (a) for a four-way set-associative cache organization.

For the purposes of this problem, a four-way set-associative cache can be

treated as four direct-mapped caches operating in parallel, each one-fourth the size

of the cache described above. Each of these four smaller units would thus be 256

KB in size, containing 4K = 212 refill lines. Thus, 12 bits would need to be used for

the index, and 26 – 6 – 12 = 8 bits would be used for the tag. The address bits would

be partitioned as follows: Tag (8 bits) | Index (12 bits) | Byte (6 bits)

(c) Repeat part (a) for a fully associative cache organization.

In a fully associative cache organization, no index bits are required.

Therefore the tags would be 26 – 6 = 20 bits long. Addresses would be partitioned

as follows: Tag (20 bits) | Byte (6 bits)

(d) Given the direct-mapped organization, and ignoring any extra bits that might be

needed (valid bit, dirty bit, etc.), what would be the overall size (“depth” by

“width”) of the memory used to implement the cache? What type of memory

devices would be used to implement the cache (be as specific as possible)?

The overall size of the direct-mapped cache would be:

(16K lines) * (64 data bytes + 6 bit tag) = (16,384) * ((64 * 8) + 6) = (16,384 * 518) =

8,486,912 bits. This would be in the form of a fast 16K by 518 static RAM.

(e) Which line(s) of the direct-mapped cache could main memory location

1E0027A16 map into? (Give the line number(s), which will be in the range of 0 to

(n-1) if there are n lines in the cache.) Give the memory address (in hexadecimal)

of another location that could not reside in cache at the same time as this one (if

such a location exists).

To answer this question, we need to write the memory address in binary.

1E0027A hexadecimal equals 01111000000000001001111010 binary. We can break

this down into a tag of 011110, an index of 00000000001001 and a byte offset within

the line of 111010. In a direct-mapped cache, the binary index tells us the number

of the only line that can contain the given memory location. So, this location can

only reside in line 10012 = 9 decimal.

Any other memory location with the same index but a different tag could not

reside in cache at the same time as this one. One example of such a location would

be the one at address 2F0027A16.

11. Define and describe virtual memory. What are its purposes, and what are the advantages

and disadvantages of virtual memory systems?

Virtual memory is a technique that separates the (virtual) addresses used by

the software from the (physical) addresses used by the memory system hardware.

Each virtual address referenced by a program goes through a process of translation

(or mapping) that resolves it into the correct physical address in main memory, if

such a mapping exists. If no mapping is defined, the desired information is loaded

from secondary memory and an appropriate mapping is created. The translation

process is overseen by the operating system, with much of the work done in

hardware by a memory management unit (MMU) for speed reasons. It is usually

done via a multi-level table lookup procedure, with the MMU internally caching

frequently- or recently-used translations so that the costly (in terms of performance)

table lookups can be avoided most of the time.

The principal advantage of virtual memory is that it frees the programmer

from the burden of fitting his or her code into available memory, giving the illusion

of a large memory space exclusively owned by the program (rather than the usually

much more limited physical main memory space that is shared with other resident

programs). The main disadvantage is the overhead of implementing the virtual

memory scheme, which invariably results in some increase in average access time vs.

a system using comparable technology with only physical memory. Table lookups

take time, and even when a given translation is cached in the MMU’s Translation

Lookaside Buffer, there is some propagation delay involved in address translation.

12. Name and describe the two principal approaches to implementing virtual memory

systems. How are they similar and how do they differ? Can they be combined, and if so,

how?

The two principal approaches to implementing virtual memory (VM) are

demand-paged VM and demand-segmented VM (paging and segmentation, for short).

They are similar in that both map a virtual (or logical) address space to a physical

address space using a table lookup process managed by an MMU and overseen by

the computer’s operating system. They are different in that paging maps fixed-size

regions of memory called pages, while segmentation maps variable-length segments.

Page size is usually determined by hardware considerations such as disk sector size,

while segment size is determined by the structure of the program’s code and data.

A paged system can concatenate the offset within a page with the translated upper

address bits, while a segmented system must translate a logical address into the

complete physical starting address of a segment and then add the segment offset to

that value.

It is possible to create a system that uses aspects of both approaches;

specifically, one in which the variable-length segments are each comprised of one or

more fixed-sized pages. This approach, known as segmentation with paging, trades

off some of the disadvantages of each approach to try to take advantage of their

strengths.

13. What is the purpose of having multiple levels of page or segment tables rather than a

single table for looking up address translations? What are the disadvantages, if any, of

this scheme?

The main purpose of having multiple-level page or segment tables is to

replace one huge mapping table with a hierarchy of smaller ones. The advantage is

that the tables are smaller (remember, they are stored in main memory, though

some entries may be cached) and easier for the operating system to manage. The

disadvantage is that “walking” the hierarchical sequence of tables takes longer than

a single table lookup. Most systems have a TLB to cache recently-used address

translations, though, so this time penalty is usually only incurred once when a given

page or segment is first loaded into memory (or perhaps again later if the TLB fills

up and a displaced entry has to be reloaded).

14. A process running on a system with demand-paged virtual memory generates the

following reference string (sequence of requested pages): 4, 3, 6, 1, 5, 1, 3, 6, 4, 2, 2, 3.

The operating system allocates each process a maximum of four page frames at a time.

What will be the number of page faults for this process under each of the following page

replacement policies?

a) LRU 7 page faults

b) FIFO 8 page faults

c) LFU (with FIFO as tiebreaker) 7 page faults

15. In what ways are cache memory and virtual memory similar? In what ways are they

different?

Cache memory and virtual memory are similar in several ways. Both involve

the interaction between two levels of a hierarchical memory system – one larger and

slower, the other smaller and faster. Both have the goal of performing close to the

speed of the smaller, faster memory while taking advantage of the capacity of the

larger, slower one; both depend on the principle of locality of reference to achieve

this. Both operate on a demand basis and both perform a mapping of addresses

generated by the CPU.

One significant difference is the size of the blocks of memory that are

mapped and transferred between levels of the hierarchy. Cache lines tend to be

significantly smaller than pages or segments in a virtual memory system. Because of

the size of the mapped areas as well as the speed disparity between levels of the

memory system, cache misses tend to be more frequent, but less costly in terms of

performance, than page or segment faults in a VM system. Cache control is done

entirely in hardware, while virtual memory management is accomplished via a

combination of hardware (the MMU) and software (the operating system). Cache

exists for the sole reason of making main memory appear faster than it really is;

virtual memory has several purposes, one of which is to make main memory appear

larger than it is, but also to support multiprogramming, relocation of code and data,

and the protection of each program’s memory space from other programs.

16. In systems which make use of both virtual memory and cache, what are the advantages of

a virtually addressed cache? Does a physically addressed cache have any advantages of

its own, and if so, what are they? Describe a situation in which one of these approaches

would have to be used because the other would not be feasible.

All else being equal, a virtually mapped cache is faster than a physically

mapped cache because no address translation is required prior to checking the tags

to see if a hit has occurred. The appropriate bits from the virtual address are

matched against the (virtual) tags. In a physically addressed cache, the virtual-to-

physical translation must be done before the tags can be matched. A physically

addressed cache does have some advantages, though, including the ability to

perform task switches without having to flush (invalidate) the contents of the cache.

In a situation where the MMU is located on-chip with the CPU while a cache is

located off-chip (for example a level-2 or level-3 cache on the motherboard) the

address is already translated before it appears on the system bus and, therefore,

that cache would have to be physically addressed.


chapter:

Information density - a characteristic of a memory device that refers to the amount of

information that can be stored in a given physical space or volume

Dynamic Random Access Memory (DRAM) - a semiconductor memory device made

up of a large array of capacitors; its contents must be periodically refreshed in order to

keep them from being lost

Magnetic RAM (MRAM) - a developing memory technology that operates on the

principle of magnetoresistance; it may allow the development of “instant-on” computer

systems

Erasable/Programmable Read-Only Memory (EPROM) - a type of semiconductor

memory device, the contents of which cannot be overwritten during normal operation, but

can be erased using ultraviolet light

Associative memory - this type of memory device is also known as a CAM

Argument register - a register in an associative memory that contains the item to be

searched for

Locality of reference - the principle that allows hierarchical storage systems to function

at close to the speed of the faster, smaller level(s)

Miss - this occurs when a needed instruction or operand is not found in cache and thus a

main memory access is required

Refill line - the unit of information that is transferred between a cache and main memory

Tag - the portion of a memory address that determines whether a cache line contains the

needed information

Fully associative mapping - the most flexible but most expensive cache organization, in

which a block of information from main memory can reside anywhere in the cache

Write-back - a policy whereby writes to cached locations update main memory only

when the line is displaced

Valid bit - this is set or cleared to indicate whether a given cache line has been initialized

with “good” information or contains “garbage” due to not yet being initialized

Memory Management Unit (MMU) - a hardware unit that handles the details of address

translation in a system with virtual memory

Segment fault - this occurs when a program makes reference to a logical segment of

memory that is not physically present in main memory

Translation Lookaside Buffer (TLB) - a type of cache used to hold virtual-to-physical

address translation information

Dirty bit - this is set to indicate that the contents of a faster memory subsystem have

been modified and need to be copied to the slower memory when they are displaced

Delayed page fault - this can occur during the execution of a string or vector instruction

when part of the operand is present in physical main memory and the rest is not

3 Basics of the Central Processing Unit

1. Does an architecture that has fixed-length instructions necessarily have only one

instruction format? If multiple formats are possible given a single instruction size in bits,

explain how they could be implemented; if not, explain why this is not possible.

Not necessarily. It is possible to have multiple instruction formats, all of the

same length. For example, SPARC has three machine language instruction formats,

all 32 bits long. This is implemented by decoding a subset of the op code bits (in the

SPARC example, the two leftmost bits) and using the decoded outputs to determine

how to decode the remaining bits of the instruction.

2. The instruction set architecture for a simple computer must support access to 64 KB of

byte-addressable memory space and eight 16-bit general-purpose CPU registers.

(a) If the computer has three-operand machine language instructions that operate on

the contents of two different CPU registers to produce a result that is stored in a

third register, how many bits are required in the instruction format for addressing

registers?

Since there are 8 = 23 registers, three bits are needed to identify each register

operand. In this case there are two source registers and one destination register, so

it would take 3 * 3 = 9 bits in the instruction to address all the needed registers.

(b) If all instructions are to be 16 bits long, how many op codes are available for the

three-operand, register operation instructions described above (neglecting, for the

moment, any other types of instructions that might be required)?

16 bits total minus 9 bits for addressing registers leaves 7 bits to be used as

the op code. Since 27 = 128, there are 128 distinct op codes available to specify such

instructions.

(c) Now assume (given the same 16-bit instruction size limitation) that, besides the

instructions described in (a), there are a number of additional two-operand

instructions to be implemented, for which one operand must be in a CPU register

while the second operand may reside in a main memory location or a register. If

possible, detail a scheme that allows for at least 50 register-only instructions of

the type described in (a) plus at least 10 of these two-operand instructions. (Show

how you would lay out the bit fields for each of the machine language instruction

formats.) If this is not possible, explain in detail why not and describe what

would have to be done to make it possible to implement the required number and

types of machine language instructions.

We can accomplish this design goal by adopting two instruction formats that

could be distinguished by a single bit. Format 1 will have a specific bit (say, the

leftmost bit) = 0 while format 2 will have a 1 in that bit position. Three-operand

(register-only) instructions would use format 1. With one bit already used to

identify the format, of the remaining 15 bits, 6 would constitute the op code (giving

us 26 = 64 possible instructions of this type). The other 9 bits would be used to

identify source register 1 (3 bits), source register 2 (3 bits), and the destination

register (3 bits).

The two-operand instructions would use format 2. These instructions cannot

use absolute addressing for memory operands because that would require 16 bits for

the memory address alone, and there are only 16 total bits per instruction.

However, register indirect addressing or indexed addressing could be used to locate

memory operands. In this format, 3 of the remaining 15 bits would be needed to

identify the operand that is definitely in a register. One additional bit would be

required to tell whether the second operand was in a register or in a memory

location. Then, another set of 3 bits would identify a second register that contains

either the second operand or a pointer to it in memory. This leaves 8 bits, of which

4 or more would have to be used for op code bits since we need at least 10

instructions of this type. The remaining 4 bits could be used to provide additional

op codes or as a small displacement for indexed addressing.

3. What are the advantages and disadvantages of an instruction set architecture with

variable-length instructions?

For an architecture with a sufficient degree of complexity, it is natural that

some instructions may be expressible in fewer bits than others. (Some may have

fewer options, operands, addressing modes, etc. while others have more

functionality.) Having variable-length instructions means that the simpler

instructions need take up no more space than absolutely necessary. (If all

instructions are the same length, then even the simplest ones must be the same size,

in bits, as the most complex.) Variable-length instructions can save significant

amounts of code memory, but at the expense of requiring a more complex decoding

scheme that can complicate the design of the control unit. Variable-length

instructions also make it more difficult to pipeline the process of fetching, decoding,

and executing instructions (see Chapter 4).

4. Name and describe the three most common general types (from the standpoint of

functionality) of machine instructions found in executable programs for most computer

architectures.

In most executable programs (one can always find isolated counter-

examples) the bulk of the machine instructions are, usually in this order: data

transfer instructions, computational (arithmetic, logic, shift, etc.) instructions, and

control transfer instructions.

5. Given that we wish to specify the location of an operand in memory, how does indirect

addressing differ from direct addressing? What are the advantages of indirect addressing,

and in what circumstances is it clearly preferable to direct addressing? Are there any

disadvantages of using indirect addressing? How is register indirect addressing different

from memory indirect addressing, and what are the relative advantages and disadvantages

of each?

Direct (or absolute) addressing specifies the location of the operand(s)

explicitly as part of the machine language instruction (as opposed to immediate

addressing, which embeds the operand itself in the instruction). Indirect addressing

uses the machine language instruction to specify not the location of the operand, but

the location of the location of the operand. (In other words, it tells where to find a

pointer to the operand.) The advantage of indirect addressing is that if a given

instruction is executed more than once (as in a program loop) the operand does not

have to be in the same memory location every time. This is of particular use in

processing tables, arrays, and other multi-element data structures. The only real

disadvantages of indirect addressing vs. direct addressing are an increase in

complexity and a decrease in processing speed due to the need to dereference the

pointer.

Depending on the architecture, the pointer (which contains the operand

address) specified by the instruction may reside in either a memory location

(memory indirect addressing) or a CPU register (register indirect addressing).

Memory indirect addressing allows a virtually unlimited number of pointers to be

active at once, but requires an additional memory access – which complicates and

slows the execution of the instruction, exacerbating the disadvantages mentioned

above. To avoid this complexity, most modern architectures support only register

indirect addressing, which limits the pointers to exist in the available CPU registers

but allows instructions to execute more quickly.

6. Various computer architectures have featured machine instructions that allow the

specification of three, two, one, or even zero operands. Explain the tradeoffs inherent to

the choice of the number of operands per machine instruction. Pick a current or historical

computer architecture, find out how many operands it typically specifies per instruction,

and explain why you think its architects implemented the instructions the way they did.

The answer to this question will obviously depend on the architecture chosen.

The main tradeoff is programmer (or compiler) convenience, which favors more

operands per instruction, versus the desire to keep instructions smaller and more

compact, which favors fewer operands per instruction.

7. Why have load-store architectures increased in popularity in recent years? (How do their

advantages go well with modern architectural design and implementation technologies?)

What are some of their less desirable tradeoffs vs. memory-register architectures, and

why are these not as important as they once were?

Load/store architectures have become popular in large measure because the

decoupling of memory access from computational operations on data keeps the

control unit logic simpler and makes it easier to pipeline the execution of

instructions (see Chapter 4). Simple functionality of instructions makes it easier to

avoid microcode and use only hardwired control logic, which is generally faster and

takes up less “real estate” on the IC. Not allowing memory operands also helps keep

instructions shorter and can help avoid the need to have multiple instruction

formats of different sizes.

Memory-register architectures, on the other hand, tend to require fewer

machine language instructions to accomplish the same programming task, thus

saving program memory. The compiler (or the assembly language programmer)

has more flexibility and not as many registers need to be provided if operations on

memory contents are allowed. Given the decrease in memory prices, the

improvements in compiler technology, and the shrinking transistor sizes over the

past 20 years or so, the advantages of memory-register architectures have been

diminished and load/store architectures have found greater favor.

8. Discuss the two historically dominant architectural philosophies of CPU design:

a) Define the acronyms CISC and RISC and explain the fundamental differences

between the two philosophies.

CISC stands for “Complex Instruction Set Computer” and RISC stands for

“Reduced Instruction Set Computer.” The fundamental difference between these

two philosophies of computer system design is the choice of whether to put the

computational complexity required of the system in the hardware or in the software.

CISC puts the complexity in the hardware. The idea of CISC was to support high-

level language programming by making the machine directly execute high-level

functions in hardware. This was usually accomplished by using microcode to

implement those complex functions. Programs were expected to be optimized by

coding in assembly language. RISC, on the other hand, puts the complexity in the

software (mainly, the high level language compilers). No effort was made to

encourage assembly language programming; instead there is a reliance on

optimization by the compiler. The RISC idea was to make the hardware as simple

and fast as possible by eliminating microcode and explicitly encouraging pipelining

of the hardware. Any task that cannot be quickly and conveniently done in

hardware is left for the compiler to implement by combining simpler functions.

b) Name one commercial computer architecture that exemplifies the CISC

architectural approach and one other that exemplifies RISC characteristics.

CISC examples include the DEC VAX, Motorola 680x0, Intel x86, etc. RISC

examples include the IBM 801, Sun SPARC, MIPS Rx000, etc.

c) For each of the two architectures you named in (b) above, describe one

distinguishing characteristic not present in the other architecture that clearly

shows why one is considered a RISC and the other a CISC.

Answers will vary depending on the architectures chosen, but may include

the use of hardwired vs. microprogrammed control, the number and complexity of

machine language instructions and memory addressing modes, the use of fixed- vs.

variable-length instructions, a memory-register vs. a load/store architecture, the

number of registers provided and their functionality, etc.

d) Name and explain one significant advantage of RISC over CISC and one

significant advantage of CISC over RISC.

Significant advantages of RISC include simpler, hardwired control logic that

takes up less space (leaving room for more registers, on-chip cache and/or floating-

point hardware, etc.) and allows higher CPU clock frequencies, the ability to execute

instructions in fewer clock cycles, and ease of instruction pipelining. Significant

advantages of CISC include a need for fewer machine language instructions per

program (and thus a reduced appetite for code memory), excellent support for

assembly language programming, and less demand for complexity in, and

optimization by, the compilers.

9. Discuss the similarities and differences between the programmer-visible register sets of

the 8086, 68000, MIPS, and SPARC architectures. In your opinion, which of these CPU

register organizations has the most desirable qualities, and which is least desirable? Give

reasons to explain your choices.

The 8086 has a small number of highly specialized registers. Some are for

addresses, some for computations; many functions can only be carried out using a

specific register or a limited subset of the registers. The 68000, another CISC

processor, has a few more (16) working registers and divides them only into two

general categories: data registers and address registers. Within each group,

registers have identical functionality (except for address register 7 which acts as the

stack pointer).

MIPS and SPARC, both RISC designs, have larger programmer-visible

register sets (32 working registers) and do not distinguish between registers used for

data vs. registers used for pointers to memory. For the most part, “all registers are

created equal”, though in both architectures register 0 is a ROM location that

always contains the value 0. SPARC processors actually have a variable number

(up to hundreds) of registers and use a hardware register renaming scheme to make

different subsets of 32 of them visible at different times. This “overlapping register

window” scheme was devised to help optimize parameter passing across procedure

calls. Students can be expected to have different preferences, but should point to

specific advantages of a given architecture to back up their choices.

10. A circuit is to be built to add two 10-bit numbers x and y plus a carry-in. (Bit 9 of each

number is the MSB, while bit 0 is the LSB. c0 is the carry-in to the LSB position.) The

propagation delay of any individual AND or OR gate is 0.4 ns, and the carry and sum

functions of each full adder are implemented in sum of products form.

(a) If the circuit is implemented as a ripple carry adder, how much time will it take to

produce a result?

Each full adder takes (0.4 + 0.4) = 0.8 ns to produce a result (sum and carry

outputs). Since the carry output of each adder is an input to the adder in the next

more significant position, the operation of the circuit is sequential and it takes 10 *

(0.8 ns) = 8.0 ns to compute the sum of two 10-bit numbers.

(b) Given that the carry generate and propagate functions for bit position i are given

by gi = xiyi and pi = xi + yi, and that each required carry bit (c1...c10) is developed

from the least significant carry-in c0 and the appropriate gi and pi functions using

AND-OR logic, how much time will a carry lookahead adder circuit take to

produce a result? (Assume AND gates have a maximum fan-in of 8 and OR gates

have a maximum fan-in of 12.)

In a carry lookahead adder, all the gi and pi functions are generated

simultaneously by parallel AND and OR gates. This takes 0.4 ns (one gate delay

time). Since ci+1 = gi + pici, generating all the carries should take two more gate

delay times or 0.8 ns. However, we have to consider the gate fan-in restrictions.

Since OR gates can have a fan-in of 12 and we never need to OR that many terms,

that restriction does not matter; but the fan-in limitation on the AND gates means

that an extra level of logic will be needed (since there are cases where we have to

AND more than 8 terms). Thus, 3 * (0.4 ns) = 1.2 ns is required for this AND-OR

logic for a total of 4 * (0.4 ns) = 1.6 ns to generate all the carries. Once the carries

are available, all full adders operate simultaneously, requiring an additional 2 * (0.4

ns) = 0.8 ns to generate the final result. The overall propagation delay of the circuit

is 6 gate delays or 2.4 ns.

11. Under what circumstances are carry save adders more efficient than “normal” binary

adders that take two operands and produce one result? Where, in a typical general-

purpose CPU, would one be most likely to find carry save adders?

Carry save adders are more efficient when there are several numbers, rather

than just two, that need to be added together. While multi-operand addition is not a

typical operation supported by general-purpose computers, multiplication (which

requires the addition of several partial products) is such an operation. Carry save

adders are frequently used in multiplication hardware.

12. Given two 5-bit, signed, two’s complement numbers x = -6 = 110102 and y = +5 =

001012, show how their 10-bit product would be computed using Booth’s algorithm (you

may wish to refer to Figures 3.24, 3.25, and 3.26).

M = -6 = 11010; therefore –M = +6 = 00110. The initial contents of P are

0000000101 (upper part zero, lower part = the multiplier (+5)). The computation

proceeds as follows: P C 00000 00101 0 +00110 (1. add –M) 00110 00101 0 00011 00010 1 (then shift right) +11010 (2. add +M) 11101 00010 1 11110 10001 0 (then shift right) +00110 (3. add –M) 00100 10001 0 00010 01000 1 (then shift right) +11010 (4. add +M) 11100 01000 1 11110 00100 0 (then shift right) 11111 00010 0 (5. just shift right) Done. Answer = 1111100010 = -30 (represented as the two’s complement of 30).

13. Discuss the similarities and differences between “scientific notation” (used for manual

calculations in base 10) and floating-point representations for real numbers used in digital

computers.

Numbers expressed in scientific notation and floating-point format have a

great deal in common. Both approaches divide a number into its mantissa

(significant digits) and its exponent (the power of the system’s base or radix) such

that the number is the product of the mantissa times the base raised to the given

power. (Both the mantissa and the exponent are signed values, allowing us to

represent a wide range of positive and negative values.) In both cases, we gain the

considerable advantage of being able to represent very large and very small

numbers without having to write (or store) a large number of digits that are either

zero or insignificant.

The main difference is that numbers in scientific notation usually work with

base 10, with the mantissa and exponent themselves expressed in decimal notation

also. Computer floating-point formats invariably express the mantissa and

exponent in some type of binary format, and the radix is either 2 or some small

power of 2 (i.e. 4, 8, or 16). Of course, the signs of the mantissa and exponent must

be represented in sign-magnitude, complement, or biased notation since the

computer cannot directly represent the symbols “+” and “-” in hardware. Floating-

point formats also need representations for special cases like zero, infinity, and

invalid or out-of-range results that may result from computations, so such results

can be distinguished from normalized floating-point values.

14. Why was IEEE 754-1985 a significant development in the history of computing,

especially in the fields of scientific and engineering applications?

IEEE 754 was significant because it represented the first time that a

consortium of major computer manufacturers was able to come to an agreement on

a floating-point arithmetic standard that all of them could and would support.

Before the adoption of IEEE 754, every vendor had its own, unique floating-point

format (some more mathematically robust than others). Moving binary files

containing real-number data between incompatible systems was impossible. Porting

any kind of code that performed arithmetic on real numbers (as most scientific and

engineering applications do extensively) from one system to another was very

tedious, tricky, and prone to surprising – and in some cases, mathematically invalid

– results. IEEE 754, once adopted and generally supported, essentially put an end

to floating-point compatibility problems. (Much to the joy of scientific applications

programmers everywhere!)

15. Assume that IEEE has modified standard 754 to allow for “half-precision” 16-bit

floating-point numbers. These numbers are stored in similar fashion to the single

precision 32-bit numbers, but with smaller bit fields. In this case, there is one bit for the

sign, followed by six bits for the exponent (in excess-31 format), and the remaining 9 bits

are used for the fractional part of the normalized mantissa. Show how the decimal value

+17.875 would be represented in this format.

The fact that the number in question is positive means that the leftmost (sign)

bit would be 0. To determine the rest of the bits, we express 17.875 in binary as

10001.111 or, normalizing, 1.0001111 * 24. The stored exponent would thus be 4 +

31 = 35 decimal or 100011 binary. Stripping off the leading “1.” and padding on the

right with zeroes, the significand would be 000111100. So +17.875 would be

expressed in this fictional format as 0100011000111100 binary or 463C hexadecimal.

16. Show how the decimal value -267.5625 would be represented in IEEE-754 single and

double precision formats.

Because the number is negative, the leftmost (sign) bit will be 1 in either

format. To determine the rest of the bits, we express 267.5625 in binary as

100001011.1001 or, normalizing, 1.000010111001 * 28. In single precision, the stored

exponent would be 8 + 127 = 135 decimal or 10000111 binary. In double precision,

the stored exponent would be 8 + 1023 = 1031 decimal or 10000000111 binary.

Stripping off the leading “1.” and padding on the right with zeroes, the significand

would be 00001011100100000000000 (single precision) or

0000101110010000000000000000000000000000000000000000 (double precision). So

–267.5625 would be expressed in the IEEE single precision format as

11000011100001011100100000000000 binary or C385C800 hexadecimal, and in the

IEEE double precision format as

1100000001110000101110010000000000000000000000000000000000000000 binary

or C070B90000000000 hexadecimal.

17. Consider a simple von Neumann architecture computer like the one discussed in Section

3.3.1 and depicted in Figure 3.32. One of its machine language instructions is an ANDM

instruction which reads the contents of a memory location (specified by direct

addressing), bitwise ANDs this data with the contents of a specified CPU register, then

stores the result back in that same register. List and describe the sequence of steps that

would have to be carried out under the supervision of the processor’s control unit in order

to implement this instruction.

1. MAR PC Copy the contents of the Program Counter (the address of the instruction) to the MAR so they can be output on the address bus.

2. Read; PC PC + 1 Activate the Read control signal to the memory system to initiate the

memory access. While the memory read is taking place, increment the Program Counter so that it points to the next sequential instruction in the program.

3. MDR [MAR] When the memory read is complete, transfer the contents of the

memory location over the data bus and latch them into the MDR. 4. IR MDR Transfer the contents of the MDR (the machine language instruction)

to the IR and decode the instruction. At this point, the control unit discovers that this is an “AND with memory direct” instruction.

5. MAR IRlow Transfer the lower 8 bits from the IR (the operand address) to the

MAR to prepare to read the operand. 6. Read Activate the Read control signal to the memory system to initiate the

memory access for the operand. 7. MDR [MAR] When the memory read is complete, transfer the contents of the

memory location over the data bus and latch them into the MDR. 8. MDRout; R4outB; AND Transfer the contents of the MDR (the memory operand) and register 4

(the register operand) to the ALU inputs and activate the control signal telling the ALU to perform the logical AND function.

9. R4 ALU Transfer the output of the ALU (the logical AND of the operands) to

the destination register (R4). Execution of the current instruction is now complete and the control unit is ready to fetch the next instruction.

18. What are the two principal design approaches for the control unit of a CPU? Describe

each of them and discuss their advantages and disadvantages. If you were designing a

family of high performance digital signal processors, which approach would you use, and

why?

The two principal design approaches are hardwired control and

microprogrammed control. A hardwired (conventional) control unit is implemented

using standard sequential and combinational logic design techniques to design the

control step counter, decoders, and other circuitry required to generate the

machine’s control signals. It has the disadvantages of requiring more design effort

(especially if the machine has a complex instruction set architecture) and being

more difficult to change once designed, but the advantages of being faster (all else

being equal, logic tends to be faster than memory access) and taking up less space

for implementation.

Microprogrammed control is implemented by treating each individual

machine operation as a task to be implemented using programmed steps. These

steps are carried out by a very simple computing engine within the CPU itself; thus,

microprogramming is sometimes described as the “computer within a computer”

approach. Microinstructions, representing sets of control signals to be generated

for each step in carrying out a machine operation, are fetched from microprogram

memory (control store) and issued (sent out over the control lines) to control

internal and external operations. Microprogramming simplifies the design process

and makes it easy to change or extend the capabilities of an architecture, but the

control store takes up a considerable amount of space on the chip. And since

control signals are read out of a memory rather than generated in logic,

microprogrammed control tends to yield a slower implementation than hardwired

control.

In the situation described, it would probably be most appropriate to use

hardwired control since digital signal processing is likely to demand raw CPU speed

over just about any other criterion and hardwired logic will contribute to a “lean,

mean” and fast design. Since it is a special-purpose and not a general-purpose

processor, the DSP’s instruction set is likely to be fairly simple and so the design

would not be likely to profit much from microcode’s more flexible capabilities and

ability to easily implement complexity.

19. In a machine with a microprogrammed control unit, why is it important to be able to do

branching within the microcode?

Transfers of control (branches) are needed within microcode for many of the

same reasons that they are needed in machine language, assembly language, and

high-level language programming. At any level of programming, the next

sequential instruction in memory is not always the one we wish to perform next.

Unconditional transfers of control are needed to move between one microroutine

and another, for example from the end of the execution microroutine for one

instruction to the beginning of the instruction fetch microroutine that will retrieve

the next microinstruction. Conditional transfers of control in the microcode are

needed in order to implement conditional branching in the machine language

program (which, in turn, supports conditional structures in high-level code).

20. Given the horizontal control word depicted in Figure 3.39 for our simple example

machine, develop the microroutines required to fetch and execute the ANDM instruction

using the steps you outlined in question 17.

For each step needed to execute an instruction we would need to form a

binary microinstruction that had 1s in the correct places to carry out the desired

action, and 0s everywhere else. For example, to do step one (which is MAR ← PC in

register transfer language) we would make PCout = 1 and MARin = 1, and all the rest

of the bits of the control word would be 0. So the first microinstruction would look

like this:

0000000000000000000000000010100000000000000

The same sort of procedure would have to be followed for each of the

remaining steps. The complete binary microroutine would be as follows:

0000000000000000000000000010100000000000000 0000000000000000000000000000000100000000110 0000000000000000000000000001000000000000000 0000000000000000000000000100001000000000000 0000000000000000000000000010010000000000000 0000000000000000000000000000000000000000110 0000000000000000000000000001000000000000000 0000000000000000000100000000001000100000000 0000100000000000000000000000000000000000000

21. Repeat question 20 using the vertical control word depicted in Figure 3.40.

The steps needed to carry out the instruction are still the same, but the

microinstructions are more compact since subsets of the control signals are encoded

into bit fields. To come up with specific binary values for the microinstructions, we

would have to know how the four system registers (PC, IR, MAR, and MDR) and

the eight ALU operations were encoded. Assuming that PC = 00, IR = 01, MAR =

10, and MDR = 11, and that the AND operation is represented by 010, the binary

microroutine in the vertical format would be as follows:

0000000001000000000 0000000000000100011 0000000001100000000 0000000000111000000 0000000001001000000 0000000000000000011

0000000001100000000 0000001000011001000 1000000000000000000


chapter:

Op code - the portion (bit field) of a machine language instruction that specifies the

operation to be done by the CPU

Control transfer instruction - a type of instruction that modifies the machine’s program

counter (other than by simply incrementing it)

Indexed addressing - a way of specifying the location of an operand in memory by

adding a constant embedded in the instruction to the contents of a “pointer” register

inside the CPU

Zero-operand instructions - these would be characteristic of a stack-based instruction

set

Accumulator machine - this type of architecture typically has instructions that explicitly

specify only one operand

Load-store architecture - a feature of some computer architectures where “operate”

instructions do not have memory operands; their operands are found in CPU registers

Complex Instruction Set Computer (CISC) - machines belonging to this architectural

class try to “bridge the semantic gap” by having machine language instructions that

approximate the functionality of high-level language statements

Datapath - this part of a CPU includes the registers that store operands as well as the

circuitry that performs computations

Carry lookahead adder - this type of addition circuit develops all carries in logic,

directly from the inputs, rather than waiting for them to propagate from less significant

bit positions

Wallace tree - a structure comprised of multiple levels of carry save adders, which can

be used to efficiently implement multiplication

Excess (biased) notation - this type of notation stores signed numbers as though they

were unsigned; it is used to represent exponents in some floating-point formats

Significand (fraction) - in IEEE-754 floating-point numbers, a normalized mantissa with

the leading 1 omitted is called this

Positive infinity - this is the result when the operation 1.0/0.0 is performed on a system

with IEEE-754 floating-point arithmetic

Instruction Register (IR) - this holds the currently executing machine language

instruction so its bits can be decoded and interpreted by the control unit

Microroutine - a sequence of microinstructions that fetches or executes a machine

language instruction, initiates exception processing, or carries out some other basic

machine-level task

Horizontal microprogramming - a technique used in microprogrammed control unit

design in which mutually-exclusive control signals are not encoded into bit fields, thus

eliminating the need for decoding microinstructions

Microprogram Counter (µPC) - this keeps track of the location of the next microword

to be retrieved from microcode storage

4 Enhancing CPU Performance

1. Suppose that you are designing a machine that will frequently have to perform 64

consecutive iterations of the same task (for example, a vector processor with 64-element

vector registers). You wish to implement a pipeline that will help speed up this task as

much as is reasonably possible, but recognize that dividing a pipeline into more stages

takes up more chip area and adds to the cost of implementation.

(a) Make the simplifying assumptions that the task can be subdivided as finely or

coarsely as desired and that pipeline registers do not add a delay. Also assume that

one complete iteration of the task takes 16 ns (thus, a non-pipelined implementation

would take 64 * 16 = 1024 ns to complete 64 iterations). Consider possible pipelined

implementations with 2, 4, 8, 16, 24, 32, and 48 stages. What is the total time

required to complete 64 iterations in each case? What is the speedup (vs. a non-

pipelined implementation) in each case? Considering cost as well as performance,

what do you think is the best choice for the number of stages in the pipeline?

Explain. (You may wish to make graphs of speedup and/or total processing time vs.

the number of stages to help you analyze the problem.)

In general, for a pipeline with s stages processing n iterations of a task, the

time taken to complete all the iterations may be expressed as:

tTOTAL = [s * tSTAGE] + [(n-1) * tSTAGE] = [(s+n-1) * tSTAGE]

In a pipelined implementation with 2 stages: the stage time is 16/2 = 8 ns,

and the total time for 64 iterations = [(2 + 64 – 1) * 8] ns = 520 ns; the speedup

factor is 1.969.



factor is 3.821.



factor is 7.211.



factor is 12.962.

In a pipelined implementation with 24 stages: the stage time is 16/24 = 0.667

ns, and the total time for 64 iterations = [(24 + 64 – 1) * 0.667] ns = 58 ns; the

speedup factor is 17.655.

In a pipelined implementation with 32 stages: the stage time is 16/32 = 0.5 ns,

and the total time for 64 iterations = [(32 + 64 – 1) * 0.5] ns = 47.5 ns; the speedup

factor is 21.558.

In a pipelined implementation with 48 stages: the stage time is 16/48 = 0.333

ns, and the total time for 64 iterations = [(48 + 64 – 1) * 0.333] ns = 37 ns; the


The speedup achieved versus the number of stages goes up in nearly linear

fashion through about 8-16 stages, then starts to fall off somewhat. Still, while gains

are not linear, performance continues to improve significantly all the way up to a

48-stage pipeline. If hardware cost were a critical limiting factor, it would be

reasonable to build only a 16-stage pipeline. If, on the other hand, performance is

paramount, it would probably be worth building the most finely-grained (48-stage)

pipeline even though it achieves a speedup of “only” about 28.

(b) Now assume that a total of 32 levels of logic gates are required to perform the task,

each with a propagation delay of 0.5 ns (thus the total time to produce a single result

is still 16 ns). Logic levels cannot be further subdivided. Also assume that each

pipeline register has a propagation delay equal to that of two levels of logic gates, or

1 ns. Re-analyze the problem; does your previous recommendation still hold? If not,

how many stages would you recommend for the pipelined implementation under

these conditions?

The same general pipeline performance equation still holds; however, in this

case we are presented with reasonable limitations on what the stage time can be.

Clearly, if there are 32 levels of logic gates involved, then the number of stages

should be 2, 4, 8, 16, or 32; a pipeline with 24 or 48 stages is not a realistic approach

since the delays do not divide evenly. (Some stages would do less work than others,

or perhaps no work at all!) In addition, each stage has a minimum propagation

delay of 1 ns due to its pipeline register regardless of the number of logic levels it

contains. Given these more realistic limitations, we analyze the situation as follows:

In a pipelined implementation with 2 stages: the stage time is (16/2) + 1 = 9

ns, and the total time for 64 iterations = [(2 + 64 – 1) * 9] ns = 585 ns; the speedup

factor is 1.750.



factor is 3.057.



factor is 4.808.

In a pipelined implementation with 16 stages: the stage time is (16/16) + 1 =

2 ns, and the total time for 64 iterations = [(16 + 64 – 1) * 2] ns = 158 ns; the speedup

factor is 6.481.

In a pipelined implementation with 32 stages: the stage time is (16/32) + 1 =

1.5 ns, and the total time for 64 iterations = [(32 + 64 – 1) * 1.5] ns = 142.5 ns; the


Here, because of the more realistic design constraints, the speedup achieved

versus the number of stages is not nearly linear, especially beyond about 4-8 stages.

An implementation with 32 stages is twice as costly as one with 16 stages, yet

performs only a fraction better. Even the 16-stage implementation is not

appreciably better than the 8-stage pipeline. In both these latter cases, we would

incur considerable extra hardware expense for very modest reductions in the total

time required to complete the task. The best tradeoff, depending on the exact

constraints of the situation, is probably to use 4 or, at most, 8 stages. As we said in

Section 4.1, “achieving a speedup factor approaching s (the number of pipeline

stages) depends on n (the number of consecutive iterations being processed) being

large, where ‘large’ is defined relative to s.” Here, n is 64 which is large relative to 4

(or perhaps 8), but 64 is not particularly large relative to 16 or 32.

2. Given the following reservation table for a static arithmetic pipeline:

0 1 2 3 4 5

Stage 1 X X

Stage 2 X X

Stage 3 X X

(a) Write the forbidden list. (0, 1, 5)

(b) Determine the initial collision vector C. C = c5c4c3c2c1c0 = 100011

(c) Draw the state diagram.

The diagram includes three states. The initial state is given by the

initial collision vector, 100011. For i = 4 or i ≥ 6, we remain in this state. For

i = 3 we transition from the initial state to state 100111, while for i = 2 we

transition from the initial state to state 101011. Once in state 100111 we

remain in that state for i = 3 and return to the initial state for i = 4 or i ≥ 6.

Once in state 101011 we remain in that state for i = 2 and return to the initial

state for i = 4 or i ≥ 6.

(d) Find the MAL. 2

(e) Find the minimum latency. 2

3. Considering the overall market for all types of computers, which of the following are

more commonly found in today’s machines: arithmetic pipelines (as discussed in Section

4.2) or instruction unit pipelines (Section 4.3)? Explain why this is so.

Arithmetic pipelines are primarily found in vector supercomputers, which

once had a significant share of the high-performance computing market but have

now largely fallen out of favor – mainly because they exhibit little generality. (They

are only useful for a limited subset of applications.) On the other hand, instruction

unit pipelines are found in all RISC and superscalar microprocessors and even in

most CISC microprocessors (other than low-end embedded microcontroller units).

Nowadays, instruction unit pipelines are nearly ubiquitous.

4. Why do control transfers, especially conditional control transfers, cause problems for an

instruction-pipelined machine? Explain the nature of these problems and discuss some of

the techniques that can be employed to cover up or minimize their effect.

In a non-pipelined machine, control transfer instructions are not appreciably

more troublesome than other instructions since all instructions are processed one at

a time. In an instruction-pipelined machine, control transfers are problematic

because by the time the instruction in question is decoded and determined to be a

control transfer (and, in the case of conditional control transfers, by the time the

branch condition is evaluated to determine whether or not the branch succeeds), one

or more additional instructions may have been fetched and proceeded some distance

into the pipeline. If, as is the normal case, these instructions are the ones

sequentially following the control transfer, they are not the correct instructions that

the machine should be executing. (It should instead be executing the instructions

starting at the control transfer’s target location.) While the control unit can

suppress the effect of these following instructions by not allowing them to update

any registers or memory locations (and thus insure correct operation of the

program), it cannot recover the clock cycles lost by bringing them into the pipeline.

One technique that can be used to minimize the performance effect of control

transfers is delayed branching. This approach, commonly used in RISC

architectures, documents the fact that the instruction(s) immediately following a

control transfer (a.k.a. the delay slot instruction(s)) – since it is (or they are) already

in the pipeline – will be executed as though it (they) came before the control transfer

instruction. (In the case of a conditional branch, this means that the delay slot

instruction(s) are executed regardless of the outcome of the branch). Another

approach (these are not mutually exclusive) is to use static (compile time) and/or

dynamic (run time) branch prediction. The compiler and/or processor attempt to

predict, based on the structure of the code and/or the history of execution, whether a

conditional branch will succeed or fail and fetch subsequent instructions

accordingly. A successful prediction will reduce or eliminate the “branch penalty”;

an incorrect prediction, however, will incur a significant performance penalty due

to the need to drain the pipeline. (See the answer to the next question for an

example of a quantitative analysis.)

5. A simple RISC CPU is implemented with a single scalar instruction processing pipeline.

Instructions are always executed sequentially except in the case of branch instructions.

Given that pb is the probability of a given instruction being a branch, pt is the probability

of a branch being taken, pc is the probability of a correct prediction, b is the branch

penalty in clock cycles, and c is the penalty for a correctly predicted branch:

(a) Calculate the throughput for this instruction pipeline if no branch prediction is

done, given that pb = 0.16, pt = 0.3, and b = 3.

The average number of clock cycles per instruction is (0.16)(0.3)(1 + 3 cycles)

+ (0.16)(0.7)(1 cycle) + (1 – 0.16)(1 cycle) = 0.192 + 0.112 + 0.84 = 1.144

cycles/instruction. The throughput equals 1 / (1.144 cycles/instruction) ≈ 0.874

instructions per cycle.

(b) Assume that we use a branch prediction technique to try to improve the pipeline’s

performance. What would be the throughput if c = 1, pc = 0.8, and the other

values are the same as above?

The average number of clock cycles per instruction would be

(0.16)(0.3)(0.8)(1 + 1 cycles) + (0.16)(0.3)(0.2)(1 + 3 cycles) + (0.16)(0.7)(0.8)(1 cycle)

+ (0.16)(0.7)(0.2)(1 + 3 cycles) + (0.84)(1 cycle) = 0.0768 + 0.0384 + 0.0896 + 0.0896 +

0.84 = 1.1344 cycles/instruction. The throughput in this case would improve slightly

to 1 / (1.1344 cycles/instruction) ≈ 0.882 instructions per cycle.

6. What are the similarities and differences between a delayed branch and a delayed load?

A delayed branch, as explained in the answer to question 4 above, is a special

type of control transfer instruction used in some pipelined architectures to minimize

or “cover up” the branch penalty caused by having the wrong instruction(s) in

progress following a control transfer instruction. A delayed load is also a feature of

some instruction sets that is used to cover up a potential performance penalty

associated with pipelined implementation. It has nothing to do with branching, but

rather deals with the (also problematic) latency normally associated with accessing

memory as compared to internal CPU operations. Since (even in the case of a cache

hit) reading a value from memory usually takes at least one additional clock cycle

vs. obtaining it from a register, the architects may document the fact that the

instruction immediately following a load should not (or must not) use the value

being loaded; instead, it should operate on some other data already inside the CPU.

Depending on the architecture, if the following instruction does use the value being

loaded from memory into a register, it may be documented to reference the old,

rather than the new, value in that register; or a hardware interlock may be

employed to delay the subsequent instruction until the load completes and the newly

loaded value can be used.

7. Given the following sequence of assembly language instructions for a CPU with multiple

pipelines, indicate all data hazards that exist between instructions.

I1: Add R2, R4, R3 ; R2 = R4 + R3

I2: Add R1, R5, R1 ; R1 = R5 + R1

I3: Add R3, R1, R2 ; R3 = R1 + R2

I4: Add R2, R4, R1 ; R2 = R4 + R1

Three RAW hazards exist: between I1 and I3 over the use of R2; between I2

and I3 over the use of R1; and between I2 and I4 over the use of R1. Two WAR

hazards exist: between I1 and I3 over the use of R3, and between I3 and I4 over the

use of R2. Only one WAW hazard exists; it is between I1 and I4 over the use of R2.

8. What are the purposes of the scoreboard method and Tomasulo’s method of controlling

multiple instruction execution units? How are they similar and how are they different?

Both the scoreboard method and Tomasulo’s method are techniques of

controlling and scheduling a superscalar processor’s internally parallel hardware

execution units. By detecting and correcting for data hazards, both methods ensure

that RAW, WAR, and WAW relationships in the code being executed do not cause

the machine to improperly execute a sequentially written program. Even though

instructions may actually be executed out of order, these methods make them

appear to be executed in the original, sequential order.

These two methods are different in that each was initially devised for a

different 1960s-era supercomputer (the CDC 6600 vs. the IBM 360/91); each bears

the specific characteristics of its parent machine. The scoreboard method uses a

centralized set of registers and logic to schedule the use of processor hardware by

multiple instructions; when a data hazard is detected, the issue of one or more

instructions to a functional unit is stalled to ensure correct operation. Tomasulo’s

method uses a distributed control technique (with reservation stations associated

with each functional unit) to schedule operations. It implements a “dataflow”

approach to scheduling hardware, using register renaming and data forwarding to

help avoid stalls as much as possible and enhance the machine’s ability to execute

operations concurrently.

9. List and explain nine common characteristics of RISC architectures. In each case,

discuss how a typical CISC processor would (either completely or partially) not exhibit

the given attribute.

The characteristics that are common to virtually all RISC architectures are

discussed in Section 4.4. A RISC architecture may primarily be distinguished by its

adherence to the following characteristics:

• Fixed-length instructions are used to simplify instruction fetching. • The machine has only a few instruction formats in order to simplify

instruction decoding. • A load/store instruction set architecture is used to decouple memory accesses

from computations so that each can be optimized independently. • Instructions have simple functionality, which helps keep the control unit

design simple. • A hardwired control unit optimizes the machine for speed. • The architecture is designed for pipelined implementation, again to optimize

for speed of execution. • Only a few, simple addressing modes are provided, since complex ones may

slow down the machine and are rarely used by compilers.

• There is an emphasis on optimization of functions by the compiler since the architecture is designed to support high-level languages rather than assembly programming.

• Complexity is in the compiler (where it only affects the performance of the compiler), not in the hardware (where it would affect the performance of every program that runs on the machine).

Secondary characteristics that are prevalent in RISC machines include:

• Three-operand instructions make it easier for the compiler to optimize code. • A large register set (typically 32 or more registers) is possible because the

machine has a small, hardwired control unit, and desirable because of the need for the compiler to optimize code for the load/store architecture.

• Instructions execute in a single clock cycle (or at least most of them appear to, due to pipelined implementation).

• Delayed control transfer instructions are used to minimize disruption to the pipeline.

• Delay slots behind loads and stores help to cover up the latency of memory accesses.

• A Harvard architecture is used to keep memory accesses for data from interfering with instruction fetching and thus keep the pipeline(s) full.

• On-chip cache is possible due to the small, hardwired control unit and necessary to speed instruction fetching and keep the latency of loads and stores to a minimum.

The characteristics of a CISC would depend on the specific architecture and

implementation being compared, but would include such things as variable-length

instructions, many instruction formats, a memory-register architecture, complex

functionality of individual machine instructions, the use of microprogrammed

control, the lack of explicit support for pipelining, support for many and/or complex

addressing modes, an emphasis on optimization of code by manual profiling and re-

coding in assembly language, support for fewer than three operands per instruction,

a limited number of programmer-visible registers, instructions that require multiple

clock cycles to execute, the lack of delay slots, a Princeton architecture and, in some

cases, a smaller or nonexistent on-chip cache. In general, complexity is in the

hardware rather than in the compiler! A machine need not exhibit all these

characteristics to be considered a CISC architecture, but the more of them that are

observed, the more confident we can be in identifying it as such.

10. How does the “overlapping register windows” technique, used in the Berkeley RISC and

its commercial successor the Sun SPARC, simplify the process of calling and returning

from subprograms?

This technique partitions the register set not by intended use (data registers

vs. pointer registers), as is common in many other architectures, but rather by scope

(global, local, inputs, and outputs) relative to the procedure(s) in which values are

used. When a procedure calls another procedure, the internal decoding scheme is

altered such that registers are renumbered; the caller’s output registers

automatically become the called procedure’s input registers and a new set of local

and output registers is allocated to that procedure. In many cases, the number of

procedure parameters is such that all of them can be passed in the overlapped

registers, thus eliminating the need to access memory to write and then read values

to/from a stack frame. This reduction in the number of accesses to data memory

can help improve performance, especially for high-level language programs that are

written to use a number of modular functions.

11. You are on a team helping to design the new Platinum V® processor for AmDel

Corporation. Consider the following design issues:

(a) Your design team is considering a superscalar vs. superpipeline approach to the

design. What are the advantages and disadvantages of each option? What

technological factors would tend to influence this choice one way or the other?

The superpipelined approach has the advantage of being simpler to control

(since there is no out-of-order execution to cause WAR and WAW hazards); it also

may be able to achieve a higher clock frequency since each pipeline stage does less

work. The simpler control logic and single pipeline leave more space on the chip for

registers, on-chip cache, floating-point hardware, memory management unit(s), and

other enhancements. However, superpipelined processors suffer from an increased

branch penalty and do not perform well on code with many control transfers.

The superscalar approach takes advantage of spatial parallelism to better

exploit the instruction-level parallelism inherent in most programs. Superscalar

processors also may achieve high performance with only a modest clock frequency,

avoiding the need to generate and distribute a high-frequency clock signal. On the

other hand, a superscalar implementation has the disadvantage of being more

difficult to control internally, due to its vulnerability to all types of data hazards.

The multiple pipelines and more complex control logic take up more space on the

chip, leaving less room for other functionality. It is also more difficult to handle

exceptions precisely in a superscalar architecture.

An implementation technology with short propagation delays (i.e. very fast

transistors) favors a superpipelined approach, while an implementation technology

with small feature sizes (and thus more transistors per IC) favors a superscalar

approach.

(b) Your design team has allocated the silicon area for most of the IC and has narrowed

the design options to two choices: one with 32 registers and a 512KB on-chip cache

and one with 512 registers but only a 128 KB on-chip cache. What are the

advantages and disadvantages of each option? What other factors might influence

your choice?

One would think that having more registers would always be a good thing,

but this is not necessarily true. Registers are only beneficial to the extent that the

compiler (or the assembly language programmer) can make use of them; adding

additional registers past that point provides no benefit and has some definite costs.

For example, one significant advantage of the first option is that it only requires 5

bits in the instruction format for addressing each register operand. Unless some

scheme such as register windowing is used, the second option (with 29 registers) will

require 9 bits in the instruction for each register operand used. The first option also

requires fewer registers to be saved and restored on each context switch or

interrupt.

On the other hand, while having more cache is essentially always a good

thing, even cache hits are not as beneficial for performance as keeping data in a

CPU register, since cache generally requires at least one extra clock cycle to read or

write data as compared to internal operations. Cache, like registers, also has a

“diminishing returns” effect; while 512KB is four times the capacity of 128KB, it

certainly won’t provide four times the hit ratio. To make the best choice given the

specified alternatives, one would probably need to consider factors including the

intended application(s) of the processor, the anticipated speed gap between cache

and main memory, whether the on-chip cache is going to have a Harvard

(instructions and data kept separate) or Princeton (unified) architecture, whether

there will be any off-chip (level 2) cache, etc.

12. How are VLIW architectures similar to superscalar architectures, and how are they

different? What are the relative advantages and disadvantages of each approach? In

what way can VLIW architectures be considered the logical successors to RISC

architectures?

Both VLIW and superscalar architectures make use of internally parallel

hardware (more than one pipelined instruction execution unit operating at the same

time). The main difference is that superscalar architectures use hardware inside the

CPU’s control unit to dynamically schedule the execution of instructions and avoid

incorrect operation due to hazards; VLIW machines instead use the compiler to

statically perform dependency resolution and resource scheduling. The principal

advantages of the superscalar approach are better compatibility with existing

architectures and a reduced burden on the (more conventional) compiler.

VLIW has the advantage of simplified control hardware, which may reduce

internal delays and allow operation at a higher clock frequency than a superscalar

processor given the same implementation technology. Also, since the compiler can

use more sophisticated scheduling logic, a VLIW architecture may take better

advantage of instruction-level parallelism and execute more instructions, on

average, per clock cycle. However, VLIW architectures in general suffer from poor

code density (programs take up a lot of memory), are not easily made compatible

with existing architectures, and perform poorly on code with many branches.

VLIW is the logical successor to RISC in the sense that both design

philosophies emphasize making hardware simpler and faster while transferring the

burden of computational complexity to the software (specifically, the compiler).

13. Is Explicitly Parallel Instruction Computing (EPIC) the same thing as a VLIW

architecture? Explain why or why not.

Not exactly. EPIC is based on the same idea as VLIW, but the format of the

“bundles” in the IA-64 processors is not tied to a particular hardware

implementation. Some EPIC-based chips may be able to execute one bundle at a

time, some others may do less than a complete bundle at a time, and some higher-

performance implementations may be able to execute multiple bundles at a time (if

not prevented by dependencies between operations). By standardizing the size of

the bundles, EPIC (unlike a VLIW architecture) allows binary compatibility to be

maintained between generations of CPUs. In a pure VLIW machine the compiler

does all the resource scheduling. In a superscalar machine the control unit does the

scheduling. In EPIC the compiler does much of the work, but the control unit still

has to do some of it. So EPIC is somewhere between VLIW and superscalar, though

closer to VLIW.


chapter:

Flow-through time - the time required for the first result in a series of computations to

emerge from a pipeline

Pipeline register - this is used to separate one stage of a pipeline from the next

Multifunction pipeline - this type of pipeline can perform different kinds of

computations at different times

Collision - this occurs if we mistakenly try to use a pipeline stage for two different

computations at the same time

Average Latency - over time, this tells the mean number of clock cycles between

initiation of operations into a pipeline (if an optimal pipeline control strategy is used,

this would be equal to the Minimum Average Latency or MAL)

Pipeline throughput - over time, this tells the mean number of operations completed per

clock cycle

Branch penalty - the clock cycles that are wasted by an instruction-pipelined processor

due to executing a control transfer instruction

Static branch prediction - a technique used in pipelined CPUs where the compiler

supplies a hint as to whether or not a given conditional branch is likely to succeed

Delay slot instruction(s) - the instruction(s) immediately following a conditional control

transfer instruction in some pipelined processors, which are executed whether or not the

control transfer occurs

Delayed load - a technique used in pipelined CPUs where the instruction immediately

following another instruction that reads a memory operand cannot use the updated value

of the operand

Read After Write (RAW) hazard - the most common data hazard in pipelined

processors; also known as a true data dependence

Write After Write (WAW) hazard - also known as an output dependence, this hazard

can occur in a processor that utilizes out-of-order execution

Scoreboard - a centralized resource scheduling mechanism for internally concurrent

processors; it was first used in the CDC 6600 supercomputer

Reservation stations - these are used by a Tomasulo scheduler to hold operands for

functional units

Overlapping register windows - a technique used in some RISC processors to speed up

parameter passing for high-level language procedure calls

Superpipelined - this type of processor architecture maximizes temporal parallelism by

using a very deep pipeline with very fast stages

Superscalar - this approach to high-performance processing uses multiple pipelines with

resolution of inter-instruction data dependencies done by the control unit

Explicitly Parallel Instruction Computing (EPIC) - the “architecture technology” used

in Intel’s IA-64 (Itanium) chips

Predication - the IA-64 architecture uses this approach instead of branch prediction to

minimize the disruption caused by conditional control transfers

5 Exceptions, Interrupts, and Input/Output Systems

1. What do we mean when we say that interrupts must be processed “transparently”? What

does this involve and why is it necessary?

Since interrupts are asynchronous to CPU operations (that is, they can occur

at any time, without warning), it is necessary that the complete run-time context of

the program that was executing be preserved across the servicing of the interrupt.

That is to say, the interrupted program must not experience any changes to the state

of the processor (or to its program or data memory) due to interrupt handling; it

should not “see” any effects other than a time lag, and should compute the same

results in the presence vs. absence of an interrupt(s). For this to happen it is

necessary to save and restore not only the program counter (so the program can be

resumed at the next instruction that would have been executed had the interrupt not

occurred), but also the condition codes (a.k.a. status flags) and the contents of all the

CPU registers. This is normally accomplished via pushing these values on the

system stack when an interrupt occurs and popping them back off after it has been

serviced. If this were not done, the interrupt would not be “transparent” and the

interrupted program could operate incorrectly.

2. Some processors, before servicing an interrupt, automatically save all register contents.

Others automatically save only a limited amount of information. In the second case, how

can we be sure that all critical data are saved and restored? What are the advantages and

disadvantages of each of these approaches?

The advantage of automatically saving everything is that we are sure that it

has been done and thus we know that the interrupt will be serviced in transparent

fashion (as described in the answer to question 1 above). The disadvantage of this is

that a given interrupt service routine may actually use only a small subset of the

CPU registers, and the time spent saving and restoring all the other, unused

registers is wasted.

The additional delay involved in saving a large number of registers can

significantly increase the latency in responding to an interrupt; for some timing-

sensitive I/O devices, it is important to keep this latency as small as possible. Some

processors only automatically save the program counter and condition codes; other

registers are left to be preserved with push and pop instructions inside the service

routine. In that case, we save only the necessary registers and keep latency to a

minimum; but the potential disadvantage is that we are dependent on the vigilance

of the programmer who writes the service routine to track his/her register usage

and save all required registers. If he or she fails to do this, the interrupted program

could have its run-time context corrupted.

3. Explain the function of a watchdog timer. Why do embedded control processors usually

need this type of mechanism?

A watchdog timer is a mechanism that can be used to reset a system if it

“locks up”, without requiring the intervention of a human user (which is not always

possible, especially in embedded systems). To implement this mechanism, a counter

that runs off the system clock (or some derivative of it) is initialized and allowed to

count up toward a maximum count (or down toward zero). Software running on

the system is assigned the task of periodically resetting the counter to its initial

value. If the counter ever “rolls over”, presumably due to a system software failure,

that rollover event is detected by hardware and used to generate a reset signal to

reinitialize and recover the system to a known state. Embedded control processors,

unlike general-purpose CPUs, are not normally mounted in a case with a convenient

reset button within reach of a user’s finger. Due to the embedded location, it may

be difficult or impossible to perform a manual reset; a watchdog timer may be the

only mechanism that allows the system to be restarted in the event of a crash.

4. How are vectored and autovectored interrupts similar and how are they different? Can

they be used in the same system? Why or why not? What are their advantages and

disadvantages vs. nonvectored interrupts?

Vectored and autovectored interrupts are similar in that both use an

interrupt number to index into a table that contains the addresses of the various

interrupt service routines. (The value read from the table is loaded into the

program counter and execution of the service routine begins from that point.) The

only difference between the two techniques is that hardware devices provide their

own interrupt numbers (via the system bus during the interrupt acknowledge cycle)

in a system with vectored interrupts, while an autovectoring scheme uses interrupt

numbers internally generated by the CPU based on the priority level of an incoming

interrupt.

Yes, both of these techniques can be used in the same system; the Motorola

680x0 family of CPUs is a prime example. The “smarter” devices can provide their

own vectors, while a special hardware signal or timeout mechanism can alert the

CPU to the need to generate autovectors for other devices. While nonvectored

interrupts are slightly easier to implement in hardware as compared to vectored or

autovectored interrupts, the additional complexity required of the software (to

identify the source of each interrupt and execute the correct code to handle it), the

corresponding increase in the latency to service an interrupt, and the limitations it

places on the design of the memory system are significant drawbacks.

5. Given the need for user programs to access operating system services, why are traps a

better solution than conventional subprogram call instructions?

The main “problem” with a typical subprogram call instruction is that it

generally requires a target address that is explicitly specified using one of the

machine’s addressing modes; that is to say, we must know where a given routine

resides in memory in order to call it. While this is normally not a problem for user-

written code or procedures in a link library, we often do not know the location of

routines that are part of the operating system. Their location may vary from one

specific system or OS version to another. Also, code that performs system functions

such as I/O usually needs to be executed at a system privilege level, while called

procedures normally execute with the same privilege level as the code that called

them. Traps, since they make use of the same vectoring mechanism as interrupts or

other exceptions, allow OS routines to be accessed implicitly, without the

programmer having to know the exact location of the code he or she wishes to

execute. By executing a specific trapping instruction, the desired routine can be

executed at a system privilege level with control returning (at the proper privilege

level) to the user program that called it.

6. Compare and contrast program-controlled I/O, interrupt-driven I/O, and DMA-based I/O.

What are the advantages and disadvantages of each? Describe scenarios that would favor

each particular approach over the others.

In a system with program-controlled I/O, the CPU executes code to poll the

various hardware devices to see when they require service, then executes more code

to carry out the data transfers. This is the simplest way to handle I/O, requiring no

extra hardware support; but the need for the CPU to spend time polling devices

complicates the software and detracts from system performance. This approach

would only be favored in a very low-cost, embedded system where the CPU is not

doing much other than I/O and the goal is to keep the hardware as simple and

inexpensive as possible.

In a system with interrupt-driven I/O, the devices use hardware interrupt

request lines to notify the CPU when they need servicing. The CPU then executes

instructions to transfer the data (as it would in a system using program-controlled

I/O). This approach doesn’t eliminate the need for CPU involvement in moving

data and also involves a bit more hardware complexity, but support for interrupt

processing is already built into virtually every microprocessor so the additional cost

is minimal. The upside of this technique is that the CPU never has to waste time

polling devices. System software is simplified by having separate interrupt service

routines for each I/O device, and devices are typically serviced with less latency than

if interrupts were not used. This approach is good for many systems, especially

general-purpose machines that have a wide variety of I/O devices with different

speeds, data transfer volumes, and other characteristics.

DMA-based I/O is carried out by a hardware DMA controller that is

separate from the system CPU. When the CPU determines (often by receiving an

interrupt) that a transfer of data to or from an I/O device needs to take place, it

initializes the DMA controller with the particulars of the transfer; the DMA

controller then carries out the operation, transferring data directly between the

chosen device and memory, without further intervention by the CPU. This

approach has the highest hardware cost of the three, since it requires an extra

system component; it also requires the overhead of the CPU having to set up the

DMA controller for each transfer. However, DMA is very efficient, especially when

large blocks of data are frequently transferred. Its use would be favored in a

general-purpose or (especially) a high-performance system with high-speed devices

that can benefit significantly from large block I/O operations.

7. Systems with “separate I/O” have a second address space for I/O devices as opposed to

memory and also a separate category of instructions for doing I/O operations as opposed

to memory data transfers. What are the advantages and disadvantages of this method of

handling I/O? Name and describe an alternative strategy and discuss how it exhibits a

different set of pros and cons.

Separate I/O has the advantage of a unique address space for I/O devices;

because of this, there are no “holes” in the main memory address space where I/O

device interface registers have been decoded. The full physical memory address

space is available for use by memory devices. Also, I/O operations are easily

distinguished from memory operations by their use of different machine language

instructions. On the other hand, hardware complexity (and possibly cost) is

increased slightly and the additional instructions required for I/O make the

instruction set architecture a bit more complex.

The alternative, memory-mapped I/O, shares a single physical address space

between memory and I/O devices. This keeps the hardware and instruction set

simpler while sacrificing the distinct functionality of I/O instructions as well as the

complete, contiguous address space that would otherwise be available to memory.

Given the widespread use of virtual memory in all but the simplest of systems, the

pros and cons of either approach are not as noteworthy as they once were and either

approach can be made to work well.

8. Given that many systems have a single bus which can be controlled by only one bus

master at a time (and thus the CPU cannot use the bus for other activities during I/O

transfers), explain how a system that uses DMA for I/O can outperform one in which all

I/O is done by the CPU.

On the face of it, it would seem that DMA I/O would provide little or no

advantage in such a system, since only one data transfer can occur at a time

regardless of whether the CPU or DMAC is initiating it. However, DMA still has a

considerable advantage for a couple of important reasons. One of these is that, due

to the widespread use of on-chip instruction and data cache, it is likely that the CPU

can continue to execute code for some time (in parallel with I/O activities) even

without any use of the system bus. The second reason is that even if the CPU “stalls

out” for lack of ability to access code or data in main memory, the I/O operation

itself is done more efficiently than it would be if the CPU performed it. Instead of

reading a value from a buffer in memory and then writing it to an I/O device

interface (or vice versa), the CPU (which would be the middleman in the

transaction) gets out of the way and the two transactions are replaced with one

direct data transfer between memory and the device in question.

9. Compare and contrast the channel processors used in IBM mainframes with the PPUs

used in CDC systems.

The channel processors used in IBM mainframes were simple von Neumann

machines with their own program counters, register sets, (simpler) instruction set

architecture, etc. They communicated with the main system processor(s) by reading

and writing a shared area of main memory. CDC’s Peripheral Processing Units

were complete computers dedicated to I/O operations. The PPUs had their own

separate memory and were architecturally similar to the main system processor

(although they lacked certain capabilities, such as hardware support for floating-

point arithmetic, that were not useful for I/O). In addition to controlling I/O devices

they performed other operations such as buffering, checking, formatting, and

translating data.


chapter:

Exception - a synchronous or asynchronous event that occurs, requiring the attention of

the CPU to take some action

Service routine (handler) - a special program that is run in order to service a device,

take care of some error condition, or respond to an unusual event

Stack - when an interrupt is accepted by a typical CPU, critical processor status

information is usually saved here

Non-maskable interrupt - the highest priority interrupt in a system; one that will never

be ignored by the CPU

Reset - a signal that causes the CPU to reinitialize itself and/or its peripherals so that the

system starts from a known state

Vectoring - the process of identifying the source of an interrupt and locating the service

routine associated with it

Vectored interrupt - when this occurs, the device in question places a number on the bus

which is read by the processor in order to determine which handler should be executed

Trap - another name for a software interrupt, this is a synchronous event occurring inside

the CPU because of program activity

Abort - on some systems, the “Blue Screen Of Death” can result from this type of

software-related exception

Device interface registers - these are mapped in a system’s I/O address space; they

allow data and/or control information to be transferred between the system bus and an I/O

device

Memory-mapped I/O - a technique that features a single, common address space for

both I/O devices and main memory

Bus master - any device that is capable of initiating transfers of data over the system bus

by providing the necessary address, control, and/or timing signals

Direct Memory Access Controller (DMAC) - a hardware device that is capable of

carrying out I/O activities after being initialized with certain parameters by the CPU

Burst mode DMA - a method of handling I/O where the DMAC takes over exclusive

control of the system bus and performs an entire block transfer in one operation

Input/Output Processor (IOP) (also known as Peripheral Processor or Front-End

Processor) - an independent, programmable processor that is used in some systems to

offload input and output activities from the main CPU

6 Parallel and High-Performance Systems

1. Discuss at least three distinguishing factors that can be used to differentiate among

parallel computer systems. Why do systems vary so widely with respect to these factors?

Some of the main factors that differ from one parallel system to another are

the number and type of processors used and the way in which they are connected to

each other. Some parallel systems use shared main memory for communication

between processors, while others use a message-passing paradigm. With either of

these approaches, a wide variety of networks can be used to facilitate the exchange

of information. The main reason why parallel systems exhibit such a wide range of

characteristics is probably because there is such a wide variety of applications. The

characteristics of the intended applications drive the characteristics of the machines

that are built to run them.

2. Michael Flynn defined the terms SISD, SIMD, MISD, and MIMD to represent certain

classes of computer architectures that have been built or at least considered. Tell what

each of these abbreviations stands for; describe the general characteristics of each of

these architectures; and explain how they are similar to and different from one another. If

possible, give an example of a specific computer system fitting each of Flynn’s

classifications.

SISD stands for Single Instruction stream, Single Data stream – this is a

single-processor system such as a typical desktop or notebook PC or workstation.

Generally, such systems are built around a processor with a conventional von

Neumann (Princeton) or Harvard architecture. SIMD stands for Single Instruction

stream, Multiple Data stream. SIMD systems are commonly known as array

processors because they execute the same operation on a large collection of operands

at the same time. The control unit of a SIMD is much like that of a SISD machine,

but it controls a number of processing elements simultaneously. Examples of SIMD

computers include the ILLIAC IV, the Connection Machine, etc. MISD (Multiple

Instruction stream, Single Data stream) machines would have carried out multiple

algorithms on the same data sets. Such machines, while conceptually possible, have

yet to be developed. MIMD is an acronym for Multiple Instruction stream, Multiple

Data stream. This classification of machines encompasses the vast majority of

parallel systems, which consist of multiple CPUs (each with a Princeton or Harvard

architecture like the one in a typical SISD system) connected together by some type

of communications network. Examples of MIMD computers include the Silicon

Graphics Origin series, the Cray T3E, and any “Beowulf” class cluster system.

3. What is the main difference between a vector computer and the scalar architectures that

we studied in Chapters 3 and 4? Do vector machines tend to have a high, or low, degree

of generality as defined in Section 1.4? What types of applications take best advantage of

the properties of vector machines?

The main difference between a vector computer and conventional scalar

architectures is the fact that instructions executed by vector processors operate not

on individual values or pairs of values, but on vectors (one-dimensional arrays of

values). In a scalar machine the ADD instruction adds two numbers to produce

their sum; in a vector processor the ADD instruction adds each element of one set of

numbers to the corresponding element of a second set, producing a corresponding

set of results. This is usually accomplished with a deeply pipelined execution unit(s)

through which vector elements are fed in succession.

Because of their unique construction, vector machines have a very low degree

of generality. They are extremely well suited to certain applications, particularly

scientific and engineering applications like weather forecasting, CFD simulations,

etc. that do a great deal of “number crunching” on vectors or arrays of data.

However, they offer little to no advantage when running office applications or any

type of scalar code. While vector processors are more powerful now than they have

ever been, they are not as popular in the overall supercomputer market as they once

were because they are useful for a relatively narrow range of specialized

applications. Cluster computers based on RISC and superscalar microprocessors

are now more popular since they tend to be less expensive (per MIP or MFLOP)

and can run a wider range of applications efficiently.

4. How are array processors similar to vector processors and how are they different?

Explain the difference between fine-grained and coarse-grained array processors. Which

type of array parallelism is more widely used in today’s computer systems? Why?

Array processors are similar to vector processors in that a single machine

instruction causes a particular computation to be carried out on a large set of

operands. They are different in their construction: array processors use a number

of relatively simple processing elements (spatial parallelism) while vector processors

generally employ a small number of deeply pipelined processing units (temporal

parallelism). Fine-grained array processors consist of a very large number of

extremely simple processing elements, while coarse-grained array processors have a

few (but usually somewhat more capable) processing elements. Coarse-grained

array parallelism is more widely used in today’s computers, particularly in the

multimedia accelerators that have been added to many popular microprocessor

families. This is probably because a coarse-grained SIMD is useful in a wider range

of applications than a fine-grained array processor would be.

5. Explain the difference between multiprocessor and multicomputer systems. Which of

these architectures is more prevalent among massively parallel MIMD systems? Why?

Which architecture is easier to understand (for programmers familiar with the

uniprocessor model)? Why?

Multiprocessors are systems in which the CPUs communicate by sharing

main memory locations, while multicomputers are systems in which each CPU has

its own, local memory and communication is accomplished by passing messages over

a network. Most massively parallel MIMD systems are multicomputers because

sharing main memory among large numbers of processors is difficult and expensive.

Multiprocessors are easier for most programmers to work with because the shared

memory model allows communication to be done using the same approaches that

are used in systems with a single CPU. Multicomputers, on the other hand, must

make use of message passing – a more "artificial" and counterintuitive paradigm

for communication.

6. Explain the similarities and differences between UMA, NUMA, and COMA

multiprocessors.

All three of these architectural classifications refer to machines with shared

main memory. Any location in memory can be read or written by any processor in

the system; this is how processes running on various processors communicate with

each other. In a system with UMA (Uniform Memory Access), any memory location

can be read or written by any CPU in the same amount of time (unless the memory

module in question is already busy). This is a desirable property, but the hardware

required to accomplish it does not scale economically to large systems.

Multiprocessors with many CPUs tend to use the NUMA (Non-Uniform

Memory Access) or the more recently developed COMA (Cache-Only Memory

Architecture) approaches. NUMA systems use a modular interconnection scheme in

which memory modules are directly connected to some CPUs but only indirectly

connected to others. This is more cost-effective for larger multiprocessors, but

access time is variable (remote modules take longer to read or write than local

modules) and thus code/data placement must be “tuned” to the specific

characteristics of the memory system (a non-trivial exercise) for best performance.

In a COMA system, the entire main memory space is treated as a cache; all

addresses represent tags rather than physical locations. Items in memory can be

migrated and/or replicated dynamically so they are nearer to where they are most

needed. This experimental approach requires even more hardware support than

the other two, and is therefore more expensive to implement, but it has the potential

to make larger multiprocessors behave more like SMPs and thus perform well

without the software having to be tuned to a particular hardware configuration.

7. What does “cache coherence” mean? In what type of computer system would cache

coherence be an issue? Is a write-through strategy sufficient to maintain cache coherence

in such a system? If so, explain why. If not, explain why not and name and describe an

approach that could be used to ensure coherence.

Cache coherence means that every CPU in the system sees the same view of

memory (“looking through” its cache(s) to the main memory). In a coherent system,

any CPU should get the same value as any other when it reads a shared location,

and this value should reflect updates made by this or any other processor. A write-

through strategy is not sufficient to ensure this, because updating the contents of

main memory is not enough to make sure the other caches’ contents are consistent

with the updated memory. Even if main memory contains the updated value, the

other caches might have previously loaded that refill line and thus might still

contain an old value.

To ensure a coherent view of memory across the whole machine, copies of a

line that has been modified (written) need to be updated in other caches (so that

they immediately receive the new data) or invalidated by them (so that they will

miss if they try to access the old data and update it at that point by reading the new

value from main memory). The write-update or write-invalidate operations can be

accomplished by implementing a snoopy protocol (typical of smaller, SMP systems)

in which caches monitor a common interconnection network, such as a bus, to detect

writes to cached locations. In larger multiprocessor systems such as NUMA

machines where bus snooping is not practical, a directory protocol (in which caches

notify a centralized controller(s) of relevant transactions and, in turn, receive

notifications of other caches’ write operations) is often used.

8. What are the relative advantages and disadvantages of write-update and write-invalidate

snoopy protocols?

A write-invalidate snoopy protocol is simpler to implement and uses less bus

bandwidth since there is no need for other caches to load modified data. It works

well when data are lightly shared, but not as well when data are heavily shared since

the hit ratio is usually lower. A write-update snoopy protocol keeps the hit ratio

higher for heavily shared data and works well when reads and writes alternate, but

it is more complex to implement and may use more bus bandwidth (which can be a

limiting factor if several processors are sharing a common bus).

9. What are directory-based protocols and why are they often used in CC-NUMA systems?

Directory-based protocols are cache coherence schemes that do not rely on

the “snooping” of a common bus or other single interconnection between CPUs in a

multiprocessor system. (While snooping is feasible in most SMP systems, it is not so

easily accomplished in larger NUMA architectures with distributed shared memory

using a number of local and system-wide interconnections.) Communications with

the system directory, which is a hardware database that maintains all the

information necessary to ensure coherence of the memory system, are done in point-

to-point fashion and thus scale better to a larger system. In very large systems, the

directory itself may be distributed (split into subsets residing in different locations)

to further enhance scalability.

10. Explain why synchronization primitives based on mutual exclusion are important in

multiprocessors. What is a read-modify-write cycle and why is it significant?

A read-modify write (RMW) cycle is one in which a memory location is read,

its contents are modified by a processor, and the new value is written back into the

same memory location in indivisible, “atomic” fashion. This type of operation is

important in accessing mutual exclusion primitives such as semaphores. The RMW

cycle protects the semaphore test/update operation from being interrupted by any

other process; if such an interruption did occur, this could lead to a lack of mutual

exclusion on a shared resource, which in turn could cause incorrect operation of the

program.

11. Describe the construction of a “Beowulf cluster” system. Architecturally speaking, how

would you classify such a system? Explain.

A Beowulf-type cluster is a parallel computer system made up of a number of

inexpensive, commodity computers (often generic Intel-compatible PCs) networked

together, usually with off-the-shelf components such as 100 megabit/s or 1 gigabit/s

Ethernet. The operating system is often an open-source package such as Linux.

The idea is to aggregate a considerable amount of computational power as

inexpensively as possible. Because they are comprised of multiple, complete

computer systems and communicate via message passing over a network, Beowulf

clusters are considered multicomputers (LM-MIMD systems).

12. Describe the similarities and differences between circuit-switched networks and packet-

switched communications networks. Which of these network types is considered “static”

and which is “dynamic”? Which type is more likely to be centrally controlled and which

is more likely to use distributed control? Which is more likely to use asynchronous

timing and which is more likely to be synchronous?

Circuit-switched and packet-switched networks are both used to facilitate

communications in parallel (SIMD and MIMD) computer systems. Both allow

connections to be made for the transfer of data between nodes. They are different in

that circuit-switched networks actually make and break physical connections to

allow various pairs of nodes to communicate, while packet-switched networks

maintain the same physical connections all the time and use a routing protocol to

guide message packets (containing data) to their destinations.

Because physical connections are not changed to facilitate communication,

packet-switched networks are said to be static, while circuit-switched networks

dynamically reconfigure hardware connections. Packet-switched networks

generally exhibit distributed control and are most likely to be asynchronous in their

timing. Circuit-switched networks are more likely to be synchronous (though some

are asynchronous) and more commonly use a centralized control strategy.

13. What type of interconnection structure is used most often in small systems? Describe it

and discuss its advantages and disadvantages.

Small systems often use a single bus as an interconnection. This is a set of

address, data, and control/timing signals connected to all components in the system

to allow information to be transferred between them. The principal advantage of a

bus-based system is simplicity of hardware (and therefore low cost). Its main

disadvantage is limited performance; only one transaction (read or write operation)

can take place at a time. In a system with only one or a few bus masters (such as

CPUs, IOPs, DMACs, etc.) there will typically be little contention for use of the bus

and performance will not be compromised too much; in larger systems, however,

there will often be a desire for multiple simultaneous data transfers which, of

course, cannot happen. Thus one or more bus masters will have to wait to use the

bus and overall performance will suffer to some degree.

14. Describe the operation of a static network with a “star” topology. What connection

degree do its nodes have? What is its communication diameter? Discuss the advantages

and disadvantages of this topology.

A star network has all the computational and/or memory nodes connected to

a single, central communications node or “hub”. All the nodes except the hub have

a connection degree of one (the hub is of degree n if there are n other nodes). Such a

network has a communication diameter of two (one hop from the source to the hub,

one hop from the hub to the destination). Advantages include the network’s small

communication diameter (2) and the fact that the communication distance is the

same for all messages sent across the network. Another advantage is that the

network structure is simple and it is easy to add additional nodes if system

expansion is desired. The main disadvantage of a star network is that all

communications must pass through the hub. Because of this, the hub may become

saturated with traffic, thus becoming a bottleneck and limiting system performance.

15. How are torus and Illiac networks similar to a two-dimensional nearest-neighbor mesh?

How are they different?

Both torus and Illiac networks are similar to a two-dimensional nearest-

neighbor mesh in that their nodes are of degree four (they are connected to

neighbors in each of the x and y directions; in other words, to a “north”, “south”,

“east”, and “west” neighbor). This is true of the interior nodes in a nearest-

neighbor mesh as well. The differences between these three network topologies lie

in how the “edge” and “corner” nodes are connected. In a basic nearest-neighbor

mesh, the nodes on the side and top/bottom edges lack one of the possible

connections and thus have a connection degree of only three; the corner nodes lack

one connection in the x direction and one in the y direction and therefore are of

degree two.

In a torus network, the edge nodes have a “wrap-around” connection to the

node in the same row or column on the opposite edge (corner nodes have wrap-

around connections in both dimensions); therefore all nodes in the network have a

connection degree of four. The Illiac network has the same configuration, except

that the rightmost node in each row has a wrap-around connection to the leftmost

node in the next (rather than the same) row, with the rightmost node in the last row

being connected to the leftmost node in the first row.

16. Consider a message-passing multicomputer system with 16 computing nodes.

(a) Draw the node connections for the following connection topologies: linear array,

ring, two-dimensional rectangular nearest-neighbor mesh (without edge

connections), binary n-cube.

(The drawings will be similar to the appropriate figures in Chapter 6.)

(b) What is the connection degree for the nodes in each of the above interconnection

networks?

Linear array: 2 (1 for end nodes). Ring: 2 (all nodes). 2-D mesh: 4 (3 on

edges, 2 in corners). Binary n-cube: 4 (all nodes).

(c) What is the communication diameter for each of the above networks?

Linear array: 15. Ring: 8. 2-D mesh: 6. Binary n-cube: 4.

(d) How do these four networks compare in terms of cost, fault tolerance, and speed

of communications? (For each of these criteria, rank them in order from most

desirable to least desirable.)

Cost: linear array, ring, 2-D mesh, n-cube. Fault tolerance: n-cube, 2-D

mesh, ring, linear array. Speed: n-cube, 2-D mesh, ring, linear array.

17. Describe, compare, and contrast store-and-forward routing with wormhole routing.

Which of these approaches is better suited to implementing communications over a static

network with a large number of nodes? Why?

Both wormhole and store-and-forward routing are methods for transmitting

message packets across a static network. Store-and-forward routing treats each

message packet as a unit, transferring the entire packet from one node in the

routing path to the next before beginning to transfer it to a subsequent node. This is

a simple but not very efficient way to transfer messages across the network; it can

cause significant latency for messages that must traverse several nodes to reach

their destinations. Wormhole routing divides the message packets into smaller

pieces called flits; as soon as an individual flit is received by an intermediate node

along the routing path, it is sent on to the next node without waiting for the entire

packet to be assembled. This effectively pipelines, and thus speeds up, the

transmission of messages between remote nodes. A network with a large number of

nodes is likely to have a large communication diameter, with many messages

requiring several “hops” to reach their destinations, and thus is likely to benefit

considerably from the use of wormhole routing.

18. In what type of system would one most likely encounter a full crossbar switch

interconnection? Why is this type of network not usually found in larger (measured by

number of nodes) systems?

A full crossbar switch would most likely be found in a high-performance

symmetric multiprocessor (SMP) system. Such a network would probably not be

used in a system with many nodes because it has a cost and complexity that

increases proportionally to the square of the number of nodes connected. (In other

words, its cost is O(n2).)

19. Consider the different types of dynamic networks discussed in this chapter. Explain the

difference between a blocking network and a non-blocking network. Explain how a

rearrangeable network compares to these other two dynamic network types. Give an

example of each.

In a blocking network such as the Omega network, any node on one side of

the network can be connected to any node on the other side. However, creating one

connection across the network prevents (blocks) certain other pairs of nodes from

being connected as long as the first connection exists. Only certain subsets of

connections can exist simultaneously. In a non-blocking network such as a full

crossbar switch, any node on one side of the network can be connected to any node

on the other side, and this connection does not interfere with the establishment of a

connection between any other (idle) nodes.

A rearrangeable network such as a Benes network represents a middle

ground (in functionality, complexity, and expense) between the previous two

alternatives. It is similar in structure to a blocking network, but adds redundancy

in the form of additional stages and/or connections. The redundancy allows for

multiple possible paths connecting any two given nodes. While any particular

connection across the network does block certain other connections, an established

connection can always be rerouted along one of the alternate paths (a.k.a.

rearranged) in order to allow another desired connection to be made.

20. Choose the best answer to each of the following questions:

(a) Which of the following is not a method for ensuring cache coherence in a

multiprocessor system? (1) write-update snoopy cache; (2) write-through cache; (3)

write-invalidate snoopy cache; (4) full-map directory protocol

(b) In a 16-node system, which of these networks would have the smallest

communication diameter? (1) n-cube; (2) two-dimensional nearest-neighbor mesh; (3)

ring; (4) torus (tie; both have diameter = 4)

(c) Which of the following is a rearrangeable network? (1) Illiac network; (2) multistage

cube network; (3) crossbar switch; (4) Benes network; (5) none of the above

(d) In a 64-node system, which of the following would have the smallest node connection

degree? (1) ring; (2) two-dimensional nearest-neighbor mesh; (3) Illiac network; (4) n-

cube


chapter:

Multicomputer (LM-MIMD) - a parallel computer architecture in which there are

several processing nodes, each of which has its own local or private memory modules

Multiprocessor (GM-MIMD) - a parallel computer architecture in which there are

several processing nodes, all of which have access to shared memory modules

Single Instruction stream, Multiple Data stream (SIMD) machine - another name for

an array processor

Symmetric Multiprocessor (SMP) - a relatively small MIMD system in which the

“uniform memory access” property holds

Deadlock - a situation in which messages on a network cannot proceed to their

destinations because of mutual or cyclic blocking

Blocking network - an interconnection network in which any node can be connected to

any node, but some sets of connections are not simultaneously possible

Communication diameter - the maximum number of “hops” required to communicate

across a network

Packet-switched network - multicomputers with many nodes would be interconnected

by this

(Full) Crossbar switch - the classic example of a non-blocking, circuit-switched

interconnection network for multiprocessor systems

Store and forward routing - a method of message passing in which flits do not continue

toward the destination node until the rest of the packet is assembled

Write-update snoopy cache protocol - a method used for ensuring coherence of data

between caches in a multiprocessor system where a write hit by one CPU causes other

processors’ caches to receive a copy of the written value

Flow control digit (flit) - the basic unit of information transfer through the network in a

multicomputer system using wormhole routing

7 Special-Purpose and Future Architectures

1. Explain how a dataflow machine avoids the “von Neumann bottleneck.”

A dataflow machine, unlike one based on a von Neumann architecture, does

not rely on the use of sequential algorithms to guide the processing of data.

Processing is data-driven rather than instruction-driven; it is not inherently

sequential, but instead allows the hardware to exploit any parallelism inherent in

the task. Since there is no need to fetch “instructions” separately from data, the von

Neumann bottleneck described in Chapter 1 does not come into play.

2. Draw a dataflow graph and an activity template for the following programming construct:

if (x >= 0) { z = (x + y) * 4; } else { z = (y - x) * 4; }

The dataflow graph will be similar to Figure 7.2 except for the details of the

operations. Likewise, the activity template will be similar to Figure 7.3.

3. If you had a scientific application that involved a large number of matrix manipulations,

would you rather run it on a dataflow computer or a SIMD computer? Explain.

It would probably be better to run such an application on a SIMD computer,

since that type of system is designed to optimize performance on array processing

(matrices are handled as two-dimensional arrays). Dataflow computers do

reasonably well with “unstructured” parallelism but are not particularly good at

exploiting array-type parallelism.

4. What do you think is the main reason why dataflow computers have so far not been

widely adopted?

There are several legitimate reasons why dataflow computers have not

reached the mainstream. One is their reliance on specialized programming

languages to express the constructs represented in a dataflow graph or activity

template. Machines that do not use standard programming languages tend to have

a higher software development cost. Also, dataflow machines do not perform

particularly well (at least, not well enough to justify their cost) on many common

applications. Unless the parallelism inherent to the task is a good match for the

parallelism of the machine hardware, performance gains will be modest at best.

Finally, dataflow machines are not easy or cheap to build and do not take much

advantage of the locality of reference that is essential to the function of hierarchical

memory systems.

5. Give an example of how dataflow techniques have influenced and/or been used in

conventional computer design.

Dataflow techniques were a part of the control strategy developed by Robert

Tomasulo for the IBM 360/91 computer in the 1960s. While the machine was

programmed like a traditional von Neumann computer, internally its hardware

execution units were scheduled using a dataflow approach: an operation was sent to

a functional unit once its operands were available. Some modern, superscalar

microprocessors still use Tomasulo’s method (or variations on it) and thus bear the

influence of dataflow computing.

6. Are superthreaded and hyper-threaded processors the same thing? If not, how do they

differ?

Superthreaded and hyper-threaded processors are close cousins, but not

identical. Superscalar machines that use superthreading (also called time-slice

multithreading) can issue multiple instructions belonging to one process (or thread)

during a given clock cycle; during a different clock cycle, instructions belonging to

another process can be issued. Effectively, use of the CPU is time-multiplexed on a

cycle by cycle basis. Hyper-threading (or simultaneous multithreading) takes this

concept one step further: during a given clock cycle, instructions from more than

one process may be issued in order to make the maximum possible use of CPU

resources.

7. Would you classify an artificial neural network as an SISD, SIMD, MISD, or MIMD

system, or something else? Make a case to support your choice.

The best answer is probably “something else” – ANNs represent a unique

class of architectures with their own special characteristics. Artificial neural

networks really do not fit the description of any of the systems described in Flynn’s

taxonomy of computer systems. If one had to try to pigeonhole them into one of his

four classifications, they could be said to at least resemble, in some ways, MIMD or

even MISD machines. (ANNs are clearly not SISD or SIMD architectures because

they lack a single instruction stream.)

8. Explain how the processing elements and interconnections in an artificial neural network

relate to the structure of the human nervous system.

The many, simple processing elements in an artificial neural network

correspond to the many neurons in the human body. Each processing element, like

each real neuron, accepts several inputs and computes the weighted sum of those

inputs. This sum is applied to an activation function that simulates the action

potential threshold of a biological neuron. The activation function determines

whether a given processing element will send an output to the input of another

processing element (simulated neuron).

9. How is a supervised artificial neural network programmed to carry out a particular task?

What is the difference between a supervised vs. unsupervised ANN?

An artificial neural network is not so much programmed, but trained

iteratively to perform a given task. A supervised ANN receives its training via the

user’s repeated applications of inputs, producing outputs that are compared to the

corresponding desired outputs; the neuron weights are adjusted after each pass

until the network “learns” to produce good output for the full range of inputs.

Unsupervised ANNs are used in situation for which feedback is unavailable (no

“known good” output data exists). Instead, they use “competitive learning”

techniques to learn on their own without intervention by a human trainer.

10. Why are ANNs well suited to applications such as robotic control? Give an example of

an application for which you do not think an ANN would be a good choice.

ANNs are a good choice for robotic control because complex motions are

difficult to program algorithmically using traditional programming languages.

Since it is possible to generate examples of the desired functionality and since most

ANNs operate on the principle of “training” a system to produce outputs

corresponding to given examples, they are a natural “fit”. After all, the biological

neural networks of human beings and animals can be trained to produce desired

output, so why shouldn’t artificial neural networks exhibit similar strengths? On

the other hand, applications such as computational fluid dynamics, etc. that require

a great deal of numeric computations (“number crunching”) would probably

perform a lot better on a conventional supercomputer than on an ANN.

11. What is different about logical variables in a fuzzy system as compared to a conventional

computer system?

In a conventional computer system, logical variables are binary in nature.

That is to say, they are either true or false; on or off; 1 or 0; 100% or 0%. In a

fuzzy system, logical values can take on a continuum of truth values between the

limits of 0 and 1, inclusive.

12. Both ANNs and fuzzy logic systems attempt to mimic the way human beings make

decisions. What is the main difference between the two approaches?

The main difference is that artificial neural networks attempt to mimic the

actual structure of the human brain by simulating the functionality of neurons and

the connections that exist between them. Fuzzy logic systems attempt to model the

uncertain, imprecise methods people use to make decisions based on (often

incomplete) available information, but their structure is not based on any biological

model.

13. What is a fuzzy subset and how does the idea of a membership function relate to it?

Propose a simple membership function rich() that deals with the concept of a fuzzy

subset of wealthy people.

A fuzzy subset is a portion of the universe of discourse (the set of all things

under consideration in formulating a given problem), whose membership is not

defined precisely. A membership function expresses the perceived likelihood that a

given member of the universe belongs to a particular fuzzy subset. In other words,

it produces a truth value (in the range of 0 to 1, inclusive) that indicates an object’s

degree of membership in the fuzzy subset. One possible definition of a membership

function rich() would be as follows:

rich (x) = {0, if income (x) < $100K; (income (x) - $100K) / $900K, if $100K ≤ income (x) ≤ $1M; 1, if income (x) > $1M.} 14. Can the Boolean, or crisp, logic operations AND, OR, and NOT be defined in regard to

fuzzy logic? If so, explain how; if not, explain why not.

Yes, the Boolean functions AND, OR, and NOT correspond to fuzzy

operations. NOT is generally defined such that truth (not x) is equal to 1.0 – truth

(x). Various definitions have been used for the AND and OR functions; the most

common are truth (x AND y) = min (truth (x), truth (y)) and truth (x OR y) = max

(truth (x), truth (y)). Note that if the variables x and y are restricted to only the

discrete values 0 and 1 (as in binary logic systems), these definitions are consistent

with Boolean algebra properties.

15. Explain, in the context of a fuzzy expert system, what rules are and how they are used.

Rules are statements that reflect the knowledge of a human expert about how

a given system works. They are typically expressed in terms of if-then relationships

between fuzzy output subsets and linguistic variables derived from the inputs. The

rules are used to make inferences about the system based on the fuzzified input data

(that is, the results after the membership functions are applied to the “raw” input

data). The outputs of the various rules that make up the system’s rule base are

combined and used to create a single fuzzy subset for each output variable. These

outputs can then be defuzzified to produce “crisp” outputs if they are needed.

16. For what type(s) of physical system is fuzzy control particularly well suited?

Fuzzy control is a good choice for controlling systems that are nonlinear,

complex, and/or have poorly specified characteristics, thus making them a poor

match for conventional analog or digital control systems using algorithms that

depend on having a well-defined, linear model of the process to be controlled.

17. What is Moore’s Law and how has it related to advances in computing over the last 40

years? Is Moore’s Law expected to remain true forever or lose its validity in the future?

Explain your answer and discuss the implications for the design of future high-

performance computer systems.

Moore’s Law says that the continually shrinking sizes of semiconductor

devices will result in an exponential growth (doubling on approximately a yearly

basis) in the number of transistors that can feasibly be integrated on a single chip.

This has resulted in a doubling of computational power approximately every 18-24

months (since, apparently, computational power is not a linear function of the

number of transistors).

Given the known laws of physics, it is not possible that Moore’s Law will

continue to hold true indefinitely. The problem is that transistors used as switching

elements in computers cannot keep shrinking once they get to the size of individual

atoms or small groups of atoms. Devices that small will no longer work under the

binary logic principles of Boolean algebra, and the performance of traditional

computer architectures will reach a hard limit. (This has been estimated to occur

within the next 10-20 years.) At that point, further increases in performance will

only be achievable if some new approach, for example quantum computing, is

adopted.

18. How does a quantum computer fundamentally differ from all the other computer

architectures discussed in this book? What allows a quantum computer to achieve the

effect of a massively parallel computation using a single piece of hardware?

A quantum computer differs from all traditional computer architectures in

that it does not operate on the principles of Boolean algebra, where computations

are done sequentially (or in parallel, by replicating hardware) on binary digits (bits)

or groups of bits. In conventional machines, each bit can only be 0 or 1 at any given

time. Quantum computers instead operate on quantum bits (qubits). Qubits can

take on not only the distinct values 0 or 1, but also – by the principle of quantum

superposition – they can take on states that can be 0 and 1 at the same time. By

adding more qubits, a quantum computer becomes exponentially more powerful.

While a 16-bit binary register can take on only one of its 65,536 possible states at a

time, a 16-qubit quantum register can be in all 65,536 states at once in coherent

superposition. This allows the effect of a massively parallel computation to be

achieved using only one piece of hardware.

19. What are some of the problems scientists must solve in order to make supercomputers

based on the principles of quantum mechanics practical?

Researchers working on quantum computers have encountered a number of

problems that have so far made it impractical to construct machines that can

compete with supercomputers based on conventional architectures. First of all, it is

difficult to build a quantum computer, since there is a need to separate one or a

small number of atoms from others and keep them in a steady state in order to use

them for computation. Another significant problem is decoherence, a phenomenon

that can introduce errors in computations due to interactions of the computer

hardware with the surrounding environment. Finally, assuming one has performed

a quantum computation, it is difficult to observe the result without collapsing the

coherent superposition of states and destroying one’s work. Research into solving

these problems is ongoing.

20. What application(s) are expected to be a good match for the unique capabilities of

quantum computers? Explain.

If large-scale quantum computers can be practically constructed, they will

probably be used to solve extremely numerically intensive problems that have

proven to be intractable with even the fastest conventional systems. (They won’t be

used for word processing, sending e-mail, or surfing the Net!) One area that seems

to be a good potential match for the capabilities of quantum computers is

cryptography – the making and breaking of highly secure codes that protect

sensitive information from being intercepted by unauthorized parties.


chapter:

Dataflow machine - a type of computer architecture in which execution depends on the

availability of operands and execution units rather than a sequential-instruction program

model

Node (actor) - an element in a dataflow graph that represents an operation to be

performed on data

Tokens - these are used to represent data values (operands and results) in algorithms for a

dataflow architecture

IBM 360/91 - this (outwardly) von Neumann machine made use of dataflow techniques

for internal scheduling of operations

Hyper-threading - a machine using this technique can issue instructions from more than

one thread of execution during the same clock cycle

Artificial Neural Network (ANN) - a type of computer architecture with a structure

based on that of the human nervous system

Neurons - the fundamental units that make up a biological neural network

Dendrites - these are fibers that act as “input devices” for neurons in human beings

Convergence - when an artificial neural network achieves this, it is “trained” and ready

to be put into operating mode

Single-Layer Perceptron (SLP) - the earliest and simplest type of artificial neural

network

Unsupervised neural network - a type of artificial neural network that does not require

user intervention for training

Fuzzy logic architecture - a type of computer architecture in which logical values are

not restricted to purely “true” or “false” (1 or 0)

Linguistic variable - a type of variable that expresses a “fuzzy” concept; for example,

“slightly dirty” or “very fast”

Universe of discourse - the set of all objects under consideration in the design of a fuzzy

system

Truth value - the numerical degree (between 0 and 1, inclusive) of membership that an

object has in a fuzzy subset

Fuzzification - the first step performed in doing “fuzzy computations” for an expert

system, control system, etc.

Defuzzification - this is necessary if a fuzzy result must be converted to a crisp output

Quantum computer - a type of computer architecture in which the same physical

hardware can be used to simultaneously compute many results as though it were parallel

hardware; its operation is not based on Boolean algebra, but on the physics of subatomic

particles

Moore’s Law - a prophetic observation of the fact that conventional computers would

tend to grow exponentially more powerful over time as integrated circuit features got

smaller and smaller

Quantum bit (qubit) - the basic unit of information in a quantum computer

Quantum interference - this phenomenon results from the superposition of multiple

possible quantum states

Quantum entanglement - a state in which an atom’s properties are identically assumed

by another atom, but with opposite spin

Decoherence - the tendency for interactions with the surrounding environment to disturb

the state of qubits, possibly resulting in computational errors

Thirty - a quantum computer with this many qubits has been estimated to have 10

TFLOPS of computational power

Cryptography - so far, this appears to be the most likely application for supercomputers

based on quantum principles

Documents

Solutions Manual