Upload
yama-tama
View
461
Download
16
Embed Size (px)
Citation preview
Computer Architecture: Fundamentals and Principles of Computer Design
Solutions Manual
Joseph D. Dumas II
University of Tennessee at Chattanooga Department of Computer Science and Electrical Engineering
Copyright © 2006
CRC Press/Taylor & Francis Group
1 Introduction to Computer Architecture
1. Explain in your own words the differences between computer systems architecture and
implementation. How are these concepts distinct, yet interrelated? Give a historical
example of how implementation technology has affected architectural design (or vice
versa).
Architecture is the logical design of a computer system, from the top level on
down to the subsystems and their components – a specification of how the parts of
the system will fit together and how everything is supposed to function.
Implementation is the physical realization of an architecture – an actual, working
hardware system on which software can be executed.
There are many examples of advances in implementation technology
affecting computer architecture. An obvious example is the advent of magnetic core
memory to replace more primitive storage technologies such as vacuum tubes, delay
lines, magnetic drums, etc. The new memory technology had much greater storage
capacity than was previously feasible. The availability of more main memory
resulted in changes to machine language instruction formats, addressing modes, and
other aspects of instruction set architecture.
2. Describe the technologies used to implement computers of the first, second, third, fourth,
and fifth generations. What were the main new architectural features that were
introduced or popularized with each generation of machines? What advances in software
went along with each new generation of hardware?
First generation computers were unique machines built with very primitive
implementation technologies such as electromagnetic relays and (later) vacuum
tubes. The main new architectural concept was the von Neumann stored-program
paradigm itself. (The early first generation machines were not programmable in the
sense we understand that term today.) Software, for those machines where the
concept was actually relevant, was developed in machine language.
Second-generation computers made use of the recently invented transistor as
a basic switching element. The second generation also saw the advent of magnetic
core memory as a popular storage technology. At least partly in response to these
technological advances, new architectural features were developed including virtual
memory, interrupts, and hardware representation of floating-point numbers.
Advances in software development included the use of assembly language and the
first high-level languages including Fortran, Algol, and COBOL. Batch processing
systems and multiprogramming operating systems were also devised during this
time period.
The third generation featured the first use of integrated circuits (with
multiple transistors on the same piece of semiconductor material) in computers.
Not only was this technology used to create smaller CPUs requiring less wiring
between components, but semiconductor memory devices began to replace core
memory as well. This led to the development of minicomputer architectures that
were less expensive to implement and helped give rise to families of computer
systems sharing a common instruction set architecture. Software advances included
increased use of virtual memory, the development of more modern, structured
programming languages, and the dawn of timesharing operating systems such as
UNIX.
Fourth generation computers were the first machines to use VLSI integrated
circuits including microprocessors (CPUs fabricated on a single IC). VLSI
technology continued to improve during this period, eventually yielding
microprocessors with over one million transistors and large-capacity semiconductor
RAM and ROM devices. VLSI “chips” allowed the development of inexpensive but
powerful microcomputers during the fourth generation. These systems gradually
began to make use of virtual memory, cache memory, and other techniques
previously reserved for mainframes and minicomputers; they provided direct
support for high-level languages either in hardware (CISC) or by using optimizing
compilers (RISC). Other software advances included new languages like BASIC,
Pascal, and C, and the first object-oriented language (C++). Office software
including word processors and spreadsheet applications helped microcomputers
gain a permanent foothold in small businesses and homes.
Fifth generation computers exhibited fewer architectural innovations than
their predecessors, but advances in implementation technology (including pipelined
and superscalar CPUs and larger, faster memory devices) yielded steady gains in
performance. CPU clock frequencies increased from tens, to hundreds, and
eventually to thousands of megahertz; today, CPUs operating at several gigahertz
are common. “Standalone” systems became less common as most computers were
connected to local area networks and/or the Internet. Object-oriented software
development became the dominant programming paradigm, and network-friendly
languages like Java became popular.
3. What characteristics do you think the next generation of computers (say, 5-10 years from
now) will display?
The answer to this question will undoubtedly vary from student to student,
but might include an increased reliance on networking (especially wireless
networking), increased use of parallel processing, more hardware support for
graphics, sound, and other multimedia functions, etc.
4. What was the main architectural difference between the two early computers ENIAC and
EDVAC?
ENIAC was not a programmable machine. Connections had to be re-wired
to do a different calculation. EDVAC was based on the von Neumann paradigm,
where instructions were not hard-wired but rather resided in main memory along
with the data. The program, and thus the system’s functionality, could be changed
without any modification to the hardware. Thus, EDVAC (and all its successors
based on the von Neumann architecture) were able to run “software” as we
understand it today.
5. Why was the invention of solid state electronics (in particular, the transistor) so important
in the history of computer architecture?
The invention of the transistor, and its subsequent use as a switching element
in computers, enabled many of the architectural enhancements that came about
during the second (and later) generations of computing. Earlier machines based on
vacuum tubes were limited in capability because of the short lifetime of each
individual tube. A machine built with too many (more than a few thousand)
switching elements could not be reliable; it would frequently “go down” due to tube
failures. Transistors, with their much longer life span, enabled the construction of
computers with tens or hundreds of thousands of switching elements, which allowed
more complex architectures to flourish.
6. Explain the origin of the term “core dump.”
The term “core dump” dates to the second and third generations of
computing, when most large computers used magnetic core memory for main
storage. Since core memory was nonvolatile (retained its contents in the absence of
power), when a program crashed and the machine had to be taken down and
restarted, the offending instruction(s) and their operands were still in memory and
could be examined for diagnostic purposes. Some later machines with
semiconductor main memory mimic this behavior by “dumping” an image of a
program’s memory space to disk to aid in debugging in the event of a crash.
7. What technological advances allowed the development of minicomputers, and what was
the significance of this class of machines? How is a microcomputer different from a
minicomputer?
The main technological development that gave rise to minicomputers was the
invention of the integrated circuit. (The shrinking sizes of secondary storage devices
and advances in display technology such as CRT terminals also played a part.) The
significance of these machines was largely due to their reduced cost as compared to
traditional mainframe computers. Because they cost “only” a few thousand dollars
instead of hundreds of thousands or millions, minicomputers were available to
smaller businesses (and to small workgroups or individuals within larger
organizations). This trend toward proliferation and decentralization of computing
resources was continued by the microcomputers of the fourth generation.
The main difference between a microcomputer and a minicomputer is the
microcomputer’s use of a microprocessor (or single-chip CPU) as the main
processing element. Minicomputers had CPUs consisting of multiple ICs or even
multiple circuit boards. The availability of microprocessors, coupled with the
miniaturization and decreased cost of other system components, made computers
smaller and cheaper and thus, for the first time, accessible to the average person.
8. How have the attributes of very high performance systems (a.k.a. supercomputers)
changed over the third, fourth, and fifth generations of computing?
The third generation of computing saw the development of the first
supercomputer-class machines, including the IBM “Stretch”, the CDC 6600 and
7600, the TI ASC, the ILLIAC IV and others. These machines were very diverse
and did not share many architectural attributes.
During the fourth generation, vector machines including the Cray-1 and its
successors (and competitors) became the dominant force in high-performance
computing. By processing vectors (large one-dimensional arrays) of operands in
highly pipelined fashion, these machines achieved impressive performance on
scientific and engineering calculations (though they did not achieve comparable
performance increases on more general applications). Massively parallel machines
(with many, simple processing elements) also debuted during this period.
Vector machines lost popularity in the fifth generation, largely giving way to
highly parallel scalar systems using large numbers of conventional microprocessors.
Many of these systems are cluster systems built around a network of relatively
inexpensive, “commodity” computers.
9. What is the most significant difference between computers of the last 10-15 years versus
those of previous generations?
Fifth generation computers are smaller, cheaper, faster, and have more
memory than their predecessors – but probably the single most significant
difference between modern systems and those of the past is the pervasiveness of
networking. Almost every general-purpose or high-performance system is
connected to a local area network, or a wide area network such as the Internet, via
some sort of wired or wireless network connection.
10. What is the principal performance limitation of a machine based on the von Neumann
(Princeton) architecture? How does a Harvard architecture machine address this
limitation?
The main performance limitation of a von Neumann machine is the “von
Neumann bottleneck” – the single path between the CPU and main memory, over
which instructions as well as data must be accessed. A Harvard architecture
removes this bottleneck by having either separate main memories for instructions
and data (with a dedicated connection to each), or (much more common nowadays)
by having only one main memory, but separate cache memories (see Chapter 2) for
instructions and data. The separate memories can be optimized for access patterns
typical of each type of memory reference in order to maximize data and instruction
bandwidth to the CPU.
11. Summarize in your own words the von Neumann machine cycle.
Fetch instruction, decode instruction, determine operand address(es), fetch
operand(s), perform operation, store result … repeat for next instruction.
12. Does a computer system with high generality tend to have higher quality than other
systems? Explain.
Not necessarily. If anything, a more general architecture tends to be more
complex, as its designers try to make it capable of doing a wide variety of things
reasonably well. This increased complexity, as compared to a more specialized
architecture, may lead to a higher probability of “bugs” in the implementation, all
else being equal.
13. How does “ease of use” relate to “user friendliness”?
Not at all; at least, not directly. User friendliness refers to the end user’s
positive experience with the operating system and applications that run under it.
Ease of use is an attribute that describes how well the architecture facilitates the
development of system software such as operating systems, compilers, linkers, etc.
In other words, it is a measure of “systems programmer friendliness.” While there
is no direct connection, an architecture that is not “easy to use” could possibly give
rise to systems software with a higher probability of bugs, which may ultimately
lead to a lower quality experience on the part of the end user.
14. The obvious benefit of maintaining upward and/or forward compatibility is the ability to
continue to run “legacy” code. What are some of the disadvantages of compatibility?
Building in compatibility with previous machines makes the design of an
architecture more complex. This may result in higher design and implementation
costs, less architectural ease of use, and a higher probability of flaws in the
implementation of the design.
15. Name at least two things (other than hardware purchase price, software licensing cost,
maintenance, and support) that may be considered cost factors for a computer system.
Costs are not always monetary – at least, not directly. Other cost factors,
depending on the nature of the system and where it is used, might include power
consumption, heat dissipation, physical volume, mass, and losses incurred if a
system fails due to reliability issues.
16. Give as many reasons as you can why PC compatible computers have a larger market
share than Macs.
It is probably impossible to know all the reasons, but one of the biggest is
that PCs have an “open”, rather than proprietary, architecture. Almost from the
very beginning, compatible “clones” were available at competitive prices, holding
down not only the initial cost of buying a computer, but also the prices for software
and replacement parts. Success breeds success, and the larger market share meant
that manufacturers who produced PC hardware were able to invest in research and
development that produced better, faster, and more economical PC compatible
machines.
17. One computer system has a 3.2 GHz processor, while another has only a 2.7 GHz
processor. Is it possible that the second system might outperform the first? Explain.
It is entirely possible that this might be the case. CPU clock frequency is only
one small aspect of system performance. Even with a lower clock frequency (fewer
clock cycles occurring each second) the second system’s CPU might outperform the
first because of architectural or implementation differences that result in it
accomplishing more work per clock cycle. And even if the first system’s CPU is
indeed more capable, differences in the memory and/or input/output systems might
still give the advantage to the second system.
18. A computer system of interest has a CPU with a clock cycle time of 2.5 ns. Machine
language instruction types for this system include: integer addition/subtraction/logic
instructions which require 1 clock cycle to be executed; data transfer instructions which
average 2 clock cycles to be executed; control transfer instructions which average 3 clock
cycles to be executed; floating-point arithmetic instructions which average 5 clock cycles
to be executed; and input/output instructions which average 2 clock cycles to be
executed.
a) Suppose you are a marketing executive who wants to hype the performance of this
system. Determine its “peak MIPS” rating for use in your advertisements.
The fastest instructions take only one clock cycle to execute, so in order to
calculate peak MIPS, assume that the whole program uses only these instructions.
That means that the machine will execute one instruction every 2.5 ns. Thus we
calculate:
Instruction execution rate = (1 instruction) / (2.5 * 10-9 seconds) = 4 * 108
instructions/second = 400 * 106 instructions/second = 400 MIPS
b) Suppose you have acquired this system and want to estimate its performance when
running a particular program. You analyze the compiled code for this program and
determine that it consists of 40% data transfer instructions, 35% integer addition,
subtraction, and logical instructions, 15% control transfer instructions, and 10% I/O
instructions. What MIPS rating do you expect the system to achieve while running
this program?
First, we need to determine the mean number of cycles per instruction using
a weighted average based on the percentages of the different types of instructions:
CPIavg = (0.40)(2 cycles) + (0.35)(1 cycle) + (0.15)(3 cycles) + (0.10)(2 cycles) = (0.80 +
0.35 + 0.45 + 0.20) = 1.80 cycles/instruction
We already determined in part (a) above that if instructions take a single
cycle, then we can execute 400 * 106 of them per second. This is another way of
saying that the CPU clock frequency is 400 MHz. Given this knowledge and the
average cycle count per instruction just calculated, we obtain:
Instruction execution rate = (400 M cycles / second) * (1 instruction / 1.8 cycles) ≈
222 M instructions/second = 222 MIPS
c) Suppose you are considering purchasing this system to run a variety of programs
using mostly floating-point arithmetic. Of the widely-used benchmark suites
discussed in this chapter, which would be the best to use in comparing this system to
others you are considering?
If general-purpose floating-point performance is of interest, it would be hard
to go wrong by using the SPECfp floating-point CPU benchmark suite (or some
subset of it, if specific types of applications to be run on the system are known).
Other possibilities include the Whetstones benchmark or (if applications of interest
perform vector computations) LINPACK or Livermore Loops. Conversely, you
would definitely not want to compare the systems using any of the integer-only or
non-CPU-intensive benchmarks such as Dhrystones, TPC, etc.
d) What does MFLOPS stand for? Estimate this system’s MFLOPS rating; justify your
answer with reasoning and calculations.
MFLOPS stands for Millions of Floating-point Operations Per Second. Peak
MFLOPS can be estimated in a similar manner to parts (a) and (b) above:
Peak floating-point execution rate = (400 M cycles / second) * (1 FLOP / 5 cycles) =
80 MFLOPS
A more realistic estimate of a sustainable floating-point execution rate would
have to take into account the additional operations likely to be required along with
each actual numeric computation. While this would vary from one program to
another, a reasonable estimate might be that for each floating-point arithmetic
operation, the program might also perform two data transfers (costing a total of
four clock cycles) plus one control transfer (costing three clock cycles). This would
mean that the CPU could only perform one floating-point computation every 12
clock cycles for a sustained execution rate of (400 M cycles / second) * (1 FLOP / 12
cycles) ≈ 33 MFLOPS. The student may come up with a variety of estimates based
on different assumptions, but any realistic estimate would be significantly less than
the 80 MFLOPS peak rate.
19. Why does a hard disk that rotates at higher RPM generally outperform one that rotates at
lower RPM? Under what circumstances might this not be the case?
There are generally three components to the total time required to read or
write data on a rotating disk. These are the time required to step the read/write
head in or out to the desired track, the rotational delay in getting to the start of the
desired sector within that track, and then the time needed to actually read or write
the sector in question. All else being equal, increasing disk RPM reduces the time it
takes for the disk to make a revolution and so tends to reduce the second and third
delay components, while it does nothing to address the first. If the higher-RPM
drive had a longer track-to-track seek time, though, it might take just as long or
even longer, overall, to access desired data as compared with a lower-RPM drive
with shorter track-to-track access time.
20. A memory system can read or write a 64-bit value every 2 ns. Express its bandwidth in
MB/s.
Since one byte equals 8 bits, a 64-bit value is 8 bytes. So we can compute the
memory bandwidth as:
BW = (8 bytes) / (2 * 10-9 seconds) = 4 * 109 bytes/second = 4 GB/s or 4000 MB/s
21. If a manufacturer’s brochure states that a given system can perform I/O operations at 500
MB/s, what questions would you like to ask the manufacturer’s representative regarding
this claim?
One should probably ask under what conditions this data transfer rate can
be achieved. If it is a “peak” transfer rate, it is probably unattainable under any
typical circumstances. It would be very helpful to know the size of the blocks of
data being transferred and the length of time for which this 500 MB/s rate was
sustained. Odds are probably good that if this is a peak rate, that it is only valid for
fairly large block transfers of optimum size, and for very short periods of time. This
may or may not reflect the nature of the I/O demands of a customer’s application.
22. Fill in the blanks below with the most appropriate term or concept discussed in this
chapter:
Implementation - the actual, physical realization of a computer system, as opposed to
the conceptual or block-level design
Babbage’s Analytical Engine - this was the first design for a programmable digital
computer, but a working model was never completed
Integrated circuits - this technological development was an important factor in moving
from second generation to third generation computers
CDC 6600 - this system is widely considered to have been the first supercomputer
Altair - this early microcomputer kit was based on an 8-bit microprocessor; it introduced
10,000 hobbyists to (relatively) inexpensive personal computing
Microcontroller - this type of computer is embedded inside another electronic or
mechanical device such as a cellular telephone, microwave oven, or automobile
transmission
Harvard architecture - a type of computer system design in which the CPU uses
separate memory buses for accessing instructions and data operands
Compatibility - an architectural attribute that expresses the support provided for
previous or other architectures by the current machine
MFLOPS - a CPU performance index that measures the rate at which computations can
be performed on real numbers rather than integers
Bandwidth - a measure of memory or I/O performance that tells how much data can be
transferred to or from a device per unit of time
Benchmark - a program or set of programs that are used as standardized means of
comparing the performance of different computer systems
2 Computer Memory Systems
1. Consider the various aspects of an ideal computer memory discussed in Section 2.1.1 and
the characteristics of available memory devices discussed in Section 2.1.2. Fill in the
columns of the table below with the following types of memory devices, in order from
most desirable to least desirable: magnetic hard disk, semiconductor DRAM, CD-R,
DVD-RW, semiconductor ROM, DVD-R, semiconductor flash memory, magnetic floppy
disk, CD-RW, semiconductor static RAM, semiconductor EPROM.
Cost/bit (will obviously fluctuate somewhat depending on market conditions): CD-
R, DVD-R, CD-RW, DVD-RW, magnetic hard disk, magnetic floppy disk,
semiconductor DRAM, semiconductor ROM, semiconductor EPROM,
semiconductor flash memory, semiconductor static RAM.
Speed (will vary somewhat depending on specific models of devices): semiconductor
static RAM, semiconductor DRAM, semiconductor ROM, semiconductor EPROM,
semiconductor flash memory, magnetic hard disk, DVD-R, DVD-RW, CD-R, CD-
RW, magnetic floppy disk.
Information Density (again, this may vary by specific types of devices): Magnetic
hard disk, DVD-R and DVD-RW, CD-R and CD-RW, semiconductor DRAM,
semiconductor ROM, semiconductor EPROM, semiconductor flash memory,
semiconductor static RAM, magnetic floppy disk.
Volatility: Optical media such as DVD-R, CD-R, DVD-RW, and CD-RW are all
equally nonvolatile. The read-only variants cannot be erased and provide secure
storage unless physically damaged. (The same is true of semiconductor ROM.) The
read-write optical disks (and semiconductor EPROMs and flash memories) may be
intentionally or accidentally erased, but otherwise retain their data indefinitely in
the absence of physical damage. Magnetic hard and floppy disks are nonvolatile
except in the presence of strong external magnetic fields. Semiconductor static
RAM is volatile, requiring continuous application of electrical power to maintain
stored data. Semiconductor DRAM is even more volatile since it requires not only
electrical power, but also periodic data refresh in order to maintain its contents.
Writability (all memory is readable): Magnetic hard and floppy disks and
semiconductor static RAM and DRAM can be written essentially indefinitely, and as
quickly and easily as they can be read. DVD-RW, CD-RW, and semiconductor
flash memory can be written many times, but not indefinitely, and the write
operation is usually slower than the read operation. Semiconductor EPROMs can
be written multiple times, but only in a special programmer, and only after a
relatively long erase cycle under ultraviolet light. DVD-R and CD-R media can be
written once and only once by the user. Semiconductor ROM is pre-loaded with its
binary information at the factory and can never be written by the user.
Power Consumption: All types of optical and magnetic disks as well as
semiconductor ROMs, EPROMs, and flash memories can store data without power
being applied at all. Semiconductor RAMs require continuous application of power
to retain data, with most types of SRAMS being more power-hungry than DRAMs.
(Low-power CMOS static RAMs, however, are commonly used to maintain data for
long periods of time with a battery backup.) While data are being read or written,
all memories require power. Semiconductor DRAM requires relatively little power,
while semiconductor ROMs, flash memories, and EPROMs tend to require more
and SRAMs, more still. All rotating disk drives, magnetic and optical, require
significant power in order to spin the media and move the read/write heads as well
as to actually perform the read and write operations. The specifics vary
considerably from device to device, but those that rotate the media at higher speeds
tend to use slightly more power.
Durability: In general, the various types of semiconductor memories are more
durable than disk memories because they have no moving parts. Only severe
physical shock or static discharges are likely to harm them. (CMOS devices are
particularly susceptible to damage from static electricity.) Optical media are also
very durable; they are nearly impervious to most dangers except that of surface
scratches. Magnetic media such as floppy and hard disks tend to be the least
durable as they are subject to erasure by strong magnetic fields and also are subject
to “head crashes” when physical shock causes the read-write head to impact the
media surface.
Removability/Portability: Flash memory, floppy disks, and optical disks are
eminently portable and can easily be carried from system to system to transfer data.
A few magnetic hard drives are designed to be portable, but most are permanently
installed in a given system and require some effort for removal. Semiconductor
ROMs and EPROMs, if placed in sockets rather than being soldered directly to a
circuit board, can be removed and transported along with their contents. Most
semiconductor RAM devices lose their contents when system power is removed and,
while they could be moved to another system, would not arrive containing any valid
data.
2. Describe in your own words what a hierarchical memory system is and why it is used in
the vast majority of modern computer systems.
A hierarchical memory system is one that is comprised of several types of
memory devices with different characteristics, each occupying a “level” within the
overall structure. The higher levels of the memory system (the ones closest to, or a
part of, the CPU) offer faster access but, due to cost factors and limited physical
space, have a smaller storage capacity. Thus, each level can typically hold only a
portion of the data stored in the next lower level. As one moves down to the lower
levels, speed and cost per bit generally decrease, but capacity increases. At the
lowest levels, the devices offer a great deal of (usually nonvolatile) storage at
relatively low cost, but are quite slow. For the overall system to perform well, the
hierarchy must be managed by hardware and software such that the stored items
that are used most frequently are located in the higher levels, while items that are
used less frequently are relegated to the lower levels.
3. What is the fundamental, underlying reason why low-order main memory interleaving
and/or cache memories are needed and used in virtually all high-performance computer
systems?
The main underlying reason why speed-enhancing techniques such as low-
order interleaving and cache continue to be needed and used in computer systems is
that main memory technology has never been able to keep up with the speed of
processor implementation technologies. The CPUs of each generation have always
been faster than any devices (from the days of delay lines, magnetic drums, and core
memory all the way up to today’s high-capacity DRAM ICs) that were feasible,
from a cost standpoint, to be used as main memory. If anything, the CPU-memory
speed gap has widened rather than narrowed over the years. Thus, the speed and
size of a system’s cache may be even more critical to system performance than
almost any other factor. (If you don’t believe this, examine the performance
difference between an Intel Pentium 4 and an otherwise similar Celeron processor.)
4. A main memory system is designed using 15 ns RAM devices using a 4-way low-order
interleave.
(a) What would be the effective time per main memory access under ideal
conditions?
Under ideal conditions, four memory accesses would be in progress at any
given time due to the low-order interleaving scheme. This means that the effective
time per main memory access would be (15 / 4) = 3.75 ns.
(b) What would constitute “ideal conditions”? (In other words, under what
circumstances could the access time you just calculated be achieved?)
The ideal condition for best performance of the memory system would be
continuous access to sequentially numbered memory locations. Equivalently, any
access pattern that consistently used all three of the other “leaves” before returning
to the one just accessed would have the same benefit. Examples would include
accessing every fifth numbered location, or every seventh, or any spacing that is
relatively prime with 4 (the interleaving factor).
(c) What would constitute “worst-case conditions”? (In other words, under what
circumstances would memory accesses be the slowest?) What would the access
time be in this worst-case scenario? If ideal conditions exist 80% of the time and
worst-case conditions occur 20% of the time, what would be the average time
required per memory access?
The worst case would be a situation where every access went to the same
device or group of devices. This would happen if the CPU needed to access every
fourth numbered location (or every eighth, or any spacing that is an integer multiple
of 4). In this case, access time would revert to that of an individual device (15 ns)
and the interleaving would provide no performance benefit at all.
In the hypothetical situation described, we could take a weighted average to
determine the effective access time for the memory system: (0.80)(3.75 ns) +
(0.20)(15 ns) = (3 + 3) = 6 ns.
(d) When ideal conditions exist, we would like the processor to be able to access
memory every clock cycle with no “wait states” (that is, without any cycles
wasted waiting for memory to respond). Given this requirement, what is the
highest processor bus clock frequency that can be used with this memory system?
In part (a) above, we found the best-case memory access time to be 3.75 ns.
Matching the CPU bus cycle time to this value and taking the reciprocal (since f =
1/T) we obtain:
f = 1/T = (1 cycle) / (3.75 * 10-9 seconds) ≈ 2.67 * 108 cycles/second = 267 MHz.
(e) Other than increased hardware cost and complexity, are there any potential
disadvantages of using a low-order interleaved memory design? If so, discuss one
such disadvantage and the circumstances under which it might be significant.
The main disadvantage that could come into play is due to the fact that
under ideal conditions, all memory modules are busy all the time. This is good if
only one device (usually the CPU) needs to access memory, but not good if other
devices need to access memory as well (for example, to perform I/O). Essentially all
the memory bandwidth is used up by the first device, leaving little or none for
others.
Another possible disadvantage is lower memory system reliability due to
decreased fault tolerance. In a high-order interleaved system, if one memory device
were to fail, 3/4 of the memory space would still be usable. In the low-order
interleaved case, if one of the four “leaves” fails, the entire main memory space is
effectively lost.
5. Is it correct to refer to a typical semiconductor integrated circuit ROM as a “random
access memory”? Why or why not? Name and describe two other logical organizations
of computer memory that are not “random access.”
It is correct to refer to a semiconductor ROM as a “random access memory”
in the strict sense of the definition – a “random access” memory is any memory
device that has an access time independent of the specific location being accessed.
(In other words, any randomly chosen location can be read or written in the same
amount of time as any other location.) This is equally true of most semiconductor
read-only memories as it is of semiconductor read/write memories (which are
commonly known as “RAMs”). Because of the commonly-used terminology, it is
probably better not to confuse the issue by referring to a ROM IC as a “RAM”,
even though that is technically a correct statement.
Besides random access, the other two logical memory organizations that may
be found in computer systems are sequential access (typical of tape and disk
memories) and associative (or content addressable).
6. Assume that a given system’s main memory has an access time of 6.0 ns, while its cache
has an access time of 1.2 ns (five times as fast). What would the hit ratio need to be in
order for the effective memory access time to be 1.5 ns (four times as fast as main
memory)?
Since effective memory access time in such a system is based on a weighted
average, we would need to solve the following equation:
ta effective = ta cache * (ph) + ta main * (1 - ph)
for the particular values given in the problem, as shown:
1.5 ns = (1.2 ns)(ph) + (6.0 ns)(1 - ph)
Using basic algebra we solve to obtain ph = 0.9375.
7. A particular program runs on a system with cache memory. The program makes a total
of 250,000 memory references; 235,000 of these are to cached locations.
(a) What is the hit ratio in this case?
ph = number of hits / (number of hits + number of misses) = 235,000 / 250,000 = 0.94
(b) If the cache can be accessed in 1.0 ns but the main memory requires 7.5 ns for an
access to take place, what is the average time required by this program for a
memory access assuming all accesses are reads?
ta effective = ta cache * (ph) + ta main * (1 - ph) = (1.0 ns)(0.94) + (7.5 ns)(0.06) = (0.94 +
0.45) ns = 1.39 ns
(c) What would be the answer to part (b) if a write-through policy is used and 75% of
memory accesses are reads?
If a write-through policy is used, then all writes require a main memory
access and write hits do nothing to improve memory system performance. The
average write access time is equal to the main memory access time, which is 7.5 ns.
The average read access time is equal to 1.39 ns as calculated in (b) above. The
overall average time per memory access is thus given by:
ta effective = (7.5 ns)(0.25) + (1.39 ns)(0.75) = (1.875 + 1.0425) ns = 2.9175 ns
8. Is hit ratio a dynamic or static performance parameter in a typical computer memory
system? Explain your answer.
Hit ratio is a dynamic parameter in any practical computer system. Even
though the cache and main memory sizes, mapping strategy, replacement policy,
etc. (which can all affect the hit ratio) are constant within a given system, the
proportion of cache hits to misses will still vary from one program to another. It
will also vary widely within a given run, based on such factors as the length of time
the program has been running, the code structure (procedure calls, loops, etc.) and
the properties of the specific data set being operated on by the program.
9. What are the advantages of a set-associative cache organization as opposed to a direct-
mapped or fully associative mapping strategy?
A set-associative cache organization is a compromise between the direct-
mapped and fully associative organizations that attempts to maximize the
advantages of each while minimizing their respective disadvantages. Fully
associative caches are expensive to build but offer a higher hit ratio than direct-
mapped caches of the same size. Direct-mapped caches are cheaper and less
complex to build but performance can suffer due to usage conflicts between lines
with the same index. By limiting associativity to just a few parallel comparisons
(two- and four-way set-associative caches are most common) the set-associative
organization can achieve nearly the same hit ratio as a fully associative design at a
cost not much greater than that of a direct-mapped cache.
10. A computer has 64 MB of byte-addressable main memory. It is proposed to design a 1
MB cache memory with a refill line (block) size of 64 bytes.
(a) Show how the memory address bits would be allocated for a direct-mapped cache
organization.
Since 64M = 226, the total number of bits required to address the main
memory space is 26. And since 64 = 26, it takes 6 bits to identify a particular byte
within a line. The number of refill lines in the cache is 1M / 64 = 220 / 26 = 214 = 16K.
Since there are 214 lines in the cache, 14 index bits are required. 26 total address
bits – 6 “byte” bits – 14 “index” bits leaves 6 bits to be used for the tag. So the
address bits would be partitioned as follows: Tag (6 bits) | Index (14 bits) | Byte (6
bits)
(b) Repeat part (a) for a four-way set-associative cache organization.
For the purposes of this problem, a four-way set-associative cache can be
treated as four direct-mapped caches operating in parallel, each one-fourth the size
of the cache described above. Each of these four smaller units would thus be 256
KB in size, containing 4K = 212 refill lines. Thus, 12 bits would need to be used for
the index, and 26 – 6 – 12 = 8 bits would be used for the tag. The address bits would
be partitioned as follows: Tag (8 bits) | Index (12 bits) | Byte (6 bits)
(c) Repeat part (a) for a fully associative cache organization.
In a fully associative cache organization, no index bits are required.
Therefore the tags would be 26 – 6 = 20 bits long. Addresses would be partitioned
as follows: Tag (20 bits) | Byte (6 bits)
(d) Given the direct-mapped organization, and ignoring any extra bits that might be
needed (valid bit, dirty bit, etc.), what would be the overall size (“depth” by
“width”) of the memory used to implement the cache? What type of memory
devices would be used to implement the cache (be as specific as possible)?
The overall size of the direct-mapped cache would be:
(16K lines) * (64 data bytes + 6 bit tag) = (16,384) * ((64 * 8) + 6) = (16,384 * 518) =
8,486,912 bits. This would be in the form of a fast 16K by 518 static RAM.
(e) Which line(s) of the direct-mapped cache could main memory location
1E0027A16 map into? (Give the line number(s), which will be in the range of 0 to
(n-1) if there are n lines in the cache.) Give the memory address (in hexadecimal)
of another location that could not reside in cache at the same time as this one (if
such a location exists).
To answer this question, we need to write the memory address in binary.
1E0027A hexadecimal equals 01111000000000001001111010 binary. We can break
this down into a tag of 011110, an index of 00000000001001 and a byte offset within
the line of 111010. In a direct-mapped cache, the binary index tells us the number
of the only line that can contain the given memory location. So, this location can
only reside in line 10012 = 9 decimal.
Any other memory location with the same index but a different tag could not
reside in cache at the same time as this one. One example of such a location would
be the one at address 2F0027A16.
11. Define and describe virtual memory. What are its purposes, and what are the advantages
and disadvantages of virtual memory systems?
Virtual memory is a technique that separates the (virtual) addresses used by
the software from the (physical) addresses used by the memory system hardware.
Each virtual address referenced by a program goes through a process of translation
(or mapping) that resolves it into the correct physical address in main memory, if
such a mapping exists. If no mapping is defined, the desired information is loaded
from secondary memory and an appropriate mapping is created. The translation
process is overseen by the operating system, with much of the work done in
hardware by a memory management unit (MMU) for speed reasons. It is usually
done via a multi-level table lookup procedure, with the MMU internally caching
frequently- or recently-used translations so that the costly (in terms of performance)
table lookups can be avoided most of the time.
The principal advantage of virtual memory is that it frees the programmer
from the burden of fitting his or her code into available memory, giving the illusion
of a large memory space exclusively owned by the program (rather than the usually
much more limited physical main memory space that is shared with other resident
programs). The main disadvantage is the overhead of implementing the virtual
memory scheme, which invariably results in some increase in average access time vs.
a system using comparable technology with only physical memory. Table lookups
take time, and even when a given translation is cached in the MMU’s Translation
Lookaside Buffer, there is some propagation delay involved in address translation.
12. Name and describe the two principal approaches to implementing virtual memory
systems. How are they similar and how do they differ? Can they be combined, and if so,
how?
The two principal approaches to implementing virtual memory (VM) are
demand-paged VM and demand-segmented VM (paging and segmentation, for short).
They are similar in that both map a virtual (or logical) address space to a physical
address space using a table lookup process managed by an MMU and overseen by
the computer’s operating system. They are different in that paging maps fixed-size
regions of memory called pages, while segmentation maps variable-length segments.
Page size is usually determined by hardware considerations such as disk sector size,
while segment size is determined by the structure of the program’s code and data.
A paged system can concatenate the offset within a page with the translated upper
address bits, while a segmented system must translate a logical address into the
complete physical starting address of a segment and then add the segment offset to
that value.
It is possible to create a system that uses aspects of both approaches;
specifically, one in which the variable-length segments are each comprised of one or
more fixed-sized pages. This approach, known as segmentation with paging, trades
off some of the disadvantages of each approach to try to take advantage of their
strengths.
13. What is the purpose of having multiple levels of page or segment tables rather than a
single table for looking up address translations? What are the disadvantages, if any, of
this scheme?
The main purpose of having multiple-level page or segment tables is to
replace one huge mapping table with a hierarchy of smaller ones. The advantage is
that the tables are smaller (remember, they are stored in main memory, though
some entries may be cached) and easier for the operating system to manage. The
disadvantage is that “walking” the hierarchical sequence of tables takes longer than
a single table lookup. Most systems have a TLB to cache recently-used address
translations, though, so this time penalty is usually only incurred once when a given
page or segment is first loaded into memory (or perhaps again later if the TLB fills
up and a displaced entry has to be reloaded).
14. A process running on a system with demand-paged virtual memory generates the
following reference string (sequence of requested pages): 4, 3, 6, 1, 5, 1, 3, 6, 4, 2, 2, 3.
The operating system allocates each process a maximum of four page frames at a time.
What will be the number of page faults for this process under each of the following page
replacement policies?
a) LRU 7 page faults
b) FIFO 8 page faults
c) LFU (with FIFO as tiebreaker) 7 page faults
15. In what ways are cache memory and virtual memory similar? In what ways are they
different?
Cache memory and virtual memory are similar in several ways. Both involve
the interaction between two levels of a hierarchical memory system – one larger and
slower, the other smaller and faster. Both have the goal of performing close to the
speed of the smaller, faster memory while taking advantage of the capacity of the
larger, slower one; both depend on the principle of locality of reference to achieve
this. Both operate on a demand basis and both perform a mapping of addresses
generated by the CPU.
One significant difference is the size of the blocks of memory that are
mapped and transferred between levels of the hierarchy. Cache lines tend to be
significantly smaller than pages or segments in a virtual memory system. Because of
the size of the mapped areas as well as the speed disparity between levels of the
memory system, cache misses tend to be more frequent, but less costly in terms of
performance, than page or segment faults in a VM system. Cache control is done
entirely in hardware, while virtual memory management is accomplished via a
combination of hardware (the MMU) and software (the operating system). Cache
exists for the sole reason of making main memory appear faster than it really is;
virtual memory has several purposes, one of which is to make main memory appear
larger than it is, but also to support multiprogramming, relocation of code and data,
and the protection of each program’s memory space from other programs.
16. In systems which make use of both virtual memory and cache, what are the advantages of
a virtually addressed cache? Does a physically addressed cache have any advantages of
its own, and if so, what are they? Describe a situation in which one of these approaches
would have to be used because the other would not be feasible.
All else being equal, a virtually mapped cache is faster than a physically
mapped cache because no address translation is required prior to checking the tags
to see if a hit has occurred. The appropriate bits from the virtual address are
matched against the (virtual) tags. In a physically addressed cache, the virtual-to-
physical translation must be done before the tags can be matched. A physically
addressed cache does have some advantages, though, including the ability to
perform task switches without having to flush (invalidate) the contents of the cache.
In a situation where the MMU is located on-chip with the CPU while a cache is
located off-chip (for example a level-2 or level-3 cache on the motherboard) the
address is already translated before it appears on the system bus and, therefore,
that cache would have to be physically addressed.
17. Fill in the blanks below with the most appropriate term or concept discussed in this
chapter:
Information density - a characteristic of a memory device that refers to the amount of
information that can be stored in a given physical space or volume
Dynamic Random Access Memory (DRAM) - a semiconductor memory device made
up of a large array of capacitors; its contents must be periodically refreshed in order to
keep them from being lost
Magnetic RAM (MRAM) - a developing memory technology that operates on the
principle of magnetoresistance; it may allow the development of “instant-on” computer
systems
Erasable/Programmable Read-Only Memory (EPROM) - a type of semiconductor
memory device, the contents of which cannot be overwritten during normal operation, but
can be erased using ultraviolet light
Associative memory - this type of memory device is also known as a CAM
Argument register - a register in an associative memory that contains the item to be
searched for
Locality of reference - the principle that allows hierarchical storage systems to function
at close to the speed of the faster, smaller level(s)
Miss - this occurs when a needed instruction or operand is not found in cache and thus a
main memory access is required
Refill line - the unit of information that is transferred between a cache and main memory
Tag - the portion of a memory address that determines whether a cache line contains the
needed information
Fully associative mapping - the most flexible but most expensive cache organization, in
which a block of information from main memory can reside anywhere in the cache
Write-back - a policy whereby writes to cached locations update main memory only
when the line is displaced
Valid bit - this is set or cleared to indicate whether a given cache line has been initialized
with “good” information or contains “garbage” due to not yet being initialized
Memory Management Unit (MMU) - a hardware unit that handles the details of address
translation in a system with virtual memory
Segment fault - this occurs when a program makes reference to a logical segment of
memory that is not physically present in main memory
Translation Lookaside Buffer (TLB) - a type of cache used to hold virtual-to-physical
address translation information
Dirty bit - this is set to indicate that the contents of a faster memory subsystem have
been modified and need to be copied to the slower memory when they are displaced
Delayed page fault - this can occur during the execution of a string or vector instruction
when part of the operand is present in physical main memory and the rest is not
3 Basics of the Central Processing Unit
1. Does an architecture that has fixed-length instructions necessarily have only one
instruction format? If multiple formats are possible given a single instruction size in bits,
explain how they could be implemented; if not, explain why this is not possible.
Not necessarily. It is possible to have multiple instruction formats, all of the
same length. For example, SPARC has three machine language instruction formats,
all 32 bits long. This is implemented by decoding a subset of the op code bits (in the
SPARC example, the two leftmost bits) and using the decoded outputs to determine
how to decode the remaining bits of the instruction.
2. The instruction set architecture for a simple computer must support access to 64 KB of
byte-addressable memory space and eight 16-bit general-purpose CPU registers.
(a) If the computer has three-operand machine language instructions that operate on
the contents of two different CPU registers to produce a result that is stored in a
third register, how many bits are required in the instruction format for addressing
registers?
Since there are 8 = 23 registers, three bits are needed to identify each register
operand. In this case there are two source registers and one destination register, so
it would take 3 * 3 = 9 bits in the instruction to address all the needed registers.
(b) If all instructions are to be 16 bits long, how many op codes are available for the
three-operand, register operation instructions described above (neglecting, for the
moment, any other types of instructions that might be required)?
16 bits total minus 9 bits for addressing registers leaves 7 bits to be used as
the op code. Since 27 = 128, there are 128 distinct op codes available to specify such
instructions.
(c) Now assume (given the same 16-bit instruction size limitation) that, besides the
instructions described in (a), there are a number of additional two-operand
instructions to be implemented, for which one operand must be in a CPU register
while the second operand may reside in a main memory location or a register. If
possible, detail a scheme that allows for at least 50 register-only instructions of
the type described in (a) plus at least 10 of these two-operand instructions. (Show
how you would lay out the bit fields for each of the machine language instruction
formats.) If this is not possible, explain in detail why not and describe what
would have to be done to make it possible to implement the required number and
types of machine language instructions.
We can accomplish this design goal by adopting two instruction formats that
could be distinguished by a single bit. Format 1 will have a specific bit (say, the
leftmost bit) = 0 while format 2 will have a 1 in that bit position. Three-operand
(register-only) instructions would use format 1. With one bit already used to
identify the format, of the remaining 15 bits, 6 would constitute the op code (giving
us 26 = 64 possible instructions of this type). The other 9 bits would be used to
identify source register 1 (3 bits), source register 2 (3 bits), and the destination
register (3 bits).
The two-operand instructions would use format 2. These instructions cannot
use absolute addressing for memory operands because that would require 16 bits for
the memory address alone, and there are only 16 total bits per instruction.
However, register indirect addressing or indexed addressing could be used to locate
memory operands. In this format, 3 of the remaining 15 bits would be needed to
identify the operand that is definitely in a register. One additional bit would be
required to tell whether the second operand was in a register or in a memory
location. Then, another set of 3 bits would identify a second register that contains
either the second operand or a pointer to it in memory. This leaves 8 bits, of which
4 or more would have to be used for op code bits since we need at least 10
instructions of this type. The remaining 4 bits could be used to provide additional
op codes or as a small displacement for indexed addressing.
3. What are the advantages and disadvantages of an instruction set architecture with
variable-length instructions?
For an architecture with a sufficient degree of complexity, it is natural that
some instructions may be expressible in fewer bits than others. (Some may have
fewer options, operands, addressing modes, etc. while others have more
functionality.) Having variable-length instructions means that the simpler
instructions need take up no more space than absolutely necessary. (If all
instructions are the same length, then even the simplest ones must be the same size,
in bits, as the most complex.) Variable-length instructions can save significant
amounts of code memory, but at the expense of requiring a more complex decoding
scheme that can complicate the design of the control unit. Variable-length
instructions also make it more difficult to pipeline the process of fetching, decoding,
and executing instructions (see Chapter 4).
4. Name and describe the three most common general types (from the standpoint of
functionality) of machine instructions found in executable programs for most computer
architectures.
In most executable programs (one can always find isolated counter-
examples) the bulk of the machine instructions are, usually in this order: data
transfer instructions, computational (arithmetic, logic, shift, etc.) instructions, and
control transfer instructions.
5. Given that we wish to specify the location of an operand in memory, how does indirect
addressing differ from direct addressing? What are the advantages of indirect addressing,
and in what circumstances is it clearly preferable to direct addressing? Are there any
disadvantages of using indirect addressing? How is register indirect addressing different
from memory indirect addressing, and what are the relative advantages and disadvantages
of each?
Direct (or absolute) addressing specifies the location of the operand(s)
explicitly as part of the machine language instruction (as opposed to immediate
addressing, which embeds the operand itself in the instruction). Indirect addressing
uses the machine language instruction to specify not the location of the operand, but
the location of the location of the operand. (In other words, it tells where to find a
pointer to the operand.) The advantage of indirect addressing is that if a given
instruction is executed more than once (as in a program loop) the operand does not
have to be in the same memory location every time. This is of particular use in
processing tables, arrays, and other multi-element data structures. The only real
disadvantages of indirect addressing vs. direct addressing are an increase in
complexity and a decrease in processing speed due to the need to dereference the
pointer.
Depending on the architecture, the pointer (which contains the operand
address) specified by the instruction may reside in either a memory location
(memory indirect addressing) or a CPU register (register indirect addressing).
Memory indirect addressing allows a virtually unlimited number of pointers to be
active at once, but requires an additional memory access – which complicates and
slows the execution of the instruction, exacerbating the disadvantages mentioned
above. To avoid this complexity, most modern architectures support only register
indirect addressing, which limits the pointers to exist in the available CPU registers
but allows instructions to execute more quickly.
6. Various computer architectures have featured machine instructions that allow the
specification of three, two, one, or even zero operands. Explain the tradeoffs inherent to
the choice of the number of operands per machine instruction. Pick a current or historical
computer architecture, find out how many operands it typically specifies per instruction,
and explain why you think its architects implemented the instructions the way they did.
The answer to this question will obviously depend on the architecture chosen.
The main tradeoff is programmer (or compiler) convenience, which favors more
operands per instruction, versus the desire to keep instructions smaller and more
compact, which favors fewer operands per instruction.
7. Why have load-store architectures increased in popularity in recent years? (How do their
advantages go well with modern architectural design and implementation technologies?)
What are some of their less desirable tradeoffs vs. memory-register architectures, and
why are these not as important as they once were?
Load/store architectures have become popular in large measure because the
decoupling of memory access from computational operations on data keeps the
control unit logic simpler and makes it easier to pipeline the execution of
instructions (see Chapter 4). Simple functionality of instructions makes it easier to
avoid microcode and use only hardwired control logic, which is generally faster and
takes up less “real estate” on the IC. Not allowing memory operands also helps keep
instructions shorter and can help avoid the need to have multiple instruction
formats of different sizes.
Memory-register architectures, on the other hand, tend to require fewer
machine language instructions to accomplish the same programming task, thus
saving program memory. The compiler (or the assembly language programmer)
has more flexibility and not as many registers need to be provided if operations on
memory contents are allowed. Given the decrease in memory prices, the
improvements in compiler technology, and the shrinking transistor sizes over the
past 20 years or so, the advantages of memory-register architectures have been
diminished and load/store architectures have found greater favor.
8. Discuss the two historically dominant architectural philosophies of CPU design:
a) Define the acronyms CISC and RISC and explain the fundamental differences
between the two philosophies.
CISC stands for “Complex Instruction Set Computer” and RISC stands for
“Reduced Instruction Set Computer.” The fundamental difference between these
two philosophies of computer system design is the choice of whether to put the
computational complexity required of the system in the hardware or in the software.
CISC puts the complexity in the hardware. The idea of CISC was to support high-
level language programming by making the machine directly execute high-level
functions in hardware. This was usually accomplished by using microcode to
implement those complex functions. Programs were expected to be optimized by
coding in assembly language. RISC, on the other hand, puts the complexity in the
software (mainly, the high level language compilers). No effort was made to
encourage assembly language programming; instead there is a reliance on
optimization by the compiler. The RISC idea was to make the hardware as simple
and fast as possible by eliminating microcode and explicitly encouraging pipelining
of the hardware. Any task that cannot be quickly and conveniently done in
hardware is left for the compiler to implement by combining simpler functions.
b) Name one commercial computer architecture that exemplifies the CISC
architectural approach and one other that exemplifies RISC characteristics.
CISC examples include the DEC VAX, Motorola 680x0, Intel x86, etc. RISC
examples include the IBM 801, Sun SPARC, MIPS Rx000, etc.
c) For each of the two architectures you named in (b) above, describe one
distinguishing characteristic not present in the other architecture that clearly
shows why one is considered a RISC and the other a CISC.
Answers will vary depending on the architectures chosen, but may include
the use of hardwired vs. microprogrammed control, the number and complexity of
machine language instructions and memory addressing modes, the use of fixed- vs.
variable-length instructions, a memory-register vs. a load/store architecture, the
number of registers provided and their functionality, etc.
d) Name and explain one significant advantage of RISC over CISC and one
significant advantage of CISC over RISC.
Significant advantages of RISC include simpler, hardwired control logic that
takes up less space (leaving room for more registers, on-chip cache and/or floating-
point hardware, etc.) and allows higher CPU clock frequencies, the ability to execute
instructions in fewer clock cycles, and ease of instruction pipelining. Significant
advantages of CISC include a need for fewer machine language instructions per
program (and thus a reduced appetite for code memory), excellent support for
assembly language programming, and less demand for complexity in, and
optimization by, the compilers.
9. Discuss the similarities and differences between the programmer-visible register sets of
the 8086, 68000, MIPS, and SPARC architectures. In your opinion, which of these CPU
register organizations has the most desirable qualities, and which is least desirable? Give
reasons to explain your choices.
The 8086 has a small number of highly specialized registers. Some are for
addresses, some for computations; many functions can only be carried out using a
specific register or a limited subset of the registers. The 68000, another CISC
processor, has a few more (16) working registers and divides them only into two
general categories: data registers and address registers. Within each group,
registers have identical functionality (except for address register 7 which acts as the
stack pointer).
MIPS and SPARC, both RISC designs, have larger programmer-visible
register sets (32 working registers) and do not distinguish between registers used for
data vs. registers used for pointers to memory. For the most part, “all registers are
created equal”, though in both architectures register 0 is a ROM location that
always contains the value 0. SPARC processors actually have a variable number
(up to hundreds) of registers and use a hardware register renaming scheme to make
different subsets of 32 of them visible at different times. This “overlapping register
window” scheme was devised to help optimize parameter passing across procedure
calls. Students can be expected to have different preferences, but should point to
specific advantages of a given architecture to back up their choices.
10. A circuit is to be built to add two 10-bit numbers x and y plus a carry-in. (Bit 9 of each
number is the MSB, while bit 0 is the LSB. c0 is the carry-in to the LSB position.) The
propagation delay of any individual AND or OR gate is 0.4 ns, and the carry and sum
functions of each full adder are implemented in sum of products form.
(a) If the circuit is implemented as a ripple carry adder, how much time will it take to
produce a result?
Each full adder takes (0.4 + 0.4) = 0.8 ns to produce a result (sum and carry
outputs). Since the carry output of each adder is an input to the adder in the next
more significant position, the operation of the circuit is sequential and it takes 10 *
(0.8 ns) = 8.0 ns to compute the sum of two 10-bit numbers.
(b) Given that the carry generate and propagate functions for bit position i are given
by gi = xiyi and pi = xi + yi, and that each required carry bit (c1...c10) is developed
from the least significant carry-in c0 and the appropriate gi and pi functions using
AND-OR logic, how much time will a carry lookahead adder circuit take to
produce a result? (Assume AND gates have a maximum fan-in of 8 and OR gates
have a maximum fan-in of 12.)
In a carry lookahead adder, all the gi and pi functions are generated
simultaneously by parallel AND and OR gates. This takes 0.4 ns (one gate delay
time). Since ci+1 = gi + pici, generating all the carries should take two more gate
delay times or 0.8 ns. However, we have to consider the gate fan-in restrictions.
Since OR gates can have a fan-in of 12 and we never need to OR that many terms,
that restriction does not matter; but the fan-in limitation on the AND gates means
that an extra level of logic will be needed (since there are cases where we have to
AND more than 8 terms). Thus, 3 * (0.4 ns) = 1.2 ns is required for this AND-OR
logic for a total of 4 * (0.4 ns) = 1.6 ns to generate all the carries. Once the carries
are available, all full adders operate simultaneously, requiring an additional 2 * (0.4
ns) = 0.8 ns to generate the final result. The overall propagation delay of the circuit
is 6 gate delays or 2.4 ns.
11. Under what circumstances are carry save adders more efficient than “normal” binary
adders that take two operands and produce one result? Where, in a typical general-
purpose CPU, would one be most likely to find carry save adders?
Carry save adders are more efficient when there are several numbers, rather
than just two, that need to be added together. While multi-operand addition is not a
typical operation supported by general-purpose computers, multiplication (which
requires the addition of several partial products) is such an operation. Carry save
adders are frequently used in multiplication hardware.
12. Given two 5-bit, signed, two’s complement numbers x = -6 = 110102 and y = +5 =
001012, show how their 10-bit product would be computed using Booth’s algorithm (you
may wish to refer to Figures 3.24, 3.25, and 3.26).
M = -6 = 11010; therefore –M = +6 = 00110. The initial contents of P are
0000000101 (upper part zero, lower part = the multiplier (+5)). The computation
proceeds as follows: P C 00000 00101 0 +00110 (1. add –M) 00110 00101 0 00011 00010 1 (then shift right) +11010 (2. add +M) 11101 00010 1 11110 10001 0 (then shift right) +00110 (3. add –M) 00100 10001 0 00010 01000 1 (then shift right) +11010 (4. add +M) 11100 01000 1 11110 00100 0 (then shift right) 11111 00010 0 (5. just shift right) Done. Answer = 1111100010 = -30 (represented as the two’s complement of 30).
13. Discuss the similarities and differences between “scientific notation” (used for manual
calculations in base 10) and floating-point representations for real numbers used in digital
computers.
Numbers expressed in scientific notation and floating-point format have a
great deal in common. Both approaches divide a number into its mantissa
(significant digits) and its exponent (the power of the system’s base or radix) such
that the number is the product of the mantissa times the base raised to the given
power. (Both the mantissa and the exponent are signed values, allowing us to
represent a wide range of positive and negative values.) In both cases, we gain the
considerable advantage of being able to represent very large and very small
numbers without having to write (or store) a large number of digits that are either
zero or insignificant.
The main difference is that numbers in scientific notation usually work with
base 10, with the mantissa and exponent themselves expressed in decimal notation
also. Computer floating-point formats invariably express the mantissa and
exponent in some type of binary format, and the radix is either 2 or some small
power of 2 (i.e. 4, 8, or 16). Of course, the signs of the mantissa and exponent must
be represented in sign-magnitude, complement, or biased notation since the
computer cannot directly represent the symbols “+” and “-” in hardware. Floating-
point formats also need representations for special cases like zero, infinity, and
invalid or out-of-range results that may result from computations, so such results
can be distinguished from normalized floating-point values.
14. Why was IEEE 754-1985 a significant development in the history of computing,
especially in the fields of scientific and engineering applications?
IEEE 754 was significant because it represented the first time that a
consortium of major computer manufacturers was able to come to an agreement on
a floating-point arithmetic standard that all of them could and would support.
Before the adoption of IEEE 754, every vendor had its own, unique floating-point
format (some more mathematically robust than others). Moving binary files
containing real-number data between incompatible systems was impossible. Porting
any kind of code that performed arithmetic on real numbers (as most scientific and
engineering applications do extensively) from one system to another was very
tedious, tricky, and prone to surprising – and in some cases, mathematically invalid
– results. IEEE 754, once adopted and generally supported, essentially put an end
to floating-point compatibility problems. (Much to the joy of scientific applications
programmers everywhere!)
15. Assume that IEEE has modified standard 754 to allow for “half-precision” 16-bit
floating-point numbers. These numbers are stored in similar fashion to the single
precision 32-bit numbers, but with smaller bit fields. In this case, there is one bit for the
sign, followed by six bits for the exponent (in excess-31 format), and the remaining 9 bits
are used for the fractional part of the normalized mantissa. Show how the decimal value
+17.875 would be represented in this format.
The fact that the number in question is positive means that the leftmost (sign)
bit would be 0. To determine the rest of the bits, we express 17.875 in binary as
10001.111 or, normalizing, 1.0001111 * 24. The stored exponent would thus be 4 +
31 = 35 decimal or 100011 binary. Stripping off the leading “1.” and padding on the
right with zeroes, the significand would be 000111100. So +17.875 would be
expressed in this fictional format as 0100011000111100 binary or 463C hexadecimal.
16. Show how the decimal value -267.5625 would be represented in IEEE-754 single and
double precision formats.
Because the number is negative, the leftmost (sign) bit will be 1 in either
format. To determine the rest of the bits, we express 267.5625 in binary as
100001011.1001 or, normalizing, 1.000010111001 * 28. In single precision, the stored
exponent would be 8 + 127 = 135 decimal or 10000111 binary. In double precision,
the stored exponent would be 8 + 1023 = 1031 decimal or 10000000111 binary.
Stripping off the leading “1.” and padding on the right with zeroes, the significand
would be 00001011100100000000000 (single precision) or
0000101110010000000000000000000000000000000000000000 (double precision). So
–267.5625 would be expressed in the IEEE single precision format as
11000011100001011100100000000000 binary or C385C800 hexadecimal, and in the
IEEE double precision format as
1100000001110000101110010000000000000000000000000000000000000000 binary
or C070B90000000000 hexadecimal.
17. Consider a simple von Neumann architecture computer like the one discussed in Section
3.3.1 and depicted in Figure 3.32. One of its machine language instructions is an ANDM
instruction which reads the contents of a memory location (specified by direct
addressing), bitwise ANDs this data with the contents of a specified CPU register, then
stores the result back in that same register. List and describe the sequence of steps that
would have to be carried out under the supervision of the processor’s control unit in order
to implement this instruction.
1. MAR PC Copy the contents of the Program Counter (the address of the instruction) to the MAR so they can be output on the address bus.
2. Read; PC PC + 1 Activate the Read control signal to the memory system to initiate the
memory access. While the memory read is taking place, increment the Program Counter so that it points to the next sequential instruction in the program.
3. MDR [MAR] When the memory read is complete, transfer the contents of the
memory location over the data bus and latch them into the MDR. 4. IR MDR Transfer the contents of the MDR (the machine language instruction)
to the IR and decode the instruction. At this point, the control unit discovers that this is an “AND with memory direct” instruction.
5. MAR IRlow Transfer the lower 8 bits from the IR (the operand address) to the
MAR to prepare to read the operand. 6. Read Activate the Read control signal to the memory system to initiate the
memory access for the operand. 7. MDR [MAR] When the memory read is complete, transfer the contents of the
memory location over the data bus and latch them into the MDR. 8. MDRout; R4outB; AND Transfer the contents of the MDR (the memory operand) and register 4
(the register operand) to the ALU inputs and activate the control signal telling the ALU to perform the logical AND function.
9. R4 ALU Transfer the output of the ALU (the logical AND of the operands) to
the destination register (R4). Execution of the current instruction is now complete and the control unit is ready to fetch the next instruction.
18. What are the two principal design approaches for the control unit of a CPU? Describe
each of them and discuss their advantages and disadvantages. If you were designing a
family of high performance digital signal processors, which approach would you use, and
why?
The two principal design approaches are hardwired control and
microprogrammed control. A hardwired (conventional) control unit is implemented
using standard sequential and combinational logic design techniques to design the
control step counter, decoders, and other circuitry required to generate the
machine’s control signals. It has the disadvantages of requiring more design effort
(especially if the machine has a complex instruction set architecture) and being
more difficult to change once designed, but the advantages of being faster (all else
being equal, logic tends to be faster than memory access) and taking up less space
for implementation.
Microprogrammed control is implemented by treating each individual
machine operation as a task to be implemented using programmed steps. These
steps are carried out by a very simple computing engine within the CPU itself; thus,
microprogramming is sometimes described as the “computer within a computer”
approach. Microinstructions, representing sets of control signals to be generated
for each step in carrying out a machine operation, are fetched from microprogram
memory (control store) and issued (sent out over the control lines) to control
internal and external operations. Microprogramming simplifies the design process
and makes it easy to change or extend the capabilities of an architecture, but the
control store takes up a considerable amount of space on the chip. And since
control signals are read out of a memory rather than generated in logic,
microprogrammed control tends to yield a slower implementation than hardwired
control.
In the situation described, it would probably be most appropriate to use
hardwired control since digital signal processing is likely to demand raw CPU speed
over just about any other criterion and hardwired logic will contribute to a “lean,
mean” and fast design. Since it is a special-purpose and not a general-purpose
processor, the DSP’s instruction set is likely to be fairly simple and so the design
would not be likely to profit much from microcode’s more flexible capabilities and
ability to easily implement complexity.
19. In a machine with a microprogrammed control unit, why is it important to be able to do
branching within the microcode?
Transfers of control (branches) are needed within microcode for many of the
same reasons that they are needed in machine language, assembly language, and
high-level language programming. At any level of programming, the next
sequential instruction in memory is not always the one we wish to perform next.
Unconditional transfers of control are needed to move between one microroutine
and another, for example from the end of the execution microroutine for one
instruction to the beginning of the instruction fetch microroutine that will retrieve
the next microinstruction. Conditional transfers of control in the microcode are
needed in order to implement conditional branching in the machine language
program (which, in turn, supports conditional structures in high-level code).
20. Given the horizontal control word depicted in Figure 3.39 for our simple example
machine, develop the microroutines required to fetch and execute the ANDM instruction
using the steps you outlined in question 17.
For each step needed to execute an instruction we would need to form a
binary microinstruction that had 1s in the correct places to carry out the desired
action, and 0s everywhere else. For example, to do step one (which is MAR ← PC in
register transfer language) we would make PCout = 1 and MARin = 1, and all the rest
of the bits of the control word would be 0. So the first microinstruction would look
like this:
0000000000000000000000000010100000000000000
The same sort of procedure would have to be followed for each of the
remaining steps. The complete binary microroutine would be as follows:
0000000000000000000000000010100000000000000 0000000000000000000000000000000100000000110 0000000000000000000000000001000000000000000 0000000000000000000000000100001000000000000 0000000000000000000000000010010000000000000 0000000000000000000000000000000000000000110 0000000000000000000000000001000000000000000 0000000000000000000100000000001000100000000 0000100000000000000000000000000000000000000
21. Repeat question 20 using the vertical control word depicted in Figure 3.40.
The steps needed to carry out the instruction are still the same, but the
microinstructions are more compact since subsets of the control signals are encoded
into bit fields. To come up with specific binary values for the microinstructions, we
would have to know how the four system registers (PC, IR, MAR, and MDR) and
the eight ALU operations were encoded. Assuming that PC = 00, IR = 01, MAR =
10, and MDR = 11, and that the AND operation is represented by 010, the binary
microroutine in the vertical format would be as follows:
0000000001000000000 0000000000000100011 0000000001100000000 0000000000111000000 0000000001001000000 0000000000000000011
0000000001100000000 0000001000011001000 1000000000000000000
22. Fill in the blanks below with the most appropriate term or concept discussed in this
chapter:
Op code - the portion (bit field) of a machine language instruction that specifies the
operation to be done by the CPU
Control transfer instruction - a type of instruction that modifies the machine’s program
counter (other than by simply incrementing it)
Indexed addressing - a way of specifying the location of an operand in memory by
adding a constant embedded in the instruction to the contents of a “pointer” register
inside the CPU
Zero-operand instructions - these would be characteristic of a stack-based instruction
set
Accumulator machine - this type of architecture typically has instructions that explicitly
specify only one operand
Load-store architecture - a feature of some computer architectures where “operate”
instructions do not have memory operands; their operands are found in CPU registers
Complex Instruction Set Computer (CISC) - machines belonging to this architectural
class try to “bridge the semantic gap” by having machine language instructions that
approximate the functionality of high-level language statements
Datapath - this part of a CPU includes the registers that store operands as well as the
circuitry that performs computations
Carry lookahead adder - this type of addition circuit develops all carries in logic,
directly from the inputs, rather than waiting for them to propagate from less significant
bit positions
Wallace tree - a structure comprised of multiple levels of carry save adders, which can
be used to efficiently implement multiplication
Excess (biased) notation - this type of notation stores signed numbers as though they
were unsigned; it is used to represent exponents in some floating-point formats
Significand (fraction) - in IEEE-754 floating-point numbers, a normalized mantissa with
the leading 1 omitted is called this
Positive infinity - this is the result when the operation 1.0/0.0 is performed on a system
with IEEE-754 floating-point arithmetic
Instruction Register (IR) - this holds the currently executing machine language
instruction so its bits can be decoded and interpreted by the control unit
Microroutine - a sequence of microinstructions that fetches or executes a machine
language instruction, initiates exception processing, or carries out some other basic
machine-level task
Horizontal microprogramming - a technique used in microprogrammed control unit
design in which mutually-exclusive control signals are not encoded into bit fields, thus
eliminating the need for decoding microinstructions
Microprogram Counter (µPC) - this keeps track of the location of the next microword
to be retrieved from microcode storage
4 Enhancing CPU Performance
1. Suppose that you are designing a machine that will frequently have to perform 64
consecutive iterations of the same task (for example, a vector processor with 64-element
vector registers). You wish to implement a pipeline that will help speed up this task as
much as is reasonably possible, but recognize that dividing a pipeline into more stages
takes up more chip area and adds to the cost of implementation.
(a) Make the simplifying assumptions that the task can be subdivided as finely or
coarsely as desired and that pipeline registers do not add a delay. Also assume that
one complete iteration of the task takes 16 ns (thus, a non-pipelined implementation
would take 64 * 16 = 1024 ns to complete 64 iterations). Consider possible pipelined
implementations with 2, 4, 8, 16, 24, 32, and 48 stages. What is the total time
required to complete 64 iterations in each case? What is the speedup (vs. a non-
pipelined implementation) in each case? Considering cost as well as performance,
what do you think is the best choice for the number of stages in the pipeline?
Explain. (You may wish to make graphs of speedup and/or total processing time vs.
the number of stages to help you analyze the problem.)
In general, for a pipeline with s stages processing n iterations of a task, the
time taken to complete all the iterations may be expressed as:
tTOTAL = [s * tSTAGE] + [(n-1) * tSTAGE] = [(s+n-1) * tSTAGE]
In a pipelined implementation with 2 stages: the stage time is 16/2 = 8 ns,
and the total time for 64 iterations = [(2 + 64 – 1) * 8] ns = 520 ns; the speedup
factor is 1.969.
In a pipelined implementation with 4 stages: the stage time is 16/4 = 4 ns,
and the total time for 64 iterations = [(4 + 64 – 1) * 4] ns = 268 ns; the speedup
factor is 3.821.
In a pipelined implementation with 8 stages: the stage time is 16/8 = 2 ns,
and the total time for 64 iterations = [(8 + 64 – 1) * 2] ns = 142 ns; the speedup
factor is 7.211.
In a pipelined implementation with 16 stages: the stage time is 16/16 = 1 ns,
and the total time for 64 iterations = [(16 + 64 – 1) * 1] ns = 79 ns; the speedup
factor is 12.962.
In a pipelined implementation with 24 stages: the stage time is 16/24 = 0.667
ns, and the total time for 64 iterations = [(24 + 64 – 1) * 0.667] ns = 58 ns; the
speedup factor is 17.655.
In a pipelined implementation with 32 stages: the stage time is 16/32 = 0.5 ns,
and the total time for 64 iterations = [(32 + 64 – 1) * 0.5] ns = 47.5 ns; the speedup
factor is 21.558.
In a pipelined implementation with 48 stages: the stage time is 16/48 = 0.333
ns, and the total time for 64 iterations = [(48 + 64 – 1) * 0.333] ns = 37 ns; the
speedup factor is 27.676.
The speedup achieved versus the number of stages goes up in nearly linear
fashion through about 8-16 stages, then starts to fall off somewhat. Still, while gains
are not linear, performance continues to improve significantly all the way up to a
48-stage pipeline. If hardware cost were a critical limiting factor, it would be
reasonable to build only a 16-stage pipeline. If, on the other hand, performance is
paramount, it would probably be worth building the most finely-grained (48-stage)
pipeline even though it achieves a speedup of “only” about 28.
(b) Now assume that a total of 32 levels of logic gates are required to perform the task,
each with a propagation delay of 0.5 ns (thus the total time to produce a single result
is still 16 ns). Logic levels cannot be further subdivided. Also assume that each
pipeline register has a propagation delay equal to that of two levels of logic gates, or
1 ns. Re-analyze the problem; does your previous recommendation still hold? If not,
how many stages would you recommend for the pipelined implementation under
these conditions?
The same general pipeline performance equation still holds; however, in this
case we are presented with reasonable limitations on what the stage time can be.
Clearly, if there are 32 levels of logic gates involved, then the number of stages
should be 2, 4, 8, 16, or 32; a pipeline with 24 or 48 stages is not a realistic approach
since the delays do not divide evenly. (Some stages would do less work than others,
or perhaps no work at all!) In addition, each stage has a minimum propagation
delay of 1 ns due to its pipeline register regardless of the number of logic levels it
contains. Given these more realistic limitations, we analyze the situation as follows:
In a pipelined implementation with 2 stages: the stage time is (16/2) + 1 = 9
ns, and the total time for 64 iterations = [(2 + 64 – 1) * 9] ns = 585 ns; the speedup
factor is 1.750.
In a pipelined implementation with 4 stages: the stage time is (16/4) + 1 = 5
ns, and the total time for 64 iterations = [(4 + 64 – 1) * 5] ns = 335 ns; the speedup
factor is 3.057.
In a pipelined implementation with 8 stages: the stage time is (16/8) + 1 = 3
ns, and the total time for 64 iterations = [(8 + 64 – 1) * 3] ns = 213 ns; the speedup
factor is 4.808.
In a pipelined implementation with 16 stages: the stage time is (16/16) + 1 =
2 ns, and the total time for 64 iterations = [(16 + 64 – 1) * 2] ns = 158 ns; the speedup
factor is 6.481.
In a pipelined implementation with 32 stages: the stage time is (16/32) + 1 =
1.5 ns, and the total time for 64 iterations = [(32 + 64 – 1) * 1.5] ns = 142.5 ns; the
speedup factor is 7.186.
Here, because of the more realistic design constraints, the speedup achieved
versus the number of stages is not nearly linear, especially beyond about 4-8 stages.
An implementation with 32 stages is twice as costly as one with 16 stages, yet
performs only a fraction better. Even the 16-stage implementation is not
appreciably better than the 8-stage pipeline. In both these latter cases, we would
incur considerable extra hardware expense for very modest reductions in the total
time required to complete the task. The best tradeoff, depending on the exact
constraints of the situation, is probably to use 4 or, at most, 8 stages. As we said in
Section 4.1, “achieving a speedup factor approaching s (the number of pipeline
stages) depends on n (the number of consecutive iterations being processed) being
large, where ‘large’ is defined relative to s.” Here, n is 64 which is large relative to 4
(or perhaps 8), but 64 is not particularly large relative to 16 or 32.
2. Given the following reservation table for a static arithmetic pipeline:
0 1 2 3 4 5
Stage 1 X X
Stage 2 X X
Stage 3 X X
(a) Write the forbidden list. (0, 1, 5)
(b) Determine the initial collision vector C. C = c5c4c3c2c1c0 = 100011
(c) Draw the state diagram.
The diagram includes three states. The initial state is given by the
initial collision vector, 100011. For i = 4 or i ≥ 6, we remain in this state. For
i = 3 we transition from the initial state to state 100111, while for i = 2 we
transition from the initial state to state 101011. Once in state 100111 we
remain in that state for i = 3 and return to the initial state for i = 4 or i ≥ 6.
Once in state 101011 we remain in that state for i = 2 and return to the initial
state for i = 4 or i ≥ 6.
(d) Find the MAL. 2
(e) Find the minimum latency. 2
3. Considering the overall market for all types of computers, which of the following are
more commonly found in today’s machines: arithmetic pipelines (as discussed in Section
4.2) or instruction unit pipelines (Section 4.3)? Explain why this is so.
Arithmetic pipelines are primarily found in vector supercomputers, which
once had a significant share of the high-performance computing market but have
now largely fallen out of favor – mainly because they exhibit little generality. (They
are only useful for a limited subset of applications.) On the other hand, instruction
unit pipelines are found in all RISC and superscalar microprocessors and even in
most CISC microprocessors (other than low-end embedded microcontroller units).
Nowadays, instruction unit pipelines are nearly ubiquitous.
4. Why do control transfers, especially conditional control transfers, cause problems for an
instruction-pipelined machine? Explain the nature of these problems and discuss some of
the techniques that can be employed to cover up or minimize their effect.
In a non-pipelined machine, control transfer instructions are not appreciably
more troublesome than other instructions since all instructions are processed one at
a time. In an instruction-pipelined machine, control transfers are problematic
because by the time the instruction in question is decoded and determined to be a
control transfer (and, in the case of conditional control transfers, by the time the
branch condition is evaluated to determine whether or not the branch succeeds), one
or more additional instructions may have been fetched and proceeded some distance
into the pipeline. If, as is the normal case, these instructions are the ones
sequentially following the control transfer, they are not the correct instructions that
the machine should be executing. (It should instead be executing the instructions
starting at the control transfer’s target location.) While the control unit can
suppress the effect of these following instructions by not allowing them to update
any registers or memory locations (and thus insure correct operation of the
program), it cannot recover the clock cycles lost by bringing them into the pipeline.
One technique that can be used to minimize the performance effect of control
transfers is delayed branching. This approach, commonly used in RISC
architectures, documents the fact that the instruction(s) immediately following a
control transfer (a.k.a. the delay slot instruction(s)) – since it is (or they are) already
in the pipeline – will be executed as though it (they) came before the control transfer
instruction. (In the case of a conditional branch, this means that the delay slot
instruction(s) are executed regardless of the outcome of the branch). Another
approach (these are not mutually exclusive) is to use static (compile time) and/or
dynamic (run time) branch prediction. The compiler and/or processor attempt to
predict, based on the structure of the code and/or the history of execution, whether a
conditional branch will succeed or fail and fetch subsequent instructions
accordingly. A successful prediction will reduce or eliminate the “branch penalty”;
an incorrect prediction, however, will incur a significant performance penalty due
to the need to drain the pipeline. (See the answer to the next question for an
example of a quantitative analysis.)
5. A simple RISC CPU is implemented with a single scalar instruction processing pipeline.
Instructions are always executed sequentially except in the case of branch instructions.
Given that pb is the probability of a given instruction being a branch, pt is the probability
of a branch being taken, pc is the probability of a correct prediction, b is the branch
penalty in clock cycles, and c is the penalty for a correctly predicted branch:
(a) Calculate the throughput for this instruction pipeline if no branch prediction is
done, given that pb = 0.16, pt = 0.3, and b = 3.
The average number of clock cycles per instruction is (0.16)(0.3)(1 + 3 cycles)
+ (0.16)(0.7)(1 cycle) + (1 – 0.16)(1 cycle) = 0.192 + 0.112 + 0.84 = 1.144
cycles/instruction. The throughput equals 1 / (1.144 cycles/instruction) ≈ 0.874
instructions per cycle.
(b) Assume that we use a branch prediction technique to try to improve the pipeline’s
performance. What would be the throughput if c = 1, pc = 0.8, and the other
values are the same as above?
The average number of clock cycles per instruction would be
(0.16)(0.3)(0.8)(1 + 1 cycles) + (0.16)(0.3)(0.2)(1 + 3 cycles) + (0.16)(0.7)(0.8)(1 cycle)
+ (0.16)(0.7)(0.2)(1 + 3 cycles) + (0.84)(1 cycle) = 0.0768 + 0.0384 + 0.0896 + 0.0896 +
0.84 = 1.1344 cycles/instruction. The throughput in this case would improve slightly
to 1 / (1.1344 cycles/instruction) ≈ 0.882 instructions per cycle.
6. What are the similarities and differences between a delayed branch and a delayed load?
A delayed branch, as explained in the answer to question 4 above, is a special
type of control transfer instruction used in some pipelined architectures to minimize
or “cover up” the branch penalty caused by having the wrong instruction(s) in
progress following a control transfer instruction. A delayed load is also a feature of
some instruction sets that is used to cover up a potential performance penalty
associated with pipelined implementation. It has nothing to do with branching, but
rather deals with the (also problematic) latency normally associated with accessing
memory as compared to internal CPU operations. Since (even in the case of a cache
hit) reading a value from memory usually takes at least one additional clock cycle
vs. obtaining it from a register, the architects may document the fact that the
instruction immediately following a load should not (or must not) use the value
being loaded; instead, it should operate on some other data already inside the CPU.
Depending on the architecture, if the following instruction does use the value being
loaded from memory into a register, it may be documented to reference the old,
rather than the new, value in that register; or a hardware interlock may be
employed to delay the subsequent instruction until the load completes and the newly
loaded value can be used.
7. Given the following sequence of assembly language instructions for a CPU with multiple
pipelines, indicate all data hazards that exist between instructions.
I1: Add R2, R4, R3 ; R2 = R4 + R3
I2: Add R1, R5, R1 ; R1 = R5 + R1
I3: Add R3, R1, R2 ; R3 = R1 + R2
I4: Add R2, R4, R1 ; R2 = R4 + R1
Three RAW hazards exist: between I1 and I3 over the use of R2; between I2
and I3 over the use of R1; and between I2 and I4 over the use of R1. Two WAR
hazards exist: between I1 and I3 over the use of R3, and between I3 and I4 over the
use of R2. Only one WAW hazard exists; it is between I1 and I4 over the use of R2.
8. What are the purposes of the scoreboard method and Tomasulo’s method of controlling
multiple instruction execution units? How are they similar and how are they different?
Both the scoreboard method and Tomasulo’s method are techniques of
controlling and scheduling a superscalar processor’s internally parallel hardware
execution units. By detecting and correcting for data hazards, both methods ensure
that RAW, WAR, and WAW relationships in the code being executed do not cause
the machine to improperly execute a sequentially written program. Even though
instructions may actually be executed out of order, these methods make them
appear to be executed in the original, sequential order.
These two methods are different in that each was initially devised for a
different 1960s-era supercomputer (the CDC 6600 vs. the IBM 360/91); each bears
the specific characteristics of its parent machine. The scoreboard method uses a
centralized set of registers and logic to schedule the use of processor hardware by
multiple instructions; when a data hazard is detected, the issue of one or more
instructions to a functional unit is stalled to ensure correct operation. Tomasulo’s
method uses a distributed control technique (with reservation stations associated
with each functional unit) to schedule operations. It implements a “dataflow”
approach to scheduling hardware, using register renaming and data forwarding to
help avoid stalls as much as possible and enhance the machine’s ability to execute
operations concurrently.
9. List and explain nine common characteristics of RISC architectures. In each case,
discuss how a typical CISC processor would (either completely or partially) not exhibit
the given attribute.
The characteristics that are common to virtually all RISC architectures are
discussed in Section 4.4. A RISC architecture may primarily be distinguished by its
adherence to the following characteristics:
• Fixed-length instructions are used to simplify instruction fetching. • The machine has only a few instruction formats in order to simplify
instruction decoding. • A load/store instruction set architecture is used to decouple memory accesses
from computations so that each can be optimized independently. • Instructions have simple functionality, which helps keep the control unit
design simple. • A hardwired control unit optimizes the machine for speed. • The architecture is designed for pipelined implementation, again to optimize
for speed of execution. • Only a few, simple addressing modes are provided, since complex ones may
slow down the machine and are rarely used by compilers.
• There is an emphasis on optimization of functions by the compiler since the architecture is designed to support high-level languages rather than assembly programming.
• Complexity is in the compiler (where it only affects the performance of the compiler), not in the hardware (where it would affect the performance of every program that runs on the machine).
Secondary characteristics that are prevalent in RISC machines include:
• Three-operand instructions make it easier for the compiler to optimize code. • A large register set (typically 32 or more registers) is possible because the
machine has a small, hardwired control unit, and desirable because of the need for the compiler to optimize code for the load/store architecture.
• Instructions execute in a single clock cycle (or at least most of them appear to, due to pipelined implementation).
• Delayed control transfer instructions are used to minimize disruption to the pipeline.
• Delay slots behind loads and stores help to cover up the latency of memory accesses.
• A Harvard architecture is used to keep memory accesses for data from interfering with instruction fetching and thus keep the pipeline(s) full.
• On-chip cache is possible due to the small, hardwired control unit and necessary to speed instruction fetching and keep the latency of loads and stores to a minimum.
The characteristics of a CISC would depend on the specific architecture and
implementation being compared, but would include such things as variable-length
instructions, many instruction formats, a memory-register architecture, complex
functionality of individual machine instructions, the use of microprogrammed
control, the lack of explicit support for pipelining, support for many and/or complex
addressing modes, an emphasis on optimization of code by manual profiling and re-
coding in assembly language, support for fewer than three operands per instruction,
a limited number of programmer-visible registers, instructions that require multiple
clock cycles to execute, the lack of delay slots, a Princeton architecture and, in some
cases, a smaller or nonexistent on-chip cache. In general, complexity is in the
hardware rather than in the compiler! A machine need not exhibit all these
characteristics to be considered a CISC architecture, but the more of them that are
observed, the more confident we can be in identifying it as such.
10. How does the “overlapping register windows” technique, used in the Berkeley RISC and
its commercial successor the Sun SPARC, simplify the process of calling and returning
from subprograms?
This technique partitions the register set not by intended use (data registers
vs. pointer registers), as is common in many other architectures, but rather by scope
(global, local, inputs, and outputs) relative to the procedure(s) in which values are
used. When a procedure calls another procedure, the internal decoding scheme is
altered such that registers are renumbered; the caller’s output registers
automatically become the called procedure’s input registers and a new set of local
and output registers is allocated to that procedure. In many cases, the number of
procedure parameters is such that all of them can be passed in the overlapped
registers, thus eliminating the need to access memory to write and then read values
to/from a stack frame. This reduction in the number of accesses to data memory
can help improve performance, especially for high-level language programs that are
written to use a number of modular functions.
11. You are on a team helping to design the new Platinum V® processor for AmDel
Corporation. Consider the following design issues:
(a) Your design team is considering a superscalar vs. superpipeline approach to the
design. What are the advantages and disadvantages of each option? What
technological factors would tend to influence this choice one way or the other?
The superpipelined approach has the advantage of being simpler to control
(since there is no out-of-order execution to cause WAR and WAW hazards); it also
may be able to achieve a higher clock frequency since each pipeline stage does less
work. The simpler control logic and single pipeline leave more space on the chip for
registers, on-chip cache, floating-point hardware, memory management unit(s), and
other enhancements. However, superpipelined processors suffer from an increased
branch penalty and do not perform well on code with many control transfers.
The superscalar approach takes advantage of spatial parallelism to better
exploit the instruction-level parallelism inherent in most programs. Superscalar
processors also may achieve high performance with only a modest clock frequency,
avoiding the need to generate and distribute a high-frequency clock signal. On the
other hand, a superscalar implementation has the disadvantage of being more
difficult to control internally, due to its vulnerability to all types of data hazards.
The multiple pipelines and more complex control logic take up more space on the
chip, leaving less room for other functionality. It is also more difficult to handle
exceptions precisely in a superscalar architecture.
An implementation technology with short propagation delays (i.e. very fast
transistors) favors a superpipelined approach, while an implementation technology
with small feature sizes (and thus more transistors per IC) favors a superscalar
approach.
(b) Your design team has allocated the silicon area for most of the IC and has narrowed
the design options to two choices: one with 32 registers and a 512KB on-chip cache
and one with 512 registers but only a 128 KB on-chip cache. What are the
advantages and disadvantages of each option? What other factors might influence
your choice?
One would think that having more registers would always be a good thing,
but this is not necessarily true. Registers are only beneficial to the extent that the
compiler (or the assembly language programmer) can make use of them; adding
additional registers past that point provides no benefit and has some definite costs.
For example, one significant advantage of the first option is that it only requires 5
bits in the instruction format for addressing each register operand. Unless some
scheme such as register windowing is used, the second option (with 29 registers) will
require 9 bits in the instruction for each register operand used. The first option also
requires fewer registers to be saved and restored on each context switch or
interrupt.
On the other hand, while having more cache is essentially always a good
thing, even cache hits are not as beneficial for performance as keeping data in a
CPU register, since cache generally requires at least one extra clock cycle to read or
write data as compared to internal operations. Cache, like registers, also has a
“diminishing returns” effect; while 512KB is four times the capacity of 128KB, it
certainly won’t provide four times the hit ratio. To make the best choice given the
specified alternatives, one would probably need to consider factors including the
intended application(s) of the processor, the anticipated speed gap between cache
and main memory, whether the on-chip cache is going to have a Harvard
(instructions and data kept separate) or Princeton (unified) architecture, whether
there will be any off-chip (level 2) cache, etc.
12. How are VLIW architectures similar to superscalar architectures, and how are they
different? What are the relative advantages and disadvantages of each approach? In
what way can VLIW architectures be considered the logical successors to RISC
architectures?
Both VLIW and superscalar architectures make use of internally parallel
hardware (more than one pipelined instruction execution unit operating at the same
time). The main difference is that superscalar architectures use hardware inside the
CPU’s control unit to dynamically schedule the execution of instructions and avoid
incorrect operation due to hazards; VLIW machines instead use the compiler to
statically perform dependency resolution and resource scheduling. The principal
advantages of the superscalar approach are better compatibility with existing
architectures and a reduced burden on the (more conventional) compiler.
VLIW has the advantage of simplified control hardware, which may reduce
internal delays and allow operation at a higher clock frequency than a superscalar
processor given the same implementation technology. Also, since the compiler can
use more sophisticated scheduling logic, a VLIW architecture may take better
advantage of instruction-level parallelism and execute more instructions, on
average, per clock cycle. However, VLIW architectures in general suffer from poor
code density (programs take up a lot of memory), are not easily made compatible
with existing architectures, and perform poorly on code with many branches.
VLIW is the logical successor to RISC in the sense that both design
philosophies emphasize making hardware simpler and faster while transferring the
burden of computational complexity to the software (specifically, the compiler).
13. Is Explicitly Parallel Instruction Computing (EPIC) the same thing as a VLIW
architecture? Explain why or why not.
Not exactly. EPIC is based on the same idea as VLIW, but the format of the
“bundles” in the IA-64 processors is not tied to a particular hardware
implementation. Some EPIC-based chips may be able to execute one bundle at a
time, some others may do less than a complete bundle at a time, and some higher-
performance implementations may be able to execute multiple bundles at a time (if
not prevented by dependencies between operations). By standardizing the size of
the bundles, EPIC (unlike a VLIW architecture) allows binary compatibility to be
maintained between generations of CPUs. In a pure VLIW machine the compiler
does all the resource scheduling. In a superscalar machine the control unit does the
scheduling. In EPIC the compiler does much of the work, but the control unit still
has to do some of it. So EPIC is somewhere between VLIW and superscalar, though
closer to VLIW.
14. Fill in the blanks below with the most appropriate term or concept discussed in this
chapter:
Flow-through time - the time required for the first result in a series of computations to
emerge from a pipeline
Pipeline register - this is used to separate one stage of a pipeline from the next
Multifunction pipeline - this type of pipeline can perform different kinds of
computations at different times
Collision - this occurs if we mistakenly try to use a pipeline stage for two different
computations at the same time
Average Latency - over time, this tells the mean number of clock cycles between
initiation of operations into a pipeline (if an optimal pipeline control strategy is used,
this would be equal to the Minimum Average Latency or MAL)
Pipeline throughput - over time, this tells the mean number of operations completed per
clock cycle
Branch penalty - the clock cycles that are wasted by an instruction-pipelined processor
due to executing a control transfer instruction
Static branch prediction - a technique used in pipelined CPUs where the compiler
supplies a hint as to whether or not a given conditional branch is likely to succeed
Delay slot instruction(s) - the instruction(s) immediately following a conditional control
transfer instruction in some pipelined processors, which are executed whether or not the
control transfer occurs
Delayed load - a technique used in pipelined CPUs where the instruction immediately
following another instruction that reads a memory operand cannot use the updated value
of the operand
Read After Write (RAW) hazard - the most common data hazard in pipelined
processors; also known as a true data dependence
Write After Write (WAW) hazard - also known as an output dependence, this hazard
can occur in a processor that utilizes out-of-order execution
Scoreboard - a centralized resource scheduling mechanism for internally concurrent
processors; it was first used in the CDC 6600 supercomputer
Reservation stations - these are used by a Tomasulo scheduler to hold operands for
functional units
Overlapping register windows - a technique used in some RISC processors to speed up
parameter passing for high-level language procedure calls
Superpipelined - this type of processor architecture maximizes temporal parallelism by
using a very deep pipeline with very fast stages
Superscalar - this approach to high-performance processing uses multiple pipelines with
resolution of inter-instruction data dependencies done by the control unit
Explicitly Parallel Instruction Computing (EPIC) - the “architecture technology” used
in Intel’s IA-64 (Itanium) chips
Predication - the IA-64 architecture uses this approach instead of branch prediction to
minimize the disruption caused by conditional control transfers
5 Exceptions, Interrupts, and Input/Output Systems
1. What do we mean when we say that interrupts must be processed “transparently”? What
does this involve and why is it necessary?
Since interrupts are asynchronous to CPU operations (that is, they can occur
at any time, without warning), it is necessary that the complete run-time context of
the program that was executing be preserved across the servicing of the interrupt.
That is to say, the interrupted program must not experience any changes to the state
of the processor (or to its program or data memory) due to interrupt handling; it
should not “see” any effects other than a time lag, and should compute the same
results in the presence vs. absence of an interrupt(s). For this to happen it is
necessary to save and restore not only the program counter (so the program can be
resumed at the next instruction that would have been executed had the interrupt not
occurred), but also the condition codes (a.k.a. status flags) and the contents of all the
CPU registers. This is normally accomplished via pushing these values on the
system stack when an interrupt occurs and popping them back off after it has been
serviced. If this were not done, the interrupt would not be “transparent” and the
interrupted program could operate incorrectly.
2. Some processors, before servicing an interrupt, automatically save all register contents.
Others automatically save only a limited amount of information. In the second case, how
can we be sure that all critical data are saved and restored? What are the advantages and
disadvantages of each of these approaches?
The advantage of automatically saving everything is that we are sure that it
has been done and thus we know that the interrupt will be serviced in transparent
fashion (as described in the answer to question 1 above). The disadvantage of this is
that a given interrupt service routine may actually use only a small subset of the
CPU registers, and the time spent saving and restoring all the other, unused
registers is wasted.
The additional delay involved in saving a large number of registers can
significantly increase the latency in responding to an interrupt; for some timing-
sensitive I/O devices, it is important to keep this latency as small as possible. Some
processors only automatically save the program counter and condition codes; other
registers are left to be preserved with push and pop instructions inside the service
routine. In that case, we save only the necessary registers and keep latency to a
minimum; but the potential disadvantage is that we are dependent on the vigilance
of the programmer who writes the service routine to track his/her register usage
and save all required registers. If he or she fails to do this, the interrupted program
could have its run-time context corrupted.
3. Explain the function of a watchdog timer. Why do embedded control processors usually
need this type of mechanism?
A watchdog timer is a mechanism that can be used to reset a system if it
“locks up”, without requiring the intervention of a human user (which is not always
possible, especially in embedded systems). To implement this mechanism, a counter
that runs off the system clock (or some derivative of it) is initialized and allowed to
count up toward a maximum count (or down toward zero). Software running on
the system is assigned the task of periodically resetting the counter to its initial
value. If the counter ever “rolls over”, presumably due to a system software failure,
that rollover event is detected by hardware and used to generate a reset signal to
reinitialize and recover the system to a known state. Embedded control processors,
unlike general-purpose CPUs, are not normally mounted in a case with a convenient
reset button within reach of a user’s finger. Due to the embedded location, it may
be difficult or impossible to perform a manual reset; a watchdog timer may be the
only mechanism that allows the system to be restarted in the event of a crash.
4. How are vectored and autovectored interrupts similar and how are they different? Can
they be used in the same system? Why or why not? What are their advantages and
disadvantages vs. nonvectored interrupts?
Vectored and autovectored interrupts are similar in that both use an
interrupt number to index into a table that contains the addresses of the various
interrupt service routines. (The value read from the table is loaded into the
program counter and execution of the service routine begins from that point.) The
only difference between the two techniques is that hardware devices provide their
own interrupt numbers (via the system bus during the interrupt acknowledge cycle)
in a system with vectored interrupts, while an autovectoring scheme uses interrupt
numbers internally generated by the CPU based on the priority level of an incoming
interrupt.
Yes, both of these techniques can be used in the same system; the Motorola
680x0 family of CPUs is a prime example. The “smarter” devices can provide their
own vectors, while a special hardware signal or timeout mechanism can alert the
CPU to the need to generate autovectors for other devices. While nonvectored
interrupts are slightly easier to implement in hardware as compared to vectored or
autovectored interrupts, the additional complexity required of the software (to
identify the source of each interrupt and execute the correct code to handle it), the
corresponding increase in the latency to service an interrupt, and the limitations it
places on the design of the memory system are significant drawbacks.
5. Given the need for user programs to access operating system services, why are traps a
better solution than conventional subprogram call instructions?
The main “problem” with a typical subprogram call instruction is that it
generally requires a target address that is explicitly specified using one of the
machine’s addressing modes; that is to say, we must know where a given routine
resides in memory in order to call it. While this is normally not a problem for user-
written code or procedures in a link library, we often do not know the location of
routines that are part of the operating system. Their location may vary from one
specific system or OS version to another. Also, code that performs system functions
such as I/O usually needs to be executed at a system privilege level, while called
procedures normally execute with the same privilege level as the code that called
them. Traps, since they make use of the same vectoring mechanism as interrupts or
other exceptions, allow OS routines to be accessed implicitly, without the
programmer having to know the exact location of the code he or she wishes to
execute. By executing a specific trapping instruction, the desired routine can be
executed at a system privilege level with control returning (at the proper privilege
level) to the user program that called it.
6. Compare and contrast program-controlled I/O, interrupt-driven I/O, and DMA-based I/O.
What are the advantages and disadvantages of each? Describe scenarios that would favor
each particular approach over the others.
In a system with program-controlled I/O, the CPU executes code to poll the
various hardware devices to see when they require service, then executes more code
to carry out the data transfers. This is the simplest way to handle I/O, requiring no
extra hardware support; but the need for the CPU to spend time polling devices
complicates the software and detracts from system performance. This approach
would only be favored in a very low-cost, embedded system where the CPU is not
doing much other than I/O and the goal is to keep the hardware as simple and
inexpensive as possible.
In a system with interrupt-driven I/O, the devices use hardware interrupt
request lines to notify the CPU when they need servicing. The CPU then executes
instructions to transfer the data (as it would in a system using program-controlled
I/O). This approach doesn’t eliminate the need for CPU involvement in moving
data and also involves a bit more hardware complexity, but support for interrupt
processing is already built into virtually every microprocessor so the additional cost
is minimal. The upside of this technique is that the CPU never has to waste time
polling devices. System software is simplified by having separate interrupt service
routines for each I/O device, and devices are typically serviced with less latency than
if interrupts were not used. This approach is good for many systems, especially
general-purpose machines that have a wide variety of I/O devices with different
speeds, data transfer volumes, and other characteristics.
DMA-based I/O is carried out by a hardware DMA controller that is
separate from the system CPU. When the CPU determines (often by receiving an
interrupt) that a transfer of data to or from an I/O device needs to take place, it
initializes the DMA controller with the particulars of the transfer; the DMA
controller then carries out the operation, transferring data directly between the
chosen device and memory, without further intervention by the CPU. This
approach has the highest hardware cost of the three, since it requires an extra
system component; it also requires the overhead of the CPU having to set up the
DMA controller for each transfer. However, DMA is very efficient, especially when
large blocks of data are frequently transferred. Its use would be favored in a
general-purpose or (especially) a high-performance system with high-speed devices
that can benefit significantly from large block I/O operations.
7. Systems with “separate I/O” have a second address space for I/O devices as opposed to
memory and also a separate category of instructions for doing I/O operations as opposed
to memory data transfers. What are the advantages and disadvantages of this method of
handling I/O? Name and describe an alternative strategy and discuss how it exhibits a
different set of pros and cons.
Separate I/O has the advantage of a unique address space for I/O devices;
because of this, there are no “holes” in the main memory address space where I/O
device interface registers have been decoded. The full physical memory address
space is available for use by memory devices. Also, I/O operations are easily
distinguished from memory operations by their use of different machine language
instructions. On the other hand, hardware complexity (and possibly cost) is
increased slightly and the additional instructions required for I/O make the
instruction set architecture a bit more complex.
The alternative, memory-mapped I/O, shares a single physical address space
between memory and I/O devices. This keeps the hardware and instruction set
simpler while sacrificing the distinct functionality of I/O instructions as well as the
complete, contiguous address space that would otherwise be available to memory.
Given the widespread use of virtual memory in all but the simplest of systems, the
pros and cons of either approach are not as noteworthy as they once were and either
approach can be made to work well.
8. Given that many systems have a single bus which can be controlled by only one bus
master at a time (and thus the CPU cannot use the bus for other activities during I/O
transfers), explain how a system that uses DMA for I/O can outperform one in which all
I/O is done by the CPU.
On the face of it, it would seem that DMA I/O would provide little or no
advantage in such a system, since only one data transfer can occur at a time
regardless of whether the CPU or DMAC is initiating it. However, DMA still has a
considerable advantage for a couple of important reasons. One of these is that, due
to the widespread use of on-chip instruction and data cache, it is likely that the CPU
can continue to execute code for some time (in parallel with I/O activities) even
without any use of the system bus. The second reason is that even if the CPU “stalls
out” for lack of ability to access code or data in main memory, the I/O operation
itself is done more efficiently than it would be if the CPU performed it. Instead of
reading a value from a buffer in memory and then writing it to an I/O device
interface (or vice versa), the CPU (which would be the middleman in the
transaction) gets out of the way and the two transactions are replaced with one
direct data transfer between memory and the device in question.
9. Compare and contrast the channel processors used in IBM mainframes with the PPUs
used in CDC systems.
The channel processors used in IBM mainframes were simple von Neumann
machines with their own program counters, register sets, (simpler) instruction set
architecture, etc. They communicated with the main system processor(s) by reading
and writing a shared area of main memory. CDC’s Peripheral Processing Units
were complete computers dedicated to I/O operations. The PPUs had their own
separate memory and were architecturally similar to the main system processor
(although they lacked certain capabilities, such as hardware support for floating-
point arithmetic, that were not useful for I/O). In addition to controlling I/O devices
they performed other operations such as buffering, checking, formatting, and
translating data.
10. Fill in the blanks below with the most appropriate term or concept discussed in this
chapter:
Exception - a synchronous or asynchronous event that occurs, requiring the attention of
the CPU to take some action
Service routine (handler) - a special program that is run in order to service a device,
take care of some error condition, or respond to an unusual event
Stack - when an interrupt is accepted by a typical CPU, critical processor status
information is usually saved here
Non-maskable interrupt - the highest priority interrupt in a system; one that will never
be ignored by the CPU
Reset - a signal that causes the CPU to reinitialize itself and/or its peripherals so that the
system starts from a known state
Vectoring - the process of identifying the source of an interrupt and locating the service
routine associated with it
Vectored interrupt - when this occurs, the device in question places a number on the bus
which is read by the processor in order to determine which handler should be executed
Trap - another name for a software interrupt, this is a synchronous event occurring inside
the CPU because of program activity
Abort - on some systems, the “Blue Screen Of Death” can result from this type of
software-related exception
Device interface registers - these are mapped in a system’s I/O address space; they
allow data and/or control information to be transferred between the system bus and an I/O
device
Memory-mapped I/O - a technique that features a single, common address space for
both I/O devices and main memory
Bus master - any device that is capable of initiating transfers of data over the system bus
by providing the necessary address, control, and/or timing signals
Direct Memory Access Controller (DMAC) - a hardware device that is capable of
carrying out I/O activities after being initialized with certain parameters by the CPU
Burst mode DMA - a method of handling I/O where the DMAC takes over exclusive
control of the system bus and performs an entire block transfer in one operation
Input/Output Processor (IOP) (also known as Peripheral Processor or Front-End
Processor) - an independent, programmable processor that is used in some systems to
offload input and output activities from the main CPU
6 Parallel and High-Performance Systems
1. Discuss at least three distinguishing factors that can be used to differentiate among
parallel computer systems. Why do systems vary so widely with respect to these factors?
Some of the main factors that differ from one parallel system to another are
the number and type of processors used and the way in which they are connected to
each other. Some parallel systems use shared main memory for communication
between processors, while others use a message-passing paradigm. With either of
these approaches, a wide variety of networks can be used to facilitate the exchange
of information. The main reason why parallel systems exhibit such a wide range of
characteristics is probably because there is such a wide variety of applications. The
characteristics of the intended applications drive the characteristics of the machines
that are built to run them.
2. Michael Flynn defined the terms SISD, SIMD, MISD, and MIMD to represent certain
classes of computer architectures that have been built or at least considered. Tell what
each of these abbreviations stands for; describe the general characteristics of each of
these architectures; and explain how they are similar to and different from one another. If
possible, give an example of a specific computer system fitting each of Flynn’s
classifications.
SISD stands for Single Instruction stream, Single Data stream – this is a
single-processor system such as a typical desktop or notebook PC or workstation.
Generally, such systems are built around a processor with a conventional von
Neumann (Princeton) or Harvard architecture. SIMD stands for Single Instruction
stream, Multiple Data stream. SIMD systems are commonly known as array
processors because they execute the same operation on a large collection of operands
at the same time. The control unit of a SIMD is much like that of a SISD machine,
but it controls a number of processing elements simultaneously. Examples of SIMD
computers include the ILLIAC IV, the Connection Machine, etc. MISD (Multiple
Instruction stream, Single Data stream) machines would have carried out multiple
algorithms on the same data sets. Such machines, while conceptually possible, have
yet to be developed. MIMD is an acronym for Multiple Instruction stream, Multiple
Data stream. This classification of machines encompasses the vast majority of
parallel systems, which consist of multiple CPUs (each with a Princeton or Harvard
architecture like the one in a typical SISD system) connected together by some type
of communications network. Examples of MIMD computers include the Silicon
Graphics Origin series, the Cray T3E, and any “Beowulf” class cluster system.
3. What is the main difference between a vector computer and the scalar architectures that
we studied in Chapters 3 and 4? Do vector machines tend to have a high, or low, degree
of generality as defined in Section 1.4? What types of applications take best advantage of
the properties of vector machines?
The main difference between a vector computer and conventional scalar
architectures is the fact that instructions executed by vector processors operate not
on individual values or pairs of values, but on vectors (one-dimensional arrays of
values). In a scalar machine the ADD instruction adds two numbers to produce
their sum; in a vector processor the ADD instruction adds each element of one set of
numbers to the corresponding element of a second set, producing a corresponding
set of results. This is usually accomplished with a deeply pipelined execution unit(s)
through which vector elements are fed in succession.
Because of their unique construction, vector machines have a very low degree
of generality. They are extremely well suited to certain applications, particularly
scientific and engineering applications like weather forecasting, CFD simulations,
etc. that do a great deal of “number crunching” on vectors or arrays of data.
However, they offer little to no advantage when running office applications or any
type of scalar code. While vector processors are more powerful now than they have
ever been, they are not as popular in the overall supercomputer market as they once
were because they are useful for a relatively narrow range of specialized
applications. Cluster computers based on RISC and superscalar microprocessors
are now more popular since they tend to be less expensive (per MIP or MFLOP)
and can run a wider range of applications efficiently.
4. How are array processors similar to vector processors and how are they different?
Explain the difference between fine-grained and coarse-grained array processors. Which
type of array parallelism is more widely used in today’s computer systems? Why?
Array processors are similar to vector processors in that a single machine
instruction causes a particular computation to be carried out on a large set of
operands. They are different in their construction: array processors use a number
of relatively simple processing elements (spatial parallelism) while vector processors
generally employ a small number of deeply pipelined processing units (temporal
parallelism). Fine-grained array processors consist of a very large number of
extremely simple processing elements, while coarse-grained array processors have a
few (but usually somewhat more capable) processing elements. Coarse-grained
array parallelism is more widely used in today’s computers, particularly in the
multimedia accelerators that have been added to many popular microprocessor
families. This is probably because a coarse-grained SIMD is useful in a wider range
of applications than a fine-grained array processor would be.
5. Explain the difference between multiprocessor and multicomputer systems. Which of
these architectures is more prevalent among massively parallel MIMD systems? Why?
Which architecture is easier to understand (for programmers familiar with the
uniprocessor model)? Why?
Multiprocessors are systems in which the CPUs communicate by sharing
main memory locations, while multicomputers are systems in which each CPU has
its own, local memory and communication is accomplished by passing messages over
a network. Most massively parallel MIMD systems are multicomputers because
sharing main memory among large numbers of processors is difficult and expensive.
Multiprocessors are easier for most programmers to work with because the shared
memory model allows communication to be done using the same approaches that
are used in systems with a single CPU. Multicomputers, on the other hand, must
make use of message passing – a more "artificial" and counterintuitive paradigm
for communication.
6. Explain the similarities and differences between UMA, NUMA, and COMA
multiprocessors.
All three of these architectural classifications refer to machines with shared
main memory. Any location in memory can be read or written by any processor in
the system; this is how processes running on various processors communicate with
each other. In a system with UMA (Uniform Memory Access), any memory location
can be read or written by any CPU in the same amount of time (unless the memory
module in question is already busy). This is a desirable property, but the hardware
required to accomplish it does not scale economically to large systems.
Multiprocessors with many CPUs tend to use the NUMA (Non-Uniform
Memory Access) or the more recently developed COMA (Cache-Only Memory
Architecture) approaches. NUMA systems use a modular interconnection scheme in
which memory modules are directly connected to some CPUs but only indirectly
connected to others. This is more cost-effective for larger multiprocessors, but
access time is variable (remote modules take longer to read or write than local
modules) and thus code/data placement must be “tuned” to the specific
characteristics of the memory system (a non-trivial exercise) for best performance.
In a COMA system, the entire main memory space is treated as a cache; all
addresses represent tags rather than physical locations. Items in memory can be
migrated and/or replicated dynamically so they are nearer to where they are most
needed. This experimental approach requires even more hardware support than
the other two, and is therefore more expensive to implement, but it has the potential
to make larger multiprocessors behave more like SMPs and thus perform well
without the software having to be tuned to a particular hardware configuration.
7. What does “cache coherence” mean? In what type of computer system would cache
coherence be an issue? Is a write-through strategy sufficient to maintain cache coherence
in such a system? If so, explain why. If not, explain why not and name and describe an
approach that could be used to ensure coherence.
Cache coherence means that every CPU in the system sees the same view of
memory (“looking through” its cache(s) to the main memory). In a coherent system,
any CPU should get the same value as any other when it reads a shared location,
and this value should reflect updates made by this or any other processor. A write-
through strategy is not sufficient to ensure this, because updating the contents of
main memory is not enough to make sure the other caches’ contents are consistent
with the updated memory. Even if main memory contains the updated value, the
other caches might have previously loaded that refill line and thus might still
contain an old value.
To ensure a coherent view of memory across the whole machine, copies of a
line that has been modified (written) need to be updated in other caches (so that
they immediately receive the new data) or invalidated by them (so that they will
miss if they try to access the old data and update it at that point by reading the new
value from main memory). The write-update or write-invalidate operations can be
accomplished by implementing a snoopy protocol (typical of smaller, SMP systems)
in which caches monitor a common interconnection network, such as a bus, to detect
writes to cached locations. In larger multiprocessor systems such as NUMA
machines where bus snooping is not practical, a directory protocol (in which caches
notify a centralized controller(s) of relevant transactions and, in turn, receive
notifications of other caches’ write operations) is often used.
8. What are the relative advantages and disadvantages of write-update and write-invalidate
snoopy protocols?
A write-invalidate snoopy protocol is simpler to implement and uses less bus
bandwidth since there is no need for other caches to load modified data. It works
well when data are lightly shared, but not as well when data are heavily shared since
the hit ratio is usually lower. A write-update snoopy protocol keeps the hit ratio
higher for heavily shared data and works well when reads and writes alternate, but
it is more complex to implement and may use more bus bandwidth (which can be a
limiting factor if several processors are sharing a common bus).
9. What are directory-based protocols and why are they often used in CC-NUMA systems?
Directory-based protocols are cache coherence schemes that do not rely on
the “snooping” of a common bus or other single interconnection between CPUs in a
multiprocessor system. (While snooping is feasible in most SMP systems, it is not so
easily accomplished in larger NUMA architectures with distributed shared memory
using a number of local and system-wide interconnections.) Communications with
the system directory, which is a hardware database that maintains all the
information necessary to ensure coherence of the memory system, are done in point-
to-point fashion and thus scale better to a larger system. In very large systems, the
directory itself may be distributed (split into subsets residing in different locations)
to further enhance scalability.
10. Explain why synchronization primitives based on mutual exclusion are important in
multiprocessors. What is a read-modify-write cycle and why is it significant?
A read-modify write (RMW) cycle is one in which a memory location is read,
its contents are modified by a processor, and the new value is written back into the
same memory location in indivisible, “atomic” fashion. This type of operation is
important in accessing mutual exclusion primitives such as semaphores. The RMW
cycle protects the semaphore test/update operation from being interrupted by any
other process; if such an interruption did occur, this could lead to a lack of mutual
exclusion on a shared resource, which in turn could cause incorrect operation of the
program.
11. Describe the construction of a “Beowulf cluster” system. Architecturally speaking, how
would you classify such a system? Explain.
A Beowulf-type cluster is a parallel computer system made up of a number of
inexpensive, commodity computers (often generic Intel-compatible PCs) networked
together, usually with off-the-shelf components such as 100 megabit/s or 1 gigabit/s
Ethernet. The operating system is often an open-source package such as Linux.
The idea is to aggregate a considerable amount of computational power as
inexpensively as possible. Because they are comprised of multiple, complete
computer systems and communicate via message passing over a network, Beowulf
clusters are considered multicomputers (LM-MIMD systems).
12. Describe the similarities and differences between circuit-switched networks and packet-
switched communications networks. Which of these network types is considered “static”
and which is “dynamic”? Which type is more likely to be centrally controlled and which
is more likely to use distributed control? Which is more likely to use asynchronous
timing and which is more likely to be synchronous?
Circuit-switched and packet-switched networks are both used to facilitate
communications in parallel (SIMD and MIMD) computer systems. Both allow
connections to be made for the transfer of data between nodes. They are different in
that circuit-switched networks actually make and break physical connections to
allow various pairs of nodes to communicate, while packet-switched networks
maintain the same physical connections all the time and use a routing protocol to
guide message packets (containing data) to their destinations.
Because physical connections are not changed to facilitate communication,
packet-switched networks are said to be static, while circuit-switched networks
dynamically reconfigure hardware connections. Packet-switched networks
generally exhibit distributed control and are most likely to be asynchronous in their
timing. Circuit-switched networks are more likely to be synchronous (though some
are asynchronous) and more commonly use a centralized control strategy.
13. What type of interconnection structure is used most often in small systems? Describe it
and discuss its advantages and disadvantages.
Small systems often use a single bus as an interconnection. This is a set of
address, data, and control/timing signals connected to all components in the system
to allow information to be transferred between them. The principal advantage of a
bus-based system is simplicity of hardware (and therefore low cost). Its main
disadvantage is limited performance; only one transaction (read or write operation)
can take place at a time. In a system with only one or a few bus masters (such as
CPUs, IOPs, DMACs, etc.) there will typically be little contention for use of the bus
and performance will not be compromised too much; in larger systems, however,
there will often be a desire for multiple simultaneous data transfers which, of
course, cannot happen. Thus one or more bus masters will have to wait to use the
bus and overall performance will suffer to some degree.
14. Describe the operation of a static network with a “star” topology. What connection
degree do its nodes have? What is its communication diameter? Discuss the advantages
and disadvantages of this topology.
A star network has all the computational and/or memory nodes connected to
a single, central communications node or “hub”. All the nodes except the hub have
a connection degree of one (the hub is of degree n if there are n other nodes). Such a
network has a communication diameter of two (one hop from the source to the hub,
one hop from the hub to the destination). Advantages include the network’s small
communication diameter (2) and the fact that the communication distance is the
same for all messages sent across the network. Another advantage is that the
network structure is simple and it is easy to add additional nodes if system
expansion is desired. The main disadvantage of a star network is that all
communications must pass through the hub. Because of this, the hub may become
saturated with traffic, thus becoming a bottleneck and limiting system performance.
15. How are torus and Illiac networks similar to a two-dimensional nearest-neighbor mesh?
How are they different?
Both torus and Illiac networks are similar to a two-dimensional nearest-
neighbor mesh in that their nodes are of degree four (they are connected to
neighbors in each of the x and y directions; in other words, to a “north”, “south”,
“east”, and “west” neighbor). This is true of the interior nodes in a nearest-
neighbor mesh as well. The differences between these three network topologies lie
in how the “edge” and “corner” nodes are connected. In a basic nearest-neighbor
mesh, the nodes on the side and top/bottom edges lack one of the possible
connections and thus have a connection degree of only three; the corner nodes lack
one connection in the x direction and one in the y direction and therefore are of
degree two.
In a torus network, the edge nodes have a “wrap-around” connection to the
node in the same row or column on the opposite edge (corner nodes have wrap-
around connections in both dimensions); therefore all nodes in the network have a
connection degree of four. The Illiac network has the same configuration, except
that the rightmost node in each row has a wrap-around connection to the leftmost
node in the next (rather than the same) row, with the rightmost node in the last row
being connected to the leftmost node in the first row.
16. Consider a message-passing multicomputer system with 16 computing nodes.
(a) Draw the node connections for the following connection topologies: linear array,
ring, two-dimensional rectangular nearest-neighbor mesh (without edge
connections), binary n-cube.
(The drawings will be similar to the appropriate figures in Chapter 6.)
(b) What is the connection degree for the nodes in each of the above interconnection
networks?
Linear array: 2 (1 for end nodes). Ring: 2 (all nodes). 2-D mesh: 4 (3 on
edges, 2 in corners). Binary n-cube: 4 (all nodes).
(c) What is the communication diameter for each of the above networks?
Linear array: 15. Ring: 8. 2-D mesh: 6. Binary n-cube: 4.
(d) How do these four networks compare in terms of cost, fault tolerance, and speed
of communications? (For each of these criteria, rank them in order from most
desirable to least desirable.)
Cost: linear array, ring, 2-D mesh, n-cube. Fault tolerance: n-cube, 2-D
mesh, ring, linear array. Speed: n-cube, 2-D mesh, ring, linear array.
17. Describe, compare, and contrast store-and-forward routing with wormhole routing.
Which of these approaches is better suited to implementing communications over a static
network with a large number of nodes? Why?
Both wormhole and store-and-forward routing are methods for transmitting
message packets across a static network. Store-and-forward routing treats each
message packet as a unit, transferring the entire packet from one node in the
routing path to the next before beginning to transfer it to a subsequent node. This is
a simple but not very efficient way to transfer messages across the network; it can
cause significant latency for messages that must traverse several nodes to reach
their destinations. Wormhole routing divides the message packets into smaller
pieces called flits; as soon as an individual flit is received by an intermediate node
along the routing path, it is sent on to the next node without waiting for the entire
packet to be assembled. This effectively pipelines, and thus speeds up, the
transmission of messages between remote nodes. A network with a large number of
nodes is likely to have a large communication diameter, with many messages
requiring several “hops” to reach their destinations, and thus is likely to benefit
considerably from the use of wormhole routing.
18. In what type of system would one most likely encounter a full crossbar switch
interconnection? Why is this type of network not usually found in larger (measured by
number of nodes) systems?
A full crossbar switch would most likely be found in a high-performance
symmetric multiprocessor (SMP) system. Such a network would probably not be
used in a system with many nodes because it has a cost and complexity that
increases proportionally to the square of the number of nodes connected. (In other
words, its cost is O(n2).)
19. Consider the different types of dynamic networks discussed in this chapter. Explain the
difference between a blocking network and a non-blocking network. Explain how a
rearrangeable network compares to these other two dynamic network types. Give an
example of each.
In a blocking network such as the Omega network, any node on one side of
the network can be connected to any node on the other side. However, creating one
connection across the network prevents (blocks) certain other pairs of nodes from
being connected as long as the first connection exists. Only certain subsets of
connections can exist simultaneously. In a non-blocking network such as a full
crossbar switch, any node on one side of the network can be connected to any node
on the other side, and this connection does not interfere with the establishment of a
connection between any other (idle) nodes.
A rearrangeable network such as a Benes network represents a middle
ground (in functionality, complexity, and expense) between the previous two
alternatives. It is similar in structure to a blocking network, but adds redundancy
in the form of additional stages and/or connections. The redundancy allows for
multiple possible paths connecting any two given nodes. While any particular
connection across the network does block certain other connections, an established
connection can always be rerouted along one of the alternate paths (a.k.a.
rearranged) in order to allow another desired connection to be made.
20. Choose the best answer to each of the following questions:
(a) Which of the following is not a method for ensuring cache coherence in a
multiprocessor system? (1) write-update snoopy cache; (2) write-through cache; (3)
write-invalidate snoopy cache; (4) full-map directory protocol
(b) In a 16-node system, which of these networks would have the smallest
communication diameter? (1) n-cube; (2) two-dimensional nearest-neighbor mesh; (3)
ring; (4) torus (tie; both have diameter = 4)
(c) Which of the following is a rearrangeable network? (1) Illiac network; (2) multistage
cube network; (3) crossbar switch; (4) Benes network; (5) none of the above
(d) In a 64-node system, which of the following would have the smallest node connection
degree? (1) ring; (2) two-dimensional nearest-neighbor mesh; (3) Illiac network; (4) n-
cube
21. Fill in the blanks below with the most appropriate term or concept discussed in this
chapter:
Multicomputer (LM-MIMD) - a parallel computer architecture in which there are
several processing nodes, each of which has its own local or private memory modules
Multiprocessor (GM-MIMD) - a parallel computer architecture in which there are
several processing nodes, all of which have access to shared memory modules
Single Instruction stream, Multiple Data stream (SIMD) machine - another name for
an array processor
Symmetric Multiprocessor (SMP) - a relatively small MIMD system in which the
“uniform memory access” property holds
Deadlock - a situation in which messages on a network cannot proceed to their
destinations because of mutual or cyclic blocking
Blocking network - an interconnection network in which any node can be connected to
any node, but some sets of connections are not simultaneously possible
Communication diameter - the maximum number of “hops” required to communicate
across a network
Packet-switched network - multicomputers with many nodes would be interconnected
by this
(Full) Crossbar switch - the classic example of a non-blocking, circuit-switched
interconnection network for multiprocessor systems
Store and forward routing - a method of message passing in which flits do not continue
toward the destination node until the rest of the packet is assembled
Write-update snoopy cache protocol - a method used for ensuring coherence of data
between caches in a multiprocessor system where a write hit by one CPU causes other
processors’ caches to receive a copy of the written value
Flow control digit (flit) - the basic unit of information transfer through the network in a
multicomputer system using wormhole routing
7 Special-Purpose and Future Architectures
1. Explain how a dataflow machine avoids the “von Neumann bottleneck.”
A dataflow machine, unlike one based on a von Neumann architecture, does
not rely on the use of sequential algorithms to guide the processing of data.
Processing is data-driven rather than instruction-driven; it is not inherently
sequential, but instead allows the hardware to exploit any parallelism inherent in
the task. Since there is no need to fetch “instructions” separately from data, the von
Neumann bottleneck described in Chapter 1 does not come into play.
2. Draw a dataflow graph and an activity template for the following programming construct:
if (x >= 0) { z = (x + y) * 4; } else { z = (y - x) * 4; }
The dataflow graph will be similar to Figure 7.2 except for the details of the
operations. Likewise, the activity template will be similar to Figure 7.3.
3. If you had a scientific application that involved a large number of matrix manipulations,
would you rather run it on a dataflow computer or a SIMD computer? Explain.
It would probably be better to run such an application on a SIMD computer,
since that type of system is designed to optimize performance on array processing
(matrices are handled as two-dimensional arrays). Dataflow computers do
reasonably well with “unstructured” parallelism but are not particularly good at
exploiting array-type parallelism.
4. What do you think is the main reason why dataflow computers have so far not been
widely adopted?
There are several legitimate reasons why dataflow computers have not
reached the mainstream. One is their reliance on specialized programming
languages to express the constructs represented in a dataflow graph or activity
template. Machines that do not use standard programming languages tend to have
a higher software development cost. Also, dataflow machines do not perform
particularly well (at least, not well enough to justify their cost) on many common
applications. Unless the parallelism inherent to the task is a good match for the
parallelism of the machine hardware, performance gains will be modest at best.
Finally, dataflow machines are not easy or cheap to build and do not take much
advantage of the locality of reference that is essential to the function of hierarchical
memory systems.
5. Give an example of how dataflow techniques have influenced and/or been used in
conventional computer design.
Dataflow techniques were a part of the control strategy developed by Robert
Tomasulo for the IBM 360/91 computer in the 1960s. While the machine was
programmed like a traditional von Neumann computer, internally its hardware
execution units were scheduled using a dataflow approach: an operation was sent to
a functional unit once its operands were available. Some modern, superscalar
microprocessors still use Tomasulo’s method (or variations on it) and thus bear the
influence of dataflow computing.
6. Are superthreaded and hyper-threaded processors the same thing? If not, how do they
differ?
Superthreaded and hyper-threaded processors are close cousins, but not
identical. Superscalar machines that use superthreading (also called time-slice
multithreading) can issue multiple instructions belonging to one process (or thread)
during a given clock cycle; during a different clock cycle, instructions belonging to
another process can be issued. Effectively, use of the CPU is time-multiplexed on a
cycle by cycle basis. Hyper-threading (or simultaneous multithreading) takes this
concept one step further: during a given clock cycle, instructions from more than
one process may be issued in order to make the maximum possible use of CPU
resources.
7. Would you classify an artificial neural network as an SISD, SIMD, MISD, or MIMD
system, or something else? Make a case to support your choice.
The best answer is probably “something else” – ANNs represent a unique
class of architectures with their own special characteristics. Artificial neural
networks really do not fit the description of any of the systems described in Flynn’s
taxonomy of computer systems. If one had to try to pigeonhole them into one of his
four classifications, they could be said to at least resemble, in some ways, MIMD or
even MISD machines. (ANNs are clearly not SISD or SIMD architectures because
they lack a single instruction stream.)
8. Explain how the processing elements and interconnections in an artificial neural network
relate to the structure of the human nervous system.
The many, simple processing elements in an artificial neural network
correspond to the many neurons in the human body. Each processing element, like
each real neuron, accepts several inputs and computes the weighted sum of those
inputs. This sum is applied to an activation function that simulates the action
potential threshold of a biological neuron. The activation function determines
whether a given processing element will send an output to the input of another
processing element (simulated neuron).
9. How is a supervised artificial neural network programmed to carry out a particular task?
What is the difference between a supervised vs. unsupervised ANN?
An artificial neural network is not so much programmed, but trained
iteratively to perform a given task. A supervised ANN receives its training via the
user’s repeated applications of inputs, producing outputs that are compared to the
corresponding desired outputs; the neuron weights are adjusted after each pass
until the network “learns” to produce good output for the full range of inputs.
Unsupervised ANNs are used in situation for which feedback is unavailable (no
“known good” output data exists). Instead, they use “competitive learning”
techniques to learn on their own without intervention by a human trainer.
10. Why are ANNs well suited to applications such as robotic control? Give an example of
an application for which you do not think an ANN would be a good choice.
ANNs are a good choice for robotic control because complex motions are
difficult to program algorithmically using traditional programming languages.
Since it is possible to generate examples of the desired functionality and since most
ANNs operate on the principle of “training” a system to produce outputs
corresponding to given examples, they are a natural “fit”. After all, the biological
neural networks of human beings and animals can be trained to produce desired
output, so why shouldn’t artificial neural networks exhibit similar strengths? On
the other hand, applications such as computational fluid dynamics, etc. that require
a great deal of numeric computations (“number crunching”) would probably
perform a lot better on a conventional supercomputer than on an ANN.
11. What is different about logical variables in a fuzzy system as compared to a conventional
computer system?
In a conventional computer system, logical variables are binary in nature.
That is to say, they are either true or false; on or off; 1 or 0; 100% or 0%. In a
fuzzy system, logical values can take on a continuum of truth values between the
limits of 0 and 1, inclusive.
12. Both ANNs and fuzzy logic systems attempt to mimic the way human beings make
decisions. What is the main difference between the two approaches?
The main difference is that artificial neural networks attempt to mimic the
actual structure of the human brain by simulating the functionality of neurons and
the connections that exist between them. Fuzzy logic systems attempt to model the
uncertain, imprecise methods people use to make decisions based on (often
incomplete) available information, but their structure is not based on any biological
model.
13. What is a fuzzy subset and how does the idea of a membership function relate to it?
Propose a simple membership function rich() that deals with the concept of a fuzzy
subset of wealthy people.
A fuzzy subset is a portion of the universe of discourse (the set of all things
under consideration in formulating a given problem), whose membership is not
defined precisely. A membership function expresses the perceived likelihood that a
given member of the universe belongs to a particular fuzzy subset. In other words,
it produces a truth value (in the range of 0 to 1, inclusive) that indicates an object’s
degree of membership in the fuzzy subset. One possible definition of a membership
function rich() would be as follows:
rich (x) = {0, if income (x) < $100K; (income (x) - $100K) / $900K, if $100K ≤ income (x) ≤ $1M; 1, if income (x) > $1M.} 14. Can the Boolean, or crisp, logic operations AND, OR, and NOT be defined in regard to
fuzzy logic? If so, explain how; if not, explain why not.
Yes, the Boolean functions AND, OR, and NOT correspond to fuzzy
operations. NOT is generally defined such that truth (not x) is equal to 1.0 – truth
(x). Various definitions have been used for the AND and OR functions; the most
common are truth (x AND y) = min (truth (x), truth (y)) and truth (x OR y) = max
(truth (x), truth (y)). Note that if the variables x and y are restricted to only the
discrete values 0 and 1 (as in binary logic systems), these definitions are consistent
with Boolean algebra properties.
15. Explain, in the context of a fuzzy expert system, what rules are and how they are used.
Rules are statements that reflect the knowledge of a human expert about how
a given system works. They are typically expressed in terms of if-then relationships
between fuzzy output subsets and linguistic variables derived from the inputs. The
rules are used to make inferences about the system based on the fuzzified input data
(that is, the results after the membership functions are applied to the “raw” input
data). The outputs of the various rules that make up the system’s rule base are
combined and used to create a single fuzzy subset for each output variable. These
outputs can then be defuzzified to produce “crisp” outputs if they are needed.
16. For what type(s) of physical system is fuzzy control particularly well suited?
Fuzzy control is a good choice for controlling systems that are nonlinear,
complex, and/or have poorly specified characteristics, thus making them a poor
match for conventional analog or digital control systems using algorithms that
depend on having a well-defined, linear model of the process to be controlled.
17. What is Moore’s Law and how has it related to advances in computing over the last 40
years? Is Moore’s Law expected to remain true forever or lose its validity in the future?
Explain your answer and discuss the implications for the design of future high-
performance computer systems.
Moore’s Law says that the continually shrinking sizes of semiconductor
devices will result in an exponential growth (doubling on approximately a yearly
basis) in the number of transistors that can feasibly be integrated on a single chip.
This has resulted in a doubling of computational power approximately every 18-24
months (since, apparently, computational power is not a linear function of the
number of transistors).
Given the known laws of physics, it is not possible that Moore’s Law will
continue to hold true indefinitely. The problem is that transistors used as switching
elements in computers cannot keep shrinking once they get to the size of individual
atoms or small groups of atoms. Devices that small will no longer work under the
binary logic principles of Boolean algebra, and the performance of traditional
computer architectures will reach a hard limit. (This has been estimated to occur
within the next 10-20 years.) At that point, further increases in performance will
only be achievable if some new approach, for example quantum computing, is
adopted.
18. How does a quantum computer fundamentally differ from all the other computer
architectures discussed in this book? What allows a quantum computer to achieve the
effect of a massively parallel computation using a single piece of hardware?
A quantum computer differs from all traditional computer architectures in
that it does not operate on the principles of Boolean algebra, where computations
are done sequentially (or in parallel, by replicating hardware) on binary digits (bits)
or groups of bits. In conventional machines, each bit can only be 0 or 1 at any given
time. Quantum computers instead operate on quantum bits (qubits). Qubits can
take on not only the distinct values 0 or 1, but also – by the principle of quantum
superposition – they can take on states that can be 0 and 1 at the same time. By
adding more qubits, a quantum computer becomes exponentially more powerful.
While a 16-bit binary register can take on only one of its 65,536 possible states at a
time, a 16-qubit quantum register can be in all 65,536 states at once in coherent
superposition. This allows the effect of a massively parallel computation to be
achieved using only one piece of hardware.
19. What are some of the problems scientists must solve in order to make supercomputers
based on the principles of quantum mechanics practical?
Researchers working on quantum computers have encountered a number of
problems that have so far made it impractical to construct machines that can
compete with supercomputers based on conventional architectures. First of all, it is
difficult to build a quantum computer, since there is a need to separate one or a
small number of atoms from others and keep them in a steady state in order to use
them for computation. Another significant problem is decoherence, a phenomenon
that can introduce errors in computations due to interactions of the computer
hardware with the surrounding environment. Finally, assuming one has performed
a quantum computation, it is difficult to observe the result without collapsing the
coherent superposition of states and destroying one’s work. Research into solving
these problems is ongoing.
20. What application(s) are expected to be a good match for the unique capabilities of
quantum computers? Explain.
If large-scale quantum computers can be practically constructed, they will
probably be used to solve extremely numerically intensive problems that have
proven to be intractable with even the fastest conventional systems. (They won’t be
used for word processing, sending e-mail, or surfing the Net!) One area that seems
to be a good potential match for the capabilities of quantum computers is
cryptography – the making and breaking of highly secure codes that protect
sensitive information from being intercepted by unauthorized parties.
21. Fill in the blanks below with the most appropriate term or concept discussed in this
chapter:
Dataflow machine - a type of computer architecture in which execution depends on the
availability of operands and execution units rather than a sequential-instruction program
model
Node (actor) - an element in a dataflow graph that represents an operation to be
performed on data
Tokens - these are used to represent data values (operands and results) in algorithms for a
dataflow architecture
IBM 360/91 - this (outwardly) von Neumann machine made use of dataflow techniques
for internal scheduling of operations
Hyper-threading - a machine using this technique can issue instructions from more than
one thread of execution during the same clock cycle
Artificial Neural Network (ANN) - a type of computer architecture with a structure
based on that of the human nervous system
Neurons - the fundamental units that make up a biological neural network
Dendrites - these are fibers that act as “input devices” for neurons in human beings
Convergence - when an artificial neural network achieves this, it is “trained” and ready
to be put into operating mode
Single-Layer Perceptron (SLP) - the earliest and simplest type of artificial neural
network
Unsupervised neural network - a type of artificial neural network that does not require
user intervention for training
Fuzzy logic architecture - a type of computer architecture in which logical values are
not restricted to purely “true” or “false” (1 or 0)
Linguistic variable - a type of variable that expresses a “fuzzy” concept; for example,
“slightly dirty” or “very fast”
Universe of discourse - the set of all objects under consideration in the design of a fuzzy
system
Truth value - the numerical degree (between 0 and 1, inclusive) of membership that an
object has in a fuzzy subset
Fuzzification - the first step performed in doing “fuzzy computations” for an expert
system, control system, etc.
Defuzzification - this is necessary if a fuzzy result must be converted to a crisp output
Quantum computer - a type of computer architecture in which the same physical
hardware can be used to simultaneously compute many results as though it were parallel
hardware; its operation is not based on Boolean algebra, but on the physics of subatomic
particles
Moore’s Law - a prophetic observation of the fact that conventional computers would
tend to grow exponentially more powerful over time as integrated circuit features got
smaller and smaller
Quantum bit (qubit) - the basic unit of information in a quantum computer
Quantum interference - this phenomenon results from the superposition of multiple
possible quantum states
Quantum entanglement - a state in which an atom’s properties are identically assumed
by another atom, but with opposite spin
Decoherence - the tendency for interactions with the surrounding environment to disturb
the state of qubits, possibly resulting in computational errors
Thirty - a quantum computer with this many qubits has been estimated to have 10
TFLOPS of computational power
Cryptography - so far, this appears to be the most likely application for supercomputers
based on quantum principles