02 Computer Evolution and Performancesites.alaqsa.edu.ps/ekhsharaf/wp-content/uploads/sites/...• 8080 • first general purpose microprocessor • 8 bit data path • Used in first

Dr.Khaled Kh. Sharaf

Faculty Of Computers

And Information

Technology

Second Term

2019- 2020

Computer Architecture

Chapter 2:

Computer Evolution and Performance



LEARNING OBJECTIVES

1. A Brief History of Computers.

2. The Evolution of the Intel x86 Architecture

3. Embedded Systems and the ARM

4. Performance Assessment

1. A BRIEF HISTORY OF COMPUTERS

The First Generation: Vacuum Tubes

Electronic Numerical Integrator And Computer (ENIAC)

- Designed and constructed at the University of Pennsylvania, was

the world’s first general purpose

- Started 1943 and finished 1946

- Decimal (not binary) - 20 accumulators of 10 digits

- Programmed manually - 18,000 vacuum tubes

by switches

- 30 tons -15,000 square feet

- 140 kW power consumption - 5,000 additions per second


The First Generation: Vacuum Tubes

VON NEUMANN MACHINE

• Stored Program concept

• Main memory storing programs and data

• ALU operating on binary data

• Control unit interpreting instructions from memory and executing

• Input and output equipment operated by control unit

• Princeton Institute for Advanced Studies

• IAS

• Completed 1952


Structure of von Neumann machine


Structure of the IAS Computer

IAS - details

• 1000 x 40 bit words

• Binary number

• 2 x 20 bit instructions

Set of registers (storage in CPU) 1

• Memory buffer register (MBR): Contains a word to be stored in

memory or sent to the I/O unit, or is used to receive a word from

memory or from the I/O unit.

• • Memory address register (MAR): Specifies the address in

memory of the word to be written from or read into the MBR.

• • Instruction register (IR): Contains the 8-bit opcode instruction

being executed.


IAS - details

Set of registers (storage in CPU) 2

• Instruction buffer register (IBR): Employed to hold temporarily the

right hand instruction from a word in memory.

• Program counter (PC): Contains the address of the next instruction

pair to be fetched from memory.

• Accumulator (AC) and multiplier quotient (MQ): Employed to hold

temporarily operands and results of ALU operations.


Commercial Computers

The 1950s saw the birth of the computer industry with two companies,

Sperry and IBM, dominating the marketplace

The UNIVAC I (Universal Automatic Computer)

was the first successful commercial computer. It was intended for both

scientific and commercial applications

• Late 1950s - UNIVAC II

• Faster

• More memory


IBM

• Punched-card processing equipment

• 1953 - the 701

• IBM’s first stored program computer

• Scientific calculations

• 1955 - the 702

• Business applications

• Lead to 700/7000 series


The Second Generation: Transistors

• The second generation saw the introduction of more complex arithmetic

and logic units and control units, the

• Use of high-level programming languages, and the provision of system

software with the computer.

• system software provided the ability to

• load programs,

move data to peripherals, and

libraries to perform common computations, similar to what modern

OSes like Windows and Linux do.

• Literally - “small electronics”

• A computer is made up of gates, memory cells and interconnections

• These can be manufactured on a semiconductor

• e.g. silicon wafer


Transistors

• Replaced vacuum tubes

• Smaller

• Cheaper

• Less heat dissipation

• Solid State device

• Made from Silicon (Sand)

• Invented 1947 at Bell Labs

• William Shockley et al.


Transistor Based Computers

• Second generation machines

• NCR & RCA produced small transistor machines

• IBM 7000

• DEC – 1957 “Digital Equipment Corporation”

• Produced PDP-1


Third Generation: Integrated Circuits

Microelectronics A single, self-contained transistor is called a discrete component.

Throughout the 1950s and early 1960s, electronic equipment was

composed largely of discrete components— transistors, resistors,

capacitors, and so on.

Microelectronics means, literally, “small electronics.” Since the

beginnings of digital electronics and the computer industry, there has been

a persistent and consistent trend toward the reduction in size of digital

electronic circuits.



Relationship among

Wafer, Chip, and Gate

Moore’s Law

• Increased density of components on chip

• Gordon Moore – co-founder of Intel

• Number of transistors on a chip will double every year

• Since 1970’s development has slowed a little

• Number of transistors doubles every 18 months

• Cost of a chip has remained almost unchanged

• Higher packing density means shorter electrical paths, giving higher

performance

• Smaller size gives increased flexibility

• Reduced power and cooling requirements

• Fewer interconnections increases reliability


consequences of Moore’s law

1. The cost of a chip has remained virtually unchanged during this

period of rapid growth in density. This means that the cost of

computer logic and memory circuitry has fallen at a dramatic rate.

2. Because logic and memory elements are placed closer together on

more densely packed chips, the electrical path length is shortened,

increasing operating speed.

3. The computer becomes smaller, making it more convenient to

place in a variety of environments.

4. There is a reduction in power and cooling requirements.

5. The interconnections on the integrated circuit are much more

reliable than solder connections. With more circuitry on each chip,

there are fewer interchip connections


Later Generations

• Beyond the third generation there is less general agreement on

defining generations of computers. Table 2.2 suggests that there have

been a number of later generations, based on advances in integrated

circuit technology.

• With the introduction of large-scale integration (LSI), more than 1000

components can be placed on a single integrated circuit chip.

• Very-large-scale integration (VLSI) achieved more than 10,000

components per chip, while current ultra-large-scale integration (ULSI)

chips can contain more than one billion components.

• SEMICONDUCTOR MEMORY The first application of integrated circuit

technology to computers was construction of the processor (the control

unit and the arithmetic and logic unit) out of integrated circuit chips. But

it was also found that this same technology could be used to construct

memories.


Later Generations

• SEMICONDUCTOR MEMORY The first application of integrated circuit

technology to computers was construction of the processor (the control

unit and the arithmetic and logic unit) out of integrated circuit chips. But

it was also found that this same technology could be used to construct

memories.

• Since 1970, semiconductor memory has been through 13 generations:

1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M, 1G, 4G, and, as of

this writing, 16 Gbits on a single chip (1K = 210, 1M = 220, 1G = 230).

Each generation has provided four times the storage density of the

previous generation, accompanied by declining cost per bit and

declining access time.


Later Generations

MICROPROCESSORS Just as the density of elements on memory chips

has continued to rise, so has the density of elements on processor chips.

As time went on, more and more elements were placed on each chip, so

that fewer and fewer chips were needed to construct a single computer

processor.


Generations of Computer

• Vacuum tube - 1946-1957

• Transistor - 1958-1964

• Small scale integration - 1965

• Up to 100 devices on a chip

• Medium scale integration - to 1971

• 100-3,000 devices on a chip

• Large scale integration - 1971-1977

• 3,000 - 100,000 devices on a chip

• Very large scale integration - 1978 -1991

• 100,000 - 100,000,000 devices on a chip

• Ultra large scale integration – 1991 -

• Over 100,000,000 devices on a chip


Generations of Computer


x86 Evolution (1) • 8080

• first general purpose microprocessor

• 8 bit data path

• Used in first personal computer – Altair

• 8086 – 5MHz – 29,000 transistors

• much more powerful

• 16 bit

• instruction cache, prefetch few instructions

• 8088 (8 bit external bus) used in first IBM PC

• 80286

• 16 Mbyte memory addressable

• up from 1Mb

• 80386

• 32 bit

• Support for multitasking

• 80486

• sophisticated powerful cache and instruction pipelining

• built in maths co-processor


x86 Evolution (2)

• Pentium

• Superscalar

• Multiple instructions executed in parallel

• Pentium Pro

• Increased superscalar organization

• Aggressive register renaming

• branch prediction

• data flow analysis

• speculative execution

• Pentium II

• MMX technology

• graphics, video & audio processing

• Pentium III

• Additional floating point instructions for 3D graphics


x86 Evolution (3)

• Pentium 4

• Note Arabic rather than Roman numerals

• Further floating point and multimedia enhancements

• Core

• First x86 with dual core

• Core 2

• 64 bit architecture

• Core 2 Quad – 3GHz – 820 million transistors

• Four processors on chip

• x86 architecture dominant outside embedded systems

• Organization and technology changed dramatically

• Instruction set architecture evolved with backwards compatibility

• ~1 instruction per month added

• 500 instructions available

• See Intel web pages for detailed information on processors


Embedded Systems ARM

• Embedded system. A combination of computer hardware and software,

and perhaps additional mechanical or other parts, designed to perform a

dedicated function. In many cases, embedded systems are part of a

larger system or product, as in the case of

• an antilock.

• braking system in a car.

• ARM evolved from RISC design

• Used mainly in embedded systems

• Used within product

• Not general purpose computer

• Dedicated function


Embedded Systems Requirements

• Different sizes

• Different constraints, optimization, reuse

• Different requirements

• Safety, reliability, real-time, flexibility, legislation

• Lifespan

• Environmental conditions

• Static v dynamic loads

• Slow to fast speeds

• Computation v I/O intensive

• Descrete event v continuous dynamics


Possible Organization of an Embedded System


ARM Evolution

• Designed by ARM Inc., Cambridge, England

• Licensed to manufacturers

• High speed, small die, low power consumption

• PDAs, hand held games, phones

• E.g. iPod, iPhone

• Acorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989

• Acorn, VLSI and Apple Computer founded ARM Ltd.


ARM Systems Categories

• Embedded real time

• Application platform

• Linux, Palm OS, Symbian OS, Windows mobile

• Secure applications


What do we measure?

Define performance….


Airplane Passengers Range (mi) Speed (mph)

Boeing 737-100 101 630 598

Boeing 747 470 4150 610

BAC/Sud Concorde 132 4000 1350

Douglas DC-8-50 146 8720 544

Define performance….

• How much faster is the Concorde compared to the 747?

• How much bigger is the Boeing 747 than the Douglas DC-8?

• So which of these airplanes has the best performance?!

When trying to choose among different computers, performance is an

important attribute. Accurately measuring and comparing different

computers is critical to purchasers and therefore to designers.


Defining Performance

we can define computer performance in several different ways.

• If you were running a program on two different desktop computers,

you’d say that the faster one is the desktop computer that gets the

job done first.

• If you were running a datacenter that had several servers running

jobs submitted by many users, you’d say that the faster computer

was the one that completed the most jobs during a day.


Defining Performance: TIME, TIME, TIME!!!

• Response Time (elapsed time, latency,):

• how long does it take for my job to run?

• how long does it take to execute (start to

finish) my job?

• how long must I wait for the database query?

• Throughput:

• how many jobs can the machine run at once?

• what is the average execution rate?

• how much work is getting done?

• If we upgrade a machine with a new processor what

do we increase?

• If we add a new machine to the lab what do we increase?

Individual user concerns…

Systems manager

concerns…


Defining Performance

If we upgrade a machine with a new processor what do we increase?

- both response time and throughput are improved.

If we add a new machine to the lab what do we increase?

- case 2, no one task gets work done faster, so only throughput

increases.

Thus, in many real computer systems, changing either execution time or

throughput often affects the other.


Execution Time

• Elapsed Time

• counts everything (disk and memory accesses, waiting for I/O, running other

programs, etc.) from start to finish

• a useful number, but often not good for comparison purposes

elapsed time = CPU time + wait time (I/O, other programs, etc.)

• CPU time

• doesn't count waiting for I/O or time spent running other programs

• can be divided into user CPU time and system CPU time (OS calls)

CPU time = user CPU time + system CPU time

elapsed time = user CPU time + system CPU time + wait time

• Our focus: user CPU time (CPU execution time or, simply, execution time)

• time spent executing the lines of code that are in our program


Execution Time

• For some program running on machine X:

PerformanceX = 1 / Execution timeX

• X is n times faster than Y means:

PerformanceX / PerformanceY = n

• execution time on Y is n times longer than it is on X:

Execution timey / Execution timeX = n


Execution Time

Relative Performance

If computer A runs a program in 10 seconds and computer B runs the

same program in 15 seconds, how much faster is A than B?

We know that A is n times faster than B if

Thus the performance ratio is 15 / 10 = 1.5

A is therefore 1.5 times faster than B.

Execution timeB / Execution timeA = n


Clock Cycles

• Instead of reporting execution time in seconds, we often use cycles. In modern

computers hardware events progress cycle by cycle: in other words, each event,

e.g., multiplication, addition, etc., is a sequence of cycles

• Clock ticks indicate start and end of cycles:

• cycle time = time between ticks = seconds per cycle

• clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec, 1 MHz. = 106

cycles/sec)

• Example: A 200 Mhz. clock has a

cycle time

time

seconds

program

cycles

program

seconds

cycle

1

200 106 109 5 nanoseconds

cycle

tick

tick

clock period:

The length of

each clock cycle.


Measuring Performance

Time is the measure of computer performance: the computer that

performs the same amount of work in the least time is the fastest.

Program execution time is measured in seconds per program.

The time can be defined in different ways, depending on what we count

wall clock time, response time, or elapsed time: These terms mean

the total time to complete a task, including:

- disk accesses,

- memory accesses,

- input/output (I/O) activities,

- operating system overhead—everything.



In such cases, the system may try to optimize throughput rather than

attempt to minimize the elapsed time for one program.

CPU execution time or simply CPU time is the actual time the CPU

spends computing for a specific task and does not include time spent

waiting for I/O or running other programs.

user CPU time is the CPU time spent in a program itself.

system CPU time is the CPU time spent in the operating system

performing tasks on behalf of the program.



Because it is often hard to assign responsibility for operating system

activities to one user program rather than another, And because of the

functionality differences among operating systems.

The differentiating between system and user CPU time is difficult to

do accurately.

For consistency, we maintain a distinction between performance based

on elapsed time and that based on CPU execution time.

We will use the term system performance to refer to elapsed time on

an unloaded system and CPU performance to refer to user CPU time.


Understanding Program Performance

Different applications are sensitive to different aspects of the

performance of a computer system.

Many applications, especially those running on servers, depend as much

on I/O performance, which, in turn, relies on both hardware and

software.

Total elapsed time measured by a wall clock is the measurement of

interest.

In some application environments, the user may care about throughput,

response time, or a complex combination.



To improve the performance of a program, one must have a clear

definition of what performance metric.

Almost all computers are constructed using a clock that determines when

events take place in the hardware.

These discrete time intervals are called clock cycles.

clock cycle is the time for one clock period, usually of the processor

clock, which runs at a constant rate.

clock period: The length of each clock cycle.

clock rate: The speed at which the processor execute instruction.


Performance Equation I

• So, to improve performance one can either:

• reduce the number of cycles for a program, or

• reduce the clock cycle time, or, equivalently,

• increase the clock rate

seconds

program

cycles

program

seconds

cycle

CPU execution time CPU clock cycles Clock cycle time

for a program for a program =

equivalently


CPU Performance and Its Factors

Alternatively, because clock rate and clock cycle time are inverses,

CPU execution time for a program = CPU clock cycles for a program

Clock rate


How many cycles are required for a program?

• Could assume that # of cycles = # of instructions

time

1st

instr

uction

2n

d instr

uction

3rd

instr

uction

4th

5th

6th

...

This assumption is incorrect! Because:

Different instructions take different amounts of time (cycles)

Why…?


How many cycles are required for a program?

• Multiplication takes more time than addition

• Floating point operations take longer than integer ones

• Accessing memory takes more time than accessing registers

• Important point: changing the cycle time often changes the

number of cycles required for various instructions because it

means changing the hardware design. More later…

time


Example

• Our favorite program runs in 10 seconds on computer A, which has a

2Ghz. clock.

• We are trying to help a computer designer build a new machine B, that

will run this program in 6 seconds. The designer can use new (or

perhaps more expensive) technology to substantially increase the clock

rate, but has informed us that this increase will affect the rest of the CPU

design, causing machine B to require 1.2 times as many clock cycles as

machine A for the same program.

• What clock rate should we tell the designer to target?


CPU Performance and Its Factors


Terminology

• A given program will require:

some number of instructions (machine instructions)

some number of cycles

some number of seconds

• We have a vocabulary that relates these quantities:

• cycle time (seconds per cycle)

• clock rate (cycles per second)

• (average) CPI (cycles per instruction)

• a floating point intensive application might have a higher average CPI

• MIPS (millions of instructions per second)

• this would be higher for a program using simple instructions


Performance Measure

• Performance is determined by execution time

• Do any of these other variables equal performance?

• # of cycles to execute program?

• # of instructions in program?

• # of cycles per second?

• average # of cycles per instruction?

• average # of instructions per second?

• Common pitfall : thinking one of the variables is indicative of

performance when it really isn’t


Instruction Performance

Therefore, the number of clock cycles required for a program can be

written as

CPU clock cycles = Instructions for a program × Average clock cycles

per instruction

The term clock cycles per instruction, which is the average number of

clock cycles each instruction takes to execute, is often abbreviated as CPI.

clock cycles per instruction (CPI)

Average number of clock cycles per instruction for a program

or program fragment.



Suppose we have two implementations of the same instruction set

architecture.

Computer A has a clock cycle time of 250 ps and a CPI of 2.0 for some

program, and computer B has a clock cycle time of 500 ps and a CPI of 1.2

for the same program.

Which computer is faster for this program and by how much?



We know that each computer executes the same number of instructions for

the program; let’s call this number I. First, find the number of processor

clock cycles for each computer:

We can conclude

that computer A is

1.2 times as fast as

computer B for this

program.


Performance Equation II

We can now write performance equation ii in terms of instruction count,

CPI, and clock cycle time:

CPU execution time = Instruction count x average CPI x cycle time

for a program for a program

or, since the clock rate is the inverse of clock cycle time:

CPU execution time = Instruction count for a program × CPI

for a program

Clock rate


The Classic CPU Performance Equation

A compiler designer is trying to decide between two code sequences for a

particular computer. The hardware designers have supplied the following

facts:

For a particular high-level language statement, the compiler writer is

considering two code sequences that require the following instruction

counts:



Which code sequence executes the most instructions?

Which will be faster?

What is the CPI for each sequence?



ANSWER

Sequence 1 executes 2 + 1 + 2 = 5 instructions.

Sequence 2 executes 4 + 1 + 1 = 6 instructions.

Therefore, sequence 1 executes fewer instructions.

We can use the equation for CPU clock cycles based on instruction

count and CPI to find the

total number of clock cycles for each sequence:



This yields

CPU clock cycles1 = (2 × 1) + (1 × 2) + (2 × 3)

= 2 + 2 + 6 = 10 cycles

CPU clock cycles2 = (4 × 1) + (1 × 2) + (1 × 3)

= 4 + 2 + 3 = 9 cycles

So code sequence 2 is faster, even though it executes one extra

instruction.

Since code sequence 2 takes fewer overall clock cycles but has more

instructions, it must have a lower CPI.



The CPI values can be computed by



The following figure shows the basic measurements at different levels in the

computer and what is being measured in each case.

We can see how these factors are combined to yield execution time

measured in seconds per program:


The Classic CPU Performance Equation The following table summarizes how these algorithm, the language, the compiler, the

architecture, and the actual hardware affect the factors in the CPU performance

equation.



Finally

I wish you good luck

Documents

02 Computer Evolution and Performancesites.alaqsa.edu.ps/ekhsharaf/wp-content/uploads/sites/...• 8080 • first general purpose microprocessor • 8 bit data path • Used in first