Embedded Systems Design: A Unified Hardware/Software Introduction 1 Microprocessors

Embedded Systems Design: A Unified Hardware/Software Introduction

1

Microprocessors

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

2

CMOS transistor on silicon

• Transistor– The basic electrical component in digital systems

– Acts as an on/off switch

– Voltage at “gate” controls whether current flows from source to drain

– Don’t confuse this “gate” with a logic gate

source drainoxidegate

IC package IC channel

Silicon substrate

gate

source

drain

Conductsif gate=1

1


3

CMOS transistor implementations

• Complementary Metal Oxide Semiconductor

• We refer to logic levels– Typically 0 is 0V, 1 is 5V

• Two basic CMOS types– nMOS conducts if gate=1

– pMOS conducts if gate=0

– Hence “complementary”

• Basic gates– Inverter, NAND, NOR

x F = x'

1

inverter

0

F = (xy)'

x

1

x

y

y

NAND gate

0

1

F = (x+y)'

x y

x

y

NOR gate

0

gate

source

drain

nMOS

Conductsif gate=1

gate

source

drain

pMOS

Conductsif gate=0


4

Basic logic gates

F = x yAND

F = (x y)’NAND

F = x yXOR

F = xDriver

F = x’Inverter

x F

F = x + yOR

F = (x+y)’NOR

x F

x

yF

Fx

y

x

yF

x

yF

x

yF

F = x yXNOR

Fyxx

0y0

F0

0 1 01 0 01 1 1

x0

y0

F0

0 1 11 0 11 1 1

x0

y0

F0

0 1 11 0 11 1 0

x0

y0

F1

0 1 01 0 01 1 1

x0

y0

F1

0 1 11 0 11 1 0

x0

y0

F1

0 1 01 0 01 1 0

x F0 01 1

x F0 11 0


5

Combinational logic design

A) Problem description

y is 1 if a is to 1, or b and c are 1. z is 1 if b or c is to 1, but not both, or if all are 1.

D) Minimized output equations

00

0

1

01 11 10

0

1

0 1 0

1 1 1

abcy

y = a + bc

00

0

1

01 11 10

0

0

1 0 1

1 1 1

z

z = ab + b’c + bc’

abc

C) Output equations

y = a'bc + ab'c' + ab'c + abc' + abc

z = a'b'c + a'bc' + ab'c + abc' + abc

B) Truth table

1 0 1 1 11 1 0 1 11 1 1 1 1

0 0 1 0 10 1 0 0 10 1 1 1 01 0 0 1 0

00 0 0 0

Inputsa b c

Outputsy z

E) Logic Gates

abc

y

z


6

Combinational components

With enable input e all O’s are 0 if e=0

With carry-in input Ci

sum = A + B + Ci

May have status outputs carry, zero, etc.

O =I0 if S=0..00I1 if S=0..01…I(m-1) if S=1..11

O0 =1 if I=0..00O1 =1 if I=0..01…O(n-1) =1 if I=1..11

sum = A+B (first n bits)carry = (n+1)’th bit of A+B

less = 1 if A<B equal =1 if A=Bgreater=1 if A>B

O = A op Bop determinedby S.

n-bit, m x 1 Multiplexor

O

…S0

S(log m)

n

n

I(m-1) I1 I0

…

log n x n Decoder

…

O1 O0O(n-1)

I0I(log n -1)

…

n-bitAdder

n

A B

n

sumcarry

n-bitComparator

n n

A B

less equal greater

n bit, m function

ALU

n n

A B

…S0

S(log m)

n

O


7

Sequential components

Q = 0 if clear=1, I if load=1 and clock=1, Q(previous) otherwise.

Q = 0 if clear=1, Q(prev)+1 if count=1 and clock=1.

clear

n-bitRegister

n

n

load

I

Q

shift

I Q

n-bitShift register

n-bitCounter

n

Q

Q = lsb - Content shifted - I stored in msb


8

Contemporary Logic DesignSequential Logic

ฉ R.H. Katz Transparency No. 6-7

100

R

S

Q

\ Q

Sequential Switching Networks

Cross-Coupled NOR Gates

Just like cascaded inverters,with capability to force output

to 0 (reset) or 1 (set)

Timing Waveform

Reset Hold Set

ForbiddenState

Reset Set

ForbiddenState

Race

R

R

S

S

Q

\Q


Gated R-S Latch (clocked S-R flip-flop)

9

\ S

\ R

\ Q

Q

\enb

Enb = 1, latch closed (outputs unchanged)Enb = 0, enabled (outputs depend on inputs)


J-K Flip-flop

10

How to eliminate the forbidden state?

Idea: use output feedback to

guarantee that R and S are

never both one

J, K both one yields toggle

Characteristic Equation:

Q+ = Q K + Q J

R-S latch

K

J S

R

Q

\ Q \ Q

Q




Sequential Switching Network

Master/Slave J-K Flipflop

Master Stage Slave Stage

Sample inputs while clock high Sample inputs while clock low

Uses time to break feedback path from outputs to inputs!Uses time to break feedback path from outputs to inputs!

Correct ToggleOperation

J

R-S Latch

R-S Latch

K R

S

Clk

\Q

Q

\P

P

R

S

\Q

Q

\Q

Q

Master outputs

Slave outputs

Set Reset T oggle 1's

Catch 100

J

K

Clk

P

\ P

Q

\ Q


12



Sequential Switching NetworksEdge-Triggered Flipflops

1's Catching: a 0-1-0 glitch on the J or K inputs leads to a state change!forces designer to use hazard-free logic

Solution: edge-triggered logic

Negative Edge-TriggeredD flipflop

4-5 gate delays

setup, hold timesnecessary to successfully

latch the input

Characteristic Equation:Q+ = D

Q

Q

D

Clk=1

R

S

0

0

D

DD

Holds D when clock goes low

Holds D when clock goes low

Negative edge-triggered FFwhen clock is high


13

Sequential logic design

A) Problem Description

You want to construct a clock divider. Slow down your pre-existing clock so that you output a 1 for every four clock cycles

0

1 2

3

x=0

x=1x=0

x=0

a=1 a=1

a=1

a=1

a=0

a=0

a=0

a=0

B) State Diagram

C) Implementation Model

Combinational logic

State register

a x

I0

I0

I1

I1

Q1 Q0

D) State Table (Moore-type)

1 0 1 1 11 1 0 1 11 1 1 0 0

0 0 1 0 10 1 0 0 10 1 1 1 01 0 0 1 0

00 0 0 0

InputsQ1 Q0 a

OutputsI1 I0

1

0

0

0

x

• Given this implementation model– Sequential logic design quickly reduces to

combinational logic design


14

Sequential logic design (cont.)

00

1

Q1Q0 I1

I1 = Q1’Q0a + Q1a’ + Q1Q0’

0 1

1

1

010

00 11 10 a 01

0

0

0

1 0 1

1

00 01 11 a

1

10 I0 Q1Q0

I0 = Q0a’ + Q0’a0

1

0 0

0

1

1

0

0

00 01 11 10

x = Q1Q0

x

0

1

0

a

Q1Q0

E) Minimized Output Equations F) Combinational Logic

a

Q1 Q0

I0

I1

x


15

Basic Architecture

• Control unit and datapath– Note similarity to

single-purpose processor

• Key differences– Datapath is general

– Control unit doesn’t store the algorithm – the algorithm is “programmed” into the memory

Processor

Control unit Datapath

ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status


16

Datapath Operations

• Load– Read memory location

into register • ALU operation

– Input certain registers through ALU, store back in register

• Store

– Write register to memory location

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

10

+1

11

11


17

Control Unit

• Control unit: configures the datapath operations

– Sequence of desired operations (“instructions”) stored in memory – “program”

• Instruction cycle – broken into several sub-operations, each one clock cycle, e.g.:

– Fetch: Get next instruction into IR– Decode: Determine what the

instruction means– Fetch operands: Move data from

memory to datapath register– Execute: Move data through the

ALU– Store results: Write data from

register to memory

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

load R0, M[500] 500

501

100

inc R1, R0101

store M[501], R1102

R0 R1


18

Control Unit Sub-Operations

• Fetch– Get next instruction

into IR

– PC: program counter, always points to next instruction

– IR: holds the fetched instruction

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

load R0, M[500] 500

501

100

inc R1, R0101

store M[501], R1102

R0 R1100 load R0, M[500]


19


• Decode– Determine what the

instruction means

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

load R0, M[500] 500

501

100

inc R1, R0101

store M[501], R1102

R0 R1100 load R0, M[500]


20


• Fetch operands– Move data from

memory to datapath register

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

load R0, M[500] 500

501

100

inc R1, R0101

store M[501], R1102

R0 R1100 load R0, M[500]

10


21


• Execute– Move data through

the ALU

– This particular instruction does nothing during this sub-operation

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

load R0, M[500] 500

501

100

inc R1, R0101

store M[501], R1102

R0 R1100 load R0, M[500]

10


22


• Store results– Write data from

register to memory

– This particular instruction does nothing during this sub-operation

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

load R0, M[500] 500

501

100

inc R1, R0101

store M[501], R1102

R0 R1100 load R0, M[500]

10


23

Instruction Cycles

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

load R0, M[500] 500

501

100

inc R1, R0101

store M[501], R1102

R0 R1

PC=100

10

Fetch ops

Exec. Store results

clk

Fetch

load R0, M[500]

Decode

100


24

Instruction Cycles

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

load R0, M[500] 500

501

100

inc R1, R0101

store M[501], R1102

R0 R1

10

PC=100Fetch Decode Fetch

opsExec. Store

resultsclk

PC=101

inc R1, R0

Fetch Fetch ops

+1

11

Exec. Store results

clk

101

Decode


25

Instruction Cycles

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status

10

...

...

load R0, M[500] 500

501

100

inc R1, R0101

store M[501], R1102

R0 R1

1110


opsExec. Store

resultsclk


opsExec. Store

resultsclk

PC=102

store M[501], R1

Fetch Fetch ops

Exec.

11

Store results

clk

Decode

102


26

Architectural Considerations

• N-bit processor– N-bit ALU, registers,

buses, memory data interface

– Embedded: 8-bit, 16-bit, 32-bit common

– Desktop/servers: 32-bit, even 64

• PC size determines address space

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status


27

Architectural Considerations

• Clock frequency– Inverse of clock

period

– Must be longer than longest register to register delay in entire processor

– Memory access is often the longest

Processor


ALU

Registers

IRPC

Controller

Memory

I/O

Control/Status


28

Pipelining: Increasing Instruction Throughput

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Fetch-instr.

Decode

Fetch ops.

Execute

Store res.

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Wash

Dry

Time

Non-pipelined Pipelined

Time

Time

Pipelined

pipelined instruction execution

non-pipelined dish cleaning pipelined dish cleaning

Instruction 1


29

Superscalar and VLIW Architectures

• Performance can be improved by:– Faster clock (but there’s a limit)

– Pipelining: slice up instruction into stages, overlap stages

– Multiple ALUs to support more than one instruction stream• Superscalar

– Scalar: non-vector operations

– Fetches instructions in batches, executes as many as possible

• May require extensive hardware to detect independent instructions

– VLIW: each word in memory has multiple independent instructions

• Relies on the compiler to detect and schedule instructions

• Currently growing in popularity


30

Two Memory Architectures

Processor

Program memory

Data memory

Processor

Memory(program and data)

Harvard Princeton

• Princeton– Fewer memory

wires

• Harvard– Simultaneous

program and data memory access


31

Cache Memory

• Memory access may be slow• Cache is small but fast

memory close to processor– Holds copy of part of memory

– Hits and misses

Processor

Memory

Cache

Fast/expensive technology, usually on the same chip

Slower/cheaper technology, usually on a different chip


32

Programmer’s View

• Programmer doesn’t need detailed understanding of architecture– Instead, needs to know what instructions can be executed

• Two levels of instructions:– Assembly level

– Structured languages (C, C++, Java, etc.)

• Most development today done using structured languages– But, some assembly level programming may still be necessary

– Drivers: portion of program that communicates with and/or controls (drives) another device

• Often have detailed timing considerations, extensive bit manipulation

• Assembly level may be best for these


33

Assembly-Level Instructions

opcode operand1 operand2




...

Instruction 1

Instruction 2

Instruction 3

Instruction 4

• Instruction Set– Defines the legal set of instructions for that processor

• Data transfer: memory/register, register/register, I/O, etc.• Arithmetic/logical: move register through ALU and back• Branches: determine next PC value when not just PC+1


34

A Simple (Trivial) Instruction Set

opcode operands

MOV Rn, direct

MOV @Rn, Rm

ADD Rn, Rm

0000 Rn direct

0010 Rn

0100 RmRn

Rn = M(direct)

Rn = Rn + Rm

SUB Rn, Rm 0101 Rm Rn = Rn - Rm

MOV Rn, #immed. 0011 Rn immediate Rn = immediate

Assembly instruct. First byte Second byte Operation

JZ Rn, relative 0110 Rn relative PC = PC+ relative (only if Rn is 0)

Rn

MOV direct, Rn 0001 Rn direct M(direct) = Rn

Rm M(Rn) = Rm


35

Addressing Modes

Data

Immediate

Register-direct

Registerindirect

Direct

Indirect

Data

Operand field

Register address

Register address

Memory address

Memory address

Memory address Data

Data

Memory address

Data

Addressingmode

Register-filecontents

Memorycontents


36

Sample Programs

int total = 0;for (int i=10; i!=0; i--) total += i;// next instructions...

C program

MOV R0, #0; // total = 0

MOV R1, #10; // i = 10

JZ R1, Next; // Done if i=0

ADD R0, R1; // total += i

MOV R2, #1; // constant 1

JZ R3, Loop; // Jump always

Loop:

Next: // next instructions...

SUB R1, R2; // i--

Equivalent assembly program

MOV R3, #0; // constant 0

0

1

2

3

5

6

7

• Try some others– Handshake: Wait until the value of M[254] is not 0, set M[255] to 1, wait

until M[254] is 0, set M[255] to 0 (assume those locations are ports).– (Harder) Count the occurrences of zero in an array stored in memory

locations 100 through 199.


37

Application-Specific Instruction-Set Processors (ASIPs)

• General-purpose processors– Sometimes too general to be effective in demanding

application• e.g., video processing – requires huge video buffers and operations

on large arrays of data, inefficient on a GPP

– But single-purpose processor has high NRE, not programmable

• ASIPs – targeted to a particular domain– Contain architectural features specific to that domain

• e.g., embedded control, digital signal processing, video processing, network processing, telecommunications, etc.

– Still programmable


38

A Common ASIP: Microcontroller

• For embedded control applications– Reading sensors, setting actuators– Mostly dealing with events (bits): data is present, but not in huge

amounts– e.g., VCR, disk drive, digital camera (assuming SPP for image

compression), washing machine, microwave oven

• Microcontroller features– On-chip peripherals

• Timers, analog-digital converters, serial communication, etc.• Tightly integrated for programmer, typically part of register space

– On-chip program and data memory– Direct programmer access to many of the chip’s pins– Specialized instructions for bit-manipulation and other low-level

operations


39

Another Common ASIP: Digital Signal Processors (DSP)

• For signal processing applications– Large amounts of digitized data, often streaming

– Data transformations must be applied fast

– e.g., cell-phone voice filter, digital TV, music synthesizer

• DSP features– Several instruction execution units

– Multiple-accumulate single-cycle instruction, other instrs.

– Efficient vector operations – e.g., add two arrays• Vector ALUs, loop buffers, etc.


40

Trend: Even More Customized ASIPs

• In the past, microprocessors were acquired as chips• Today, we increasingly acquire a processor as Intellectual

Property (IP)– e.g., synthesizable VHDL model

• Opportunity to add a custom datapath hardware and a few custom instructions, or delete a few instructions– Can have significant performance, power and size impacts– Problem: need compiler/debugger for customized ASIP

• Remember, most development uses structured languages• One solution: automatic compiler/debugger generation

– e.g., www.tensillica.com

• Another solution: retargettable compilers– e.g., www.improvsys.com (customized VLIW architectures)


41

Programmer Considerations

• Program and data memory space– Embedded processors often very limited

• e.g., 64 Kbytes program, 256 bytes of RAM (expandable)

• Registers: How many are there?– Only a direct concern for assembly-level programmers

• I/O– How communicate with external signals?

• Interrupts


42

Selecting a Microprocessor

• Issues– Technical: speed, power, size, cost– Other: development environment, prior expertise, licensing, etc.

• Speed: how evaluate a processor’s speed?– Clock speed – but instructions per cycle may differ– Instructions per second – but work per instr. may differ– Dhrystone: Synthetic benchmark, developed in 1984. Dhrystones/sec.

• MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s VAX 11/780). A.k.a. Dhrystone MIPS. Commonly used today.

– So, 750 MIPS = 750*1757 = 1,317,750 Dhrystones per second

– SPEC: set of more realistic benchmarks, but oriented to desktops– EEMBC – EDN Embedded Benchmark Consortium, www.eembc.org

• Suites of benchmarks: automotive, consumer electronics, networking, office automation, telecommunications


43

General Purpose Processors

Processor Clock speed Periph. Bus Width MIPS Power Trans. PriceGeneral Purpose Processors

Intel PIII 1GHz 2x16 KL1, 256KL2, MMX

32 ~900 97W ~7M $900

IBMPowerPC750X

550 MHz 2x32 KL1, 256KL2

32/64 ~1300 5W ~7M $900

MIPSR5000

250 MHz 2x32 K2 way set assoc.

32/64 NA NA 3.6M NA

StrongARMSA-110

233 MHz None 32 268 1W 2.1M NA

MicrocontrollerIntel8051

12 MHz 4K ROM, 128 RAM,32 I/O, Timer, UART

8 ~1 ~0.2W ~10K $7

Motorola68HC811

3 MHz 4K ROM, 192 RAM,32 I/O, Timer, WDT,SPI

8 ~.5 ~0.1W ~10K $5

Digital Signal Processors

TI C5416 160 MHz 128K, SRAM, 3 T1Ports, DMA, 13ADC, 9 DAC

16/32 ~600 NA NA $34

LucentDSP32C

80 MHz 16K Inst., 2K Data,Serial Ports, DMA

32 40 NA NA $75

Sources: Intel, Motorola, MIPS, ARM, TI, and IBM Website/Datasheet; Embedded Systems Programming, Nov. 1998


44

Microprocessor Architecture Overview

• If you are using a particular microprocessor, now is a good time to review its architecture


45


Microcontroller catalogue


47



Microcontroller packaging

49

Documents

Embedded Systems Design: A Unified Hardware/Software Introduction 1 Microprocessors