Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB...

SOCSA Slides: Microprocessors

Case Study: Microprocessors

System-on-Chip

Solutions & Architectures A. Herkersdorf

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 2

Microprocessors

Motivation

Classification and Characteristics

Look Inside

How to Increase CPU Performance

Integrated Systems

Motivation (1)

Processor-based Digital Systems: Computers with fully programmable, general-

purpose processors (PCs, laptops, workstations)

Primary purpose / function is data processing (incl. Web servers, bank servers)

Hardware & software evolve rather independently

However, most processors are deployed in „embedded systems“

Game consoles, PDAs, cell phones, printers, household appliances, …

Cars, industry robots, …

Integrated Systems

Motivation (2)

Network Equipment: Internet Router

Routing table entries grow exponentially

Link rates:

2.5 Gb/s: 6.5Mpps

10 Gb/s: 25Mpps

Mega Bytes memories with Giga Bytes / s access bandwidth !

Source: http://telstra.net/ops/bgptable.html 90 95 99 00 01

Integrated Systems

Motivation (3)

Network Equipment: Internet Router MIPS Processing requirements

per packet vary substantially depending on application

10‘s K effective MIPS!

10‘s of GHz class processors

Source: Jenkins, "NPU Co-Processors", 2000

OC-3 OC-12 OC-48 OC-192

b / s 155 M 622 M 2.5 G 10 G

pkts / s 420 K 1.7 M 6.8 M 27 M

s / pkt 2.4µ 600 n 150 n 37 n

NP case study will tell us how to tackle this challenge!

Integrated Systems

Microprocessor SoC: PowerPC 405GP

DMA Controller O

Periphera

OPB Bridge

UART (2)

I2C (2)

32K I-Cache

32K D-Cache

Trace JTAG

Processor Local Bus (PLB) up to 133MHz 128-bit

SRAM Ctlr.

128KB SRAM

10/100 Ethernet

Timers

Interrupt Controller

128-bit 128-bit

DDR266 SDRAM

Controller

266MHz 32/64-bit with ECC

128 bit

128-bit

PCI-X Bridge

66-133MHz 64-bit PCI-X, 33-66MHz 32/64bit PCI

128-bit master, 128-bit slave 128 bit

128-bit

RAM/ROM/ Peripheral controller

External bus master cntlr.

Up to 66MHz 32-bit address / 32-bit data

128-bit GPIO

13 external interrupts

1 MII or 2 RMII interfaces

PLB Monitor

Local Bus

Fast & Small SRAM

Slower & larger (S)DSRAM

I/O Subsystem (SCSI, PCI, etc)

Integrated Systems

Real-World Case Studies

Sonet/SDH Transmission LAN/SAN

Switch

Internet Router

Sonet/SDH Transmission

Control Procesors

Integrated Systems

Classification and Characteristics

Type Application Characteristics Remarks

CISC Personal Computer Complex, variable-length instructions

Intel x86-based

RISC Embedded control Load/store instruct‘s for memory access

MIPS, PowerPC

DSP xDSL Modem HW multiply for digital filters

VLIW Set Top Box Instruct‘s parallelism on compile-time

Parallel video pixel processing

Superscalar Network Protocol Processing

Instruct‘s parallelism on run-time

ASIP Embedded control Application-specific intructions

Tensilica

Integrated Systems

Implementation Strategies for SOC (1)

„Real“

Component

„Virtual“

Component

System on Board System on Silicon

Integrated Systems

Implementation Strategies for SOC (2)

Soft VC Firm VC Hard VC

Architectural extensions

Speed/Area optimized

Integrated Systems

Soft VC CPU in FPGA SOC

Example: XILINX MicroBlaze CPU

SDRAM Ctrl.

RS-232

GPIO (buttons)

UserLogic (OPB-Master)

GPIO (LEDs)

Debug Logic

Local SRAM

MicroBlaze: 32 bit RISC 200 MHz 166 DMIPS Extensions: I-Cache D-Cache HW Multiplier

For comparison:

Hard VC PowerPC 405: 32 bit RISC 400 MHz 600 DMIPS

MicroBlaze Core

Integrated Systems

Today’s primary focus

What is “Machine Structure”?

I/O system Processor

Compiler Operating

System

Applications

Digital Design

Circuit Design

Instruction Set Architecture

Coordination of many levels of abstraction

Datapath & Control

transistors

Memory Hardware

Software Assembler

Integrated Systems

Levels of Representation

High Level Language Program (e.g., C)

Assembly Language Program (e.g.,MIPS)

Machine Language Program (MIPS)

Control Signal Specification

Compiler

Assembler

Machine Interpretation

temp = v[k];

v[k] = v[k+1];

v[k+1] = temp;

lw $to, 0($2) lw $t1, 4($2) sw $t1, 0($2) sw $t0, 4($2)

0000 1001 1100 0110 1010 1111 0101 1000

1010 1111 0101 1000 0000 1001 1100 0110

1100 0110 1010 1111 0101 1000 0000 1001

0101 1000 0000 1001 1100 0110 1010 1111

Integrated Systems

Instruction Set Architecture (ISA)

Defines the interface between software & hardware

Visible hardware state (registers & memory)

A set of instructions that operate on that state

Given an ISA

The hardware implements it

The software uses it

Old SW can use new HW and vice versa

Keep in mind Difference: ISA vs. HW implementation

X86: 80x86 Pentium

Hardware

Software & OS

Instruction Set

Integrated Systems

ISA: What Programers See

Instruction Set Registers Memory Address Space

FFFxxx000

000xxxFFF

Intel‘s mostly used instructions [Hennessy]: • Load • Conditional branch • Compare • Store • Add • And • Sub • Move reg-reg

From total instruction set of ~140

i3886 register set [Intel]

Integrated Systems

Basic System Architecture

L1 cache

Memory access: Registers/ L1 cache: 1 cycle L2 cache: 10 cycles ext mem: 50 cycles

Spatial and temporal locality of data and code are the reasons why memory hierarchies perform!

L2 cache

Integrated Systems

Look Inside

internal address bus

external data bus

internal data bus

accumulator

control

register block

status

program counter

address i/o

external address bus

data i/o

data cache

instr cache

Integrated Systems

Microprocessor Architecture

external data bus

internal data bus

accumulator

control

register block

status

program counter

address i/o

data i/o

data cache

instr cache

Integrated Systems

external data bus

internal data bus

accumulator

control

register block

status

program counter

address i/o

data i/o

data cache

instr cache Instruction fetch (IF) Instruction decode (ID)

Operand fetch (OF) Execution (EX)

Write back (WB)

Integrated Systems

external data bus

internal data bus

accumulator

control

register block

status

program counter

address i/o

data i/o

data cache

instr cache

Memory load (ld) Memory store (st)

Integrated Systems

Pipelining

IF ID OF EX WB

IF ID M WB

IF ID OF M

add r3,r2,r1

Sequential machine

ld r1,0(r0)

st r3, 4(r0)

CPI = 4 - 5

… multiple instructions execute faster: CPI 1

IF ID OF EX WB

Pipelined processor

IF ID OF EX WB M

add r3,r2,r1

ld r1,0(r0)

st r3, 4(r0)

Individual instruction may take longer, …

Integrated Systems

A. Herkersdorf

CPU Pipeline

SoC - Microprocessors - 22

Pipeline Control

ID EXE MEM WB

Buffer

ΣTlogic ΣTlogic ΣTlogic

logicc2qclk TTTT

Single-scalar = 1 ALU, CPImin = 1.0

instr._rate[MIPS] =

= f[MHz]/CPI

Tstp Tc2q

Integrated Systems

Pipelining

Prerequisite for effective pipelining Regularity in sequence of individual instruction phases

Few, regular instruction set

Simple, few addressing modes

Deep pipelining Ease processor speed scaling

Increase vulnerability for pipeline problems

Data hazards

Branch conflicts

Integrated Systems

Data Hazard

IF ID OF EX WB M

add r3,r2,r1

sub r7,r3,r1

and r6,r3,r2

Dependencies back in time cause data hazards

IF ID OF EX WB M

ID OF EX M

add r3,r2,r1

sub r7,r3,r1

and r6,r3,r2

Eliminate reverse time dependency by stalling

stall IF WB

Integrated Systems

Branching

IF ID OF EX WB M

bcctr r3

shr r7, r1

and r6,r3,r2

Deviation from sequential program execution

Stall, or exploit advanced concepts like “branch prediction”

If r3 points back in address space, it‘s more likely that branch is taken

bcctr r3 r3 addr1 addr2

Integrated Systems

Performance

What is performance? Example Porsche vs. Bus from Munich to Stuttgart

Vehicle Top speed

[km/h]

Distance

Travel time

Porsche 260 200 0.77

Bus 120 200 1.6

Capacity Throughput [person] [pkm/h]

46 5520

What matters in CPU performance: Fastest possible execution of single instruction?

Shortest program execution time (many instructions)?

Integrated Systems

Processor Performance

Ultimately interested in: CPU execution time: Time CPU needs to complete

certain program, task or function

CPU time = x Clock cycles

Program

Seconds

Clock cycle

Instructions

Program

Clock cycles

Instruction

Seconds

Clock cycle = x x

Specific for your application

Estimate/count after compilation

1 / fcpu

Processor data sheet

CPI: Processor architecture and memory hierarchy dependent

Integrated Systems

Processor Performance

CPI = CPICPU + CPIMEM

CPIMEM = CPIIaccess + CPIDaccess

= IFreq x L1miss_rate (L1miss_penalty + L2miss_rate x L2miss_penalty) + DaccFreq x L1miss_rate (L1miss_penalty + L2miss_rate x L2miss_penalty)

Integrated Systems

Processor Performance - Example

Pipelined RISC CPU: CPICPU =1.2

Two-level cache hierarchy: L1miss_rate = 5%; L1miss_penalty = 10 cycles L2miss_rate = 3%; L2miss_penalty = 50 cycles DaccFreq = 20% CPIMEM = 0.69

0.15% instr./data accesses to system memory degrade overall performance (CPU execution time) by 57%

CPIno_miss 1.2

CPImiss 1.89 = = 1.57

Integrated Systems

Microprocessor Performance

[Xilinx]

Integrated Systems

How to Increase CPU Performance ?

• Pipelining

• Application specific ISA extensions

• Multiple ALUs and Control units

• Superscalar

• VLIW (Very Long Instruction Word)

• Multithreading

• Memory hierarchy design

Integrated Systems

DSP Architecture

HW multiply unit

external data bus

internal data bus

Multiply

accumulator

control

register block

status

program counter

address i/o

data i/o

data cache

instr cache

Integrated Systems

SIMD / MIMD

external data bus

internal data bus

Datapath

accumulator

control

register block

status

program counter

address i/o

data i/o

data cache

instr cache

Single Instruction Multiple Data: • single control / multiple datapaths

Multiple Instruction Multiple Data: • multiple controls

Integrated Systems

VLIW – Very Long Instruction Word

SequentialProgram

Instr i+2

Instr i+1

Instr iInstr i-1

Instr i-2

DP 1 DP 2 DP 3 DP 4 DP n-1 DP n

Registers

Optimizing Compiler

InstrDP2InstrDP1 InstrDP3 InstrDP4 InstrDPn-1 InstrDPn.. ... ... ... ... ..

.... Datapath .....

Determined during Compile-time

Integrated Systems

Superscalar

ID2i+3

ID2i+1

Instr pre-decode (ID1i ... ID1i+3)

Instr fetch (IFi ... IFi+3)

Instr distribute

ID2i+2

Decided at Run-time

Integrated Systems

Multithreading in Hardware

external data bus

internal data bus

accumulator

control

register block

status

program counter

address i/o

data i/o

data cache

instr cache

Multiple register banks

Integrated Systems

Multithreading in Software

external data bus

internal data bus

accumulator

control

register block

status

program counter

address i/o

data i/o

data cache

instr cache

Load/save register status

Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB...

Documents

ds404 OPB IPIF - Xilinx · 2021. 1. 17. · products. It pro-vides a bidirectional interface between a user IP core and the OPB 32-bit bus standard. The Xilinx OPB IPIF is avail-able

Opb Ethernet

OPB mooring chains.ppt

Detailed Soil Map; Soil Survey of Luzerne County, Pennsylvania€¦ · opB LcB opB ope opo LcB LCD opB 00B O ArB OpB PoB DdD GpB DdD PPD BxB CIA MoB WtB MOB Sm opa OID 01B 6 OIC LUZERNE

Sensing Bar Lighting OPB-S Series - CCS FASTUSccsfastus.com/files/CCS_FASTUS_OPBS_Series_Bar_Light.pdf · BKT-OPB-L All OPB-S Series models L-shaped 20 7 Lighting angles can also

DS464 OPB Serial Peripheral Interface (SPI) · 2021. 2. 12. · It interfaces to the Xilinx OPB through the OPB IPIF module. It consists of an 8-bit status register, an 8-bit control

Tarbell 32K RAM Memory Manual 32K RAM... · Title: Tarbell 32K RAM Memory Manual Author: hharte Created Date: 1/18/2004 12:22:37 AM

OPB Phaser 3300 Service Manual

Designing Custom OPB Slave Peripherals for MicroBlaze · 2 February 8, 2002 1-800-255-7778 R Tutorial: Designing Custom OPB Slave Peripherals for MicroBlaze An OPB peripheral of MicroBlaze

Proc. Mantencion Grua Automontable 32K, Liebherr

QUICK GUIDE IN ACCOMPLISHING THE OPB …pdf.usaid.gov/pdf_docs/PNACU524.pdf · QUICK GUIDE IN ACCOMPLISHING THE OPB UTILIZATION MONITORING FORMS ... • fecalysis • urinalysis

On board processing - OPB

0 OPB General Purpose Input/Output (GPIO) (v3.01b) · G16 OPB Address Width C_OPB_AWIDTH 32 32 integer G17 OPB Data Width C_OPB_DWIDTH 32 32 integer Notes: 1. The range specified

ds404 OPB IPIF - Xilinx

Opb Tutorial

Metaphysics Q a Volume1 32K

EPerformance for Managers OPB State Personnel Administration

OPB PCI v1.02a User Guide...OPB PCI v1.02a User Guide 5 UG241 July 26, 2006 R Chapter 1 About This Guide The OPB PCI User Guide provides information about the OPB PCI Bridge, which

8-bit Microcontroller with 32K Bytes

(OPB) CY 2014