27
Very long instruction word architectures Nicholas FitzRoy-Dale For cs9244, S38n

Very long instruction word architectures

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Very long instruction word architectures

Very long instruction word architectures

Nicholas FitzRoy-Dale

For cs9244, S38n

Page 2: Very long instruction word architectures

Source: http://imagelab.ing.unimo.it/architettura/varie/periferiche/DS80C88.pdf

3-3

Functional Diagram

REGISTER FILE

EXECUTION UNIT

CONTROL AND TIMING

INSTRUCTIONQUEUE

4-BYTE

FLAGS

16-BIT ALU

BUS 8

4

QS0, QS1

S2, S1, S0

2

4

3

GNDVCC

CLK RESET READY

BUS INTERFACE UNIT

RELOCATIONREGISTER FILE

3

A19/S6. . . A16/S3

INTA, RD, WR

DT/R, DEN, ALE, IO/M

SSO/HIGH

2

SEGMENT REGISTERSAND

INSTRUCTION POINTER(5 WORDS)

DATA POINTERAND

INDEX REGS(8 WORDS)

TESTINTRNMI

HLDA

HOLD

RQ/GT0, 1

LOCK

MN/MX

3

ES

CS

SS

DS

IP

AH

BH

CH

DH

AL

BL

CL

DL

SP

BP

SI

DI

ARITHMETIC/LOGIC UNIT

B-BUS

C-BUS

EXECUTIONUNIT

INTERFACEUNIT

BUS

QUEUE

INSTRUCTIONSTREAM BYTE

EXECUTION UNITCONTROL SYSTEM

FLAGS

MEMORY INTERFACE

A-BUS

AD7-AD0

8 A8-A15

INTERFACEUNIT

80C88

Page 3: Very long instruction word architectures

Intel Technology Journal Q1, 2001

The Microarchitecture of the Pentium!!!! 4 Processor 4

11 22 33 44 55 66 77 88 99 1010FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd RdyRdy/Sch/Sch DispatchDispatch ExecExec

Basic Pentium III Processor Basic Pentium III Processor MispredictionMisprediction Pipeline Pipeline

Basic Pentium 4 Processor Basic Pentium 4 Processor MispredictionMisprediction Pipeline Pipeline

11 22 33 44 55 66 77 88 99 1010 1111 1212TC TC Nxt Nxt IPIP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch

1313 1414DispDisp DispDisp

1515 1616 1717 1818 1919 2020RFRF ExEx FlgsFlgs Br CkBr Ck Drive DriveRF RF

Figure 3: Misprediction Pipeline

Allocator / Register RenamerAllocator / Register Renamer

Memory Memory uopuop Queue Queue Integer/Floating Point Integer/Floating Point uop uop QueueQueue

FP Register / BypassFP Register / Bypass

FPFPMMXMMXSSESSESSE2SSE2

FPFPMoveMove

Simple FPSimple FP

L1 Data Cache (8Kbyte 4-way)L1 Data Cache (8Kbyte 4-way)

Memory SchedulerMemory Scheduler FastFast Slow/General FP SchedulerSlow/General FP Scheduler

Integer Register File / Bypass NetworkInteger Register File / Bypass Network

ComplexComplexInstr.Instr.

Slow ALUSlow ALU

SimpleSimpleInstr.Instr.

2x ALU2x ALU

SimpleSimpleInstr.Instr.

2x ALU2x ALU

LoadLoadAddressAddress

AGUAGU

StoreStoreAddressAddress

AGUAGU

256 bits256 bits

64-bits wide64-bits wide

QuadQuadPumpedPumped3.2 GB/s3.2 GB/s

BusBusInterfaceInterface

UnitUnit

SystemSystemBusBus

L2 CacheL2 Cache(256K Byte(256K Byte

8-way)8-way)

48GB/s48GB/s

InstructionInstructionTLB/TLB/PrefetcherPrefetcher

Front-End BTBFront-End BTB(4K Entries)(4K Entries)

Instruction DecoderInstruction Decoder

Trace CacheTrace Cache(12K (12K µµopsops))

Trace Cache BTBTrace Cache BTB(512 Entries)(512 Entries)

MicrocodeMicrocodeROMROM

µµopop Queue Queue

Figure 4: Pentium® 4 processor microarchitecture

NETBURST MICROARCHITECTURE Figure 4 shows a more detailed block diagram of the NetBurst microarchitecture of the Pentium 4 processor. The top-left portion of the diagram shows the front end of the machine. The middle of the diagram illustrates the out-of-order buffering logic, and the bottom of the diagram shows the integer and floating-point execution units and the L1 data cache. On the right of the diagram is the memory subsystem.

Front End The front end of the Pentium 4 processor consists of several units as shown in the upper part of Figure 4. It has the Instruction TLB (ITLB), the front-end branch predictor (labeled here Front-End BTB), the IA-32 Instruction Decoder, the Trace Cache, and the Microcode ROM.

Source: http://www.intel.com/technology/itj/q12001/pdf/art_2.pdf

Page 4: Very long instruction word architectures

Superscalar vs VLIWInstruction cache

Buffer, decoder,

dispatcher

Execution unit Execution unit Execution unitExecution unit

Register file

Data cache

Reorder buffer

Superscalar

Page 5: Very long instruction word architectures

Superscalar vs VLIWInstruction cache

Instruction

register

Execution unit Execution unit Execution unitExecution unit

Register file

Data cache

VLIW

Page 6: Very long instruction word architectures

What changes?➔ We lose:

➔ Dynamic scheduling

➔ Branch prediction?

➔ Register renaming?

➔ Interlocks?

➔ We gain:

➔ Explicit parallelism

➔ Predication?

Page 7: Very long instruction word architectures

Compiler techniques

➔ Loop parallelism

➔ Global code motion

➔ Predication

➔ Speculation

Page 8: Very long instruction word architectures

Loop parallelism

Unrolling Software pipelining

counter = 0loop arr2[counter] += arr1[counter] counter += 1until counter == 8

counter = 0loop arr2[counter] += arr1[counter] arr2[counter+1] += arr1[counter+1] arr2[counter+2] += arr1[counter+2] arr2[counter+3] += arr1[counter+3] counter += 4until counter == 8

counter = 0loop arr2[counter] += arr1[counter] arr3[counter] %= arr2[counter] counter += 1until counter == 8

STARTUPcounter = 1loop arr2[counter] += arr1[counter] arr3[counter-1] %= arr2[counter-1] counter += 1until counter == 8SHUTDOWN

Page 9: Very long instruction word architectures

Global code motion

➔ AKA “moving instructions across branches”

➔ Pure-software

➔ Trace scheduling

➔ Superblocks

➔ Hardware-assisted

➔ Predication

➔ Speculation

Page 10: Very long instruction word architectures

Trace scheduling

A

B

C D

E

A

C

E

B

FIXUP

D

Page 11: Very long instruction word architectures

Superblocks

➔ A simplified form of traces

➔ Traces: multiple entry, multiple exit

➔ Superblocks, single entry, multiple exit

Page 12: Very long instruction word architectures

Predication

bz R1, post add R2, R2, R3post: ...

add.pred R1, R2, R2, R3...

if(a == 0) { x(); y();} else { c(); d();}

a==0? x();a==0? y();a!=0? c();a!=0? d();

Page 13: Very long instruction word architectures

Speculation

ld R1, 0(R2) bz R1, else ld R3, 8(R2) j outelse: ld r3, 16(R2)out: ...

ld R1, 0(R2) ld.spec R14, 8(R2) bnz R1, out ld r14, 16(R2)out: ...

Page 14: Very long instruction word architectures

History

Joseph Fisher

➔ ELI

➔ Proposed by Joseph Fisher

➔ 512-bit words

➔ Trace scheduling

➔ 1979

Page 15: Very long instruction word architectures

MultiFlow Trace

➔ Successor to ELI

➔ 1984-1990

➔ 256/512-bit IW

➔ Compiler outlived hardware MultiFlow Trace

Page 16: Very long instruction word architectures

CydromeCydra 5

➔ 256-bit instruction word

➔ 6 operations / cycle

➔ or 1 operation / cycle

➔ All predicated

➔ “Directed dataflow” architecture

Cydra 5

Page 17: Very long instruction word architectures

Analog Devices SHARC

ADSP-2136x SHARC Processor Programming Reference 1-3

!"#$%&'(#)%"

• External port for interfacing to off-chip SDRAM (ADSP-21367/8/9 processors) and configuring a shared memory system with up to four other ADSP-21368 SHARC processors

• Parallel port for interfacing to off-chip memory and peripherals (ADSP-21362/3/4/5/6 processors)

Figure 1-1 also shows the three on-chip buses of the ADSP-2136x proces-sors: the PM bus, DM bus, and I/O bus. The PM bus provides access to either instructions or data. During a single cycle, these buses let the pro-cessor access two data operands from memory, access an instruction (from the cache), and perform a DMA transfer.

Figure 1-1. ADSP-21362/3/4/5/6 SHARC Processor Block Diagram

ADDR DATA

IOD

ADDR DATA

IOA

ADDR DATA

IOA

SRAM1 MBIT ROM

2 MBIT

SRAM0.5 MBIT

BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 3

ADDR DATA

IOA

IOP REGISTERS(MEMORY MAPPED)

I/O PROCESSORAND PERIPHERALS

6JTAG TEST & EMULATION

32PM ADDRESS BUS

DM ADDRESS BUS 32

PM DATA BUS

DM DATA BUS

64

64

PX REGISTERPROCESSING

ELEMENT(PEY)

PROCESSINGELEMENT

(PEX)

TIMERINSTRUCTION

CACHE32 X 48-BIT

DAG18X4X32

DAG28X4X32

CORE PROCESSOR

PROGRAMSEQUENCER

SRAM1 MBIT ROM

2 MBIT

SIGNALROUTING

UNIT

SRAM0.5 MBIT

4 BLOCKS OF ON-CHIP MEMORY

IOD IOA IOD IOD

SPISPORTS

IDPPCG

TIMERSSRC

SPDIFDTCP

SHARC 21362

Page 18: Very long instruction word architectures

SHARC

➔ “Super Harvard” architecture

➔ DAG1, DAG2

➔ PEx, PEy

➔ 16 registers, all universal

Page 19: Very long instruction word architectures

SHARC

ADSP-2136x SHARC Processor Programming Reference 8-5

!"#$%&'$()"*+,$

(DM) and R4, S4 (PM) respectively. In the second instruction, values are simultaneously read from data memory to registers R8 and S8 and written to program memory from registers R0 and S0.

R0=DM(I1,M1);

When the processor is in broadcast mode (the BDCST1 bit is set in the MODE1 system register), the R0 (PEx) data register in this example is loaded with the value from data memory utilizing the I1 register from DAG1, and S0 (PEy) is loaded with the same value.

Type 1 Opcode

47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23

001DMD

DMI DMMPMD

DM DREG PMI PMM PM DREG

22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

COMPUTE

Bits Description

DMD, PMD Select the access types (read or write)

DM DREG,PM DREG

Specify register file location.

DMI, PMI Specify I registers for data and program memory

DMM, PMM Specify M registers used to update the I registers

COMPUTE Defines a compute operation to be performed in parallel with the data accesses; if omitted, this is a NOP

SHARC Group 1, Type 1 instruction

Page 20: Very long instruction word architectures

Matrix × matrix

mxnxnxp:bit set MODE1 PEYEN | CBUFEN; /* enable SIMD mode, circular buffers */

lcntr=R15, do matrix until lce; lcntr=r7, do column until lce; r5=i0; /*store i0 in r5*/ lcntr=R14, do row until lce; /*N/2 times*/ /*calc mat_A * mat_B,accumulate,read mat_A and mat_B */row: f12=f0*f4, f8=f8+f12, f0=dm(i0,m1), f4=pm(i8,m9); f12=f0*f4, f8=f8+f12; /*multiply, accumulate*/ f8=f8+f12; /*final accumulate*/ r4=r4 xor r4, r9<->s8; /*move Pey to Pex*/ f8=f8+f9, r0=r4; /*add values for result*/ r8=r8 xor r8, dm(i1,m0)=r8; /*clear r8, save values in mat_c */ column: r12=r12 xor r12, i0=r5; /*clear f12, restore i0 with r5*/ r5=r5+r6; /*accumulate r5 by N for next row of matrix B*/matrix: i0=r5; /* loads i0 with a value which points to the next row for matrix B*/ rts(db); bit clr mode1 PEYEN | CBUFEN; /*disable SIMD and circular buffers*/ /*the last value in the buffer is a dummy value and here it is cleared*/ dm(i1,m0)=0;

Page 21: Very long instruction word architectures

EPIC

Montecito

Page 22: Very long instruction word architectures

Itanium

➔ 128-bit-wide “bundles”

➔ 128 GP registers

➔ 128 FP registers

➔ 64 predicate registers

➔ 8 branch registers

➔ Comparatively short pipeline

Page 23: Very long instruction word architectures

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

TC IP TC Fetch Drv Alc Rename Q Sch Sch Sch Dsp Dsp RF RF Ex Flg BrCk Drv

Pentium 41 2 3 4 6

IPG ROT

5 7 8

EXPRENREGEXEDET WRB

Itanium 2

1 2 3 4 6

IFU IFU

5 7 8

IFU DEC DEC RAT ROB DIS

109 11

EX RET RET

P6 family

Pipelines and complexity

Page 24: Very long instruction word architectures

EPIC’s big ideas

➔ Bundles: avoid hardware dependence

➔ Speculation

➔ Predication

Page 25: Very long instruction word architectures

BundlingTmpl Instruction 1 Instruction 2 Instruction 3

5 4141 41

EPIC bundle

Template Slot 0 Slot 1 Slot 2

0 M I I

1 M I I

2 M I I

3 M I I

Four bundle templates

Page 26: Very long instruction word architectures

Other EPIC features

➔ Register “renaming”

➔ All instructions predicated

➔ Speculative loads

➔ For control and data

Page 27: Very long instruction word architectures

Convergence

➔ Backwards compatibility

➔ Universal predication

➔ Branch prediction hints and speculative loads