27
Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA ISCA 2004 The Vector-Thread Architecture

The Vector Thread

  • Upload
    harry3k

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 1/27

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve

Gerding, Brian Pharris, Jared Casper, Krste Asanović

MIT Computer Science and Artificial Intelligence Laboratory,

Cambridge, MA, USA

ISCA 2004

The Vector-Thread

Architecture

Page 2: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 2/27

Goals For Vector-Thread Architecture

• Primary goal is efficiency 

- High performance with low energy and small area

• Take advantage of whatever parallelism and locality isavailable: DLP, TLP, ILP

-  Allow intermixing of multiple levels of parallelism

• Programming model is key

- Encode parallelism and locality in a way that enables a complexity-effective implementation

- Provide clean abstractions to simplify coding and compilation

Page 3: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 3/27

Vector and Multithreaded Architectures

• Vector processors provide

efficient DLP execution-  Amortize instruction control

-  Amortize loop bookkeepingoverhead

- Exploit structured memoryaccesses

• Unable to execute loopswith loop-carrieddependencies or complex

internal control flow

• Multithreaded processors

can flexibly exploit TLP• Unable to amortize common

control overhead acrossthreads

• Unable to exploit structuredmemory accesses acrossthreads

• Costly memory-basedsynchronization and

communication betweenthreads

PE0

Memory

PE1 PE2 PEN 

vector control

PE0

Memory

PE1 PE2 PEN 

thread

control

Control

Processor 

Page 4: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 4/27

Vector-Thread Architecture

• VT unifies the vector and multithreaded compute models

•  A control processor interacts with a vector of virtualprocessors (VPs)

• Vector-fetch: control processor fetches instructions for allVPs in parallel

• Thread-fetch: a VP fetches its own instructions

• VT allows a seamless intermixing of vector and threadcontrol

ControlProcessor 

VP0

Memory

VP1 VP2 VP3 VPN 

thread-

fetch

vector-fetch

Page 5: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 5/27

Page 6: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 6/27

Page 7: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 7/27

Virtual Processor Vector 

•  A VT architecture includes a control processor and a virtualprocessor vector 

- Two interacting instruction sets•  A vector-fetch command allows the control processor to fetch

an AIB for all the VPs in parallel

• Vector-load and vector-store commands transfer blocks of data between memory and the VP registers

Registers

VP0

ALUs

Registers

VP1

ALUs

Registers

VPN 

ALUs

Control

Processor 

Memory

Vector Memory

Unit

vector-fetch

vector 

-loadvector 

-store

Page 8: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 8/27

Cross-VP Data Transfers

• Cross-VP connections provide fine-grain data operand

communication and synchronization- VP instructions may target nextVP as destination or use prevVP as asource

- CrossVP queue holds wrap-around data, control processor can pushand pop

- Restricted ring communication pattern is cheap to implement,scalable, and matches the software usage model for VPs

Registers

VP0

ALUs

Registers

VP1

ALUs

Registers

VPN 

ALUs

crossVP-push

Control

Processor 

crossVP-pop

vector-fetch

crossVPqueue

Page 9: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 9/27

Mapping Loops to VT

•  A broad class of loops map naturally to VT

- Vectorizable loops

- Loops with loop-carried dependencies

- Loops with internal control flow

• Each VP executes one loop iteration

- Control processor manages the execution

- Stripmining enables implementation-dependent vector lengths

• Programmer or compiler only schedules one loop iteration on

one VP- No cross-iteration scheduling

Page 10: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 10/27

Page 11: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 11/27

Loop-Carried Dependencies

• Loops with cross-iteration dependenciesmapped using vector commands with cross-VPdata transfers

- Vector-fetch introduces chain of prevVP receives andnextVP sends

- Vector-memory commands with non-vectorizable

compute

<< x

+

ld ld

st

<< x

+

vector-store st

<< x

+

st

<< x

+

st

<< x

+

st

<< x

+

st

vector-load ld

vector-load ld

VP0 VP1 VP2 VP3 VPN Control

Processor 

ld

ld

ld

ld

ld

ld

ld

ld

vector-fetch

Page 12: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 12/27

Loops with Internal Control Flow

• Data-parallel loops with large conditionals or inner-loops mapped using thread-fetches

- Vector-commands and thread-fetches freely intermixed

- Once launched, the VP threads execute to completionbefore the next control processor command

ld

vector-load ld

vector-store st

VP0 VP1 VP2 VP3 VPN Control

Processor 

ld

st

ld==

br 

==

br 

ld

==

br 

ld

ld

st

==

br 

ld

ld

st

==

br 

ld

ld

st

==

br 

ld

==

br 

ld

ld

st

==

br 

ld

==

br 

vector-fetch

Page 13: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 13/27

VT Physical Model

•  A Vector-Thread Unit contains an array of lanes with physical register files and execution units

• VPs map to lanes and share physical resources, VP execution is time-multiplexed on the lanes

• Independent parallel lanes exploit parallelism across VPs and dataoperand locality within VPs

Lane 0 Lane 1 Lane 2 Lane 3

VP0VP4

VP8

VP12

ALU

VP1VP5

VP9

VP13

ALU

VP2VP6

VP10

VP14

ALU

VP3VP7

VP11

VP15

ALU

Memory

Control

Processor 

Vector-Thread Unit

Vector 

Memory

Unit

Page 14: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 14/27

Lane Execution

• Lanes execute decoupled from eachother 

• Command management unit handlesvector-fetch and thread-fetchcommands

• Execution cluster executesinstructions in-order from small AIBcache (e.g. 32 instructions)

-  AIB caches exploit locality to reduceinstruction fetch energy (on par withregister read)

• Execute directives point to AIBs andindicate which VP(s) the AIB shouldbe executed for 

- For a thread-fetch command, the laneexecutes the AIB for the requesting VP

- For a vector-fetch command, the laneexecutes the AIB for every VP

•  AIBs and vector-fetch commandsreduce control overhead

- 10s—100s of instructions executed per fetch address tag-check, even for non-vectorizable loops

Lane 0

thread-fetch

VP0

VP4

VP8

VP12

AIB

tags

vector-fetch

ALU

execu

tedirective

vector-fetch

miss

addr 

AIB

fill

miss

VP

AIBaddress

AIB

cache

AIBinstr.

Page 15: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 15/27

VP Execution Interleaving

Lane 3Lane 2Lane 1

VP0VP4VP8VP12

Lane 0

VP1VP5VP9VP13 VP2VP6VP10VP14 VP3VP7VP11VP15

time-multiplexing

time

• Hardware provides the benefits of loop unrolling by interleaving VPs

• Time-multiplexing can hide thread-fetch, memory, and functional unitlatencies

thread-fetch

vector-fetch

vector-fetch

AIBvector-fetch

Page 16: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 16/27

VP Execution Interleaving

Lane 3Lane 2Lane 1

VP0VP4VP8VP12

Lane 0

VP1VP5VP9VP13 VP2VP6VP10VP14 VP3VP7VP11VP15vector-fetch

AIB

• Dynamic scheduling of cross-VP data transfers

automatically adapts to software critical path (in contrast tostatic software pipelining)

- No static cross-iteration scheduling

- Tolerant to variable dynamic latencies

time-multiplexing

time

Page 17: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 17/27

SCALE Vector-Thread Processor 

• SCALE is designed to be a complexity-effective all-purpose

embedded processor - Exploit all available forms of parallelism and locality to achieve high

performance and low energy

• Constrained to small area (estimated 10 mm2 in 0.18 μm)

- Reduce wire delay and complexity

- Support tiling of multiple SCALE processors for increased throughput

• Careful balance between software and hardware for codemapping and scheduling

- Optimize runtime energy, area efficiency, and performance whilemaintaining a clean scalable programming model

Page 18: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 18/27

SCALE Clusters

• VPs partitioned into four clusters to exploit ILP and allow laneimplementations to optimize area, energy, and circuit delay

- Clusters are heterogeneous – c0 can execute loads and stores, c1 can executefetches, c3 has integer mult/div

- Clusters execute decoupled from each other 

Lane 0 Lane 1 Lane 2 Lane 3Control

Processor 

L1 Cache

AIB

FillUnit

c0

c1

c2

c3

c0

c1

c2

c3

c0

c1

c2

c3

c0

c1

c2

c3

c0

c1

c2

c3

SCALE VP

Page 19: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 19/27

VP24VP12

SCALE Registers and VP Configuration

• Number of VP registers in each cluster is configurable

- The hardware can support more VPs when they each have fewer private registers

- Low overhead: Control processor instruction configures VPs beforeentering stripmine loop, VP state undefined across reconfigurations

cr0 cr1

•  Atomic instruction blocks allow VPs to share

temporary state – only valid within the AIB- VP general registers divided into private and shared

- Chain registers at ALU inputs – avoid reading andwriting general register file to save energy

VP0 VP0VP4VP8VP12VP16

VP20

VP4

VP8

4 VPs with0 shared regs

8 private regs

shared7 VPs with4 shared regs

4 private regs

25 VPs with7 shared regs

1 private reg

c0

shared

VP0

VP4VP8

shared

Page 20: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 20/27

SCALE Micro-Ops

•  Assembler translates portable software ISA into hardware micro-ops

•Per-cluster micro-op bundles access local registers only

• Inter-cluster data transfers broken into transports and writebacks

Cluster 0 Cluster 1 Cluster 2

wb compute tp wb compute tp wb compute tp

xor pr0,pr1 →c1,c2 c0→cr0 sll cr0,2 →c2 c0→cr0

c1→cr1 add cr0,cr1→ pr0

cluster operation destinationsc0 xor pr0, pr1 → c1/cr0, c2/cr0

c1 sll cr0, 2 → c2/cr1

c2 add cr0, cr1 → pr0

Software VP code:

Hardware micro-ops:

cluster micro-op bundle Cluster 3 not shown

Page 21: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 21/27

Page 22: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 22/27

SCALE Prototype and Simulator 

• Prototype SCALE processor in development- Control processor: MIPS, 1 instr/cycle

- VTU: 4 lanes, 4 clusters/lane, 32 registers/cluster, 128 VPs max

- Primary I/D cache: 32 KB, 4x128b per cycle, non-blocking

- DRAM: 64b, 200 MHz DDR2 (64b at 400Mb/s: 3.2GB/s)

- Estimated 10 mm2 in 0.18μm, 400 MHz (25 FO4)

• Cycle-level execution-driven microarchitectural simulator - Detailed VTU and memory system model

4 mm

2.5mm

    s     h     f     t    r

     A     L     U

     M     D

     C     P     0

     L     /     S

     R     F

ctrl

     b    y    p

     P     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

     R     F

    c     t    r     l

    s     h     f     t    r

     A     L     U

     l    a     t    c     h

    m    u    x     /

     I     Q     C

    c     t    r     l

     L     D     Q

    c     t    r     l

     L     D     Q

    c     t    r     l

     L     D     Q

    c     t    r     l

     L     D     Q

Memory

Interface /

CacheControl

Memory

Unit

Cache

Bank

(8KB)Cache

Bank

(8KB)

Cache

Bank(8KB)

Cache

Bank

(8KB)

Control Processor 

Cross

bar

Cache

Tags

Mult

Div

Lane

Cluster 

Page 23: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 23/27

Benchmarks

• Diverse set of 22 benchmarks chosen to evaluate a range of applicationswith different types of parallelism

- 16 from EEMBC, 6 from MiBench, Mediabench, and SpecInt

• Hand-written VP assembly code linked with C code compiled for controlprocessor using gcc

- Reflects typical usage model for embedded processors

• EEMBC enables comparison to other processors running hand-optimized code, but it is not an ideal comparison

- Performance depends greatly on programmer effort, algorithmic changes areallowed for some benchmarks, these are often unpublished

- Performance depends greatly on special compute instructions or sub-word SIMD

operations (for the current evaluation, SCALE does not provide these)- Processors use unequal silicon area, process technologies, and circuit styles

• Overall results: SCALE is competitive with larger and more complexprocessors on a wide range of codes from different domains

- See paper for detailed results

- Results are a snapshot, SCALE microarchitecture and benchmark mappingscontinue to improve

Page 24: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 24/27

Mapping Parallelism to SCALE

• Data-parallel loops with no complex control flow

- Use vector-fetch and vector-memory commands

- EEMBC rgbcmy, rgbyiq, and hpg execute 6-10 ops/cycle for 12x-32x speedupover control processor, performance scales with number of lanes

• Loops with loop-carried dependencies

- Use vector-fetched AIBs with cross-VP data transfers

- Mediabench adpcm.dec: two loop-carried dependencies propagate in parallel, onaverage 7 loop iterations execute in parallel, 8x speedup

- MiBench sha has 5 loop-carried dependencies, exploits ILP

010

20

30

40

50

60

70

   S  p  e  e   d  u  p

  v  s

 .

   C  o  n   t  r  o   l   P  r  o  c  e  s  s  o  r

01

2

3

4

5

67

8

91 Lane2 Lanes

4 Lanes8 Lanes

adpcm.dec sha

prototype

rgbcmy rgbyiq hpg

Page 25: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 25/27

Mapping Parallelism to SCALE

• Data-parallel loops with large conditionals- Use vector-fetched AIBs with conditional thread-fetches

- EEMBC dither : special-case for white pixels (18 ops vs. 49)• Data-parallel loops with inner loops

- Use vector-fetched AIBs with thread-fetches for inner loop

- EEMBC lookup: VPs execute pointer-chaising IP address lookups in routing table

• Free-running threads

- No control processor interaction- VP worker threads get tasks from shared queue using atomic memory ops

- EEMBC pntrch and MiBench qsort achieve significant speedups

0

2

4

6

8

10

   S  p  e  e   d  u  p

  v  s .

   C  o

  n   t  r  o   l   P  r  o  c  e  s

  s  o  r

dither lookup qsortpntrch

1 Lane2 Lanes

4 Lanes8 Lanes

Page 26: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 26/27

Comparison to Related Work

• TRIPS and Smart Memories can also exploit multiple types of 

parallelism, but must explicitly switch modes• Raw’s tiled processors provide much lower compute density than

SCALE’s clusters which factor out instruction and data overheads and

use direct communication instead of programmed switches

• Multiscalar passes loop-carried register dependencies around a ringnetwork, but it focuses on speculative execution and memory, whereasVT uses simple logic to support common loop types

• Imagine organizes computation into kernels to improve register filelocality, but it exposes machine details with a low-level VLIW ISA, incontrast to VT’s VP abstraction and AIBs

• CODE uses register-renaming to hide its decoupled clusters fromsoftware, whereas SCALE simplifies hardware by exposing clusters andstatically partitioning inter-cluster transport and writeback ops

• Jesshope’s micro-threading is similar in spirit to VT, but its threads are

dynamically scheduled and cross-iteration synchronization usesfull/empty bits on global registers

Page 27: The Vector Thread

8/3/2019 The Vector Thread

http://slidepdf.com/reader/full/the-vector-thread 27/27

Summary

• The vector-thread architectural paradigm unifies the

vector and multithreaded compute models• VT abstraction introduces a small set of primitives to

allow software to succinctly encode parallelism andlocality and seamlessly intermingle DLP, TLP, and ILP

- Virtual processors, AIBs, vector-fetch and vector-memorycommands, thread-fetches, cross-VP data transfers

• SCALE VT processor efficiently achieves high-performance on a wide range of embedded applications