25
An FPGA-based Scalable Simulation Accelerator for Tile Architectures Shinya Takamaeda-Yamazaki †‡ , Ryosuke Sasakawa , Yoshito Sakaguchi , Kenji Kise Tokyo Institute of Technology, Japan JSPS Research Fellow 14:30 – 15:00 June 2, 2011 HEART 2011 @Imperial College London

An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Embed Size (px)

DESCRIPTION

A presentation of ScalableCore system 1.1 at HEART 2011 @Imperial College London

Citation preview

Page 1: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

An FPGA-based Scalable Simulation Accelerator for Tile Architectures

Shinya Takamaeda-Yamazaki†‡, Ryosuke Sasakawa†, Yoshito Sakaguchi†, Kenji Kise†

†Tokyo Institute of Technology, Japan ‡JSPS Research Fellow

14:30 – 15:00 June 2, 2011 HEART 2011 @Imperial College London

Page 2: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

This presentation shows ScalableCore system n  Multi-FPGA system for Tile architecture simulations

l  Achieving SCALABLE simulation speed

2

Target Core

System Function

Page 3: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Agenda

n  Background & Motivation n  Proposal: ScalableCore

n  System Implementation l  Overall system

l  Components: ScalableCore Unit & Board

l  Logic Hierarch & Architecture

n  Evaluation l  Simulation Speed

l  Power

n  Conclusion

3

Page 4: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Background: Multicores to Many-cores

4

Intel Single Chip Cloud Computer 48 cores (x86)

TILERA TILE-Gx100 100 cores (MIPS)

Page 5: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Simulation Target Manycore: M-Core n  Tile architecture with 2D mesh network

l  A Node has: Core, Local Memory, INCC (DMA controller) and Router

l  Local Memory: Independent Address Space, Data transfer by DMAs

5

Local Memory

INCC Core

R

DRAM Controller DRAM Controller

DRAM Controller DRAM Controller

Node

Page 6: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

How to evaluate the architectures? n  Customizability vs. Simulation Speed

l  We want to run a large benchmark fast

6

Difficulty to construct

Rea

lity

Software Simulator

FPGA Simulator

Chip

Faster simulation and customizable

Easy construction of ideal system without

HW limitations Real but expensive

Page 7: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Less scalability of simulation speed on software simulators n  Decreasing speed with the increasing # target cores

l  SimMc :M-Core simulator

l  Difficult to achieve the scalable speed •  Overhead for cycle accurate simulation

7

343

149

96 70

0

50

100

150

200

250

300

350

400

16 32 48 64

Sim

ulat

ion

Spee

d [K

cyc

le /

sec]

# Target Cores

Speed degradation more than the increasing # cores

Simulation Speed on SimMc (M-Core simulator)

Page 8: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Motivation n  Achieve the SCALABLE simulation speed

l  = Keep the constant simulation speed in case of large number of cores

n  How to scale the simulation speed? l  Our target architecture: M-Core

•  Tile architecture with 2D mesh network

8

Partitioning of the target processor into multiple FPGAs

Many-core Processor

Partition

Page 9: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Proposal of ScalableCore n  Multiple FPGAs corresponding to the target processor

l  Each ScalableCore Unit has a part of the target processor and shares the simulation progress with its neighbor Units

9

ScalableCore Unit (FPGA Card with off-chip Memory) A part of the target processor

ScalableCore Board Connecting among the ScalableCore Units

LCD Display for simulation information

Target Core

System Function Target Processor (M-Core)

Page 10: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Simulation Target Manycore: M-Core n  Tile architecture with 2D mesh network

l  A Node has: Core, Local Memory, INCC (DMA controller) and Router

l  Local Memory: Independent Address Space, Data transfer by DMAs

10

Local Memory

INCC Core

R

DRAM Controller DRAM Controller

DRAM Controller DRAM Controller

Node

Current Target of ScalableCore system

Page 11: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

ScalableCore system 1.1: Overview n  Simulating the M-Core with up to 64 Nodes (= FPGAs)

11

Local Memory

INCC Core

R

System Functions

Able to increase/decrease the number of Nodes

Page 12: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

1Node : 1 ScalableCore Unit

12

45cm

30cm

Page 13: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

4 Nodes (2x2) : 4 ScalableCore Units

13

45cm

30cm

Page 14: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

16 Nodes (4×4) : 16 ScalableCore Units

14

45cm

30cm

Page 15: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

64 Nodes (8×8) : 64 ScalableCore Units

15

Scalable Extension!

Page 16: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

ScalableCore system 1.1: Components

n  ScalableCore Unit FPGA board with off-chip SRAM l  Xilinx Spartan-3E XC3S500E

l  512KBi SRAM (8bit, 1 port for read/write)

l  Configuration ROM

n  ScalableCore Board Interface board bridging Units l  Power regulator & SD card slot

16

Page 17: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

ScalableCore system 1.1:Logic Hierarchy

17

Core INCC

Local Memory (Interface)

Router

Ser/Des Memory Multiplexer

Initializer Device Controller

Arbiter Interface Register

Target Core (a Node in M-Core)

System Functions

Page 18: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

ScalableCore system 1.1:Logic Architecture

18

Memory Multiplexer

DMA Generator/Receiver

Fetch Unit

Decoder

Execution Unit

Register File

Memory Access Unit

DMA Register Memory Controller

SRAM Controller SRAM

Arbiter

XBAR

Interface Register

Interface Register

SD Card Controller

Core

INCC

Node Memory

Router

to/from Adjacent Units

State Machine Controller

SD

Ser/Des

Ser/Des

Ser/Des

Ser/Des

Clock

Reset

ScalableCore Unit FPGA Spartan-3E

Off-chip Devices

IR IR

IR

IR IR IR IR

Configuration ROM

XCF04S JTAG port

Page 19: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Two key techniques n  Local Barrier Synchronization

l  Each FPGA has one Node of M-Core (or other tile architecture)

l  To satisfy the cycle accuracy, hand shaking of simulation state is needed

•  All-to-All hand shake: Increasing overhead to the number of cores

l  Our target is a tile architecture, so …

n  Virtual Cycle l  How to emulate the complex hardware?

•  Ex.) larger number of memory ports

19

Hand shaking by only 4 neighbors

Use multiple FPGA cycles for 1 target cycle

Page 20: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Local Barrier Synchronization n  Handshakes with 4 neighbor FPGAs

l  Constant handshaking overhead, not increasing with the increasing of # target cores

l  So it achieves scalable simulation speed

Sending to Unit 0

Sending to Unit 1

Sending to Unit 2

Sending to Unit 3

Receiving from Unit 0

Receiving from Unit 1

Receiving from Unit 2

Receiving from Unit 3

Sending to Unit 0

Sending to Unit 1

Sending to Unit 2

Sending to Unit 3

Receiving from Unit 0

Receiving from Unit 1

Receiving from Unit 2

Receiving from Unit 3

Cycle 1 Cycle 2

0

3 4

2

1

20

Page 21: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Virtual Cycle n  Multiple FPGA clock cycles for 1 target clock cycle

l  Virtually complex hardware by using simple FPGA equipment •  Example. Multiport RAM by driving 1 port RAM multiple times

21

INCC Core

Sending the synchronized data via Serial I/O (North)

Receiving the synchronized data via Serial I/O (North)

Sending the synchronized data via Serial I/O (East) Sending the synchronized data via Serial I/O (West) Sending the synchronized data via Serial I/O (South)

Receiving the synchronized data via Serial I/O (East) Receiving the synchronized data via Serial I/O (West)

Receiving the synchronized data via Serial I/O (South)

Start sending

Finish synchronization

Data Sender via Serial I/Os

Data Receiver via Serial I/Os

1 Virtual Cycle Time

Virtual Cycle N

Virtual Cycle N+1

Router

INCC Send Core (IF) INCC Recv Core (L/S) Interleaved Memory Access

via Memory Multiplexer

Proceeding Target Circuit State

Drive the circuit of target components

Process the memory accesses

Page 22: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Evaluation

n  Evaluation Points l  Simulation Speed [K cycle / sec]

l  Power [W]

n  Environment l  ScalableCore system 1.1 (FPGA-based simulator)

•  Freq.: 45MHz

l  SimMc 1.1(Software simulator of M-Core) •  Intel Core2Duo, Memory 4GB, gcc4.1.2, Debian 5

n  # Node l  16, 32, 48, 64

22

Page 23: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Evaluation: Simulation Speed [K cycle/sec] n  = Clock frequency of the target processor [KHz]

l  Software simulator: degrading speed with the increasing of # target cores

l  ScalableCore system: constant speed rate

n  Relative Speed l  Increasing # cores, Increasing the relative speed

•  In simulation of 64 Nodes, achieves 14.2x speed up

23

1000 1000 1000 1000

343 149 96 70

0 200 400 600 800

1000 1200

16 32 48 64

Sim

ulat

ion

Spee

d [K

cyc

le /

sec]

# Nodes

ScalableCore system Software Simulator

2.9

6.7

10.4

14.2

0.0 2.0 4.0 6.0 8.0

10.0 12.0 14.0 16.0

16 32 48 64

Rel

ativ

e Sp

eed

# Nodes

Page 24: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Evaluation: Power [W] n  = Energy consumption of the system per sec

l  Software simulator: constant consumption [W]

l  ScalableCore system: increasing the power [W]

n  Relative Efficiency (=Ratio of energy used for simulation of 1 clock cycle on the target1) l  More efficient, increasing # target cores

•  In simulation of 64 nodes, achieves

24

19.2 22.2 22.9 23.5

0.0

5.0

10.0

15.0

20.0

25.0

16 32 48 64

Rel

ativ

e Ef

ficie

ncy

# Nodes

13 26

38 51

84 84 84 84

0

20

40

60

80

100

16 32 48 64

Pow

er [W

]

# Nodes

ScalableCore system Software Simulator

Page 25: An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Conclusion n ScalableCore system 1.1

An FPGA-based scalable simulation system for tile architecture evaluations l  Multiple FPGAs l  Two key techniques

•  Virtual cycle

•  Local Barrier Synchronization

l  14.2 times faster simulation than the software simulator •  When simulating the more detailed architecture the speedup rate

becomes the very larger

n  Future Work l  Off-chip DRAM support l  Virtual combined multiple FPGAs for a large core l  Time-multiplexed driven for higher hardware utilization

25