17
TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

Embed Size (px)

Citation preview

Page 1: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

TMBox: A Configurable 16-core Hybrid TM FPGA prototype

Osman Unsal

Page 2: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

The people

• Nehir Sonmez (BSC)• Oriol Arcas (BSC)• Osman Unsal (BSC)• Adrian Cristal (BSC)• Satnam Singh (MSR Cambridge)

2

Page 3: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

BeeFarm

• Software simulators are poorly parallelized• An FPGA can be significantly faster for

multicore emulation:

FPGA emulator at 25 MHzcan be faster than

Software simulator on 2 GHz host

3

From Plasma to BeeFarm: Design Experience of an FPGA-based Multicore Prototype.Nehir Sonmez, Oriol Arcas, Gokhan Sayilar, Osman S. Unsal, Adrián Cristal, Ibrahim Hur, Satnam Singh and Mateo Valero.In 7th International Symposium on Applied Reconfigurable Computing (ARC 2011), March 2011.

Page 4: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

BeeFarm

• 8-core, FPGA-based multiprocessor• Completely modifiable from top to bottom

4

Bus

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

ArbiterDDR2Controller

Bootmem

I/O

25 MHz125 MHz

HoneycombMIPS R3000 compatible

Shared bus128-bit split bus

L1 cacheUnified 8 KB cache

Page 5: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

The Honeycomb core

R3000-compatible Honeycomb with flexible HTM support

= Original Plasma (MIPS R2000-compatible)

+ MMU, FPU

+ exceptions support

+ synchronization primitives: LL/SC

+ snooping, coherent caches (MSI)

+ debugging, performance counters

+ system libraries to support string, I/O, TM

5

Page 6: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

BeeFarm performance

1 core 2 cores 4 cores0

1

2

3

4

5

ScalParC simulation

BeeFarmM5M5 -timing

Sp

eed

up

6

Results normalized to M5 with 1 thread.

Functional simulation

Detailed simulation

Page 7: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

TMbox

• HTM multiprocessor on FPGA– Inspired by AMD’s Advanced Synchronization

Facility• BeeFarm improved:

– Ring bus instead of shared bus (which doesn’t fit well on FPGA)

– x2 frequency (50 MHz)

7

TMbox: A Flexible and Reconfigurable 16-core Hybrid Transactional Memory SystemNehir Sonmez, Oriol Arcas, Otto Pflucker, Osman S. Unsal, Adrián Cristal, Ibrahim Hur, Satnam Singh and Mateo Valero.In 19th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2011), May 2011.

Page 8: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

HTM ISA extensions

• Inspired by AMD-ASF• 10 new MIPS instructions

– XBEGIN (addr)– XLB, XLH, XLW, XSB, XSH, XSW– XCOMMIT, XABORT (code)– MFTM

• 4 new special registers– Can only be read with the MFTM (move from TM) instruction– $TM0 register contains the abort address (XCOMMIT)– $TM1 has a copy of the stack pointer (XCOMMIT)– $TM2 contains the abort cause (overflow, contention or explicit) – $TM3 stores a 20-bit software abort code (XABORT)

8

Page 9: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

HTM example• atomic {a++} example in MIPS assembler:

9

$ERR: ...

$ABORT: MFTM $12, $TM2 BEQ $12, $13, $ERR ADDIU $10, $10, 1 SLTU $12, $10, $11 BEQZ $12, $ERR2 J $TX

$TX: XBEGIN $ABORT XLW $8, 0($a0) ADDI $8, $8, 1 XSW $8, 0($a0) XCOMMIT

LI $11, 5 LI $13, HW_OFLOW J $TX

next code...

Abort due to conflict, retry...

HW capacity exceeded

AbortTransaction committed

Page 10: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

TinySTM – ASF integration

• atomic {a++} example with TinySTM hybrid TM:

10

Switch to software

tm_start(); t = tm_read(a); tm_write(a, t); tm_commit();

tm_thread_init();

next code...

Abort due to conflict, retry...

HW capacity exceeded, explicit

SW abort

AbortTransaction committed

TinySTM conflict management

Page 11: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

Compilation

• Standard GCC-MIPS cross-compiler

+ HyTM extensions

(to use 10 new tx instr.)• 4 new TM registers, read

with MFTM instr.• Also extend the cache

FSM to support TM

11

Page 12: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

TMbox architecture

12

C7 C0

C1

C2

C3C4

C5

C6

DDRResponses Requests

Invalidations

L1 Honeycomb CPU

TM Unit

CAM RAM data

hit

addr

BusNode

BusCtrl.

To commit (serialized):1. Lock ring (to prevent other writes and commits)

Will destroy ongoing write/commit requests2. Commit the TX writes through channelWill abort conflicting TXs snooping the ring3. Unlock ring

Page 13: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

Performance

• Eigenbench synthetic TM benchmark on 16 cores (lower is better):– Left: 10 element r/w set: overflows the TM cache– Right: 8 element r/w set: fits in the TM cache

13

HyTM betterHyTM better

Page 14: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

Performance (cont.)• From the STAMP TM bench. Suite:

– SSCA2: An efficient and scalable graph kernel constant algorithm.– Intruder: A high abort rate benchmark.

• If the program scales, so do we… (higher is better)

14

5-8% betterSSCA2 Intruder

48% TX in HW(HW aborts are less expensive)

Page 15: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

Future Work: TMbox 2?

15

• Distributed memory directory

• 4 FPGAs

• 64 cores

• Maps well on FPGA

• Similar to Stanford Dash

DDR Directory Switch DDRDirectorySwitch

DDR Directory Switch DDRDirectorySwitch

FPG

A A

FPG

A B

FPG

A C

FPG

A D

BEE3 board

RS232 PCIeEthernetMIPS R3000

8 KB I$1 + 8 KB D$1100 MHz

Low-overhead,online profiling

4 GB DDR2 256 LB L2 cache

Page 16: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

Bluespec System Verilog

• Functional language for HW modeling– Functional, object-oriented, rule-based– HW functional verification is fast and easy

(static rule conditions verification)– Compiles to Verilog source code (better for

component refinement)• First prototype: MIPS 5-stage processor

– Faster (100 MHz) and smaller

16

Page 17: TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

TMbox is available at: http://www.velox-project.eu/releases

Any questions?Contact: [email protected]