Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department

Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx

ZYNQ

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, Italy

Microelectronic Systems Design Research Group, University of Kaiserslautern, Germany

{mohammadsadegh.sadr2,luca.benini}@unibo.it, {weis,wehn}@eit.uni-kl.de

ver0

Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 2

Outline

Experimental Results

Memory Sharing Methods

Infrastructure Setup (Hardware & Software)

Motivations & Contributions

ZYNQ Architecture (Brief)

Introduction

Lessons Learned & Conclusion

Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ(c) Luca Bedogni 2012

Introduction

Performance Per Watt!!

1951 UNIVAC I : 0.015 operations per 1 watt-second

2012Half a century later!

ST P2012 : 40 billion operations per 1 watt-second

Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ

Accelerator(specialized hardware)

Accelerator(specialized hardware)

Introduction

Solution : Specialized functional units (Accelerators)

CPU

L1$

DRAM

Case 1

TASK 1

TASK 2

TASK 3

TASK 4

var1

var2

var3var1var2

cached

Case 2Faster! More Power Efficient!

Better Performance Per Watt!

What about Variables?

?????CPU should Flush the cache!

- Problem can be more complicated! e.g. Multiple CPU cores!

- Every processing element:Should have a consistent view of the shared

memory!

- Accelerator Coherency Port (ACP):Allows accelerator hardware

To Perform coherent accessesTo CPU(s) memory space!


OCM

PL PS

ARM A9NEON MMU

ARM A9NEONMMU

L1

L1

Snoop

L2PL310

DRAM Controller(Synopsys IntelliDDR MPMC)

Peripherals (UART, USB, Network, SD, GPIO,…)

InterConnect

(ARMNIC-301)

HP0

HP1

HP2

HP3

SGP0

SGP1

MGP0

MGP1

AXIMasters

AXISlaves

AXI Master ACP

DMA Controller (ARM PL330)

Xilinx ZYNQ Architecture


OCM

PL PS

DRAM ControllerHP0

AXI Master(Accelerator)

ACP

L2PL310

Motivations & Contributions

Which method is better to share data between CPU and Accelerator?

ARM A9NEON MMU

ARM A9NEONMMU

L1

L1

Snoop

For each method,What is the data transfer speed?How much is the energy consumption?Effect of background workload on performance?

- Various acceleration methods are addressed in the literature (GPU, hardware boards, …)

- We develop an infrastructure (HW+SW)

For the Xilinx ZYNQ

- We run practical tests & measurementsTo quantify the efficiency of different CPU-accelerator

memory sharing methods.


Hardware


Software

Linux Kernel LevelDrivers

AXI DummyDriver

AXI Driver

Simple driver:- Initializes the dummy AXI masters (HP1)- Triggers an endless read/write loop

More complicated:- Handles AXI masters - ACP & HP0- Memory allocation- ISR registration- statistics PL310- time measurement

Over ACP: kmalloc

Over HP: dma_alloc_coherent

AXI Driver user side interface application

Background application: A Simple memory

read/write loop

Oprofile statistical profiler.Measure all CPU performance

metrics.


Source Image(image_size bytes)

@Source Address

FIR

Result Image(image_size bytes)

@Dest Address

readprocess

write

Loop: N timesMeasure execution interval.

FIFO: 128K

128K

Selection of Pakcets:(Addressing)- Normal- Bit-reversed

Allocated by: kmalloc

dma_alloc_coherentDepends on the memory

Sharing method

Image Sizes:4KBytes16K65K128K256K1MBytes2MBytes

We define : Different methods to accomplish the task.

Measure : Execution time & Energy.

Processing Task Definition


Memory Sharing Methods

Accelerator ACP SCU L2 DRAM

• ACP Only (HP only is similar, there is no SCU and L2)

• CPU only (with&without cache)

• CPU ACP(CPU HP similar)

Accelerator ACP SCU L2 DRAM

CPU

12

ACP --- CPU --- ACP ---


Speed Comparison

256K 1MBytes128K64K16K4K

ACP Loses!

298MBytes/s239MBytes/s

CPU OCM between CPU ACP & CPU HP


Dummy Traffic Effect

256K

HP: 1382Mbytes/s

ACP: 1664Mbytes/s

CPU dummy trafficOccupies cache entriesSo less free entries remain for the accelerator


Power Comparison


Energy Comparison

CPU only methods : worst case!

CPU ACP ; always better energy than CPU HP0When the image size grows CPU ACP converges CPU HP0

CPU OCM always between CPU ACP and CPU HP


Lessons Learned & Conclusion

• If a specific task should be done by the cooperation ofCPU and accelerator:

• CPU ACP and CPU OCM are always better than CPU HP in terms of energy• If we are running other applications whichheavily depend on caches, CPU OCM and then CPU HP are preferred!

• If a specific task should be done by accelerator only:• For small arrays ACP Only & OCM Only can be used• For large arrays (>size of L2$) HP Only always acts better.

Documents

Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department