View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Improving the Efficiency of Memory Partitioning by Address Clustering
Alberto Macii Enrico Macii Massimo Poncino
Proceedings of the Design ,Automation and Test in Europe Conference and Exhibition
Presenter : Hung Yu Chen
112/04/18 Hung-Yu Chen 2/21
Abstract
Memory partitioning is a effective approach to memory energy optimization in embedded systems. Spatial locality of the memory address profile is the key property that partitioning exploits to determine an efficient multi-bank memory architecture.This paper presents an approach, called address clustering, for increasing the locality of given memory access profile, and thus improving the efficiency of partitioning.Results obtained on several embedded applications running on an ARM7 core show average energy reductions of 25% (maximum 57%) w.r.t a partitioned memory architecture synthesized without resorting to address clustering.
112/04/18 Hung-Yu Chen 3/21
Outline
What’s the problem? Memory Energy Memory Partitioning Address Clustering Experimental Result Conclusions
112/04/18 Hung-Yu Chen 4/21
What’s the problem?
Modern SoC platforms usually contain one or more processors. the increasing gap between processor and memory
speed. Various types of on-chip embedded memories
providing shorting latencies and wider interfaces. Problem:
Ubiquity of embedded memories makes them the largest contributor to the overall energy budget of a chip.
112/04/18 Hung-Yu Chen 5/21
Memory Energy
Model: Emen = ∑Ni=1 Cost(i);
N: number of accesses during the computation. Cost(i) : cost of an access due to the memory
organization and the cost of the physical access given by technology.
Memory energy optimization:1. Reducing Cost(i):
build low-energy memory architecture.
2. Reducing N: modify the memory access pattern.
3. Both two.
112/04/18 Hung-Yu Chen 6/21
Memory Partitioning
memory partitioning technique.
112/04/18 Hung-Yu Chen 7/21
Memory Partitioning (cont.)
Figure 1-a: The whole address space of the application is map
ped to a single SRAM memory array. Figure 1-b:
A dynamic access profile. Figure 1-c:
The partitioned memory. Notice that we need to account for the power c
onsumed in the entire partitioned memory system.
112/04/18 Hung-Yu Chen 8/21
Address Clustering-Example
MPEG Decoding application for ARM7 core Instruction stream
112/04/18 Hung-Yu Chen 9/21
Address Clustering-Example (cont.) Figure 2 show :
Total number of addresses : 31,233 (range from 0 to 124,892) Memory cut has 1,952 rows * 512 columns.
Power consumes 170mJ. (44.4 million total read) Memory partitioning :
Three memory blocks of sizes 736*256 696*512 892*512
Power consumes 96mJ. (inclusive of the overhead) 43.5% Energy reduction :
696*512 : keep the majority (82%) of the memory accesses. (36 million out of 44.4)
112/04/18 Hung-Yu Chen 10/21
Address Clustering-Example (cont.) Figure 3 : Clustered Address Profile of a MPEG Decoder
Two memory block sizes : 212*128 1900*512 Power : 42mJ. (an additional 56% of energy saved) 99% of the memory access. (43.99 million out of 44.4 )
112/04/18 Hung-Yu Chen 11/21
Address Clustering-Problem
Find a relocation of a proper subset of the address space. Maximize the locality of the dynamic trace. Minimizing the energy consumption of the memory
architecture Cost Metrics
Dynamic access profile C = {c0,c1,….,cN-1}
D(C,W) = maxi (Si) , i = 0, 1, …, N-W (Si) = ∑W-1
j=0 ci+j , W : a sliding window of size
d(C,W) = D(C,W) / Tot. Tot = ∑N
i=0 Ci
112/04/18 Hung-Yu Chen 12/21
Address Clustering-Problem (cont.) Figure4 shows the values of d(C,W) for w = 32, 64,
128, 256, 512, about Figure2.
80%
112/04/18 Hung-Yu Chen 13/21
Address Clustering-Exploration
High-level pseudo-code : Explore : find a good value of W
112/04/18 Hung-Yu Chen 14/21
Address Clustering-Clustering Algorithm Cluster : returns a
modified trace whose first M locations contain the M most visited addresses.
112/04/18 Hung-Yu Chen 15/21
Address Clustering-Encoder
Hardware Encode : the swap of address pair -> 2M Cluster Address. f(X) represents a function if X belongs to the set of 2M. Clustering address X’ = R(X). 32 input, combinational network.
112/04/18 Hung-Yu Chen 16/21
Experimental Result
Benchmarks are taken from the Ptolemy distribution, others come from the MediaBench suite.
Platform : ARM software development kit. Table1 :
#Addr : total number of distinct addresses. Emono : the energy of the monolithic memory that
contains all the data/instructions. Epartitioned : total memory energy of a partitioned
memory architecture. M = 256, 512, 1024 : memory partitioning combined
with address clustering.
112/04/18 Hung-Yu Chen 17/21
Experimental Result (cont.)
112/04/18 Hung-Yu Chen 18/21
Experimental Result (cont.)
Original vs. Clustering (Energy)
112/04/18 Hung-Yu Chen 19/21
Encoder Overhead Analysis
Encoders have been synthesized with Synopsys DesignCompier on a 0.25um technology by STMicroelectronics
Power figure (Figure 8) are obtained with Synopsys PowerCompier.
The energy figures over the various applications is relatively small
1.The complexity of the decoder is basically independent of the set of addresses that are clustered.
2.The switching activity of the address lines is very similar for all benchmarks.
112/04/18 Hung-Yu Chen 20/21
Encoder Overhead Analysis (cont.) 16K memory which dissipates about 375 mW
frequency of 150Mhz. Power = 7.5 mW for M = 1024.
112/04/18 Hung-Yu Chen 21/21
Conclusions
Energy reduction achievable by memory partitioning technology can be improved sensibly by increasing the locality of the trace. Proposed an architectural solution, called Address
Clustering. Experimental results on a set of typical
embedded applications running on an ARM-based system. Address Clustering is able to reduce the energy
consumption of a partitioned memory architecture by 25% on average (maximum 57%) with respect to the partitioning driving by the original trace.