CML
Vector Class on Limited Local Memory (LLM) Multi-core Processors
Ke BaiDi Lu and Aviral Shrivastava
Compiler Microarchitecture LabArizona State University, USA
CMLhttp://www.aviral.lab.asu.edu
2 CML
Summary Cannot improve performance without improving power-efficiency
Cores are becoming simpler in multicore architectures Caches not scalable (both power and performance)
Limited Local Memory multicore architectures Each core has a scratch pad (e.g., Cell processor) Need explicit DMAs to communicate with global memory
Objective: How to enable vector data structure (dynamic arrays) on the LLM cores?
Challenges: 1. Use local store as temporary buffer (e.g., software cache) for vector data 2. Dynamic global memory management, and core request arbitration 3. How to use pointers when the data pointed to may have moved ?
Experiments Any size vector is supported All SPUs may use vector library simultaneously – and is scalable
CMLhttp://www.aviral.lab.asu.edu
3 CML
From multi- to many-core processors
IBM XCell 8i GeForce 9800 GTTilera TILE64
Simpler design and verification Reuse the cores
Can improve performance without much increase in power
Each core can run at a lower frequency Tackle thermal and reliability problems at core granularity
CMLhttp://www.aviral.lab.asu.edu
4 CML
Memory Scaling Challenge
D Cache19%
I Cache25%
D MMU5%
I MMU4%
arm925%
PATag RAM1%
CP 152%
BIU8%
SysCtl3%
Clocks4%
Other4%
Intel 48 core chip
Strong ARM 1100 In Chip Multi Processors (CMPs) , caches guarantee data coherency Bring required data from wherever into the cache Make sure that the application gets the latest copy
of the data Caches consume too much power
44% power, and greater than 34% area Cache coherency protocols do not scale
well Intel 48-core Single Cloud-on-a-Chip has non-
coherent caches
CMLhttp://www.aviral.lab.asu.edu
5 CML
PPE
Element Interconnect Bus (EIB)
Off-chip Global
MemoryPPE: Power Processor ElementSPE: Synergistic Processor ElementLS: Local Store
SPE 0 SPE 2
SPE 5
SPE 4
SPE 3SPE 1
SPE 6
Limited Local Memory Architecture Cores have small local memories (scratch pad)
Core can only access local memory Accesses to global memory through explicit DMAs in the program
e.g. IBM Cell architecture, which is in Sony PS3.
SPE 7 LS
SPU
CMLhttp://www.aviral.lab.asu.edu
6 CML
LLM Programming Task based programming, MPI like communication
#include<libspe2.h>
extern spe_program_handle_t hello_spu;
int main(void){int speid, status;
speid (&hello_spu);
}
Main Core
<spu_mfcio.h>
int main(speid, argp){printf("Hello world!\n");}
Local Core
<spu_mfcio.h>
int main(speid, argp){printf("Hello world!\n");}
Local Core
<spu_mfcio.h>
int main(speid, argp){printf("Hello world!\n");}
Local Core
<spu_mfcio.h>
int main(speid, argp){printf("Hello world!\n");}
Local Core
<spu_mfcio.h>
int main(speid, argp){printf("Hello world!\n");}
Local Core
<spu_mfcio.h>
int main(speid, argp){printf("Hello world!\n");}
Local Core
= spe_create_thread
Otherwise, efficient data management is required!
Extremely power-efficient computation If all code and data fit into the local memory of the cores
CMLhttp://www.aviral.lab.asu.edu
7 CML
Managing data
Local Memory Aware CodeOriginal Code
int global;
f1(){ int a,b; global = a + b;
f2(); }
int global;
f1(){ int a,b; DMA.fetch(global) global = a + b; DMA.writeback(global) DMA.fetch(f2) f2();}
CMLhttp://www.aviral.lab.asu.edu
8 CML
Vector Class Introduction
Vector Class is widely used library for programming!
One of classes in Standard Template Library(STL) for C++ Implemented as dynamic arrays, sequential container Elements stored in contiguous storage locations
Can be accessed by using iterators or offsets on regular pointers to elements
Compared to arrays: Vector have the ability to be easily resized Capacity increase and decrease is handled automatically They usually consume more memory than arrays when their capacity is
handled automatically This is in order to accommodate extra storage space for future grownth
CMLhttp://www.aviral.lab.asu.edu
9 CML
Vector Class Management
main() { vector<int> vec; for(int i = 0; i < N; i++) vec.push back(i);}
SPE code Max N is 8192
N0
8192 INTs is only 32KB, far less than 256KB of local memory. Why it crashes so early?
All code and data need to be managed This paper focuses on vector data management
Vector management is difficult Vector size is dynamic and can be unbounded
Cell programming manual suggests “Use dynamic data at your own risk”. Restricting the usage of dynamic data is restrictive for programmers.
CMLhttp://www.aviral.lab.asu.edu
10 CML
Outline of the Talk Motivation
Related Works on Vector Data Management
Our Approach of Vector Data Management
Experiments
CMLhttp://www.aviral.lab.asu.edu
11 CML
Related Works
SPE
Local Memory
Global Memory
……
LLM Architecture
SPE
Local Memory
DMA
They ensure data coherency across different spaces. What about size of local memory is small?
Different threads can access vector concurrently, no matter it is in one address space or different spaces.
They provide efficient parallel implementations, abstract platform details, provide an interface to programmers to express the parallelism of the problems, automatically translate from one space to another Shared memory: MPTL[Baertschiger2006], MCSTL[Singler2007] and
Intel TBB[Intel2006] Distributed memory: POOMA[Reynders1996], AVTL[Sheffler1995],
STAPL[Buss2010] and PSTL[Johnson1998]
CMLhttp://www.aviral.lab.asu.edu
12 CML
Space Allocation and Reallocation
Unlimited vector requires evicting older vector data to global memory and reallocating more global memory!
Vector Dataallocated space
0x010100
0x010200
(a) When the vector use up the allocated space
Vector DataNew allocated
space
0x010500
0x010600
(b) We allocate a large space and move all data
0x010700
push_back & insert Adds elements Needs to be re-allocated for a larger space when there is no unused space
CMLhttp://www.aviral.lab.asu.edu CML
Space Allocation and Reallocation Static buffer?
Small vector -> low utilization; large vector -> overflow SPU thread can’t use malloc() and free() on global memory Hybrid: DMA + mailbox
SPE
struct msgStruct { int vector_id; int request_size; int data_size; int new_gAddr;}; (2)operation
type
vector data
Global Memory
(4) restart signal
(1) transfer parameters by DMA
SPE thread
PPE
PPE thread (3) operate on vector,
update new_gAddr
in the data structure
(5) get new vector address by DMA
mailbox based
13
CMLhttp://www.aviral.lab.asu.edu
14 CML
Element Retrieving
133th element: block index = 128 = 133 / 16 * 16
……
Block Size is 16
…………0th element 1st element15th element 128th element 143th element ……
Block 0 Block 7
Based on the global address, we can know whether this block is in the local memory or not. If not, fetch it.
Block index: index of 1st element in the block Each block contains a block index, besides the data; blocks are in linked list.
Global address:
CMLhttp://www.aviral.lab.asu.edu
15 CML
Vector Function Implementation
But elements shifting now is a challenging task under LLM architecture Because we cannot use pointers in the local memory to access global
memory & DMA requires alignment
New Element
Global Memory
Global Memory
for (……) (*b++) = (*a++);
Local Memory
New Element
In order to keep semantics, we implemented all functions. But only insert function is shown here. Original insertion can take advantage of pointers.
CMLhttp://www.aviral.lab.asu.edu
16 CMLPointer problem needs to be solved!
Pointer Problem In order to support limitless vector data, global memory
must be leveraged. Two address spaces co-exist, no matter what scheme is
implemented, pointer issue exist.
vec
Global Memory
(a) Pointer points to a vector element
struct* S { …… int* ptr;}
Local Memory
vec
(b) The vector element is moved to global memory
?
struct* S { …… int* ptr;}
Local Memory
Global Memory
CMLhttp://www.aviral.lab.asu.edu
17 CML
Pointer Resolution
(a) Original Program (b) Transformed Program
main() { vector<int> vec; int* a = vec.at(index); int sum = 1 + *a; int* b = a; }
main() { vector<int> vec; int* a = ppu_addr(vec,index); a = ptrChecker(a); int sum = 1 + *a; a = s2p(a); int* b = a; }
• ppu_addr: returns global address ptr pointing to the vector element.• ptrChecker:
– checks whether ptr is pointing to a vector data; – guarantees the data pointed is in the local memory;– returns the local address.
• s2p: transforms local address back to global address
• Local address should not be used to identify the data.
CMLhttp://www.aviral.lab.asu.edu
18 CML
Experimental Setup Hardware
PlayStation 3 with IBM Cell BE Software
Operating System: Linux Fedora 9 and IBM SDK 3.1 Benchmarks: some possible applications using vector data.
CMLhttp://www.aviral.lab.asu.edu
19 CML
Unlimited Vector Data
0.001
0.01
0.1
1
10
100Our Improved Vector Class
Original Vector Class
Total number of integers
Run
time(
s)
𝑁0 = 8192
4 B ……
B: Bytes
8 B 16 B 2n+2 B
reallocation
reallocation
reallocation
……𝒔𝟎𝒔𝟎+𝟒12 𝒔𝟎+𝟐𝟖𝒔𝟎+2n+2−𝟒
……
Why?
CMLhttp://www.aviral.lab.asu.edu
20 CML
Impact of Block Size
4 8 16 32 64 128 2561
10
100
heap sortradix sortFFTinvfftdijkstraSORsparse matrix
Run
time(
s)
Block Size (# of elements in one block)
CMLhttp://www.aviral.lab.asu.edu
21 CML
Impact of buffer Space
512 1024 2048 40960
5
10
15
20
25
30
heap sortradix sortFFTinvfftdijkstraSORsparse matrix
Run
time(
s)
Buffer Size (# of elements in one buffer)
buffer_size = number_of_block × block_size.
CMLhttp://www.aviral.lab.asu.edu
22 CML
Impact of Associativity
heap sort radix sort FFT invfft dijkstra SOR sparse matrix
0
5
10
15
20
25
30
35 Direct Map2-way Associative4-way Associative8-way Associative
Benchmarks
Run
time(
s) Higher associativity -> high computation spent on looking
up data structure & low miss ratio
CMLhttp://www.aviral.lab.asu.edu
23 CML
Scalability
1 2 3 4 5 60
5
10
15
20
25
30
heap sortradix sortFFTinvfftdijkstraSORsparse matrix
Run
time(
s)
Number of Cores
CMLhttp://www.aviral.lab.asu.edu
24 CML
Summary Cannot improve performance without improving power-efficiency
Cores are becoming simpler in multicore architectures Caches not scalable (both power and performance)
Limited Local Memory multicore architectures Each core has a scratch pad (e.g., Cell processor) Need explicit DMAs to communicate with global memory
Objective: How to enable vector data structure (dynamic arrays) on the LLM cores?
Challenges: 1. Use local store as temporary buffer (e.g., software cache) for vector data 2. Dynamic global memory management, and core request arbitration 3. How to use pointers when the data pointed to may have moved ?
Experiments Any size vector is supported All SPUs may use vector library simultaneously – and is scalable