Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Cornell University Ji Kim 1/21Cornell University Ji Kim 1/21Cornell University Ji Yun Kim 1
Ji Kim, Shunning Jiang, Christopher Torng,Moyang Wang, Shreesha Srinath, Berkin Ilbeyi,
Khalid Al-Hawaj, and Christopher Batten
Computer Systems LaboratoryCornell University
50th ACM/IEEE Int’l Symp. on Microarchitecture, MICRO-2017
Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance
of Task-Based Parallel Programs
Cornell University Ji Kim 2/21Cornell University Ji Kim 2/21Cornell University Ji Yun Kim 2
Inter-Core
● Task-Based Parallel Programming Frameworks○ Intel TBB, Cilk
Intra-Core
● Packed-SIMD Vectorization○ Intel AVX, Arm NEON
Cornell University Ji Kim 3/21Cornell University Ji Kim 3/21Cornell University Ji Yun Kim 2
Inter-Core
● Task-Based Parallel Programming Frameworks○ Intel TBB, Cilk
Intra-Core
● Packed-SIMD Vectorization○ Intel AVX, Arm NEON
Cornell University Ji Kim 4/21Cornell University Ji Kim 4/21Cornell University Ji Yun Kim 2
Inter-Core
● Task-Based Parallel Programming Frameworks○ Intel TBB, Cilk
Intra-Core
● Packed-SIMD Vectorization○ Intel AVX, Arm NEON
Cornell University Ji Kim 5/21Cornell University Ji Kim 5/21Cornell University Ji Yun Kim 2
Inter-Core
● Task-Based Parallel Programming Frameworks○ Intel TBB, Cilk
Intra-Core
● Packed-SIMD Vectorization○ Intel AVX, Arm NEON
Cornell University Ji Kim 6/21Cornell University Ji Kim 6/21Cornell University Ji Yun Kim
Challenges of Combining Tasks and Vectors
3
Challenge #1: Intra-Core Parallel Abstraction Gap
void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ...
// Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...
Cornell University Ji Kim 7/21Cornell University Ji Kim 7/21Cornell University Ji Yun Kim
Challenges of Combining Tasks and Vectors
3
Challenge #1: Intra-Core Parallel Abstraction Gap
void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ...
// Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...
Cornell University Ji Kim 8/21Cornell University Ji Kim 8/21Cornell University Ji Yun Kim
Challenges of Combining Tasks and Vectors
3
Challenge #1: Intra-Core Parallel Abstraction Gap
void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ...
// Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...
Challenge #2: Inefficient Execution of Irregular Tasks
Cornell University Ji Kim 9/21Cornell University Ji Kim 9/21Cornell University Ji Yun Kim
Regular Irregular
Native Performance Results
4
Cornell University Ji Kim 10/21Cornell University Ji Kim 10/21Cornell University Ji Yun Kim
Loop-Task Accelerator (LTA) Vision
5
● Motivation
● Challenge #1: LTA SW
● Challenge #2: LTA HW
● Evaluation
● Conclusion
Cornell University Ji Kim 11/21Cornell University Ji Kim 11/21Cornell University Ji Yun Kim
LTA SW: API and ISA Hintvoid app_kernel_lta(int N, float* src, float* dst) { LTA_PARALLEL_FOR(0, N, (dst, src), ({ if (src[i] > THRESHOLD) dst[i] = DoComputeLight(src[i]); else dst[i] = DoComputeHeavy(src[i]); }));}
void loop_task_func(void* a, int start, int end, int step=1);
6
Hint that hardware can potentially accelerate task execution
Cornell University Ji Kim 12/21Cornell University Ji Kim 12/21Cornell University Ji Yun Kim
LTA SW: Task-Based Runtime
7
Cornell University Ji Kim 13/21Cornell University Ji Kim 13/21Cornell University Ji Yun Kim
Loop-Task Accelerator (LTA) Vision
8
● Motivation
● Challenge #1: LTA SW
● Challenge #2: LTA HW
● Evaluation
● Conclusion
Cornell University Ji Kim 14/21Cornell University Ji Kim 14/21Cornell University Ji Yun Kim
LTA HW: Fully-Coupled LTA
9
Coupling better for regular workloads (amortize frontend/memory)
Cornell University Ji Kim 15/21Cornell University Ji Kim 15/21Cornell University Ji Yun Kim
LTA HW: Fully Decoupled LTA
10
Decoupling better for irregular workloads (hide latencies)
Cornell University Ji Kim 16/21Cornell University Ji Kim 16/21Cornell University Ji Yun Kim
LTA HW: Task-Coupling Taxonomy
11
+ Higher Perf on Irregular- Higher Area/Energy
Task Group (lock-step execution)
More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy
Cornell University Ji Kim 17/21Cornell University Ji Kim 17/21Cornell University Ji Yun Kim
Does it matter whether we decouple in space or in time?
LTA HW: Task-Coupling Taxonomy
11
Cornell University Ji Kim 18/21Cornell University Ji Kim 18/21Cornell University Ji Yun Kim
Loop-Task Accelerator (LTA) Vision
12
● Motivation
● Challenge #1: LTA SW
● Challenge #2: LTA HW
● Evaluation
● Conclusion
Cornell University Ji Kim 19/21Cornell University Ji Kim 19/21Cornell University Ji Yun Kim
Evaluation: Methodology• Ported 16 application kernels from PBBS and in-house benchmark suites with diverse loop-task parallelism• Scientific computing: N-body simulation, MRI-Q, SGEMM
• Image processing: bilateral filter, RGB-to-CMYK, DCT
• Graph algorithms: breadth-first search, maximal matching
• Search/Sort algorithms: radix sort, substring matching
• gem5 + PyMTL co-simulation for cycle-level performance
• Component/event-based area/energy modeling• Uses area/energy dictionary backed by VLSI results and McPAT
13
Cornell University Ji Kim 20/21Cornell University Ji Kim 20/21Cornell University Ji Yun Kim
Evaluation: Design Space Exploration
14
spatial decoupling
temporal decoupling
resource constraints
Cornell University Ji Kim 21/21Cornell University Ji Kim 21/21Cornell University Ji Yun Kim
Evaluation: Design Space Exploration
14
spatial decoupling
temporal decoupling
Prefer spatial decoupling over temporal decoupling
resource constraints
Cornell University Ji Kim 22/21Cornell University Ji Kim 22/21Cornell University Ji Yun Kim
Evaluation: Design Space Exploration
14
spatial decoupling
temporal decoupling
resource constraints
Cornell University Ji Kim 23/21Cornell University Ji Kim 23/21Cornell University Ji Yun Kim
Evaluation: Design Space Exploration
14
spatial decoupling
temporal decoupling
resource constraints
Reduce spatial decoupling to improve energy efficiency
Cornell University Ji Kim 24/21Cornell University Ji Kim 24/21Cornell University Ji Yun Kim
Evaluation: Multicore LTA Performance
Regular Irregular
4.4x2.9x
10.7x5.2x
15
Cornell University Ji Kim 25/21Cornell University Ji Kim 25/21Cornell University Ji Yun Kim
Evaluation: Area-Normalized Performance
16
1.8x
1.6x
1.2x
Cornell University Ji Kim 26/21Cornell University Ji Kim 26/21Cornell University Ji Yun Kim
Loop-Task Accelerator (LTA) Vision
17
● Motivation
● Challenge #1: LTA SW
● Challenge #2: LTA HW
● Evaluation
● Conclusion
Cornell University Ji Kim 27/21Cornell University Ji Kim 27/21Cornell University Ji Yun Kim
Related Work• Challenge #1: Intra-Core Parallel Abstraction Gap
• Persistent threads for GPGPUs (S. Tzeng et al.)• OpenCL, OpenMP, C++ AMP• Cilk for vectorization (B. Ren et al.)• And more...
• Challenge #2: Inefficient Execution of Irregular Tasks• Variable warp sizing (T. Rogers et al.)• Temporal SIMT (S. Keckler et al.) • Vector-lane threading (S. Rivoire et al.)• And more…
• Please see paper for more detailed references!
18
Cornell University Ji Kim 28/21Cornell University Ji Kim 28/21Cornell University Ji Yun Kim
Take-Away Points
19
• Intra-core parallel abstraction gap and inefficient execution of irregular tasks are fundamental challenges for CMPs
• LTAs address both challenges with a lightweight ISA hint and a flexible microarchitectural template
• Results suggest in a resource-constrained environment, architects should favor spatial decoupling over temporal decoupling
• First step towards accelerating a wider variety of task parallelism