28
Cornell University Ji Kim 1/21 Cornell University Ji Kim 1/21 Cornell University Ji Yun Kim 1 Ji Kim, Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten Computer Systems Laboratory Cornell University 50th ACM/IEEE Int’l Symp. on Microarchitecture, MICRO-2017 Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs

Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 1/21Cornell University Ji Kim 1/21Cornell University Ji Yun Kim 1

Ji Kim, Shunning Jiang, Christopher Torng,Moyang Wang, Shreesha Srinath, Berkin Ilbeyi,

Khalid Al-Hawaj, and Christopher Batten

Computer Systems LaboratoryCornell University

50th ACM/IEEE Int’l Symp. on Microarchitecture, MICRO-2017

Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance

of Task-Based Parallel Programs

Page 2: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 2/21Cornell University Ji Kim 2/21Cornell University Ji Yun Kim 2

Inter-Core

● Task-Based Parallel Programming Frameworks○ Intel TBB, Cilk

Intra-Core

● Packed-SIMD Vectorization○ Intel AVX, Arm NEON

Page 3: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 3/21Cornell University Ji Kim 3/21Cornell University Ji Yun Kim 2

Inter-Core

● Task-Based Parallel Programming Frameworks○ Intel TBB, Cilk

Intra-Core

● Packed-SIMD Vectorization○ Intel AVX, Arm NEON

Page 4: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 4/21Cornell University Ji Kim 4/21Cornell University Ji Yun Kim 2

Inter-Core

● Task-Based Parallel Programming Frameworks○ Intel TBB, Cilk

Intra-Core

● Packed-SIMD Vectorization○ Intel AVX, Arm NEON

Page 5: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 5/21Cornell University Ji Kim 5/21Cornell University Ji Yun Kim 2

Inter-Core

● Task-Based Parallel Programming Frameworks○ Intel TBB, Cilk

Intra-Core

● Packed-SIMD Vectorization○ Intel AVX, Arm NEON

Page 6: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 6/21Cornell University Ji Kim 6/21Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

3

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ...

// Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...

Page 7: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 7/21Cornell University Ji Kim 7/21Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

3

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ...

// Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...

Page 8: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 8/21Cornell University Ji Kim 8/21Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

3

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ...

// Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...

Challenge #2: Inefficient Execution of Irregular Tasks

Page 9: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 9/21Cornell University Ji Kim 9/21Cornell University Ji Yun Kim

Regular Irregular

Native Performance Results

4

Page 10: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 10/21Cornell University Ji Kim 10/21Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

5

● Motivation

● Challenge #1: LTA SW

● Challenge #2: LTA HW

● Evaluation

● Conclusion

Page 11: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 11/21Cornell University Ji Kim 11/21Cornell University Ji Yun Kim

LTA SW: API and ISA Hintvoid app_kernel_lta(int N, float* src, float* dst) { LTA_PARALLEL_FOR(0, N, (dst, src), ({ if (src[i] > THRESHOLD) dst[i] = DoComputeLight(src[i]); else dst[i] = DoComputeHeavy(src[i]); }));}

void loop_task_func(void* a, int start, int end, int step=1);

6

Hint that hardware can potentially accelerate task execution

Page 12: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 12/21Cornell University Ji Kim 12/21Cornell University Ji Yun Kim

LTA SW: Task-Based Runtime

7

Page 13: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 13/21Cornell University Ji Kim 13/21Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

8

● Motivation

● Challenge #1: LTA SW

● Challenge #2: LTA HW

● Evaluation

● Conclusion

Page 14: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 14/21Cornell University Ji Kim 14/21Cornell University Ji Yun Kim

LTA HW: Fully-Coupled LTA

9

Coupling better for regular workloads (amortize frontend/memory)

Page 15: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 15/21Cornell University Ji Kim 15/21Cornell University Ji Yun Kim

LTA HW: Fully Decoupled LTA

10

Decoupling better for irregular workloads (hide latencies)

Page 16: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 16/21Cornell University Ji Kim 16/21Cornell University Ji Yun Kim

LTA HW: Task-Coupling Taxonomy

11

+ Higher Perf on Irregular- Higher Area/Energy

Task Group (lock-step execution)

More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy

Page 17: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 17/21Cornell University Ji Kim 17/21Cornell University Ji Yun Kim

Does it matter whether we decouple in space or in time?

LTA HW: Task-Coupling Taxonomy

11

Page 18: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 18/21Cornell University Ji Kim 18/21Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

12

● Motivation

● Challenge #1: LTA SW

● Challenge #2: LTA HW

● Evaluation

● Conclusion

Page 19: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 19/21Cornell University Ji Kim 19/21Cornell University Ji Yun Kim

Evaluation: Methodology• Ported 16 application kernels from PBBS and in-house benchmark suites with diverse loop-task parallelism• Scientific computing: N-body simulation, MRI-Q, SGEMM

• Image processing: bilateral filter, RGB-to-CMYK, DCT

• Graph algorithms: breadth-first search, maximal matching

• Search/Sort algorithms: radix sort, substring matching

• gem5 + PyMTL co-simulation for cycle-level performance

• Component/event-based area/energy modeling• Uses area/energy dictionary backed by VLSI results and McPAT

13

Page 20: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 20/21Cornell University Ji Kim 20/21Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

spatial decoupling

temporal decoupling

resource constraints

Page 21: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 21/21Cornell University Ji Kim 21/21Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

spatial decoupling

temporal decoupling

Prefer spatial decoupling over temporal decoupling

resource constraints

Page 22: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 22/21Cornell University Ji Kim 22/21Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

spatial decoupling

temporal decoupling

resource constraints

Page 23: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 23/21Cornell University Ji Kim 23/21Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

spatial decoupling

temporal decoupling

resource constraints

Reduce spatial decoupling to improve energy efficiency

Page 24: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 24/21Cornell University Ji Kim 24/21Cornell University Ji Yun Kim

Evaluation: Multicore LTA Performance

Regular Irregular

4.4x2.9x

10.7x5.2x

15

Page 25: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 25/21Cornell University Ji Kim 25/21Cornell University Ji Yun Kim

Evaluation: Area-Normalized Performance

16

1.8x

1.6x

1.2x

Page 26: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 26/21Cornell University Ji Kim 26/21Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

17

● Motivation

● Challenge #1: LTA SW

● Challenge #2: LTA HW

● Evaluation

● Conclusion

Page 27: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 27/21Cornell University Ji Kim 27/21Cornell University Ji Yun Kim

Related Work• Challenge #1: Intra-Core Parallel Abstraction Gap

• Persistent threads for GPGPUs (S. Tzeng et al.)• OpenCL, OpenMP, C++ AMP• Cilk for vectorization (B. Ren et al.)• And more...

• Challenge #2: Inefficient Execution of Irregular Tasks• Variable warp sizing (T. Rogers et al.)• Temporal SIMT (S. Keckler et al.) • Vector-lane threading (S. Rivoire et al.)• And more…

• Please see paper for more detailed references!

18

Page 28: Khalid Al-Hawaj, and Christopher Batten Using Intra-Core ...ctorng/pdfs/kim-lta-slides-micro2017.pdf · Using Intra-Core Loop-Task Accelerators to ... •Scientific computing: N-body

Cornell University Ji Kim 28/21Cornell University Ji Kim 28/21Cornell University Ji Yun Kim

Take-Away Points

19

• Intra-core parallel abstraction gap and inefficient execution of irregular tasks are fundamental challenges for CMPs

• LTAs address both challenges with a lightweight ISA hint and a flexible microarchitectural template

• Results suggest in a resource-constrained environment, architects should favor spatial decoupling over temporal decoupling

• First step towards accelerating a wider variety of task parallelism