30
Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing Center (EMC 2 ) The Pennsylvania State University International Conference on Computer Design, 10/2-5, 2005, San Jose

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

Embed Size (px)

DESCRIPTION

International Conference on Computer Design, 10/2-5, 2005, San Jose. Temperature-Sensitive Loop Parallelization for Chip Multiprocessors. Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing Center (EMC 2 ) The Pennsylvania State University. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie

Embedded Mobile Computing Center (EMC2)The Pennsylvania State University

International Conference on Computer Design, 10/2-5, 2005, San Jose

Page 2: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

2

Outline

Motivation Related Works Our Approach Example Experimental Results & Conclusion

Page 3: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

3

Motivation

Thermal Hotspots are a cause for concern Caused due to increasing power density Can result in the permanent chip damage

How to avoid damage Cooling techniques

How to prevent HotSpots Hardware techniques This paper proposes a compiler directed technique to avoid hotspots in

CMPs

Page 4: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

4

Related work: Dynamic Thermal Management

When one unit overheats, migrate its functionality to a distant, spare unit Dual pipeline (Intel, ISQED ’02) Spare register file (Skadron et al. 2003) Separate core (CMP) (Heo et al. ISLPED 2003) Microarchitectural clusters (Intel, ICCD 2004)

Raises many interesting issues Cost-benefit tradeoff for extra area Use both resources (scheduling) Run-time Thermal sensing/estimation

Yesterday, UC Riverside paper @ Session 2.2 proposes a run-time thermal tracking method

Page 5: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

5

Related work: Design-time techniques MDL @ PSU:

Thermal-Aware IP Virtualization and Placement for Networks-on-Chip Architecture, ICCD 2004 Thermal-Aware Allocation and Scheduling for MPSOC Design,

DATE 2005 Thermal-Aware Floorplanning Using Genetic Algorithms ISQED 2005 Thermal-Aware Voltage-island architecting, the other paper in this

session

Other groups:

Thermal-Aware High Level Synthesis (Northwestern Univ. Memik, R.Dick (ISLPED 2005, ASP-DAC 2006)

Many more in this conference Industry:

Gradient Design Automation (a start-up showcases at DAC 2005)

Page 6: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

6

CMP

–Justin R. Rattner, Intel director of the Corporate Technology Group, Spring 2005 IDF

“Intel researchers and scientists are experimenting with "many tens of cores, potentially even hundreds of cores per die, per single processor die. ..”

Last night, Panel discussion on CMP

Industry examples:

Page 7: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

7

This paper- compiler approach

Temperature and performance sensitive loop scheduling Schedules different loop iterations on CMP Data locality aware and hence performance aware

Intuition behind the approach Let ‘hot” cores idle while cool cores work. Static scheduling of parallelized loop iterations at compiler

time

Page 8: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

8

How can the compiler schedule temperature aware code? This work targets loop intensive programs run on

embedded CMPs Loop nests are divided into chunks. The number of cycles in a chunk is . Let the starting temperature of a processor be Tc

The temperature after execution the chunk is Tc‘ = F(Tc , , floorplan, power )

, power are obtained by profiling the code.

Floorplan and physical parameters remain constant.

Page 9: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

9

Thermal modeling Want a good model of chip temperature

That accounts for adjacency and package That does not require detailed designs That is fast enough for practical use

A compact model based on thermal R, C (Hotspot)Parameterized to automatically derive a model based on

various Architectures Power models Floorplans Thermal Packages

Page 10: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

10

Temperature Estimation The temperature of each block depends on the power

consumption and the location of blocks. The thermal resistance Rij of PEi with respect to PEj

can be represented by units of temperature rise at PEi due to one unit of power dissipated at PEj.

Rt11 R

t12 ……………….. R

t1m

Rt21 R

t22 ……………….. R

t2m

Rtm1 R

tm2 ……………….. R

tmm

Rt =

Rt11 R

t12 ……………….. R

t1m

Rt21 R

t22 ……………….. R

t2m

Rtm1 R

tm2 ………………. R

tmm

T1

T1

Tm

=

P1

P1

Pm

Page 11: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

11

Running ExampleBasic Schedule

for (i=1; i<=600; i++) for (j=1; j<=1000; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) / 4;

Time P0 P1 P2 P3 P4 P5 P6 P7

1 0 6 12 18 242 1 7 13 19 253 2 8 14 20 264 3 9 15 21 275 4 10 16 22 286 5 11 17 23 29

Jacobi’s Algorithm

for (i=k*120+1; i<=(k+1)*120; i++) for (j=1; j<=1000; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) / 4;

ParallelizedAlgorithm for 5 cores

ParallelSchedule

Iterationchunk

numberCore numberTime Slot

Page 12: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

12

Analysis of Basic Schedule

Analysis Great locality Uses only 5 processors Will definitely overheat

Time P0 P1 P2 P3 P4 P5 P6 P7

1 0 6 12 18 242 1 7 13 19 253 2 8 14 20 264 3 9 15 21 275 4 10 16 22 286 5 11 17 23 29

Assumptions in the example

1. Initial temperature is 0

2. Threshold temperature is 2

3. An idle slot reduces the temperature by 1 degree ( but 0)

4. So at most 2 active slots can be scheduled together on one core

5. The ideal number of active processors at any time is 5.

6. Due to Jacobi’s algorithm consecutive iteration chunk exhibit locality

Page 13: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

13

Pure Temperature Aware SchedulingAlgorithm

Start with time slot as 0 and all iterations as unscheduled While unscheduled iterations exit

Select the coolest A processors whose temperature is less than the threshold.

Schedule the chunks on those processors at current timeslot.

Reduce number of chunks to be scheduled. Increase the time slot by 1.

Analysis

Poor locality 1 extra time slot is used. No temperature problems

Page 14: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

14

10

8

6

4

2

_____11

_____9

_____7

_____5

_____3

_____1P7P6P5P4P3P2P1P0Slot

Pure Temperature Aware Scheduling

_____6_____5_____4_____3_____2_____1

P7P6P5P4P3P2P1P0Time

29231711562822161045272115934262014823251913712241812601

P7P6P5P4P3P2P1P0Time

29728272625246

23222120519181716154

14131211103987652

432101

P7P6P5P4P3P2P1P0Slot

29231711562822161045272115934262014823251913712241812601

P7P6P5P4P3P2P1P0Time

Original Schedule

_____6

_____4

_____2

_7

____5

_____3

_____1P7P6P5P4P3P2P1P0Slot

Page 15: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

15

Pure Locality Aware Scheduling

Algorithm Start with a clean slate. For each iteration chunk

Schedule it on the processor with greatest locality with it keeping at most two chunks together.

If more slots are required (when all processors are exhausted), increase the scheduling length.

Otherwise move to the next processor

654321

P7P6P5P4P3P2P1P0Time

C = { I0, I1, I2, I3, I4 }

2422203

1915117362523215

181410624

1713951216128401

P7P6P5P4P3P2P1P0Time

C = { I26, I27, I28, I29 }

365

243

1201

P7P6P5P4P3P2P1P0Time

C = { I4, I5, I6, I7, I8 }

27298

2422203

26287191511736

2523215181410624

1713951216128401

P7P6P5P4P3P2P1P0Time

C = { }

Analysis

Very good locality However 2 extra time slots are used. No temperature problems

Page 16: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

16

Locality and temperature aware scheduling

Algorithm Use temperature aware scheduling to obtain the schedulable slots. Use locality aware scheduling to assign chunks to these slots.

Time P0 P1 P2 P3 P4 P5 P6 P7

1 ■ ■ ■ ■ ■2 ■ ■ ■ ■ ■3 ■ ■ ■ ■ ■4 ■ ■ ■ ■ ■5 ■ ■ ■ ■6 ■ ■ ■ ■ ■7 ■

C = { I0, I1, I2, I3, I4 }

Time P0 P1 P2 P3 P4 P5 P6 P7

1 0 4 8 12 162 1 5 20 24 273 9 13 17 21 254 2 6 10 14 285 18 22 26 296 3 7 11 15 197 23

C = { }

Analysis - Best of both worlds Great Locality No temperature problems Good performance

Page 17: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

17

Phase1 - Profiling

#define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=-3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_numthreads()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */

Cycle Times

Chunk Sizes

Energy Consumption

Architecture Details

_8

___6

___4

___2

_____7

_____5

_____3

_____1P7P6P5P4P3P2P1P0Slot

Temperature Sensitive Schedule

+Scheduler

HotSpot

Phase 2 -Temperature Sensitive Scheduling

Phase 3 -Locality Based Scheduling

298

2422203

2628277191511736

2523215181410624

1713951216128401

P7P6P5P4P3P2P1P0Slot

Temperature &Locality Sensitive Schedule

Scheduler

#define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=-3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_numthreads()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */

Optimized, temperature sensitive code

+Code

Generator

Phase 4 - Code Generation

Omega Library

Page 18: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

18

Experiments

5 codes loop intensive codes were tested

Benchmark Cycles

(millions)

Energy

(J)

3step-log 1487 1894686.2

Adi 438 1239551.1

Btrix 1351 80918.1

Eflux 56 80918.1

Tsf 1799 2548001.6

Page 19: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

19

adi - Threshold Temperature 88 ºC

60

70

80

90

100

110

120

130

140

150

0 10 20 30 40 50 60 70 80 90

Percentage of Execution

100

base

temperature-sensitive

Page 20: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

20

eflux - Threshold Temperature 88 ºC

70

75

80

85

90

0 10 20 30 40 50 60 70 80 90

Percentage of Execution

100

base

temperature-sensitive

Page 21: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

21

adi - Threshold Temperature 88 ºC

78

79

80

81

82

83

84

85

86

87

88

0 10 20 30 40 50 60 70 80 90

Percentage of Execution

Page 22: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

22

eflux - Threshold Temperature 88 ºC

71.5

72.5

73.5

74.5

0 10 20 30 40 50 60 70 80 90

Percentage of Execution

Page 23: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

23

Sensitivity Analysis adi - Threshold Temperature 87 ºC

Page 24: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

24

Sensitivity Analysis adi - Threshold Temperature 86 ºC

Page 25: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

25

Sensitivity Analysis adi - Threshold Temperature 85 ºC

Page 26: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

26

Sensitivity Analysis adi - Threshold Temperature 84 ºC

Page 27: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

27

Experiments

Benchmark Name Peak Temperature Average Temperature

Original Optimized Original Optimized3step-log 95.5 80.7 80.7 78.7

adi 146.1 86.8 100.5 85.0btrix 84.9 78.9 74.1 73.9eflux 84.9 74.2 76.4 73.7tsf 87.6 74.2 80.0 73.0

average 99.8 78.9 81.2 76.9

Page 28: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

28

Experiments

BenchmarkName

Extra Energyconsumption

Extra ExecutionCycles

3step-log 2.40% 1.80%adi 2.40% 9.10%

btrix 0.80% 0.60%eflux 7.40% 4.00%tsf 1.60% 1.20%

average 2.90% 3.30%

Page 29: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

29

Conclusion Implemented a compiler directed combined

temperature sensitive and performance aware scheduling algorithm.

Achieve impressive average and peak chip temperature reductions.

This allows software to take up the burden of preventing chip damage due to thermal effects. Chips can be aggressively scaled Cooling costs can be reduced Lowers the need for hardware based thermal

management schemes.

Page 30: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

Thank you!