Compiler and API for Low Power High Performance …Resume Standby: Power shutdown & 5 5 Voltage lowering 1.4V-1.0V) 3 4 3 4 2 2 0 Avg. Power 88 3% Power Reduction Avg. Power 1 0 1

Compiler and API forLow Power High Performance

MulticoresHironori Kasahara

Professor Department of Computer ScienceProfessor, Department of Computer ScienceDirector, Advanced Chip-Multiprocessor

Research InstituteResearch Institute Waseda University

Tokyo JapanTokyo, Japanhttp://www.kasahara.cs.waseda.ac.jp

June 24, 2008, MPSoC 2008

METI/NEDO National Project Multi-core for Real-time Consumer ElectronicsMulti core for Real time Consumer Electronics

<Goal> R&D of compiler cooperative multi-core processor technology for consumer

新マルチコア新マルチコア新マルチコア新マルチコア

(2005.7～2008.3)**

electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems.

CMP m

(マルチ

コア・チップm)

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC nLDM/

D-cache (ローカルデータ

デ

LPM/I C h

CMP 0 （マルチコアチップ0）

CSM j

I/O

ＣＳＰ k

CPU(プロセッサ）

DTC(データ

転送コン

I/O DevicesI/O

（入出力装置）CMP m

(マルチ



(プロ

セッサコア1)

PC nLDM/


デ

LPM/I C h


CSM j

I/O

ＣＳＰ k


DTC(データ

転送コン

I/O DevicesI/O

（入出力装置）

新マルチコアプロセッサ

•高性能

•低消費電力

短開


•高性能

•低消費電力

短HW/SW開

CMP m

(マルチ



(プロ

セッサコア1)

PC nLDM/


デ

LPM/I C h


CSM j

I/O

ＣＳＰ k


DTC(データ

転送コン

I/O DevicesI/O

（入出力装置）CMP m

(マルチ



(プロ

セッサコア1)

PC nLDM/


デ

LPM/I C h


CSM j

I/O

ＣＳＰ k


DTC(データ

転送コン

I/O DevicesI/O

（入出力装置）


•高性能

•低消費電力

短開


•高性能

•低消費電力

短HW/SW開navigation systems. <Period> From July 2005 to March 2008

<Features> ・Good cost performance

(プロセッサ

コアn)

コア1)

IntraCCN (チップ内結合網：　複数バス、クロスバー等 )

DSM(分散共有メモリ)

(メモリ/L1データ

キャッシュ)I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CSM(集中

共有メモリ)

ＣＳＰ k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス )

転送コントローラ) (プロセッサ

コアn)

コア1)







CSM(集中

共有メモリ)

ＣＳＰ k



転送コントローラ) •短HW/SW開

発期間

•各チップ間でアプリケーション共用可

•短HW/SW開発期間


(プロセッサ

コアn)

コア1)







CSM(集中

共有メモリ)

ＣＳＰ k



転送コントローラ) (プロセッサ

コアn)

コア1)







CSM(集中

共有メモリ)

ＣＳＰ k



転送コントローラ) •短HW/SW開

発期間


•短HW/SW開発期間


・Short hardware and software development periods・Low power consumption

CSM / L2 Cache(集中共有メモリあるいはL２キャッシュ )

( )

InterCCN (チップ間結合網：　複数バス、クロスバー、多段ネットワーク等 )


( )


共用可

•高信頼性

•半導体集積度と共に性能向上

•高信頼性


マルチコア統合ECU

,,


( )



( )


共用可

•高信頼性


•高信頼性


マルチコア統合ECU

,,

・Low power consumption・Scalable performance improvement with the advancement of semiconductor

開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ

w e dv ce e o se co duc o・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API

API：Application Programming Interface

**Hitachi, Renesas, Fujitsu,

Toshiba, Panasonic, NEC

OSCAR Parallelizing Compiler• Improve effective performance, cost-performance

and productivity and reduce consumed power– Multigrain Parallelization

• Exploitation of parallelism from the whole program by use ofcoarse-grain parallelism among loops and subroutines,coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data Localization– Data Localization• Automatic data distribution for distributed shared memory, cache

and local memory on multiprocessor systems.

– Data Transfer Overlapping• Data transfer overhead hiding by overlapping task execution and

data transfer using DMA or data pre-fetchingdata transfer using DMA or data pre fetching

– Power Reduction• Reduction of consumed power by compiler control of

frequency, voltage and power shut down with hardware supports.

Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks)

1 BPABPA

Data Dependency

Control flow

Conditional branch

Block of Psuedo

1

3 BPA2 BPA

4 BPA

BPA

RB Repetition Block

RB

Block of PsuedoAssignment Statements

7

2 3

4 8BPA

5 BPA

6 RB

7 RB15 BPA

RB

RB

RB

BPA

7

11

4

5 6

8

9 10

6

BPA8 BPA

9 BPA 10 RB

11 BPA

RB 1115 7

12

13

Data dependency

Extended control dependency11 BPA

12 BPA

13 RB14

13Extended control dependencyConditional branch

ORAND

Original control flowA Macro Flow Graph14 RB

ENDA Macro Task Graph

MTG of Su2cor-LOOPS-DO400MTG of Su2cor-LOOPS-DO400Coarse grain parallelism PARA_ALD = 4.3

DOALL Sequential LOOP BBSB

Data Localization11

2

1

32

PE0 PE1

12 1

3 4 56

23 45

6 7 8910 1112

3

4

2

6 7

8

14

18

7

6 7 8910 1112

1314 15 16

5

8

9

11

15

18

19

25

8 9 101718 19 2021 22

10

11

13 16

25

29

112324 25 26

dlg0dlg3dlg1 dlg2

17 20

21

22 26

30

12 1314

15

2728 29 3031 32

33Data Localization Group

dlg023 24

27 28

32

MTG MTG after Division A schedule for two processors

15 33Data Localization Group31

Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler

• Shortest execution time mode

OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)

CMP m

PE0 PE

CMP (chip multiprocessor 0)0

CPU

I/ODevicesI/O

Devices

0 PE1 PE nLDM/

D-cacheLPM/

CSM jI/O

CMP k

CPU

DTC

DSMI-Cache

CSN t k I t fFVR

Intra-chip connection network

CSMNetwork InterfaceFVR

FVR

CSM / L2 Cache

(Multiple Buses, Crossbar, etc)FVR

FVRFVR FVR FVR FVR

Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.

FVR

DSM: distributed shared mem.

DTC: Data Transfer Controller

LPM : local program mem.

FVR: frequency / voltage control register

API and Parallelizing Compiler in METI/NEDOAdvanced Multicore for Realtime Consumer Electronics Project

Details of API: See http://www kasahara cs waseda ac jp/Details of API: See http://www.kasahara.cs.waseda.ac.jp/

ExecutableTranslate

into parallel

API to specify data assignment, data

Sequential Application Program

codes for each vendor

chip

pcodes for

each vender

transfer, power reduction control

Program（Subset of C Language)

RealtiA

Imag

e

Was

Backend compilerMach. Codes

Proc0ScheduledTasks

APIdecoder

Sequential Compilerm

e Co

ns

Ap

plicati

e, Secu

re

seda O

S

Backend Compiler

T1 Stop

Proc1ScheduledT k

SH multicore

Mach.Codes

APIdecoder

Sequential Compilersu

mer E

ion

Pro

ge A

ud

io

etc.

SC

AR

Co

FR-V

Tasks

T2 T4

Proc2

Codesdecoder Compiler

Electro

ni

gram

sS

treami

om

piler

Backend CompilerScheduledTasks

T3 T6 Slow

Mach.CodesSequential

CompilerAPIdecoder

csng

（CELL）

Data Transfer by DTC(DMAC)

Performance of OSCAR Compiler Using the

9 Intel Ver.10.1

Multicore API on Intel Quad-core Xeon

6

7

8

o

OSCAR

4

5

6

eedu

p ra

tio

1

2

3spe

0

mca

tv

swim

u2co

r

dro2

d

mgr

id

appl

u

urb3

d

apsi

fppp

p

wav

e5

swim

mgr

id

appl

u

apsi

tom su

hyd m tu f w m

SPEC95 SPEC2000

• OSCAR Compiler gives us 2.09 times speedup on the average against Intel Compiler ver.10.1

Performance of OSCAR Compiler Using the multicore API on Fujitsu FR1000 MulticoreAPI on Fujitsu FR1000 Multicore

3.76 3.75

3 47

4.0

2.812.90

3.47

3.0

3.5

1.941 80

2.12

2.54

1.98 1.98

2.55

2.0

2.5

1.00 1.00

1.80

1.00 1.00

1.5

0.5

1.0

0.0

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

MPEG2enc MPEG2dec MP3enc JPEG 2000enc

3.38 times speedup on the average for 4 cores againsta single core execution

Performance of OSCAR Compiler Using the Developed API 4 (SH4A) OSCAR T M ltiAPI on 4 core (SH4A) OSCAR Type Multicore

4.0

3.353.24

3.50

3.173.5

4.0

2.57 2.532.71

2 07

2.5

3.0

1.781.86 1.88 1.83

2.07

1.5

2.0

1.00 1.00 1.00 1.00

0.5

1.0

0.0

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

MPEG2enc MPEG2dec MP3enc JPEG 2000enc

3.31 times speedup on the average for 4cores against 1core

Performance OSCAR Multigrain Parallelizing Compiler on a IBM p550q 8core Deskside Server

2.7 times speedup against loop parallelizing compiler on 8 cores

Loop parallelization

Multigrain parallelizition

7

8

9

5

6

7

du

p r

atio

2

3

4

spee

d

0

1

2

t t i 2 h d 2d id l t b3d i f 5 i id l itomcatvswim su2corhydro2dmgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsi

spec95 spec2000

Performance of OSCAR compiler on16 cores SGI Altix 450 Montvale server16 cores SGI Altix 450 Montvale server

6 Intel Ver.10.1

Compiler options for the Intel Compiler:for Automation parallelization: -fast -parallel.for OpenMP codes generated by OSCAR: -fast -openmp

4

5

o

OSCAR for OpenMP codes generated by OSCAR: -fast -openmp

2

3

4

peed

up r

ati

1

2sp

0

omca

tv

swim

su2c

or

hydr

o2d

mgr

id

appl

u

turb

3d

apsi

fppp

p

wav

e5

swim

mgr

id

appl

u

apsi

• OSCAR compiler gave us 2.32 times speedup

to h

spec95 spec2000

OSCAR compiler gave us 2.32 times speedupagainst Intel Fortran Itanium Compiler revision 10.1

RP2 Chip Photo and Specifications

Process 90nm 8-layer triple-ILRAM I$ Process Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 104 8mm2

Core#1

C0

LB

SC

URAMDLRAM

Core#0ILRAM

D$

I$

Chip Size 104.8mm(10.61mm x 9.88mm)

CPU Core 6.6mm2

Core#2 Core#3S

N

SHWY CPU Core Size

6.6mm(3.36mm x 1.96mm)

Supply 1.0V–1.4V (internal), Core#6 Core#7

SN

C1

SHWYVSWC

pp yVoltage

( ),1.8/3.3V (I/O)

Power 17 (8 CPUs, Core#4 Core#5

SC

SM

Domains(

8 URAMs, common)DBSC

DDRPADGCPG

Processing Performance on the Developed Multicore Using Automatic Parallelizing CompilerMulticore Using Automatic Parallelizing Compiler

Speedup against single core execution for audio AAC encoding*) Ad d A di C di*) Advanced Audio Coding

6 0

7.0

5.0

6.0

ps

5 83 0

4.0

eed

up

3.6

5.8

2.0

3.0

Sp

1.0 1.9

0.0

1.0

1 2 4 8

Numbers of processor cores

Power Reduction by OSCAR Parallelizing Compiler for Secure Audio Encodingfor Secure Audio Encoding

AAC Encoding + AES Encryption with 8 CPU cores

Without Power With Power Control

6

7

Without Power Control（Voltage：1.4V)

6

7

With Power Control （Frequency, Resume Standby: Power shutdown &

5

6

5

6 Voltage lowering 1.4V-1.0V)

3

4

3

4

2 2

Avg. Power Avg. Power88 3% Power Reduction0

1

0

1

Avg. Power5.68 [W]

Avg. Power0.67 [W]

88.3% Power Reduction

17

Power Reduction by OSCAR Parallelizing Compiler for MPEG2 Decodingfor MPEG2 DecodingMPEG2 Decoding with 8 CPU cores

Without Power C l

With Power Control

6

7

6

7

Control（Voltage：1.4V)

W owe Co o（Frequency, Resume Standby: Power shutdown & Voltage lowering 1 4V

5 5

6 Voltage lowering 1.4V-1.0V)

3

4

3

4

1

2 2

Avg. Power Avg. Power73 5% Power Reduction0

1

0

1

Avg. Power5.73 [W]

Avg. Power1.52 [W]

73.5% Power Reduction18

Low Power High Performance Multicore Computer withMulticore Computer with Solar Panel Clean Energy AutonomousgyServers operational in deserts

Compiler cooperative low power high effectiveConclusions

Compiler cooperative low power, high effective performance, short software development period multi-core processors will be more important in wide range ofcore processors will be more important in wide range of information systems from embedded applications like games, mobile phones, digital TVs and automobiles to g p gpeta-scale supercomputers. Automatically generated parallel programs using the y g p p g gdeveloped multicore API by OSCAR compiler give us the following performances:

3.31 times speedup on 4core (SH4A) OSCAR type multicore3.38 times speedup on 4 core FR1000 against 1 core88% power reduction by the compiler power control on the developed 8 core (SH4A) multicore for realtime secure AAC encodingAAC encoding 70% power reduction on the multicore for MPEG2 decoding

Documents

Compiler and API for Low Power High Performance …Resume Standby: Power shutdown & 5 5 Voltage lowering 1.4V-1.0V) 3 4 3 4 2 2 0 Avg. Power 88 3% Power Reduction Avg. Power 1 0 1