40
Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous Multicores Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society Board of Governors IEEE Computer Society Multicore Strategic Technical Committee (STC) Chair URL: http://www.kasahara.cs.waseda.ac.jp/ I2PC Seminor, University of Illinois at Urbana-Champaign, 2012.10.18(Thursday)

Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous

and Heterogeneous Multicores Hironori Kasahara

Professor, Dept. of Computer Science & EngineeringDirector, Advanced Multicore Processor Research Institute

Waseda University, Tokyo, JapanIEEE Computer Society Board of Governors

IEEE Computer Society Multicore Strategic Technical Committee (STC) Chair

URL: http://www.kasahara.cs.waseda.ac.jp/

I2PC Seminor, University of Illinois at Urbana-Champaign, 2012.10.18(Thursday)

Page 2: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Multi/Many-core EverywhereMulti-core from embedded to supercomputers Consumer Electronics (Embedded)

Mobile Phone, Game, TV, Car Navigation, Camera, IBM/ Sony/ Toshiba Cell, Fujitsu FR1000, Panasonic Uniphier, NEC/ARM MPCore/MP211/NaviEngine,Renesas 4 core RP1, 8 core RP2, 15core Hetero RP-X,Plurarity HAL 64(Marvell), Tilera Tile64/ -Gx100(->1000cores),DARPA UHPC (2017: 80GFLOPS/W)

PCs, ServersIntel Quad Xeon, Core 2 Quad, Montvale, Nehalem(8cores), Larrabee(32cores), SCC(48cores), Night Corner(50 core+:22nm), AMD Quad Core Opteron (8, 12 cores)

WSs, Deskside & Highend ServersIBM(Power4,5,6,7), Sun (SparcT1,T2), Fujitsu SPARC64fx8

SupercomputersEarth Simulator:40TFLOPS, 2002, 5120 vector proc.BG/Q (A2:16cores) Water Cooled20PFLOPS, 3-4MW (2011-12),BlueWaters(HPCS) Power7, 10 PFLOP+(2011.07), Tianhe-1A (4.7PFLOPS,6coreX5670+ Nvidia Tesla M2050),Godson-3B (1GHz40W 8core128GFLOPS) -T (64 core,192GFLOPS:2011)RIKEN Fujitsu “K” 10PFLOPS(8core SPARC64VIIIfx, 128GGFLOPS)

High quality application software, Productivity, Costperformance, Low power consumption are important

Ex, Mobile phones, GamesCompiler cooperated multi-core processors are promising to realize the above futures

OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)

The 37th (June 20,2011) &38th

(Nov.14.2011) Top 500 No.1, Riken Fujitsu “K” 705,024 cores Peak 11.28 PFLOPS, (88,128procs)LINPACK 10.510 PFLOPS (93.2%)

Page 3: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

<R & D Target>Hardware, Software, Application for Super Low-Power Manycore ProcessorsMore than 64 coresNatural air cooling (No fan)

Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, Toshiba, etc<Ripple Effect>Low CO2 (Carbon Dioxide) EmissionsCreation Value Added Products

Consumer Electronics, Automobiles, Servers

Green Computing Systems R&D CenterWaseda University

Supported by METI (Mar. 2011 Completion)

Beside Subway Waseda Station,Near Waseda Univ. Main Campus

3

Hitachi SR16000:Power7 128coreSMP

Fujitsu M9000SPARC VII 256 core SMP

Page 4: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Cancer Treatment Carbon Ion Radiotherapy

Environment

LivesIndustry

CapsuleInner Camera

Page 5: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Green Computing Systems R&D Center, 2011.11.1(Clear)Solar Power Generation & Server Consumption

FujitsuM9000

HitachiSR16000

Page 6: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

2012.4.2(Clear) Power Generation and Server Consumption: One day Trends

Page 7: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Super Low Power Web Server Using Embedded Multicore Processor RPX

1W with 8 SH4A processor cores

Page 8: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

METI/NEDO National Project Multi-core for Real-time Consumer Electronics

<Goal> R&D of compiler cooperative multi-core processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems.

<Period> From July 2005 to March 2008<Features> ・Good cost performance

・Short hardware and software development periods・Low power consumption・Scalable performance improvement with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API

API:Application Programming Interface

CMP m

(マルチ

コア・チップm)

(プロセッサ

コアn)

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC n

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CMP 0 (マルチコアチップ0)

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM j

CSM(集中

共有メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

CPU(プロセッサ)

DTC(データ

転送コントローラ)

I/O DevicesI/O

(入出力装置)CMP m

(マルチ

コア・チップm)

(プロセッサ

コアn)

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC n

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CMP 0 (マルチコアチップ0)

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM j

CSM(集中

共有メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

CPU(プロセッサ)

DTC(データ

転送コントローラ)

I/O DevicesI/O

(入出力装置)

新マルチコアプロセッサ

•高性能

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション共用可

•高信頼性

•半導体集積度と共に性能向上

新マルチコアプロセッサ

•高性能

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション共用可

•高信頼性

•半導体集積度と共に性能向上

マルチコア統合ECU

,,

CMP m

(マルチ

コア・チップm)

(プロセッサ

コアn)

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC n

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CMP 0 (マルチコアチップ0)

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM j

CSM(集中

共有メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

CPU(プロセッサ)

DTC(データ

転送コントローラ)

I/O DevicesI/O

(入出力装置)CMP m

(マルチ

コア・チップm)

(プロセッサ

コアn)

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC n

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CMP 0 (マルチコアチップ0)

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM j

CSM(集中

共有メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

CPU(プロセッサ)

DTC(データ

転送コントローラ)

I/O DevicesI/O

(入出力装置)

新マルチコアプロセッサ

•高性能

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション共用可

•高信頼性

•半導体集積度と共に性能向上

新マルチコアプロセッサ

•高性能

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション共用可

•高信頼性

•半導体集積度と共に性能向上

マルチコア統合ECU

,,

(2005.7~2008.3)**

**Hitachi, Renesas, Fujitsu,

Toshiba, Panasonic, NEC

Page 9: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project

Process Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 104.8mm2

(10.61mm x 9.88mm)CPU Core Size

6.6mm2

(3.36mm x 1.96mm)Supply Voltage

1.0V–1.4V (internal), 1.8/3.3V (I/O)

Power Domains

17 (8 CPUs, 8 URAMs, common)

Core#2 Core#3

Core#1

Core#4 Core#5

Core#6 Core#7

SNC

0SN

C1

DBSC

DDRPADGCPG

CSM

LB

SC

SHWY

URAMDLRAM

Core#0ILRAM

D$

I$

VSWC

IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”

9

Page 10: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Demo of NEDO Multicore for Real Time Consumer Electronicsat the Council of Science and Engineering Policy on April 10, 2008

CSTP MembersPrime Minister: Mr. Y. FUKUDAMinister of State for Science, Technology and Innovation Policy:Mr. F. KISHIDAChief Cabinet Secretary: Mr. N. MACHIMURAMinister of Internal Affairs and Communications :Mr. H. MASUDAMinister of Finance :Mr. F. NUKAGAMinister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAIMinister of Economy,Trade and Industry: Mr. A. AMARI

Page 11: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

To improve effective performance, cost-performance and software productivity and reduce power

OSCAR Parallelizing Compiler

Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory

Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)

Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

Page 12: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Generation of coarse grain tasksMacro-tasks (MTs) Block of Pseudo Assignments (BPA): Basic Block (BB) Repetition Block (RB) : natural loop Subroutine Block (SB): subroutine

Program

BPA

RB

SB

Near fine grain parallelization

Loop level parallelizationNear fine grain of loop bodyCoarse grainparallelizationCoarse grainparallelization

BPARBSB

BPARBSB

BPARBSBBPARBSBBPARBSBBPARBSB

1st. Layer 2nd. Layer 3rd. LayerTotalSystem

Page 13: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks)

BPA

BPA

1 BPA

3 BPA2 BPA

4 BPA

5 BPA

6 RB

7 RB15 BPA

8 BPA

9 BPA 10 RB

11 BPA

12 BPA

13 RB

14 RB

END

RB

RB

BPA

RB

Data Dependency

Control flow

Conditional branch

Repetition Block

RB

BPA

Block of PsuedoAssignment Statements

7

11

14

1

2 3

4

5 6

15 7

8

9 10

12

13

Data dependency

Extended control dependencyConditional branch

6

ORAND

Original control flowA Macro Flow Graph

A Macro Task Graph

Page 14: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

MTG of Su2cor-LOOPS-DO400

DOALL Sequential LOOP BBSB

Coarse grain parallelism PARA_ALD = 4.3

Page 15: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Data Localization

MTG MTG after Division A schedule for two processors

1

2

3 4 56

7

8 9 10

11

12 1314

15

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

3

4

2

5

6 7

8

9

10

11

13

14

15

16

17

18

19

20

21

22

23 24

25

26

27 28

29

30

31

32

PE0 PE1

12 1

Page 16: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Generated Multigrain Parallelized Code (The nested coarse grain task parallelization is realized by only

OpenMP “section”, “Flush” and “Critical” directives.)

Centralized scheduling code

Distributed scheduling code

T0 T1 T2 T3

Thread group0

MT1_1

SYNC SENDMT1_2

SYNC RECV

T4 T5 T6 T7

1_4_2

1_4_4

1_4_3

1_4_1

1_4_2

1_4_4

1_4_3

1_4_11_3_2

1_3_4

1_3_3

1_3_1

1_3_6

1_3_5

1_3_2

1_3_4

1_3_3

1_3_1

1_3_6

1_3_5

1_3_2

1_3_4

1_3_3

1_3_1

1_3_6

1_3_5

SECTIONSSECTION SECTION

END SECTIONS

Thread group1

MT1_1

MT1_3SB

MT1_2DOALL

MT1_4RB

1st layer

2nd layer

1_3_1

1_3_2 1_3_3

1_3_5

1_3_4

1_3_6

1_4_21_4_3 1_4_4

1_4_1

MT1-4

MT1-3

2nd layer

Page 17: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Low Power Heterogeneous Multicore Code

GenerationAPI

Analyzer(Available

from Waseda)

Existing sequential compiler

Multicore Program Development Using OSCAR API V2.0Sequential Application

Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Low Power Homogeneous Multicore Code

GenerationAPI

AnalyzerExisting

sequential compiler

Proc0

Thread 0

Code with directives

Waseda OSCARParallelizing Compiler

Coarse grain task parallelization

Data Localization DMAC data transfer Power reduction using

DVFS, Clock/ Power gating

Proc1

Thread 1

Code with directives

Parallelized API F or C program

OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,

data transfer using DMA, power managements

Generation of parallel machine

codes using sequential compilers

Exe

cuta

ble

on v

ario

us m

ultic

ores

OSCAR: Optimally Scheduled Advanced MultiprocessorAPI: Application Program Interface

HomegeneousMulticore s

from Vendor A(SMP servers)

Server Code GenerationOpenMP Compiler

Shred memory servers

HeterogeneousMulticores

from Vendor B

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

Accelerator 1Code

Accelerator 2Code

Hom

ogen

eous

Accelerator Compiler/ User Add “hint” directives

before a loop or a function to specify it is executable by

the accelerator with how many clocks

Het

ero

Manual parallelization / power reduction

Page 18: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

OSCAR API Ver. 2.0 for Homogeneous/Heterogeneous Multicores and Manycores

Page 19: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler

• Shortest execution time mode

Page 20: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

An Example of Machine Parameters for the Power Saving Scheme

• Functions of the multiprocessor– Frequency of each proc. is changed to several levels– Voltage is changed together with frequency– Each proc. can be powered on/off

statefrequencyvoltagedynamic energystatic power

FULL1111

MID1 / 20.873 / 4

1

LOW1 / 40.711 / 2

1

OFF0000

stateFULLMIDLOWOFF

FULL0

40k40k80k

MID40k0

40k80k

LOW40k40k0

80k

OFF80k80k80k0

stateFULLMIDLOWOFF

FULL0202040

MID2002040

LOW2020040

OFF4040400

delay time [u.t.] energy overhead [μJ]

• State transition overhead (Example: not for RP2)

Page 21: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Power Reduction Scheduling

Page 22: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Low-Power Optimization with OSCAR API

22

MT1

VC0

MT2

MT4MT3

Sleep

VC1

Scheduled Resultby OSCAR Compiler void

main_VC0() {

MT1

voidmain_VC1() {

MT2

#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))

#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))

Sleep

MT4MT3

} }

Generate Code Image by OSCAR Compiler

Page 23: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server

Compile Option:(*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto(*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto(Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto

OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average

Page 24: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Performance of OSCAR Compiler on Intel  12 core SMP based on 6‐core Xeon X5670 

• OSCAR Compiler gives us 1.9 times speedup on the average against Intel Composer XE 2011

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsi

SPEC95 SPEC2000

Intel Composer XE 2011 OSCAR

Page 25: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Performance of OSCAR Compiler on AMD 12‐core SMP Based on Opteron 6174

• OSCAR Compiler gives us 2.2 times speedup on the average against Intel Composer XE 2011

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsi

SPEC95 SPEC2000

Intel Composer XE 2011 OSCAR

Page 26: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

OSCAR Compiler’s Performance on Fujitsu9000 SparcVII 256core SMP

Page 27: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Engine Control by multicore with Denso

Engine control by multicoreHard real-time processing

27

Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core  V850 multicore processor.

1 core 2 cores

Page 28: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

2012/07/04 API委員会 28

1.871.72

1.901.73

0

0.5

1

1.5

2

AAC ENC MPEG2 DEC OMPM equake MPEG2 ENC

Speedu

p Ra

tio

Application

1PE

2PE

1.81 times speedup by 2 cores on the average against 1 core

Performance of OSCAR Compiler & API on 2 ARMv7‐cores Qualcomm MSM8960 (Snapdragon) 

Android 4.0 for Smart Phones

Page 29: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

1.00  1.00  1.00  1.00  1.00 

1.95 1.75 

1.64 

1.95 1.77 

2.85 

2.45 

2.05  2.03 

2.47 

0.00

0.50

1.00

1.50

2.00

2.50

3.00

AAC Encoder MPEG2 Encoder MPEG2 Decoder Optical Flow(OpenCV)

SPEC2000183.equake

speed up

 ratio

1PE

2PE

3PE

Parallel Processing Performance on 3Cores NaviEngine with Realtime OS  eT‐Kernel Multi‐Core Edition

• 2.37 times speedup on 3ARM cores against 1 core

NaviEngine (ARM11 MPCore) 400MHz 3 core SMP(Renesas Electronics EC-4260)

29

Page 30: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler

Avg. Power5.73 [W]

Avg. Power1.52 [W]

73.5% Power Reduction30

MPEG2 Decoding with 8 CPU cores

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Without Power Control(Voltage:1.4V)

With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)

Page 31: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

OSCAR API-ApplicableHeterogeneous Multicore Architecture

CPU

LPM

FVR

DTU

DSM

LDLDM

Network Interface

Interconnection Network

OnChipCSM

MemoryInterface

OffChipCSM

LPM

DSM

LDLDM

FVR

CPU DTU

ACCa_0

Network Interface

31

ACCa_1

DMAC

• DTU– Data Transfer

Unit• LPM

– Local Program Memory

• LDM– Local Data

Memory• DSM

– Distributed Shared Memory

• CSM– Centralized

Shared Memory• FVR

– Frequency/Voltage Control Register

ACCb_0

VC(n+m+l)

VC(1),VPC(1)VC(0),VPC(0) VC(n),VPC(n)

VC(n+1),VPC(n+1)

VC(n+2),VPC(n+2)

VC(n+m+1)

Page 32: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

32

An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Control

TIM

E

Page 33: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Heterogeneous Multicore RP-Xpresented in SSCC2010 Processors Session on Feb. 8, 2010

SH-X3SH-X3SH-X3

Cluster #0

SH-4A

I$ D$

ILM

CPU FPU

URAM

CRU

DLM

SH-4ADTU

MX2#0-1

SHPBHPBLBSCSATA SPU2PCI

exp

DBSC#0

DMAC#0

DMAC#1

DBSC#1

FE#0-3VPU5

SHwy#0(Address=40,Data=128) SHwy#1(Address=40,Data=128)

SHwy#2(Address=32,Data=64)

SNC

SH-X3SH-X3SH-X3

Cluster #1

SH-4A

L2CRenesas, Hitachi, Tokyo Inst. Of Tech. & Waseda Univ.

Page 34: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

12.29 3.09

5.4

18.85

26.71

32.65

0

5

10

15

20

25

30

35

1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE

Speedu

ps against a single SH

 processor 

3.4[fps]

111[fps]

CPU performs data transfers between SH and FE

Page 35: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

Without Power Reduction With Power Reductionby OSCAR Compiler

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms]→30[fps]

70% of power reduction

Page 36: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Core #3

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #2

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #1

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #0

I$16K

D$16K

CPU FPU

URAM 64K

Local memoryI:8K, D:32K

CCNBAR

8 Core RP2 Chip Block Diagram

On-chip system bus (SuperHyway)

DDR2LCPG: Local clock pulse generatorPCR: Power Control RegisterCCN/BAR:Cache controller/Barrier RegisterURAM: User RAM (Distributed Shared Memory)

Snoo

p co

ntro

ller

1

Snoo

p co

ntro

ller

0

LCPG0

Cluster #0 Cluster #1

PCR3

PCR2

PCR1

PCR0

LCPG1

PCR7

PCR6

PCR5

PCR4

controlSRAM

controlDMA

control

Core #7

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #6

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #5

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #4

I$16K

D$16K

CPUFPU

URAM 64K

Local memoryI:8K, D:32K

CCNBAR

Barrier Sync. Lines

Page 37: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

Faster or Equal Processing Performance with Hardware Coherence Control on 8 core RP2 Multicore Precessor Having

Hardware Coherent Mechanism Up-to 4 cores by OSCAR Compiler’s Software Coherence Control

1.00

1.89

3.54

1.00

1.62

2.54

1.00

1.85

3.34

1.02

1.92

3.59

5.90

1.01 1.61

2.45

3.36

1.02

2.10

3.90

6.63

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

1 2 4 8 1 2 4 8 1 2 4 8

AAC Encoder MPEG2 Decoder MPEG2 Encoder

Seed

Up

agai

nst s

eque

ntia

l Pro

cess

ing

No. of processor cores

SMPNon-Coherent Cache

Page 38: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

92 Times Speedup against the Sequential Processing for GMS Earthquake Wave

Propagation Simulation on Hitachi SR16000(Power7 Based 128 Core Linux SMP)

380

10

20

30

40

50

60

70

80

90

100

1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe

Spee

dup

agai

nst s

eque

ntia

l pro

cess

ing

oscar

Page 39: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)

55 times speedup by 64 processorsIBM Power 7 64 core SMP

(Hitachi SR16000)

National Institute of Radiological Sciences (NIRS)

Cancer Treatment Carbon Ion Radiotherapy

(Previous best was 2.5 times speedup on 16 processors with hand optimization)

Page 40: Green Computing Using Automatic Parallelizing and Power ......Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous

OSCAR compiler automatic parallelizes C or Fortran program using multigrain parallelization, data localization for cache and local memory with DMA data transfers and generates C or Fortran parallelized code with OSCAR API version 2.0.

It supports shared memory homogeneous and heterogeneous multicores and manycoresincluding non-coherent cache architectures.

In addition to the automatic parallelization, automatic power control using DVFS and Clock and Power gating has been implemented for real-time processing and minimum execution time processing modes.

The following performance has been attained on various multicores and servers: 55 times speedup by 64 processors for Carbon Ion Radiotherapy Cancer treatment on

IBM Power 7 SMP (Hitachi SR16000) 92 Times Speedup for GMS Earthquake Wave Propagation Simulation on 128 cores of

Hitachi SR16000 Faster or Equal Processing Performance with Hardware Coherence Control on 8 core

RP2 Multicore Precessor Having Hardware Coherent Mechanism Up-to 4 cores byOSCAR Compiler’s Software Coherence Control

33 Times Speedup for Optical Flow on 8 SH4A and 4 DRP accelerators on RP-X heterogeneous multicore.

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2. 2.2 times speedup on the average against Intel Composer XE 2011on AMD 12-core SMP

against Intel Composer XE 2011 Based on Opteron 6174. 1.9 times speedup on the average on Intel 12 core SMP based on 6-core Xeon X5670. 2.9 Times Speed-up for AAC Encoding on 3 Core NaviEngine (ARM MPcore) with

Realtime OS eT-Kernel Multi-Core Edition

Conclusions