37
OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical Systems Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society Board of Governors IEEE Computer Society Multicore Strategic Technical Committee (STC) Chair URL: http://www.kasahara.cs.waseda.ac.jp/ Intel/Kai, 2012.10.18(Thursday)

OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

OSCAR Compiler and API for High Performance Low Power Multicores

and Their Application to Smartphones, Automobiles, Medical Systems

Hironori KasaharaProfessor, Dept. of Computer Science & Engineering

Director, Advanced Multicore Processor Research InstituteWaseda University, Tokyo, Japan

IEEE Computer Society Board of GovernorsIEEE Computer Society Multicore Strategic Technical

Committee (STC) Chair URL: http://www.kasahara.cs.waseda.ac.jp/

Intel/Kai, 2012.10.18(Thursday)

Page 2: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Multi/Many-core EverywhereMulti-core from embedded to supercomputers Consumer Electronics (Embedded)

Mobile Phone, Game, TV, Car Navigation, Camera, IBM/ Sony/ Toshiba Cell, Fujitsu FR1000, Panasonic Uniphier, NEC/ARM MPCore/MP211/NaviEngine,Renesas 4 core RP1, 8 core RP2, 15core Hetero RP-X,Plurarity HAL 64(Marvell), Tilera Tile64/ -Gx100(->1000cores),DARPA UHPC (2017: 80GFLOPS/W)

PCs, ServersIntel Quad Xeon, Core 2 Quad, Montvale, Nehalem(8cores), Larrabee(32cores), SCC(48cores), Night Corner(50 core+:22nm), AMD Quad Core Opteron (8, 12 cores)

WSs, Deskside & Highend ServersIBM(Power4,5,6,7), Sun (SparcT1,T2), Fujitsu SPARC64fx8

SupercomputersEarth Simulator:40TFLOPS, 2002, 5120 vector proc.BG/Q (A2:16cores) Water Cooled20PFLOPS, 3-4MW (2011-12),BlueWaters(HPCS) Power7, 10 PFLOP+(2011.07), Tianhe-1A (4.7PFLOPS,6coreX5670+ Nvidia Tesla M2050),Godson-3B (1GHz40W 8core128GFLOPS) -T (64 core,192GFLOPS:2011)RIKEN Fujitsu “K” 10PFLOPS(8core SPARC64VIIIfx, 128GGFLOPS)

High quality application software, Productivity, Costperformance, Low power consumption are important

Ex, Mobile phones, GamesCompiler cooperated multi-core processors are promising to realize the above futures

OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)

The 37th (June 20,2011) &38th

(Nov.14.2011) Top 500 No.1, Riken Fujitsu “K” 705,024 cores Peak 11.28 PFLOPS, (88,128procs)LINPACK 10.510 PFLOPS (93.2%)

Page 3: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

<R & D Target>Hardware, Software, Application for Super Low-Power Manycore ProcessorsMore than 64 coresNatural air cooling (No fan)

Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, Toshiba, etc<Ripple Effect>Low CO2 (Carbon Dioxide) EmissionsCreation Value Added Products

Consumer Electronics, Automobiles, Servers

Green Computing Systems R&D CenterWaseda University

Supported by METI (Mar. 2011 Completion)

Beside Subway Waseda Station,Near Waseda Univ. Main Campus

3

Hitachi SR16000:Power7 128coreSMP

Fujitsu M9000SPARC VII 256 core SMP

Page 4: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Cancer Treatment Carbon Ion Radiotherapy

Environment

LivesIndustry

CapsuleInner Camera

Page 5: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Green Computing Systems R&D Center, 2011.11.1(Clear)Solar Power Generation & Server Consumption

FujitsuM9000

HitachiSR16000

Page 6: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

2012.4.2(Clear) Power Generation and Server Consumption: One day Trends

Page 7: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Super Low Power Web Server Using Embedded Multicore Processor RPX

1W with 8 SH4A processor cores

Page 8: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

METI/NEDO National Project Multi-core for Real-time Consumer Electronics

<Goal> R&D of compiler cooperative multi-core processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems.

<Period> From July 2005 to March 2008<Features> ・Good cost performance

・Short hardware and software development periods・Low power consumption・Scalable performance improvement with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API

API:Application Programming Interface

CMP m

(マルチ

コア・チップm)

(プロセッサ

コアn)

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC n

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CMP 0 (マルチコアチップ0)

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM j

CSM(集中

共有メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

CPU(プロセッサ)

DTC(データ

転送コントローラ)

I/O DevicesI/O

(入出力装置)CMP m

(マルチ

コア・チップm)

(プロセッサ

コアn)

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC n

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CMP 0 (マルチコアチップ0)

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM j

CSM(集中

共有メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

CPU(プロセッサ)

DTC(データ

転送コントローラ)

I/O DevicesI/O

(入出力装置)

新マルチコアプロセッサ

•高性能

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション共用可

•高信頼性

•半導体集積度と共に性能向上

新マルチコアプロセッサ

•高性能

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション共用可

•高信頼性

•半導体集積度と共に性能向上

マルチコア統合ECU

,,

CMP m

(マルチ

コア・チップm)

(プロセッサ

コアn)

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC n

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CMP 0 (マルチコアチップ0)

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM j

CSM(集中

共有メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

CPU(プロセッサ)

DTC(データ

転送コントローラ)

I/O DevicesI/O

(入出力装置)CMP m

(マルチ

コア・チップm)

(プロセッサ

コアn)

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC n

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CMP 0 (マルチコアチップ0)

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM j

CSM(集中

共有メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

CPU(プロセッサ)

DTC(データ

転送コントローラ)

I/O DevicesI/O

(入出力装置)

新マルチコアプロセッサ

•高性能

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション共用可

•高信頼性

•半導体集積度と共に性能向上

新マルチコアプロセッサ

•高性能

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション共用可

•高信頼性

•半導体集積度と共に性能向上

マルチコア統合ECU

,,

(2005.7~2008.3)**

**Hitachi, Renesas, Fujitsu,

Toshiba, Panasonic, NEC

Page 9: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project

Process Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 104.8mm2

(10.61mm x 9.88mm)CPU Core Size

6.6mm2

(3.36mm x 1.96mm)Supply Voltage

1.0V–1.4V (internal), 1.8/3.3V (I/O)

Power Domains

17 (8 CPUs, 8 URAMs, common)

Core#2 Core#3

Core#1

Core#4 Core#5

Core#6 Core#7

SNC

0SN

C1

DBSC

DDRPADGCPG

CSM

LB

SC

SHWY

URAMDLRAM

Core#0ILRAM

D$

I$

VSWC

IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”

9

Page 10: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Demo of NEDO Multicore for Real Time Consumer Electronicsat the Council of Science and Engineering Policy on April 10, 2008

CSTP MembersPrime Minister: Mr. Y. FUKUDAMinister of State for Science, Technology and Innovation Policy:Mr. F. KISHIDAChief Cabinet Secretary: Mr. N. MACHIMURAMinister of Internal Affairs and Communications :Mr. H. MASUDAMinister of Finance :Mr. F. NUKAGAMinister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAIMinister of Economy,Trade and Industry: Mr. A. AMARI

Page 11: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

To improve effective performance, cost-performance and software productivity and reduce power

OSCAR Parallelizing Compiler

Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory

Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)

Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

Page 12: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Generation of coarse grain tasksMacro-tasks (MTs) Block of Pseudo Assignments (BPA): Basic Block (BB) Repetition Block (RB) : natural loop Subroutine Block (SB): subroutine

Program

BPA

RB

SB

Near fine grain parallelization

Loop level parallelizationNear fine grain of loop bodyCoarse grainparallelizationCoarse grainparallelization

BPARBSB

BPARBSB

BPARBSBBPARBSBBPARBSBBPARBSB

1st. Layer 2nd. Layer 3rd. LayerTotalSystem

Page 13: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks)

BPA

BPA

1 BPA

3 BPA2 BPA

4 BPA

5 BPA

6 RB

7 RB15 BPA

8 BPA

9 BPA 10 RB

11 BPA

12 BPA

13 RB

14 RB

END

RB

RB

BPA

RB

Data Dependency

Control flow

Conditional branch

Repetition Block

RB

BPA

Block of PsuedoAssignment Statements

7

11

14

1

2 3

4

5 6

15 7

8

9 10

12

13

Data dependency

Extended control dependencyConditional branch

6

ORAND

Original control flowA Macro Flow Graph

A Macro Task Graph

Page 14: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

MTG of Su2cor-LOOPS-DO400

DOALL Sequential LOOP BBSB

Coarse grain parallelism PARA_ALD = 4.3

Page 15: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Data Localization

MTG MTG after Division A schedule for two processors

1

2

3 4 56

7

8 9 10

11

12 1314

15

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

3

4

2

5

6 7

8

9

10

11

13

14

15

16

17

18

19

20

21

22

23 24

25

26

27 28

29

30

31

32

PE0 PE1

12 1

Page 16: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Generated Multigrain Parallelized Code (The nested coarse grain task parallelization is realized by only

OpenMP “section”, “Flush” and “Critical” directives.)

Centralized scheduling code

Distributed scheduling code

T0 T1 T2 T3

Thread group0

MT1_1

SYNC SENDMT1_2

SYNC RECV

T4 T5 T6 T7

1_4_2

1_4_4

1_4_3

1_4_1

1_4_2

1_4_4

1_4_3

1_4_11_3_2

1_3_4

1_3_3

1_3_1

1_3_6

1_3_5

1_3_2

1_3_4

1_3_3

1_3_1

1_3_6

1_3_5

1_3_2

1_3_4

1_3_3

1_3_1

1_3_6

1_3_5

SECTIONSSECTION SECTION

END SECTIONS

Thread group1

MT1_1

MT1_3SB

MT1_2DOALL

MT1_4RB

1st layer

2nd layer

1_3_1

1_3_2 1_3_3

1_3_5

1_3_4

1_3_6

1_4_21_4_3 1_4_4

1_4_1

MT1-4

MT1-3

2nd layer

Page 17: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Low Power Heterogeneous Multicore Code

GenerationAPI

Analyzer(Available

from Waseda)

Existing sequential compiler

Multicore Program Development Using OSCAR API V2.0Sequential Application

Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Low Power Homogeneous Multicore Code

GenerationAPI

AnalyzerExisting

sequential compiler

Proc0

Thread 0

Code with directives

Waseda OSCARParallelizing Compiler

Coarse grain task parallelization

Data Localization DMAC data transfer Power reduction using

DVFS, Clock/ Power gating

Proc1

Thread 1

Code with directives

Parallelized API F or C program

OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,

data transfer using DMA, power managements

Generation of parallel machine

codes using sequential compilers

Exe

cuta

ble

on v

ario

us m

ultic

ores

OSCAR: Optimally Scheduled Advanced MultiprocessorAPI: Application Program Interface

HomegeneousMulticore s

from Vendor A(SMP servers)

Server Code GenerationOpenMP Compiler

Shred memory servers

HeterogeneousMulticores

from Vendor B

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

Accelerator 1Code

Accelerator 2Code

Hom

ogen

eous

Accelerator Compiler/ User Add “hint” directives

before a loop or a function to specify it is executable by

the accelerator with how many clocks

Het

ero

Manual parallelization / power reduction

Page 18: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

OSCAR API Ver. 2.0 for Homogeneous/Heterogeneous Multicores and Manycores

Page 19: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler

• Shortest execution time mode

Page 20: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

An Example of Machine Parameters for the Power Saving Scheme

• Functions of the multiprocessor– Frequency of each proc. is changed to several levels– Voltage is changed together with frequency– Each proc. can be powered on/off

statefrequencyvoltagedynamic energystatic power

FULL1111

MID1 / 20.873 / 4

1

LOW1 / 40.711 / 2

1

OFF0000

stateFULLMIDLOWOFF

FULL0

40k40k80k

MID40k0

40k80k

LOW40k40k0

80k

OFF80k80k80k0

stateFULLMIDLOWOFF

FULL0202040

MID2002040

LOW2020040

OFF4040400

delay time [u.t.] energy overhead [μJ]

• State transition overhead (Example: not for RP2)

Page 21: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Power Reduction Scheduling

Page 22: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Low-Power Optimization with OSCAR API

22

MT1

VC0

MT2

MT4MT3

Sleep

VC1

Scheduled Resultby OSCAR Compiler void

main_VC0() {

MT1

voidmain_VC1() {

MT2

#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))

#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))

Sleep

MT4MT3

} }

Generate Code Image by OSCAR Compiler

Page 23: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server

Compile Option:(*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto(*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto(Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto

OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average

Page 24: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

OSCAR Compiler’s Performance on Fujitsu9000 SparcVII 256core SMP

Page 25: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Engine Control by multicore with Denso

Engine control by multicoreHard real-time processing

25

Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core  V850 multicore processor.

1 core 2 cores

Page 26: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

2012/07/04 API委員会 26

1.871.72

1.901.73

0

0.5

1

1.5

2

AAC ENC MPEG2 DEC OMPM equake MPEG2 ENC

Speedu

p Ra

tio

Application

1PE

2PE

1.81 times speedup by 2 cores on the average against 1 core

Performance of OSCAR Compiler & API on 2 ARMv7‐cores Qualcomm MSM8960 (Snapdragon) 

Android 4.0 for Smart Phones

Page 27: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

1.00  1.00  1.00  1.00  1.00 

1.95 1.75 

1.64 

1.95 1.77 

2.85 

2.45 

2.05  2.03 

2.47 

0.00

0.50

1.00

1.50

2.00

2.50

3.00

AAC Encoder MPEG2 Encoder MPEG2 Decoder Optical Flow(OpenCV)

SPEC2000183.equake

speed up

 ratio

1PE

2PE

3PE

Parallel Processing Performance on 3Cores NaviEngine with Realtime OS  eT‐Kernel Multi‐Core Edition

• 2.37 times speedup on 3ARM cores against 1 core

NaviEngine (ARM11 MPCore) 400MHz 3 core SMP(Renesas Electronics EC-4260)

27

Page 28: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler

Avg. Power5.73 [W]

Avg. Power1.52 [W]

73.5% Power Reduction28

MPEG2 Decoding with 8 CPU cores

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Without Power Control(Voltage:1.4V)

With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)

Page 29: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

29

An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Control

TIM

E

Page 30: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

12.29 3.09

5.4

18.85

26.71

32.65

0

5

10

15

20

25

30

35

1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE

Speedu

ps against a single SH

 processor 

3.4[fps]

111[fps]

CPU performs data transfers between SH and FE

Page 31: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

Without Power Reduction With Power Reductionby OSCAR Compiler

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms]→30[fps]

70% of power reduction

Page 32: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Core #3

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #2

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #1

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #0

I$16K

D$16K

CPU FPU

URAM 64K

Local memoryI:8K, D:32K

CCNBAR

8 Core RP2 Chip Block Diagram

On-chip system bus (SuperHyway)

DDR2LCPG: Local clock pulse generatorPCR: Power Control RegisterCCN/BAR:Cache controller/Barrier RegisterURAM: User RAM (Distributed Shared Memory)

Snoo

p co

ntro

ller

1

Snoo

p co

ntro

ller

0

LCPG0

Cluster #0 Cluster #1

PCR3

PCR2

PCR1

PCR0

LCPG1

PCR7

PCR6

PCR5

PCR4

controlSRAM

controlDMA

control

Core #7

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #6

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #5

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #4

I$16K

D$16K

CPUFPU

URAM 64K

Local memoryI:8K, D:32K

CCNBAR

Barrier Sync. Lines

Page 33: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

Faster or Equal Processing Performance with Hardware Coherence Control on 8 core RP2 Multicore Precessor Having

Hardware Coherent Mechanism Up-to 4 cores by OSCAR Compiler’s Software Coherence Control

1.00

1.89

3.54

1.00

1.62

2.54

1.00

1.85

3.34

1.02

1.92

3.59

5.90

1.01 1.61

2.45

3.36

1.02

2.10

3.90

6.63

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

1 2 4 8 1 2 4 8 1 2 4 8

AAC Encoder MPEG2 Decoder MPEG2 Encoder

Seed

Up

agai

nst s

eque

ntia

l Pro

cess

ing

No. of processor cores

SMPNon-Coherent Cache

Page 34: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

92 Times Speedup against the Sequential Processing for GMS Earthquake Wave

Propagation Simulation on Hitachi SR16000(Power7 Based 128 Core Linux SMP)

340

10

20

30

40

50

60

70

80

90

100

1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe

Speedup

agai

nst s

eque

ntia

l pro

cess

ing

oscar

Page 35: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)

55 times speedup by 64 processorsIBM Power 7 64 core SMP

(Hitachi SR16000)

National Institute of Radiological Sciences (NIRS)

Cancer Treatment Carbon Ion Radiotherapy

(Previous best was 2.5 times speedup on 16 processors with hand optimization)

Page 36: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

CS Multicore STC TeamCS Multicore STC TeamHironori Kasahara ? (to be hired) Hironori Kasahara

<Hard+Soft+IndustrialApplications from Embedded to HPC Systems with Gov.,Acad.&Indus.>•Co-Chair Josep Torrellas (UIUC)•Co-Chair Hironori Kasahara•Vivek Sarkar (Rice U.)•Dr. Ahmed Jerraya, CEA-LETI, MINATEC, Fr

• + Industry: Automobile, Smart Phone, Medical etc

•Carrie Walsh (SE)

*GL — Group LeadSE — Staff Expert

•Start from “OnlineLecture”:Ask the lecture to the world best researcher for the topics

•David Padua(UIUC)•Dorian McClenahan (SE)

<Online Publication for quick and low cost>•Trans. on Multicores•Multicore Magazine<Start as online publication through Web Portal: Not only written papers and also Online Presentation by especially Industry Leaders>

<Think introduction of “Mileage system” for Editorial, Programing Committee members and reviewers•Lars Jentsch (SE) Alicia Stickley (SE)

•Architecture Committee3D Integ., Memory (Non volatile),etc

•Software CommitteeAPI, Development Env. etc

•Industrial Application Comm.• Consumer Electronics (Smart Phones): ATT, NTT, Apple,

•Automobile (GM, Mercedes, Toyota, Bosch, Denso, etc)

•Medical (Varian,Hitachi,Semens,etc)•Anne Marie Kelly (SE)

•Based on Parallel Processing Encyclopedia

•David Padua & others•Dante David (SE)

•Start from Online lecture with Education and Online Magazine with Publishing

•?•Theresa McNeill (SE)•Chris Jensen (SE)

•First ,with Conference, Education and Publishingpush the latest attractive information to members.

•?•Margo McCall (SE)

Chair FTs Proj. Mgr. BoG “Angel”

Conferences Standards

Publishing

Education

Body of Knowledge / Thesaurus

NewsletterWeb Portal

Page 37: OSCAR Compiler and API for High Performance Low Power ......OSCAR Compiler and API for High Performance Low Power Multicores and Their Application to Smartphones, Automobiles, Medical

OSCAR compiler automatic parallelizes C or Fortran program using multigrain parallelization, data localization for cache and local memory with DMA data transfers and generates C or Fortran parallelized code with OSCAR API version 2.0.

It supports shared memory homogeneous and heterogeneous multicores and manycoresincluding non-coherent cache architectures.

In addition to the automatic parallelization, automatic power control using DVFS and Clock and Power gating has been implemented for real-time processing and minimum execution time processing modes.

The following performance has been attained on various multicores and servers: 55 times speedup by 64 processor cores for Carbon Ion Radiotherapy Cancer treatment

on IBM Power 7 64 core SMP (Hitachi SR16000) 92 Times Speedup for GMS Earthquake Wave Propagation Simulation on 128 processor

cores SMP ( Hitachi SR16000) Faster or Equal Processing Performance with Hardware Coherence Control on 8 core

RP2 Multicore Precessor Having Hardware Coherent Mechanism Up-to 4 cores byOSCAR Compiler’s Software Coherence Control

33 Times Speedup for Optical Flow on 8 SH4A and 4 DRP accelerators on RP-X heterogeneous multicore.

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2. 1.95 times speedup on Renesas V850 2 core embedded multicore for automobile engine

control program generated by MATLAB/SIMLINK embedded coder. 2.9 Times Speed-up for AAC Encodeing on 3 Core NaviEngine (ARM MPcore) with

Realtime OS eT-Kernel Multi-Core Edition

Conclusions