Research of OSCAR Parallelizing Compiler for High

Research of OSCAR Parallelizing Compiler for High Performance and

Low Power Green Computing

Hironori KasaharaProfessor, Dept. of Computer Science & Engineering

Director, Advanced Multicore Processor Research InstituteWaseda University, Tokyo, Japan

IEEE Computer Society Board of GovernorsIEEE Computer Society Multicore STC Chair

URL: http://www.kasahara.cs.waseda.ac.jp/

1

＜R & D Target＞Hardware, Software, Application for Super Low-Power Manycore ProcessorsMore than 64 coresNatural air cooling (No fan)

Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, Toshiba, etc＜Ripple Effect＞Low CO2 (Carbon Dioxide) EmissionsCreation Value Added Products

Consumer Electronics, Automobiles, Servers

Green Computing Systems R&D CenterWaseda University

Supported by METI (Mar. 2011 Completion)

Beside Subway Waseda Station,Near Waseda Univ. Main Campus

2

Hitachi SR16000:Power7 128coreSMP

Fujitsu M9000SPARC VII 256 core SMP

2

Solar PoweredSmart phones

Cameras

Robots

Cool desktop servers

Industry-government-academia collaboration in R&Dand target practical applications

On-board vehicle technology (navigation systems, integrated controllers, infrastructure coordination)

Consumer electronicInternet TV/DVD

Non-fan, cool, quiet servers designed for server

Green cloud servers

Stock trading

Operation/rechargingby solar cells

OSCAR many-core chip

Green supercomputers

OSCAR

OSCAR

OSCARMany-core

Chip

Waseda University :R&DMany-core system technologies with

ultra-low power consumptionSuper real-time disaster simulation (tectonic shifts, tsunami), tornado,flood, fire spreading)

National Institute of Radiological Sciences

Protect lives

Protect environmentFor smart life

Camcorders

Intelligent home appliances

Supercomputers and servers

Industry

Capsule inner cameras

OSCAR Compiler

Medical servers

Heavy particle radiation planning, cerebral infarction)

Toyota

Denso

Olympus

Fujitsu

MitsubishiElectric

TokyoStock Exchange

HitachiRenesasNEC

OSCARTechnology

ESOL

OSCAR API 11 Companies3 Univ.

Fujitsu

3

Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project

Process Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 104.8mm2

(10.61mm x 9.88mm)CPU Core Size

6.6mm2

(3.36mm x 1.96mm)Supply Voltage

1.0V–1.4V (internal), 1.8/3.3V (I/O)

Power Domains

17 (8 CPUs, 8 URAMs, common)

Core#2 Core#3

Core#1

Core#4 Core#5

Core#6 Core#7

SNC

0SN

C1

DBSC

DDRPADGCPG

CSM

LB

SC

SHWY

URAMDLRAM

Core#0ILRAM

D$

I$

VSWC

IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”

4

Demo of NEDO Multicore for Real Time Consumer Electronicsat the Council of Science and Engineering Policy on April 10, 2008

CSTP MembersPrime Minister: Mr. Y. FUKUDAMinister of State for Science, Technology and Innovation Policy:Mr. F. KISHIDAChief Cabinet Secretary: Mr. N. MACHIMURAMinister of Internal Affairs and Communications :Mr. H. MASUDAMinister of Finance :Mr. F. NUKAGAMinister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAIMinister of Economy,Trade and Industry: Mr. A. AMARI

5

Core #3

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #2

I$16K

D$16K

CPU FPU

User RAM 64K


Core #1

I$16K

D$16K

CPU FPU

User RAM 64K


Core #0

I$16K

D$16K

CPU FPU

Distributed SharedMemory (URAM)64K


CCNBAR

8 Core RP2 Chip Block Diagram

On-chip system bus (SuperHyway)

DDR2LCPG: Local clock pulse generatorPCR: Power Control RegisterCCN/BAR:Cache controller/Barrier RegisterURAM: User RAM (Distributed Shared Memory)

Snoo

p co

ntro

ller

1

Snoo

p co

ntro

ller

0

LCPG0

Cluster #0 Cluster #1

PCR3

PCR2

PCR1

PCR0

LCPG1

PCR7

PCR6

PCR5

PCR4

controlOn-chipShared Memory

DMAcontrol

Core #7

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #6

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #5

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #4

I$16K

D$16K

CPUFPU

URAM 64K


CCNBAR

Barrier Sync. Lines

6

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler

Avg. Power5.73 [W]

Avg. Power1.52 [W]

73.5% Power Reduction7

MPEG2 Decoding with 8 CPU cores

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Without Power Control（Voltage：1.4V)

With Power Control （Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)

7

Automatic Power Reduction on 4 core Intel Haswell

Haswell Processor OS Ubuntu 13.10 Intel CPU Core i7 4770K

4 cores L1 Cache: Load 64Bytes/cycle, Store 32Bytes/cycle L2 Cache 64Bytes/cycle L3 Cache 8 MB Frequency 3.5GHz~0.8MHz

Memory 16GB (8GB×2)

8

Power Reduction on Intel Haswell for Real-time Optical Flow

9

Power was reduced to 1/4 by compiler power optimization on the same 3 cores.The power with 3 core was reduced to 1/3 against 1 core.

29.29

36.5941.58

28.40

13.22 10.49

0.00

10.00

20.00

30.00

40.00

50.00

1 2 3

Average Po

wer [W

]

No. of Processor Cores

電力制御なし電力制御あり

Power was reduced to 1/4 by compiler on 3 cores

Power was reduced to 1/3 compared with one core ordinal execution

For HD 720p(1280x720) moving pictures15fps (Deadline66.6[ms/frame])

Automatic Power Reduction for MPEG2 Decode on Android Multicore

ODROID X2 ARM Cortex-A9４cores

10

On 3 cores, Automatic Power Reduction successfully reduced power to 1/7 against without Power Reduction.

3 cores with power reduction reduced power to 1/3 against ordinary 1 core execution.

0.97

1.88

2.79

0.63 0.46 0.37

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 2 3

Power Con

sumption[W

]

Number of Processor Cores

電力制御なし電力制御ありWithout Power Reduction

With Power Reduction

2/3(‐35.0%)

1/4（‐75.5%）

1/7(‐86.７%)

1/3(‐61.9%)

http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

0

15

30

45

60

通常の1コア実⾏並列化3コア実⾏

DrawImage : FPS

Parallelization of 2D Rendering Engine SKIA on 3 cores of Google NEXUS7

0

15

30

45

60

通常の1コア実⾏並列化3コア実⾏

DrawRect :FPS

22.82

43.57

27.16

×1.91 ×1.95

for DrawRect 1.91 speedup

for DrawImage 1.95 speedupOn Nexus7, 3 core parallelization gave us

52.88

11

1 Core 3 cores 1 Core 3 cores

http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

OSCAR API-ApplicableHeterogeneous Multicore Architecture

CPU

LPM

FVR

DTU

DSM

LDLDM

Network Interface

Interconnection Network

OnChipCSM

MemoryInterface

OffChipCSM

LPM

DSM

LDLDM

FVR

CPU DTU

ACCa_0

Network Interface

12

ACCa_1

DMAC

• DTU– Data Transfer

Unit• LPM

– Local Program Memory

• LDM– Local Data

Memory• DSM

– Distributed Shared Memory

• CSM– Centralized

Shared Memory• FVR

– Frequency/Voltage Control Register

ACCb_0

VC(n+m+l)

VC(1),VPC(1)VC(0),VPC(0) VC(n),VPC(n)

VC(n+1),VPC(n+1)

VC(n+2),VPC(n+2)

VC(n+m+1)

33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

12.29 3.09

5.4

18.85

26.71

32.65

0

5

10

15

20

25

30

35

1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE

Speedu

ps against a single SH

processor

3.4[fps]

111[fps]

CPU performs data transfers between SH and FE

13

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

Without Power Reduction With Power Reductionby OSCAR Compiler

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms]→30[fps]

70% of power reduction

14

To improve effective performance, cost-performance and software productivity and reduce power

OSCAR Parallelizing Compiler

Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory

Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)

Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

15

Low Power Heterogeneous Multicore Code

GenerationAPI

Analyzer(Available

from Waseda)

Existing sequential compiler

Multicore Program Development Using OSCAR API V2.0Sequential Application

Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Low Power Homogeneous Multicore Code

GenerationAPI

AnalyzerExisting

sequential compiler

Proc0

Thread 0

Code with directives

Waseda OSCARParallelizing Compiler

Coarse grain task parallelization

Data Localization DMAC data transfer Power reduction using

DVFS, Clock/ Power gating

Proc1

Thread 1

Code with directives

Parallelized API F or C program

OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,

data transfer using DMA, power managements

Generation of parallel machine

codes using sequential compilers

Exe

cuta

ble

on v

ario

us m

ultic

ores

OSCAR: Optimally Scheduled Advanced MultiprocessorAPI： Application Program Interface

HomegeneousMulticore s

from Vendor A(SMP servers)

Server Code GenerationOpenMP Compiler

Shred memory servers

HeterogeneousMulticores

from Vendor B

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

Accelerator 1Code

Accelerator 2Code

Hom

ogen

eous

Accelerator Compiler/ User Add “hint” directives

before a loop or a function to specify it is executable by

the accelerator with how many clocks

Het

ero

Manual parallelization / power reduction

16

Low-Power Optimization with OSCAR API

17

MT1

VC0

MT2

MT4MT3

Sleep

VC1

Scheduled Resultby OSCAR Compiler void

main_VC0() {

MT1

voidmain_VC1() {

MT2

#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))

#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))

Sleep

MT4MT3

} }

Generate Code Image by OSCAR Compiler

Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server

Compile Option:(*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto(*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto(Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto

OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average

18

92 Times Speedup against the Sequential Processing for GMS Earthquake Wave

Propagation Simulation on Hitachi SR16000（Power7 Based 128 Core Linux SMP）

190

10

20

30

40

50

60

70

80

90

100

1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe

Speedup

agai

nst s

eque

ntia

l pro

cess

ing

oscar

8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)

55 times speedup by 64 processorsIBM Power 7 64 core SMP

(Hitachi SR16000)

National Institute of Radiological Sciences (NIRS)

Cancer Treatment Carbon Ion Radiotherapy

(Previous best was 2.5 times speedup on 16 processors with hand optimization)

20

Parallel Processing of Face Detection on Manycore, Highendand PC Server

• OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highendserver.

1.00 1.72

3.01

5.74

9.30

1.00 1.93

3.57

6.46

11.55

1.00 1.93

3.67

6.46

10.92

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

1 2 4 8 16

速度

向上

率

コア数

速度向上率tilepro64 gcc

SR16k(Power7 8core*4cpu*4node) xlc

rs440(Intel Xeon 8core*4cpu) icc

21

Automatic Parallelization of Still Image Encoding Using JPEG-XR for the Next Generation Cameras and Drinkable Inner Camera

• TILEPro64

Ds

t0X4)

rt 1

al)

nal)

X4)

1.00 1.96 3.95 7.86

15.82

30.79

55.11

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 2 4 8 16 32 64

Speed up

コア数

Speed‐ups on TILEPro64 Manycore0.18[s]

1コア10.0[s]

55 times speedup with 64 coresagainst 1 core

22

Engine Control by multicore with DensoThough so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.

1 core 2 cores

Hard real-time automobile engine

control by multicore

23

Future Multicore ProductsNext Generation Automobiles‐ Safer, more comfortable, energy efficient, environment friendly‐ Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control

Solar powered with more than 100 times power efficient : FLOPS/W• Regional Disaster Simulators

saving lives from tornadoes, localized heavy rain, fires with earth quakes

‐From everyday recharging to less than once a week‐ Solar powered operation inemergency condition‐ Keep health

Smart phones

Cancer treatment, Drinkable inner camera• Emergency solar powered• No cooling fun, No dust ,

clean usable inside OP room

Advanced medical systems Personal / Regional Supercomputers

24

Documents

Research of OSCAR Parallelizing Compiler for High