24
Research of OSCAR Parallelizing Compiler for High Performance and Low Power Green Computing Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society Board of Governors IEEE Computer Society Multicore STC Chair URL: http://www.kasahara.cs.waseda.ac.jp/ 1

Research of OSCAR Parallelizing Compiler for High

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research of OSCAR Parallelizing Compiler for High

Research of OSCAR Parallelizing Compiler for High Performance and

Low Power Green Computing

Hironori KasaharaProfessor, Dept. of Computer Science & Engineering

Director, Advanced Multicore Processor Research InstituteWaseda University, Tokyo, Japan

IEEE Computer Society Board of GovernorsIEEE Computer Society Multicore STC Chair

URL: http://www.kasahara.cs.waseda.ac.jp/

1

Page 2: Research of OSCAR Parallelizing Compiler for High

<R & D Target>Hardware, Software, Application for Super Low-Power Manycore ProcessorsMore than 64 coresNatural air cooling (No fan)

Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, Toshiba, etc<Ripple Effect>Low CO2 (Carbon Dioxide) EmissionsCreation Value Added Products

Consumer Electronics, Automobiles, Servers

Green Computing Systems R&D CenterWaseda University

Supported by METI (Mar. 2011 Completion)

Beside Subway Waseda Station,Near Waseda Univ. Main Campus

2

Hitachi SR16000:Power7 128coreSMP

Fujitsu M9000SPARC VII 256 core SMP

2

Page 3: Research of OSCAR Parallelizing Compiler for High

Solar PoweredSmart phones

Cameras

Robots

Cool desktop servers

Industry-government-academia collaboration in R&Dand target practical applications

On-board vehicle technology (navigation systems, integrated controllers, infrastructure coordination)

Consumer electronicInternet TV/DVD

Non-fan, cool, quiet servers designed for server

Green cloud servers

Stock trading

Operation/rechargingby solar cells

OSCAR many-core chip

Green supercomputers

OSCAR

OSCAR

OSCARMany-core

Chip

Waseda University :R&DMany-core system technologies with

ultra-low power consumptionSuper real-time disaster simulation (tectonic shifts, tsunami), tornado,flood, fire spreading)

National Institute of Radiological Sciences

Protect lives

Protect environmentFor smart life

Camcorders

Intelligent home appliances

Supercomputers and servers

Industry

Capsule inner cameras

OSCAR Compiler

Medical servers

Heavy particle radiation planning, cerebral infarction)

Toyota

Denso

Olympus

Fujitsu

MitsubishiElectric

TokyoStock Exchange

HitachiRenesasNEC

OSCARTechnology

ESOL

OSCAR API 11 Companies3 Univ.

Fujitsu

3

Page 4: Research of OSCAR Parallelizing Compiler for High

Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project

Process Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 104.8mm2

(10.61mm x 9.88mm)CPU Core Size

6.6mm2

(3.36mm x 1.96mm)Supply Voltage

1.0V–1.4V (internal), 1.8/3.3V (I/O)

Power Domains

17 (8 CPUs, 8 URAMs, common)

Core#2 Core#3

Core#1

Core#4 Core#5

Core#6 Core#7

SNC

0SN

C1

DBSC

DDRPADGCPG

CSM

LB

SC

SHWY

URAMDLRAM

Core#0ILRAM

D$

I$

VSWC

IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”

4

Page 5: Research of OSCAR Parallelizing Compiler for High

Demo of NEDO Multicore for Real Time Consumer Electronicsat the Council of Science and Engineering Policy on April 10, 2008

CSTP MembersPrime Minister: Mr. Y. FUKUDAMinister of State for Science, Technology and Innovation Policy:Mr. F. KISHIDAChief Cabinet Secretary: Mr. N. MACHIMURAMinister of Internal Affairs and Communications :Mr. H. MASUDAMinister of Finance :Mr. F. NUKAGAMinister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAIMinister of Economy,Trade and Industry: Mr. A. AMARI

5

Page 6: Research of OSCAR Parallelizing Compiler for High

Core #3

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #2

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #1

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #0

I$16K

D$16K

CPU FPU

Distributed SharedMemory (URAM)64K

Local memoryI:8K, D:32K

CCNBAR

8 Core RP2 Chip Block Diagram

On-chip system bus (SuperHyway)

DDR2LCPG: Local clock pulse generatorPCR: Power Control RegisterCCN/BAR:Cache controller/Barrier RegisterURAM: User RAM (Distributed Shared Memory)

Snoo

p co

ntro

ller

1

Snoo

p co

ntro

ller

0

LCPG0

Cluster #0 Cluster #1

PCR3

PCR2

PCR1

PCR0

LCPG1

PCR7

PCR6

PCR5

PCR4

controlOn-chipShared Memory

DMAcontrol

Core #7

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #6

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #5

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #4

I$16K

D$16K

CPUFPU

URAM 64K

Local memoryI:8K, D:32K

CCNBAR

Barrier Sync. Lines

6

Page 7: Research of OSCAR Parallelizing Compiler for High

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler

Avg. Power5.73 [W]

Avg. Power1.52 [W]

73.5% Power Reduction7

MPEG2 Decoding with 8 CPU cores

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Without Power Control(Voltage:1.4V)

With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)

7

Page 8: Research of OSCAR Parallelizing Compiler for High

Automatic Power Reduction on 4 core Intel Haswell

Haswell Processor OS Ubuntu 13.10 Intel CPU Core i7 4770K

4 cores L1 Cache: Load 64Bytes/cycle, Store 32Bytes/cycle L2 Cache 64Bytes/cycle L3 Cache 8 MB Frequency 3.5GHz~0.8MHz

Memory 16GB (8GB×2)

8

Page 9: Research of OSCAR Parallelizing Compiler for High

Power Reduction on Intel Haswell for Real-time Optical Flow

9

Power was reduced to 1/4 by compiler power optimization on the same 3 cores.The power with 3 core was reduced to 1/3 against 1 core.

29.29

36.5941.58

28.40

13.22 10.49

0.00

10.00

20.00

30.00

40.00

50.00

1 2 3

Average Po

wer [W

]

No. of Processor Cores

電力制御なし 電力制御あり

Power was reduced to 1/4  by compiler on 3 cores 

Power was reduced to 1/3 compared with one core ordinal execution

For HD 720p(1280x720)  moving pictures15fps (Deadline66.6[ms/frame])

Page 10: Research of OSCAR Parallelizing Compiler for High

Automatic Power Reduction for MPEG2 Decode on Android Multicore

ODROID X2 ARM Cortex-A94cores

10

On 3 cores, Automatic Power Reduction successfully reduced power to 1/7 against without Power Reduction.

3 cores with power reduction reduced power to 1/3 against ordinary 1 core execution.

0.97

1.88

2.79

0.63 0.46 0.37

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 2 3

Power Con

sumption[W

]

Number of Processor Cores

電力制御なし 電力制御ありWithout Power Reduction

With Power Reduction

2/3(‐35.0%)

1/4(‐75.5%)

1/7(‐86.7%)

1/3(‐61.9%)

http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

Page 11: Research of OSCAR Parallelizing Compiler for High

0

15

30

45

60

通常の1コア実⾏ 並列化3コア実⾏

DrawImage : FPS

Parallelization of 2D Rendering Engine SKIA on 3 cores of Google NEXUS7

0

15

30

45

60

通常の1コア実⾏ 並列化3コア実⾏

DrawRect :FPS

22.82

43.57

27.16

×1.91 ×1.95

for DrawRect 1.91 speedup

for DrawImage 1.95 speedupOn Nexus7, 3 core parallelization gave us

52.88

11

1 Core 3 cores 1 Core 3 cores

http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

Page 12: Research of OSCAR Parallelizing Compiler for High

OSCAR API-ApplicableHeterogeneous Multicore Architecture

CPU

LPM

FVR

DTU

DSM

LDLDM

Network Interface

Interconnection Network

OnChipCSM

MemoryInterface

OffChipCSM

LPM

DSM

LDLDM

FVR

CPU DTU

ACCa_0

Network Interface

12

ACCa_1

DMAC

• DTU– Data Transfer

Unit• LPM

– Local Program Memory

• LDM– Local Data

Memory• DSM

– Distributed Shared Memory

• CSM– Centralized

Shared Memory• FVR

– Frequency/Voltage Control Register

ACCb_0

VC(n+m+l)

VC(1),VPC(1)VC(0),VPC(0) VC(n),VPC(n)

VC(n+1),VPC(n+1)

VC(n+2),VPC(n+2)

VC(n+m+1)

Page 13: Research of OSCAR Parallelizing Compiler for High

33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

12.29 3.09

5.4

18.85

26.71

32.65

0

5

10

15

20

25

30

35

1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE

Speedu

ps against a single SH

 processor 

3.4[fps]

111[fps]

CPU performs data transfers between SH and FE

13

Page 14: Research of OSCAR Parallelizing Compiler for High

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

Without Power Reduction With Power Reductionby OSCAR Compiler

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms]→30[fps]

70% of power reduction

14

Page 15: Research of OSCAR Parallelizing Compiler for High

To improve effective performance, cost-performance and software productivity and reduce power

OSCAR Parallelizing Compiler

Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory

Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)

Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

15

Page 16: Research of OSCAR Parallelizing Compiler for High

Low Power Heterogeneous Multicore Code

GenerationAPI

Analyzer(Available

from Waseda)

Existing sequential compiler

Multicore Program Development Using OSCAR API V2.0Sequential Application

Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Low Power Homogeneous Multicore Code

GenerationAPI

AnalyzerExisting

sequential compiler

Proc0

Thread 0

Code with directives

Waseda OSCARParallelizing Compiler

Coarse grain task parallelization

Data Localization DMAC data transfer Power reduction using

DVFS, Clock/ Power gating

Proc1

Thread 1

Code with directives

Parallelized API F or C program

OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,

data transfer using DMA, power managements

Generation of parallel machine

codes using sequential compilers

Exe

cuta

ble

on v

ario

us m

ultic

ores

OSCAR: Optimally Scheduled Advanced MultiprocessorAPI: Application Program Interface

HomegeneousMulticore s

from Vendor A(SMP servers)

Server Code GenerationOpenMP Compiler

Shred memory servers

HeterogeneousMulticores

from Vendor B

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

Accelerator 1Code

Accelerator 2Code

Hom

ogen

eous

Accelerator Compiler/ User Add “hint” directives

before a loop or a function to specify it is executable by

the accelerator with how many clocks

Het

ero

Manual parallelization / power reduction

16

Page 17: Research of OSCAR Parallelizing Compiler for High

Low-Power Optimization with OSCAR API

17

MT1

VC0

MT2

MT4MT3

Sleep

VC1

Scheduled Resultby OSCAR Compiler void

main_VC0() {

MT1

voidmain_VC1() {

MT2

#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))

#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))

Sleep

MT4MT3

} }

Generate Code Image by OSCAR Compiler

Page 18: Research of OSCAR Parallelizing Compiler for High

Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server

Compile Option:(*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto(*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto(Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto

OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average

18

Page 19: Research of OSCAR Parallelizing Compiler for High

92 Times Speedup against the Sequential Processing for GMS Earthquake Wave

Propagation Simulation on Hitachi SR16000(Power7 Based 128 Core Linux SMP)

190

10

20

30

40

50

60

70

80

90

100

1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe

Speedup

agai

nst s

eque

ntia

l pro

cess

ing

oscar

Page 20: Research of OSCAR Parallelizing Compiler for High

8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)

55 times speedup by 64 processorsIBM Power 7 64 core SMP

(Hitachi SR16000)

National Institute of Radiological Sciences (NIRS)

Cancer Treatment Carbon Ion Radiotherapy

(Previous best was 2.5 times speedup on 16 processors with hand optimization)

20

Page 21: Research of OSCAR Parallelizing Compiler for High

Parallel Processing of Face Detection on Manycore, Highendand PC Server

• OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highendserver.

1.00 1.72 

3.01 

5.74 

9.30 

1.00 1.93 

3.57 

6.46 

11.55 

1.00 1.93 

3.67 

6.46 

10.92 

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

1 2 4 8 16

速度

向上

コア数

速度向上率tilepro64 gcc

SR16k(Power7 8core*4cpu*4node) xlc

rs440(Intel Xeon 8core*4cpu) icc

21

Page 22: Research of OSCAR Parallelizing Compiler for High

Automatic Parallelization of Still Image Encoding Using JPEG-XR for the Next Generation Cameras and Drinkable Inner Camera

• TILEPro64

Ds

t0X4)

rt 1

al)

nal)

X4)

1.00  1.96  3.95 7.86 

15.82 

30.79 

55.11 

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 2 4 8 16 32 64

Speed up

コア数

Speed‐ups on TILEPro64 Manycore0.18[s]

1コア10.0[s]

55 times speedup with 64 coresagainst 1 core

22

Page 23: Research of OSCAR Parallelizing Compiler for High

Engine Control by multicore with DensoThough so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core  V850 multicore processor.

1 core 2 cores

Hard real-time automobile engine

control by multicore

23

Page 24: Research of OSCAR Parallelizing Compiler for High

Future Multicore ProductsNext Generation Automobiles‐ Safer, more comfortable, energy efficient, environment friendly‐ Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter  control

Solar powered with more than 100 times power efficient :  FLOPS/W• Regional Disaster Simulators

saving lives from tornadoes, localized heavy rain, fires with earth quakes

‐From everyday recharging to less than once a week‐ Solar powered operation  inemergency condition‐ Keep health 

Smart phones

Cancer treatment, Drinkable inner camera• Emergency solar powered• No cooling fun, No dust ,

clean usable inside OP room

Advanced medical systems Personal / Regional Supercomputers

24