Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Research of OSCAR Parallelizing Compiler for High Performance and
Low Power Green Computing
Hironori KasaharaProfessor, Dept. of Computer Science & Engineering
Director, Advanced Multicore Processor Research InstituteWaseda University, Tokyo, Japan
IEEE Computer Society Board of GovernorsIEEE Computer Society Multicore STC Chair
URL: http://www.kasahara.cs.waseda.ac.jp/
1
<R & D Target>Hardware, Software, Application for Super Low-Power Manycore ProcessorsMore than 64 coresNatural air cooling (No fan)
Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, Toshiba, etc<Ripple Effect>Low CO2 (Carbon Dioxide) EmissionsCreation Value Added Products
Consumer Electronics, Automobiles, Servers
Green Computing Systems R&D CenterWaseda University
Supported by METI (Mar. 2011 Completion)
Beside Subway Waseda Station,Near Waseda Univ. Main Campus
2
Hitachi SR16000:Power7 128coreSMP
Fujitsu M9000SPARC VII 256 core SMP
2
Solar PoweredSmart phones
Cameras
Robots
Cool desktop servers
Industry-government-academia collaboration in R&Dand target practical applications
On-board vehicle technology (navigation systems, integrated controllers, infrastructure coordination)
Consumer electronicInternet TV/DVD
Non-fan, cool, quiet servers designed for server
Green cloud servers
Stock trading
Operation/rechargingby solar cells
OSCAR many-core chip
Green supercomputers
OSCAR
OSCAR
OSCARMany-core
Chip
Waseda University :R&DMany-core system technologies with
ultra-low power consumptionSuper real-time disaster simulation (tectonic shifts, tsunami), tornado,flood, fire spreading)
National Institute of Radiological Sciences
Protect lives
Protect environmentFor smart life
Camcorders
Intelligent home appliances
Supercomputers and servers
Industry
Capsule inner cameras
OSCAR Compiler
Medical servers
Heavy particle radiation planning, cerebral infarction)
Toyota
Denso
Olympus
Fujitsu
MitsubishiElectric
TokyoStock Exchange
HitachiRenesasNEC
OSCARTechnology
ESOL
OSCAR API 11 Companies3 Univ.
Fujitsu
3
Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project
Process Technology
90nm, 8-layer, triple-Vth, CMOS
Chip Size 104.8mm2
(10.61mm x 9.88mm)CPU Core Size
6.6mm2
(3.36mm x 1.96mm)Supply Voltage
1.0V–1.4V (internal), 1.8/3.3V (I/O)
Power Domains
17 (8 CPUs, 8 URAMs, common)
Core#2 Core#3
Core#1
Core#4 Core#5
Core#6 Core#7
SNC
0SN
C1
DBSC
DDRPADGCPG
CSM
LB
SC
SHWY
URAMDLRAM
Core#0ILRAM
D$
I$
VSWC
IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”
4
Demo of NEDO Multicore for Real Time Consumer Electronicsat the Council of Science and Engineering Policy on April 10, 2008
CSTP MembersPrime Minister: Mr. Y. FUKUDAMinister of State for Science, Technology and Innovation Policy:Mr. F. KISHIDAChief Cabinet Secretary: Mr. N. MACHIMURAMinister of Internal Affairs and Communications :Mr. H. MASUDAMinister of Finance :Mr. F. NUKAGAMinister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAIMinister of Economy,Trade and Industry: Mr. A. AMARI
5
Core #3
I$16K
D$16K
CPU FPU
User RAM 64K
Local memoryI:8K, D:32K
Core #2
I$16K
D$16K
CPU FPU
User RAM 64K
Local memoryI:8K, D:32K
Core #1
I$16K
D$16K
CPU FPU
User RAM 64K
Local memoryI:8K, D:32K
Core #0
I$16K
D$16K
CPU FPU
Distributed SharedMemory (URAM)64K
Local memoryI:8K, D:32K
CCNBAR
8 Core RP2 Chip Block Diagram
On-chip system bus (SuperHyway)
DDR2LCPG: Local clock pulse generatorPCR: Power Control RegisterCCN/BAR:Cache controller/Barrier RegisterURAM: User RAM (Distributed Shared Memory)
Snoo
p co
ntro
ller
1
Snoo
p co
ntro
ller
0
LCPG0
Cluster #0 Cluster #1
PCR3
PCR2
PCR1
PCR0
LCPG1
PCR7
PCR6
PCR5
PCR4
controlOn-chipShared Memory
DMAcontrol
Core #7
I$16K
D$16K
CPUFPU
User RAM 64K
I:8K, D:32K
Core #6
I$16K
D$16K
CPUFPU
User RAM 64K
I:8K, D:32K
Core #5
I$16K
D$16K
CPUFPU
User RAM 64K
I:8K, D:32K
Core #4
I$16K
D$16K
CPUFPU
URAM 64K
Local memoryI:8K, D:32K
CCNBAR
Barrier Sync. Lines
6
Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2
by OSCAR Parallelizing Compiler
Avg. Power5.73 [W]
Avg. Power1.52 [W]
73.5% Power Reduction7
MPEG2 Decoding with 8 CPU cores
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Without Power Control(Voltage:1.4V)
With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)
7
Automatic Power Reduction on 4 core Intel Haswell
Haswell Processor OS Ubuntu 13.10 Intel CPU Core i7 4770K
4 cores L1 Cache: Load 64Bytes/cycle, Store 32Bytes/cycle L2 Cache 64Bytes/cycle L3 Cache 8 MB Frequency 3.5GHz~0.8MHz
Memory 16GB (8GB×2)
8
Power Reduction on Intel Haswell for Real-time Optical Flow
9
Power was reduced to 1/4 by compiler power optimization on the same 3 cores.The power with 3 core was reduced to 1/3 against 1 core.
29.29
36.5941.58
28.40
13.22 10.49
0.00
10.00
20.00
30.00
40.00
50.00
1 2 3
Average Po
wer [W
]
No. of Processor Cores
電力制御なし 電力制御あり
Power was reduced to 1/4 by compiler on 3 cores
Power was reduced to 1/3 compared with one core ordinal execution
For HD 720p(1280x720) moving pictures15fps (Deadline66.6[ms/frame])
Automatic Power Reduction for MPEG2 Decode on Android Multicore
ODROID X2 ARM Cortex-A94cores
10
On 3 cores, Automatic Power Reduction successfully reduced power to 1/7 against without Power Reduction.
3 cores with power reduction reduced power to 1/3 against ordinary 1 core execution.
0.97
1.88
2.79
0.63 0.46 0.37
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 2 3
Power Con
sumption[W
]
Number of Processor Cores
電力制御なし 電力制御ありWithout Power Reduction
With Power Reduction
2/3(‐35.0%)
1/4(‐75.5%)
1/7(‐86.7%)
1/3(‐61.9%)
http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ
0
15
30
45
60
通常の1コア実⾏ 並列化3コア実⾏
DrawImage : FPS
Parallelization of 2D Rendering Engine SKIA on 3 cores of Google NEXUS7
0
15
30
45
60
通常の1コア実⾏ 並列化3コア実⾏
DrawRect :FPS
22.82
43.57
27.16
×1.91 ×1.95
for DrawRect 1.91 speedup
for DrawImage 1.95 speedupOn Nexus7, 3 core parallelization gave us
52.88
11
1 Core 3 cores 1 Core 3 cores
http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ
OSCAR API-ApplicableHeterogeneous Multicore Architecture
CPU
LPM
FVR
DTU
DSM
LDLDM
Network Interface
Interconnection Network
OnChipCSM
MemoryInterface
OffChipCSM
LPM
DSM
LDLDM
FVR
CPU DTU
ACCa_0
Network Interface
12
ACCa_1
DMAC
• DTU– Data Transfer
Unit• LPM
– Local Program Memory
• LDM– Local Data
Memory• DSM
– Distributed Shared Memory
• CSM– Centralized
Shared Memory• FVR
– Frequency/Voltage Control Register
ACCb_0
VC(n+m+l)
VC(1),VPC(1)VC(0),VPC(0) VC(n),VPC(n)
VC(n+1),VPC(n+1)
VC(n+2),VPC(n+2)
VC(n+m+1)
33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X
(Optical Flow with a hand-tuned library)
12.29 3.09
5.4
18.85
26.71
32.65
0
5
10
15
20
25
30
35
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
Speedu
ps against a single SH
processor
3.4[fps]
111[fps]
CPU performs data transfers between SH and FE
13
Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X
(Optical Flow with a hand-tuned library)
Without Power Reduction With Power Reductionby OSCAR Compiler
Average:1.76[W] Average:0.54[W]
1cycle : 33[ms]→30[fps]
70% of power reduction
14
To improve effective performance, cost-performance and software productivity and reduce power
OSCAR Parallelizing Compiler
Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory
Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)
Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.
1
23 45
6 7 8910 1112
1314 15 16
1718 19 2021 22
2324 25 26
2728 29 3031 32
33Data Localization Group
dlg0dlg3dlg1 dlg2
15
Low Power Heterogeneous Multicore Code
GenerationAPI
Analyzer(Available
from Waseda)
Existing sequential compiler
Multicore Program Development Using OSCAR API V2.0Sequential Application
Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)
Low Power Homogeneous Multicore Code
GenerationAPI
AnalyzerExisting
sequential compiler
Proc0
Thread 0
Code with directives
Waseda OSCARParallelizing Compiler
Coarse grain task parallelization
Data Localization DMAC data transfer Power reduction using
DVFS, Clock/ Power gating
Proc1
Thread 1
Code with directives
Parallelized API F or C program
OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,
data transfer using DMA, power managements
Generation of parallel machine
codes using sequential compilers
Exe
cuta
ble
on v
ario
us m
ultic
ores
OSCAR: Optimally Scheduled Advanced MultiprocessorAPI: Application Program Interface
HomegeneousMulticore s
from Vendor A(SMP servers)
Server Code GenerationOpenMP Compiler
Shred memory servers
HeterogeneousMulticores
from Vendor B
Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.
Accelerator 1Code
Accelerator 2Code
Hom
ogen
eous
Accelerator Compiler/ User Add “hint” directives
before a loop or a function to specify it is executable by
the accelerator with how many clocks
Het
ero
Manual parallelization / power reduction
16
Low-Power Optimization with OSCAR API
17
MT1
VC0
MT2
MT4MT3
Sleep
VC1
Scheduled Resultby OSCAR Compiler void
main_VC0() {
MT1
voidmain_VC1() {
MT2
#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))
#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))
Sleep
MT4MT3
} }
Generate Code Image by OSCAR Compiler
Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server
Compile Option:(*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto(*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto(Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto
OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average
18
92 Times Speedup against the Sequential Processing for GMS Earthquake Wave
Propagation Simulation on Hitachi SR16000(Power7 Based 128 Core Linux SMP)
190
10
20
30
40
50
60
70
80
90
100
1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe
Speedup
agai
nst s
eque
ntia
l pro
cess
ing
oscar
8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)
55 times speedup by 64 processorsIBM Power 7 64 core SMP
(Hitachi SR16000)
National Institute of Radiological Sciences (NIRS)
Cancer Treatment Carbon Ion Radiotherapy
(Previous best was 2.5 times speedup on 16 processors with hand optimization)
20
Parallel Processing of Face Detection on Manycore, Highendand PC Server
• OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highendserver.
1.00 1.72
3.01
5.74
9.30
1.00 1.93
3.57
6.46
11.55
1.00 1.93
3.67
6.46
10.92
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
1 2 4 8 16
速度
向上
率
コア数
速度向上率tilepro64 gcc
SR16k(Power7 8core*4cpu*4node) xlc
rs440(Intel Xeon 8core*4cpu) icc
21
Automatic Parallelization of Still Image Encoding Using JPEG-XR for the Next Generation Cameras and Drinkable Inner Camera
• TILEPro64
Ds
t0X4)
rt 1
al)
nal)
X4)
1.00 1.96 3.95 7.86
15.82
30.79
55.11
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1 2 4 8 16 32 64
Speed up
コア数
Speed‐ups on TILEPro64 Manycore0.18[s]
1コア10.0[s]
55 times speedup with 64 coresagainst 1 core
22
Engine Control by multicore with DensoThough so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.
1 core 2 cores
Hard real-time automobile engine
control by multicore
23
Future Multicore ProductsNext Generation Automobiles‐ Safer, more comfortable, energy efficient, environment friendly‐ Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control
Solar powered with more than 100 times power efficient : FLOPS/W• Regional Disaster Simulators
saving lives from tornadoes, localized heavy rain, fires with earth quakes
‐From everyday recharging to less than once a week‐ Solar powered operation inemergency condition‐ Keep health
Smart phones
Cancer treatment, Drinkable inner camera• Emergency solar powered• No cooling fun, No dust ,
clean usable inside OP room
Advanced medical systems Personal / Regional Supercomputers
24