Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Compiler and API forLow Power High Performance
MulticoresHironori Kasahara
Professor Department of Computer ScienceProfessor, Department of Computer ScienceDirector, Advanced Chip-Multiprocessor
Research InstituteResearch Institute Waseda University
Tokyo JapanTokyo, Japanhttp://www.kasahara.cs.waseda.ac.jp
June 24, 2008, MPSoC 2008
METI/NEDO National Project Multi-core for Real-time Consumer ElectronicsMulti core for Real time Consumer Electronics
<Goal> R&D of compiler cooperative multi-core processor technology for consumer
新マルチコア新マルチコア新マルチコア新マルチコア
(2005.7~2008.3)**
electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems.
CMP m
(マルチ
コア・チップm)
PC0(プロセッサコア0) PC1
(プロ
セッサコア1)
PC nLDM/
D-cache (ローカルデータ
デ
LPM/I C h
CMP 0 (マルチコアチップ0)
CSM j
I/O
CSP k
CPU(プロセッサ)
DTC(データ
転送コン
I/O DevicesI/O
(入出力装置)CMP m
(マルチ
コア・チップm)
PC0(プロセッサコア0) PC1
(プロ
セッサコア1)
PC nLDM/
D-cache (ローカルデータ
デ
LPM/I C h
CMP 0 (マルチコアチップ0)
CSM j
I/O
CSP k
CPU(プロセッサ)
DTC(データ
転送コン
I/O DevicesI/O
(入出力装置)
新マルチコアプロセッサ
•高性能
•低消費電力
短 開
新マルチコアプロセッサ
•高性能
•低消費電力
短HW/SW開
CMP m
(マルチ
コア・チップm)
PC0(プロセッサコア0) PC1
(プロ
セッサコア1)
PC nLDM/
D-cache (ローカルデータ
デ
LPM/I C h
CMP 0 (マルチコアチップ0)
CSM j
I/O
CSP k
CPU(プロセッサ)
DTC(データ
転送コン
I/O DevicesI/O
(入出力装置)CMP m
(マルチ
コア・チップm)
PC0(プロセッサコア0) PC1
(プロ
セッサコア1)
PC nLDM/
D-cache (ローカルデータ
デ
LPM/I C h
CMP 0 (マルチコアチップ0)
CSM j
I/O
CSP k
CPU(プロセッサ)
DTC(データ
転送コン
I/O DevicesI/O
(入出力装置)
新マルチコアプロセッサ
•高性能
•低消費電力
短 開
新マルチコアプロセッサ
•高性能
•低消費電力
短HW/SW開navigation systems. <Period> From July 2005 to March 2008
<Features> ・Good cost performance
(プロセッサ
コアn)
コア1)
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
DSM(分散共有メモリ)
(メモリ/L1データ
キャッシュ)I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
CSM(集中
共有メモリ)
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス )
転送コントローラ) (プロセッサ
コアn)
コア1)
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
DSM(分散共有メモリ)
(メモリ/L1データ
キャッシュ)I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
CSM(集中
共有メモリ)
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス )
転送コントローラ) •短HW/SW開
発期間
•各チップ間でアプリケーション共用可
•短HW/SW開発期間
•各チップ間でアプリケーション共用可
(プロセッサ
コアn)
コア1)
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
DSM(分散共有メモリ)
(メモリ/L1データ
キャッシュ)I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
CSM(集中
共有メモリ)
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス )
転送コントローラ) (プロセッサ
コアn)
コア1)
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
DSM(分散共有メモリ)
(メモリ/L1データ
キャッシュ)I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
CSM(集中
共有メモリ)
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス )
転送コントローラ) •短HW/SW開
発期間
•各チップ間でアプリケーション共用可
•短HW/SW開発期間
•各チップ間でアプリケーション共用可
・Short hardware and software development periods・Low power consumption
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
( )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
( )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
共用可
•高信頼性
•半導体集積度と共に性能向上
•高信頼性
•半導体集積度と共に性能向上
マルチコア統合ECU
,,
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
( )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
( )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
共用可
•高信頼性
•半導体集積度と共に性能向上
•高信頼性
•半導体集積度と共に性能向上
マルチコア統合ECU
,,
・Low power consumption・Scalable performance improvement with the advancement of semiconductor
開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ
w e dv ce e o se co duc o・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API
API:Application Programming Interface
**Hitachi, Renesas, Fujitsu,
Toshiba, Panasonic, NEC
OSCAR Parallelizing Compiler• Improve effective performance, cost-performance
and productivity and reduce consumed power– Multigrain Parallelization
• Exploitation of parallelism from the whole program by use ofcoarse-grain parallelism among loops and subroutines,coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
Data Localization– Data Localization• Automatic data distribution for distributed shared memory, cache
and local memory on multiprocessor systems.
– Data Transfer Overlapping• Data transfer overhead hiding by overlapping task execution and
data transfer using DMA or data pre-fetchingdata transfer using DMA or data pre fetching
– Power Reduction• Reduction of consumed power by compiler control of
frequency, voltage and power shut down with hardware supports.
Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks)
1 BPABPA
Data Dependency
Control flow
Conditional branch
Block of Psuedo
1
3 BPA2 BPA
4 BPA
BPA
RB Repetition Block
RB
Block of PsuedoAssignment Statements
7
2 3
4 8BPA
5 BPA
6 RB
7 RB15 BPA
RB
RB
RB
BPA
7
11
4
5 6
8
9 10
6
BPA8 BPA
9 BPA 10 RB
11 BPA
RB 1115 7
12
13
Data dependency
Extended control dependency11 BPA
12 BPA
13 RB14
13Extended control dependencyConditional branch
ORAND
Original control flowA Macro Flow Graph14 RB
ENDA Macro Task Graph
MTG of Su2cor-LOOPS-DO400MTG of Su2cor-LOOPS-DO400Coarse grain parallelism PARA_ALD = 4.3
DOALL Sequential LOOP BBSB
Data Localization11
2
1
32
PE0 PE1
12 1
3 4 56
23 45
6 7 8910 1112
3
4
2
6 7
8
14
18
7
6 7 8910 1112
1314 15 16
5
8
9
11
15
18
19
25
8 9 101718 19 2021 22
10
11
13 16
25
29
112324 25 26
dlg0dlg3dlg1 dlg2
17 20
21
22 26
30
12 1314
15
2728 29 3031 32
33Data Localization Group
dlg023 24
27 28
32
MTG MTG after Division A schedule for two processors
15 33Data Localization Group31
Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler
• Shortest execution time mode
OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)
CMP m
PE0 PE
CMP (chip multiprocessor 0)0
CPU
I/ODevicesI/O
Devices
0 PE1 PE nLDM/
D-cacheLPM/
CSM jI/O
CMP k
CPU
DTC
DSMI-Cache
CSN t k I t fFVR
Intra-chip connection network
CSMNetwork InterfaceFVR
FVR
CSM / L2 Cache
(Multiple Buses, Crossbar, etc)FVR
FVRFVR FVR FVR FVR
Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.
FVR
DSM: distributed shared mem.
DTC: Data Transfer Controller
LPM : local program mem.
FVR: frequency / voltage control register
API and Parallelizing Compiler in METI/NEDOAdvanced Multicore for Realtime Consumer Electronics Project
Details of API: See http://www kasahara cs waseda ac jp/Details of API: See http://www.kasahara.cs.waseda.ac.jp/
ExecutableTranslate
into parallel
API to specify data assignment, data
Sequential Application Program
codes for each vendor
chip
pcodes for
each vender
transfer, power reduction control
Program(Subset of C Language)
RealtiA
Imag
e
Was
Backend compilerMach. Codes
Proc0ScheduledTasks
APIdecoder
Sequential Compilerm
e Co
ns
Ap
plicati
e, Secu
re
seda O
S
Backend Compiler
T1 Stop
Proc1ScheduledT k
SH multicore
Mach.Codes
APIdecoder
Sequential Compilersu
mer E
ion
Pro
ge A
ud
io
etc.
SC
AR
Co
FR-V
Tasks
T2 T4
Proc2
Codesdecoder Compiler
Electro
ni
gram
sS
treami
om
piler
Backend CompilerScheduledTasks
T3 T6 Slow
Mach.CodesSequential
CompilerAPIdecoder
csng
(CELL)
Data Transfer by DTC(DMAC)
Performance of OSCAR Compiler Using the
9 Intel Ver.10.1
Multicore API on Intel Quad-core Xeon
6
7
8
o
OSCAR
4
5
6
eedu
p ra
tio
1
2
3spe
0
mca
tv
swim
u2co
r
dro2
d
mgr
id
appl
u
urb3
d
apsi
fppp
p
wav
e5
swim
mgr
id
appl
u
apsi
tom su
hyd m tu f w m
SPEC95 SPEC2000
• OSCAR Compiler gives us 2.09 times speedup on the average against Intel Compiler ver.10.1
Performance of OSCAR Compiler Using the multicore API on Fujitsu FR1000 MulticoreAPI on Fujitsu FR1000 Multicore
3.76 3.75
3 47
4.0
2.812.90
3.47
3.0
3.5
1.941 80
2.12
2.54
1.98 1.98
2.55
2.0
2.5
1.00 1.00
1.80
1.00 1.00
1.5
0.5
1.0
0.0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
MPEG2enc MPEG2dec MP3enc JPEG 2000enc
3.38 times speedup on the average for 4 cores againsta single core execution
Performance of OSCAR Compiler Using the Developed API 4 (SH4A) OSCAR T M ltiAPI on 4 core (SH4A) OSCAR Type Multicore
4.0
3.353.24
3.50
3.173.5
4.0
2.57 2.532.71
2 07
2.5
3.0
1.781.86 1.88 1.83
2.07
1.5
2.0
1.00 1.00 1.00 1.00
0.5
1.0
0.0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
MPEG2enc MPEG2dec MP3enc JPEG 2000enc
3.31 times speedup on the average for 4cores against 1core
Performance OSCAR Multigrain Parallelizing Compiler on a IBM p550q 8core Deskside Server
2.7 times speedup against loop parallelizing compiler on 8 cores
Loop parallelization
Multigrain parallelizition
7
8
9
5
6
7
du
p r
atio
2
3
4
spee
d
0
1
2
t t i 2 h d 2d id l t b3d i f 5 i id l itomcatvswim su2corhydro2dmgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsi
spec95 spec2000
Performance of OSCAR compiler on16 cores SGI Altix 450 Montvale server16 cores SGI Altix 450 Montvale server
6 Intel Ver.10.1
Compiler options for the Intel Compiler:for Automation parallelization: -fast -parallel.for OpenMP codes generated by OSCAR: -fast -openmp
4
5
o
OSCAR for OpenMP codes generated by OSCAR: -fast -openmp
2
3
4
peed
up r
ati
1
2sp
0
omca
tv
swim
su2c
or
hydr
o2d
mgr
id
appl
u
turb
3d
apsi
fppp
p
wav
e5
swim
mgr
id
appl
u
apsi
• OSCAR compiler gave us 2.32 times speedup
to h
spec95 spec2000
OSCAR compiler gave us 2.32 times speedupagainst Intel Fortran Itanium Compiler revision 10.1
RP2 Chip Photo and Specifications
Process 90nm 8-layer triple-ILRAM I$ Process Technology
90nm, 8-layer, triple-Vth, CMOS
Chip Size 104 8mm2
Core#1
C0
LB
SC
URAMDLRAM
Core#0ILRAM
D$
I$
Chip Size 104.8mm(10.61mm x 9.88mm)
CPU Core 6.6mm2
Core#2 Core#3S
N
SHWY CPU Core Size
6.6mm(3.36mm x 1.96mm)
Supply 1.0V–1.4V (internal), Core#6 Core#7
SN
C1
SHWYVSWC
pp yVoltage
( ),1.8/3.3V (I/O)
Power 17 (8 CPUs, Core#4 Core#5
SC
SM
Domains(
8 URAMs, common)DBSC
DDRPADGCPG
Processing Performance on the Developed Multicore Using Automatic Parallelizing CompilerMulticore Using Automatic Parallelizing Compiler
Speedup against single core execution for audio AAC encoding*) Ad d A di C di*) Advanced Audio Coding
6 0
7.0
5.0
6.0
ps
5 83 0
4.0
eed
up
3.6
5.8
2.0
3.0
Sp
1.0 1.9
0.0
1.0
1 2 4 8
Numbers of processor cores
Power Reduction by OSCAR Parallelizing Compiler for Secure Audio Encodingfor Secure Audio Encoding
AAC Encoding + AES Encryption with 8 CPU cores
Without Power With Power Control
6
7
Without Power Control(Voltage:1.4V)
6
7
With Power Control (Frequency, Resume Standby: Power shutdown &
5
6
5
6 Voltage lowering 1.4V-1.0V)
3
4
3
4
2 2
Avg. Power Avg. Power88 3% Power Reduction0
1
0
1
Avg. Power5.68 [W]
Avg. Power0.67 [W]
88.3% Power Reduction
17
Power Reduction by OSCAR Parallelizing Compiler for MPEG2 Decodingfor MPEG2 DecodingMPEG2 Decoding with 8 CPU cores
Without Power C l
With Power Control
6
7
6
7
Control(Voltage:1.4V)
W owe Co o(Frequency, Resume Standby: Power shutdown & Voltage lowering 1 4V
5 5
6 Voltage lowering 1.4V-1.0V)
3
4
3
4
1
2 2
Avg. Power Avg. Power73 5% Power Reduction0
1
0
1
Avg. Power5.73 [W]
Avg. Power1.52 [W]
73.5% Power Reduction18
Low Power High Performance Multicore Computer withMulticore Computer with Solar Panel Clean Energy AutonomousgyServers operational in deserts
Compiler cooperative low power high effectiveConclusions
Compiler cooperative low power, high effective performance, short software development period multi-core processors will be more important in wide range ofcore processors will be more important in wide range of information systems from embedded applications like games, mobile phones, digital TVs and automobiles to g p gpeta-scale supercomputers. Automatically generated parallel programs using the y g p p g gdeveloped multicore API by OSCAR compiler give us the following performances:
3.31 times speedup on 4core (SH4A) OSCAR type multicore3.38 times speedup on 4 core FR1000 against 1 core88% power reduction by the compiler power control on the developed 8 core (SH4A) multicore for realtime secure AAC encodingAAC encoding 70% power reduction on the multicore for MPEG2 decoding