Josep Torrellas (University of Illinois at Urbana-Champaign) Ben Abbott (Southwest Research Institute)Ted Bapty (Vanderbilt University) Bob Bassett, David Ngo (BAE SYSTEMS) Hubertus Franke, Jose Moreira(IBM Research)
ArchitectureArchitecture Compiler SupportCompiler Support
Software ProductivitySoftware Productivity
� � �� �� �� � �� � �� ��
� � ��
� �� � �
��� �� � � � �� �
�� �� � � � �� � � � � ��
�� � !� � !" � � " " ��
� � � " � !# � �� �$% &' � � % �� � (� � � �� � � # ) � � � �
M PΦΦΦΦ M
M P M P
P M P M
M P M ΦΦΦΦ
M3T
�����
�������
� � �� � � �� � � �� � ���� � �� � � � � � �
� ��� � � � � � � �
� �� � �
� � � � � � � �
� �� � �
Novel Inter-Task Optimizations� � � �� �� � �� � � � � ��
� � � �� � � � ��
� � � �� � � � � ��
� � � �� � � � � � �� � � �
� � � �� �� � ��
� � � �� �� �� � �� � � �
� � � ��� � � � � � � ��
� � ! � "
� # �� "
*+ , - *. *+ , - */*+ , - *0*+ , - *.1 /
Front End
High Level Transformations
Task Selection
Inter-Task Optimizations
Code Generation
Intra-Task Optimizations
Novel compiler algorithms to build tasks
Sync Bus
CPU+L1CPU+L1
Banked L2
Off-Chip Memory
…
On-Chip Network
Banked L2 Banked L2
TST
PT
W task
TST: Task State Table
PTW: Pending Task Window
TaskScalar Morph Evaluation
Applications: � � � $ %
2 �3 2 � � � � � # ) � # � � � � !3 � � !" �4
& � ' ' � �%
�" " � !� � ! � # � ! � 5� � � 4 � � ( � " � � # � � � � � � � �� 5� 5 5 � �� � 6 � � !7 ! 4
� � ( �� �� � � � %
� ! ( � �� � !3 5� � � � � � !7 ! 4 � � # � � # � � � ) � # � # � � �� � ��
Effect of Task Size Effect of Number of Processors Effect of Network Latency
Timeline of Tasks (Matrix) Timeline of Tasks (Bubble) Timeline of Tasks (Pathological)
89: : ;< 9= > ?: @AB @: C: D EF ?= G > H HI= AJ : ;K >= L= 8M > H > N A HAKO F EK @: = 9 : : ;< 9 = = AB D A EA M > DK HO;: 9: D ;= F DK @: >9 9 HA M >K A F D 89 : : ;< 9= > ?: C: ?O K F H: ? > DK K F D: KP F ? L H >K : DMO
8 G F F K @: Q: M < K A F D F EK >= L= R < ?= KO = 9 >P D > D ;: Q: M < K A F D F EK >= L= S AB @ H F > ; A G N > H > DM:T P >= K : F E ?: = F < ?M: =
$ � # � 6 !# � � # � � 4 # � � � � # ! � � ! � #
U F ;: M F > ?= : M ?AK A M > H= : M K A F DV D= : ?K 9: ? @ >9= < D D: M: = = > ?O N > ? ? A: ?=� " � � � �� !7 � � 4 # � � � � # ! � � ! � # � � " � � � �� � " � � � � !7 � 5 � � � ! �� �% � �� (�% 6 ��W �
X: K: M K M F D E HA M K =Y ? F H H N >M L F E E: D ;A DB K @ ?: > ;=Z= : M >M @: = K F = K F ?: = 9: M < H >K A C: = K >K :
�� !# � !# � �� � � �
�����������→
6 �� � � � ) " � �W � �� �
[ F M L\ FP D: ?] H >B \ 9 ? F ;< M: ?
R > ? ? A: ? \ H >B B A DB K >= L=
Debugging Data Races Debugging Data Races [ISCA03][ISCA03]
……LD AINCST A…
…lock(L)LD AINCST Aunlock(L)…
Task X Task Y
?
CPU
Memory
Cache
CPU
Cache
A A
M3T Architecture
CPU+L1CPU+L1TST CPU+L1CPU+L1TST
TaskScalar Morph^ # ! � 6 �3 �� � ! � # � � � � (
� � � ( � � # 5 � � � ! � ) �� �_ � � � � � ) !# � # � � � �
PT
W
PT
W
No explicit orderbetween
`` ``
and
`` ``
$ � �� � # � 4 # � � � � # ! � � ) � � � # !� � ! � #
� # � � � � # � � � # ) � � # ) � �a � 3 � � � ! � #
b �a �3 �� � ! � # !� ) � � � !# !� !�
Unlock L
Unlock L
Lock LLock L
Set F
Wait F
Barrier
Barrier
Task Ordering
cccdef� # )� � � )g � W � �
b � � 7 � )g � � � ! ��
defdefdefdef� # )� � � )g � W � �
b � � 7 � ) ' �� (
hidefdefdefj � � � k 3 !� !# W
$ � � b� � ��
defdefdefdef� 4 # � � � �� W �
" �� !# 7 � � !� 5 � ��
� � () � � * � �+ �� � ' � �, �� �� �
Effectiveness
Speculative Barrier
Speculative Lock
C
D ACQUIRE
RELEASESafe
Speculative
BA
E
C
BARRIERA B
Safe
Speculative
0
20
40
60
80
100
120
Base Spec
Nor
mal
ized
Tim
e Lo
st to
Syn
chro
niza
tion
lmn o pnq r
stu q v pwx y rz{ | }~stu q v pw� q � vn �q �q ~stu q v pw � ou n �q �q ~s� y |
17.7%
Sync Time Reduction
� ' � �W � � � )� � ! � # � � ��
TaskScalar attempts torun section in parallelspeculate past synchronization
Result: appear as if we had invested more man-hours
Reducing Parallel Programming Effort Reducing Parallel Programming Effort [ASPLOS02][ASPLOS02]
Parallelism
Superscalar
SMT
CMPTaskScalar
SpecIntSpecFP
Scientific
Per
form
ance
0%
5%
10%
15%
0 20 40 60 80 100 120Rollback Distance [Instructions per CPU]
Ove
rhe
ad
Better
Chosen- .
Overhead
K K K K K K