Upload
mary-lynch
View
214
Download
1
Embed Size (px)
Citation preview
CCNoC: On-Chip Interconnects forCache-Coherent Manycore Server Chips
CiprianSeiculescu
Stavros Volos
Naser Khosro Pour
Babak Falsafi
Giovanni De Micheli LSIIntegratedSystemsLaboratory
NoCs Major Power Consumer
Move towards manycore • Tiled architectures
Network-on-Chip (NoC) • Significant power
consumer• 40% MIT RAW• 30% Intel Tera-scale
Cache coherent CMP• Server workloads
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
Core Core
$ $
Crossbar
Proposals to Reduce NoC Power
Multiple networks• Better area and power [Balfour & Dally ICS 2006]
Commercial server workloads• Traffic patterns are different
Run on cache coherent CMPs• Strong relation between coherence protocol and NoC
Not optimized for Commercial Server Workload traffic
Contributions
Commercial server workloads• Optimized for reuse in L1, little sharing• Full blown coherence protocol in CMPs• Only some transitions are frequent
Duality in Request/Response message size
CCNoC• Full advantage of heterogeneity • Same number of buffers • 16% less power same performance as Mesh
Outline
Overview
Why CCNoC?
Dual-router design
Evaluation
Conclusions
Dual Router is More Efficient
Dual router• Two crossbars per routing node
Wires less expensive on-chip• Use more wires for better performance
Area and power grows faster than connectivity• Balfour & Dally ICS 2006• Dual router: better performance, power and area
N bit wide
N/2 bit wide
N/2 bit wide
Right Dual Router Design
Avoid protocol level deadlock• Separate
- Requests - Responses
• Use Virtual Channels
CCNoC • sub-networks
- Request / Response• No VCs needed• Same number of buffers
Buffers are power hungry
MIT RAW
BuffersCrossbar + Links
H.S.Wang & L.S.Peh, MICRO 2003
Protocol Activity
CMPs implement full blown coherence protocol
• Some transitions are frequent [Hardavellas ISCA 2009]- Read clean block- Evict clean block- Write to unshared block
• Other transitions needed for correctness (infrequent)- Read dirty block- Evict dirty- Write to shared block
Frequent Read Protocol Activity
Reader Directory Writer
Read Req
Read Resp
Evict Clean Req
Short Req
Short Req
Short Resp
Long Resp
Frequent Write Protocol Activity
Writer Directory
Fetch/Upgrade Req
FetchResp
Short Req
Short Req
Short Resp
Long Resp
Upgrade Resp
Infrequent Read Protocol Activity
Reader Directory Writer
Read Req
Read Resp
Short Req
Short Req
Short Resp
Long Resp
Downgrade Req
Downgrade Resp
Infrequent Write Protocol Activity
Writer Directory Reader 1Fetch/Upgrade Req
Fetch Resp
Short Req
Short Req
Short Resp
Long Resp
Reader 2
Upgrade Resp
Inv Req Inv
Req
Inv Resp
Inv Resp
Evict Dirty Req
Traffic Analysis
DB
2
OR
AC
LE
DB
2 M
IX
AP
AC
HE
ZE
US
EM
3D
SP
EC
2K
OLTP DSS WEB SCI MIX
0%
20%
40%
60%
80%
100%
Long RespShort RespLong ReqShort Req
Tra
ffic
Dis
trib
uti
on
Request: 93% short Response: 86% long
CCNoC Router
Request network narrow: optimized for short messages Response network wide: optimized for long messages
RequestSwitch
ResponseSwitch
NI
Router
Previous Work
Balfour et al. ICS 2006• Better than single large router• Read/Write traffic• Same number of reads and writes
Yoon et al. DAC 2010• Physical channel better then virtual channel
Not optimized for cache coherent CMP• Running commercial server workloads
Outline
Overview
Why CCNoC?
Dual-router design
Evaluation
Conclusions
Evaluation Methodology
FLEXUS• Full system simulation • 16 or 8 UltraSPARC III
ISA cores• Split I/D, 64KB L1• 1 or 2 MB L2
ORION 2.0• power estimation• area estimation
Workloads• OLTP: TPC-C
- IBM DB2 and Oracle
• DSS: TPC-H - IBM DB2- Q1, Q6, Q13, Q16
• Web: SPECweb99 - Apache and Zeus
• Scientific: EM3D• Multiprogrammed:
- SPEC2K - 2x: gcc, twolf, art, mcf
Evaluation NoCs
Mesh-128 - baseline• 128 bit flit width
Torus - reference• 128 bit flit width
Mesh-176 – high performance • 176 bit flit width
CCNoC• Request: 48 bit flit width• Response: 128 bit flit width
Switches• Wormhole flow control• Input queued • Transmission protocol
- On/Off
• Input buffers- 2 entry
Performance
DB
2
OR
AC
LE
DB
2 M
IX
AP
AC
HE
ZE
US
EM
3D
SP
EC
2K
OLTP DSS WEB SCI MIX
0
0.2
0.4
0.6
0.8
1
1.2
Mesh-128Mesh-176CCNoC
No
rma
lize
d I
PC
(to
To
rus
)
Performance loss: 2% Torus, 8% Mesh-176
Power Savings
Power savings: 16% Mesh-128, 22% Torus, 38% Mesh-176
DB
2
OR
AC
LE
DB
2 M
IX
AP
AC
HE
ZE
US
EM
3D
SP
EC
2K
OLTP DSS
WEB SCI
MIX
-2.22044604925031E-16
0.2
0.4
0.6
0.8
1
1.2
1.4
TorusMesh-128Mesh-176CCNoC
No
rma
lize
d T
ota
l P
ow
er(
%)
Conclusions
Duality in Request/Response traffic• Request: dominated by short messages• Response: dominated by long messages
Proposed CCNoC• Narrow request network• Wide response network
Showed significant power savings• 22% against Torus• 38% against Mesh-176
Thank you!
Q&A