View
214
Download
1
Category
Preview:
Citation preview
Energy-Efficient Circuit Technologies for Sub-14nm Microprocessors and SoCs:
Challenges and Opportunities
IEEE Solid-State Circuits
Society Webinar
July 13 2016
Ram K. KrishnamurthySenior Principal Engineer
IEEE Fellow & SSCS Distinguished Lecturer
Circuits Research Lab, Intel Labs
Intel Corporation, Hillsboro, OR 97124, USAram.krishnamurthy@intel.com
Acknowledgements: Intel Circuits Research Lab, Vivek De, Matt Haycock,
Shekhar Borkar, ADR Bangalore Design Lab
2
Era of Tera-scale ComputingTeraflops of performance operating on Terabytes of data
Terabytes
TIPS
Gigabytes
MIPS
Megabytes
GIPS
Perf
orm
ance
Dataset SizeKilobytes
KIPS
Mult-
Media
3D &
Video
Text
ModelsPersonal Media Creation and Management
Entertainment, learningand virtual travel
Health
Terascale
Multi-core
Single-core
Financial Analytics
Model-based AppsRecognition
MiningSynthesis
3
3
Internet of Everything (IoE)
Need end-to-end energy efficiency & security
4
4
Tera-scale Microprocessors and SoCs
Deliver best user experience under constraints
Scalable On-die Interconnect FabricScalable On-die Interconnect Fabric
Graphics
Video
SpecialPurposeEngines
IntegratedMemory
Controllers
Off Die interconnect
Cache Cache Cache
Last LevelCache
Last LevelCache
Last LevelCache
Scalable On-die Interconnect FabricScalable On-die Interconnect Fabric
Graphics
Video
SpecialPurposeEngines
IntegratedMemory
Controllers
Off Die interconnect
Cache Cache Cache
Last LevelCache
Last LevelCache
Last LevelCache
Scalable On-die Interconnect FabricScalable On-die Interconnect Fabric
Graphics
Video
SpecialPurposeEngines
Graphics
Video
SpecialPurposeEngines
IntegratedMemory
Controllers
Off Die interconnect
Cache Cache Cache
Last LevelCache
Last LevelCache
Last LevelCache
Cache Cache CacheCache Cache Cache
Last LevelCache
Last LevelCache
Last LevelCache
DynamicV/F control
IndependentV/F control
regions
Workload-basedcore activation
& shutdown
Scenario-basedpower allocation
Maximize
performance
& efficiency
5
More, better transistors
More cores
Continued benefitsfrom Moore’s Law
Moore’s Law scaling
45nm
+
2007
105
103
107
10914nm
Trigate
2014
Source: Intel
6
Performance/Energy Scaling Trends
Source: Intel
22nm Interconnects
• M1 to M8 cross-section
• M1-M6 use ultra-low-k ILD and self-aligned vias providing 13-18% capacitance reduction
• Cross-section of integrated MIM capacitor
7
C. Auth, VLSI Symposium 2012
Microprocessor Evolution
4004 Processor Westmere-EX Processor
Year 1971 2011
Transistors 2300 2.6 B
Process 10 µm 32 nm
Die area 12 mm2 513 mm2
Die photos not at scale
8
Source: Intel
9
9
“Extreme” energy efficiency
2W –100 GigaFLOPS
10 year goal: ~300X Improvement in energy efficiency
Equal to 20 pJ/FLOPS at the system level
20MW - ExaFLOPS
NTV Operation & Energy Efficiency
10
10-2
10-1
1
101
0
75
150
225
300
375
450
0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)
Acti
ve L
eakag
e P
ow
er
(mW
)
En
erg
y-E
ffic
ien
cy
(GO
PS
/Watt
)
320mV
9.6X
65nm CMOS, 50°C
Su
bth
res
ho
ld
1
101
103
104
102
10-2
10-1
1
101
102
0.2 0.4 0.6 0.8 1.0 1.2 1.4
65nm CMOS, 50°C
Maxim
um
Fre
qu
en
cy (
MH
z)
To
tal P
ow
er
(mW
)
Supply Voltage (V)
Frequency reduces almost
linearly first, then exponentially
Total power reduces by three to
four orders of magnitude
Energy efficiency improves by
one order of magnitude at NTV
Energy efficiency reduces in
subthreshold operation
Leakage power reduces by two
to three orders of magnitude
H. Kaul, R. Krishnamurthy et al, ISSCC 2008
11
11
Voltage-frequency range limiters
Reliability & functional failures limit range
Voltage
Frequency
Vmax
Vmin
Fm
ax
• Reliability• Thermals• Power delivery
Vmax/Fmax limiters
• Circuit functional failures• Soft errors• Steep frequency roll-off• Aging
Vmin limiters
12
12
NTV design techniques
m1 m2
m3 m4
m5
m7 m8
wrwl
rdwl
wrb
l#
rdb
l
bitx bit
m9
wrwl#
m10
m6
Modified Register File Cell (L1$)
Robust Flop Topologies
Multi-corner design
optimizations(SCL)
0.5 0.6 0.7 0.8 0.9 1 1.1
Fre
qu
en
cyVoltage (V)
Optimization Corners
Variation-aware design2X min Z, 40% lib cells used
0.4 0.5 0.6 0.7 0.8 0.9 1
Voltage (V)
Delay spread due to random variations
2-i
np
ut
NA
ND
gat
e d
ela
y
4:1 Mux
“1”
“1”
“1”
“0”
“0”
“0”
“1”
“1”
“1”
“0”
“0”
“0”
Narrow muxes No stack height > 2
input
output
vcch vcch
vcch
vccl
vcch
input
output
vcch vcch
vcch
vccl
vcch
Robust level converters
NTV Across Technology Generations
13
0
0.5
1.0
1.5
2.0
2.5
3.0
0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)
En
erg
y E
ffic
ien
cy
(T
OP
S/W
)
10-3
10-2
10-1
1
10
Ac
tiv
e L
ea
ka
ge
Po
we
r (m
W)
Reconfigurable Fabric, 32nm CMOS, 50 C
340mV
0.8mW
5.7x
Su
b-t
hre
sh
old
Re
gio
n
0.2 0.4 0.6 0.8 1.0 1.2
En
erg
y E
ffic
ien
cy
(G
OP
S/W
)
Supply Voltage (V)
Le
ak
ag
e P
ow
er
(mW
)
103
102
10
10-2
1
10-3
103
102
10
1
10-1
Sub
-th
resh
old
R
egi
on
22nm CMOS, 50°C
9x
9x
Register FilePermute Crossbar
0
1
2
3
4
5
6
7
8
9
0.15 0.40 0.65 0.90 1.15 1.40No
rma
lize
d E
ne
rgy
Eff
icie
nc
y
Supply Voltage (V)
300mV
8X
45nm CMOS
50 C32b Multiply
16b SIMD Multiply
72b Add 1.1V
0.980.870.740.590.370.15
Vhi
Vlo
H. Kaul, et. al., ISSCC 2009
A. Agarwal, et. al., ISSCC 2010
S. K. Hsu, et. al., ISSCC 2012
NTV operation improves energy
efficiency across 45nm-22nm CMOS
ISSCC 2012 Distinguished Paper Award,
ESSCIRC 2012 Best Paper Award
14
Vector Flip-flops for VMIN
● Min-sized clock inverters shared across adjacent flip-flops no clock load increase
● Hold time VMIN improves by 175mV
Supply Voltage (a.u.)0
5
10
15
20Flip-flopVector Flip-flop
Hold Threshold
175mV
Ho
ld T
ime
(%
Cy
cle
)
22nm Tri-Gate CMOS Simulation 0°C-85°C, 3σsystematic, 6σrandom
Vector Flip-flop
D1 Q1
D0 Q0
CC#Cd
S. Hsu, R. Krishnamurthy et al, ISSCC 2012
15
ULVS Level Shifter for VMIN
● Decouples CVSL stage from output driver stage and interrupts cross-coupled PMOS devices
● Reduced contention improves VMIN by 125mV
VCC2 VCC2
VCC1
VCC2
DOUT
DIN
Ultra Low Voltage Split-Output (ULVS)
VCC2 VCC2
VCC1
VCC2
DOUT
DIN
Conventional CVSL
ConventionalULVS
Delay Threshold
125mV
0
1
2
3
4
5
6
7
8
Supply Voltage (a.u.)
No
rma
lize
d D
ela
y
22nm Tri-Gate CMOS Simulation 0°C-85°C, 3σsystematic, 6σrandom
S. Hsu, R. Krishnamurthy et al, ISSCC 2012
16
Ultra Low Voltage Graphics/Media and
Security Accelerators
DSP functions highly throughput-oriented: Amenable for parallelism/pipelining
Better power-performance optimization
Optimal partitioning of tasks between GP processor and dedicated engines
GO
PS
/W
PP
C
PP
C1-
SO
I
Sp
arc
Sp
arc2
PP
C2-
SO
I
Sp
arc1 P4
x86
PP
C77
0
Alp
ha
PP
C97
0
Alp
ha
PP
C
Itan
ium
SA
-DS
P
Hit
ach
i-D
SP
Fu
j-D
SP
Fu
j-D
SP
Cel
l-S
PE
KA
IST
-DS
P
NE
C-D
SP
Fu
j-M
ult
i
MP
EG
2
En
cryp
t
MU
D
MP
EG
2
802.
11a
Microprocessors
DSPs
DedicatedHW
10x
100x
10-100X higher performance/watt vs. GP cores
Fle
xib
ilit
y v
s.
en
erg
y-e
ffic
ien
cy
More flexible…More flexible…
More efficient…
Source: ISSCC
Vid
eo M
E
SIM
D V
ecto
r
SIM
D P
erm
uta
tio
n
AE
S E
ncr
ypti
on
Intel ISSCC, VLSI
2008-2016
17
H. Kaul, R. Krishnamurthy et al, ISSCC 2012
NTV Variable Precision FPU
18
18
Near threshold voltage IA processor
Technology 32nm High-K Metal Gate
Interconnect 1 Poly, 9 Metal (Cu)
Transistors 6 Million (Core)
Core Area 2mm2
IA-32
CorePLL
JTAG
I/O Area
I/O Area
I/O
Are
a
5 m
m
5 mm
IA-32 Core
Logic
Scan
RO
M
L1$-I L1$-D
Level Shifters + clk spine
1.1 mm
1.8
mm
19
19
Power performance measurements
915MHz
500MHz
100MHz
3MHz
737mW
174mW
17mW2mW
0
100
200
300
400
500
600
700
800
1
10
100
1000
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2
To
tal P
ow
er (m
W)F
req
uen
cy (
MH
z)
32nm CMOS, 25oC
Logic Vcc / Memory Vcc (V)
20
20
Energy efficiency peaks near threshold
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.2 0.4 0.6 0.8 1 1.2
32nm CPU process32nm SOC process
400mV
500mV
3X
5X
Voltage /Freq operating points
No
rma
lize
dE
ne
r gy
/ cy
cle
Normal operating range NTVSub-threshold
21
21
NTV and variability
100
1000
0 100 200 300 400 500 600 700 800 900 1000
Fast Medium Slow
Leakage Comparison
Slow 1.0X
Medium 2.5X
Fast 7.5X
Frequency (MHz)
En
erg
y/C
ycle
(p
J)
16%30%
22%
28%
18%
32nm CMOS, 25oC
Interconnect Trends
Al Cu
22
On-chip Interconnect Trend
• Local interconnects scale with gate delay
• Global interconnects do not keep up with scaling
Source: ITRSGate delay (FO4)
Local interconnect (M1,2)
Global interconnect
with repeaters
Global interconnect
without repeaters
23
Circuit-Switched On-Chip Interconnects
G. Chen, R. Krishnamurthy et al, ISSCC 2014
25
25
Dynamic V & F adaptation
Clocking
Input
Buffer
Sensors
& Analog
DAB
TCP/IP
Processor
Core
JTAG
No
ise
ge
n
No
ise
ge
n
TCP/IP
processor
PLL0
PLL1
DAB
Control
Thermal
sensor
Div
PMOS
CBG
NMOS
CBG
core clk
gate
Droop
sensor
Time
Tem
p
Time
Vc
c
PLL2
NMOS body bias
PMOS body bias
I/O clk
Noise
injector
CL
OC
KIN
GC
ON
TR
OL
F0
Inp
ut b
uffe
r
Ou
tpu
t po
rt
F1
F2
1st droop
2nd droop 3rd droop
ctrl
PL
L c
om
man
d
VR
M
• Adapt F/V to V/T change reduce V/T margin
• Adapt F/V to aging reduce aging margin
Environment-aware dynamic adaptation
Prototype chip in 90nm
Source: Intel
Source: Intel
Y. Hoskote et al, 2003 ISSCC
26
26
Integrated Voltage Regulators: Fine-grain power managementSpatial domain
•Same voltage to all cores•Same frequency for all cores
Coarse-grain management
TODAY
•Each core/cluster at optimum voltage•Each core/cluster at optimum frequency
Fine-grain management
FUTURE
•V/F domain interfaces•Synchronization overhead•Clock generation/distribution•Power grid routing•Optimum V/F for non-cores•Sub-core clock/leakage gating•Sub-core V/F domains
Ciphertext
MerchantCustomerSecure channel
Plaintext Plaintext
ÅS
bo
x
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Sb
ox
Rounddatain[127:0]
Rounddataout[127:0]
8
8
Roundkey[127:0]
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
*3*1*2
2:1
Plaintext
AddRoundKey
Substitute Bytes
MixColumns/AddRoundKey
ShiftRows
Co
nv
en
tio
na
l A
ES
Ro
un
d
10
ite
rati
on
sNano-AES Hardware Accelerator
● Most popular symmetric-key encryption algorithm
● Conventional 128b AES datapath large area & power
● Not suitable for ultra-low power wearable systems 27
28
ClockIO & Control
71µm
31
µm
76µm
36µ
m
En
cry
pt
De
cry
pt
Key Register Data RegisterIntermediate
Register
Data Map
SBOX
Mix ColumnKey Map
Key Generate
Data InvMap
Key Register Data Register Intermediate
Register
Data Map
SBOX
Mix ColumnKey Map
Key Generate
Data InvMap
Process22nm tri-gate high-K
metal gate CMOS
Die area 0.19mm2
Area (µm2)
Nom. throughput @ 0.9V, 25°C (Mbps)
Latency (cycles)
2736
671
216
289
2200
432
336
186Peak efficiency @ 0.43V, 25°C(Gbps/W)
Ground-field poly x4+x+1x
4+x
3+1
Extension-field poly x2+2x+Ex
2+6x+9
DecryptEncrypt
Gate Count 20901947
DecryptEncrypt
Total power @ 1.1GHz, 0.9V, 25°C (mW)
Leakage power @ 0.9V, 25°C (mW)
13
0.5
13
0.5
Area (µm2) 2200 2736
22nm CMOS NTV Nano-AES Accelerator
28Industry-leading energy efficiency of 289Gbps/Watt at 340mV!
S. Mathew, R. Krishnamurthy et al, 2014 VLSI circuits symposium
0
50
100
150
200
250
300
350
0 10 20 30 40 50
Energy per AES-128 block (nJ)
En
erg
y-e
ffic
ien
cy
(G
bp
s/W
)
11X
This work
[EUROMICRO’06]
[EUROCRYPT’11]
[CHES’04]
Comparisons with prior-art
29
● 11x higher energy-efficiency than previously-
reported measurements
30
30
AES symmetric-key crypto accelerator
S. Mathew, R. Krishnamurthy et al, 2010 VLSI circuits symposium
31
31
All-digital random number generator
Scalable & PVT variation tolerant
S. Mathew, R. Krishnamurthy et al, 2015 ESSCIRC
S. Mathew, R. Krishnamurthy et al, 2010 VLSI circuits symposium
32
32
Physically unclonable function (PUF)
S. Mathew, R. Krishnamurthy et al, 2014 ISSCC
33
33
Neuromorphic computing
34
34
“Extreme” efficiency research
35
Legal DisclaimerThis presentation contains the general insights and opinions of intel corporation (Intel).
• This presentation is provided for informational purposes only and is not to be relied upon for any other purpose. Intel makes no representations or warranties regarding the accuracy or completeness of the information in this presentation. Intel accepts no duty to update this presentation based on more current information. Intel is not liable for any damages, direct or indirect, consequential or otherwise, that may arise, directly or indirectly, from
the use or misuse of the information in this presentation. Intel retains all rights to the Presentation, including any patent, trademark, trade secret, copyright, trade dress, mask works, or any other intellectual property rights. The provision of the Presentation does not constitute the grant or license of any such rights by Intel. Provision of the Presentation shall not be construed to constitute advice or consultation. Intel does not provide, and is not providing, any technical, legal, regulatory or compliance advice. Nor does Intel make any representation or warranty with respect to the effectiveness
of any information contained herein. • Intel may make changes to specifications and product descriptions at any time, without notice.• Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel
products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
• Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
• Intel Virtualization Technology requires a computer system with a processor, chipset, BIOS, virtual machine monitor (VMM) and applications enabled for virtualization technology. Functionality, performance or other virtualization technology benefits will vary depending on hardware and software configurations. Virtualization technology-enabled BIOS and VMM applications are currently in development.
• Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost.
• No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, a Intel® Trusted Execution Technology-enabled Intel processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other Intel® Trusted Execution Technology compatible measured virtual machine monitor. In addition, Intel® Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.
• Intel, Intel Xeon, Intel Core microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. © 2008 Standard Performance Evaluation Corporation (SPEC) logo is reprinted with permission.
• * Other names and brands may be claimed as the property of others.
Recommended