Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
GOING BEYOND TURING ENERGY-EFFICIENCY IN THE POST-MOORE ERA
Jan M. Rabaey Donald O. Pederson Distinguished Prof.
University of California at Berkeley
ISLPED 2010 The contributions of Doug Jones, Subhasish Mitra, Rahul Sarpeshkar, and Naresh Shanbhag are gracefully acknowledged (as well as the funding by the FCRP program).
Information Processing –
It Is All About Energy …
Further progress in all aspects of future information technology platform requires continuing increase in energy efficiency
The Compute Cloud
ENERGY-INTENSIVE
Mobiles
ENERGY-FRUGAL
15 Years of ISLPED!
ISLPED was born in its current form in 1996 through the merger of the International Symposium on Low Power Electronics (ISLPE) and the International Symposium on Low Power Design (ISLPD), each having had two previous editions before the merger. Since its inception, ISLPED has been the premier forum for presentation of advances in all aspects of low power design and technologies, including …
Two Decades of Low-Power Design
Eliminating waste Reducing energy by voltage scaling Architectural innovation Power/voltage management
© Springer 09
The mantra’s: slow, simple, many, dedicated, adaptive
But … We are running out of options
Waste has been largely eliminated (…)
0 0.2 0.4 0.6 0.8 1 1.2
VDD (V)
0.001
0.01
0.1
1
En
erg
y (
no
rm.)
0.3V
12x
Minimum energy point set by leakage
Technology scaling may not help much anymore
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
20 30 40 50 60 70 80 90
Technology node (nm)
EO
P (
fJ)
Process variations and random upsets dictate noise and timing margins
How to Move Forward (or Lower)?
DOMAIN EXPERTS:MODERATOR:
PANELISTS:
Eco-Friendly Semiconductor Technologies for Healthy LivingOh-Hyun KwonSamsung Electronics
Game-Changing Opportunities for Wireless Personal Healthcare and LifestyleJo De BoeckIMEC
Jan RabaeyProfessorUC Berkeley
Takayasu SakuraiProfessorUniversity of Tokyo
Mark HorowitzProfessorStanford University
Dan DobberpuhlConsultant
Kiyoo ItohFellowHitachi
Jack SunCTO/VP R&DTSMC
Philippe MagarshackGroup VPSTMicroelectronics
Asad AbidiProfessorUCLA
Hermann EulExecutive VPInfineon
Plenary Talks (Mon, Feb 21)
Plenary Roundtable (Mon, Feb 21) Beyond the Horizon: The Next 10x Reduction in Power – Challenges and Solutions
ISSCC® 2011
New Interfaces to the Body ThroughImplantable System IntegrationSteve OesterleMedtronic
Hugo De ManProfessor EmeritusKU Leuven
Electronics for Healthy LivingSan Francisco Marriott Marquis | San Francisco, California, USA
Some options …
Energy-Proportional Computing
Throughput
Actual
Ideal
Pow
er
New devices that lower the minimum energy point
Example: NEMS Relay Logic (King, Alon) Others: TFETs, IGFETs
Probably more than decade out
Getting factor 10 is not obvious And leaves us wanting …
Shannon-Von Neumann-Landauer Bound for irreversible computing:
Minimum energy/operation = kTln(2)
= 4.10-21J/bit at room temperature
Many orders of magnitude of opportunity awaiting
Probabilistic Turing machineFrom Wikipedia, the free encyclopedia
In computability theory, a probabilistic Turing machine is a non-deterministic Turing machine whichrandomly chooses between the available transitions at each point according to some probability distribution.
In the case of equal probabilities for the transitions, it can be defined as a deterministic Turing machine havingan additional "write" instruction where the value of the write is uniformly distributed in the Turing Machine'salphabet (generally, an equal likelihood of writing a '1' or a '0' on to the tape.) Another common reformulationis simply a deterministic Turing machine with an added tape full of random bits called the random tape.
As a consequence, a probabilistic Turing machine can (unlike a deterministic Turing Machine) have stochasticresults; on a given input and instruction state machine, it may have different run times, or it may not halt at all;further, it may accept an input in one execution and reject the same input in another execution.
Maybe adding randomness can help … Going beyond Deterministic Turing Machines
The Opportunity: Functional Non-Determinism +
-
Effi
cien
cy
-
+
Redundancy/ Overdesign
App. Domain Solution
Non-deterministic Computing
Req
uire
d A
ccur
acy
Thinking Aloud …
Pavg(brain) = 20 W (20% of body dissipation, 2% of the weight), Power density: ~15 mW/cm3
Nerve cells only 4% of brain volume Average neuron density: 70 million/cm3
Computational capacity of the brain: 1016 computations per second* 1-2 fJ per computation (<1 aJ per operation) (Memory capacity: 100k TeraByte)
A very energy-efficient computer
[* R. Kurtzweil, “Singularity”]
The Lure of Analog Processing Computational engines that, given the properties and statistics of the input signals and the physical implementation, ensure that the outputs fall within the desired specifications
Analog Processing Pro’s and Con’s Single operator performs complex functions
- Inherently energy efficient Improving SNR (resolution) requires more power and/or area
- Linear or quadratic relationship
PT =Cw ⋅ SNR⋅ Δf
VDD2 −
SNR⋅ Cf ⋅ ln( fh / f l )
AT
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
1/ p
[Courtesy: R. Sarpeshkar, MIT]
The Lure of Analog Processing Computational engines that, given the properties and statistics of the input signals and the physical implementation, ensure that the outputs fall within the desired specifications
Digital Processing Pro’s and Con’s Single operator very simple (logic gate) Power increases logarithmically with resolution (SNR)
1 extra bit in adder increases SNR with factor 2
PT = DPΔf (log2(1+ SNR))
The Lure of Analog Processing Computational engines that, given the properties and statistics of the input signals and the physical implementation, ensure that the outputs fall within the desired specifications
Digital versus Analog Processing Digital shines for high SNR Analog is supreme at low SNR
[R. Sarpeshkar, Ultra-Low Power Bioelectronics,©2010]
When is Analog Communication Efficient?
CTCV
CTDV
DTCV
DTDV
[Courtesy: PC Huang, UCB]
Same answer! Also: Use of slow (digital) feedback moves analog curves further to the right
How about about Mechanical Computing?
60 MHz Q = 48,000
1.2 GHz Q = 14,600
1.5 GHz Q = 11,555
1mm2 = roughly 2000 resonators Assume 100 μW / analog channel
Spectrum analysis @ 1 GHz with N bins
[Courtesy: J. Richmond, UCB]
Mechanical-Analog wins for low resolutions
MEMS Spectrum Analyzer
Accomplish Functionality using Low-Precision Components
[Rahul Sarpheshkar, MIT]
Floating-point scale-independent spectrum analyzer using tapered cascaded analog
stages inspired by cochlea linear in N (versus Nln(N) for FFT)
Plenty of opportunities in the middle Distributed analog Non-deterministic digital Use feedback to add robustness
Another Approach: “Non-deterministic digital” or “statistical computing”
Computational engines that, given the properties and statistics of the input signals and the physical implementation, ensure that the outputs fall within the desired specifications
Uses digital encoding Non-determinism arises from errors generated in compute modules Detection and estimation used to reign in error bounds
← estimation error hardware error →
error-free output
Distinct error characteristics
Not to be confused with …
Probabilistic algorithms: Algorithms that have element
of randomness Given deterministic inputs
and implementation, outputs are random variables
May lead to better performance (search, optimization, polynomial factoring)
○ Example: simulated annealing, genetic algorithms
No specific benefits related to nanoscale computing
Not to be Confused With … Probabilistic Boolean Networks
All signals in logic network considered as stochastic variables.
Noise added into the process. Each logic network is essentially stochastic process, producing stochastic variables at the output
Soft data is turned into Boolean variables with error probability at decision points (e.g. latch with sharp timing edges)
In Out
Equivalent to discrete communication known as Binary Symmetric Channel (BSC) Studied extensively in Information Theory (Von Neumann, Winograd, Hajek) Coding can be applied, but large overhead and latency
Statistical Computing
Statistical Performance Metrics
Statistical Model of Implementation Platform
Statistical Computation
Inputs: deterministic or stochastic variables Outputs: stochastic variables with guaranteed properties (mean, distribution, bounds)
Implementation adds randomness (errors) System designed such that output metrics are accomplished in spite of randomness of implementation
Requires error models to help design the compensation techniques
Examples: synthesis, classification, modeling, search, recognition
Statistical Computation – Basic Tools • Algorithmic resilience (RMS) • Estimation • Detection
16 16
Error Models of Implementation Modules
16-bit Ripple-carry adder 8-bit Baugh-Wooley multiplier
Characterize and engineer error statistics of computational macros
[Courtesy: N. Shanbhag UIUC]
Example: Voltage Overscaling (VOS)
Example: Error-Resilient System Architecture (ERSA) RMS: Recognition, Mining, Synthesis
Emerging killer applications: cognition, vision, genomics
Large data sets, highly parallel
Core algorithms
Probabilistic belief propagation, K-means clustering, Bayesian networks
[S. Mitra et al., Stanford University]
Cognitive resilience “Acceptable” results OK
Algorithmic resilience Low order bit-errors – minimal effects
Intolerant to control and higher order bit-errors
RMS + Unreliable Hardware: crashes instantaneously
RMS Workload Model
Worker thread
Main thread
Setup
Work Assignment
Barrier
Data Reduction
Work Queue
Convergence Test
Iterations
Calculate
Worker thread
Calculate
Worker thread
Calculate
Relaxed Reliability Cores
Super Reliable Core
RRC 1
L1 cache
RRC 1
L1 cache
RRC 1
L1 cache
RRC 1
L1 cache
L2 Bank 2
L2 Bank 2
L2 Bank 2
Supervisor
SRC
L1 cache
Interconnect
L2 cache Bank 1
…
…
: Reliable : Unreliable
ERSA Vision: Asymmetric Reliability
Inexpensive & Unreliable Without expensive error detection
Worker Threads Consists most of the workload
Reliable parts Memory Bound Check Restart
Highly Reliable (Expensive) Proper Error Protection
Executes Main Thread Assign Worker Threads Reduction
Supervise RRCs Timeout check
RMS on ERSA
Calculate
Worker thread + bounds check
(RRC)
Main Thread (SRC)
Setup
Work Assignment
Barrier “Basic” check
Data Reduction
(Work, Memory bounds, Timeout)
Convergence Test
Iterations
Worker thread + bounds check
(RRC)
Simplistic ERSA inadequate Convergence filtering heuristics
Convergence damping = Estimation
Calculate
ERSA on BEE3 FPGA system Large scale FPGA system 4 Virtex-5 FPGAs Many-core emulation ○ Real-world scale application Flexible error injection experiments
Comb. logic
…
…
…
Clk
Original circuit
DE
MU
X
Inje
ctio
n ra
te
cont
rol
LFS
R
LFSR Error injection logic
to every flip-flop in the design
ERSA Results
0
5
10
15
20
25
30
0 1K 2K 3K 5K 10K 20K
Error %
(Probabilty
Dist.)
Errors / RRC / sec
Naïve ERSA
No ERSA
Optimized
ERSA
Bayesian Network Inference
0
20
40
60
80
100
0 2K 4K 6K 8K 10K 25K
Successful
Decoding
(%)
Errors / RRC / sec
Naïve ERSA
No ERSA
Optimized
ERSA
LDPC Decoding
0
0.5
1
1.5
2
2.5
3
3.5
4
0 500 1K 5K 10K 15K 20K 25K 30K
Normalized
Execution
Time
Errors / RRC / sec
Naïve ERSA
No ERSA
Optimized
ERSA
0
0.5
1
1.5
2
2.5
3
3.5
4
0 500 1K 5K 10K 15K 20K 25K 30K
Normalized
Execution
Time
Errors / RRC / sec
Naïve ERSA
No ERSA
Optimized
ERSA
Exe
cutio
n Ti
me
Out
put Q
ualit
y
ERSA Real World App: Image Classifier
Incorrect inference : Just one0 Errors / sec / RRC
Correct Result 30K Errors / sec / RRC
90% accurate
Bayesian Network Inference 90% accuracy enough: cognitive resilience
Inferred: a car
Inferred: not a car
K-Means clustering Error free execution 1,000 errors / sec
ERSA : 10,000 errors / sec ERSA : 100,000 errors / sec
Algorithmic Noise-Tolerance (ANT) Combining estimation and detection
• Main Block designed for average case – Makes intermittent errors (reduced margins)
• Estimator approximates Main Block output • Detector compares and replaces • Assumes algorithmic knowledge for designing efficient estimators
[Courtesy: Shanbhag et al, UIUC]
Results: ANT Motion Estimation
2.5X energy-savings
ANT
Conventional
ideal conventional ANT
PSNR variance reduction: 7X
Peak SNR
PSNR increase
Using Estimation Only “Sensor Networks on a Chip (SNOC)”
Stochastic Model
i i Y ηθ +=
estimate
observations noise
Estimation Theory
Computational cores Requires Efficient and robust estimators Favorable error-statistics, e.g.
independent and identical distributions [Shanbhag, Jones et al, UIUC]
SNOC-based PN-Code Acquisition
300X
40% energy savings
Probability (Detection)
800X
Simulation results
Integrated circuit realization 256-bit PN-code acquisition
2x2 mm, 180nm CMOS, 1.8V [Shanbhag, Jones, UIUC]
Statistical Computing: Quo Vadis?
So far: pretty much ad-hoc The quest for a generalized strategy Input descriptions that capture intended statistical behavior (GP) statistical processors with known error models Algorithm optimization and software generation (a.k.a compilers) so that intended behavior is obtained
Statistical Processors – Thinking aloud
Reliable simple CPU: Calibration: Collect statistics (Vdd,f)
for cores and interconnect Statistics selection: Application
dependent (QoS); (Vdd,f) for ‘good’ statistics
Application-dependent reconfiguration and adaptation
Energy-efficient unreliable IP Cores Performs majority of computation Intermittent errors
Example: VASCO Variation-Adaptive Stochastic Computer Organization
Tiled, multi-core, adaptable architecture matches hardware stochasticity to application [Courtesy: R. Kumar, UIUC]
General-Purpose Statistical Computing Soft NMR (N-way Modular Redundancy)
Soft voter: combines multiple observations with observed error profiles and multiple hypothesis to provide output minimizing error. Does not need algorithmic information
Challenge: Hypothesis synthesis
[Courtesy: Shanbhag, Kim, UIUC]
Going Beyond Turing – Take Aways
Major reductions in energy/operation not evident in near future
Major reductions in design margins an interesting proposition
But requires a drastic redefinition in the way data or control is encoded
Basic opportunity: most applications do not need huge resolution
Challenge: needs a rethinking of algorithms, applications, architectures and platforms, metrics, and … INSPIRATION