Domain-Specific Architecturesacs.ict.ac.cn/asbd2018/slides/ASBD2018Keynote.pdf · Director, ARCO...

6/11/18

Domain-Specific Architectures:The Next Wave of Computing Innovation

Antonio GonzálezDirector, ARCO Research Group

Professor, Computer Architecture DepartmentUPC, Barcelona, Spain

8th Workshop on Architectures and Systems for Big Data (ISCA 2018), Los Angeles CA, June 2, 2018

Agenda

• The next wave of innovation in computing• Main hurdles and research directions• A case study: automatic speech recognition• Concluding remarks

6/11/18

The First Revolution in ComputingThe First Computers

Univac I, 1951

IBM 701, 1952

ENIAC, 1947

CDC 7600, 1969

The Second RevolutionPersonal Computers

PC Laptop Ultrabook

Tablet Convertible Smartphone

6/11/18

The Next Revolution: Ubiquitous Intelligent Computing

• Computing everywhere– On you– At home– At work– In the infrastructures

• City• Roads• Public transportation

• Interconnected– To cooperate and share data

• Intelligent

Intelligent Computing

• Machines that can perform human-like intellectual tasks

• Some capabilities– Comprehend our surroundings

• Vision• Language processing

– Learn– Proactively take decisions and autonomous actions

• Personal assistants• Drive assistants

6/11/18

Very Diverse in Functionality and Requirements

• Worn devices• Body sensors / prosthetics• Driving assistants• Home robots• Healthcare devices• Energy management• Smart consumer electronics

Require High Performance

• Complex tasks– Pattern recognition• Objects in real scenes• Spoken words• Facial identities and expressions• Anomalies (e.g. potential hazards when driving)

– Natural language processing– Image and audio processing– Decision making– Etc.

6/11/18

And Must be Extremely Energy-Efficiency

• Small wireless devices have very limited battery capacity• Performance (“intelligence”) is limited by energy-efficiency

– System power = EnergyPerTask * TaskPerSecond– Given a fixed power budget,• EPT has to decrease at the same pace as TPS (performance) increases

Reducing EPT is the key for delivering increased performance

How To Improve EPT

• Technology– Scaling dimensions and other parameters

6/11/18

Ideal Technology Scaling

• Rules given by Robert Dennard in 1974– Scaling by a factor k• Device dimension (tox, L, W) 1/k• Doping concentration k• Voltage 1/k

• Current 1/k• Capacitance 1/k• Delay time per circuit 1/k• Power dissipation per circuit 1/k2

• Power density 1

GATESOURCE DRAIN

Dimension Scaling Has Been Great

• Gordon Moore predicted doubling transistor density every 2 years (1975)

1.95x / 2 yearsAnd it happened

for the last 40 years

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

Microprocessor transistorcount(x1000)

1.95x / 2 years

6/11/18

Future Projections

• Dimension scaling has recently slowed down

– Previous 2-year cadence has increased to 3+ years

• May soon come to a halt

– Silicon lattice spacing is 5.43 Å ≈ ½ nm

– In two generations, dimensions will be around 10 atoms (10nm à 7 à5)

Supply Voltage Has Scaled Much Less Than Dennard’s Rule

8%decreaseperyear

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

MinimumSupplyVoltage

6/11/18

Consequence: Power Has Increased Over the Years

• Exponential increase• Until early 2000s

1970 1980 1990 2000 2010

Negative Effects of Power Increase

• Autonomy of mobile devices• Cost of running the system• Power density reached

unsustainable levels– Cost of cooling solution– Form factor due to cooling

solution– Noise of cooling solution– Reliability

Source: S. Borkar, IEEE Micro 1999; F. Pollack, MICRO-1999; R. Ronen, WCED 2001

11..55µµ 11µµ 00..77µµ 00..55µµ 00..3355µµ 00..2255µµ 00..1188µµ 00..1133µµ 00..11µµ 00..0077µµ

i386i486

Pentium® Pentium® Pro

Pentium® IIPentium® IIIHot plate

Nuclear Reactor

Pentium® 4

6/11/18

How Power Increase Was Stopped Around 2000

• Dynamic Power: Ceff x Vdd2 x freq.

– Clock gating– Reduces percentage of switching transistors• To reduce the effective capacitance

• Static Power: Vdd x Ioff– Power gating– Reduces percentage of transistors that are

powered on• Sacrificing Performance

– By not increasing frequency

1970 1980 1990 2000 2010

Clock Frequency

How Power Increase Was Stopped Around 2000

• Dynamic Power: Ceff x Vdd2 x freq.

– Reducing the percentage of switching transistor• To reduce the effective capacitance

• Extensive use of clock gating • Static Power: Vdd x Ioff

– Reducing the percentage of transistors that are powered on

– Power gating techniques• Sacrificing performance

– By not increasing frequency• Specialized units

– Specialization improves efficiency at the expense of flexibility

Qualcomm Snapdragon 820

Source: HotChips 2015

6/11/18

Future Technology Contribution To Improve EPT

• Summary– Scaling dimensions à Benefits may soon rich a point of diminishing returns– New technology à No mature alternative in the horizon

Innovations from architecture will be a key driving force in the forthcoming future

General Purpose Architectures

• Great improvements in the past– Pipelining– Caches– Branch prediction– RISC– Superscalar– Out of order execution– Multithreading– Multicore

• After 5 decades of evolution they are highly optimized and difficult to improve

Expected to provide minor improvements in the future

6/11/18

Time/Opportunity for Domain-Specific Architectures

• Key Features– Many simple units • Simple units have low performance but consume much less energy• Parallelism provides the desired performance at much lower energy cost

– Much less data movement• For performance and energy reduction

– More specialized hardware• Dramatic benefits in energy-efficiency

– New ISA and programming paradigms• Oriented to “intelligence”-related tasks rather than numerical algebra

Example: Brain-Inspired Computing

• Human brain is very good at some of these intelligence-related tasks– E.g. object recognition

• Meets all key features described above– Composed of many simple units– Highly parallel– No centralized memory à only local data movements– With a very different programming paradigm: learning

100 μm

6/11/18

A Case Study:Automatic Speech Recognition

Large Vocabulary Continuous Speech Recognition

• A hard task– Word boundaries are not known in advance– Co-articulatory effects are very strong– Real time requirements – Low power constraints when carried out in mobile devices

• Main approach– Hybrid scheme• Hidden Markov Model + Artificial Neural Network

6/11/18

Modeling Speech

• Acoustic model– An utterance consists of a sequence of units of speech– Context dependent phones are the units used by most systems • About 50 phones in English• A phone does not always sound the same, due to co-articulatory effects• A triphone is a phone observed in the context of a preceding and succeeding

phones– Each triphone is modeled by a small HMM (5-state in Sphinx II)

Modeling Speech

• Acoustic model (cont’d)

– In the order of 1M or more states

• Training data may be insufficient or too costly

• States are often clustered (tied) into equivalence classes called senones

– Word HMMs are built by concatenating the HMMs of individual triphones

– There may be more than one model for a given word, to account for alternative pronunciations

– Continuous speech is modeled by adding null transitions from the final state of every word to the initial state of all words in the vocabulary

• Cross-word transition probabilities are given by the language model (e.g. trigram)

6/11/18

Modeling Speech

• Language model (aka grammar)– Helps to select the most likely word sequence

from alternative hypotheses produced during search process

– Example• Trigram: triples and their probability of

occurrence– Backoff mechanism (example for trigram

grammar)• Only the most frequent trigrams are included • If the desired trigram is not found, one falls

back to bigram or unigram probabilities

THERE ISIS

ON TABLETABLE

TABLE IS

IS TABLE

THERE IS

THERE ISIS

ON TABLETABLE

TABLE IS

IS TABLE

THERE IS

Offline Composition

• In the language model, every word transition is replaced by its corresponding acoustic model

6/11/18

Offline Composition

• Resulting in huge WFST (e.g. 22M states, 1.1 GB for Kaldi Tedlium)

t ei bS3/TABLES2/εS1/ε S4/ε

ON TABLE

t ei bS3/TABLES2/εS1/ε

IS TABLE

TABLE IS

THERE ISIS

S3/TABLE

S2/εS1/ε

S4/ε S4/ε

Viterbi Beam Search

• Evaluates all transitions between previous frame and current one

• For every state it computes the probability corresponding to the best state sequence from time 0 to time t

• Each state has a single best predecessor, which is recorded• The best state sequence is determined by starting at the final state and going

backwards following the best predecessor at each point• Beam search heuristic to reduce the search space

– At each time t, the state with max probability is found: Pmax(t)– All states with probabilities lower than Pmax(t) x B are not considered for computations at time t+1– B is an appropriately chosen threshold lower than 1 that determines the width of the “beam”

6/11/18

Putting It All Together

Sound Signal Graph Search

Sentence

Feature Extraction

Likelihood Computation

Weighted-Finite-State-Transducer

w ah n

th r iyah

Accelerated ASR System

ASR SystemMemory

Audioframes

Acousticscore

Wordlattice

CPU(Feature extraction)

DNNAccelerator

ViterbiAccelerator

6/11/18

UNFOLD: On-The-Fly Composition [MICRO 2017]

Acoustic Model

Language Model(0, 0)

(t, 0)

(i, 0)

(a, 0)

(th, 0)

(ei, 0)

(z, IS)

(n, ON)

S11/ON

(e, 0)

THERE ISIS

ON TABLETABLE

TABLE IS

IS TABLE

THERE IS

t ei bS3/TABLES2/ε

a nS10/ε S11/ON

i zS5/εS6/IS

th e rS9/THERES8/ε

Viterbi Accelerator

111x Speedup

22x Speedup

48x Reduction

628x Reduction

6/11/18

Memory Requirements

• Huge reduction compared with fully-composed approach– 31x reduction in storage– 70% reduction in memory bandwidth

1126,4 1226,14

TedliumLibrispeech

VoxforgeTedlium

KALDI EESEN

l Size

Fully-Composed

On-the-fly+Comp

DNN Pruning [ISCA 2018]

• DNN pruning has recently become popular

– DNN are usually oversized so many neurons and connections can be removed

without impacting accuracy

– Several recent works proposed different heuristics with high effectiveness

(>50% pruning)

– Large reduction in computations, energy and storage requirements

• However: How accuracy is measured?

– Pruning studies normally use Top-1 or Top-5

– This may not be adequate in some use cases of DNN such as ASR

6/11/18

Our Observation: DNN Pruning Reduces Confidence

Top-1 is the same for all but its likelihood is reduced drastically

Acoustic scores for one particular frame

The Consequences: Less Confidence Requires to Evaluate More Hypothesis

6/11/18

The Idea [ISCA 2018]

• Keep the N-best hypotheses, regardless of confidence

• The problem: Compute the N-best is costly

– Requires sorting of a huge number of hypotheses (20K on average, up to 300K in Kaldi-LibriSpeech)

• Our proposal: Approximate the N-best

– Using a K-way set-associative hash table

– Keeping the K-best for each entry

Performance and Energy Consumption

• 5.6x speedup • 5.2x energy savings

6/11/18

Improving the DNN Through Computation Reuse

• Observation: Consecutive inputs are quite similar

Our Proposal: Computation Reuse [ISCA 2018]

• Reuse previous output• Make adjustments to take into account input variations

– If number of different inputs is small, adjustments are cheaper than computing the output from scratch

6/11/18

Results

• Input similarity: 45%• Computations saved: 53%• Speedup: 1.9x• Energy reduction: 49%

Summary

• Next revolution in computing

– A broad variety of intelligent devices

– Ubiquitous

– Applications very different to typical number crunching

– Dramatic improvements in energy efficiency are required

• Call for domain-specific architectures

– Massive parallelism

– Reduction in data movement

– More specialized hardware

– New programming paradigms

• Case Study: UNFOLD, an efficient speech-recognizer for mobile system

– 550x real-time (1.8 ms per second of speech)

– 1.4 mW average power (1.4 mJ per second of speech)

– 18 mm2 (11 for Viterbi + 7 for DNN)

– Less than 30 MB of DRAM memory

Domain-Specific Architecturesacs.ict.ac.cn/asbd2018/slides/ASBD2018Keynote.pdf · Director, ARCO...

Documents

K040627, Depuy Orthopaedic, ASR Modular Ace Tabular Cup System

Cisco ASR 9000 System - clnv.s3.amazonaws.com · Cisco ASR 9000 System ... Cisco ASR 9906 Overview BRKARC-2003 9 ... -8x100GE 24/48x10/1G LC A9K-400G-DWDM BRKARC-2003 15

Cisco ASR 9000 System Aggregation Services Routerssurazalsystems.com/wp-content/uploads/2012/06/Cisco-ASR-9000-S… · Cisco ASR 9000 Series SPA Interface Processor 700 Software Cisco

Cisco ASR 9000 System Aggregation Services Routers ASR 9000 System Aggregation Services Routers ... Service providers can add more power as their bandwidth and feature ... mobile Radio

armonico.music.coocan.jparmonico.music.coocan.jp/.../la_cumparsita.pdf · arco arco arco > espr. pizz. unis. div. mp pizz. mp pizz. pizz. mp m arco m 53

Adaptable)System)Recovery)(ASR)) …System)Recovery)(ASR))) for)Linux)Virtual)Machines) Fall-08! Even the most experienced, competent administrator would take considerable time getting

asr 1000 Series Data Sheet - Cisco · and and Cisco ASR 1000 Series Cisco Cisco ASR or Cisco ASR Cisco ASR 1000 Series Cisco ASR 1000 Series

ASR 6-49-2011 - ASR - Recent

Cisco ASR 9000 System Architecture - · PDF fileCisco ASR 9000 System Architecture ... FlexibleCLI(Configuration Groups) & Scale ACL's • BRKSPG-3334: ... RTN B+, M3 M3 M2 M1 M0

Cisco ASR 9000 System Architecture · Cisco ASR 9000 System Architecture BRKARC-2003 Xander Thuijs CCIE#6775 Principal Engineer Highend Routing and Optical Group Dennis Cai, Distinguished

Aquifer Storage Recovery (ASR) · 2021. 3. 3. · Supply Authority ASR System ASR is a proven, cost-effective alternative water supply 21 wells/21 mgd More than 35 years of successful

Cisco ASR 9000 System Architectured2zmdbbm9feqrf.cloudfront.net/2015/usa/pdf/BRKARC-2003.pdf · Cisco ASR 9000 System Architecture BRKARC-2003 ... ASR 901/903 IOS XR CE 8 LC/21RU

Cisco ASR 5500 System Administration Guide · Cisco ASR 5500 System Administration Guide Version 15.0 Last Updated October 31, 2014 Americas Headquarters Cisco Systems, Inc. 170 West

ARCO #0019. ARCO #0192 ARCO #0203c Closed 2-29-00

· arco aesthetics arco con dimpled mark arco de acero inoxidable arco m7 (ni-ti multistrand) arco ni-ti classic arco ni-ti econoline arco ni-ti termico econoline

ASR 5500 System Administration Guide, StarOS Release 19

Training Manual ASR System

Veröffentlichungen ARCO 2013 –ARCO Newsletter no.04-ARCO Veröffentlichungen - Newsletter 04, June 2013 5 News from ARCO-Spain Eight years after its foundation ARCO-Spain comes

ASR 1000 System & Solutiond2zmdbbm9feqrf.cloudfront.net/2016/usa/pdf/BRKARC-2001.pdf · asr 1000 system & solution ... ip/mpls core edge cgn lns cpe olt xpon xdsl dslam docsis ettx

Robust ASR system : Malayalam