Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R...

Revolutionizing the Datacenter

Join the Conversation #OpenPOWERSummit

Power-Efficient Machine Learning using FPGAs on POWER Systems

Ralph Wittig, Distinguished Engineer

Office of the CTO, Xilinx

Join the Conversation #OpenPOWERSummit

Your logohere

Super Human

Humans: ~95%***

Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)

* http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

Your logohere

Super Human

Humans: ~95%***

Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)

CNNs far outperform non AI methods

CNNs deliver super-human accuracy

* http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

Your logohere

CNNs Explained

Your logohere

The Computation

Your logohere

The Computation

Your logohere

Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights

Convolution

Input256

KernelWeights

Output

Your logohere

Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights

Convolution

Input256

KernelWeights

Output

Your logohere

Continue along the row ...

Convolution

Input256

KernelWeights

Output

Your logohere

Before moving down to the next row

Convolution

Input256

KernelWeights

Output

Your logohere

The first output feature map is complete

Convolution

Input256

KernelWeights

Output

Your logohere

Move onto the next output feature map by switching weights, and repeat

Convolution

Input256

KernelWeights

Output

Your logohere

Pattern repeats as before: same input volumes, different weight

Convolution

Input256

KernelWeights

Output

Your logohere

Complete the second output feature map plane

Convolution

Input256

KernelWeights

Output

Your logohere

Finally, after 256 weight sets have been used, the output feature map is complete

Convolution

Input256

KernelWeights

Output

Your logohere

Fully Connected Layers

Your logohere

Fully Connected Layers

a1,40 95

w0,0,0

w0,0,1

,0,0,00,1

ii wafa

w1,40 95,1

w1,40 95,0

a0,40 95

w0,0,40 95

a2,99 9

w1,40 95,99 9

,0,1,1999,2

ii wafa

fc6 fc7fc7 fc8

Your logohere

Compute: dominated by convolution (CONV) layers

CaffeNet ZF VGG11 VGG16 VGG19

CONV1 CONV2 CONV3 CONV4CONV5 FC6 FC7 FC8

CaffeNet ZF VGG11 VGG16 VGG19

CONV1 CONV2 CONV3 CONV4

Source: Yu Wang, Tsinghua University, Feb 2016

CNN Properties

Memory BW: dominated by fully-connected (FC) layers

Your logohere

Humans vs Machines

Humans are six orders of magnitude more efficient

*IBM Watson, ca 2012

Source: Yu Wang, Tsinghua University, Feb 2016

Your logohere

Cost of Computation

Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 20

Your logohere

Cost of Computation

Stay in on-chip memory (1/100 x power)

Use Smaller Multipliers (8bits vs 32bits: 1/16 x power)

Fixed-Point vs Float (don’t waste bits on dynamic range)

Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 21

Your logohere

Improving Machine Efficiency

Model Pruning

Right-Sizing Precision

Custom CNN Processor Architecture

Your logohere

Pruning Elements Retrain to Recover Accuracy

Train Connectivity

Prune Connections

Train Weights

40% 50% 60% 70% 80% 90% 100%

racy L

Parametes Pruned Away

L2 regularization w/o retrain L1 regularization w/o retrain

L1 regularization w/ retrain L2 regularization w/ retrain

L2 regularization w/ iterative prune and retrain

Pruned Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Remove Low Contribution Weights (Synapses)

Retrain Remaining Weights

Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks”http://arxiv.org/pdf/1506.02626v3.pdf

Your logohere

Pruning Results: AlexNet

Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdf

9x Reduction In #Weights

Most Reduction In FC Layers

Your logohere

Pruning Results: AlexNet

< 0.1% Accuracy Loss

Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdfPage 25

Your logohere

Inference with Integer Quantization

Your logohere

Dynamic: Variable Format Fixed-Point (Per Layer)

< 1% Accuracy Loss

Network VGG16

Data Bits Single-float 16 16 8 8

Weight Bits Single-float 16 8 8 8 or 4

Data Precision N/A 2-2 2-2 2-5/2-1 Dynamic

Weight Precision N/A 2-15 2-7 2-7 Dynamic

Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0%

Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6%

Source: Yu Wang, Tsinghua University, Feb 2016 Page 27

Your logohere

Fixed-Point Sufficient For Deployment (INT16, INT8)

No Significant Loss in Accuracy (< 1%)

>10x Energy Efficiency OPs/J (INT8 vs FP32)

4x Memory Energy Efficiency Tx/J (INT8 vs FP32)

Your logohere

Improving Machine Efficiency

CNN Model

Pruned

Floating-Point Model

Pruned

Fixed-Point Model

Instructions

FPGA Based

Neural Network

Processor

Modelpruning

Data/weightquantization

Compilation

Modified From: Yu Wang, Tsinghua University, Feb 2016 Page 29

Your logohere

Xilinx Kintex® UltraScale™ KU115 (20nm)

5520 DSP Cores,up to 500Mhz

5.5 T OPs int16 (peak)

4 GB DDR4-2400 & 38 GB/s

55W TDP & 100 G OPs/W

Single Slot, Low Profile FF

OpenPOWER CAPI AlphaData ADM-PCIE-8K5

Your logohere

FPGA Architecture

CLB DSP CLBRAM RAM

CLB DSP

CLB DSP CLB

. . . .

2D Array Architecture (scales with Moore’s Law)

Memory Proximate Computing (Minimize Data Moves)

Broadcast Capable Interconnect (Data Sharing/Reuse)

Your logohere

FPGA Arithmetic & Memory Resources

16-bitMultiplier

Native 16-bit multiplier (or reduced power 8-bit)

On-Chip RAMs store INT4, INT8, INT16, …

Custom Quantization Formatting (Qm.n)

48-bitAccumulator

Q8.8Q2.14Qm.n

CustomQuantization

Custom WidthMemory

INT4INT8

INT16INT32FP16FP32

Your logohere

Convolver Unit

⋯⋯ ⋯⋯

⋯⋯

Data buffer

Weight buffer

MultipliersAdder Tree

9 Data Inputs

9 Weight

Inputs

n Delays

𝑚 Delays

Weight

Output

Your logohere

Convolver Unit

⋯⋯ ⋯⋯

⋯⋯

Data buffer

Weight buffer

MultipliersAdder Tree

9 Data Inputs

9 Weight

Inputs

n Delays

𝑚 Delays

Weight

Output

Memory Proximate Compute2D Parallel Memory2D Operator Array

Serial to ParallelPing/Pong

Serial to ParallelData Reuse: 8/9

Your logohere

Processing Engine (PE)

Convolver

Complex

+ NL PoolC

Output

Buffer

Weights

Intermediate Data

Controller

Bias Shift

……

Your logohere

Processing Engine (PE)

Convolver

Complex

+ NL PoolC

Output

Buffer

Weights

Intermediate Data

Controller

Bias Shift

……

Memory SharingBroadcast Weights

CustomQuantization

Your logohere

Top Level

Power CPUExternal

Memory

DMA w/ compression

Data & Inst. Bus

Buffer

Computing Complex

Output

Buffer

icConfig

Your logohere

Top Level

Power CPUExternal

Memory

DMA w/ compression

Data & Inst. Bus

Buffer

Computing Complex

Output

Buffer

icConfig

SW ScheduledDataflow

Decompress weights on the fly

Multiple PEBlock Level Parallelism

Ping Pong BuffersTransfers Overlap with Compute

Your logohere

FPGA Neural Net Processor

Tiled Architecture (Parallelism & Scaling)

Semi Static Dataflow (Pre-scheduled Data Transfers)

Memory Reuse (Data Sharing across Convolvers)

Your logohere

OpenPOWER CAPI

Shared Virtual Memory

System-Wide Memory Coherency

Low Latency Control Messages

POWER8CAP UNIT

CAP PSL

Peer Programming Model and Interaction Efficiency

Your logohere

OpenPOWER CAPI

POWER8CAP UNIT

CAP PSL

• Caffe, TensorFlow, etc• Load CNN Model• Call AuvizDNN Library

Xilinx FPGA

• AuvizDNN Kernel• Scalable & Fully Parameterized• Plug and Play Library

Your logohere

OpenPOWER CAPI

POWER8CAP UNIT

CAP PSL

14 Images/s/W (AlexNet)

Batch Size 1

Low Profile TDP

Your logohere

Take Aways

FPGA: Ideal Dataflow CNN Processor

POWER/CAPI: Elevates Accelerators As Peers to CPUs

FPGA CNN Libraries

Your logohere

444/11/2016

Thank You!

Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R...

Documents

N UK 1 5 125 RA 8000 - IMI Precision Engineeringcdn.norgren.com/pdf/N_UK_1_5_125_RA_8000.pdf · H T H T N/UK 1.5.125.02 3/07 RA/8000, RA/8000/M Our policy is one of continued research

D o ssie r: e str s po r c alo r e n v ac as de le c he · te m p e ra tu ra y p o r e sta ra z n e s c o n v e - ... n g o d e te m p e ra tu ra se c o - ... l d e la va ca . Otr

N ra d tr 1076

N?EC N?EC 8.41 ,.-$8 - Schroders · 8.41 ,.-$8! enqsd enq lnmdx (@mcdk sgd hmudrsldms l@drsqn n?ec !rj @ odmrhnmr dwodqs /khuh@ 2tcf@qc @mrvdqr xntq ptdrshnmr n?ec ! qhs@hm`r odmrhnm

Ec n 20120203

M e di a K i t › kajabi-storefronts... · W e b si t e : w ww. ra n d yt a ra n . co m F a ce b o o k: w ww. f a ce b o o k. co m/ Ra n d yT a ra n I n st a g ra m: w ww. i n st

EC 140T s/n DEQ181382

l i b e ra l i s a t i o n a n d i n t ra - R E C t ra d e

N en 1 5 125 RA 8000 - tme.eu

Facing Benign Colorectal Surgery… · T ra n s v ers e C o lo n R ec tum T h e In tes tin al T ra c t S igm o id C o lo n D esc endin g C o lo n A s c endin g C o lo n. Treatment

HYHORSPHQWRIDQHXUDOFRQWUROOHUDSSOLHGLQD … · 2014-02-18 · wdqk vjq wdqk vjq ec ec ec fq f kq q fkq q =++ Š grqghn uhsuhvhqwd od uhodflyq gh hqjudqdmhv n n kaod frqvwdqwhghsdughoprwru

CASSCO N N EC TIO Volume 9, Issue 2N - o.b5z.neto.b5z.net/i/u/10063916/h/CASS Magazine/CASS_Connection_Fall_2013.… · CASSCO N N EC TIO Volume 9, Issue 2N Is Education Inspiring

CO Nastri, RA Ferriani, N Raine-Fenning, WP Martins

G R Y P H O N FLEET S ec ond t N o n

A F re e ze -F ra c tu re T ra n s m is s io n E le c tro n M ic ro s c ...ceweb/ce/people/faculty/zasadzinski/... · A F re e ze -F ra c tu re T ra n s m is s io n E le c tro n M

Tin e. a earn yogopre ec ra no ce - Kenkei Momotaro · Tin e. a earn yogopre ec ra no ce . Title: Adobe Photoshop PDF Created Date: 7/8/2014 6:12:34 AM

M A I .N L I N E RA I L WAY

Fiche technique Corps de vanne avec préréglage, Type RA-N … · 2020. 12. 14. · RA-N 25 1 1 0,10 0,15 0,17 0,26 0,35 0,46 0,73 1,04 1,40 013G0038 Droit kv à Xp = 2K kvs RA-N

IS G NG E AT L RA A MT I S RA EN GA EC N KE I R N K E T I L O P · 2020. 9. 19. · 2014 R E N C A N A S T RA T E G IS P O L I T E K N I K N E G E R I M A L A NG Penataan dan Pengembangan

Full Line EC&N Catalog: Elbows, Couplings & Nipples