Features Benefits Die Layout, Specifications Dynamic Winograd-weight transformation, INT8 convolutions performed at INT12 accumulate •2.25x acceleration for supported convolutions with no precision loss Multilayer parallelism (deep fusion) •Inference activations can be kept local •Higher compute utilization Background loading of weights during computation •Higher hardware utilization GPIO x32 connectors •Multichip, scalable solutions possilbe Single and double byte datatypes •INT8 datapath for low latency •BF16 datapath for high precision Architecture: 2x2 tile of 64 64-wide 1D systolic array •Convolutional kernels map efficiently •Wide coverage of neural network operators Datacenter Acceleration at Edge Power and Price Reconfigurable interconnect •64 MAC clusters can be configured to allow unblocked dataflow AI + eFPGA ® InferX™ X1 Edge Inference Coprocessor Package 21x21mm 624 pin fcBGA Frequency (MHz) 933 / 800 / 667 / 533 MACs (INT8 / BF16) 4096 / 2048 On-Chip SRAM 13MB DRAM Interface x32 LPDDR4x Host Interfaces PCIe Gen3/4 Est. Worst-Case TDP (933 MHz, YOLOv3) 13.5W NMAX Array NoC L3 SRAM LPDDR4x Interface PCIe

LPDDR4x Interface · 2020. 10. 20. · LPDDR4x DRAM chip. The InferX Compiler takes as input TensorFlow Lite / ONNX models and directly outputs a binary execution plan for the X1

Download PDF Report

Upload
others
View
8
Download
1

Embed Size (px)

Citation preview

Features Benefits

Die Layout, Specifications

Dynamic Winograd-weight transformation, INT8 convolutions performed at INT12 accumulate

•2.25x acceleration for supported convolutions with no precision loss

Multilayer parallelism (deep fusion) •Inference activations can be kept local•Higher compute utilization

Background loading of weights during computation •Higher hardware utilization

GPIO x32 connectors •Multichip, scalable solutions possilbe

Single and double byte datatypes •INT8 datapath for low latency•BF16 datapath for high precision

Architecture: 2x2 tile of 64 64-wide 1D systolic array •Convolutional kernels map efficiently•Wide coverage of neural network operators

Datacenter Acceleration at Edge Power and Price

Reconfigurable interconnect •64 MAC clusters can be configured to allow unblocked dataflow

AI + eFPGA ®InferX™ X1Edge Inference Coprocessor

Package 21x21mm 624 pin fcBGAFrequency (MHz) 933 / 800 / 667 / 533MACs (INT8 / BF16) 4096 / 2048On-Chip SRAM 13MBDRAM Interface x32 LPDDR4xHost Interfaces PCIe Gen3/4Est. Worst-Case TDP (933 MHz, YOLOv3)

13.5W

NMAX Array

NoC

L3 SRAMLP

DD

R4x

Inte

rfac

e

PCIe
InferX X1 performs inference faster than existing edge inference chips. The X1 is optimized for the real-time megapixel input streams fundamental to edge applications, and operates in INT8 or BF16 precision over a batch size of 1 for minimum latency.

At the center of the X1 is the nnMAX™ acceleration engine. The programmable logic in the nnMAX rewire internal MAC clusters to create an optimal dataflow path for layers’ convolutional kernels, with intermediate activations stored in local SRAM or passed directly to the next layer. Complementing the nnMAX inference acceleration engine, the X1 contains a PCIe x4 Gen3/4 interface to connect to a host, and a x32 DDR interface for a single DDP LPDDR4x DRAM chip.

The InferX Compiler takes as input TensorFlow Lite / ONNX models and directly outputs a binary execution plan for the X1. InferX X1 is available as standalone chips, or on PCIe boards with single and multi-chip configurations. Performance modeling and analysis via the InferX Compiler is available for evaluation now.

October 2020www.flexlogix.com

Product Overview

AI + eFPGA ®InferX™ X1Edge Inference Coprocessor

X1 Freq. (MHz) Model Precision Input Size Throughput (fps) Latency (ms)

933

YOLOv3 INT8

1440 14.0 71.4608 72.7 13.8416 109.8 9.1

8001440 12.0 83.3608 62.3 16.0416 94.1 10.6

6671440 10.0 99.9608 52.0 19.2416 78.5 12.7

5331440 8.0 125.0608 41.5 24.1416 62.7 15.9

X1P1 Performance Estimatesbatch=1, no optimizations post training