2
Features Benefits Die Layout, Specifications Dynamic Winograd-weight transformation, INT8 convolutions performed at INT12 accumulate •2.25x acceleration for supported convolutions with no precision loss Multilayer parallelism (deep fusion) •Inference activations can be kept local •Higher compute utilization Background loading of weights during computation •Higher hardware utilization GPIO x32 connectors •Multichip, scalable solutions possilbe Single and double byte datatypes •INT8 datapath for low latency •BF16 datapath for high precision Architecture: 2x2 tile of 64 64-wide 1D systolic array •Convolutional kernels map efficiently •Wide coverage of neural network operators Datacenter Acceleration at Edge Power and Price Reconfigurable interconnect •64 MAC clusters can be configured to allow unblocked dataflow AI + eFPGA ® InferX™ X1 Edge Inference Coprocessor Package 21x21mm 624 pin fcBGA Frequency (MHz) 933 / 800 / 667 / 533 MACs (INT8 / BF16) 4096 / 2048 On-Chip SRAM 13MB DRAM Interface x32 LPDDR4x Host Interfaces PCIe Gen3/4 Est. Worst-Case TDP (933 MHz, YOLOv3) 13.5W NMAX Array NoC L3 SRAM LPDDR4x Interface PCIe

LPDDR4x Interface · 2020. 10. 20. · LPDDR4x DRAM chip. The InferX Compiler takes as input TensorFlow Lite / ONNX models and directly outputs a binary execution plan for the X1

  • Upload
    others

  • View
    8

  • Download
    1

Embed Size (px)

Citation preview

  • Features Benefits

    Die Layout, Specifications

    Dynamic Winograd-weight transformation, INT8 convolutions performed at INT12 accumulate

    •2.25x acceleration for supported convolutions with no precision loss

    Multilayer parallelism (deep fusion) •Inference activations can be kept local•Higher compute utilization

    Background loading of weights during computation •Higher hardware utilization

    GPIO x32 connectors •Multichip, scalable solutions possilbe

    Single and double byte datatypes •INT8 datapath for low latency•BF16 datapath for high precision

    Architecture: 2x2 tile of 64 64-wide 1D systolic array •Convolutional kernels map efficiently•Wide coverage of neural network operators

    Datacenter Acceleration at Edge Power and Price

    Reconfigurable interconnect •64 MAC clusters can be configured to allow unblocked dataflow

    AI + eFPGA ®InferX™ X1Edge Inference Coprocessor

    Package 21x21mm 624 pin fcBGAFrequency (MHz) 933 / 800 / 667 / 533MACs (INT8 / BF16) 4096 / 2048On-Chip SRAM 13MBDRAM Interface x32 LPDDR4xHost Interfaces PCIe Gen3/4Est. Worst-Case TDP (933 MHz, YOLOv3)

    13.5W

    NMAX Array

    NoC

    L3 SRAMLP

    DD

    R4x

    Inte

    rfac

    e

    PCIe

  • InferX X1 performs inference faster than existing edge inference chips. The X1 is optimized for the real-time megapixel input streams fundamental to edge applications, and operates in INT8 or BF16 precision over a batch size of 1 for minimum latency.

    At the center of the X1 is the nnMAX™ acceleration engine. The programmable logic in the nnMAX rewire internal MAC clusters to create an optimal dataflow path for layers’ convolutional kernels, with intermediate activations stored in local SRAM or passed directly to the next layer. Complementing the nnMAX inference acceleration engine, the X1 contains a PCIe x4 Gen3/4 interface to connect to a host, and a x32 DDR interface for a single DDP LPDDR4x DRAM chip.

    The InferX Compiler takes as input TensorFlow Lite / ONNX models and directly outputs a binary execution plan for the X1. InferX X1 is available as standalone chips, or on PCIe boards with single and multi-chip configurations. Performance modeling and analysis via the InferX Compiler is available for evaluation now.

    October 2020www.flexlogix.com

    Product Overview

    AI + eFPGA ®InferX™ X1Edge Inference Coprocessor

    X1 Freq. (MHz) Model Precision Input Size Throughput (fps) Latency (ms)

    933

    YOLOv3 INT8

    1440 14.0 71.4608 72.7 13.8416 109.8 9.1

    8001440 12.0 83.3608 62.3 16.0416 94.1 10.6

    6671440 10.0 99.9608 52.0 19.2416 78.5 12.7

    5331440 8.0 125.0608 41.5 24.1416 62.7 15.9

    X1P1 Performance Estimatesbatch=1, no optimizations post training