Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Intelligent Design Space Exploration for High-Level
and System Synthesis
AIDArc 2020
1
Antonino Tumeo
Outline
• Synthesis of accelerators for irregular applications and data analytics
• Speeding up design space exploration in high-level and system synthesis with bio-inspired heuristics
•Overview of SODALITE
•Opportunities for artificial intelligence in SODALITE
2
3
Irregular Applications Characteristics• Unpredictable, fine-grained data accesses
• Poor locality
• Pointer or linked-list based data structures• Graphs & sparse matrices, unbalanced trees, unstructured grids
• Difficult to partition in a balanced way• Inherent parallelism (for each element)• High synchronization intensity
• In general, memory-bound• High memory parallelism, but many small memory operations in unrelated locations• The key problem is actual bandwidth utilization
• Prototypical irregular kernels: graph algorithms• Data analysts do not only want to compute metric on graphs, but also and foremost query graph databases
(e.g., to find interesting patterns)
4
Application-specific Accelerators• As Moore’s law slows down, application-specific accelerators appear the main
approach to keep increasing efficiency
• At one end, sea of application-specific accelerators
• At the other end, (re)emergence of (re)configurable designs• FPGAs in the cloud, FPGAs for HPC• Renewed interest for Coarse Grained Reconfigurable Arrays (CGRAs)
• Reconfigurable architectures, and FPGAs in particular, may have hard time to reach peak flop rates of ASICs• Can make it up in efficiency• Key aspect (especially for irregular applications): enable exploration around the memory
interface
5
High-Level Synthesis
• Tries to bridge the design gap of FPGA accelerators• Generation of hardware design language descriptions starting from high-level
program specifications
• Conventional High-Level Synthesis flows address:• Dense, regular data structures• Simple memory models• Instruction-level parallelism• Compute-bound kernels (Digital Signal Processing-like)• Latest commercial tools based on OpenCL works well for regular, compute-
bound workloads• Significant limits for nested-loops, no support for atomic memory operations
6
Our contributions
• We have developed a set of techniques to enable HLS of Irregular Applications• Customizable architectural templates and related analysis and synthesis methodologies• Implemented in an open-source HLS research framework – PandA Bambu – available at: https://panda.dei.polimi.it
7
Query exampleReturn the names of all persons owning at least two cars, of which at least one is a SUV
8
Source Code Example
Multithreaded architecture template
• Architectural templates: expose set of parameters (number of accelerators, memory channels, contexts) to explore
• Synthesizes effectively parallel loop iterations with atomic memory operations
9
10
Design Space Exploration
11
Intelligent System Design• The previous example has shown only the space of the parameters for the multithreaded
architecture template• In reality, High-Level or System Synthesis need to solve various NP-Complete Problems• High-Level Synthesis:• Resource Allocation, Scheduling, Resource Binding, Interconnection…
• System Synthesis:• HW/SW partitioning, Scheduling, Mapping, Communication orchestration…
• Brute-force methods require too much time• Problems also are strictly correlated, and executed in different orders
• Many Integer Linear Programming formulations• Still too much time to converge
• Heuristic optimization algorithms• Many bio-inspired (genetic algorithms, swarm optimization)
Genetic Algorithm for High-Level Synthesis
•Genetic Algorithms (GAs) enable exploring non-convex design spaces by evolving a population of solutions• mutation introduces local variations• crossover allows jumping across areas of the space and exit from local minima
or maxima• selection of the fittests then guides the search along the most promising areas
•We apply GAs High-Level Synthesis process• Each chromosome represent a full synthesis process• By considering the full synthesis process, we can explore much larger design
space than considering each synthesis “task” alone
12
NSGA-II for synthesis example
• Non-Dominated Sorting Genetic Algorithm II• Chromosome encoding• Binding of Operations to Functional Units• Algorithms for Scheduling, Register Allocation, Interconnection
• Mutation and crossover• Elitism preserves diversity into the population• Crowded-comparison operator based on density estimation, allows obtaining the crowding distance
• Selection: non-dominated rank and crowding distance• Solutions are ranked also inside a non-dominated level: if they have the same rank, they belong to
the same front and selection prefers the. less crowded region
13
[C. Pilato, A. Tumeo, G. Palermo, F. Ferrandi, P. L. Lanzi, D. Sciuto:Improving evolutionary exploration to area-time optimization of FPGA designs. J. Syst. Archit. 54(11): 1046-1057 (2008)]
Ant Colony Optimization for Scheduling and Mapping• Ant Colony Optimization: multi-agent optimization heuristic• Ants randomly explore different paths to the food.
• At each decision point:
• They deposit pheromone proportionally to the length of the path, which suggests other ants to follow the same trail. Pheromone also evaporates with time.
14
Scheduling and Mapping Example
15
[F.Ferrandi, P.L. Lanzi, C. Pilato, D. Sciuto, A. Tumeo: Ant Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems. IEEE Trans. on CAD of Integrated Circuits and Systems 29(6): 911-924 (2010)]
Bayesian Optimization for Mapping Pipelined Applications
• The Bayesian Optimization Algorithm (BOA) is a Probabilistic Model Building Genetic Algorithm (PMBGA)• mutation and crossover operators are replaced by the construction and the sampling of a Bayesian
network. • Through the Bayesian Network, it can find underlying sub-structures of some complex problems
• We apply BOA for the mapping of pipelined applications on a heterogeneous platform
16
BOA example
[A. Tumeo, M. Branca, L. Camerini, C. Pilato, P. L. Lanzi, F. Ferrandi, D. Sciuto: Mapping pipelined applications onto heterogeneous embedded systems: a bayesian optimization algorithm based approach. CODES+ISSS 2009: 443-452]
17
BOA Results• Applied to more complex task graph (B,C, D)• Compared to multiobjective Simulated
Annealing (MSA), Tabu Search (TSA), Genetic Algorithm (GA)• Also a hybrid formulation where each offspring
generation of BOA is followed by several iterations of SA• Reports execution latency in clock cycles,
Relative Standard Deviation, and execution time of the optimization algorithm
18
Multi-objective Synthesis for Real-Time Systems
• Consider a real time application with hard and soft deadlines, described by a task graph
•We are given a set of resources that could be composed together to form a system• Processors, accelerators, memories, communication elements (buses or point-
to-point channels)
•We want to obtain the system that is able to minimize area, is feasible (no violations of hard deadlines), minimize buffer/memories size, and minimize violation of soft deadlines
19
Multi-objective Synthesis for Real-Time Systems
OVERALL FLOW CONVERSION TO MULTI-RATE TASK GRAPH
20
Multiobjective Synthesis for Real Time Systems
• Problem formulation• Resource library, communication paths,
mapping, scheduling
• Optimization algorithms evaluated:• Multiobjective Simulated Annealing (SA)• Multiobjective Tabu Search (TS)• Niched Pareto Genetic Algorithm II (GA)
• In average, the GA is more robust and able to cover more non dominated solutions in highly constrained problem• The TS performs worse than the SA with high
number of evaluations, but is comparable or better with few evaluations• SA obtains valuable results on problems with
higher degrees of freedom
[M. Ceriani, F. Ferrandi, P. L. Lanzi, D. Sciuto, A. Tumeo:Multiprocessor systems-on-chip synthesis using multi-objective evolutionary computation. GECCO 2010: 1267-1274]
21
SODALITE: Software Defined Accelerators from Machine Learning Tools Environment• SODALITE is PNNL’s project in the DARPA RTML (Real Time Machine Learning) program
• 3 years, 2 phases of 1.5 years each• Coordinated with parallel NSF Program
• DARPA RTML looks at the development of a compiler that will allow to generate Verilog designs starting from High-Level Machine Learning Frameworks (e.g., Pytorch, TensorFlow, MXNet, CNTK, …)
• The designs will then be fabricated in chiplets
22
SODALITE overview• Distill promising network architectures from suggested
application area• High-Bandwidth Imaging• Driver to enable agile codesign approach and identification of
architectural templates, but objective is generality of the synthesizer
• Synthesizer frontend lowers a High-Level Intermediate Representation (HLIR) to Low Level IR (LLIR)
• Initially exploit ONNX to lower to a common HLIR• Explore opportunities to employ MLIR as HLIR• LLIR: LLVM IR
• Synthesizer Middle end performs the actual synthesis• New dataflow template-based synthesis• Classical high-level synthesis path
• Design Space Exploration engine plugs-in in the middle end• Heuristic optimization algorithms, including bio-inspired
• Closed loop with chip design and evaluation• Provides constant feedback for synthesizer development
23
Artificial Intelligence in SODALITE• SODALITE is a new generation synthesizer• Like the examples for high-level and system synthesis, we will use
optimization algorithms to explore a multidimensional design space• Performance, power, accuracy, heat…
• A synthesizer is a compiler• Large amount of compiler optimizations can significantly influence the Verilog
generation process• Not only optimizations, but also ability to understand computational patterns and reuse• Patterns may not be the conventional ones
•We also need estimation methods to estimate the quality of the results• ASIC vs FPGA interconnect• Estimators for FPGA work mostly based on linear regression through the synthesizers -
can we do better for ASICs?
24
Conclusion
• Synthesis techniques for graph analytics and large design space
•Overview of heuristic optimization methods for high-level and system synthesis
•Overview of SODALITE
• Possible directions for SODALITE design space exploration
• Looking to create an opensource ecosystem for synthesis and system level design space exploration
25
Thank you!
• Thank you to my past and present collaborators
• Thanks to the SODALITE and SO(DA)2 team• Vinay Amatya, Vito Giovanni Castellana, Joseph Manzano, Marco Minutoli,
Cheng Tan
•Questions?• [email protected]
26