View
9
Download
0
Category
Preview:
Citation preview
DEIS
University of Bologna
Multi processor systems with configurable hardware acceleration
Ph.D in Electronics, Computer Science and Telecommunications
Ph.D Student:Davide Rossi
Ph.D Tutor:Prof. Roberto Guerrieri
DEISUniversity of Bologna
Outline
Motivations Electronics systems requirements and issues
The Morpheus Platform Heterogeneous multi-core Reconfigurable platform
The Manyac Platform Homogeneous and regular multi-core platform Configurable and reconfigurable acceleration
Results Programming productivity Performance (area, power) Impact on manufacturing costs
DEISUniversity of Bologna
Motivations (1)
New generation embeddedapplications are pushing signalprocessing systems to improve: Performance
Energy efficiency
Flexibility
Programmability
Time to market*source ITRS
*source ITRS
DEISUniversity of Bologna
Motivations (2)
4
*source SEMATECH
*source PHILIPS
Increase of products development costs (NRE):
Design costs Front-end Implementation Verification Testing Software development
Mask costs Significant impact on
small volume products
DEISUniversity of Bologna
Morpheus: Main Goals Programming legacy through: ARM Processor acting as system supervisor
Flexibility and performance gain through three heterogeneous reconfigurable processing cores: Fine grain fabric (Abound Logic Flexeos eFPGA)
Medium grain fabric (STMicroelectronics DREAM)
Coarse grain fabric (PACT XPP-III)
Programming productivity through: High level programming approaches for
reconfigurable engines
5
DEISUniversity of Bologna
Morpheus: Architecture
AMBA (main bus)
ARM9
DREAM
PCM
XPP
eFPGA
MainMem
ConfMem
ExternalMemory Controller
AMBA (configuration bus)
BridgeDNA
NoC
ARM core Standard peripheral set
3 communication domains Synchronization and control:
Main bus (AHB)
Data transfers: 8-nodes 64-bit NoC (STNoC)
Configuration: Configuration bus (AHB)
Hardware services: Predictable Configuration
Manager (PCM)
Direct Network Accesses (DNA)
4 Domains Dynamic Frequency Scaling
NOC
DEISUniversity of Bologna
Morpheus: Reconfigurable engines Encapsulated into three independent clock islands
Local buffers act as domain crossing mechanism (DPDC memories)
PACT XPP Coarse grain device (16-bit) Streaming applications with regular computation patterns Programming: NML (Natural Mapping Language)
DREAM Medium grain computation intensive device (4-bit) Iterative applications with complex addressing patterns
Programming: Griffy-C
eFPGA Fine grain device (1-bit LUT)
Applications handling bit-manipulations, configurable I/O
Programming: VHDL
7
DEISUniversity of Bologna
Morpheus: Chip description and Measurements
XPP subsystem:Max Freq@1V: 150 MHz
Dynamic power: 7,5 mW/MHz
DREAM subsystem:Max Freq@1V: 200 MHz
Dynamic power: 2,1 mW/MHz
eFPGA subsystem:Max Freq@1V: 100 MHz
Dynamic power: 0,8 mW/MHz
PACT XPP
DREAM
eFPGA
PCM
ARM
C.
MEMM.
MEM
Technology: CMOS090GP Supply voltage: 1V Transistor count: 97 M Chip area: 110 mm2
Static power: 235 mW Max frequency 250 MHz Peak power: 3W
ARM DOMAINMax Freq@1V: 250 MHz
Dynamic power: 2.4 mW/MHz
DEISUniversity of Bologna
Manyac: Main Goals
Flexibility and Programmability through: Multi-processor approach
Performance gain trough: Application specific hardware accelerators
Programming/design productivity through: High level programming approach based on OpenCL Automatic synthesis of accelerators from high-level
language (Griffy-C)
Reduction of costs through: Platform-based design approach Regular replication of identical tiles Regular silicon structures for implementation of
accelerators
9
DEISUniversity of Bologna
Manyac: Architecture Regular replication of
identical computational tiles + one IO tile
Communication: ring topology NoC (STNoC)
3 Hierarchy levels memory infrastructure: Private memory Local memory Global memory
Hardware synchronization Hardware accelerators
Regular gate arrays
10
The architectural parameters are configurable at design time
DEISUniversity of Bologna
Manyac: Configurable Hardware Accelerators(ST Microelectronics)
Pipelined datapaths targeting three kinds of configurable gate array: Run-time programmable gate array
Routing and functionalities are programmed through SRAMs
Post-fabrication programmability
Via-programmable gate array Routing and functionalities are
programmed through one via layer Customization: 1 metal layer
Metal-programmable gate array Functionalities are mapped on a
library of metal programmable cells Customization: 9 metal layers
11
customizations through VIAs
customizations through metals
DEISUniversity of Bologna
Manyac: Programming Model
12
Based on OpenCL Sequential code executes on
a host processor Parallel and hardware
accelerated code executes on the parallel device
Two programming models Data parallel (Homogeneous) Task parallel (Heterogeneous)
Hardware accelerated functions are encapsulated within parallel kernels and tasks
DEISUniversity of Bologna
Manyac: Design environment
13
OpenCL compiler Allocates function and
variables according to OpenCL qualifiers
Generates host and device code
TLM simulation platform High level exploration of
architectural parameters
RTL platform Cycle-accurate simulation
platform Entry point for physical
implementation
Griffy environment Accelerators design,
simulation models and implementation
DEISUniversity of Bologna
Manyac: Implementation Technology:
CMOS40LP, 1.1V
Configuration Technology: Metal programmable
CT Area: Post Layout: 0,8 mm2
Metal Programmable area (targeting motion detection application): Post Layout: 0,2 mm2
4 Tiles Cluster Area: Post Synthesis: ~5 mm2
Max frequency (post layout): 250 MHz (wc, 125°C, 1.0V)
Power consumption : 45 mW@250MHz (nc, 25°C, 1.1V)
14
Computational tile area breakdown by logic entity
Computational tile layout
DEISUniversity of Bologna
Results: analysis of programming productivity
15
Programming effort required to implement signal processing application on different
computational platforms
Objective: evaluate programming productivity improvement due to high level approaches
Efforts are estimated according with programming language tables based (*SPR) on the Function Point Analysis (FPA) extended to VHDL language
Griffy-C and NML treated as ASM
Reduction of design effort with respect to VHDL:
1,3x ÷ 2x
Language
Average Source Statements per
FP
Productivity Average per Staff
MonthC 128 9 FP
ASM 213 5 FPVHDL 19 18 FP
DEISUniversity of Bologna
Results: Morpheus performance
16
PERFORMANCE (GOPS)
ENERGY EFFICIENCY (GOPS/W)
Application fields selected for characterization: Image processing (Edge
detection, Binarization, Rgb2YUV)
Video processing ( Motion Estimation, Motion Compensation)
Telecommunications (CRC, AES, Ethernet)
Performance (measured): 1,6 ÷ 15 GOPS
Energy efficiency (measured): 2,7 ÷ 52,9 GOPS/W Reduction of dynamic power due
to frequency scaling: 1.5x ÷ 5.5x
DEISUniversity of Bologna
Results: Manyac performance by configuration technology
17
Technology node: CMOS65LP 8 cores platform All figures are estimated Std-cell based accelerators:
Performance: 5,5 ÷ 25 GOPS Energy efficiency: 24 ÷ 113 GOPS/W Area efficiency: 0,6 ÷ 3 GOPS/mm2
Metal programmable accelerators overhead is negligible
Via programmable accelerators Performance overhead: 1,25x Energy efficiency: 2,9x Area efficiency: 4,7x
Run-time programmable accelerators Performance overheads: 1,25x Energy efficiency: 3,7x Area efficiency: 10x
DEISUniversity of Bologna
Results: Manyac manufacturing costs by configuration technology
18
MANUFACTURING COST PER CONFIGURATION TECHNOLOGY
TECHNOLOGY NODES TRENDS
Assumptions: Technology node: CMOS65LP 5 customizations (or re-spins) of
the same platform
Run-time programmable and via programmable technologies are convenient only for very low market volumes Run-time programmable: <5K
pieces Via programmable: 5K ÷ 12K
pieces)
Metal programmable technology is convenient for larger market volumes
Perspectives: As technology nodes scale
reconfigurable technologies are becoming even more convenient
DEISUniversity of Bologna
Conclusion Two multi-core platforms with configurable/ reconfigurable
acceleration have been presented: The Morpheus platform (heterogeneous, reconfigurable) The Manyac platform (homogeneous, configurable)
Improvement of design/programming productivity due to highlevel approaches: 1,3x ÷ 2x
Multi-processor systems with accelerators implemented onreconfigurable and structured ASIC technologies are able toprovide high performance, still showing some overhead interms of power and area with respect to traditional standard-cell based approach.
The proposed approaches provide an effective way to reducemanufacture costs, especially for low volume products.
19
DEISUniversity of Bologna
Collaborations
The PhD is in collaboration with STMicroelectronics
Collaborations within 2 European projects: MORPHEUS (FP6) MODERN (ENIAC)
DEISUniversity of Bologna
PublicationsBook chapters:
N. Voros et al. “Dynamic System Reconfiguration in Heterogeneous Platforms”, Chapter 5: “The DREAM digital Signal Processor”, Chapter 8:” The MORPHEUS Data Communication and Storage Infrastructure”, Springer, 2009.
Conference Papers:
D. Rossi et al. “A Heterogeneous Digital Signal Processor Implementation for Dynamically Reconfigurable Computing”, CICC (Custom Integrated Circuit Conference), 2009.
D. Rossi et al. ”A Multi-Core Signal Processor for Heterogeneous Reconfigurable Computing”, International Symposium on System-on-Chip, Proceedings, 2009.
F. Campi et al. “RTL-to-Layout Implementation of an Embedded Coarse Grained Architecture for Dynamically Reconfigurable Computing in Systems-on-Chip”, Proceedings, 2009.
Journal Papers:
D. Rossi et al. , ”A Heterogeneous Digital Signal Processor for Dynamically Reconfigurable Computing”, JSSC IEEE Journal of Solid-State Circuits, 2010.
D. Rossi, C. Mucci, F. Campi, S. Spolzino, L. Vanzolini, H. Sahlbach, S. Whitty, R. Ernst, W. Putzke-Röming, and R. Guerrieri, “Application Space Exploration of a Heterogeneous Run Time Configurable Digital Signal Processor”, IEEE Transactions on Very Large Scale Integration (TVLSI) Systems, 2012.
DEISUniversity of Bologna
Thanks for your attention
Recommended