49
Direct Communication and Synchronization Mechanisms in Chip Multiprocessors Stamatis Kavadias Computer Science Department University of Crete (UOC-CSD) and Institute of Computer Science Foundation for Research and Technology Hellas (FORTH-ICS)

Direct Communicationand Synchronization Mechanismsin Chip Multiprocessors

Embed Size (px)

DESCRIPTION

PhD Thesis PresentationThe physical constraints of transistor integration have made chip multiprocessors (CMPs) a necessity, and increasing the number of cores (CPUs) the best approach, yet, for the exploitation of more transistors. Already, the feasible number of coresper chip increases beyond our ability to utilize them for general purposes. Although many important application domains can easily benefit from the use of more cores, scaling, in general, single-application performance with multiprocessing presents a tough milestone for computer science.The use of per core on-chip memories, managed in software with RDMA, adopted in the IBM Cell processor, has challenged the mainstream approach of using coherent caches for the on-chip memory hierarchy of CMPs. The two architectures have largely different implications for software and disunite researchers for the mostsuitable approach to multicore exploitation. We demonstrate the combination of the two approaches, with cache-integration of a network interface (NI) for explicit interprocessorcommunication, and flexible dynamic allocation of on-chip memory tohardware-managed (cache) and software-managed parts. The network interface architecture combines messages and RDMA-based transfers,with remote load-store access to the software-managed memories, and allows multipath routing in the processor interconnection network. We propose the technique of event responses that efficiently exploits the normal cache access flow for network interface functions, and prototype our combined approach in an FPGAbasedmulticore system, which shows reasonable logic overhead (less than 20%) in cache datapaths and controllers, for the basic NI functionality.We also design and implement synchronization mechanisms in the network interface (counters and queues), that take advantage of event responses and exploit the cache tag and data arrays for synchronization state. We propose novel queues, thatefficiently support multiple readers, providing hardware lock and job dispatching services, and counters, that enable selective fences for explicit transfers, and can be synthesized to implement barriers in the memory system.Evaluation of the cache-integrated NI on the hardware prototype, demonstrates the flexibility of exploiting both cacheable and explicitly-managed data, and potential advantages of NI transfer mechanism alternatives. Simulations of up to 128 core CMPs show that our synchronization primitives provide significant benefits for contendedlocks and barriers, and can improve task scheduling efficiency in the Cilk run-time system, for executions within the scalability limits of our benchmarks.

Citation preview

Direct Communicationand Synchronization Mechanismsin Chip MultiprocessorsStamatis KavadiasComputer Science DepartmentUniversity of Crete (UOC-CSD)andInstitute of Computer ScienceFoundation for Research and Technology Hellas (FORTH-ICS)Motivation and Approach CMP architectures becoming more distributed (manycore) Utilize scalable NoC (>> few tens of cores) Scalable communication mechanisms required to exploit chip Locality will be very important Low latency communication exploit locality effectively Fast synchronization improve efficiency of fine-grain comp. This study advocates: Use on-chip scratchpad memory for comm/comp Exploit direct communication and synchronization mechanisms Aim scalable mechanisms & implementation Exploit increased (replicated) resources Reduce overheads with on-chip bulk transfers Enable efficient communication supporting NoC optimizations2University of Crete & Foundation for Research & Technology - HellasProposed Architectural Enhancements& Contributions Modify contemporary CMP architecture to support: Shared address space extension for direct scratchpad access Cache integration of a network interface (NI) Direct communication mechanisms for RDMA & messages Direct synchronization mechanisms (counters & queues) The contributions of this thesis are: Design a CMP network interface integrated at top memory hierarchy levels Introduce event responses technique for cache integration of NI communication & synchronization mechanisms Design direct synch. mechanisms with existing cache resources Refine HW design to reduce gates by 19.3% (