XQueue: Extreme Fine-grained Concurrent Lock-less...

XQueue: Extreme Fine-grained Concurrent Lock-less QueuePoornima Nookala1, Peter Dinda2, Kyle Hale3, Ioan Raicu4

1,3,4Illinois Institute of Technology {pnookala@hawk.iit.edu, khale@cs.iit.edu, iraicu@cs.iit.edu}, 2Northwestern University {pdinda@northwestern.edu}

OverviewA single general purpose shared memory machine todaymay have hundreds of hardware threads available. Op-erations that may run in tens of cycles on a single coreusing a single thread can take upwards of millions ofcycles when multiple threads are competing for sharedresources. The concurrent multiple producer, multipleconsumer (MPMC) queue is a critical building block innumerous systems such as the task scheduler of a run-time system (or operating system) that supports mod-ern parallel programming models. We present the de-sign, implementation, and evaluation of XQueue, a novellock-less concurrent queuing system with relaxed order-ing semantics that is geared to realizing scalability tohundreds of concurrent threads. XQueue implementsthe MPMC interface using multiple queues in order toreduce contention, applies simple deterministic load bal-ancing techniques, and eliminates all use of locks andatomic operations with a lock-less design. Experimentalresults show that XQueue can deliver concurrent oper-ations with 110 to 400 cycle latency at scales up to 384hardware threads.

Concurrent queues areterrible!

• Analyzed latency and throughput of mutex, spinlock,semaphore and atomic fetch-and-add for an incrementoperation on Mystic testbed.

• Mystic covers latest many-core architectures from Intel,AMD, IBM and ARM with processors such as Haswell,Broadwell, Skylake, Phi, Opteron, Ryzen, Threadripper,Epyc, Power9, and ThunderX

• Latency of a single atomic increment on a Skylakesystem with 192-cores and 384 hardware threadswhen running on all threads concurrently is 33592cycles whereas on Intel Xeon Phi KnightsLanding with 64-cores and 256 hardware threads,latency reaches 3868 cycles.

• Latency of enqueue/dequeue operation on SPSCqueue takes 30 to 70 cycles depending on thearchitecture and clock frequency.

• Average throughput of enqueue/dequeue operationson SPSC queue reaches 270 million operationsper second on Intel Skylake 192-core machine.

• Latency of MPMC queue can reach up to millions ofcycles under high contention and throughput candrop up to as low as 300,000 operations persecond.

• These results provide enough motivation to investigatemethods to exploit full concurrency on many-corearchitectures while not compromising on the lowestlatency that can be achieved.

Figure 1: SPSC Queue Latency

Figure 2: MPMC Queue Latency

Figure 3: SPSC Queue Throughput

Figure 4: MPMC Queue Throughput

Problem Statement

Having scalable and fast queue datastructures under high concurrency isa critical missing link towards therealization of efficient parallel run-times on many-core architectures. Itis essential to analyze and optimize basicdata structures that form the barebones ofparallel runtime systems so they do not be-come the bottleneck for performance.

XQueue - Design and Implementation

• Core idea of XQueue is to have two queues percore, one being the master and other being the auxiliaryqueue.

• There is one master queue and one auxiliary queue oneach core with one producer thread per queue and acommon consumer thread for both queues.

• XQueue is implemented without using any types oflocks or atomic operations or barriers. Currentimplementation uses B-queue [1] which is a lock-freeconcurrent SPSC queue.

• XQueue employs load balancing techniques whereitems can be added to auxiliary queues depending onthe topology defined (round-robin in Figure 5). Figure 5: XQueue Architecture

Microbenchmarks

• For evaluation purposes, we implemented two versions ofXQueue, one with a lock-less SPSC queue and anotherone with mutex-based MPMC queue.

• The throughput achieved on skylake-192 withXQueue with lock-less queue is 5 billionoperations per second. For XQueue usinglock-based queue, the average throughput achieved is135 million operations per second.

• Latency of queue operations on XQueue usinglock-less queue is 110 to 400 CPU cycles onaverage on all the different architectures with 384threads of execution.

Figure 6: XQueue Throughput usinglock-less queue

Figure 7: XQueue Throughput usinglock-based queue

Figure 8: Latency of Xqueue using lock-less queue

latency<400cycleswith384threads!

Conclusion• XQueue is an extremely scalable lock-lesscon-current MPMC out of order queue that can scaleup to hundreds of threads of execution.

• Future Work: XTask, a light-weight parallel runtimesystem which can achieve low latency and highthroughput at extreme scale.

[1] Junchang Wang, Kai Zhang, Xinan Tang, and Bei Hua. B-queue: Efficient and practical queuing for fast core-to-core communication.International Journal of Parallel Programming, 41(1):137–159, Feb 2013.

XQueue: Extreme Fine-grained Concurrent Lock-less...

Documents

Concurrency Programming in Java - 07 - High-level Concurrency objects, Lock Objects, Executors, Concurrent Collections, Atomic Variables, Concurrent Random Numbers

CONCURRENT PROGRAMMING Introduction to Locks and Lock-free data structures 1

Concurrency in Failure Atomic Data Structures on ......Transactions and concurrency, lock-free programming 03 Concurrent data structures Persistent TLS, concurrent map, concurrent

Infrastructure-Free Logging and Replay of Concurrent ...(b) Coarse-grained Exploration (c) Second Replay Failure (d) Fine-grained Exploration Fig.2. The different phases of our single

Verifying a Two-Lock Concurrent Queue Hussain Tinwala Fall 2007

Are Lock-Free Concurrent Algorithms Practically Wait-Free?

Chapter 14: Concurrency Control · Database System Concepts - 6th Edition 15.3 ©Silberschatz, Korth and Sudarshan Lock-Based Protocols A lock is a mechanism to control concurrent

Lock-free Concurrent Data Structures

IPDPS 2003 - Fast and Lock Free Concurrent Priority Queues

A Fine-Grained Adaptive Middleware Framework for Parallel ... · previous research, our system supports ofﬂoading to multiple remote locations, concurrent application model, and

Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

Lock-Free Concurrent Binomial Heaps · Lock-Free Concurrent Binomial Heaps Gavin Lowe Department of Computer Science, University of Oxford gavin.lowe@cs.ox.ac.uk August 21, 2018 Abstract

Verifying Fine-Grained Concurrent Data Structures · Verifying Fine-Grained Concurrent Data Structures Master Thesis Felix Wolf August 21, 2018 Advisors: Prof. Dr. Peter Muller, Dr

Concurrent B-trees with Lock-free Techniqueshacamero/Research/BtreeTechrpt2011.pdf · Concurrent B-trees with Lock-free Techniques Afroza Sultana Concordia University Helen A. Cameron

Lecture 13: Fine-grained Synchronization & Lock-free ...cs149.stanford.edu/winter19content/lectures/13_lockfree/13_lockfre… · Stanford CS149, Winter 2019 Required conditions for

Performance Evaluation of Intel® Transactional ...ts.data61.csiro.au/Events/summer/14/lai.pdf · Coarse-Grained Locks •Use a lock to guard shared data –A thread acquires lock,

Concurrent and Event-Based Programming - UPTdanc/pcbe/pcbe-curs-v1.3.pdf · Monitors in Java Each Java object provides an ‘intrinsic lock’ (‘monitor lock’) which is automatically

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel

Mechanized Veriﬁcation of Fine-grained Concurrent …Mechanized Veriﬁcation of Fine-grained Concurrent Programs User manual and code commentary Ilya Sergey IMDEA Software Institute

1 Fundamentals of Concurrent Systems Fundamentals of Concurrent Processes The issues to be addressed here are: 1. Process and Thread 2. Fine-Grained