NUMA overview

NUMA

MAATALLA Abed

[email protected]

• What is NUMA?

• History of processors.

• Close look on NUMA.

• UMA, NUMA & NUMA SMP architect.

• Barriers of NUMA.

• Solutions.

• Existing simulators.

• Benefits of NUMA

What is NUMA?

• Non-Uniform Memory Access: it will take longer

to access some regions of memory than others• Designed to improve scalability on large SMPs• Processor can access its own local memory faster than

non-local memory.

SMP: symmetric multiprocessing

What is NUMA?

• Groups of processors (NUMA node) have their own local memory– Any processor can access any memory, including the

one not "owned" by its group (remote memory)– Non-uniform: accessing local memory is faster than

accessing remote memory

What is NUMA?

• Nodes are linked to each other by a hight-speed interconnection• NUMA limits the number of CPUs• Each group of processors has its own memory and possibly its I/O

channels • The number of CPUs withing a NUMA node depends on the hardware

vendor.

What is NUMA?

• Facts:– (most of) memory is

allocated at task startup.– tasks are (usually) free to

run on any processor.

Both local and remote accesses can happen during task's life.

History of processors.

• Mental model of CPUs is stuck in the 1980s: basically boxes that do arithmetic, logic, bit twiddling and shifting, and loading and storing things in memory. But various newer developments like vector instructions (SIMD) and the idea that newer CPUs have support for virtualization.

• Many supercomputer designs of the 1980s and 1990s focused on providing high-speed memory access as opposed to faster processors, allowing the computers to work on large data sets at speeds other systems could not approach.

History of processors.

• The first commercial implementation of a NUMA-based Unix system was the Symmetrical Multi Processing XPS-100 family of servers, designed by Dan Gielan of VAST Corporation for Honeywell Information Systems Italy.

Close look on NUMA.

• One can view NUMA as a tightly coupled form of cluster computing. The addition of virtual memory paging to a cluster architecture can allow the implementation of NUMA entirely in software. However, the inter-node latency of software-based NUMA remains several orders of magnitude greater (slower) than that of hardware-based NUMA.

• NUMA come to solve performance problems by providing separate memory for each processor & avoiding the performance hit when several processors attempt to address the same memory.

Close look on NUMA

• Threads that share memory should be on the same socket, and a memory-mapped I/O heavy thread should make sure it’s on the socket that’s closest to the I/O device it’s talking to.

• There is multiple level of memory like CC & LLC because CPU become faster and need to speed up memory access, it calls memory tree.

Close look on NUMA

• NUMA VS ccNUMA: The difference is almost nonexistent at this point. ccNUMA stands for Cache-Coherent NUMA, but NUMA and ccNUMA have really come to be synonymous. The applications for non-cache coherent NUMA machines are almost non-existent, and they are a real pain to program for, so unless specifically stated otherwise, NUMA actually means ccNUMA.

Close look on NUMA

• When a processor looks for data at a certain memory address, it first looks in the L1 cache on the microprocessor itself, then on a somewhat larger L1 and L2 cache chip nearby, and then on a third level of cache that the NUMA configuration provides before seeking the data in the "remote memory" located near the other microprocessors. Each of these NODES in the interconnection network. NUMA maintains a hierarchical view of the data on all the nodes.

• InterConnection Netwrok (ICN): as mentioned above, ICN related NODES to allow exchange of data between them. ( same in cluster physical link allow exchange of data)

UMA, NUMA & NUMA SMP architect

• Uniform memory access(UMA): all processors have same latency to access memory. This architecture is scalable only for limited nmber of processors.

• Nom Uniform Memory Access(NUMA): each processor has its own local memory, the memory of other processor is accessible but the lantency to access them is not the same which this event called " remote memory access"

UMA, NUMA & NUMA SMP architect

• NUMA SMP: the hardware trend is to use NUMA systems with sereval NUMA nodes as show in figure. A NUMA node haa a group of processors having shared memory. A NUMA node can use its local bus to interact with local memory. Multiple NUMA nodes can be added to form a SMP. A common SMP bus can interconnect all NUMA nodes

Barriers of NUMA.

• Spread data between memories.

Barriers of NUMA.

• Spread tacks between sockets.

Barriers of NUMA.

• IO NUMA: needs to be considered during placement / scheduling.

Barriers of NUMA.

• There was just memory in 80s. Then CPUs got fast enough relative to memory that people wanted to add a cache. It’s bad news if the cache is inconsistent with the backing store (memory), so the cache has to keep some information about what it’s holding on to so it knows if/when it needs to write things to the backing store.

Barriers of NUMA.

• Data request by more than one processor.

• How far apart the processors are from their associated memory banks.

Solutions

• It exist some hardware implementation to solve some problems. Because, buying a high end server is so expensive to test on it new approches and need a special condition like cold and space.

• We as developer could create a simulator to implement different approaches to analyse, improve performance and scalability. This mean that simulator need to handle software and hardware part also, by indicating remote memory access events, calculate execution time of each process and IO events ... etc.

Existing simulators

There is a same number of existing project that could be named such as: RSIM, SICOSYS, SIMT and simNUMA.Those projects exist and have done pretty nice job each of those has power points and weakness points, but it's already started and there is much more to cover and to implement in this field.

There are a lot of approches and theories that needs to be tested and proved or disproved.

For those reason mentioned above simulator plays an important role in the near future

Benefit of NUMA

As mentioned above and scalability. It is extremely difficult to scale SMP CPUs. At that number of CPUs, the memory bus is under heavy contention. NUMA is one way of reducing the number of CPUs competing for access to a shared memory bus. This is accomplished by having several memory busses and only having a small number of CPUs on each of those busses.

I’m interested in things that CPUs can’t do yet but will be able to do in the near future.

Thank you

Documents

NUMA overview