Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Krzysztof Banaś Computational Performance 1
Analysis and modeling of
Computational Performance
Latency and throughput of memory
Krzysztof Banaś Computational Performance 2
Memory latency
➔ Latency: in general (recall):
• time between the stimulation and the response, between the cause and the effect, between the beginning of operation and its end
for memory accesses the time between issuing a memory request and its finalization
• for reading: arrival of data• for writing: storing data in memory
➢ complicated by cache coherence protocols (see Lecture 6) theoretical values based on hardware characteristics experimental estimates:
• very short times, impossible to measure individually• there must be multiple accesses to measure time• how to arrange multiple accesses so that the average access time
corresponds to a single separate memory access➢ how to eliminate the effects of all latency hiding techniques
Krzysztof Banaś Computational Performance 3
Memory latency
➔ How to measure latency experimentally: different types of accesses (depends also whether inclusive or
exclusive caches, shared or separate per core caches etc.)• L1 (L1 hit)• L2 (L1 miss)• L2 from a different core (L1, L2 miss, cache coherence protocol)• L3 (L1, L2 miss)• L3 from a different processor (L1, L2, L3 miss, cache coherence
protocol)• DRAM (L1, L2, L3 miss)• other? (NUMA?)
organization of accesses• should not have data locality
➢ temporal – single data element accesses separated by accesses to many other elements (to force eviction from caches)
➢ spatial – no accesses to the same cache line
Krzysztof Banaś Computational Performance 4
Memory latency
➔ How to measure latency experimentally: several simple strategies:
• only one array accessed• read only accesses
➢ e.g. sum += tab[index];• write only accesses
➢ e.g. tab[index] = data;• read-modify-write accesses:
➢ e.g. tab[index]++;• strided accesses:
➢ e.g. tab[index]++; index+= stride;• random accesses:
➢ e.g. index = random_cache_line*cache_line_size; tab[index]++;• pointer chasing:
➢ e.g. index = tab[index];
Krzysztof Banaś Computational Performance 5
Latency and throughput
Krzysztof Banaś Computational Performance 6
Memory throughput (bandwidth)
➔ The maximal transfer rate between processor and a given level of memory hierarchy
➔ Should use all available latency hiding mechanisms: prefetching (hardware and may be software) concurrency
• at all levels – memory controller, buses, DRAM modules➢ pipelining, multi-banking, non-blocking, etc.
➔ Theoretical throughput (bandwidth) based on hardware characteristics
➔ Experimental estimates: massive transfers
• latency hiding techniques many independent memory requests
• maximizing concurrency
Krzysztof Banaś Computational Performance 7
Memory throughput (bandwidth)
➔ How to measure throughput experimentally: massive transfer
• array(s) fitting in the given memory level• multiple repetitions
➢ accesses to the same element must be from the tested memory level – separated by sufficient number of accesses to different elements to evict from levels of memory closer to the core
spatial locality• stride 1, full exploitation of the content of cache lines
many independent memory requests • for different cache lines
number_of_accesses * sizeof(data) / execution_time• number of accesses from source code (checked with assembly code)
➢ effective accesses – data used in the code» not realized by hardware, due e.g. to prefetching» the use of hardware counters can be misleading
Krzysztof Banaś Computational Performance 8
Little's law for memory accesses
➔ Little's law (recall) the average number L of customers in a stationary system is equal
to the average effective arrival rate λ multiplied by the average time W that a customer spends in the system: L = λ W
for memory access requests:• L – the number of requests processed concurrently [B]
➢ should be measured by the number of cache lines • λ – the throughput [GB/s]• W – the time to process each of the memory requests [ns]
in order to maximize the throughput, λ = L / W, i.e. to keep it as close as possible to the theoretical maximum, given the time W that depends on hardware and operating system:• maximize the number of requests processed concurrently L
➢ sufficient number of independent requests in the code
Krzysztof Banaś Computational Performance 9
Little's law for memory accesses
Krzysztof Banaś Computational Performance 10
Memory throughput
➔ Memory throughput factors: hardware
• memory technology (e.g. DDR4)• number of banks, ranks etc.• number of channels• the width of a single channel (bus, usually 64 bits)• processor's memory system capabilities (often expressed as
the number of (usually 64-bit) transactions per second) software
• number of generated cache line accesses➢ several arrays or proper loop unrolling for a single array
• locality of accesses ➢ full use of the whole cache hierarchy
• vectorization of accesses (e.g. -march=core-avx2 )• alignment of arrays in memory (e.g. posix_memalign(...) )
Krzysztof Banaś Computational Performance 11
Latency and throughput
Krzysztof Banaś Computational Performance 12
Example theoretical cache parameters
Krzysztof Banaś Computational Performance 13
Memory performance modelling
➔ AMAT – often used memory performance model AMAT – average memory access time
• hit access times – L1, L2, L3• miss penalty times – for L1, L2, L3• general formula
➢ AMAT = %L1_hit * L1_hit_time + %L1_miss * L1_miss_penalty• recursively
➢ L1_miss_penalty = %L2_hit * L2_hit_time + %L2_miss * L2_miss_penalty➢ L2_miss_penalty = %L3_hit * L3_hit_time + %L3_miss * DRAM_access_time
• the model does not take into account➢ concurrency – essential for high throughput
» unless different times are used for different types of accesses➢ cache coherence – important for multithreading➢ NUMA memory➢ address translation
Krzysztof Banaś Computational Performance 14
Memory accesses – virtual memory
➔ Typical memory instruction: movsd %xmm0, 24(%rsp)
➔ Arguments in: registers main memory (primary storage)
• address calculated based on the values stored in registers
➔ Addresses in virtual address space
• size: 2^n-1, n = 16, 32 or 64• n – characteristic to processor and
operating system translated to physical addresses
• paging – the most popular mechanism• segments, uniform, ...
Krzysztof Banaś Computational Performance 15
Paged virtual memory
➔ Paged virtual memory virtual address space of the process divided into (virtual) pages
• page size variable, usually 4 kb• physical memory divided into frames (physical pages) of the same
size as virtual memory pages• each virtual address divided into page number and address within
page (example: 32-bit address)• virtual addresses are translated into physical addresses
Krzysztof Banaś Computational Performance 16
Paged virtual memory
➔ Paged virtual memory address translation:
• each page associated with the entry in the page table• page table specifies whether the page has assigned physical memory
frame with its data➢ if yes, page table contains physical address of virtual page memory
Krzysztof Banaś Computational Performance 17
Paged virtual memory
➔ Paging (page swapping) virtual address space usually
larger than physical memory• only fraction of all pages
from virtual address space in physical memory
• some pages may be stored in secondary storage (HDD, SDD)
• operating system manages the storage of pages
➢ when accessed page not in physical DRAM memory – page fault» minor page fault – virtual page has not yet assigned physical
page – possibly small overhead (no HDD or SDD access)» major page fault – virtual page moved to secondary storage –
page swap between secondary and primary storage large overhead
Krzysztof Banaś Computational Performance 18
Paged virtual memory
➔ Page table page table can be large – millions of pages (e.g. 2^(n-12), n=32,64)
• stored in primary memory and possibly in secondary memory to speed up access to
page table it has a separate cache – TLB
• TLB – translation lookaside buffer
• within core• up to several hundred
entries there may be many
page tables managed by the operating system (for different processes, segments, etc.)
Krzysztof Banaś Computational Performance 19
Memory access
➔ Instruction with virtual address address translation
• TLB look-up ➢ TLB hit -> physical page address with data➢ TLB miss -> page walk – the search in memory
» entry found, physical page assigned -> TLB update, memory access retry, TLB hit, physical page address with data
» entry found, no physical page assigned -> page fault (both minor and major faults lead to page table and TLB update)
physical memory access• caches, cache coherence protocol, DRAM
some processors perform concurrently (at least partially) address translation and L1 cache access
➔ Thrashing frequent page swaps due to many major page faults must be avoided (any major page faults should be avoided)
Krzysztof Banaś Computational Performance 20
Paged virtual memory
Krzysztof Banaś Computational Performance 21