Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
A LITTLE BIT OF HISTORY
• Multix in the ’60s
NICTA Copyright c© 2014 From Imagination to Impact 2
A LITTLE BIT OF HISTORY
• Multix in the ’60s
• Ken Thompson and Dennis Ritchie in 1967–70
NICTA Copyright c© 2014 From Imagination to Impact 2
A LITTLE BIT OF HISTORY
• Multix in the ’60s
• Ken Thompson and Dennis Ritchie in 1967–70
• USG and BSD
NICTA Copyright c© 2014 From Imagination to Impact 2
A LITTLE BIT OF HISTORY
• Multix in the ’60s
• Ken Thompson and Dennis Ritchie in 1967–70
• USG and BSD
• John Lions 1976–95
NICTA Copyright c© 2014 From Imagination to Impact 2
A LITTLE BIT OF HISTORY
• Multix in the ’60s
• Ken Thompson and Dennis Ritchie in 1967–70
• USG and BSD
• John Lions 1976–95
• Andrew Tanenbaum 1987
NICTA Copyright c© 2014 From Imagination to Impact 2
A LITTLE BIT OF HISTORY
• Multix in the ’60s
• Ken Thompson and Dennis Ritchie in 1967–70
• USG and BSD
• John Lions 1976–95
• Andrew Tanenbaum 1987
• Linux Torvalds 1991
NICTA Copyright c© 2014 From Imagination to Impact 2
A LITTLE BIT OF HISTORY
• Basic concepts well established
– Process model
– File system model
– IPC
NICTA Copyright c© 2014 From Imagination to Impact 3
A LITTLE BIT OF HISTORY
• Basic concepts well established
– Process model
– File system model
– IPC
• Additions:
– Paged virtual memory (3BSD, 1979)
NICTA Copyright c© 2014 From Imagination to Impact 3
A LITTLE BIT OF HISTORY
• Basic concepts well established
– Process model
– File system model
– IPC
• Additions:
– Paged virtual memory (3BSD, 1979)
– TCP/IP Networking (BSD 4.1, 1983)
NICTA Copyright c© 2014 From Imagination to Impact 3
A LITTLE BIT OF HISTORY
• Basic concepts well established
– Process model
– File system model
– IPC
• Additions:
– Paged virtual memory (3BSD, 1979)
– TCP/IP Networking (BSD 4.1, 1983)
– Multiprocessing (Vendor Unices such as
Sequent’s ‘Balance’, 1984)
NICTA Copyright c© 2014 From Imagination to Impact 3
ABSTRACTIONS
Linux Kernel
File
s
Th
read
of
Co
ntr
ol
Mem
ory
Sp
ace
NICTA Copyright c© 2014 From Imagination to Impact 4
PROCESS MODEL
• Root process (init)
• fork() creates (almost) exact copy
– Much is shared with parent — Copy-On-Write
avoids overmuch copying
• exec() overwrites memory image from a file
NICTA Copyright c© 2014 From Imagination to Impact 5
PROCESS MODEL
• Root process (init)
• fork() creates (almost) exact copy
– Much is shared with parent — Copy-On-Write
avoids overmuch copying
• exec() overwrites memory image from a file
• Allows a process to control what is shared
NICTA Copyright c© 2014 From Imagination to Impact 5
FORK() AND EXEC()
➜ A process can clone itself by calling fork().
➜ Most attributes copied :
➜ Address space (actually shared, marked copy-on-write)
➜ current directory, current root
➜ File descriptors
➜ permissions, etc.
➜ Some attributes shared :
➜ Memory segments marked MAP SHARED
➜ Open files
NICTA Copyright c© 2014 From Imagination to Impact 6
FORK() AND EXEC()
Files and Processes:
.
.
0
1
2
3
4
5
6
7
File descriptor table
Process A
NICTA Copyright c© 2014 From Imagination to Impact 7
FORK() AND EXEC()
Files and Processes:
Open file descriptor
Offset
In−kernel inode
.
.
0
1
2
3
4
5
6
7
File descriptor table
Process A
NICTA Copyright c© 2014 From Imagination to Impact 7
FORK() AND EXEC()
Files and Processes:
dup()
Open file descriptor
Offset
In−kernel inode
.
.
0
1
2
3
4
5
6
7
File descriptor table
Process A
NICTA Copyright c© 2014 From Imagination to Impact 7
FORK() AND EXEC()
Files and Processes:
.
.
0
1
2
3
4
5
6
7
File descriptor table
Process B
fork()
dup()
Open file descriptor
Offset
In−kernel inode
.
.
0
1
2
3
4
5
6
7
File descriptor table
Process A
NICTA Copyright c© 2014 From Imagination to Impact 7
FORK() AND EXEC()
switch (kidpid = fork()) {
case 0: /* child */
close(0); close(1); close(2);
dup(infd); dup(outfd); dup(outfd);
execve("path/to/prog", argv, envp);
_exit(EXIT_FAILURE);
case -1:
/* handle error */
default:
waitpid(kidpid, &status, 0);
}
NICTA Copyright c© 2014 From Imagination to Impact 8
STANDARD FILE DESCRIPTORS
0 Standard Input
1 Standard Output
2 Standard Error
➜ Inherited from parent
➜ On login, all are set to controlling tty
NICTA Copyright c© 2014 From Imagination to Impact 9
FILE MODEL
• Separation of names from content.
• ‘regular’ files ‘just bytes’ → structure/meaning
supplied by userspace
• Devices represented by files.
• Directories map names to index node indices
(inums)
• Simple permissions model
NICTA Copyright c© 2014 From Imagination to Impact 10
FILE MODEL
.
..
bash
shls
whichrnano
busybox
setserial
bzcmp
367
368
402401
265
/ bin / ls
.
..
boot
sbin
bin
dev
var
vmlinux
etc
usr
inode 324
2300300
301
32434
5
76
2
2324
8
125
NICTA Copyright c© 2014 From Imagination to Impact 11
NAMEI
➜ translate name → inode
➜ abstracted per filesystem in VFS layer
➜ Can be slow: extensive use of caches to speed it up
dentry cache
➜ hide filesystem and device boundaries
➜ walks pathname, translating symbolic links
NICTA Copyright c© 2014 From Imagination to Impact 12
NAMEI
➜ translate name → inode
➜ abstracted per filesystem in VFS layer
➜ Can be slow: extensive use of caches to speed it up
dentry cache — becomes SMP bottleneck
➜ hide filesystem and device boundaries
➜ walks pathname, translating symbolic links
NICTA Copyright c© 2014 From Imagination to Impact 12
EVOLUTION
KISS:
➜ Simplest possible algorithm used at first
NICTA Copyright c© 2014 From Imagination to Impact 13
EVOLUTION
KISS:
➜ Simplest possible algorithm used at first
➜ Easy to show correctness
➜ Fast to implement
NICTA Copyright c© 2014 From Imagination to Impact 13
EVOLUTION
KISS:
➜ Simplest possible algorithm used at first
➜ Easy to show correctness
➜ Fast to implement
➜ As drawbacks and bottlenecks are found, replace with
faster/more scalable alternatives
NICTA Copyright c© 2014 From Imagination to Impact 13
C DIALECT
• Extra keywords:
– Section IDs: init, exit, percpu etc
– Info Taint annotation user, rcu, kernel,
iomem
– Locking annotations acquires(X),
releases(x)
– extra typechecking (endian portability) bitwise
NICTA Copyright c© 2014 From Imagination to Impact 14
C DIALECT
• Extra iterators
– type name foreach()
• Extra accessors
– container of()
NICTA Copyright c© 2014 From Imagination to Impact 15
C DIALECT
• Massive use of inline functions
• Some use of CPP macros
• Little #ifdef use in code: rely on optimizer to elide
dead code.
NICTA Copyright c© 2014 From Imagination to Impact 16
SCHEDULING
Goals:
• O(1) in number of runnable processes, number of
processors
– good uniprocessor performance
• ‘fair’
• Good interactive response
• topology-aware
NICTA Copyright c© 2014 From Imagination to Impact 17
SCHEDULING
Implementation:
• Changes from time to time.
• Currently ‘CFS’ by Ingo Molnar.
NICTA Copyright c© 2014 From Imagination to Impact 18
SCHEDULING
Dual Entitlement Scheduler
0.5 0.7 0.1
0 0
Expired
Running
NICTA Copyright c© 2014 From Imagination to Impact 19
SCHEDULING
1. Keep tasks ordered by effective CPU runtime
weighted by nice in red-black tree
2. Always run left-most task.
NICTA Copyright c© 2014 From Imagination to Impact 20
SCHEDULING
1. Keep tasks ordered by effective CPU runtime
weighted by nice in red-black tree
2. Always run left-most task.
Devil’s in the details:
• Avoiding overflow
• Keeping recent history
• multiprocessor locality
• handling too-many threads
• Sleeping tasks
• Group hierarchyNICTA Copyright c© 2014 From Imagination to Impact 20
SCHEDULING
(hyper)Thread
NICTA Copyright c© 2014 From Imagination to Impact 21
SCHEDULING
Core
NICTA Copyright c© 2014 From Imagination to Impact 21
SCHEDULING
(hyper)Threads
Packages
Cores
NICTA Copyright c© 2014 From Imagination to Impact 21
SCHEDULING
(hyper)Threads
Packages
Cores
(hyper)Threads
Packages
Cores
(hyper)Threads
Packages
Cores
RAM
RAM
RAM
NUMA Node
NICTA Copyright c© 2014 From Imagination to Impact 21
SCHEDULING
Locality Issues:
• Best to reschedule on same processor (don’t move
cache footprint, keep memory close)
NICTA Copyright c© 2014 From Imagination to Impact 22
SCHEDULING
Locality Issues:
• Best to reschedule on same processor (don’t move
cache footprint, keep memory close)
– Otherwise schedule on a ‘nearby’ processor
NICTA Copyright c© 2014 From Imagination to Impact 22
SCHEDULING
Locality Issues:
• Best to reschedule on same processor (don’t move
cache footprint, keep memory close)
– Otherwise schedule on a ‘nearby’ processor
• Try to keep whole sockets idle
NICTA Copyright c© 2014 From Imagination to Impact 22
SCHEDULING
Locality Issues:
• Best to reschedule on same processor (don’t move
cache footprint, keep memory close)
– Otherwise schedule on a ‘nearby’ processor
• Try to keep whole sockets idle
• Somehow identify cooperating threads, co-schedule
on same package?
NICTA Copyright c© 2014 From Imagination to Impact 22
SCHEDULING
• One queue per processor (or hyperthread)
• Processors in hierarchical ‘domains’
• Load balancing per-domain, bottom up
• Aims to keep whole domains idle if possible (power
savings)
NICTA Copyright c© 2014 From Imagination to Impact 23
MEMORY MANAGEMENT
Memory in
zones
Highmem
Normal
DMA
Normal
Physical address 0
16M
900M
DMA
3GLinux kernel
User VM
VirtualPhysical
Iden
tity
Map
ped
wit
h o
ffse
t
NICTA Copyright c© 2014 From Imagination to Impact 24
MEMORY MANAGEMENT
• Direct mapped pages become logical addresses
– pa() and va() convert physical to virtual for
these
NICTA Copyright c© 2014 From Imagination to Impact 25
MEMORY MANAGEMENT
• Direct mapped pages become logical addresses
– pa() and va() convert physical to virtual for
these
• small memory systems have all memory as logical
NICTA Copyright c© 2014 From Imagination to Impact 25
MEMORY MANAGEMENT
• Direct mapped pages become logical addresses
– pa() and va() convert physical to virtual for
these
• small memory systems have all memory as logical
• More memory → ∆ kernel refer to memory by
struct page
NICTA Copyright c© 2014 From Imagination to Impact 25
MEMORY MANAGEMENT
struct page:
• Every frame has a struct page (up to 10 words)
• Track:
– flags
– backing address space
– offset within mapping or freelist pointer
– Reference counts
– Kernel virtual address (if mapped)
NICTA Copyright c© 2014 From Imagination to Impact 26
MEMORY MANAGEMENT
File(or swap)
struct
address_space
struct
vm_area_structstruct
vm_area_structstruct
vm_area_struct
struct mm_struct
In virtual address order....
struct task_struct
Pag
e T
able
(har
dw
are
def
ined
)
owner
NICTA Copyright c© 2014 From Imagination to Impact 27
MEMORY MANAGEMENT
Address Space:
• Misnamed: means collection of pages mapped from
the same object
• Tracks inode mapped from, radix tree of pages in
mapping
• Has ops (from file system or swap manager) to:
dirty mark a page as dirty
readpages populate frames from backing store
writepages Clean pages — make backing store the
same as in-memory copyNICTA Copyright c© 2014 From Imagination to Impact 28
MEMORY MANAGEMENT
migratepage Move pages between NUMA nodes
Others. . . And other housekeeping
NICTA Copyright c© 2014 From Imagination to Impact 29
PAGE FAULT TIME
• Special case in-kernel faults
• Find the VMA for the address
– segfault if not found (unmapped area)
• If it’s a stack, extend it.
• Otherwise:
1. Check permissions, SIG SEGV if bad
2. Call handle mm fault():
– walk page table to find entry (populate higher
levels if nec. until leaf found)
– call handle pte fault()
NICTA Copyright c© 2014 From Imagination to Impact 30
PAGE FAULT TIME
handle pte fault(): Depending on PTE status, can
• provide an anonymous page
• do copy-on-write processing
• reinstantiate PTE from page cache
• initiate a read from backing store.
and if necessary flushes the TLB.
NICTA Copyright c© 2014 From Imagination to Impact 31
DRIVER INTERFACE
Three kinds of device:
1. Platform device
2. enumerable-bus device
3. Non-enumerable-bus device
NICTA Copyright c© 2014 From Imagination to Impact 32
DRIVER INTERFACE
Enumerable buses:
static DEFINE_PCI_DEVICE_TABLE(cp_pci_tbl) = {
{ PCI_DEVICE(PCI_VENDOR_ID_REALTEK,PCI_DEVICE_ID_REALTEK_8139)
{ PCI_DEVICE(PCI_VENDOR_ID_TTTECH,PCI_DEVICE_ID_TTTECH_MC322),
{ },
};
MODULE_DEVICE_TABLE(pci, cp_pci_tbl);
NICTA Copyright c© 2014 From Imagination to Impact 33
DRIVER INTERFACE
Driver interface:
init called to register driver
exit called to deregister driver, at module unload time
probe() called when bus-id matches; returns 0 if driver
claims device
open, close, etc as necessary for driver class
NICTA Copyright c© 2014 From Imagination to Impact 34
DRIVER INTERFACE
Platform Devices:
static struct platform_device nslu2_uart = {
.name = "serial8250",
.id = PLAT8250_DEV_PLATFORM,
.dev.platform_data = nslu2_uart_data,
.num_resources = 2,
.resource = nslu2_uart_resources,
};
NICTA Copyright c© 2014 From Imagination to Impact 35
DRIVER INTERFACE
non-enumerable buses: Treat like platform devices
NICTA Copyright c© 2014 From Imagination to Impact 36
SUMMARY
• I’ve told you status today
NICTA Copyright c© 2014 From Imagination to Impact 37
SUMMARY
• I’ve told you status today
– Next week it may be different
NICTA Copyright c© 2014 From Imagination to Impact 37
SUMMARY
• I’ve told you status today
– Next week it may be different
• I’ve simplified a lot. There are many hairy details
NICTA Copyright c© 2014 From Imagination to Impact 37
FILE SYSTEMS
I’m assuming:
• You’ve already looked at ext[234]-like filesystems
• You’ve some awareness of issues around on-disk
locality and I/O performance
• You understand issues around avoiding on-disk
corruption by carefully ordering events, and/or by the
use of a Journal.
NICTA Copyright c© 2014 From Imagination to Impact 38
NORMAL FILE SYSTEMS
• Optimised for use on spinning disk
• RAID optimised (especially XFS)
• Journals, snapshots, transactions...
NICTA Copyright c© 2014 From Imagination to Impact 39
FLASH MEMORY
• NOR Flash
• NAND Flash
NICTA Copyright c© 2014 From Imagination to Impact 40
FLASH MEMORY
• NOR Flash
• NAND Flash
– MTD
– eMMC, SDHC etc
– SSD, USB
NICTA Copyright c© 2014 From Imagination to Impact 40
FLASH MEMORY
• NOR Flash
• NAND Flash
– MTD — Memory Technology Device
– eMMC, SDHC etc
– SSD, USB
NICTA Copyright c© 2014 From Imagination to Impact 40
FLASH MEMORY
• NOR Flash
• NAND Flash
– MTD — Memory Technology Device
– eMMC, SDHC etc — A JEDEC standard
– SSD, USB
NICTA Copyright c© 2014 From Imagination to Impact 40
FLASH MEMORY
• NOR Flash
• NAND Flash
– MTD — Memory Technology Device
– eMMC, SDHC etc — A JEDEC standard
– SSD, USB — and other disk-like interfaces
NICTA Copyright c© 2014 From Imagination to Impact 40
NAND CHARACTERISTICS
Erase Block
Page
Interface Circuitry
NAND Flash Chip
NICTA Copyright c© 2014 From Imagination to Impact 41
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 42
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 42
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 43
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 43
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 43
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 43
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 43
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 43
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 43
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 43
FLASH UPDATE
Erase Block
Page
Interface Circuitry
NAND Flash Chip
NOR flashorEEPROM
RAM
Processor
NICTA Copyright c© 2014 From Imagination to Impact 44
FLASH UPDATE
NICTA Copyright c© 2014 From Imagination to Impact 45
THE CONTROLLER:
• Presents illusion of ‘standard’ block device
• Manages writes to prevent wearing out
• Manages reads to prevent read-disturb
• Performs garbage collection
• Performs bad-block management
NICTA Copyright c© 2014 From Imagination to Impact 46
THE CONTROLLER:
• Presents illusion of ‘standard’ block device
• Manages writes to prevent wearing out
• Manages reads to prevent read-disturb
• Performs garbage collection
• Performs bad-block management
Mostly documented in Korean patents referred to by US
patents!
NICTA Copyright c© 2014 From Imagination to Impact 46
WEAR MANAGEMENT
Two ways:
• Remap blocks when they begin to fail (bad block
remapping)
NICTA Copyright c© 2014 From Imagination to Impact 47
WEAR MANAGEMENT
Two ways:
• Remap blocks when they begin to fail (bad block
remapping)
• Spread writes over all erase blocks (wear levelling)
NICTA Copyright c© 2014 From Imagination to Impact 47
WEAR MANAGEMENT
Two ways:
• Remap blocks when they begin to fail (bad block
remapping)
• Spread writes over all erase blocks (wear levelling)
In practice both are used.
NICTA Copyright c© 2014 From Imagination to Impact 47
WEAR MANAGEMENT
Two ways:
• Remap blocks when they begin to fail (bad block
remapping)
• Spread writes over all erase blocks (wear levelling)
In practice both are used.
Also:
• Count reads and schedule garbage collection after
some threshhold
NICTA Copyright c© 2014 From Imagination to Impact 47
PREFORMAT
• Typically use FAT32 (or exFAT for sdxc cards)
• Always do cluster-size I/O (64k)
• First partition segment-aligned
NICTA Copyright c© 2014 From Imagination to Impact 48
PREFORMAT
• Typically use FAT32 (or exFAT for sdxc cards)
• Always do cluster-size I/O (64k)
• First partition segment-aligned
Conjecture Flash controller optimises for the
preformatted FAT fs
NICTA Copyright c© 2014 From Imagination to Impact 48
FAT FILE SYSTEMS
Clu
ster
Info
Blo
ck
FAT
Data Area
Root Directory
Bo
ot
Par
amR
eser
ved
NICTA Copyright c© 2014 From Imagination to Impact 49
FAT FILE SYSTEMS
Conjecture The controller has some number of
buffers it treats specially, to allow more than one write
locus.
NICTA Copyright c© 2014 From Imagination to Impact 50
TESTING SDHC CARDS
NICTA Copyright c© 2014 From Imagination to Impact 51
SD CARD CHARACTERISTICS
Card Price/G #AU Page size Erase Size
Kingston
Class 10
$0.80 2 128k 4M
Toshiba
Class 10
$1.20 2 64k 8M
SanDisk
Extreme
UHS-1
$5.00 9 64k 8M
SanDisk
Ex-
treme Pro
$6.50 9 16k 4M
NICTA Copyright c© 2014 From Imagination to Impact 52
WRITE PATTERNS: FILE CREATE
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
0 0.5 1 1.5 2 2.5 3 3.5 4
Blo
ckN
umbe
r
Time
Write 40M File
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
0 2 4 6 8 10 12
Blo
ckN
umbe
r
Time
Write 40M File
(On Toshiba Exceria card)
NICTA Copyright c© 2014 From Imagination to Impact 53
WRITE PATTERNS: FILE CREATE
8000
10000
12000
14000
16000
18000
20000
22000
0 1 2 3 4 5 6 7
"~/iozone.dat" using 1:2
NICTA Copyright c© 2014 From Imagination to Impact 54
F2FS
• By Samsung
NICTA Copyright c© 2014 From Imagination to Impact 55
F2FS
• By Samsung
• ‘Use on-card FTL, rather than work against it’
NICTA Copyright c© 2014 From Imagination to Impact 55
F2FS
• By Samsung
• ‘Use on-card FTL, rather than work against it’
• Cooperate with garbage collection
NICTA Copyright c© 2014 From Imagination to Impact 55
F2FS
• By Samsung
• ‘Use on-card FTL, rather than work against it’
• Cooperate with garbage collection
• Use FAT32 optimisations
NICTA Copyright c© 2014 From Imagination to Impact 55
F2FS
• 2M Segments written as whole chunks
NICTA Copyright c© 2014 From Imagination to Impact 56
F2FS
• 2M Segments written as whole chunks — always
writes at log head
NICTA Copyright c© 2014 From Imagination to Impact 56
F2FS
• 2M Segments written as whole chunks — always
writes at log head
— aligned with FLASH allocation units
NICTA Copyright c© 2014 From Imagination to Impact 56
F2FS
• 2M Segments written as whole chunks — always
writes at log head
— aligned with FLASH allocation units
• Log is the only data structure on-disk
NICTA Copyright c© 2014 From Imagination to Impact 56
F2FS
• 2M Segments written as whole chunks — always
writes at log head
— aligned with FLASH allocation units
• Log is the only data structure on-disk
• Metadata (e.g., head of log) written to FAT area in
single-block writes
NICTA Copyright c© 2014 From Imagination to Impact 56
F2FS
• 2M Segments written as whole chunks — always
writes at log head
— aligned with FLASH allocation units
• Log is the only data structure on-disk
• Metadata (e.g., head of log) written to FAT area in
single-block writes
• Splits Hot and Cold data and Inodes.
NICTA Copyright c© 2014 From Imagination to Impact 56
BENCHMARKS: POSTMARK 32K READ
Kingston Toshiba
Sandisk Extreme SanDisk Extreme Pro
Filesystem
EX
T4
FA
T32
1
0
2
3
5
4
F2F
S
MB
/s
NICTA Copyright c© 2014 From Imagination to Impact 57
USING NON-F2FS
• Observation: XFS and ext4 already understand RAID
NICTA Copyright c© 2014 From Imagination to Impact 58
USING NON-F2FS
• Observation: XFS and ext4 already understand RAID
• RAID has multiple chunks, and a fixed stride, so...
NICTA Copyright c© 2014 From Imagination to Impact 58
USING NON-F2FS
• Observation: XFS and ext4 already understand RAID
• RAID has multiple chunks, and a fixed stride, so...
• Configure FS as if for RAID
NICTA Copyright c© 2014 From Imagination to Impact 58
USING NON-F2FS
Still running benchmarks, see LCA talk next January for
results!
NICTA Copyright c© 2014 From Imagination to Impact 59
SCALABILITY
The Multiprocessor Effect:
• Some fraction of the system’s cycles are not available
for application work:
– Operating System Code Paths
– Inter-Cache Coherency traffic
– Memory Bus contention
– Lock synchronisation
– I/O serialisation
NICTA Copyright c© 2014 From Imagination to Impact 60
SCALABILITY
Amdahl’s law:
If a process can be split
such that σ of the running
time cannot be sped up, but
the rest is sped up by
running on p processors,
then overall speedup is
p
1 + σ(p− 1)
T(1−σ ) Tσ
Tσ
T(1−σ )
T(1−σ )
T(1−σ )
NICTA Copyright c© 2014 From Imagination to Impact 61
SCALABILITY
1 processor
Throughput
Applied load
NICTA Copyright c© 2014 From Imagination to Impact 62
SCALABILITY
1 processor
Throughput
Applied load
NICTA Copyright c© 2014 From Imagination to Impact 62
SCALABILITY
1 processor
Throughput
Applied load
NICTA Copyright c© 2014 From Imagination to Impact 62
SCALABILITY
1 processor
Throughput
Applied load
NICTA Copyright c© 2014 From Imagination to Impact 62
SCALABILITY
1 processor
Throughput
Applied load
2 processors
3 processors
NICTA Copyright c© 2014 From Imagination to Impact 62
SCALABILITY
1 processor
Throughput
Applied load
2 processors
3 processors
NICTA Copyright c© 2014 From Imagination to Impact 62
SCALABILITY
1 processor
Throughput
Applied load
2 processors
3 processors
NICTA Copyright c© 2014 From Imagination to Impact 62
SCALABILITY
3 processors
2 processors
Applied load
Throughput
Latency
Throughput
NICTA Copyright c© 2014 From Imagination to Impact 63
SCALABILITY
Gunther’s law:
C(N) =N
1 + α(N − 1) + βN(N − 1)
where:
N is demand
α is the amount of serialisation: represents Amdahl’s law
β is the coherency delay in the system.
C is Capacity or Throughput
NICTA Copyright c© 2014 From Imagination to Impact 64
SCALABILITY
0
2000
4000
6000
8000
10000
0 2000 4000 6000 8000 10000
Thr
ough
put
Load
USL with alpha=0,beta=0
α = 0, β = 0
NICTA Copyright c© 2014 From Imagination to Impact 65
SCALABILITY
0
2000
4000
6000
8000
10000
0 2000 4000 6000 8000 10000
Thr
ough
put
Load
USL with alpha=0,beta=0
α = 0, β = 0
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000
Thr
ough
put
Load
USL with alpha=0.015,beta=0
α > 0, β = 0
NICTA Copyright c© 2014 From Imagination to Impact 65
SCALABILITY
0
2000
4000
6000
8000
10000
0 2000 4000 6000 8000 10000
Thr
ough
put
Load
USL with alpha=0,beta=0
α = 0, β = 0
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000
Thr
ough
put
Load
USL with alpha=0.015,beta=0
α > 0, β = 0
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000
Thr
ough
put
Load
USL with alpha=0.001,beta=0.0000001
α > 0, β > 0
NICTA Copyright c© 2014 From Imagination to Impact 65
SCALABILITY
Queueing Models:
ServerQueue
Poissonarrivals
Poissonservice times
NICTA Copyright c© 2014 From Imagination to Impact 66
SCALABILITY
Queueing Models:
ServerQueue
Poissonarrivals
Poissonservice times
ServerQueue
Poissonservice times
Same Server
High Priority
Normal Priority
Sink
NICTA Copyright c© 2014 From Imagination to Impact 66
SCALABILITY
Real examples:
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 10 20 30 40 50 60 70 80
Thr
ough
put
Load
Postgres TPC throughput
NICTA Copyright c© 2014 From Imagination to Impact 67
SCALABILITY
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 10 20 30 40 50 60 70 80
Thr
ough
put
Load
USL with alpha=0.342101,beta=0.017430Postgres TPC throughput
NICTA Copyright c© 2014 From Imagination to Impact 68
SCALABILITY
0
1000
2000
3000
4000
5000
6000
7000
8000
0 10 20 30 40 50 60 70 80
Thr
ough
put
Load
Postgres TPC throughput, separate log disc
NICTA Copyright c© 2014 From Imagination to Impact 69
SCALABILITY
Another example:
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 10 20 30 40 50
Jobs
per
Min
ute
Number of Clients
01-way02-way04-way08-way12-way
NICTA Copyright c© 2014 From Imagination to Impact 70
SCALABILITY
SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME
72.3% 13.1% 0.5us(9.5us) 29us( 20ms)(42.5%) 50542055 86.9% 13.1% 0%
find lock page+0x30
0.01% 85.3% 1.7us(6.2us) 46us(4016us)(0.01%) 1113 14.7% 85.3% 0%
find lock page+0x130
NICTA Copyright c© 2014 From Imagination to Impact 71
SCALABILITY
struct page *find lock page(struct address space *mapping,
unsigned long offset)
{
struct page *page;
spin lock irq(&mapping->tree lock);
repeat:
page = radix tree lookup(&mapping>page tree, offset);
if (page) {
page cache get(page);
if (TestSetPageLocked(page)) {
spin unlock irq(&mapping->tree lock);
lock page(page);
spin lock irq(&mapping->tree lock);
. . .NICTA Copyright c© 2014 From Imagination to Impact 72
SCALABILITY
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 10 20 30 40 50
Jobs
per
Min
ute
Number of Clients
01-way02-way04-way08-way12-way16-way
NICTA Copyright c© 2014 From Imagination to Impact 73
TACKLING SCALABILITY PROBLEMS
• Find the bottleneck
NICTA Copyright c© 2014 From Imagination to Impact 74
TACKLING SCALABILITY PROBLEMS
• Find the bottleneck
– not always easy
NICTA Copyright c© 2014 From Imagination to Impact 74
TACKLING SCALABILITY PROBLEMS
• Find the bottleneck
• fix or work around it
NICTA Copyright c© 2014 From Imagination to Impact 74
TACKLING SCALABILITY PROBLEMS
• Find the bottleneck
• fix or work around it
– not always easy
NICTA Copyright c© 2014 From Imagination to Impact 74
TACKLING SCALABILITY PROBLEMS
• Find the bottleneck
• fix or work around it
• check performance doesn’t suffer too much on the
low end.
NICTA Copyright c© 2014 From Imagination to Impact 74
TACKLING SCALABILITY PROBLEMS
• Find the bottleneck
• fix or work around it
• check performance doesn’t suffer too much on the
low end.
• Experiment with different algorithms, parameters
NICTA Copyright c© 2014 From Imagination to Impact 74
TACKLING SCALABILITY PROBLEMS
• Each solved problem
uncovers another
• Fixing performance for
one workload can
worsen another
NICTA Copyright c© 2014 From Imagination to Impact 75
TACKLING SCALABILITY PROBLEMS
• Each solved problem
uncovers another
• Fixing performance for
one workload can
worsen another
• Performance problems
can make you cry
NICTA Copyright c© 2014 From Imagination to Impact 75
DOING WITHOUT LOCKS
Avoiding Serialisation:
• Lock-free algorithms
• Allow safe concurrent access without excessive
serialisation
NICTA Copyright c© 2014 From Imagination to Impact 76
DOING WITHOUT LOCKS
Avoiding Serialisation:
• Lock-free algorithms
• Allow safe concurrent access without excessive
serialisation
• Many techniques. We cover:
– Sequence locks
– Read-Copy-Update (RCU)
NICTA Copyright c© 2014 From Imagination to Impact 76
DOING WITHOUT LOCKS
Sequence locks:
• Readers don’t lock
• Writers serialised.
NICTA Copyright c© 2014 From Imagination to Impact 77
DOING WITHOUT LOCKS
Reader:
volatile seq;
do {
do {
lastseq = seq;
} while (lastseq & 1);
rmb();
....
} while (lastseq != seq);
NICTA Copyright c© 2014 From Imagination to Impact 78
DOING WITHOUT LOCKS
Writer:
spinlock(&lck);
seq++; wmb()
...
wmb(); seq++;
spinunlock(&lck);
NICTA Copyright c© 2014 From Imagination to Impact 79
DOING WITHOUT LOCKS
RCU: ??
1.
NICTA Copyright c© 2014 From Imagination to Impact 80
DOING WITHOUT LOCKS
RCU: ??
1. 2.
NICTA Copyright c© 2014 From Imagination to Impact 80
DOING WITHOUT LOCKS
RCU: ??
1. 2.
3.
NICTA Copyright c© 2014 From Imagination to Impact 80
DOING WITHOUT LOCKS
RCU: ??
1. 2.
3. 4.
NICTA Copyright c© 2014 From Imagination to Impact 80
BACKGROUND READING
References
McKenney, P. E. (2004), Exploiting Deferred Destruction:
An Analysis of Read-Copy-Update Techniques in
Operating System Kernels, PhD thesis, OGI School of
Science and Engineering at Oregon Health and
Sciences University.
URL:
http://www.rdrop.com/users/paulmck/RCU/RCUdissertation.2004.07.14e1.pdf
McKenney, P. E., Sarma, D., Arcangelli, A., Kleen, A.,
Krieger, O. & Russell, R. (2002), Read copy update, inNICTA Copyright c© 2014 From Imagination to Impact 81
BACKGROUND READING
‘Ottawa Linux Symp.’.
URL:
http://www.rdrop.com/users/paulmck/rclock/rcu.2002.07.08.pdf
NICTA Copyright c© 2014 From Imagination to Impact 82