Upload
dangdiep
View
215
Download
1
Embed Size (px)
Citation preview
Copyright 2012 FUJITSU LIMITED
Memory Cgroup what’s going on
2012.06.08 KAMEZAWA Hiroyuki [email protected]
Copyright 2012 FUJITSU LIMITED
History
Features of memcg in linux-3.5-rc1.
Performance brief.
News and TODOs.
Agenda
1
Way to RHEL6(2.6.32).
Some basic features are added.
Per-zone LRU, OOM handling, Swap, hierarchy support, softlimit etc…
Major committers
Balbir Singh
Daisuke Nishimura
Hugh Dickins
KOSAKI Motohiro
KAMEZAWA Hiroyuki
Copyright 2010 FUJITSU LIMITED 3
Performance fixes(…2.6.36)
In this era, many many atomic ops were removed.
Some new features.
Moving charge at task moving.
Threshold notifier (by embedded guy.)
Major commiters
Daisuke Nishimura
Kirill A. Shutemov
KAMEZAWA Hiroyuki
Copyright 2010 FUJITSU LIMITED 4
New commiters (..v3.0)
Biggest changes in this era was new skilled committers
Johannes Weiner
Michal Hocko
Ying Han (+ Google team)
Many refactoring and bug-fixes
Johannes and Michal are MAINTAINER now.
More statistics
Copyright 2010 FUJITSU LIMITED 5
Renewal implementation (..v3.5-rc)
Hard works by Johannes and Hugh.
Reduced 60% of memory overhead.
New accoutings
Tcp memcontrol
Huge Page limiting
Recent Commiters • Johannes Weiner, Hugh Dickins, Glauber Costa, Michal
Hocko, Konstantin Khlebnikov, Kirill A. Shutemov, KAMEZAWA Hiroyuki
Copyright 2010 FUJITSU LIMITED 6
Lines of codes.
7 Copyright 2010 FUJITSU LIMITED
0
1000
2000
3000
4000
5000
6000
7000
80002
.6.2
5
2.6
.26
2.6
.27
2.6
.28
2.6
.29
2.6
.30
2.6
.31
2.6
.32
2.6
.33
2.6
.34
2.6
.35
2.6
.36
2.6
.37
2.6
.38
2.6
.39 3
3.1
3.2
3.3
3.4
Lines of codes.
Lines
memcg git tree
Memory cgroup development tree maintained by Michal Hokko
git://github.com/mstsxfx/memcg-devel.git
Copyright 2010 FUJITSU LIMITED 10
Features of memory cgroup
Agenda
Requirements/Basics
Per-memcg memory/swap limiting
Hierarchical or non-Hierarchical(Flat)
LRU implementation
Threshold/OOM notiffier
Tcp memcontrol
Copyright 2010 FUJITSU LIMITED 11
Requirements
Works on all system CONFIG_MMU=y
Uses 16bytes/PAGE_SIZE(4096bytes) on 64bit system to record information
Copyright 2010 FUJITSU LIMITED
General setup=>
Control Group support
12
cgroup
Allows users to make a group of process via cgroup file system interface
Users can control characteristics of the group by read/write cgroup file system.
Copyright 2010 FUJITSU LIMITED
root
admin
user
Tree of cgroup
processes
Attach processes
to cgroup
13
Memory/swap limiting.
limit usage of Anon, File Cache
limit of memory+swap usage.
Copyright 2010 FUJITSU LIMITED
Only 300M usage of cache
3.6G file
15
Hierarchical/non-Hierarchical
User can choice accounting policy of tree
Copyright 2010 FUJITSU LIMITED
Usage=600M
200M 400M
400M
?
200M
?
400M
Assume memory usage
in leaf cgroups are 200M
and 400M.
What the parent’s should be ?
Hierarchical mode
0
200M
0
400M
Non Hierarchical mode
choice
16
Example: hierarchical mode.
Copyright 2010 FUJITSU LIMITED
Memory usage is propagated.
Set hierarchical mode.
17
Example: non-hierarhical mode
Copyright 2010 FUJITSU LIMITED
Unset hierarchical mode
No propagation to the parent
Under non-hierarchila mode, parent/child cgroup
are independent from each other. 18
LRU
Memory cgroup reclaims memory when the usage hit limits.
All pages are tracked by linked list, LRU.
At reclaiming memory, memory cgroup scans s its own LRU list and select victim pages.
Copyright 2010 FUJITSU LIMITED
Usage hits
the limit
Linked list of pages
Scan And find
unused pages drop
page
Linked list
19
LRU implementation(1)
Before 3.3…..LRU was Duplicated.
Copyright 2010 FUJITSU LIMITED
Global per-zone-page-LRU-list scans all pages.
LRU of memcg A
LRU of memcg B
LRU of memcg C
Memory cgroup scans its own LRU.
This consumes
16bytes/page 20
LRU implemenations(2)
Since 3.3, LRUs are unified.
Copyright 2010 FUJITSU LIMITED
LRU of memcg A
LRU of memcg B
LRU of memcg C
Global LRU is re-implemented as a group of all
per-memcg-per-zone-LRU linked list.
This saves 16bytes/page. 21
Threshold notification
Notify via eventfd when the usage crosses the specified value.
Copyright 2010 FUJITSU LIMITED
threshold
usage Notify via eventfd
poll()
Check: Documentation/cgroup/cgroup_event_listener.c
22
Example: threshold
Copyright 2010 FUJITSU LIMITED
Wait for usage reaching 300M bytes
on TestCgroup
23
OOM block/notifier
Memory cgroup can
Block OOM-Kill under a memcg
Notification of OOM via eventfd
If OOM-Killer is blocked
All tasks under the cgroup will stop.
Tasks run again if • Some resources are freed.
• Get signal
• Limit is raised.
• A task is moved to other cgroup
Copyright 2010 FUJITSU LIMITED 24
Example: OOM
Copyright 2010 FUJITSU LIMITED
Set limt and wait for OOM
Check status and kill by hand
All tasks are stopped
kill
Run again
25
tcp memcontrol.
Controls memory usage for TCP (3.3) If memory usage hits limit….
• INPUT: packets will be dropped.
•OUTPUT: wait for available memory.
Copyright 2010 FUJITSU LIMITED
For now, this works independent from other memory controls for anon,file,swap
26
Performance
Comparison between 2.6.32 and 3.4
Transparent Hugepage is disabled.
Check overheads of memory cgroup.
create a tree /cgroup/memory/L0/L1/L2/L3/L4/L5/L6/L7/L8/L9/L10with use_hierarchy=1
Run mini-benchmark on root,L1,L2,L10.
Copyright 2010 FUJITSU LIMITED 27
Overhead of hierarchy
If use_hierarchy=1, need to update several counters at once.
Copyright 2010 FUJITSU LIMITED
root
L0
L1
L10
Root cgroup is out of control. No counters
In this test, All usage in L1…L10 are propagated up to L0. Then, the number of counters and Overheads will be big in deep groups.
28
tar -xpf
tar –xpf linux-3.5.tar onto tmpfs. Checking file cache creation overheads in ‘sys’ time
Flat graph is better.
Copyright 2010 FUJITSU LIMITED
4core/1socket
0
0.1
0.2
0.3
0.4
0.5
0.6
2.14
2.15
2.16
2.17
2.18
2.19
2.2
2.21
2.22
2.23
root L0 L1 L10
2.6.32
3.4.1
3.4
2.6.32
29
rm -rf
# rm –rf linux-3.5 on tmpfs
Checking file cache deletion overheads in ‘sys’.
Flat graph is better.
Copyright 2010 FUJITSU LIMITED
4core/1socket
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.63
0.64
0.65
0.66
0.67
0.68
0.69
0.7
0.71
0.72
0.73
root L0 L1 L10
2.6.32
3.4.1
30
Parallel page faults
Causing page fault in parallel.
Copyright 2010 FUJITSU LIMITED
page fault
unmap
process
page fault
unmap
process
page fault
unmap
process
page fault
unmap
process
page fault
unmap
process
Lock contention
in memory cgroup
Each process touch/free 256k mem.
31
Parallel page faults(1)
Copyright 2010 FUJITSU LIMITED
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
L0L1
L2L10
speed of parallel page faults under memcg # of pagefaults per proc per min.
2.6-t4 3.4-t4 2.6-t3 3.4-t3 2.6-t2 3.4-t2 2.6-t1 3.4-t1
depth
# of proc
1
4
4core/1socket
32
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
45,000,000
50,000,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
root
L0
2.2TB alloc/free /min
1.9TB alloc/free /min
Parallel page faults (2)
Copyright 2010 FUJITSU LIMITED
10core/2sockets Pagefaults/proc/min on linux-3.4
# of procs
175Gb alloc/free per min
1.1TB alloc/free /min
33
News & TODO
Many things are still in TODO list…
Split locks.
Hugetlb controls
Kmem(slab) controls
Softlimit renewal
•Idle memcg.
Dirty Throttling
Reduce memory overhead(16bytes)
Per-memcg kswapd.
Copyright 2010 FUJITSU LIMITED 34
Split locks
Now, per-zone-memcg-lru list is maintained.
But, lock is now shared among memcgs.
Implementation is very complicated.
Hugh and Konstantin working on this.
Copyright 2010 FUJITSU LIMITED
LRU of memcg A
LRU of memcg B
LRU of memcg C
lock
35
Hugetlb Controls
Controls for hugetlbfs
Implements per cgroup hugetlbfs quota.
The development was almost finished but the author started re-designing because of rjections by other commiters.
Maybe hugetlbfs cgroup will be added rather than enhancing memcg.
Aneesh is working on this.
Copyright 2010 FUJITSU LIMITED 36
Kmem(slab) controls
Function to account/limit kernel memory
Under development for a half year…
Supporting slab/slub
Discuss: per-page accounting v.s. per-objs
The biggest concern is performance.
•Isolation & Performance…..challenge.
Costa & Suleiman works on this.
•Seems co-operative with Christoph Lameter
Copyright 2010 FUJITSU LIMITED 37
Soft limit renewal
Now, memory cgroup has softlimt
for hinting the system memory reclaim priority.
Some people complaints
isolation using softlimit doesn’t work well.
it doesn’t work as expected..
Total re-implementation is planned.
Ying Han, Johannes, Michal working on this.
Copyright 2010 FUJITSU LIMITED 38
Idle memcg
In softlimit discussion..
Copyright 2010 FUJITSU LIMITED
root
cg1 cg2 cg3
cg1-1
Memory reclaim
• If soft limit is set, kswapd will choose victim cgroup by it. • What happens when a cgroup has been idle for a long time but it doesn’t hit softlimit ?
39
Dirty Throttoling
In todo list since 3 years ago…
Without this :
background write-out doesn’t start enough quick.
Memory recalim will see too much dirty pages.
Patches were posted several times.
New implementation of I/O less dirty throttoling
Need updates.
Copyright 2010 FUJITSU LIMITED 40
Reduce memory overhead.
Now, using 16bytes/page (onx86-64)
4Mbytes/ 1Gbytes.
Copyright 2010 FUJITSU LIMITED
Seems it’s possible to make this 8bytes/page.
merge this into ‘unused’ 8bytes in ‘struct page’…
41
Per-memcg kswapd
Now, memory cgroup has no kswapd.
Run kthread for reduce memory usage when the usage is near to the limit.
Move cpu usage for memory reclaiming from applications to kthread.
By this, memory will be reclaimed in ‘idle’ time and applications will get better latency. • Should be able to control cpu prio of kthread for kswapd?
Old patch shown good result.
Waiting for lock-splitting patches for avoid contention.
Copyright 2010 FUJITSU LIMITED 42
Parallel Page fault /touch 2M+THP(3)
Increase bufsize as 2M and enable THP.
Copyright 2010 FUJITSU LIMITED
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 3 5 7 9 11 13 15 17 19
root+4kpage
root+THP
L0 + THP
44