Click here to load reader
Upload
ziming-hu
View
844
Download
0
Embed Size (px)
Citation preview
1
CLSF 2010
2
Agenda
• Photos
• Attendances Overview
• Key Topics
• Industry Talks
3
LSF 2010, Shanghai
4
LSF 2010 participant/companies
Company # of participant Key background
Intel 5 Kernel performance, SSD, mem mgmt
EMC 5 Storage, file system
Fujitsu 4 IO Controller, btrfs
Taobao 3 Distributed storage, taobao server
Novell 2 Suse server, HA
Oracle 2 OCFS2 dev/test
Baidu 2 Baidu kernel optimization
Canonical 2
Redhat 1 Network driver
5
Key topics
ftraceNKernel Tracing
Topic Slides? Description
Page writeback Y Dirty page ratio limit
Control process to write pages
CFQ, IO controller Y CFQ introduction and further features
BTRFS N Memory consuming, fsck speed
SSD/Block layer Y Block layer issues with SSD
VFS scalability N Multi-core challenges
Kernel testing Y Intel kernel auto test framework
Industrial talk: Taobao Y TFS, Tair
Industrial talk: Baidu N The architecture of Baidu search system
Industrial talk: EMC N FSCK
6
Writeback - Wu Fengguang
• vmscan is a bottleneck
decrease dirty ratio under memory pressure, so vmscan can less possibly find
dirty pages on page allocation.
• pageout(page) calls wirtepage() to write to disk, which is a performance killer
since it does random writes
let flusher write. expand single 4K write to 4MB write. So more dirty pages are
reclaimed and flushed.
• balance_dirty_page() should not write: random write kills performance
let flusher write and ask process to sleep. Three proposals:
a). wait io completion: NFS bumpy completion, need smoother sleep method.
b). sleep (dirtied *3/2 / write_bandwidth)
c). sleep (dirtied / throttle_bandwidth)
• flusher default write size (4MB -> 128MB), will be dynamic in the future.
Baidu's practice: SSD random write is really bad. For sequential write, increase
wb size (4MB -> 40MB) will get 120% SSD performance
7
Btrfs - Coly Li
• Has too much love from linux community
two years ago > now
• Used in MeeGo Project
• Taobao plan to push industrial deployment in 2-3 years
10T per data server in TFS cluster
Use on SSD and SATA hybrid data server
Metadata will be allocated on SSD and data on SATA.
• Dynamic data relocation with hot data tracking patch.
For generic fs usage, need to deal with device mapper to get device speed
information.
• FSCK
A difficult must. Currently assigned to Fujitsu.
8
SSD challenges - Li Shaohua
• Throughput: same issue as network
• Disk controller gap and big locks (queue locks & scsi locks)
• Interrupt related:
a. smp affinity: single queue, one CPU to deal with irqsb. blk_iopoll: poll more req in one req
• Need hardware multiqueue
• CFQ needs to be changed to fit multiqueue (e.g. CFQ per queue)
• Queue lock contention vs. cache lock contention
See Andi Kleen's talk in Tokyo Linux Conf
• Intel is building nextGen PCIE SSD, with many fancy features. Stay tuned
9
VFS scalability- Ma Tao
• With multi-cores, all global locks suck
• Globle icache/dcache can be adapted to per-CPU
• CFQ can be adapted to per-queue
• The less global locks the better
10
Industry talk – Baidu (Cont.)• Service types
a). Indexing: random read, high IOPS, small IO size, read-only. 80M records per data
node, and process 8-9 K queries per second.
b). Distributed system: large files, sequential read/write.
c). cache/KV storage: between a and b
d). Web Server: CPU bound.
• For a), read() sucks. Use mmap() to read blocks adhead to void
kernel/userspace memory copy.
mmap() can not use page cache LRU. Call readahead() after each mmap() to mark
pages as read.
mmap() pagefault is expensive with mm->mmap_sem lock. Use sync_readahead()
and sync_readaheadv()
With above, memory is now the bottleneck. Doing 10G+ MB read.
• Google patch for reducing mm->mmap_sem hold time
In do_page_fault(), drop mem->sem if page not found, then read it and get the lock
again.
11
Industry talk – Baidu (Cont.)• 8K filesystem block size (ext2) with 8K page size. Alloc continuous two pages
each time
Get better performance for sequential IO
OCFS2 uses 1MB fs block size and 4K page size
• PCIE compress card + ECC
12
Industry Talk - Taobao
• TFS
• Tair uses an update server to record updates; apply updates to production
system during mid-night
• A config server is used to minimize meta server workload
A versioned bucket table is maintained by config server and stored in each data
server. Client can manipulate data location with the bucket table returned by config
server.
• Both TFS and Tair are open source projects now
13
Industry Talk - EMC
• Introduce the recovery methods and check methods use in the file systems
from ext2 to btrFS
• Emphasis the importance of FSCK; Introduce the issues within FSCK when
checking a huge file system; Collect the proposals to solve this problem
• pNFS learning notes
14
Q & A