14

Click here to load reader

20101030 clsf2010

Embed Size (px)

Citation preview

Page 1: 20101030 clsf2010

1

CLSF 2010

Page 2: 20101030 clsf2010

2

Agenda

• Photos

• Attendances Overview

• Key Topics

• Industry Talks

Page 3: 20101030 clsf2010

3

LSF 2010, Shanghai

Page 4: 20101030 clsf2010

4

LSF 2010 participant/companies

Company # of participant Key background

Intel 5 Kernel performance, SSD, mem mgmt

EMC 5 Storage, file system

Fujitsu 4 IO Controller, btrfs

Taobao 3 Distributed storage, taobao server

Novell 2 Suse server, HA

Oracle 2 OCFS2 dev/test

Baidu 2 Baidu kernel optimization

Canonical 2

Redhat 1 Network driver

Page 5: 20101030 clsf2010

5

Key topics

ftraceNKernel Tracing

Topic Slides? Description

Page writeback Y Dirty page ratio limit

Control process to write pages

CFQ, IO controller Y CFQ introduction and further features

BTRFS N Memory consuming, fsck speed

SSD/Block layer Y Block layer issues with SSD

VFS scalability N Multi-core challenges

Kernel testing Y Intel kernel auto test framework

Industrial talk: Taobao Y TFS, Tair

Industrial talk: Baidu N The architecture of Baidu search system

Industrial talk: EMC N FSCK

Page 6: 20101030 clsf2010

6

Writeback - Wu Fengguang

• vmscan is a bottleneck

decrease dirty ratio under memory pressure, so vmscan can less possibly find

dirty pages on page allocation.

• pageout(page) calls wirtepage() to write to disk, which is a performance killer

since it does random writes

let flusher write. expand single 4K write to 4MB write. So more dirty pages are

reclaimed and flushed.

• balance_dirty_page() should not write: random write kills performance

let flusher write and ask process to sleep. Three proposals:

a). wait io completion: NFS bumpy completion, need smoother sleep method.

b). sleep (dirtied *3/2 / write_bandwidth)

c). sleep (dirtied / throttle_bandwidth)

• flusher default write size (4MB -> 128MB), will be dynamic in the future.

Baidu's practice: SSD random write is really bad. For sequential write, increase

wb size (4MB -> 40MB) will get 120% SSD performance

Page 7: 20101030 clsf2010

7

Btrfs - Coly Li

• Has too much love from linux community

two years ago > now

• Used in MeeGo Project

• Taobao plan to push industrial deployment in 2-3 years

10T per data server in TFS cluster

Use on SSD and SATA hybrid data server

Metadata will be allocated on SSD and data on SATA.

• Dynamic data relocation with hot data tracking patch.

For generic fs usage, need to deal with device mapper to get device speed

information.

• FSCK

A difficult must. Currently assigned to Fujitsu.

Page 8: 20101030 clsf2010

8

SSD challenges - Li Shaohua

• Throughput: same issue as network

• Disk controller gap and big locks (queue locks & scsi locks)

• Interrupt related:

a. smp affinity: single queue, one CPU to deal with irqsb. blk_iopoll: poll more req in one req

• Need hardware multiqueue

• CFQ needs to be changed to fit multiqueue (e.g. CFQ per queue)

• Queue lock contention vs. cache lock contention

See Andi Kleen's talk in Tokyo Linux Conf

• Intel is building nextGen PCIE SSD, with many fancy features. Stay tuned

Page 9: 20101030 clsf2010

9

VFS scalability- Ma Tao

• With multi-cores, all global locks suck

• Globle icache/dcache can be adapted to per-CPU

• CFQ can be adapted to per-queue

• The less global locks the better

Page 10: 20101030 clsf2010

10

Industry talk – Baidu (Cont.)• Service types

a). Indexing: random read, high IOPS, small IO size, read-only. 80M records per data

node, and process 8-9 K queries per second.

b). Distributed system: large files, sequential read/write.

c). cache/KV storage: between a and b

d). Web Server: CPU bound.

• For a), read() sucks. Use mmap() to read blocks adhead to void

kernel/userspace memory copy.

mmap() can not use page cache LRU. Call readahead() after each mmap() to mark

pages as read.

mmap() pagefault is expensive with mm->mmap_sem lock. Use sync_readahead()

and sync_readaheadv()

With above, memory is now the bottleneck. Doing 10G+ MB read.

• Google patch for reducing mm->mmap_sem hold time

In do_page_fault(), drop mem->sem if page not found, then read it and get the lock

again.

Page 11: 20101030 clsf2010

11

Industry talk – Baidu (Cont.)• 8K filesystem block size (ext2) with 8K page size. Alloc continuous two pages

each time

Get better performance for sequential IO

OCFS2 uses 1MB fs block size and 4K page size

• PCIE compress card + ECC

Page 12: 20101030 clsf2010

12

Industry Talk - Taobao

• TFS

• Tair uses an update server to record updates; apply updates to production

system during mid-night

• A config server is used to minimize meta server workload

A versioned bucket table is maintained by config server and stored in each data

server. Client can manipulate data location with the bucket table returned by config

server.

• Both TFS and Tair are open source projects now

Page 13: 20101030 clsf2010

13

Industry Talk - EMC

• Introduce the recovery methods and check methods use in the file systems

from ext2 to btrFS

• Emphasis the importance of FSCK; Introduce the issues within FSCK when

checking a huge file system; Collect the proposals to solve this problem

• pNFS learning notes

Page 14: 20101030 clsf2010

14

Q & A