34
1/35 Stonehenge: Multi- Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony Brook June, 2004

PPT

Embed Size (px)

Citation preview

Page 1: PPT

1/35

Stonehenge: Multi-Dimensional Storage

Virtualization

Lan HuangIBM Almaden Research Center

Joint work with Gang Peng and Tzi-cker Chiueh

SUNY Stony Brook

June, 2004

Page 2: PPT

2/35

Introduction Storage growth is

phenomenal: new hardware

Isolated storage: resource waste

Management

clients

IP LAN/MAN/WAN

Database server

File server

1

10

100

1000

10000

1970 1980 1990 2000

Year

Are

al D

ensi

ty

[Patterson’98]

Huge amount of data,heterogeneous Devices, spread out everywhere.

Page 3: PPT

3/35

Storage Virtualization

Examples: LVM, xFS, StorageTank Hide Physical details from high-level applications

applicationStorage

management

O. S. StorageVirtualization

Disks,Controllers

Hardwareresources

AbstractInterface

Physical Disks

Virtual Disks

Clients

Page 4: PPT

4/35

Storage Virtualization

Storage consolidation VD as tangible as PD:

Capacity Throughput Latency

Resource efficiency Ei

Page 5: PPT

5/35

Stonehenge Overview

Input: VD (B, C, D, E) Output: VDs with

performance guarantee

High Level Goals: Storage Consolidation Performance Isolation Efficiency Performance

clients

IP LAN/MAN/WAN

Database server

File server

Stonehenge (LAN)

Page 6: PPT

6/41

Hardware Organization

Storagemanag

er

Storage server

Diskarray

Kernel

ClientApplicatio

n

Storage Clerk

Kernel

ClientApplicatio

n

Storage Clerk

Storage server

Diskarray

Storage server

Diskarray

Control mesg Data/cmds

Gigabit network

Object interface

Object interface

client client

File interface

Page 7: PPT

7/41

Key Issues in Stonehenge

How to ease the task of storage management:

Centralization Virtualization Consolidation

How to achieve performance isolation among virtual disks?

Run time QoS guarantee How to do it efficiently?

Efficiency-aware algorithms Dynamic adaptive feedback

Page 8: PPT

8/41

Key components

Mapper CVC scheduler Feedback path between them

Page 9: PPT

9/35

Virtual to Physical Disk Mapping

Multi-dimension disk mapping: NP Complete

Goal: maximize resource utilization Heuristics: maximize goal function

[toyota75] Input: VDs, PDs Goal Function G: max(G) Output: VD, PD mapping

Page 10: PPT

10/35

Islands Effect

1 2 3 4

PDs

VDs

Page 11: PPT

11/35

Key Components

Mapper CVC scheduler Feedback path between them

Page 12: PPT

12/35

Requirements of Real-time Disk Scheduling

Disk Specific Improve disk bandwidth utilization

SATF, CSCAN etc…

Non Disk Specific Meet real-time request’s deadline Fair disk bandwidth allocation

among virtual disks (Virtual Clock scheduling)

Key: Bandwidth Guarantee

seek

rotation

txferother

Page 13: PPT

13/35

CVC Algorithm

Two Queues: FT(i) = max(FT(i-1),

realtime)+1/IOPSm LBA

LBA Queue is used only if FT’s slack time allows it.

Real time + service time(R) < starting deadline of next request

FT LBA

CVC Scheduler

VD(m)

Page 14: PPT

14/35

Real-life Deployment

Dispatch the next N requests from LBA queue

The next batch will not be issued until the previous batch is done.

FT LBA

CVC Scheduler

VD(m)

Storage controller

On disk scheduler

Page 15: PPT

16/35

CVC Performance

3 VDs with real-life traces: video stream, web, financial, TPC-C

Touch 40% of the storage space

Video Streams Mixed Traces

Page 16: PPT

17/35

Impact of Disk I/O Time Estimate

Model Disk I/O time ? ATA disk impossible [ECSL TR-81] SCSI disk possible?

Run Time measurement: P(I/O Time)

Page 17: PPT

18/35

CVC Latency Bound

If the traffic generated within the period of [0,t]

V(t) <= T + r * t then D <= (T + Lmax )/ Bi +Lmax/C

(1)Storage System:D <= ( ( N+1)*k*C + T + Lmax)/

Bi + ( k*C+Lmax)/C (2)Stonehenge:D <=(N+1)/IOPSi+1/IOPSmax (3)

FT

VD(m)

IOPS(m)

IOPS(max)

N reqT Bytes

?

seek

rotation

txferother

Page 18: PPT

19/35

Key Components

Mapper CVC scheduler Feedback path between them

Relaxing worst case service time estimate

VD multiplex effect

Page 19: PPT

20/35

Empirical Latency vs Worst Case

Approximate P(service time, N) with P(service time, N-1)

Q is P’s inverse function

D <=(Q(0.95) + s) * [(N+1)/IOPSi+1/IOPSmax]

x

y

Page 20: PPT

21/35

Bursty I/O Traffic and Pspare

Self-similar Multiplexing effect : Pspare(x)

x

y

Page 21: PPT

22/35

Latency or Throughput Bound

(Bthroughput, C, D, E) D--> Blatency

(Bthroughput, C, Blatency, E) Bthroughput >= Blatency: throughput

bound Bthroughput < Blatency: latency bound

BthroughputBlatency Or even less?

Page 22: PPT

23/35

MBAC for Latency Bound VDs

When the jth VD with requirements (Dj, IOPS’’j, Cj, E) comes,1. For 0 < i <= j,

Convert Di to IOPS’i: Di <=(Qservice(0.95) +s)*[(N+1)/IOPS’i+1/IOPSmax]

Let IOPSi = max(IOPS’i, IOPS’’i)2. If sum(IOPSi) <IOPSmax , accept the new VD,

otherwise, reject.

Page 23: PPT

24/35

MBAC Performance Pservice

VD Type Probability

Deterministic MBAC Oracle

Run 1 Financial 95% 7 20 22

Run 2 Mixed 95% 7 14 14

Run 3 Mixed 85% 7 17 17

Number of VDs 7 9 10 11 13 14 15

Q_{service}(0.95) 11% 15% 19% 24% 37% 49% -

MBAC N/A 38% 43% 47% 55% 67% 95%

Deterministic 90% - - - - - -

Table 2. Resource Reservation

Table 1. Maximum number of VDs accepted.

Page 24: PPT

25/35

MBAC for Throughput Bound VDs

When jth VD (Dj, IOPS’’j, Cj, E) comes,Convert Dj to IOPS’j:

Dj <=(Qservice(0.95)+s)*[(N+1)/IOPS’j+1/IOPSmax]

Let IOPSj = max(IOPS’j, IOPS’’j)

if IOPSj < Qspare(E) admit the new VD, otherwise, reject it.

Page 25: PPT

26/35

MBAC Performance Pspare

VD 0 – TPC-CVD 1 - financialVD 2 – web

search

Page 26: PPT

27/35

Measurement-based Admission Control (MBAC)

When the jth VD with requirements (Dj, IOPS’’j, Cj, E) comes,1. For 0 < i <= j,

Convert Di to IOPS’i: Di <=(Qservice(0.95) +s)*[(N+1)/IOPSi+1/IOPSmax]

Let IOPSi = max(IOPS’i, IOPS’’i)2. Group VDs into two sets: throughput bounded set T and latency bounded L3. For the throughput bound VDs, calculate combined QI/O_rate, Let Qspare(x) = IOPSmax – QI/O_rate(x)4. If sum(IOPS(L)) <Qspare(E) , accept the new VD, otherwise, reject.

Page 27: PPT

28/35

Issues with Measurement

Stability I/O rate pattern is stable Boundary case for Pservice

Overhead of monitoring trivial

Window Size

Page 28: PPT

29/35

Put them all together: Stonehenge

Functionality: A general purpose IP storage cluster

Performance scheduling

Efficiency measurement

Page 29: PPT

30/35

Software Architecture

kernel

kernel

kernel

F. E. T. D.

iSCSI initiator

Stonehenge

V. Table

P. Table

IDE Mid Layer Driver

Disk Mapper

Admission controller

Trafficshaping

Scheduler

F. E. T. D.

F. E. T. D.

V. Table

P. Table

Scheduler

Queues

FETD front end target driver

User

Stonehenge

Page 30: PPT

31/35

Effectiveness of QoS Guarantees in Stonehenge

(a) CVC

(b) CSCAN (c) Deadline violation percentage

Page 31: PPT

32/35

Impact of Leeway Factor

Overload probability Violation Percentage

Page 32: PPT

33/35

Overall System Performance and Latency Breakdown

1 GHZ CPU IBM 7200 ATA disk

array Promise IDE

controllers 64 bit 66MHZ PCI

bus Intel GB NICs

Software Modules

Average Latency (usec)

iSCSI client 57

iSCSI server 507

Disk access 1360

Central 50

Network delay 1

574

Network delay 2

2

A max of 55 MB/sec per server.

Page 33: PPT

34/35

Related Work

Storage management Minerva etc at HPL

Efficiency-aware disk scheduler: Cello, Prism, YFQ

Run time QoS guarantee Web server, Video server, network QoS

IP storage

Page 34: PPT

35/35

Conclusion

IP Storage Cluster consolidates storage and reduces fragmentation by 20-30%.

Efficiency-aware CVC real time disk scheduler with dynamic I/O time estimate provides guarantee of performance and good disk head utilization.

Measurement feed-back effectively remedies the over-provision.

Latency: Pservice 2-3 folds Throughput: Pspare 20% I/O time estimate: PI/O time Load imbalance: Pleeway