1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center...

Preview:

Citation preview

1/35

Stonehenge: Multi-Dimensional Storage

Virtualization

Lan HuangIBM Almaden Research Center

Joint work with Gang Peng and Tzi-cker Chiueh

SUNY Stony Brook

June, 2004

2/35

Introduction Storage growth is

phenomenal: new hardware

Isolated storage: resource waste

Management

clients

IP LAN/MAN/WAN

Database server

File server

1

10

100

1000

10000

1970 1980 1990 2000

Year

Are

al D

ensi

ty

[Patterson’98]

Huge amount of data,heterogeneous Devices, spread out everywhere.

3/35

Storage Virtualization

Examples: LVM, xFS, StorageTank Hide Physical details from high-level applications

applicationStorage

management

O. S. StorageVirtualization

Disks,Controllers

Hardwareresources

AbstractInterface

Physical Disks

Virtual Disks

Clients

4/35

Storage Virtualization

Storage consolidation VD as tangible as PD:

Capacity Throughput Latency

Resource efficiency Ei

5/35

Stonehenge Overview

Input: VD (B, C, D, E) Output: VDs with

performance guarantee

High Level Goals: Storage Consolidation Performance Isolation Efficiency Performance

clients

IP LAN/MAN/WAN

Database server

File server

Stonehenge (LAN)

6/41

Hardware Organization

Storagemanag

er

Storage server

Diskarray

Kernel

ClientApplicatio

n

Storage Clerk

Kernel

ClientApplicatio

n

Storage Clerk

Storage server

Diskarray

Storage server

Diskarray

Control mesg Data/cmds

Gigabit network

Object interface

Object interface

client client

File interface

7/41

Key Issues in Stonehenge

How to ease the task of storage management:

Centralization Virtualization Consolidation

How to achieve performance isolation among virtual disks?

Run time QoS guarantee How to do it efficiently?

Efficiency-aware algorithms Dynamic adaptive feedback

8/41

Key components

Mapper CVC scheduler Feedback path between them

9/35

Virtual to Physical Disk Mapping

Multi-dimension disk mapping: NP Complete

Goal: maximize resource utilization Heuristics: maximize goal function

[toyota75] Input: VDs, PDs Goal Function G: max(G) Output: VD, PD mapping

10/35

Islands Effect

1 2 3 4

PDs

VDs

11/35

Key Components

Mapper CVC scheduler Feedback path between them

12/35

Requirements of Real-time Disk Scheduling

Disk Specific Improve disk bandwidth utilization

SATF, CSCAN etc…

Non Disk Specific Meet real-time request’s deadline Fair disk bandwidth allocation

among virtual disks (Virtual Clock scheduling)

Key: Bandwidth Guarantee

seek

rotation

txferother

13/35

CVC Algorithm

Two Queues: FT(i) = max(FT(i-1),

realtime)+1/IOPSm LBA

LBA Queue is used only if FT’s slack time allows it.

Real time + service time(R) < starting deadline of next request

FT LBA

CVC Scheduler

VD(m)

14/35

Real-life Deployment

Dispatch the next N requests from LBA queue

The next batch will not be issued until the previous batch is done.

FT LBA

CVC Scheduler

VD(m)

Storage controller

On disk scheduler

16/35

CVC Performance

3 VDs with real-life traces: video stream, web, financial, TPC-C

Touch 40% of the storage space

Video Streams Mixed Traces

17/35

Impact of Disk I/O Time Estimate

Model Disk I/O time ? ATA disk impossible [ECSL TR-81] SCSI disk possible?

Run Time measurement: P(I/O Time)

18/35

CVC Latency Bound

If the traffic generated within the period of [0,t]

V(t) <= T + r * t then D <= (T + Lmax )/ Bi +Lmax/C

(1)Storage System:D <= ( ( N+1)*k*C + T + Lmax)/

Bi + ( k*C+Lmax)/C (2)Stonehenge:D <=(N+1)/IOPSi+1/IOPSmax (3)

FT

VD(m)

IOPS(m)

IOPS(max)

N reqT Bytes

?

seek

rotation

txferother

19/35

Key Components

Mapper CVC scheduler Feedback path between them

Relaxing worst case service time estimate

VD multiplex effect

20/35

Empirical Latency vs Worst Case

Approximate P(service time, N) with P(service time, N-1)

Q is P’s inverse function

D <=(Q(0.95) + s) * [(N+1)/IOPSi+1/IOPSmax]

x

y

21/35

Bursty I/O Traffic and Pspare

Self-similar Multiplexing effect : Pspare(x)

x

y

22/35

Latency or Throughput Bound

(Bthroughput, C, D, E) D--> Blatency

(Bthroughput, C, Blatency, E) Bthroughput >= Blatency: throughput

bound Bthroughput < Blatency: latency bound

BthroughputBlatency Or even less?

23/35

MBAC for Latency Bound VDs

When the jth VD with requirements (Dj, IOPS’’j, Cj, E) comes,1. For 0 < i <= j,

Convert Di to IOPS’i: Di <=(Qservice(0.95) +s)*[(N+1)/IOPS’i+1/IOPSmax]

Let IOPSi = max(IOPS’i, IOPS’’i)2. If sum(IOPSi) <IOPSmax , accept the new VD,

otherwise, reject.

24/35

MBAC Performance Pservice

VD Type Probability

Deterministic MBAC Oracle

Run 1 Financial 95% 7 20 22

Run 2 Mixed 95% 7 14 14

Run 3 Mixed 85% 7 17 17

Number of VDs 7 9 10 11 13 14 15

Q_{service}(0.95) 11% 15% 19% 24% 37% 49% -

MBAC N/A 38% 43% 47% 55% 67% 95%

Deterministic 90% - - - - - -

Table 2. Resource Reservation

Table 1. Maximum number of VDs accepted.

25/35

MBAC for Throughput Bound VDs

When jth VD (Dj, IOPS’’j, Cj, E) comes,Convert Dj to IOPS’j:

Dj <=(Qservice(0.95)+s)*[(N+1)/IOPS’j+1/IOPSmax]

Let IOPSj = max(IOPS’j, IOPS’’j)

if IOPSj < Qspare(E) admit the new VD, otherwise, reject it.

26/35

MBAC Performance Pspare

VD 0 – TPC-CVD 1 - financialVD 2 – web

search

27/35

Measurement-based Admission Control (MBAC)

When the jth VD with requirements (Dj, IOPS’’j, Cj, E) comes,1. For 0 < i <= j,

Convert Di to IOPS’i: Di <=(Qservice(0.95) +s)*[(N+1)/IOPSi+1/IOPSmax]

Let IOPSi = max(IOPS’i, IOPS’’i)2. Group VDs into two sets: throughput bounded set T and latency bounded L3. For the throughput bound VDs, calculate combined QI/O_rate, Let Qspare(x) = IOPSmax – QI/O_rate(x)4. If sum(IOPS(L)) <Qspare(E) , accept the new VD, otherwise, reject.

28/35

Issues with Measurement

Stability I/O rate pattern is stable Boundary case for Pservice

Overhead of monitoring trivial

Window Size

29/35

Put them all together: Stonehenge

Functionality: A general purpose IP storage cluster

Performance scheduling

Efficiency measurement

30/35

Software Architecture

kernel

kernel

kernel

F. E. T. D.

iSCSI initiator

Stonehenge

V. Table

P. Table

IDE Mid Layer Driver

Disk Mapper

Admission controller

Trafficshaping

Scheduler

F. E. T. D.

F. E. T. D.

V. Table

P. Table

Scheduler

Queues

FETD front end target driver

User

Stonehenge

31/35

Effectiveness of QoS Guarantees in Stonehenge

(a) CVC

(b) CSCAN (c) Deadline violation percentage

32/35

Impact of Leeway Factor

Overload probability Violation Percentage

33/35

Overall System Performance and Latency Breakdown

1 GHZ CPU IBM 7200 ATA disk

array Promise IDE

controllers 64 bit 66MHZ PCI

bus Intel GB NICs

Software Modules

Average Latency (usec)

iSCSI client 57

iSCSI server 507

Disk access 1360

Central 50

Network delay 1

574

Network delay 2

2

A max of 55 MB/sec per server.

34/35

Related Work

Storage management Minerva etc at HPL

Efficiency-aware disk scheduler: Cello, Prism, YFQ

Run time QoS guarantee Web server, Video server, network QoS

IP storage

35/35

Conclusion

IP Storage Cluster consolidates storage and reduces fragmentation by 20-30%.

Efficiency-aware CVC real time disk scheduler with dynamic I/O time estimate provides guarantee of performance and good disk head utilization.

Measurement feed-back effectively remedies the over-provision.

Latency: Pservice 2-3 folds Throughput: Pspare 20% I/O time estimate: PI/O time Load imbalance: Pleeway

Recommended