Upload
gabby
View
39
Download
0
Embed Size (px)
DESCRIPTION
Harnessing Petabytes of Online Storage Effectively. 2005/09/27. Jun Nitta ([email protected]). Hitachi, Ltd. 1. Introduction: where are we today? 2. Configuring mass online storage 3. Defining distribution of intelligence 4. Miscellaneous topics 5. Summary: beyond 10 petabytes. - PowerPoint PPT Presentation
Citation preview
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
Harnessing Petabytes of Online Storage Effectively
Jun Nitta ([email protected])
Hitachi, Ltd.
2005/09/27
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
1. Introduction: where are we today?
2. Configuring mass online storage
3. Defining distribution of intelligence
4. Miscellaneous topics
5. Summary: beyond 10 petabytes
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
Introduction: where are we today?1
4Copyright © Hitachi, Ltd. 2005. All rights reserved.
1-1 Looking into the latest specifications of HDDs…
disk sizedisk size rotational speedrotational speed(seek / latency)(seek / latency)
interfaceinterface(sustained data rate)(sustained data rate)
data data bufferbuffercapacity / diskscapacity / disks
3.5’’ 147GB/5
3.5’’ 300GB/5
15,000rpm(3.7ms/2.0ms)
10,025rpm(4.7ms/3.0ms)
4Gbp/s FC-AL(n/a-93.3MB/s)
2Gbp/s FC-AL(46.8-89.3MB/s)
16MB
16MB
3.5’’
2.5’’
1.0’’
500GB/57,200rpm
(8.5ms/4.2ms) 16MB3Gbp/s SATA-II(31-64.8MB/s)
* based on HGST catalogues as of Sep. 2005
100GB/27,200rpm
(10ms/4.2ms)1.5Gbp/s SATA
(n/a-n/a) 8MB
modelmodel
8GB/13,600rpm
(12ms/8.3ms)CE-ATA
(5.1-10.0MB/s) 128KB
for portable audio player?for portable audio player?
for small form factorfor small form factor
for large volume archivesfor large volume archives
for most other applicationsfor most other applications
for high performance OLTPfor high performance OLTP
5Copyright © Hitachi, Ltd. 2005. All rights reserved.
1-2 … and storage subsystems (RAID controllers)
raw capacityraw capacity
HDDsHDDs
FC portsFC ports
cachecache
1152 (5 cabinets) 225240 105
enterpriseenterpriseenterpriseenterprise midrangemidrangemidrangemidrange workgroupworkgroupworkgroupworkgroup
128GB
332TB (FC)
192
88.5TB (SATA)
4
8GB
4
40.5TB (SATA)
4GB
* based on HDS catalogues as of Sep. 2005
72TB (FC)
48
64GB
LUNsLUNs 16,384 2,048 51216,384
roughly: 1 rack = 200 disks (3.5”) = 100TB (500GB drive)
6Copyright © Hitachi, Ltd. 2005. All rights reserved.
1-3 Sheer number of HDDs matters practically
HDDsHDDs
O(10O(1000))
O(10O(1022))
O(10O(1033))
O(10O(1044))
O(10O(1055))
O(10O(1066))
O(10O(1011))
capacitycapacity
500GB -
5TB -
50TB -
500TB -
50PB -
500PB -
practicalitypracticality
almost prohibitive
major inhibitormajor inhibitorbesides $$$besides $$$
* MTBF of a high-end FC HDD is 106h by catalogue spec. (=114yrs, actual number may vary by order of magnitude)
practical limit
for most datacenters
a piece of cake (even possible
personally)
today’s enterprisemainstream
challenging but still feasible
getting impractical
5PB -
power & cooling
disk failure*
storage management
none
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
Configuring mass online storage:array of nodes or disks?
2
8Copyright © Hitachi, Ltd. 2005. All rights reserved.
2-1 Two alternatives to configure online storage
server farm (diskless)
storage network (FC or IP)
storage farm
array-of-disks(separate from servers)
Very cost effective for some kind of applications
- Secondary data management (especially search)- Can utilize cheapest components
Versatile for various mix of applications
- OLTP, ERP, DWH, email, …- Cost is steadily going down
CPU
memory
HDD
netw
ork
netw
ork
array-of-nodes(stack of self-contained boxes)
9Copyright © Hitachi, Ltd. 2005. All rights reserved.
2-2 Rationale for array-of-disks model
It is reasonable to separate mechanical components– HDD is the only mechanical component bedsides a cooling fan
– It makes much easier to implement hot-swap mechanisms
It is reasonable to have external storage subsystems– Disks can be shared among clusters of servers
– Spare disks can be shared within a storage subsystem
HDD1
HDD2
HDD3
HDD4
HDD5
HDD6
HDD7
HDD8
HDD9
HDD10
HDD11
RAID-5 (4D+1P) group 1
shared hot-spare diskRAID-5 (4D+1P) group
2
vs.
10
Copyright © Hitachi, Ltd. 2005. All rights reserved.
2-3 Additional discussion for array-of-disks model
It makes data management easier*– Various data protection techniques can be employed including
third-party backup and D2D replication
– For the array-of-nodes configuration, replication between nodes is the almost only viable solution for data protection (conventional backup is difficult to be employed effectively)
* Actually backup is one of the most compelling reason to consolidate scattered storages into an external RAID box
backup server
application server
RAID subsystem
tape library
11
Copyright © Hitachi, Ltd. 2005. All rights reserved.
2-4 But does this dichotomy has a meaning?
Nonetheless we need storage “controller” for array-of-disks– “Controller” is just another name of a special-purpose server of
which restricted operating environment some users prefer
– Two configuration differs essentially in CPU-to-HDD ratio determined by intelligence which a storage farm requires
Which
is most p
rom
ising?
Which
is most p
rom
ising?
Which
is most p
rom
ising?
Which
is most p
rom
ising?
general-purpose server with a couple of disks
O(103-4) ofclustered nodes
special-purpose controllerwith a lot of disks
O(100-2) ofclustered subsystems
even a HDD has CPU and memory (device controller)
O(103-4) ofclustered disks
basic building blockbasic building block petabytes configurationpetabytes configuration
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
Defining distribution of intelligence:protocol and interface
3
13
Copyright © Hitachi, Ltd. 2005. All rights reserved.
3-1 Distribution of intelligence among farms
server farm storage network(FC or IP)
storage farm
server side
intelligence
storage side
intelligence
3 reasons some functions are better placed at storage side- It is naturally implemented using CPU and memory near HDDs- It requires operations with durable state- It makes multiple servers share data objects
3 reasons some functions are better placed at server side- It is better implemented using CPU and memory near applications- It requires more powerful and economical CPU / memory- It handles multiple controllers
14
Copyright © Hitachi, Ltd. 2005. All rights reserved.
3-2 Alternative way to place intelligence
Some intelligence could be placed on the network- But a closer look reveals that most of those “intelligent network components” are not genuine network core components- Rather they are placed on the boundary between network and server /storage which is not a clear-cut edge but a blurred region
server farm
storage network (FC or IP) storage farm
network corenetwork corenetwork corenetwork coreboundaryboundaryboundaryboundary
network edge
intelligence(server side)
boundaryboundaryboundaryboundary
network edge
intelligence(storage
side)
Is this a part of network or storage farm?
15
Copyright © Hitachi, Ltd. 2005. All rights reserved.
3-3 Placement of functions: an example
storage side storage side intelligenceintelligence
server side server side intelligenceintelligence
intelligence on intelligence on both sideboth side
- basic RAID control / LUN management- remote filesystem- local replication including snapshots (copy-on-write)- volume migration transparent to servers
- block aggregation (a.k.a. logical volume management)- remote replication- backup- data encryption
- local filesystem- volume migration among multiple controllers- multi-path management (load balancing & fail over)- content search / indexing
Here is an example of intelligence distribution scheme assuming array-of-disks configuration
16
Copyright © Hitachi, Ltd. 2005. All rights reserved.
3-4 Which interface & protocol should we adopt?
There are 3 well-established I/O interfaces: block, file, SQL- None of them is optimal for today’s server/storage farm environment- Though file may be most promising for its balanced features- But I/O interface is stubborn to change (very conservative)- Thus multi interface/protocol support is a practical solution
blockblock
protocolprotocol(transport)(transport)
filefile SQLSQL
SCSI-3(FC or IP)
NFS/CIFS-SMB(TCP/IP)
proprietary(mostly TCP/IP)
strengthstrength
weaknessweakness
- low latency- strong standard protocol
- layers away from application- not network-friendly
- broad application- strong standard protocol
- performance and scalability (especially for DBMS)
- high level enough to encapsulate physical properties
- limited application- no standard protocol
interfaceinterface
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
Miscellaneous topics for managing petabytes of online storage
4
18
Copyright © Hitachi, Ltd. 2005. All rights reserved.
4-1 Virtualization: simply too many mappings
“Virtualization” itself is a powerful technology to hide complexity if used properly
- But current situation is too confusing
Operating Systems and DBMSs should be aware that a storage volume is a logical network resource
- It can even expand and shrink dynamically- There may be more than 100,000 volumes on the network (most OS can recognize up to only about 1,000 volumes)
RAID/block aggregationRAID/block aggregationRAID/block aggregationRAID/block aggregation
LULU
LULUHDDHDDHDDHDDHDDHDD
LULU
HDDHDDHDDHDDHDDHDD
RAID/block aggr.RAID/block aggr.RAID/block aggr.RAID/block aggr.
LULU LULU
LULU
HDDHDDHDDHDDHDDHDD
RAID/block aggr.RAID/block aggr.RAID/block aggr.RAID/block aggr.
RAID/block aggregationRAID/block aggregationRAID/block aggregationRAID/block aggregation
LULU
LULU
LULU
HDDHDDHDDHDDHDDHDD
RAID/block aggr.RAID/block aggr.RAID/block aggr.RAID/block aggr.
LULU
RAID/block aggregationRAID/block aggregationRAID/block aggregationRAID/block aggregation
AP-recognizable volumeAP-recognizable volume
server level virtualization(HBA/device
driver, OS/LVM, DBMS)
switch level virtualizatio
n
controller level
virtualization
server
switch
controller
recognize
export
controller
controller
controller
19
Copyright © Hitachi, Ltd. 2005. All rights reserved.
4-2 Data protection: disk plays the protagonist
You have to go to disks at least for the first step to make backup workable for > 10TB of data
- Eventually those data may go to tape (D2D2T)
primary primary volumevolume HDDHDDHDDHDDHDDHDD
MT emulationMT emulationMT emulationMT emulation
server1
controller
VTL
consistenconsistent t
snapshotsnapshot
copy on write
AP/DBMSAP/DBMSAP/DBMSAP/DBMS
agentagentagentagent
server2
HDDHDDHDDHDDHDDHDDHDDHDDHDDHDDHDDHDD
RAIDRAIDRAIDRAID
controller
HDDHDDHDDHDDHDDHDD
data data protection protection managermanager
data data protection protection managermanager
1) make quiescent
3) resume4) mount
5) backup to VTL or replicate to disks
2) take snapshot
typical backup scenario for large amount of data
20
Copyright © Hitachi, Ltd. 2005. All rights reserved.
4-3 Data migration: latent cost of online storage
Since data always outlives its container, you should migrate data from one subsystem to another several times
- Non-disruptiveness to upper layer is desirable which requires some form of address mapping- Durable address mapping for storage is not well standardized for both block and file level (cf. URL -[DNS]-> IP address -> MAC address)
storage level
mapping
storage level
mapping
switch level
mapping
switch level
mapping
sever level
mapping
sever level
mapping
invariantinvariant data data movementmovement
path(more flexible)
SCSI LUN(less flexible)
scattered& long
localized& short
HDDHDDHDDHDDHDDHDD
server1
switch
old controller
AP/DBMSAP/DBMSAP/DBMSAP/DBMS
yet another mappingyet another mappingyet another mappingyet another mapping
server2
AP/DBMSAP/DBMSAP/DBMSAP/DBMS
another mappinganother mappinganother mappinganother mapping
switchanother mappinganother mappinganother mappinganother mapping
some mappingsome mappingsome mappingsome mapping
HDDHDDHDDHDDHDDHDD
new controllersome mappingsome mappingsome mappingsome mapping
yet another mappingyet another mappingyet another mappingyet another mapping
21
Copyright © Hitachi, Ltd. 2005. All rights reserved.
4-4 Security: as always matters
And of course there are a lot of security concerns storage subsystems have to take care of
- Data-at-rest protection is much more challenging than data-in-flight because of long-term key management
application server
storage administrat
ormanagement server
primary site secondary site
storage subsystem
[management port security]- user authentication- access control- data-in-flight protection
[data port security]- device authentication- access control- data-in-flight protection
[other subsystem security]- data-at-rest protection- audit logging
22
Copyright © Hitachi, Ltd. 2005. All rights reserved.
4-5 Storage resource management: spreadsheet?
Even the basic discovery-and-reporting is still a pain in the neck for most administrators
- Most widely used management tool today is a spreadsheet- But can they continue using it for PB environment?- SNIA SMI-S standard seems good because of its set-oriented query capability (SNMP has already gone broken for storage management)- Yet most commercial tools are not proven over PB
storage administrato
r
23
Copyright © Hitachi, Ltd. 2005. All rights reserved.
4-6 Applications: will they use DBMS?
What kind of applications will use petabytes of online storage?
- email/IM, voice, video archive, …- stream data from sensor network (including RFID)- geoscience, bioscience, medical, …
How those data will be managed?- Most bulk data may not be stored in RDBMSs but in filesystems (with global name space)- XLM native store may engulf a lot of data (structured and semi-structured) once well established
today’s typical PB system
HDDHDDHDDHDDHDDHDD
application server
staging disk
MT library
MTMTMTMTMTMT
file server
HSMHSMHSMHSM
HDDHDDHDDHDDHDDHDD
metadata DB
contents server
DBMSDBMSDBMSDBMS
contents contents managermanagercontents contents managermanager
front-end applicationfront-end applicationfront-end applicationfront-end application
cachecachecachecache
O(10G-100GB)
O(1PB)e.g.
100MB*107files
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
Summary: beyond 10 petabytes5
25
Copyright © Hitachi, Ltd. 2005. All rights reserved.
5-1 Beyond 10 petabytes of data
Continuing capacity growth of HDD enables >10PB online storage within the reach of most IT organizations in 5 years
- HDD with perpendicular magnetic recording technology is emerging- Declining $/GB trend shows no sign of discontinuing
Server farm – network – storage farm configuration will continue to dominate enterprise data centers
- It is the most cost effective and flexible way to configure online storage for varieties of applications
Protocol and interface between server and storage should evolve to be more network-conscious
- But old guards will never die in a foreseeable future
XML data store may come to play a significant role in addition to filesystem and RDBMS
- Who knows!
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
Jun NittaHitachi, Ltd.
2005/09/27
Harnessing Petabytes of Online Storage Effectively