A Case for Flash Memory SSD in Enterprise Database Applications
Authors: Sang-Won Lee, Bongki Moon, Chanik Park, Jae-Myung Kim, Sang-Woo KimPublished on SIGMOD2008
Presented by Jin Xiong11/4/2008
2
Outline• Flash memory SSD• DB storage and workload• Experimental settings• Transaction log• MVCC rollback segment• Temporary table spaces• Conclusions
3
Flash memory SSD (1)
• Flash memory SSD– NAND-type flash memory– SAMSUNG– Interface: IDE
4
Flash memory SSD (2)
• Characteristics– Uniform random access speed
• Purely electronic device, no mechanically moving parts• Access latency is almost linearly proportional to the amo
unt of data irrespective of their physical locations in flash memory.
• One of the key characteristics we can take advantage of– Erase before overwriting
• Data on SSD cannot be updated in place• Erase unit is much larger than a sector, 128KB vs. 1KB• Erase is time consuming, typically 1-2 ms
– Asymmetry of read and write speed • Write is much slower than read on SSD, 0.4 ms vs 0.2 ms i
n this paper
5
Flash memory SSD (3)• Hardware logic
– Dual channel architecture, 4-way interleaving– Hide flash programming latency and increase
bandwidth– 128KB SRAM for program code, data and buffer
memory
6
Flash memory SSD (4)• Firmware: Flash translation layer (FTL)
– Address mapping and wear leveling• Address the issue of limited write cycles of each sector• Based-on super-blocks: 1MB, 8 erase units, 2 on each
flash chip• Limit the amount of information required for mapping
• Trends – Two-fold annual increase in the density– Original used in mobile computing devices
• PDA’s, MP3 players, mobile phones, digital cameras
– Recently more and more used in portable computers and enterprise server market
– Tremendous potential as a new storage medium that can replace magnetic disk and achieve much higher performance for enterprise database servers
7
DB Storage• Data structures in DB systems
– Database tables and indexes• Not within the scope of this paper
– Transaction log• Whenever a transaction updates a data object, its log rec
ord is created• Must be kept on stable storage for recoverability and dur
ability– Temporary tables
• Used to store temporary data required for performing operations such as sorts or joins
– Rollback segments• Used in multiversion concurrent control (MVCC)
8
DB Workload • Typical transactional database workloads, e.g. TP
C-C– Little locality and sequentiality– Many synchronous writes
• Forced writes of log records at commit time• Must wait until data are written on disk
– Prefetching and write buffering are less effective– Performance is limited by disk latency rather than disk b
andwidth and capacity– The latency-bandwidth imbalance of disk seems to be m
ore serious in the future– Low latency of SSD
• Improve performance significantly
9
Experimental Settings• Two machines with identical hardware
except disk– 1.86 GHz Intel Pentium dual-core processor– 2GB RAM
• OS: Linux-2.6.22• Disk
– SSD: Samsung Standard Type, 32GB, PATA (IDE), SLD NAND
– HDD: Seagate Barracuda, 250 GB, 7200 rpm, SATA
• DB– A commercial database server– Used HDD/SSD as a raw device (not through FS)– Database tables were cached in memory
10
Transaction log• Synchronous writes
– When a transaction commits, it appends a commit type log record to the log, and force-writes the log tail to stable storage
• Response time– Tresponse = Tcpu + Tread + Twrite + Tcommit– Tcommit is a significant overhead, waiting disk I/O– Commit time delay is a serious bottleneck
• Append-only sequential writes– HDD: no seek delay , avg latency 4.17ms (7200 rpm)– SSD: do not cause expensive merge or erase operation
s if clean blocks are available
11
Transaction log
• Simple SQL transactions– Multi-threaded concurrent transactions– TPS on SSD is much higher (12x-4x) than that
on HDD– The gap is shrinking with the increase of the
number of concurrent transactions– HDD: Disk access latency is the bottleneck, low
CPU utilization– SSD
• Limited by CPU rather than I/O• Saturated CPU utilization, no increase in TPS
12
Transaction log• TPC-B benchmark performance
– A stress test: transaction commit rate is higher than that of TCP-C
– Suitable for testing the log storage: a large number of small transactions causing significant forced-write activities
– The number of concurrent users: 20– TPS on SSD is 3.5x – Considerably lower log write latency on SSD– CPU is the bottleneck for SSD
13
Transaction log• I/O-bound vs CPU-bound
– SSD: faster CPU improves TPS• Dual-core: saturated at about 3000 TPS• Quad-core: saturated at about 4300 TPS
– HDD: almost no difference
14
MVCC rollback segment• MVCC — Multiversion concurrency control
– An alternative to the traditional concurrency control mechanism based on lock
– When updating a data object, its before image is written to a rollback segment, then the new data is applied to it
– When reading a data object, search for the correct version on rollback segment
– Two advantages• Minimize performance penalty on concurrent updates of
transactions, because read consistency is supported without any lock
• Support snapshot isolation and time travel queries– Cost
• Costly read operation: search through a long list of versions of a data object if it is updated many times
15
MVCC rollback segment• Write pattern 1
– Append only, sequential write – Multiple streams in parallel– 1MB extent
• Write pattern 2– In-place writes to a small logical regi
on• HDD is expected to perform poorly
– Disk arm movement each 1MB– Excessive disk seek
• SSD is expected to perform well– No additional cost when there are cl
ean blocks– Reclamation cost can be amortized
• Infrequent, every 1MB extent • Slight performance difference
– SSD: avg 6.8ms/block– HDD: avg 7.1ms/block
16
MVCC rollback segment• Read pattern
– Clustered, randomly scattered across quite a large logical address space (1GB)
• Performance– SSD: 16x faster than HDD
17
Temporary table spaces• External sort
– Typical algorithm• Partitions an input data set
into smaller chunks• Sorts the chunks separately• Merges them into a single
sorted file
– I/O pattern• Sequential write followed by
random read
– Performance• Sequential write: small
difference• Random read: SSD almost 10
times faster
18
Temporary table spaces
• External sort– Effect of cluster size on sort
performance• HDD: sort performance is
improved with larger cluster size
• SSD: sort performance is deteriorated
• Reasons: – Larger cluster is good for the
first stage, but not good for merging
– The second stage dominates the performance
– Effect of buffer cache size on sort performance• Performance is improved with
larger buffer size in both cases
19
Temporary table spaces• Hash join
– Similarity with sort algorithm• Partition input data set into smaller chunks, and process each chunk
separately
– Opposite I/O pattern• Random writes followed by sequential reads
– Performance• SSD is expected to perform poorly in the first stage• Actual result is unexpected, sequential append-only write in the first
stage• SSD is 3 times faster than HDD
20
Temporary table spaces
• Sort-merge join– SSD is 7 times faster– HDD: sort-merge join is two times slower than
hash join– SSD: sort-merger join is as fast as hash join
21
Conclusions
• Demonstrated that processing I/O requests for transaction log, rollback and temporary data can become a serious bottleneck for transaction processing
• Showed that flash memory SSD can alleviate this bottleneck drastically
• Due attention should be paid to SSD in all aspect of DB system design to maximize the benefit from this new technology