Upload
cece
View
49
Download
1
Embed Size (px)
DESCRIPTION
File Caching with SSD Arrays. Wei Yang. Motivation. We are curious No immediate needs, but future needs Caching (only) analysis job inputs SSD has limited write cycles Other goals, see the last slide File level caching Conventional LFU/LRU algorithms - PowerPoint PPT Presentation
Citation preview
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
1
File Caching with SSD Arrays
Wei Yang
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
2
Motivation• We are curious
– No immediate needs, but future needs– Caching (only) analysis job inputs– SSD has limited write cycles – Other goals, see the last slide
• File level caching– Conventional LFU/LRU algorithms
• can not capture ATLAS analysis jobs data usage pattern (if there is such a pattern)
– Sub-file level caching would be great! But book keeping is hard– We search for caching algorithm
• Out-Bytes > In-Bytes under ATLAS workload • Use LRU, but based on days/weeks/months job usage pattern
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
3
Analysis jobs visit SSD cache first
Cache miss!forward to HD storage
Day 1 ... Day N
File 0001 X1 Xn
File 0002 Y1 Yn
Fill the cache
Xrootd monitoring stream
Setup 1: Caching based on File Access Frequency
o A table records access Frequency of all fileso Rotate columns to maintain N days of records
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
4
Analysis jobs visit SSD cache first
Cache miss!forward to HD storage
Fill the cache
Xrootd monitoring stream to
UCSD collector
Setup 2: Caching based on Historic File Access Info
o Record every file access as event like infoo save to ROOT files for later analysis
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
5
Hardware of the SSD Box• Dell 610
– 8-core 2.4 Ghz– 24GB– Intel dual X520 10Gb NIC– LSI SAS 9200-8e (support TRIM)– RHEL 6 x86_64– Xrootd
• SSD Array– Dell MD1220– 12x OCZ Talos 960GB MLC SSDs, total ~11TB
• Non-raid to support TRIM. • Xrootd take care to gluing them together as a single space
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
6
The box can deliverCan the caching algorithm deliver?
3-hour plot2012-11-05
6-month plot as of 2012-11-12
File Access Freq. Alg.Net data sink, not cache
Algorithm: Bytes-read/file size > 110% during the last 5 days, prioritized by this ratio and up to 200GB/hour
Sept 1
Lack of jobs
Cache brings in ~200GB/hour
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
7
GB/hour from SSD + HDDGB/hour from SSDGB/hour to SSD
Lost monitoringdata from HDD
UCSD collectordead
Lack of jobsfor the last 4 days
Ceiling of 10Gb NIC
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
8
Simulate the Cache with Historic Data
Cache size requiredfor day [-x, -1]
day –n -n+1 -1 0
Day 0:Size of all files read
Bytes read from SSD+HDD
Bytes read from SSD
Cache size required for [-x+1, 0] - = New data to cache
For a given caching algorithm, what do we want to learn from those historic data?
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
9
Algorithm: every files during the last N days.11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
10
Algorithm: every files during the last N daysCache hit rate = Byte from SSD/Bytes from SSD+HDD
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
11
Algorithm: Bytes-read/file size > 110% during the last 5 days11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
12
Analyzing the Historic Data
Try to find a way to identify data worth caching.
So far, not much success
Worth caching
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
1311/14/12
Do the jobs tend to open the same file in a short time window?• If some, we may not have a chance to cache
File that worth caching• Access time (open) scatter over several hours – cacheable• But “scattering over several hours” doesn’t mean the file worth caching
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
14
Next Step
• So far focusing on making it a good cache– More work to be done– Should also look at
• Asking Panda for input files lists of coming jobs• Possibility of sub-file level caching
• How much can the cache speed up analysis jobs?– All files are in SSD cache– Normal caching --- some files in SSD cache, some are
not
11/14/12