Upload
wray
View
27
Download
0
Embed Size (px)
DESCRIPTION
Semantic Data Management for Organising Terabyte Data Archives. Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg). HAMSTER-WS , Langen, 11.07.03. Content: DKRZ archive development CERA 1) concept and structure Automatic fill process CERA access statistic. - PowerPoint PPT Presentation
Citation preview
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 1
Semantic Data Management forOrganising Terabyte Data Archives
Michael Lautenschlager
World Data Center for Climate(M&D/MPIMET, Hamburg)
HAMSTER-WS, Langen, 11.07.03
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 2
Content:
• DKRZ archive development
• CERA1) concept and structure
• Automatic fill process
• CERA access statistic
1) Climate and Environmental data Retrieval and Archiving
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 4
DKRZ Archive Development
DKRZ's Archive Increase (Estim. 06.03)
6001150
17502350
3250
4150
12 40 160 360660
960
2002 2003 2004 2005 2006 2007
Years
Dat
a A
mo
un
t [T
B]
Unix-File Archive
CERA DB
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 5
Problems in file archive access: Missing Data CatalogueDirectory structure of the Unix file system is not sufficient to organise
millions of files. Data are not stored application-orientedRaw data contain time series of 4D data blocks.
Access pattern is time series of 2D fields. Lack of experience with climate model dataProblems in extracting relevant information from climate model raw data
files. Lack of computing facilities at client siteNon-modelling scientists are not equipped to handle large amounts of data
(1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals).
Year 2003 2004 2005 2006 2007
Estimated File Archive Size
1,2 PB 1,8 PB 2,4 PB 3,3 PB 4,2 PB
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 6
Model Raw DataStructure
Application-orientedData Storage
5D Model Data Structure
6-hour StorageInterval
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 7
(I) Data catalogue and pointer to Unix files Enable search and identification of data Allow for data access as they are
(II) Application-oriented data storage Time series of individual variables are stored as BLOB
entries in DB TablesAllow for fast and selective data access
Storage in standard file-format (GRIB)Allow for application of standard data processing routines
(PINGOs)
CERA Concept:Semantic Data Management
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 8
CERA Database: 7.1 TB (12.2001)* Data Catalogue* Processed Climate Data * Pointer to Raw Data files
Mass Storage Archive:210 TB neglecting Security Copies (12.2001)
CE
RA
Dat
abas
eS
yste
m
Web-Based User InterfaceCatalogue Inspection
Climate Data Retrieval
DK
RZ
Mas
s S
tora
ge A
rch
ive
In
tern
etA
cces
s
Current database size is 15.3778 Terabyte Number of experiments: 299 Number of datasets: 19818 Number of blob within CERA at 20-MAY-03: 1104798147
Typical BLOB sizes: 17 kB and 100 kB
Number of data retrievals:
1500 – 8000 / month
Parts of CERA DB
DB Size 09.07.03: 19.6 TB
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 9
Level 1 - Interface:Metadata entries(XML, ASCII)
Level 2 – Interf.:Separate filescontaining BLOBtable data
Experiment Description
Pointer toUnix-Files
Dataset 1Description
Dataset nDescription
BLOB DataTable
BLOB DataTable
Data Structure in CERA DB
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 10
Climate Model Raw Data
Application-oriented Data Storage(Interface level 2)
Primary DataProcessing
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 11
Creation of application-orienteddata storage must beautomatic !!!
Automatic Fill Process (AFP)
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 12
Archive Data Flow per month
ComputeServer
CommonFile
System
MassStorageArchive
CERADB
System
50 TB 75 TB (2006+)
2003: 4 TB2004: 10 TB2005: 17 TB2006+: 25 TB
Unix-Files
Application OrientedData Hierarchy
Application OrientedData Hierarchy
Unix-Files
MetadataInitialisation
Important:Automatic fill processhas to be performedbefore correspondingfiles migrate to massstorage archive.
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 13
Automatic Fill ProcessSteps and Relations
DB-Server:
1. Initialisation of CERA DBMetadata and BLOB data tables are created
Compute Server:
1. Climate model calculation starts with 1. month
2. Next model month starts and primary data processing of previous monthBLOB table input is produced and stored in the dynamic DB fill cache
3. Step 2 repeated until end of model experiment
DB Server:
1. BLOB data table input accessed from DB fill cache
2. BLOB table injection and update of metadata
3. Step 2 repeated until table partition is filled (BLOB table fill cache)
4. Close partition, write corresponding DB files to HSM archive, open new partition and continue with 2)
5. Close entire table and update metadata after end of model experiment
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 14
GFS
Automatic Fill ProcessDevelopment
192 CPU's1.5 TB Memory
ca. 60 TB
Oracle RDBMS iscurrently not connected with GFS.
Future: • RDBMS movementto Linux system and connection with GFS,• AFP is planned to move to DB-server
AFP developmentwith 1 TB dynamic DB fill cache
4 TB/month
BLOB table fill cache
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 15
AFP Cache Sizes
Dynamic DB fill cache• In order to guarantee stable operation the fill cache should buffer data from 10 days production.
• Cache size is determined by the automatic data fill rate.
Year AFP [TB/month] DB Fill Cache [TB]
2003 4 1.5
2005 17 6
2006 - 2007 25 9
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 16
AFP Cache Sizes
BLOB table partition cache• Number of parallel climate model calculations and size of BLOB table partitions determine the cache size.
• BLOB table sections of 10 years for the high resolution model and 50 years for the medium resolution result in 0.5 TB partitions
• Present-day experience on parallel model runs result in BLOB table partition cache sizes equal to dynamic DB fill cache.
• Modification will be inferred if number of parallel runs and/or table partitions change.
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 17
CERA Access Statistic
Month Number of downloads
Volume (GB)
Average Transfersize (MB)
June 2003 3426 78 23 May 2003 5803 117 21 APRIL 2003 5343 66 16 MARCH 2003 3611 109 31 FEBRUARY 2003 8058 168 21 JANARY 2003 2312 204 90 DECEMBER 2002 1985 200 103 NOVEMBER 2002 2813 134 49 OCTOBER 2002 4270 225 54 SEPTEMBER 2002 4054 264 67 AUGUST 2002 5475 216 40 JULY 2002 2888 196 70 JUNE 2002 1835 219 122 MAY 2002 2317 150 66 APRIL 2002 1284 170 136 MARCH 2002 1682 88 54 FEBRUARY 2002 2258 105 47 JANUARY 2002 420 35 85
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 18
CERA DB using countries