Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 1

Semantic Data Management forOrganising Terabyte Data Archives

Michael Lautenschlager

World Data Center for Climate(M&D/MPIMET, Hamburg)

HAMSTER-WS, Langen, 11.07.03


Content:

• DKRZ archive development

• CERA1) concept and structure

• Automatic fill process

• CERA access statistic

1) Climate and Environmental data Retrieval and Archiving


DKRZ Archive Development

DKRZ's Archive Increase (Estim. 06.03)

6001150

17502350

3250

4150

12 40 160 360660

960

2002 2003 2004 2005 2006 2007

Years

Dat

a A

mo

un

t [T

B]

Unix-File Archive

CERA DB


Problems in file archive access: Missing Data CatalogueDirectory structure of the Unix file system is not sufficient to organise

millions of files. Data are not stored application-orientedRaw data contain time series of 4D data blocks.

Access pattern is time series of 2D fields. Lack of experience with climate model dataProblems in extracting relevant information from climate model raw data

files. Lack of computing facilities at client siteNon-modelling scientists are not equipped to handle large amounts of data

(1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals).

Year 2003 2004 2005 2006 2007

Estimated File Archive Size

1,2 PB 1,8 PB 2,4 PB 3,3 PB 4,2 PB


Model Raw DataStructure

Application-orientedData Storage

5D Model Data Structure

6-hour StorageInterval


(I) Data catalogue and pointer to Unix files Enable search and identification of data Allow for data access as they are

(II) Application-oriented data storage Time series of individual variables are stored as BLOB

entries in DB TablesAllow for fast and selective data access

Storage in standard file-format (GRIB)Allow for application of standard data processing routines

(PINGOs)

CERA Concept:Semantic Data Management


CERA Database: 7.1 TB (12.2001)* Data Catalogue* Processed Climate Data * Pointer to Raw Data files

Mass Storage Archive:210 TB neglecting Security Copies (12.2001)

CE

RA

Dat

abas

eS

yste

m

Web-Based User InterfaceCatalogue Inspection

Climate Data Retrieval

DK

RZ

Mas

s S

tora

ge A

rch

ive

In

tern

etA

cces

s

Current database size is 15.3778 Terabyte Number of experiments: 299 Number of datasets: 19818 Number of blob within CERA at 20-MAY-03: 1104798147

Typical BLOB sizes: 17 kB and 100 kB

Number of data retrievals:

1500 – 8000 / month

Parts of CERA DB

DB Size 09.07.03: 19.6 TB


Level 1 - Interface:Metadata entries(XML, ASCII)

Level 2 – Interf.:Separate filescontaining BLOBtable data

Experiment Description

Pointer toUnix-Files

Dataset 1Description

Dataset nDescription

BLOB DataTable

BLOB DataTable

Data Structure in CERA DB


Climate Model Raw Data

Application-oriented Data Storage(Interface level 2)

Primary DataProcessing


Creation of application-orienteddata storage must beautomatic !!!

Automatic Fill Process (AFP)


Archive Data Flow per month

ComputeServer

CommonFile

System

MassStorageArchive

CERADB

System

50 TB 75 TB (2006+)

2003: 4 TB2004: 10 TB2005: 17 TB2006+: 25 TB

Unix-Files

Application OrientedData Hierarchy

Application OrientedData Hierarchy

Unix-Files

MetadataInitialisation

Important:Automatic fill processhas to be performedbefore correspondingfiles migrate to massstorage archive.


Automatic Fill ProcessSteps and Relations

DB-Server:

1. Initialisation of CERA DBMetadata and BLOB data tables are created

Compute Server:

1. Climate model calculation starts with 1. month

2. Next model month starts and primary data processing of previous monthBLOB table input is produced and stored in the dynamic DB fill cache

3. Step 2 repeated until end of model experiment

DB Server:

1. BLOB data table input accessed from DB fill cache

2. BLOB table injection and update of metadata

3. Step 2 repeated until table partition is filled (BLOB table fill cache)

4. Close partition, write corresponding DB files to HSM archive, open new partition and continue with 2)

5. Close entire table and update metadata after end of model experiment


GFS

Automatic Fill ProcessDevelopment

192 CPU's1.5 TB Memory

ca. 60 TB

Oracle RDBMS iscurrently not connected with GFS.

Future: • RDBMS movementto Linux system and connection with GFS,• AFP is planned to move to DB-server

AFP developmentwith 1 TB dynamic DB fill cache

4 TB/month

BLOB table fill cache


AFP Cache Sizes

Dynamic DB fill cache• In order to guarantee stable operation the fill cache should buffer data from 10 days production.

• Cache size is determined by the automatic data fill rate.

Year AFP [TB/month] DB Fill Cache [TB]

2003 4 1.5

2005 17 6

2006 - 2007 25 9


AFP Cache Sizes

BLOB table partition cache• Number of parallel climate model calculations and size of BLOB table partitions determine the cache size.

• BLOB table sections of 10 years for the high resolution model and 50 years for the medium resolution result in 0.5 TB partitions

• Present-day experience on parallel model runs result in BLOB table partition cache sizes equal to dynamic DB fill cache.

• Modification will be inferred if number of parallel runs and/or table partitions change.


CERA Access Statistic

Month Number of downloads

Volume (GB)

Average Transfersize (MB)

June 2003 3426 78 23 May 2003 5803 117 21 APRIL 2003 5343 66 16 MARCH 2003 3611 109 31 FEBRUARY 2003 8058 168 21 JANARY 2003 2312 204 90 DECEMBER 2002 1985 200 103 NOVEMBER 2002 2813 134 49 OCTOBER 2002 4270 225 54 SEPTEMBER 2002 4054 264 67 AUGUST 2002 5475 216 40 JULY 2002 2888 196 70 JUNE 2002 1835 219 122 MAY 2002 2317 150 66 APRIL 2002 1284 170 136 MARCH 2002 1682 88 54 FEBRUARY 2002 2258 105 47 JANUARY 2002 420 35 85


CERA DB using countries

Documents

Semantic Data Management for Organising Terabyte Data Archives