M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center

M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1

Semantic Data Management forOrganising Terabyte Data Archives

Michael Lautenschlager

World Data Center for Climate(M&D/MPIMET, Hamburg)

CAS2K3 Workshop Sept. 2003 in Annecy, Fance

Home: http://www.mad.zmaw.de/wdcc


Content:

• General remarks

• DKRZ archive development

• CERA1) concept

• CERA data model and structure

• Automatic fill process

• Database access statistics

1) Climate and Environmental data Retrieval and Archiving


Semantic data management

• Data consist of numbers and metadata.

• Metadata construct the semantic data context.

• Metadata form a data catalogue which makes data searchable.

• Data are produced, archived and extracted within their semantic context.

Data without explanation are only numbers.

Problems:• Metadata are of different complexity for different data types. • Consistency between numbers and metadata have to be ensured.


DKRZ Archive Development

Basics observations and assumptions:1) Unix-File archive content end of 2002: 600 TB including

Backup's

2) Observed archive rate (Jan. - May 2003): 40 TB/month

3) System changes: 50% compute power increase in August 2003

4) CERA DB size end of 2002: 12 TB

5) Observed Increase (Jan. - May 2003): 1 TB/month

6) Automatic fill process into CERA DB is going to become operational with 4 TB/month this year and should increase from 10% of the archiving rate to approx. 30% end of 2004


DKRZ Archive Development

DKRZ's Archive Increase (Estim. 09.03)

6001200

1920

2640

3360

4080

12 40 184424 664 904

2002 2003 2004 2005 2006 2007

Years

Dat

a A

mo

un

t [T

B]

Unix-File Archive

CERA DB

Conserva

tive

Estimate


Problems with direct file archive access: Missing Data CatalogueDirectory structure of the Unix file system is not sufficient to organise

millions of files. Data are not stored application-orientedRaw data contain time series of 4D data blocks (3D in space and type of

variable).

Access pattern is time series of 2D fields. Lack of experience with climate model dataProblems in extracting relevant information from climate model raw data

files. Lack of computing facilities at client siteNon-modelling scientists are not equipped to handle large amounts of data

(1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals).

Year 2003 2004 2005 2006 2007

Estimated File Archive Size

1,2 PB 1,9 PB 2,6 PB 3,4 PB 4,1 PB


Limits of model resolution

ECHAM4(T42)Grid resolution: 2.8°Time step: 40 min

ECHAM4(T106)Grid resolution: 1.1°Time step: 20 min

Noreiks (MPIM), 2001


(I) Data catalogue and pointer to Unix files Enable search and identification of data Allow for data access as they are

(II) Application-oriented data storage Time series of individual variables are stored as BLOB

entries in DB TablesAllow for fast and selective data access

Storage in standard file-format (GRIB)Allow for application of standard data processing routines

(PINGOs)

CERA Concept:Semantic Data Management


CERA Database: 7.1 TB (12.2001)* Data Catalogue* Processed Climate Data * Pointer to Raw Data files

Mass Storage Archive:210 TB neglecting Security Copies (12.2001)

CE

RA

Dat

abas

eS

yste

m

Web-Based User InterfaceCatalogue Inspection

Climate Data Retrieval

DK

RZ

Mas

s S

tora

ge A

rch

ive

In

tern

etA

cces

s

Current database size is 20.5074 Terabyte Number of experiments: 298 Number of datasets: 29715 Number of blob within CERA at 03-SEP-03: 1262566234

Typical BLOB sizes: 17 kB and 100 kB

Number of data retrievals:

1500 – 8000 / month

Parts of CERA DB


CERA Data: Jan. Temp.



Metadata EntryThis is the central CERA Block,providing information on• the entry's title• type and relation to other entries• the project the data belong to• a summary of the entry• a list of general keywords related to data• creation and review dates of the metadata

Additionally: Modules and Local Extensions

Module DATA_ORGANIZATION (grid structure)Module DATA_ACCESS (physical storage)Local extension for specific information on (e.g.)• data usage• data access and data administration

CoverageInformation on the volume of space-time

covered by the dataReference

Any publication related to the data togehter with the publication form

StatusStatus information like data quality, processing steps, etc.

DistributionDistribution information including access restrictions, data format and fees if necessary

Contact

Data related to contact persons and institutes like distributor, investigator, and owner of copyright

ParameterBlock describes data topic,

variable and unit

Spatial Reference

Information on the coordinatesystem used

CERA-2 Data Model Blocks


The CERA2 data model …allows for data search according to discipline, keyword, variable,

project, author, geographical region and time interval and for data retrieval.

allows for specification of data processing (aggregation and selection) without attaching the primary data.

is flexible with respect to local adaptations and storage of different types of geo-referenced data.

is open for cooperation and interchange with other database systems.

Data Model Functions


Level 1 - Interface:Metadata entries(XML, ASCII)

Level 2 – Interf.:Separate filescontaining BLOBtable data

Experiment Description

Pointer toUnix-Files

Dataset 1Description

Dataset nDescription

BLOB DataTable

BLOB DataTable

Data Structure in CERA DB


Creation of application-orienteddata storage must beautomatic because of large archive rates !!!

Automatic Fill Process (AFP)


Archive Data Flow per month

ComputeServer

CommonFile

System

MassStorageArchive

CERADB

System

60 TB/month

2003: 4 TB/month2004: 12 TB/month2005+: 20 TB/month

Unix-Files

Application OrientedData Hierarchy

Application OrientedData Hierarchy

Unix-Files

MetadataInitialisation

Important:Automatic fill processhas to be performedbefore correspondingfiles migrate to massstorage archive.


Automatic Fill Process (AFP)Steps and Relations

DB-Server:

1. Initialisation of CERA DBMetadata and BLOB data tables are created

Compute Server:

1. Climate model calculation starts with 1. month

2. Next model month starts and primary data processing of previous monthBLOB table input is produced and stored in the dynamic DB fill cache

3. Step 2 repeated until end of model experiment

DB Server:

1. BLOB data table input accessed from DB fill cache

2. BLOB table injection and update of metadata

3. Step 2 repeated until table partition is filled (BLOB table fill cache)

4. Close partition, write corresponding DB files to HSM archive, open new partition and continue with 2)

5. Close entire table and update metadata after end of model experiment


AFP Disk Cache Sizes

Dynamic DB fill cache (BLOB table input time series)

• In order to guarantee stable operation the fill cache should buffer data from approximately 10 days production.

• Cache size is determined by the automatic data fill rate of up to 1/3 of the archive increase.

Year AFP [TB/month] DB Fill Cache [TB]

2003 4 1.5

2004 12 4

2005 - 2007 20 7


AFP DISK Cache Sizes

BLOB table fill cache (open BLOB table partitions)depends on

• BLOB table partition = Table Space = DB-File: 10 GB (adapted to UniTree environment and data are in GRIB format)

• 10 GB/Table-Partition result in 5 TB/Fill-Stream for standard set of variables (approx. 500 2D global fields) of current climate model.

• Number of parallel fill streams (climate model calculations): 8

• Additional 25% is needed for HSM-transfer of closed partitions and error tracking

• BLOB table partition cache results as (8 Streams * 5 TB/Stream) * 1.25 = 50 TB


CERA Access Statistic

Month Number of downloads

Volume (GB)

Average Transfersize (MB)

June 2003 3426 78 23 May 2003 5803 117 21 APRIL 2003 5343 66 16 MARCH 2003 3611 109 31 FEBRUARY 2003 8058 168 21 JANARY 2003 2312 204 90 DECEMBER 2002 1985 200 103 NOVEMBER 2002 2813 134 49 OCTOBER 2002 4270 225 54 SEPTEMBER 2002 4054 264 67 AUGUST 2002 5475 216 40 JULY 2002 2888 196 70 JUNE 2002 1835 219 122 MAY 2002 2317 150 66 APRIL 2002 1284 170 136 MARCH 2002 1682 88 54 FEBRUARY 2002 2258 105 47 JANUARY 2002 420 35 85

Summary:

The system is used,mainly from externals.

Application dependent datastorage allows for precisedata access and reducesdata transfer volume by a factor of 100 compared todirect file transfer.

But presently data accessvia CERA DB is only a few percent of DKRZ'sdata download.


CERA DB using countries

URL: http://cera-www.dkrz.de/CERA/index.html

Documents

M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center