CERA / WDCC

CERA / WDCC

Hannes ThiemannMax-Planck-Institut für Meteorologie

Modelle und Datenhannes.thiemann @ zmaw.de

NCAR, October 27th – 29th, 2008

Contents

Statistics Requirements + Features General architecture Implementation (current and new) Migration Summary

Basic Statistics

WDCC / CERA: General Statistics at 01-10-2008 00:00:10 Database Size (TByte): 370

Number of blobs: 6663287791 (6.6 billion) Data access by fields and not by files.

Number of experiments: 1146

Number of datasets: 142062

Total size divided by number of BLOBs gives the average size of data access granules: 50 - 60 kB/BLOB

Users by continent

12%

25%

27%

4%

13%

19% Campus

Germany

Europe

AF+OC+SA

North America

Asia

Active Users 1-Jan-2008 until 14-Oct-2008

Download destinations

Download destinations 1-Jan-2008 until 14-Oct-2008

3% 12% 6%

0%

14%65%

Campus

Germany

Europe

OC, AF, SA

North America

Asia

Records per download

66 6772

85 87 9098

010

20304050

607080

90100

1 12 120 240 600 1200 12000

Records

Per

cen

t

Recordsize

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

10000000000

1 8 29 32 35 84 89 92 96 99

Percent

Byt

es

Requirements and constraints

Access over WAN Downloads typically quite small, but huge

downloads to some extent. Small downloads imply that users are not willing to

wait long … We can not scan through large files for each

download Granularity has to be small

Datatypes

Model data Climatological runs (global and regional) (IPCC, …) Weather forecasts (DPHASE, CEOP, …)

Reanalysis data Observational data (COPS, CARIBIC, …) Satellite data products

Formats

CERA provides the ability to store data of any format:

These are the formats used GRIB (60%) NetCDF (18%) Other (22%)

General Architecture

Midtier

Data

General Architecture

Metadata Data

ProxyWebserver

Appl. Server

Entry

Reference

Status

Distribution

Contact Coverage

Parameter

SpatialReferenceLocal Adm.

Data AccessData Org

Select timestep + regionConvert format

Storage within CERA

1 Data of timestep i

2 Data of timestep i+1

3 Data of timestep i+2

n Data of timestep i+n

…

Database TableD

ata

of

sing

le

varia

ble

Index

Handicap

Handicap: not enough disk space available Data stored within database: approx. 400 TB Disks available: approx. 24 TB

Database has been coupled transparently to the HSM system

How do we avoid frequent tape accesses? Big cache Store data as close as possible according to the

needs of users: split into single variables

TBS - RW

TblPartition 1

TBS - RW

TblPartition 2

dxdb

TBS - RO

TblPartition 1

All tablespaces are moved

“at once” to dxdb

MigoutMigin

Data migration

Inside the datafile

Primary Key

Lob Index

Table

Blob data

Header 128k

Frontend versus Backend

Header 128k

Filesystem Frontend HSM Backend

Header 128k

Part 1 = 512 MB

Part 2 = 512 MB

Retrieving data

4

Header 128k

3 1

2 5

Tape Request

Warehouse features

Compression – nothing special used within the server

Partitioning – allow parts of data to be moved to HSM

Backup Nologging - beware of crash … Read only - two copies on tape

New implementation

Metadata database will stay as is

Oracle Databases holding data will be replaced by a new, self-made development

Why? There is a certain risk that a future version of Oracle

may not work with a / any HSM system On the long run some license costs shall be saved

General Architecture - new

Metadata Data

Webserver

Appl. Server

Oracle-DB Blobserver

CERA-Container

Instead of keeping data within blobs in Oracle databases, data records will be kept within so called CERA Container Files.

Ability to keep huge number of records. They provide fast access independent of position

within file (granular access). Provided fault tolerance against tape damages by

keeping checksums within the files. Enclose read/write operations against container files

in transactions. Well known format

Migration

Concept / Team (namely Peter Drakenberg, DKRZ) Not yet really finished

Software First software ready, in order to migrate data

Convert old data Started last week, but will take at least a year

Dataflow: outbound

1

2

Webserver

Appl. Server

34

Metadata Data

5

6

7

8

Processing

Dataflow: inbound

Metadata Dataserver

Postprocessing

Model run

GFS

Summary

CERA allows for the storage of data of different kind Format independent Metadata enables addressing of internal and

external data Users are typically fetching only small amounts of

data. System allows for efficient access to small data

granules By using warehousing functions like Partitioning by using small Oracle database Blobs or - in future

- CERA Container files.

Thank you !

Documents

CERA / WDCC