30
Digital | Curation | Centre The Digital Curation Centre Experience (Science data & CCLRC experience) David Giaretta & David Corney

The Digital Curation Centre Experience

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Digital Curation Centre Experience

Digital | Curation | Centre

The Digital Curation Centre Experience

(Science data & CCLRC experience)David Giaretta & David Corney

Page 2: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

2

Digital | Curation | Centre

Outline

• Science data characteristics• CCLRC experience• Costs• Benefits• Trends• Conclusions

Page 3: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

3

Digital | Curation | Centre

Science Data Characteristics

• Mostly numbers – objects often complex and interrelated• Representation not Presentation

– Not just to be looked at by humans (i.e. emulation of associatedsoftware usually not enough)

• Often needs processing– Different levels of processing & trends of access– On-the-fly processing from raw

• Often freely available (e.g. after 1 year)• Often large volumes

– Automated systems• Unforgiving

– Need to beware of “junk” science• Needs to be usable in current tools (i.e. emulation is not

enough)

Page 4: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

4

Digital | Curation | Centre

CCLRC Recent New Users & Potential New Users

• National Crystallography Service, Southampton University (2 TB/yr)

• VIRGO Consortium (3 TB/yr?)• Integrative Biology (15 TB/yr?)• WASP (Astronomy) (30TB/yr?)• BBSRC ? (50 TB/yr?)• Diamond (1 PB/yr?)• GRID-PP (1 PB/yr)

Page 5: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

5

Digital | Curation | Centre

Datastore Usage by Family

0

50

100

150

200

250

Jun-97 Dec-97 Jun-98 Dec-98 Jun-99 Dec-99 Jun-00 Dec-00 Jun-01 Dec-01 Jun-02 Dec-02 Jun-03 Dec-03 Jun-04 Dec-04 Apr-05

Tbytes

CR-AFRCCRAYSUPCR-EPSRCCR-NERCCR-PPARCDCI-ISEDCI-NETDCI-OHDCI-PCDCI-VISDL-SRDEDGESCIENCEEXTERNALFACILMANFUJISUPITD-SERITD-SUPNUCPHYSRAL-ADMRAL-ENGRAL-SCIRAL-TECHSCALSUPSCALUSERSSDSSD-EODSSD-PPAR

Page 6: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

6

Digital | Curation | Centre

Data Growth per period

-10

0

10

20

30

40

50

60

70

80

Jun-97 Dec-97 Jun-98 Dec-98 Jun-99 Dec-99 Jun-00 Dec-00 Jun-01 Dec-01 Jun-02 Dec-02 Jun-03 Dec-03 Jun-04 Dec-04 Apr-05

Tbyt

es

Page 7: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

7

Digital | Curation | Centre

Expected future demand

0.20.20.100.05External

5.63.11.20.55Total (PB)

1.00.70.50.2CCLRC (data volume PB)

1.01.000Diamond (data volume (PB)

3.41.20.60.3LHC data volume (PB)

60040025050LHC bandwidth (MB/sec)

2008200720062005Year

Page 8: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

8

Digital | Curation | Centre

Actual Growth 1997-2003

-20000

0

20000

40000

60000

80000

100000

Jun-9

7Sep

-97Dec

-97Mar-

98Ju

n-98

Sep-98

Dec-98

Mar-99

Jun-9

9Sep

-99Dec

-99Mar-

00Ju

n-00

Sep-00

Dec-00

Mar-01

Jun-0

1Sep

-01Dec

-01Mar-

02Ju

n-02

Sep-02

Dec-02

Mar-03

Jun-0

3Sep

-03Dec

-03

Time years

Dat

a Vo

lum

e (G

B)

Cumulative Data Volume (GB)Actual Growth (GB)

Page 9: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

9

Digital | Curation | Centre

Atlas Storage: Predicted Demand (TB)

0

500

1000

1500

2000

2500

3000

3500

4000

2003 2004 2005 2006

Upper bound datavolume (TB)

Lower bound datavolume (TB)

Page 10: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

10

Digital | Curation | Centre

Capacity & performance - Hardware

• Hardware– Defines both performance and capacity– Changing fast but well understood; (buy as late as possible)– Tied into technology futures of manufacturers and HEP

community;– Currently hardware is effectively “infinitely” scalable

• Future estimated storage capacity & bandwidth for a 6000 slot robot:

1000 GB500 GB200GBTape capacity

Titanium2Titanium 19940BTechnology

~20080 -10030 - 40Bandwidth (MB/sec)

1.2 PB

2003/04

6PB3PBCapacity (PB)

2008/92006/7Year

Page 11: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

11

Digital | Curation | Centre

Data Growth

- observatory archives growing as detectors grow

- world area of 3m+ (sq.m.)- largest detectors (Mpix)

19701975

19801985 1990 1995 2000

0.1

1

10

100

1000

CCDs Glass

- VISTA will have a Gpixel array

Page 12: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

12

Digital | Curation | Centre

STK 9310

8 x 9940 tape drives

ADS_switch_1 ADS_Switch_2

Brocade FC switches

4 drives to each switch

ermintrudeAIX

dataserver

florenceAIX

dataserver

zebedeeAIX

dataserver

dougalAIX

dataserver

mchenry1AIXTest flfsys

basilAIXtest

dataserver

brianAIXflfsys

ADS0CNTRRedhatcounter

ADS0PT01Redhat

pathtape

ADS0SB01Redhat

SRB interface

dylanAIX

Import/exportbuxtonSunOSACSLS

User

array4 array3 array2 array1

catalogue

cache

catalogue

cache

Test system

SRB Inq; S commands; MySRB

Tape devices

ADStape

ADS sysreq

admin commandscreate query

User pathtapecommandsLogging

Physical connection (FC/SCSI)

Sysreq udp command

User SRB command

VTP data transfer

SRB data transfer

STK ACSLS command

All sysreq, vtp andACSLS connections to dougal also apply tothe other dataserver machines, but are left out for clarity

Production system

SRB pathtape commands

Thursday, 04 November 2004

Page 13: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

13

Digital | Curation | Centre

Tape Drive Performance as a Function of File Size

0

5

10

15

20

25

30

35

40

0 100 200 300 400 500 600 700 800

File Size (MB)

Tape

Driv

e Th

roug

hput

(MB

/sec

)

Page 14: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

14

Digital | Curation | Centre

Types of costs

• Captures costs• Storage costs• Maintenance costs• Access/Dissemination costs

• Total cost of ownership

Page 15: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

15

Digital | Curation | Centre

Trends

• 1986 disk 5MB/£250 = 20KB/£• 1994 disk/DAT 3GB/£3K = 1MB/£• 1995 disk 420MB/£40 = 10MB/£• 1998 disk 5GB/£250 = 20MB/£• 2004 disk 60GB/£60 = 1000MB/£Doubles every year

» Data from Byte new products

Page 16: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

16

Digital | Curation | Centre

• The expected cost of the Atomic Holographic DVR disc drive will be from $570 to $750 with the replacement discs for $45.

One 10 terabyte to 100 terabyte 3.5 in FEdisk

Page 17: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

17

Digital | Curation | Centre

Issues

• System changes• Collection migration to new systems

– Descriptive Information– Finding Aids

Page 18: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

18

Digital | Curation | Centre

Consideration of service quality

• bit preservation• currently aiming to be self funding• aim to cover costs only• lower storage costs are dependant on

increased usage • increased usage is hard to predict • current charge of £1k/Tb/yr

Page 19: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

19

Digital | Curation | Centre

Costs and charging• H/W Costs

– Total ~ £1m every 4-5 years, equiv to ~ £250K/yr– H/W upgrades are costly – installation, configuration, test; and

associated data migration - many months– Example component costs:

• Robot (6000 slots) ~ £300K• Media £420K (@ £70 per unit)• Disk ~ 1.5K/TB? ~ £50K for 75TB commodity?• Tape drives £20K each. (est. T1s and T2s) Total ~ £200K for 10• Data Servers:

– Linux: £3K each. Total ~ £30K for 10– AIX: £14K each. Total ~ £140K for 10

• Network/switches ~ £50K– Numbers are the Key to flexible performance – esp. data servers

and tape drives.• S/W Costs – Currently limited to staff development costs• Staff 2.5 FTE: system manager + system developer + 0.5

operations staff

Page 20: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

20

Digital | Curation | Centre

ADS Running Costs 04/05. (Option 1).

H/W maintenance11%

S/W maintenance3%

Hardware15%

Network0%

Other5%

Staff costs66%

Page 21: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

21

Digital | Curation | Centre

SRB-ADS architectureSRB MCATDatabase

SRB MCATServer

SRB ADSServer

SRBClient

SRB DiskServer (Local Server)

Atlas Data Store SRB ADS Server

SRB-ISIS server

instance

SRB-BADC server

instance

SRB-CCLRC server instance

Port 5600

Port 5601

Port 5602

Page 22: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

22

Digital | Curation | Centre

BADC Team

BADC Team

Authorising Authority(BADC or external data manager)

BADC Support Team

External User

Administration

User Database

Metadata

Data

Generate metadata

Ingest files

Volume plans

Format descr.

Discovery Search

Data Access via FTP & HTTP

Handle queries

Manage user

accounts

Corrected files re-ingested

Submitted files

BADC team add metadata

Harvest

New and updated files

Data submission authorisation

Authentication and authorisation

Registration details and updates

Query and response

Query and response

Access request and authorisation

Report on user details

Query, update database

User details

Search & results

Data requests & data

Authentication

Functional Diagram of BADC/APS

Page 23: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

23

Digital | Curation | Centre

OAIS Functional Model

4-1.

2

MANAGEMENT

Ingest

Data Management

SIP

AIPDIP

queriesresult sets

Access

PRODUCER

CONSUMER

Descriptive Info

AIP

orders

Descriptive Info

Archival Storage

Administration

Preservation Planning

Page 24: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

24

Digital | Curation | Centre

BADC mapped to OAISPreservation Planning

IngestAccess

BADC Team

BADC Team

Authorising Authority(BADC or external data manager)

BADC Support Team

External User

Administration

User Database

Metadata Management

Metadata

Archival Storage

Data

Generate metadata

Ingest files

Volume plans

Format descr.

Discovery Search

Data Access via FTP & HTTP

Handle queries

Manage user

accounts

Corrected files re-ingested

Submitted files

BADC team add metadata

Harvest

New and updated files

Data submission authorisation

Authentication and authorisation

Registration details and updates

Query and response

Query and responseAccess

request and authorisation

Report on user details

Query, update database

User details

Search & results

Data requests & data

Authentication

Preservation PlanningPreservation Planning

IngestIngestAccessAccess

BADC TeamBADC Team

BADC TeamBADC Team

Authorising Authority(BADC or external data manager)

Authorising Authority(BADC or external data manager)

BADC Support TeamBADC Support Team

External User

AdministrationAdministration

User DatabaseUser Database

Metadata Management

Metadata

Metadata Management

MetadataMetadata

Archival Storage

Data

Archival Storage

DataData

Generate metadataGenerate metadata

Ingest files

Ingest files

Volume plans

Volume plans

Format descr.Format descr.

Discovery Search

Discovery Search

Data Access via FTP & HTTP

Data Access via FTP & HTTP

Handle queriesHandle queries

Manage user

accounts

Manage user

accounts

Corrected files re-ingested

Submitted files

BADC team add metadata

Harvest

New and updated files

Data submission authorisation

Authentication and authorisation

Registration details and updates

Query and response

Query and responseAccess

request and authorisation

Report on user details

Query, update database

User details

Search & results

Data requests & data

Authentication

Page 25: The Digital Curation Centre Experience

Digital | Curation | Centre

Space Missions - special features

• Space missions are very expensive (100’s of Millions of dollars/euros)– Specialised hardware and software

• Information if usually the only thing left after the mission

• Data Exploitation costs are usually small

Page 26: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

26

Digital | Curation | Centre

Costs of Preparation

• IUE Final Archive – IUE launched in 1978– Early example of long-term preservation

• 12 years after launch– New processing algorithms– New products

• Trends in access– New Formats– Translation of telemetry– Dictionaries for keywords in header– Capture of hand-written Observer logs– New catalogues

Page 27: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

27

Digital | Curation | Centre

Cost Sharing

• Shared archival storage – economies of scale• Shared discovery/access• Shared Preservation Planning

– Technology watch– Representation Information – Registries

• Abstraction and virtualisation• Automated migration

– Preservation Description Information - tools• Bring benefits forward

– Curation– Interoperability

• Distance in discipline is like Distance in time

Page 28: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

28

Digital | Curation | Centre

Metrics for Benefits

• National/organisational pride• Scientific

– Number of references– Number of publications– Number of requests

• Financial– Sale of data– Investment in information systems

• Legal– Avoid penalties

Page 29: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

29

Digital | Curation | Centre

Archive Research

1994.8 1995.3 1995.8 1996.3 1996.8 1997.3 1997.8 1998.3 1998.8 1999.3

Ingest

0

5

10

15

20

25

30

Gby

tes/

Day

Year

Ingest

Retrievals

Already more retrieval than ingest!Already more retrieval than ingest!

- large fraction of astro-papers based on archives

- HST archive use growing faster than archive

Page 30: The Digital Curation Centre Experience

26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets

30

Digital | Curation | Centre

Conclusions• Preservation costs of any item:

– Storage costs of the bits will fall– Migration can be automated (and done on request)– Costs to keep information usable (as in OAIS) could

grow but can be shared• Sharing nationally and internationally

• Ingest costs can be reduced by forward planning by/agreements with producers

• Benefits can be brought forward– Link to widening Interoperability

• Benefits must be measured