36
Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital Preservation Den Haag - November 2 nd 2007

HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele CERN

HEP and its dataWhat is the trouble?

A possible way forward

Tools & TrendsInternational Conference on Digital Preservation

Den Haag - November 2nd 2007

Page 2: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

High-Energy Physics (or Particle Physics)HEP aims to understand how our Universe works:— by discovering the most elementary constituents of matter and energy— by probing their interactions— by exploring the basic nature of space and time

In other words, try to answer two eternal questions:— "What is the world made of?”— "What holds it together?”

Build the largest scientific instruments ever to reach energy densities close to the Big Bang; write theories

to predict and describe the observed phenomenaSalvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 3: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 3

CERN: European Organization for Nuclear Research (since 1954)

• The world leading HEP laboratory, Geneva (CH)• 2500 staff (mostly engineers)• 8000 users (mostly physicists)• 3 Nobel prizes (Accelerators, Detectors, Discoveries)• Invented the web• Commissioning the 27-km (6000 M€) LHC accelerator• Runs a 1-million objects Digital Library

CERN Convention (1953): ante-litteram Open Access mandate“… the results of its experimental and theoretical work shall be

published or otherwise made generally available”

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 4: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 4

CERN•

Page 5: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

5

The Large Hadron Collider

•Largest scientific instrumentever built, 27km of circumference

•The “coolest” place in the Universe-271.25 ˚C

•10000 people involved in itsdesign and construction

•Worldwide budget of 6bn€

•Collides protons to reproduceconditions at the birth of theUniverse...

...40 million times a second

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 6: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 6

The LHC experiments:about 100 million “sensors” each [think your 6MP digital camera...

...taking 40 million pictures a second]ATLAS

five-storey buildingCMS

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 7: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 7

The LHC data• 40 million events (pictures) per second• Select (on the fly) the ~200 interesting

events per second to write on tape• “Reconstruct” data and convert for analysis:

“physics data” [inventing the grid...]

200 TB0.1 MBPhysics data2000 TB1.0 MBReconstructed data3200 TB1.6 MBRaw dataPer yearPer event(x4 experiments x15 years)

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 8: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

8

Preservation, re-use and (Open) Access to HEP data

ProblemOpportunity

Challenge

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 9: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 9

Some other HEP facilities (recently stopped or about to stop)

Energy frontier

Precision frontier

No real long-term archival strategy......against billions of invested funds!

HERA@DESY

KLOE@LNFBELLE@KEKTEVATRON@FNAL

BABAR@SLAC

LEP@CERN

Page 10: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 10

Why shall we care?

•We cared producing these data in first instance•Unique, extremely costly, non-reproducible•Might need to go back to the past (it happened)

•A peculiar community (the web, arXiv, the grid...)•“If it works here, will work in many other places”

Page 11: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 11

Preservation, re-use and (open) access continua (who and when)

• The same researchers who took the data, after the closure of the facility (~1 year, ~10 years)

• Researchers working at similar experiments at the same time (~1 day, week, month, year)

• Researchers of future experiments (~20 years)• Theoretical physicists who may want to re-

interpret the data (~1 month, ~1 year, ~10 years)• Theoretical physicists who may want to test future

ideas (~1 year, ~10 years, ~20 years)

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 12: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 12

Much ado about nothing?Strong force: gets weakerthe closer the quarks get.

Most counter-intuitive idea of contemporary physicsIdea 1972, Nobel prize 2004

To verify it, start pullingquarks far apart:1) Produce quark at accelerators2) Put more and more energy in3) Do quark pull each other more?

Kept togetherby the strong force

Page 13: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 13

Measuring the strong forceNeed theory to analyse data, theory improves with in-silico experiments,

which improve with computing power, which grows with time.

JADE1982-1985

OPAL1994-1998

Accelerator energy = how close we study the quarks

How

str

ong

is t

he s

tron

g fo

rce

Theory2000

Need tore-analyse data

with time!

Page 14: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 14

Data preservation, circa 2004

140 pages of tables

Page 15: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 15

Data preservation, circa 2003

140 pages of tables

Very cumbersome tablesdescribe event features.

Technical needsof multi-dimensional datawhich cannot fit on paper!

What a discovery mightlook like...

...“missing energy”...

...a few events ofbackground noise which

all theorists want to check

L3

Page 16: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

What is the trouble withpreserving HEP data?

Where to put them ?Hardware migration ?Software migration/emulation?

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 17: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

What is the trouble withpreserving HEP data?

Where to put them ?Hardware migration ?Software migration/emulation?

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 18: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 18

HEP, Open Access & Repositories • HEP is decades ahead in thinking Open Access:

– Mountains of paper preprints shipped around the world for 40 years (at author/institute expenses!)

– Launched arXiv (1991), archetypal Open Archive– >90% HEP production self-archived in repositories– 100% HEP production indexed in SPIRES(community

run database, first WWW server on US soil)• OA is second nature: posting on arXiv before

submitting to a journal is common practice– No mandate, no debate. Author-driven.

• HEP scholars have the tradition of arXivingtheir output (helas, articles) somewhere

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 19: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 19

Information discovery in HEPA poll of the HEP community

>2000 answers (10% of the community!)

91 % Community services 9% Google <0.1% Commercial services• 40 % Subject repositories• 51 % Lab-supported databases

Which HEP Information Systemdo you use the most?

Page 20: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

What is the trouble withpreserving HEP data?

Where to put them ?Hardware migration ?Software migration/emulation?

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 21: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

21

Towards an e-Infrastructure for HEP scholarly communication

Common vision of all stakeholders

1. Build a complete HEP information platform

2. Enable text- and data-mining applications

3. Demonstrate and deploy Web2.0 applications

4. Preservation and re-use of research data

There will be a placeto archive the data

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 22: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

What is the trouble withpreserving HEP data?

Where to put them ?Hardware migration ?Software migration/emulation?

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 23: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

23

Storage and migration of dataat the CERN computing centre

500GB~22’000 9940B → T1000A2007200GB~5’000 9940A → 9940B200460GB~25’000 Redwood → 9940200120GB~250’000 3480 → Redwood1997

0.2GB~150’000 9track → 34801993

End of (most) data analysis2005End of in-silico experiments2002End of data taking2000Start of data taking1989Begin of construction1984

Life-cycle of previous-generation CERN experiment L3 at LEP

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 24: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

What is the trouble withpreserving HEP data?

Where to put them ?Hardware migration ?Software migration/emulation?

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 25: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

25

Computing environment ofthe L3 experiment at LEP

End of (most) data analysis2005End of in-silico experiments2002End of data taking2000Start of data taking1989Begin of construction1984

Life-cycle of previous-generation CERN experiment L3 at LEP

Linux boxes1997-2007SGI mainframe1996-2001Apollo (HP) workstations1992-1998IBM for data analysis1986-1994VAX for data taking1989-2001

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 26: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 26

Software migration/emulation• Migrations and change of architecture can and do

occur in the lifetime of the HEP experiments, eventually CERN has even its own Linux releases

• These constitute large person-power investments• Need to adapt code and infrastructure to read and

(re-)process data and simulations, following the environment, with all dependencies

• This expertise and person-power vanishes with the disbanding of the experiments and the onset of more“rewarding” tasks at new facilities

• An issue linked to the complexity of data

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 27: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

What is the trouble withpreserving HEP data?

The HEP data !

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Where to put them ?Hardware migration ?Software migration/emulation?

Page 28: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 28

Preserving HEP data?

Concorde(15 km)

Balloon(30 km)

CD stack with1 year LHC data!(~ 20 km)

Mt. Blanc(4.8 km)

• The HEP data model is highly complex. Data are traditionally not re-used as in Astronomy or Climate science.

• Raw data → calibrated data → skimmed data → high-level objects → physics analyses → results.

• All of the above duplicated for in-silicoexperiments, necessary to interpret the highly-complex data.

• Final results depend on the grey literature on calibration constants, human knowledge and algorithms needed for each pass...oral tradition!

• Years of training for a successful analysisSalvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 29: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

A possible way forward,introducing:

The parallel way

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 30: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 30

HEP data: The “parallel way” to publish/preserve/re-use/OpenAccess

•In addition to experiment data models, elaborate a parallel format for (re-)usable high-level objects

–In times of need (to combine data of “competing”experiments) this approach has worked–Embed the “oral” and “additional” knowledge

•A format eventually understandable and thus re-usable by practitioners in other experiments and theorists•Start from tables and work back towards primary data•How much additional work? 1%, 5%, 10%?

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 31: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 31

“Major” issues with the “parallel” way•A small fraction of a big number gives a large number•Need insider knowledge to produce parallel data•Activity in competition with research time (waiting the end of the experiment is a proven recipe for disaster)•Thousands of person-years behind the data model of the large collaborations:

– enormous (impossible?) academic incentives toencourage the “parallel way”

– additional (external) funds– mandates are another recipe for disaster

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 32: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 32

“Minor” issues with the “parallel” way•Publish high-level objects behind each scientific article (voluntarily? compulsory? after a time lapse?)•Publish all high-level objects after disbanding a collaboration (ownership? impact metrics?)•Address issues of (open) access, credit, accountability, reproducibility of results, "careless discovers", "careless measurements”, depth of peer-reviewing •A monolithic way of doing business needs rethinking

A culture shift, which can only come from consensus

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 33: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 33

Preservation, re-use and (Open) Access to HEP data... first steps!

•Outgrowing an institutionalized state of denial•A difficult and costly way ahead•An issue which starts surfacing on the agenda

— Part of an integrated digital-library vision— CERN is a proud member of the Alliance— FP7 bids to start charting the way— Debates open in the foreseeable future

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 34: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 34

An integrated e-Infrastructure visionBuild an integrated digital library with the entire living corpus of a discipline (articles and metadata) Work in progress Naturally extend it to host higher-level data behind publications No technical roadblocks

A data repository for external agentsto run analysis jobs (on the grid)

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 35: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 35

Conclusions• HEP spearheaded (Open) Access to Scientific

Information: 50 years of preprints, 16 of repositories... but data preservation is not yet on the radar

• Heterogeneous continua to preserve data for• No insurmountable technical problems• The issue is the data model itself

– (Primary) data intelligible only to the producers– Need to produce a “parallel” format for

preservation, re-use and (open) access– Massive person-power costs

Preservation, re-use and (open) access of HEP data is appearing on the agenda...

will need cultural consensus and financial support

Exciting times are ahead!Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007

Page 36: HEP and its data What is the trouble? A possible …Salvatore Mele CERN HEP and its data What is the trouble? A possible way forward Tools & Trends International Conference on Digital

Thank you [email protected]

http://scoap3.orgMake Open Access happen!

Promote SCOAP3 in your country!Get in touch!

Advertising below the line