Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Salvatore Mele CERN
HEP and its dataWhat is the trouble?
A possible way forward
Tools & TrendsInternational Conference on Digital Preservation
Den Haag - November 2nd 2007
High-Energy Physics (or Particle Physics)HEP aims to understand how our Universe works:— by discovering the most elementary constituents of matter and energy— by probing their interactions— by exploring the basic nature of space and time
In other words, try to answer two eternal questions:— "What is the world made of?”— "What holds it together?”
Build the largest scientific instruments ever to reach energy densities close to the Big Bang; write theories
to predict and describe the observed phenomenaSalvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 3
CERN: European Organization for Nuclear Research (since 1954)
• The world leading HEP laboratory, Geneva (CH)• 2500 staff (mostly engineers)• 8000 users (mostly physicists)• 3 Nobel prizes (Accelerators, Detectors, Discoveries)• Invented the web• Commissioning the 27-km (6000 M€) LHC accelerator• Runs a 1-million objects Digital Library
CERN Convention (1953): ante-litteram Open Access mandate“… the results of its experimental and theoretical work shall be
published or otherwise made generally available”
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 4
CERN•
5
The Large Hadron Collider
•Largest scientific instrumentever built, 27km of circumference
•The “coolest” place in the Universe-271.25 ˚C
•10000 people involved in itsdesign and construction
•Worldwide budget of 6bn€
•Collides protons to reproduceconditions at the birth of theUniverse...
...40 million times a second
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 6
The LHC experiments:about 100 million “sensors” each [think your 6MP digital camera...
...taking 40 million pictures a second]ATLAS
five-storey buildingCMS
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 7
The LHC data• 40 million events (pictures) per second• Select (on the fly) the ~200 interesting
events per second to write on tape• “Reconstruct” data and convert for analysis:
“physics data” [inventing the grid...]
200 TB0.1 MBPhysics data2000 TB1.0 MBReconstructed data3200 TB1.6 MBRaw dataPer yearPer event(x4 experiments x15 years)
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
8
Preservation, re-use and (Open) Access to HEP data
ProblemOpportunity
Challenge
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 9
Some other HEP facilities (recently stopped or about to stop)
Energy frontier
Precision frontier
No real long-term archival strategy......against billions of invested funds!
HERA@DESY
KLOE@LNFBELLE@KEKTEVATRON@FNAL
BABAR@SLAC
LEP@CERN
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 10
Why shall we care?
•We cared producing these data in first instance•Unique, extremely costly, non-reproducible•Might need to go back to the past (it happened)
•A peculiar community (the web, arXiv, the grid...)•“If it works here, will work in many other places”
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 11
Preservation, re-use and (open) access continua (who and when)
• The same researchers who took the data, after the closure of the facility (~1 year, ~10 years)
• Researchers working at similar experiments at the same time (~1 day, week, month, year)
• Researchers of future experiments (~20 years)• Theoretical physicists who may want to re-
interpret the data (~1 month, ~1 year, ~10 years)• Theoretical physicists who may want to test future
ideas (~1 year, ~10 years, ~20 years)
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 12
Much ado about nothing?Strong force: gets weakerthe closer the quarks get.
Most counter-intuitive idea of contemporary physicsIdea 1972, Nobel prize 2004
To verify it, start pullingquarks far apart:1) Produce quark at accelerators2) Put more and more energy in3) Do quark pull each other more?
Kept togetherby the strong force
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 13
Measuring the strong forceNeed theory to analyse data, theory improves with in-silico experiments,
which improve with computing power, which grows with time.
JADE1982-1985
OPAL1994-1998
Accelerator energy = how close we study the quarks
How
str
ong
is t
he s
tron
g fo
rce
Theory2000
Need tore-analyse data
with time!
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 14
Data preservation, circa 2004
140 pages of tables
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 15
Data preservation, circa 2003
140 pages of tables
Very cumbersome tablesdescribe event features.
Technical needsof multi-dimensional datawhich cannot fit on paper!
What a discovery mightlook like...
...“missing energy”...
...a few events ofbackground noise which
all theorists want to check
L3
What is the trouble withpreserving HEP data?
Where to put them ?Hardware migration ?Software migration/emulation?
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
What is the trouble withpreserving HEP data?
Where to put them ?Hardware migration ?Software migration/emulation?
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 18
HEP, Open Access & Repositories • HEP is decades ahead in thinking Open Access:
– Mountains of paper preprints shipped around the world for 40 years (at author/institute expenses!)
– Launched arXiv (1991), archetypal Open Archive– >90% HEP production self-archived in repositories– 100% HEP production indexed in SPIRES(community
run database, first WWW server on US soil)• OA is second nature: posting on arXiv before
submitting to a journal is common practice– No mandate, no debate. Author-driven.
• HEP scholars have the tradition of arXivingtheir output (helas, articles) somewhere
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 19
Information discovery in HEPA poll of the HEP community
>2000 answers (10% of the community!)
91 % Community services 9% Google <0.1% Commercial services• 40 % Subject repositories• 51 % Lab-supported databases
Which HEP Information Systemdo you use the most?
What is the trouble withpreserving HEP data?
Where to put them ?Hardware migration ?Software migration/emulation?
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
21
Towards an e-Infrastructure for HEP scholarly communication
Common vision of all stakeholders
1. Build a complete HEP information platform
2. Enable text- and data-mining applications
3. Demonstrate and deploy Web2.0 applications
4. Preservation and re-use of research data
There will be a placeto archive the data
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
What is the trouble withpreserving HEP data?
Where to put them ?Hardware migration ?Software migration/emulation?
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
23
Storage and migration of dataat the CERN computing centre
500GB~22’000 9940B → T1000A2007200GB~5’000 9940A → 9940B200460GB~25’000 Redwood → 9940200120GB~250’000 3480 → Redwood1997
0.2GB~150’000 9track → 34801993
End of (most) data analysis2005End of in-silico experiments2002End of data taking2000Start of data taking1989Begin of construction1984
Life-cycle of previous-generation CERN experiment L3 at LEP
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
What is the trouble withpreserving HEP data?
Where to put them ?Hardware migration ?Software migration/emulation?
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
25
Computing environment ofthe L3 experiment at LEP
End of (most) data analysis2005End of in-silico experiments2002End of data taking2000Start of data taking1989Begin of construction1984
Life-cycle of previous-generation CERN experiment L3 at LEP
Linux boxes1997-2007SGI mainframe1996-2001Apollo (HP) workstations1992-1998IBM for data analysis1986-1994VAX for data taking1989-2001
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 26
Software migration/emulation• Migrations and change of architecture can and do
occur in the lifetime of the HEP experiments, eventually CERN has even its own Linux releases
• These constitute large person-power investments• Need to adapt code and infrastructure to read and
(re-)process data and simulations, following the environment, with all dependencies
• This expertise and person-power vanishes with the disbanding of the experiments and the onset of more“rewarding” tasks at new facilities
• An issue linked to the complexity of data
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
What is the trouble withpreserving HEP data?
The HEP data !
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Where to put them ?Hardware migration ?Software migration/emulation?
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 28
Preserving HEP data?
Concorde(15 km)
Balloon(30 km)
CD stack with1 year LHC data!(~ 20 km)
Mt. Blanc(4.8 km)
• The HEP data model is highly complex. Data are traditionally not re-used as in Astronomy or Climate science.
• Raw data → calibrated data → skimmed data → high-level objects → physics analyses → results.
• All of the above duplicated for in-silicoexperiments, necessary to interpret the highly-complex data.
• Final results depend on the grey literature on calibration constants, human knowledge and algorithms needed for each pass...oral tradition!
• Years of training for a successful analysisSalvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
A possible way forward,introducing:
The parallel way
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 30
HEP data: The “parallel way” to publish/preserve/re-use/OpenAccess
•In addition to experiment data models, elaborate a parallel format for (re-)usable high-level objects
–In times of need (to combine data of “competing”experiments) this approach has worked–Embed the “oral” and “additional” knowledge
•A format eventually understandable and thus re-usable by practitioners in other experiments and theorists•Start from tables and work back towards primary data•How much additional work? 1%, 5%, 10%?
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 31
“Major” issues with the “parallel” way•A small fraction of a big number gives a large number•Need insider knowledge to produce parallel data•Activity in competition with research time (waiting the end of the experiment is a proven recipe for disaster)•Thousands of person-years behind the data model of the large collaborations:
– enormous (impossible?) academic incentives toencourage the “parallel way”
– additional (external) funds– mandates are another recipe for disaster
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 32
“Minor” issues with the “parallel” way•Publish high-level objects behind each scientific article (voluntarily? compulsory? after a time lapse?)•Publish all high-level objects after disbanding a collaboration (ownership? impact metrics?)•Address issues of (open) access, credit, accountability, reproducibility of results, "careless discovers", "careless measurements”, depth of peer-reviewing •A monolithic way of doing business needs rethinking
A culture shift, which can only come from consensus
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 33
Preservation, re-use and (Open) Access to HEP data... first steps!
•Outgrowing an institutionalized state of denial•A difficult and costly way ahead•An issue which starts surfacing on the agenda
— Part of an integrated digital-library vision— CERN is a proud member of the Alliance— FP7 bids to start charting the way— Debates open in the foreseeable future
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 34
An integrated e-Infrastructure visionBuild an integrated digital library with the entire living corpus of a discipline (articles and metadata) Work in progress Naturally extend it to host higher-level data behind publications No technical roadblocks
A data repository for external agentsto run analysis jobs (on the grid)
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007 35
Conclusions• HEP spearheaded (Open) Access to Scientific
Information: 50 years of preprints, 16 of repositories... but data preservation is not yet on the radar
• Heterogeneous continua to preserve data for• No insurmountable technical problems• The issue is the data model itself
– (Primary) data intelligible only to the producers– Need to produce a “parallel” format for
preservation, re-use and (open) access– Massive person-power costs
Preservation, re-use and (open) access of HEP data is appearing on the agenda...
will need cultural consensus and financial support
Exciting times are ahead!Salvatore Mele - Preservation, re-use and (open) access of HEP data - Den Haag 02/11/2007
Thank you [email protected]
http://scoap3.orgMake Open Access happen!
Promote SCOAP3 in your country!Get in touch!
Advertising below the line