KIT – University of the State of Baden-Württemberg andNational Laboratory of the Helmholtz Association
STEINBUCH CENTRE FOR COMPUTING - SCC
www.kit.edu
Data Intensive Services for the LSDFJos van Wezel
2 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Intro
Past and ContextThe Data Challenge aheadLSDF at KITSoftware ServicesRoadmap
26 May 2011
3 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Steinbuch Centre for Computing
Computer Centre of the Karlsruhe Institute of TechnologyIT Services for KITHigh Performance ComputingScientific Computing und SimulationLarge Scale Data Management & AnalysisGrid ComputingCloud ComputingVirtualisierung
26 May 2011
4 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Data services at SCC
GridKa – LHC Tier 1 centre / 2002WLCG Tier 1 centre10 PB storage, 16000 cores, 40 Gb/s networkingDedicated to Physics off-line computing
Biology contacts Institute for Toxicology and Genetics at KIT / 2007
Initial use of ‘spare’ GridKa capacityindicating storage and computing needs
BioQuant / 2008Prof. Dr. Wolfrum and Prof. Dr. Juling: Joint proposalCooperation to procure storage for genomic research
26 May 2011
5 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
LSDF development time line at SCC/KIT
First ideas for LSDF / early 2008Installation of an LSDF pilot with 150 TB storage and 4 serversDevelopment of initial concepts , i.e. tiered storage, hadoopResult: KIT proposal
SCC Helmholtz external review / spring 2009 LSDF is an excellent idea, but DO plan beyond KITWorkshop held in 2/2009 to coordinate BioQuant and SCC efforts
Storage, funded by State of Baden-Wuerttemberg / late 2009Tendering and negotiations by R. Eils (BioQuant) and R. Kupsch (SCC)Storage for systems biology in Heidelberg (@BioQuant)Storage for Universities in Baden-Wuerttemberg (@SCC)Long Digital time Archives (@SCC)Storage Support and Services for State Universities (@SCC)
Compute cluster for DIC & cloud research / late 2009Bring computing (low latency) to storageUse hadoop to allow fast distributed data access
26 May 2011
6 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
LSDF Hardware todayDedicated LSDF Data Acquisition Network
10Gb/s redundant backbone with 2 Nexus routersSeveral KIT institutes, ITG, IPE, IAI, ANKA, GPISince 1 week: 10Gb/s to BioQuant
File Servers and On-line StorageIBM → 2 PB, 6 servers, SoFSDDN → 750 TB, 8 servers, GPFS
Computing cluster464 cores, 2 TB total memoryDirectly attached to storage (GPFS/DDN)110 TB HDFS, Hadoop native filesystemAvailable from the Cloud environment OpenNebula
users can deploy own dedicated VMsreliable, highly flexible, and very fast to deploy
Archival and off-line storageTape library6 LTO 5 drives
26 May 2011
Executive scientists:Serguei BourovAriel Garcia
7 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
LSDF realms
26 May 2011
Access to LSDF (KIT) via standard protocolsInternal (inside Firewall) via NFS/CIFS and DataBrowserExternal (outside Firewall) via ‘grid’ tools or http
8 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Users of the LSDF @ SCC/KIT
Biology High Throughput MicroscopyGene Sequencing0.5 PB/a, automated image processing
Synchrotron radiation facility (ANKA)Tomographie-Beamlines240 TB/a – 1 PB/a, data management
Climate research (IMK)Several instruments mounted on satallites300 TB/a (till 2024), 20 years archiving
In developmentBioQuant ArchivesBiophisics (Nanoscopy, Nanoparticles, …)Arts and Humanities (DARIAH)Geophysics (Seismology, Applied seismic research)Many others
26 May 2011 LSDMA-Treffen
9 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Measurement Data
26 May 2011
Data is generated at increasing rates
Costs per byte measured is decreasing
Costs per byte of storage is decreasing
2011
2013
2015
2017
2019
2021
2023
2025
10
100
1000
10000
Storage Density
10 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Scientific data sources
In the past big data resulted from simulations on supercomputersToday big data results from experiments, observations, measurementsData is valuable because it is either unique or costly to obtain or both
26 May 2011
11 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
From data to knowledge
The fourth paradigmExperiment Theory Simulation
Data ExplorationWidely Recognised i.e.
“Riding the wave, How Europe can gain from the rising tide of scientific data.” Final report of the High Level Expert Group on Scientific Data. October 2010
26 May 2011
Tony Hey, Stewart Tansley, Kristin Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research, ISBN 978-0982544204, http://research.microsoft.com/en-us/collaboration/fourthparadigm/ Jim Gray, eScience Talk at NRC-CSTB meeting Mountain View CA, 11 January 2007, http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt
12 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
A collaborative Data Infrastructure
26 May 2011
DARIAHCESSDALifeWatchENESetc.
EUDATD4ScienceELIXERetc.
LSDF
Scientific Experiments
13 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Key demands of modern data driven science
Data storage and management beyond PetaBytesLong-Term digital archiving of raw and publicised dataAnalysis with tools for data intensive computingVisualisation and data mining tools for large amounts of measurement dataIntegration of data handling with scientific workflowsSupport and services from IT and data experts
26 May 2011
LSDF
Blu
eprin
t
14 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Workflow: applicable for many data sources
data is measured, buffered and validated in storage near the instrument (T0)data is curated, registered and moved in the LSDF (T1)data is processed for analysis. each analysis step produces new, derived data that is also registered, stored and archived (T2)new data is archived: immutable data
26 May 2011
15 Steinbuch Centre for Computing
LSDF developments
Software for Scientific dataData managementSecure Access and Global AuthenticationArchival and Bit PreservationPersistent Identifiers
Data intensive Computing Storage and computing optimisationStorage and file system design
Community servicesHelpdesk and supportIntegration of existing applications
Storage for the state of Baden-WürttembergScientific Data (BioQuant)Universities, Archives, LibrariesDesktop-Data
13/4/2011
ScientificExperiments,Applications,Communities
LSDFInfrastructure
Technologies
“for happy users”
16 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Data Services
File systems, file protocols, databasesGPFS, NFS, CIFS, GridFTP, Oracle, MySQL
HadoopShared cluster wide file system, Map/Reduce framework
Cloud/Open NebulaFast deployment of virtual machines
iRodsRule-Oriented Data System
Automated Processing of large image stacksKepler workflow engine
Data Ingest Meta DataADALAPIData Browser
26 May 2011
17 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Meta Data
26 May 2011
Meta data describes the contents of dataEverybody uses meta data:
File name and extension(e.g. picture.jpg, budget.xls, Readme.doc)Location(e.g. /…/EU-projects/2011/Fishy/budget.xls)Personal know-how
Sufficient for small file systems , desktops
Try to locate a file or info somewhere-in-a-file-system15 years old ?in the file system of a colleague ?in a 100 PetaByte file system ?
18 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Access to large scale data
Separate frameworks for data and meta dataGood scalability and AccessComplicates transparent access
26 May 2011
19 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Hierarchical Catalog System (Repository)
LFN Physical File Name
Logical File Catalog DB
LFN Physical File Name
Logical File Catalog DB
LFN Physical File Name
Logical File Catalog DB
LFN Physical File Name
Logical File Catalog DB
LFN Physical File Name
Logical File Catalogs
Computing
Storage
LDN LDN, LFN
Logical Directory Catalog
LPN LDN, meta data
Logical Project Catalog DB
DB
DB
Meta data scheme
repository
Zebrafish II
ANKA BL1
Material research
Zebrafish I
Digital objects inArts and
Humanities
APIs and Tools
Catalogs LSDF SystemsSustainable and easily extensible for large amounts of data (size and number)Independent of data formatsPerformance by distributed access Safety by redundancyUse of openstandards
Generic file tree
20 Steinbuch Centre for ComputingBioQuant, First Byte Symposium26 May 2011
DataBrowser
API: Data and meta data organizationGUI: File, data and project explorer
Functions:• Data management• Queries in meta data
cataloges• Up-/Download• Control of
data analysis + vis.workflows
• Easy-to-use• Extensible• World-wide access
21 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
ADALAPI (Abstract Data Access Layer API)
Java class librarySeamless application access to LSDFIndependent of transfer protocol and locationAuthentification
X.509 certificatesuser/passwd
Protocols and file systemslocal filesgsiftpsftphttp(s) hdfs
LSDF Storage Infrastructure
Applications ToolsDataBrowserScientific exp.
DAQ
…
Grid
Workstations
Client software
Visualization
…
Cloud
22 Steinbuch Centre for ComputingBioQuant, First Byte Symposium
Conclusions
Important services have been deployedDifferent communities at KIT are successfully using the LSDF (storing as well as on-line computing)Development on new tools in progress
Roadmap LSDF will grow, adding users and hardwareContributing to EUDAT and Helmholtz Association infrastructuresAdding software and community services and support to hardware services
26 May 2011
23 Steinbuch Centre for ComputingBioQuant, First Byte Symposium26 May 2011
The Steinbuch Centre for Computing at KIT congratulates BioQuant
with its successful LSDF4LS launch. We are proud to cooperate with them and look forward to mutually enhance science by deploying innovative large
scale data services.
STEINBUCH CENTRE FOR COMPUTING - SCC
www.kit.edu
You have the data, we have the technology
Thank you very much for your attention
Many thanks to: Serguei Bourov, Ariel Garcia, Rainer Kupsch, Achim Streit, Rainer Stotzka and all other KIT colleagues making LSDF happen
26 May 2011