Upload
valentin-kuznetsov
View
124
Download
4
Embed Size (px)
DESCRIPTION
The talk present a new Data Aggregation System for CMS experiment at CERN. We use MongoDB database as caching layer to query multiple data-provides (backed up by RDMS) and aggregate data across them. Talk has been presented at ICCS 2010 conference.
Citation preview
CMS Data Aggregation SystemValentin Kuznetsov, Cornell University
1
ICCS Workshop, Amsterdam, May 31 - Jun. 2d, 2010
How can I findmy data?
DBS SiteDB
Phedex
GenDB LumiDB
RunDB
PSetDBDataQuality
Overview
Talk outline
✤ Introduction
✤ Motivations
✤ What is DAS?
✤ Design, architecture, implementations
✤ Current status & benchmarks
✤ Future plans
2
Introduction
✤ CMS is a general purpose physics detector built for the LHC
✤ beam collision 25 nsec, online trigger 300 Hz, event size 1-2MB
✤ More then 3000 physicists, 183 institution, 38 countries
✤ CMS uses distributed computing and data model
✤ 1 Tier-0, 7 Tier-1, O(50) Tier-2, O(50) Tier-3 centers
✤ 2-6 PB/year of real data + 1x Simulated data, ~500GB/year of meta-data
✤ Code: C++/Python; Databases: ORACLE, MySQL, CouchDB, MongoDB ...
Motivations ...
✤ A user want to query different meta-data services without knowing of their existence
✤ A user want to combine information from different meta-data services
✤ A user has domain knowledge, but need to query X services, using Y interface and dealing with Z data formats to get our data
4
block,site
lumi
site
DBSrun, file, block, site,config, tier, dataset,lumi, parameters, ....
LumiDBlumi, luminosity, hltpath
SiteDBsite, admin, site.status, ..
Phedexblock, file, block.replica,file.replica, se, node, ...
GenDBgenerator, xsection, process, decay, ...
RunSummaryrun, trigger, detector, ...
DataQualitytrigger, ecal, hcal, ...
run,lumi
run
MC id
Overviewcountry, node, region, ..
Parameter Set DBCMSSW parameters
run
Service Eparam1, param2, ..Service D
param1, param2, ..Service Cparam1, param2, ..Service B
param1, param2, ..Service Aparam1, param2, ..
pset
Data Aggregation System
What is DAS?
✤ DAS stands for Data Aggregation System
✤ It is layer on top of existing data-services
✤ It aggregates data across distributed data-services while preserving their integrity, security policy and data-formats
✤ it provides caching for data-services (side effect)
✤ It represents data in defined format: JSON documents
✤ It allows query data via free text-based queries
✤ Agnostic to data content 5
Challenges ...
✤ Combining N data-services is a great idea, but
✤ there is no ad-hoc IT solution
✤ DAS doesn’t hold the data, can’t have pre-defined schema
✤ must support existing APIs, data formats, interfaces, security policies
✤ must relate and aggregate meta-data
✤ must be efficient, flexible, scalable and easy to use
✤ Work on DAS prototype to understand those challenges 6
DAS prototype
✤ Code written in python, ideal for prototyping
✤ Use existing meta-data from CMS data-services as test-bed
✤ 8 data-services, 75/250GB in tables/indexes
✤ Use document-oriented “schema-less’’database: MongoDB
✤ raw cache, merge result cache, mapping and analytics DBs
✤ Support free keyword-based queries, e.g. site=T1_CERN, run=100
✤ Aggregate information using key-value matching7
DAS architecture
DAS webserver
dbs
sitedb
phedex
lumidb
runsum
DAS cache
DAS Analytics
CPU core
DAS core
DAS core
DAS Cache server
record query, APIcall to Analytics
Fetch popularqueries/APIs
Invoke the same API(params)Update cache periodically
DAS mapping Map data-service
output to DASrecords
mapping
par
ser
����������
dat
a-se
rvic
es
DAS merge
plu
gin
s
aggregator
UI
RESTful interface
DAS robot
DAS workflow
✤ Query parser
✤ Query DAS merge collection
✤ Query DAS cache collection
✤ invoke call to data service
✤ write to analytics
✤ Aggregate results (generator)
query
parser
queryDAS merge
Aggregator
queryDAS cache
querydata-services
DASmerge
DAScache
noyes
noyes
results
DASMapping
DASAnalytics
Web UI
DASlogging
DAScore
DAS and data-services
✤ DAS is data-service agnostic
✤ a data-service is identified by its URI and input parameters
✤ Use plug-and-play mechanism:
✤ add new data-service using ASCII map file (URI, parameters, ...)
✤ use generic HTTP access and standard data-parsers (XML, JSON)
✤ Use dedicated plugin:
✤ specific access requirements, custom parsers, etc.
DAS map files
system : google_mapsformat : JSON---urn : google_geo_mapsurl : "http://maps.google.com/maps/geo"expire : 30params : { "q" : "required", "output": "json" }daskeys : [ {"key":"city","map":"city.name","pattern":""},]
Data Aggregation System
DAS mapping
Data Service: URL/api?params
DAS benchmark✤ Fetch all blocks from our bookkeeping (DBS) and data transfer (PhEDEx) CMS data services
✤ parse, remap notations, store to cache, merge matched records (aggregation)
✤ Linux 64-bit, 1CPU for DAS, 1CPU for MongoDB, record size ~1KB
✤ Elapsed time = retrieval time + parsing time + remapping time + cache insertion/indexing time + output creation time
12
Format Records Time, no cache
Time w/ cache
DBS yield XML 387K 68s 0.98s
PhEDEx yield XML 190K 107s 0.98s
Merge step JSON 577K 63s 0.9s
DAS total JSON 393K 238s 2.05s
393K DAS records,create ~6K docs/sread ~7.6K docs/s
Future plans
✤ DAS goes into production this year in CMS:
✤ confirm scalability, transparency and durability w/ various data-services
✤ work on analytics to organize pre-fetch strategies
✤ Apply to other domain disciplines
✤ Release as open source
Summary
✤ Data Aggregation System is data agnostic and allow to query/aggregate meta-data information in customizable way
✤ The current architecture easily integrates with existing data-services preserving their access, security policy and development cycle
✤ DAS is designed to work with existing CMS data-services, but can easily go beyond that boundary
✤ Plug-and-play mechanism makes it easily to add new data-services and configure DAS to specific domain