Data Aggregation System

CMS Data Aggregation SystemValentin Kuznetsov, Cornell University

1

ICCS Workshop, Amsterdam, May 31 - Jun. 2d, 2010

How can I findmy data?

DBS SiteDB

Phedex

GenDB LumiDB

RunDB

PSetDBDataQuality

Overview

Talk outline

✤ Introduction

✤ Motivations

✤ What is DAS?

✤ Design, architecture, implementations

✤ Current status & benchmarks

✤ Future plans

2

Introduction

✤ CMS is a general purpose physics detector built for the LHC

✤ beam collision 25 nsec, online trigger 300 Hz, event size 1-2MB

✤ More then 3000 physicists, 183 institution, 38 countries

✤ CMS uses distributed computing and data model

✤ 1 Tier-0, 7 Tier-1, O(50) Tier-2, O(50) Tier-3 centers

✤ 2-6 PB/year of real data + 1x Simulated data, ~500GB/year of meta-data

✤ Code: C++/Python; Databases: ORACLE, MySQL, CouchDB, MongoDB ...

Motivations ...

✤ A user want to query different meta-data services without knowing of their existence

✤ A user want to combine information from different meta-data services

✤ A user has domain knowledge, but need to query X services, using Y interface and dealing with Z data formats to get our data

4

block,site

lumi

site

DBSrun, file, block, site,config, tier, dataset,lumi, parameters, ....

LumiDBlumi, luminosity, hltpath

SiteDBsite, admin, site.status, ..

Phedexblock, file, block.replica,file.replica, se, node, ...

GenDBgenerator, xsection, process, decay, ...

RunSummaryrun, trigger, detector, ...

DataQualitytrigger, ecal, hcal, ...

run,lumi

run

MC id

Overviewcountry, node, region, ..

Parameter Set DBCMSSW parameters

run

Service Eparam1, param2, ..Service D

param1, param2, ..Service Cparam1, param2, ..Service B

param1, param2, ..Service Aparam1, param2, ..

pset

Data Aggregation System

What is DAS?

✤ DAS stands for Data Aggregation System

✤ It is layer on top of existing data-services

✤ It aggregates data across distributed data-services while preserving their integrity, security policy and data-formats

✤ it provides caching for data-services (side effect)

✤ It represents data in defined format: JSON documents

✤ It allows query data via free text-based queries

✤ Agnostic to data content 5

Challenges ...

✤ Combining N data-services is a great idea, but

✤ there is no ad-hoc IT solution

✤ DAS doesn’t hold the data, can’t have pre-defined schema

✤ must support existing APIs, data formats, interfaces, security policies

✤ must relate and aggregate meta-data

✤ must be efficient, flexible, scalable and easy to use

✤ Work on DAS prototype to understand those challenges 6

DAS prototype

✤ Code written in python, ideal for prototyping

✤ Use existing meta-data from CMS data-services as test-bed

✤ 8 data-services, 75/250GB in tables/indexes

✤ Use document-oriented “schema-less’’database: MongoDB

✤ raw cache, merge result cache, mapping and analytics DBs

✤ Support free keyword-based queries, e.g. site=T1_CERN, run=100

✤ Aggregate information using key-value matching7

DAS architecture

DAS webserver

dbs

sitedb

phedex

lumidb

runsum

DAS cache

DAS Analytics

CPU core

DAS core

DAS core

DAS Cache server

record query, APIcall to Analytics

Fetch popularqueries/APIs

Invoke the same API(params)Update cache periodically

DAS mapping Map data-service

output to DASrecords

mapping

par

ser

��

dat

a-se

rvic

es

DAS merge

plu

gin

s

aggregator

UI

RESTful interface

DAS robot

DAS workflow

✤ Query parser

✤ Query DAS merge collection

✤ Query DAS cache collection

✤ invoke call to data service

✤ write to analytics

✤ Aggregate results (generator)

query

parser

queryDAS merge

Aggregator

queryDAS cache

querydata-services

DASmerge

DAScache

noyes

noyes

results

DASMapping

DASAnalytics

Web UI

DASlogging

DAScore

DAS and data-services

✤ DAS is data-service agnostic

✤ a data-service is identified by its URI and input parameters

✤ Use plug-and-play mechanism:

✤ add new data-service using ASCII map file (URI, parameters, ...)

✤ use generic HTTP access and standard data-parsers (XML, JSON)

✤ Use dedicated plugin:

✤ specific access requirements, custom parsers, etc.

DAS map files

system : google_mapsformat : JSON---urn : google_geo_mapsurl : "http://maps.google.com/maps/geo"expire : 30params : { "q" : "required", "output": "json" }daskeys : [ {"key":"city","map":"city.name","pattern":""},]

Data Aggregation System

DAS mapping

Data Service: URL/api?params

http://maps.google.com/maps/geo

http://maps.google.com/maps/geo

DAS benchmark✤ Fetch all blocks from our bookkeeping (DBS) and data transfer (PhEDEx) CMS data services

✤ parse, remap notations, store to cache, merge matched records (aggregation)

✤ Linux 64-bit, 1CPU for DAS, 1CPU for MongoDB, record size ~1KB

✤ Elapsed time = retrieval time + parsing time + remapping time + cache insertion/indexing time + output creation time

12

Format Records Time, no cache

Time w/ cache

DBS yield XML 387K 68s 0.98s

PhEDEx yield XML 190K 107s 0.98s

Merge step JSON 577K 63s 0.9s

DAS total JSON 393K 238s 2.05s

393K DAS records,create ~6K docs/sread ~7.6K docs/s

Future plans

✤ DAS goes into production this year in CMS:

✤ confirm scalability, transparency and durability w/ various data-services

✤ work on analytics to organize pre-fetch strategies

✤ Apply to other domain disciplines

✤ Release as open source

Summary

✤ Data Aggregation System is data agnostic and allow to query/aggregate meta-data information in customizable way

✤ The current architecture easily integrates with existing data-services preserving their access, security policy and development cycle

✤ DAS is designed to work with existing CMS data-services, but can easily go beyond that boundary

✤ Plug-and-play mechanism makes it easily to add new data-services and configure DAS to specific domain

Technology

Data Aggregation System