Upload
kristina-moody
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Monitoring the ATLAS Distributed Data
Management SystemRicardo Rocha (CERN)
on behalf of the ARDA Dashboard team
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it CHEP2007,Victoria, Canada [email protected] - 2
Outline
• Dashboard project• ATLAS DDM system• DDM Dashboard• Monitoring for operators• Monitoring for end users• Conclusion
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Dashboard Project
• Started inside the ARDA group of the EGEE/LCG project in 2005– Initially covering only job monitoring for CMS
• Evolved into a python framework providing a set of flexible tools allowing coverage of other grid application areas
• The framework consists of a set of different components– Data access layer (DAO)– Service configuration (agents)– Web application– Command line tools– APIs
• Strong focus on allowing easy access to the information– HTTP query interface– Output in HTML (web interfaces), but also XML and CSV for integration
with external tools
• Applications currently cover job monitoring (for all HEP experiments + VLEMED/Biomed), data management, site efficiency / reliability, and many others
CHEP2007,Victoria, Canada [email protected] - 3
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
ATLAS DDM System
• Distributing all the data in the ATLAS infrastructure• Data is organized in datasets – collections of files• Users issue subscriptions on these datasets• Different agents take care of the several tasks
required for the successful movement of the data
CHEP2007,Victoria, Canada [email protected] - 4
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
ATLAS DDM System
• Each set of agents serves one or a group of sites (typically related to each other - cloud)
• Initial deployment involved the setup of these services on the VO boxes at each Tier1
• Debugging was extremely difficult (still is…)– Involved logging into each of the machines– And correlating this information…
• Accounting and monitoring virtually impossible• A central point where all the information would
be made available became vital
CHEP2007,Victoria, Canada [email protected] - 5
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
DDM Dashboard
• Main focus on ATLAS specific services (DQ2 system), receiving information from the different agents via HTTP callbacks– Transfer state changes– Dataset complete
– Transfer complete– Transfer / registration errors
• But also on grid fabric services– Data management related services up and running
– Storage space availability
• Data is put together in a structured way– Oracle database at CERN
• Different tools (agents) responsible for generating statistics and metrics
CHEP2007,Victoria, Canada [email protected] - 6
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
DDM Dashboard
• Serves different sets of use cases, coming from different types of users
• Site / system operators– “How is the overall system doing”?– “How is site X doing”?– “What is the most common error, and what is triggering it”?
• End users / production coordinators– “What is the status of this (set of) dataset subscription(s)”?– “When will the data become available”?– Essential to have real time information
• How much data? A lot!– Millions of file transfers, each reporting the different steps
• Average week means 2 million hits (90%+ bulk reports with up to 500 items)
– Especially critical when systems misbehave (more errors)– Lot of work on partitioning the data, optimizing the database and the
web server setup (Apache)
CHEP2007,Victoria, Canada [email protected] - 7
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Monitoring for operators
• “How is the whole system performing”?
CHEP2007,Victoria, Canada [email protected] - 8
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Monitoring for operators
• “What is wrong with site X”?
CHEP2007,Victoria, Canada [email protected] - 9
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Monitoring for operators
• “What files are causing error ‘…’”?
CHEP2007,Victoria, Canada [email protected] - 10
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Monitoring for operators
• “What files are causing error ‘…’”?
CHEP2007,Victoria, Canada [email protected] - 11
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Monitoring for operators
• “What files are causing error ‘…’”?
CHEP2007,Victoria, Canada [email protected] - 12
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Monitoring for end users
• “I subscribed to dataset X on site Y. What is the status”?
CHEP2007,Victoria, Canada [email protected] - 13
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Conclusion
• Essential tool for all ATLAS operations• Currently used by a large number of people,
from site / system responsibles to end users• Usage goes beyond the web interface
– Data being queried by different external tools for automating operations (catalogs cleanup, consistency checks, alarms and notifications, …)
• What is coming next– Integration with the site specific monitoring tools– More alarms and notifications– Automated reaction to specific events– More focus on the end user
CHEP2007,Victoria, Canada [email protected] - 14
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
• Tier 0 exporthttp://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site
• Productionhttp://dashb-atlas-data.cern.ch/dashboard/request.py/site
• Homepage
http://dashboard.cern.ch
• Contact
CHEP2007,Victoria, Canada [email protected] - 15