View
219
Download
0
Tags:
Embed Size (px)
Citation preview
GridMonitor: Integration of Large Scale Facility Monitoring With MDS
Richard Baker, Antonio ChanRichard Baker, Antonio Chan
Jason Smith, Dantong YuJason Smith, Dantong Yu
USATLAS/RHIC Computing FacilityUSATLAS/RHIC Computing Facility
Brookhaven National LabBrookhaven National Lab
04/18/2304/18/23CHEP 03, La Jolla 2
Outline
RequirementsRequirements
System Framework, Structure and CharacteristicsSystem Framework, Structure and Characteristics
I: Ganglia and Its Information ProviderI: Ganglia and Its Information Provider
II: Archive and Its Information ProviderII: Archive and Its Information Provider
Gridview, Front End System: Gridview, Front End System: http://heppc1.uta.edu/atlas/grid-status/mds.gremlin.usatlas.bnl.gov.html
Current Status and Future WorksCurrent Status and Future Works
04/18/2304/18/23CHEP 03, La Jolla 3
Requirements
Requirements :Requirements : Modularity and Extensibility: Make Use of Existing Monitoring
Pieces
Flexibility: Adjustable to the Dynamics of the Monitored Systems
Overhead: Non-intrusive
Scalability
Security, Consistency, Inter-operability, Etc-bility
04/18/2304/18/23CHEP 03, La Jolla 4
What Need to Be Monitored
Linux Farm MonitoringLinux Farm Monitoring Description
About 1100 Dual CPU LINUX Nodes. Performance Data Must Be Summarized for Advertising to Grid.
Performance Events Required: Configuration Information Status Information: CPU Load, (1, 5, 10, 15), Memory Load,
Disk Load, and Network Load
Example Usage: A Resource Broker Might Ask the
Availability of Linux Farm System Resources in Order to
Plan the Efficient Execution of Tasks
04/18/2304/18/23CHEP 03, La Jolla 5
More…
Network Monitoring:Network Monitoring:
Description: 8 USATLAS TestbedsPublish the Connectivity of These Test-beds, Monitor
the Healthiness of the USATLAS NetworkArchived Performance Data Can Be Used to Predict
the Network Behavior a User Can Choose the Source and Destination for File Replication
Performance Events Required:Bandwidth, Delay ( Round Trip Time), Trace Route
04/18/2304/18/23CHEP 03, La Jolla 6
Monitoring Framework
MonitoringDatabase
(ODBC+MYSQL)Or RRD
DB Info. ProvidersData Collectors
Aggregate Service Index
(GIIS)Grid-View(Web Server)
Information Provider (GRIS)
Information Provider (GRIS)
Information Provider (GRIS)
Information Provider (GRIS)
Grid-info-search
Server HPSSNetwork
Computing Nodes
Sensor Sensor Sensor Sensor
04/18/2304/18/23CHEP 03, La Jolla 7
Monitoring System Components
Four Tier StructureFour Tier Structure Sensors
Host: Ganglia, Top, /Proc and lsf Host Load
Archive System (Database System) Round Robin Database (RRD) Relational Database: UNIXodbc+myodbc+mysql Database
Information Providers Monitoring and Discovery Service (Mds2.2), GLUE Schema,
Customized Ganglia Client Tool Reporting the Lastest Monitoring Data and Database Client Tools Reporting the Summary Information
Front-end Browsing System Gridview (Grid Visualization Tool Developed at
Univ. of Texas at Arlington)
04/18/2304/18/23CHEP 03, La Jolla 8
Advantages
Information Provider Provides Cache for the Newest Value From the Mysql Information Provider Provides Cache for the Newest Value From the Mysql DatabaseDatabase
Non-intrusiveness: Information Provider Can Eliminate the User Random Non-intrusiveness: Information Provider Can Eliminate the User Random Accesses to the Database ServerAccesses to the Database Server
Scalability Can Be Significantly IncreasedScalability Can Be Significantly Increased 1000 Linux Nodes Are Being Monitored Network Connectivity of Eight Usatlas Testbeds: Each Site Monitoring the Paths
From Itself to the Other Seven. Network Topology and Traffic Can Be Easily Constructed
Flexibility:Flexibility: Independent on Sensors. Many Sensors Can Be Easily Plugged As Long It Has
Well Defined Protocol and API: We Could Switch Among Ganglia, top, /proc Archive System Is Independent to Underlying Database
Can Be rdbms, Oracle, Mysql, Sybase, Informix, Flat Files, Objectivity As Long the Odbc Drivers Is Available
04/18/2304/18/23CHEP 03, La Jolla 9
I: Ganglia Monitoring with MDS
Ganglia Information ProviderGanglia Information Provider Front-end: Glue-schema Http://www.cnaf.Infn.It/~sergio/datatag/glue/ Back-end: XML
Cluster A Multicast Channel
Cluster A Multicast Channel
Gmond Gmond
Gmond
Gmond
GmondGmond
Gmond
Gmond
GmondGmond
Gmond
GmondXML XML
Gmetad
(filtered)
Gmetad
(filtered)
…? MDS Ganglia IP XMLGLUE
Layered
Gmetad
04/18/2304/18/23CHEP 03, La Jolla 10
I: Ganglia Monitoring with MDS
gremlin % grid-info-search -x -h spider.usatlas.bnl.gov -s onegremlin % grid-info-search -x -h spider.usatlas.bnl.gov -s one# ATLAS Linux Cluster, local, grid# ATLAS Linux Cluster, local, grid
dn: cl=ATLAS Linux Cluster, mds-vo-name=local, o=griddn: cl=ATLAS Linux Cluster, mds-vo-name=local, o=grid
objectClass: GlueClusterTopobjectClass: GlueClusterTop
objectClass: GlueClusterobjectClass: GlueCluster
GlueClusterName: ATLAS Linux ClusterGlueClusterName: ATLAS Linux Cluster
GlueClusterUniqueID: ATLAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_GroupGlueClusterUniqueID: ATLAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_Group
GlueClusterService: computeGlueClusterService: compute
# PHOBOS CAS Linux Cluster, local, grid# PHOBOS CAS Linux Cluster, local, grid
dn: cl=PHOBOS CAS Linux Cluster, mds-vo-name=local, o=griddn: cl=PHOBOS CAS Linux Cluster, mds-vo-name=local, o=grid
objectClass: GlueClusterTopobjectClass: GlueClusterTop
objectClass: GlueClusterobjectClass: GlueCluster
GlueClusterName: PHOBOS CAS Linux ClusterGlueClusterName: PHOBOS CAS Linux Cluster
GlueClusterUniqueID: PHOBOS_CAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_GroupGlueClusterUniqueID: PHOBOS_CAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_Group
GlueClusterService: computeGlueClusterService: compute
# STAR CAS Linux Cluster, local, grid# STAR CAS Linux Cluster, local, grid
dn: cl=STAR CAS Linux Cluster, mds-vo-name=local, o=griddn: cl=STAR CAS Linux Cluster, mds-vo-name=local, o=grid
objectClass: GlueClusterTopobjectClass: GlueClusterTop
objectClass: GlueClusterobjectClass: GlueCluster
GlueClusterName: STAR CAS Linux ClusterGlueClusterName: STAR CAS Linux Cluster
GlueClusterUniqueID: STAR_CAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_GroupGlueClusterUniqueID: STAR_CAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_Group
GlueClusterService: computeGlueClusterService: compute
04/18/2304/18/23CHEP 03, La Jolla 11
II: Farm Monitoring
Linux Farm Is Divided Into Different Sub-clusters Based on Linux Farm Is Divided Into Different Sub-clusters Based on
Site Policy, Different Experiments, OS and Version, CPU Site Policy, Different Experiments, OS and Version, CPU
Speed. A Sub-cluster Contains the Host With the Same Speed. A Sub-cluster Contains the Host With the Same
ConfigurationConfiguration Bnl Atlas Farm Is Partitioned Into Four Subclusters: Cpu400mhz,
Cpu700hz, Cpu1ghz, Cpu1.4ghz and CPU 2.4GHZ
The Status Information of a Sub-cluster Is Summarized The Status Information of a Sub-cluster Is Summarized
From All Nodes in This Sub-clusterFrom All Nodes in This Sub-cluster
Grid Resource Broker Schedules in the Level of Farm Sub-Grid Resource Broker Schedules in the Level of Farm Sub-
clustersclusters
04/18/2304/18/23CHEP 03, La Jolla 12
Information Schema (Linux Farm Monitoring)
Queue-Info:Queue-Info: objectclass ( 1.3.6.1.4.1.3536.2.6.0.0.0.0
NAME 'Queue-Info'
SUP 'Mds'
STRUCTURAL
MUST ( MdsQueueNumberOfCpu $
MdsQueueSpeed $
MdsQueueAverageLoad $
MdsQueueAverageUserPercent $
MdsQueueAverageSysPercent ))
Need to be replaced by GLUB-schema
04/18/2304/18/23CHEP 03, La Jolla 13
Backend Data Structure
Node Status InformationNode Status Informationmysql> describe node_load;
+-------------+-------------------------+------+----- +---------+---------------------+
| Field |Type | Null | Key |Default| Extra |
+-------------+------------------------+------+--------+--------+----------------------+
| load_index | int(10) unsigned | | PRI | NULL| auto_increment |
| sampletime| timestamp(14) | YES | MUL | NULL| |
| machine_id| varchar(31) | | | | |
| owner | varchar(8) | | | | |
| load_5 | float(10,2) | | | 0.00 | |
| user_cpu | float(10,2) | | | 0.00 | |
| sys_cpu | float(10,2) | | | 0.00 | |
+---------------+-----------------------+-------+--------+-------+---------------------+
04/18/2304/18/23CHEP 03, La Jolla 14
Information Provider (Linux Farm Monitoring)
# generate Farm information every 10 minutes# generate Farm information every 10 minutes
dn: MdsFarmQueueName=1000, MdsHostNodeDomainName=usatlas.bnl.gov, dn: MdsFarmQueueName=1000, MdsHostNodeDomainName=usatlas.bnl.gov,
Mds-Host-hn=gremlin.usatlas.bnl.gov, Mds-Vo-name=local, o=gridMds-Host-hn=gremlin.usatlas.bnl.gov, Mds-Vo-name=local, o=grid
objectclass: GlobusTopobjectclass: GlobusTop
objectclass: GlobusActiveObjectobjectclass: GlobusActiveObject
objectclass: GlobusActiveSearchobjectclass: GlobusActiveSearch
type: exectype: exec
path: /usr/local/globus-new/customizepath: /usr/local/globus-new/customize
base: mds-farm-batch-info.plbase: mds-farm-batch-info.pl
args: -dn args: -dn
MdsFarmQueueName=1000,MdsHostNodeDomainName=usatlas.bnl.gov,Mds-MdsFarmQueueName=1000,MdsHostNodeDomainName=usatlas.bnl.gov,Mds-
Host-hn=gremlin.usatlas.bnl.gov,Mds-Vo-name=local,o=grid -ttl 900Host-hn=gremlin.usatlas.bnl.gov,Mds-Vo-name=local,o=grid -ttl 900
cachetime: 600cachetime: 600
timelimit: 20timelimit: 20
sizelimit: 400sizelimit: 400
04/18/2304/18/23CHEP 03, La Jolla 16
Current Status and Future Work
Current Status:Current Status: Sensors & Local Monitoring Tools Put Less Than 1 Percent CPU Load:
Non-intrusive Improved the Ganglia Information Provider, It Can Obtain Information From
Both Gmond and Gmetad Multiple & Hierarchical Clusters Are Supported
Future WorksFuture Works Merge the Ganglia RRD Information Provider and the Archive DB
Information Provider Work With the Ganglia Team and Glue-schema, Help to Define
Requirements for What Information Be Monitoring for Job Scheduling Automate the Mapping From Xml to Glue Schema, Provide Flexibility Continue to Optimize The Information Provider to Deliver Data Faster Scalability Test Extend This Prototype To Other Facility Monitoring