Upload
hadiep
View
234
Download
0
Embed Size (px)
Citation preview
Enterprise Manager 12c - Database Monitoring Implementation at Thomson Reuters
Suraj Talreja Manager, Database & Middleware, Design and Engineering
Thomson Reuters
• Thomson Reuters is the world’s leading source of intelligent
information for businesses and professionals.
• We combine industry expertise and innovative technology
to deliver critical information to leading decision makers.
• We are the world’s most trusted news organization.
• We serve professionals in the financial and risk, legal, tax
and accounting, intellectual property and science and
media markets.
Database & Middleware, Design and Engineering
• Collaborate with Business Partners.
• Partner with Architects in evaluating new database
technologies.
• Rollout Database Infrastructure Projects.
• Establish and govern database standards across data-centers.
• Escalation contact for support groups during major incidents.
THOMSON REUTERS DATABASE SCALE
• Oracle Environment
– Over 1200 Databases Deployed
– Over 2400 Instances Deployed
– Over 1 PB Of Allocated DB Storage
– Over 350 New Instances Deployed Last Year
• Exadata Database Machines
– 8 Exadata Full Racks
– 3 Exadata Quarter Racks
So why Enterprise Manager 12c? • Implement “One Administration Approach”
• Standardize database monitoring – Replace custom scripts
– Enable monitoring & alerting in a controlled fashion
– Meaningful & actionable alerts only
– Common monitors and default thresholds for all DBA teams – exceptional thresholds to be altered by DBA teams as needed
– Email and Pager (Service Manager) integration
• Implement best practices on Database Diagnostics and Performance Management
• Holistic view for Exadata Management
But how do we manage privileges across DBA teams for a subset of targets within an Administration Group?
By implementing “Dynamic Groups”
• Requirements:
– Need to provide privileges to subset of targets in Administration Group to different DBA Support groups
– Administration groups based on “Lifecycle Status” and “Line of Business” did not align with group of targets needed for privilege management
• Solution:
– Keep Administration Groups intact for monitoring
– Create Privilege-Propagating Dynamic Groups with criteria based on “Contact” to differentiate databases managed by different DBA Support groups
– Grant Operator privilege on dynamic group to role and grant role to DBAs
– Use same dynamic groups for incident rule sets
EM 12c Monitoring Implementation at Thomson Reuters - Overview
Dev/Test Staging
LEGAL DCO BSI LEGAL DCO BSI
ALL TARGETS
Lifecycle
Status
Line of
Business
Prod
LEGAL DCO BSI
Priv Prop
Dynamic Group: “Contact=DBA-
SUPP-ORACLE-INT”
Target Attributes -
• Lifecycle Status: “TEST”
• LoB: “LEGAL”
New Target
Lifecycle Template
Collection
• Contact: “DBA-SUPP-ORACLE-INT” Other Targets
“Contact” Attribute
automatically
adds target
to Dynamic Group
Role:
“DBA-SUPP-ORCL-INT”
Incident Management - Rule Sets • One Rule Set per dynamic group corresponding to each of the three DBA
support groups.
• Each rule set consists of: Rule on.. Action Summary
Target Down Events Create Incident; set priority to Very High
Metric Alert Events Create Incident; set priority to High
Agent unreachable Create incident if event open for 15 minutes; set priority to Very High
"ORA" errors in alert.log (to capture “ORA” error string) Create Incident and notify the respective DBA group
Send email for all new incidents (from above rules)
Notify the respective DBA group:
Warning event – “email only”
Critical / Fatal events – HP Service Manager integration using EMAT
(homegrown tool)
• Advantages:
• Quickly enable/disable notifications for a subset of targets.
• Scalable approach as new DBA groups are formed.
• Room for customization within each rule set to meet new requirements.
• Ease of management.
Recommendations From Our Experience
• Ensure standby database monitoring is configured with a user having “SYSDBA” privilege.
• Have corrective actions such as listener restart? Explore “Corrective Actions”
• Configure DG Broker to receive advanced notifications for standby monitoring.
• Leverage Dynamic Groups for Incident Management as well.
• Use “Event-based” notification template for complete ORA error string from alert.log
• Set delay for “agent unreachable” alerts to mitigate network latency.
DATABASE MANAGEMENT – Pre-Enterprise Manager
DEPLOYING AT SCALE
– Over 350 New LION servers deployed each year
– Current Approach • Gold Image
• Cloning Process via Custom Scripts
– Challenges • Revision Management
• Scripting Effort
• Troubleshooting Failures
MONITORING AT SCALE
– Over 2400 Oracle servers on the floor
– Current Approach • Custom scripts
• Cron, dbash
– Challenges • Each team has their own set of
scripts
• Troubleshooting Performance
• Patch Management
Phases of Enterprise Manager Program Q4 2012 Q2 2013 Q1 2013 Q3 2013 Q4 2013
Phase 1 Deployment
(12/31) Execute
Design
Execute
Phase 2 Monitoring
(7/31)
Phase 3 Diagnostics
(9/30)
Phase 4 Lifecycle
Management
(12/15)
Design
Execute
Design
Execute
Plan
Plan
Plan
EM 12c: Infrastructure Deployment to Maximize Availability
Level 4 HA Deployment
EM Version: 12.1.0.2
DB Plugin: 12.1.0.3
EM 12c: Database Monitoring
• Implementation focus on core database monitoring components:
• Agree on pilot rollouts
• Finalize the “Admin Tree” structure
• Test HA for your EM site (addon)
• Instance Down
• Listener Down
• Alert.log
• File System Utilization
• Tablespace Utilization
• EM Agent Down
• Streams Process Down
• Listener Down
• Data Guard Standby Lag
• CRS Nodeapps Down
• CRS Processes Down
• Max. Connections
• SOA Suite • Exadata Monitoring
Incident Management – Notification Details • Utilize “EMAT” to integrate Enterprise Manager’s incident management
features with Service Manager capabilities
• EMAT functionality based upon email notification sent from EM.
• EMAT processes will generate appropriate email, paging notifications and
generate an IM ticket where appropriate.
• Incident tickets managed in Service Manager. Severity Target Lifecycle Status Action
Fatal Production IM Ticket, Page
Fatal Staging IM Ticket
Fatal Development IM Ticket
Critical Production IM Ticket, Page
Critical Staging IM Ticket
Critical Development IM Ticket
Warning Production e-mail
Warning Staging e-mail
Warning Development e-mail
Database Diagnostics using EM 12c
• Implement Database Diagnostics Pack
• Expand Core Monitoring capabilities: • Golden Gate Plugin
• “Corrective Actions” for listener restart and tablespace adds.
• Implement Phase-2 Pilots: • Exadata Monitoring and Management
• SOA Suite monitoring
• Pilot rollout: • Netapp Plugin
• MySQL Plugin
• Real Application Testing (RAT)
• Performance monitoring and diagnostics
• Automatic Workload Repository (AWR)
• Active Session History (ASH)
• Real Time ADDM
• Real-time SQL and PL/SQL Monitoring
• Exadata Cell Grid Performance
• Exadata Resource Utilization
• Mass Deployment of Oracle Software (Database, Real Application Clusters)
• Supports all versions up to 11.2 / Grid Infrastructure Architecture
• Gold Image cloning and standardized software deployment via Profiles
• Lock down access for controlled and error free deployments
DB Provisioning
EM 12c: DEPLOYING AT SCALE – Q4’13
Source DB systems Target DB Systems
Software Library Storage
Save Gold image (and
optionally data) from
source systems to EM
software library
Deploy saved Image and
data to target systems
with customizations
EM: Database Monitoring
Lessons Learnt..the hard way!
• EM12c agent NOT supported on SLES9.
• Ensure standby database monitoring is configured with a user having “SYSDBA” privilege.
• Have corrective actions such as listener restart? Explore “Corrective Actions”
• Configure DG Broker to receive advanced notifications for standby monitoring
Enhancement Requests
• Allow wildcards for tablespace names in metric collection settings.
• Custom target properties shown as drop down during target promotion.
• Make selected target properties as mandatory for target promotion
• Need EMCLI verbs for target promotion.
• Take corrective action to automatically add space to tablespace.
• Take corrective action to restart listener on failure.