Books - Maverick's Blogvenus.sci.uma.es/docs/ebooks/TakingControl... · Books Contents Chapter 2 Monitoring Windows Server . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Monitoring

Books

ContentsChapter 1 System Monitoring — What Does It Involve? . . . . . . . . . . . . . . 1

Providing Comprehensive Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

System-Monitoring Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Selecting a System-Monitoring Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Implementing a System-Monitoring Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Depth and Breadth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Manageability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

System-Monitoring Solutions: Determining Scope . . . . . . . . . . . . . . . . . . . . . . . 3Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Breadth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Assessing Users’ System-Monitoring Requirements . . . . . . . . . . . . . . . . . . . . . . . 6Who Needs What Information About Each Technology . . . . . . . . . . . . . . . . . . . . . . 6

Examining System-Monitoring Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Reporting and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Alert and Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Log-based Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8False Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Automated Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Custom Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

System-Monitoring Solution Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Monolithic System-Monitoring Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Agent-based System-Monitoring Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Agent-Optional System-Monitoring Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

System-Monitoring Solutions: Management and Policy . . . . . . . . . . . . . . . . . . . . 13Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Your System-Monitoring Solution Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Next: Monitoring Windows Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

i

http://www.windowsitlibrary.com/Ebooks

http://www.argent.com/

Books

ContentsChapter 2 Monitoring Windows Server . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Monitoring the Fundamentals: CPU, Memory, and Disk . . . . . . . . . . . . . . . . . . . 16Monitoring the CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

CPU-Use Trend Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Identifying Progressive Change in CPU Use . . . . . . . . . . . . . . . . . . . . . . . . . . 17

CPU-Use Near Real-Time Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Sidebar: Response Maturity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Monitoring Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Memory Trend Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Monitoring Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Disk-Space Trend Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Physical Disk-Drive Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Monitoring the Windows Event Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Monitoring the System Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Unspecified Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Monitoring the Application Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Monitoring the Security Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Monitoring Text-based Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28IAS and RRAS Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

IAS Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30RRAS Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Monitoring Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33WMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Next: Monitoring AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

ii



Books

ContentsChapter 3 Monitoring AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

AD’s Resilience: Pro and Con . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Monitoring AD: Single Points of Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Monitoring AD: Change Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Monitoring AD Through Event Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36DS Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36FRS Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Exhausted Disk Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38RPC Problems Between DCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

DNS Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Other DC Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Monitoring AD with Command-Line Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Dcdiag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Netdiag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Gpotool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

AD Performance Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41NTDS Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Monitoring DS on the Local DC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Monitoring Client Work and Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Database Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Monitoring AD Database Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Other Ways to Monitor AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Monitoring Services on DCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Direct AD Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Monitoring AD Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Auditing Administrative Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Detecting Account Management Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44DS Category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Detecting Intrusion Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

The Importance of Monitoring AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Next: Monitoring Exchange Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

iii



Books

ContentsChapter 4 Monitoring SQL Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Monitoring: Performance Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51SQL Server Performance Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Buffer Cache-Hit Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Page Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Page Splits Defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Fill Factor Defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Identifying Page-Split Slowdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Efficient Index Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Overall Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Concurrency Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Memory Shortage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55SQL Server Backups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Low Disk Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56CPU Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Disk and Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Monitoring: Application Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57SQL Server Security-related Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Monitoring: SQL Queries and Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59SQL Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

SQL Server: It Pays to Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Next: Monitoring Exchange Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

iv



Books

ContentsChapter 5 Monitoring Exchange Server . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Monitoring Key OS Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Services and Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Monitoring Message Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Using Performance Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Message Queues: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Link Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64System Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Message Queue Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Troubleshooting Message Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Troubleshooting a Connection to a Remote Server . . . . . . . . . . . . . . . . . . . . 67Freezing and Unfreezing Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Examining Message Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Freezing and Unfreezing Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Monitoring Specific Exchange Performance Counters . . . . . . . . . . . . . . . . . . . . 70RPC Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70Store.exe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Monitoring Through WMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Monitoring Through the Application Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Miscellaneous Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Monitoring the Many Facets of Exchange Server . . . . . . . . . . . . . . . . . . . . . . . . 73

Next: Capacity Planning and Trend Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

v



Books

ContentsChapter 6 Trend Analysis and Capacity Planning . . . . . . . . . . . . . . . . . . . 75

Trend Analysis: Establishing Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Compensating for Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Filtering Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Adjusting Data Toward Your Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Drowning Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Capacity Planning: Predicting the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Choosing the Right Trend Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Linear Trend Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Exponential Trend Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Defining What’s Typical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Monitoring: Some Reminders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Proactive Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

vi



1

Chapter 1:

System Monitoring – What Does It Involve?At first glance, system monitoring might seem to be a straightforward process. After all, it’s such acommon task that dictionaries include a noun definition for a program that’s a “monitor,” as Figure 1.1 shows.

Figure 1.1 Defining “monitor”

However, monitoring means different things to different people. For example, one’s position inthe IT department and/or the organization often drives how one thinks about – and definitely whatone needs from – a monitoring solution.

If I’m an engineer on the front line of a data center – responsible for day-to-day and minute-to-minute operations – my needs are tactical and align best with verb definition 3 above. I want toknow immediately when a problem occurs, so I can fix it. (Actually, I want to know before aproblem occurs, and I want the monitoring solution to try to fix the problem itself, but more aboutthat later in this chapter.)

If I’m a network architect assessing the need for changes to the WAN, I need metrics that canshow me how the WAN is being used, as well as when and where bottlenecks occur. My monitoringneeds align more with verb definitions 1 and 2 above.

And if I’m an executive, my needs draw upon the activities described in all the definitions, but ata strategic level. I want to know, for example, whether the IT department is meeting its service levelagreements (SLAs). I don’t need to know whether Microsoft Exchange is down at a given moment somuch as I need to know how many times Exchange has been down in the past 90 days.

Monitoring requirements also differ according to discipline. Network administrators need to identify operational problems, but security administrators want to know about suspicious activity andevents. Web masters are interested in all of the data but need to be able to watch Web site traffic totrack such things as ad efficiency and impression counts.

mon·i·tor

n. Computer Science. A program that observes, supervises, or controls the activities of other programs.v. tr. 1. To keep track of systematically with a view to collecting information. 2. To test or sample,especially on a regular or ongoing basis. 3. To keep close watch over; supervise.

Brought to you by Argent and Windows IT Pro eBooks

Obviously, monitoring is important and different people have different information needs from the systems being monitored. What’s the best way to go about meeting your organization’smonitoring requirements?

Providing Comprehensive Monitoring The UNIX platform and now the Windows platform have excellent scripting capabilities, and bothOSs, as well as most of the applications that run on them, provide access to important performanceand diagnostic data. Therefore, effective IT departments will exploit such facilities for automating critical operations and incident response actions. An IT department that relies solely on its own patchwork of monitoring and response scripts might well succeed in monitoring certain targetedapplications or processes. However, it will probably fail to provide both the tactical monitoring of theoverall network and the strategic monitoring and reporting necessary for managerial and architecturaldecisions.

The only way to provide comprehensive tactical and strategic monitoring inhouse would be toexploit a group of developers whose core competency is developing monitoring solutions. Unless anorganization has such a group of developers, which would be rare, the organization must look tosoftware vendors and implement one or more monitoring solutions from vendors whose products fitthe organization’s requirements. Companies can find plenty of solutions from which to choose. In this chapter, I’ll give you an overview of key concerns relevant to selecting and implementing asystem-monitoring solution – then, I’ll examine several of the concerns in more depth.

System-Monitoring Solutions The system-monitoring solutions market is packed with offerings that cover the spectrum in terms ofarchitecture, sophistication, and cost. Some solutions focus on one application or technology, such asExchange, the Windows Security log, or Web server statistics. Some solutions are weighted towardreal-time monitoring whereas others lean more toward trend analysis over the long haul. Some solutions target smaller organizations; others target large enterprise networks.

But one of the most important differentiating factors between system-monitoring solutions is thesolution’s component architecture. Is the solution monolithic – with a single component performingthe data collection, analysis, alerting, and reporting? Or does the solution have separate componentsfor various tasks? For example, some solutions have agents for data collection, monitoring engines toharvest data from agents and perform alerting and reporting functions, and consoles for the solution’sadministrators. The component architecture of a system-monitoring solution is a determining factor in the solution’s flexibility and deployment costs. I explore the types of architecture in the “System-Monitoring Solution Architecture” section later in this chapter.

Selecting a System-Monitoring Solution How can an organization select a monitoring solution? First, you must carefully define your organization’s requirements. But requirements analysis isn’t enough. System monitoring is technicallycomplex, and you should know its potential pitfalls and challenges. Second, once you’re aware of therisks that system-monitoring projects face, you need to identify which risks are most relevant to yourenvironment. Third, you’ll match prospective solutions and their architectures with your requirementsand risk analyses. Only then will you be able to select a solution that meets your functional needsbut also stands up to the tests of your particular environment.

2 Taking Control: Monitoring the Windows Platform Proactively


Implementing a System-Monitoring Solution The first challenge is to keep your system-monitoring solution from causing the problem you’re tryingto prevent by monitoring. When physicist Werner Heisenberg, known for his uncertainty principle,imagined building a gamma-ray microscope to observe an electron, he discerned that the gamma rayswould bounce the electron all over the place and prevent useful observation. Similarly, you must takesteps to mitigate the risk that your system monitor will adversely affect the service you’re trying toprotect. I’ll refer to this risk as interference.

Other system-monitoring solution challenges are more mundane – such as the need for depthand breadth, scalability, and manageability. By depth, I mean the ability to collect and analyze evenquite specialized sources of information.

Depth and Breadth You must implement a solution that has wide enough support to cover all the technologies you need to monitor but also provides sufficient depth for each technology monitored. To provide useful monitoring for any one resource or service, a monitoring solution needs a sufficiently deepunderstanding of that resource or service. For example, the bare-bones Windows Server 2003 OSalone contains about 10 different logs with varying formats, not to mention the hundreds of performance counters and scores of system services – any or all of which might be important to your monitoring needs. The same is true for UNIX – and for the applications and services that run on top of both Windows and UNIX.

However, you can’t implement a different solution for every component of your network thatneeds to be monitored. Such a scenario would be prohibitively costly and complex. If no single solution satisfies your requirements, you’ll to implement multiple solutions – which will immediatelyraise questions of interoperability, as I discuss later in this chapter.

Scalability Another challenge is scalability. The first issue that comes to mind when you consider scalability isusually whether the monitoring engine can keep up with alert and status data for your systems. That’sa legitimate issue, but scalability involves much more.

For example, your monitoring solution needs to scale as needed geographically. Your solutionmight be able to keep up with thousands of remote systems, but if it hogs all your WAN bandwidth,you have a problem. As you’ll see later in this chapter, the biggest factor in scalability is the monitoring solution’s architecture.

Manageability The issue of manageability comes into play as soon as you’ve deployed more than a few monitoringcomponents. Between agents, monitoring engines, and consoles, you can quickly end up with a lotof components – all needing to be maintained through consistent policies and updated in a timelymanner. Trying to maintain each monitoring agent, engine, and console manually will inevitably leadto inconsistent monitoring policies and other problems related to components to which the vendors’latest releases haven’t been applied.

System-Monitoring Solutions: Determining Scope I’ve given you an overview of some system monitoring concerns. Now, let’s dig a little deeper intodepth and breadth. (You’ll find more information about scalability in “System-Monitoring Solution

Chapter 1 System Monitoring — What Does It Involve? 3


Architecture” and about manageability in “System-Monitoring Solutions: Management and Policy” laterin this chapter.)

If your organization is like many others, you deploy many technologies on your network. Thefirst step in specifying your monitoring requirements is to identity which of these technologies youneed to monitor and to what depth. Let’s look at depth first.

Depth Depth addresses, for any given component, how specific to the technology the type of monitoringyou need must be. For example, imagine you have a Web-based application that requires monitoring.You can configure a rule in your monitoring solution that simply pings the server to make sure it’s“up,” but that might not provide enough assurance. For example, if the Web server process on theserver fails, the server’s network stack will still respond to the ping. Consequently, the monitoringsolution won’t alert you to the problem.

The next step deeper involves setting up the monitoring solution to test the Web application byregularly making a request from the Web server process through HTTP or HTTP Secure (HTTPS).The monitoring solution will now alert you if the server itself goes down or if a problem with theWeb server process exists. But what if the application itself fails?

Imagine that your Web application requires authentication to access the application and performtransactions. What if the connection between the Web application and its back-end database serverbreaks down? In the previous example, the monitoring solution simply requests a specified URL tothe server and confirms that it receives some kind of response – even if it’s a “You are not authorizedto view this page” message. That check confirms that the Web server itself is functioning, but itdoesn’t say whether the application is functioning and can access the database. To confirm that levelof functionality, your monitoring solution must be able to request a specific transaction. The solutionmight simulate a post to a form, then check for specific content in the page the server sends back.

The same issue of depth exists with any other technology, including such systems as MicrosoftSQL Server and Exchange. Think about all the thresholds and counters that might need monitoringon SQL Server alone: cache-hit ratios, free log space, deadlocks, average transaction time, and more.The point is that you need a solution or solutions that support each technology you need to monitor.

Breadth Breadth addresses the range of technologies you need to monitor. Vendors will list which technologies their products monitor, but you’ll still need to explore whether a given tool collects thespecific data you need. Figure 1.2 lists types of technologies and component parts that you mightconsider monitoring.

NoteIn future chapters, I’ll review in depth the most common Windows-based technologies, discussthe types of data that need monitoring, and point out any relevant monitoring challenges andcaveats.

n



Figure 1.2 Technologies to consider monitoring



OSs

• Windows

• Linux

• UNIX

• Netware

Directory services

• Active Directory (AD)

• Sun Microsystems’ iPlanet

• Novell Directory Services (NDS)

Network components

• Switches and hubs

• Routers

• Bridges

• Wireless Access Points (WAPs)

• Wireless bridges and repeaters

Messaging servers

• Exchange

• Lotus Notes

• SMTP servers

• POP3 servers

• Network News Transfer Protocol (NNTP)servers

• Instant Messaging (IM) servers

Databases

• SQL Server

• Informix

• Oracle

Enterprise resource planning (ERP) systems

• Source Access Point (SAP)/R3

• Oracle

• PeopleSoft

Firewalls

• Microsoft Internet Security and Accelera-tion (ISA) Server

• Cisco PIX

FTP servers

Web servers

• Microsoft Internet Information Services(IIS)

• Apache

RAS

• VPN devices and servers

• Dial-up servers

• Remote Authentication Dial-in User Ser-vice (RADIUS) servers

Network infrastructure services

• DNS servers

• DHCP servers

• Network Time Protocol (NTP) servers

Printers

• Server Hardware

• Fan speed

• Internal temperature

• RAID status

• Power-supply faults

• Memory problems

Storage Devices

• Storage Area Networks (SANs)

• Backup devices

After you identify the technologies that you need to monitor, consult with the parties who’ll usethe monitoring solution and identify the data they need for each technology – for both tactical monitoring and strategic analysis. In fact, before you can fully nail down your depth and breadthrequirements, you need to assess user requirements.

Assessing Users’ System-Monitoring Requirements As I’ve mentioned previously, monitoring means different things to different people. Given the workand cost involved in implementing system-monitoring solutions, you need to make sure that the oneyou choose can fulfill the requirements of all the parties involved. And although assessing user needsmight protract the requirements analysis and evaluation process, you’ll want to identify every stakeholder in the system-monitoring project.

First, identify your various sites, such as data centers, offices, and plants. Then, inventory thetechnology components at each site using a chart similar to the one that Table 1.1 shows.

Table 1.1 Identifying sites and their technologies

Operating Systems DB Servers Network Components Network Infrastructure

LA x x x x x x x x x NYC x x x x x DC x x x x x Miami x x x x x x x

In Table 1.1, I simply used an “x” to note each technology resident at each of the four city sites.However, you’ll also find it useful to record the actual number of “instances” of each technology ateach site. You’ll need this quantity information later in the process – when it comes time to calculatelicensing costs and decide how to deploy agents.

Who Needs What Information About Each Technology? Now that you know which technologies your organization deploys and where they reside geographically, your next step is to determine who needs what monitoring information for each component. If you can discover who’s responsible for the day-to-day operation and administration of each component and which business processes depend on each component, you’ll have an initiallist of the those who need tactical monitoring and alert coverage from the monitoring system.Next, determine who has oversight responsibility for the front-line technicians and who has managerial responsibility for the business processes dependent on the component. The members of this group might need more strategic reporting from the monitoring solution.



NTP

DH

CP

DN

S

Wi-Fi

Access Points (A

P)Routers

Switches

MySQ

L

Oracle

SQL

Linux

UN

IX

Window

s

Finally, for each technology under consideration, determine who’s responsible for the design and architecture of the systems and in each geographical area. The members of that group are goodcandidates for capacity-planning and trend-analysis information. When you’ve identified all the potential stakeholders, interview them about each technology to determine their need for tacticalmonitoring and/or the type of longer-term reporting and analysis.

After you’ve completed your information-gathering activities, record the results in a chart similarto the one that Table 1.2 shows. You can then use the information you’ve collected to define thedepth and breadth requirements for your evaluation of system-monitoring solutions.

Table 1.2 Who needs monitoring information

Windows Exchange Switches SQL Server

Operators and Security, Application, Queue length. System health. Cache hit ratios.administrators and System logs. Specific Space available. Free log space.

event IDs and codes Round-trip time (RTT) Deadlocks.within event details. for message delivery RTT for sample/All performance counters. test stored Shared file access. procedure.

IT management Downtime. Downtime. Performance Unplanned system Unplanned system metrics.outages. outages.

Space usage by department.

Network architects Bandwidth usage.

Examining System-Monitoring Features Now that you know which technologies you’ll monitor and what information various stakeholdersneed from monitoring, you can begin to assess a solution’s system-monitoring functions. You’ll wantto examine reporting and analysis features as well as alert and response capabilities.

Reporting and Analysis You should check for the presence of some reporting and analysis features that many organizationsuse. For service-level analysis, capacity planning, and other strategic reports, managers and engineersoften need to be able to roll-up data to a summary level. For example, you might want to knowwhether a monitoring tool can aggregate data for all SQL Server machines or is limited to reportingresults for each server individually. Does the tool support different views of the data? Can you analyze the same data based on geography, application or server type, and business unit?

You’ll want to determine whether the solution can answer a question such as “How many serviceoutages occurred for all systems monitored that provide service to the commercial accounts businessunit?” Also, can it respond to a request such as “Show me the percentage of free space on all fileservers across the entire network.” Your organization’s structure and size will determine which questions and requests are most important to you.



If the monitoring solution lacks sufficient reporting and analysis functionality on its own, determine whether it will support a third-party solution to provide that function (e.g., Crystal Decisions’ Crystal Reports). Also, make sure that the solution stores its data in a common databaseserver or format that you can access on your own. In addition, ask whether the database is documented so that you can join tables to produce useful queries.

Alert and Response I’ve talked about reporting so far, but I haven’t devoted much attention to the major subject of alertand response. The more tactical the monitoring solution’s users are, the more their needs will involvethe monitoring solution’s alert and response functionality.

A monitoring system’s alert and response capabilities are different in kind from its reporting capabilities. Reporting is a fairly circumspect activity that involves analyzing past measurements anddata, but alert and response needs require that a system come as close as possible to recognizingspecified conditions or events real-time – and following up immediately. Baseline functionality forentry-level monitoring solutions will include a simple rule list of events or conditions to monitor and,when those events or conditions are detected, the ability to send an email message to a specifiedaddress (e.g., an alphanumeric pager).

However, many organizations require more sophistication in both rules and response functionality. They need to specify complex rules that look for combinations of events and/or conditions. They also need to include escalation functionality so that the solution can automaticallyprioritize alerts based on time and other factors. As you consider a monitoring solution, discoverwhether you can include additional criteria in the rules, such as time of day, day of week, geographical site, and other custom-defined categories. The ability to include such criteria helps you reduce false positives and escalate alerts as necessary.

Log-based Monitoring For log-based monitoring, you need to know how detailed your solution’s scrutiny needs to be. TheWindows Application, Security, System, DNS, Directory Service (DS), and File-Replication Service logsall capture important details in the description portion of the event record.

In many cases, those description details are essential. For example, Windows 2000 Server andWindows Server 2003 log two event IDs only for all Kerberos authentication failures, yet those failures can occur for many different reasons. The only way to distinguish between an authenticationfailure because of a bad password or because of an expired account is to check the failure code inthe event description.

However, because much more development effort is required to create a tool that can processthe data in event descriptions, some monitoring solutions don’t let you go beyond identifying theevent ID. Consequently, you can’t create rules based on data within the description of different eventIDs. Because of the wide variety of technologies that you might need to monitor and the arcanaunique to each system, it pays to do your homework and make sure your monitoring solution canrecognize the events and conditions you need to know about – and their causes.



False Alarms Without sufficiently detailed monitoring that can filter out the false alarms, your staff will receivemany more alerts than necessary. They’ll be burdened with investigating each alert to determine if itreally represents a problem or is just a false alarm. In effect, they’ll be doing much of the monitoringsolution’s work. In such a scenario, staff members commonly start to respond less quickly to certainevents and sometimes begin to ignore them. Poorly selected and implemented monitoring solutionstend to be self defeating and waste resources.

Although I can’t say enough about event/condition recognition, other areas of functionality areimportant elements in many organizations’ alert requirements. You might need to configure the solution to alert different people according to your staffing schedule. For example, you might want tohave all SQL Server alerts sent to John between 8:00 A.M. and 5:00 P.M. Monday through Friday,have alerts outside that timeframe sent to Alice, who has after-hours duty for the week – then reversethat schedule for the following week. Discover whether you can send other alerts to the system operations console at the appropriate data center according to the time zone.

Also, analyze which alert methods your organization requires. Although sending an email message is the most common alerting method, your organization might need the ability to send console messages or to generate SNMP traps.

Automated Response Although alerting a human is often the first thing we think of as the appropriate response to problems the system monitor catches, it’s not always best. For many situations, you might prefer to letthe monitoring solution respond and attempt to remedy the problem on its own before involving anoperator. For example, you might have a service that frequently falls victim to the same problem –and the remedy might be straightforward. Suppose that you have a print server on which the spoolerservice hangs. Each time it hangs, your operator or administrator must restart the service. The rightmonitoring solution might be able to perform the restart for you. Automated response is the mark ofthe next level of maturity and sophistication in system-monitoring implementations.

Custom Responses The variety of alert/response methods the solution supports out-of-the-box is less important if thesolution lets you define custom responses – and you have the technical expertise inhouse to writethem. Developing your own responses usually requires programming ability with VBScript and/orPerl – and knowledge of API and/or COM interfaces to various systems management technologiessuch as Windows Management Instrumentation (WMI), Active Directory Service Interfaces (ADSI), andWeb-based Enterprise Management (WBEM).

Again, doing your evaluation homework is crucial. For each technology you plan to monitor,determine which kind of response or alert the solution needs to perform and create a chart similar tothe one that Table 1.3 shows.



Table 1.3 Determining types of alerts and responses

Email message SNMP trap Technology-specific action Custom script

Windows X System service restarts.Process kills.

SQL X Database Consistency X Checker (DBCC)

Microsoft IIS X X

Exchange X X X

AD X

Interoperability An important topic related to alert and response functionality in monitoring systems is interoperability.Larger organizations in particular often have more than one monitoring solution. In addition, mergersand acquisitions create scenarios in which you must deal with multiple monitoring solutions.

Also, hardware vendors often incorporate very specialized monitoring technologies into theirproducts. As a rule, such hardware monitoring technology provides a much deeper and more detailedlevel of monitoring and data reporting than any general monitoring solution will provide out-of-the-box. Such technologies offer excellent benefits.

For example, if your network is standardized on Compaq servers, you’ll certainly benefit fromusing Compaq’s Insight Manager technology. That technology lets you monitor data as detailed as fan speed, power-supply status, RAID status, temperature, and system errors and warnings.

The drawback of vendor-specialized monitoring systems lies in potential interoperability problems. Also, you don’t want to maintain a parallel monitoring infrastructure just for server hardware with vendor-supplied monitoring consoles.

If the monitoring solution you choose can interoperate with the vendor’s monitoring agent, however, you have a win-win situation. The lingua franca of system monitoring is SNMP. If the serverhardware can produce SNMP traps and your monitoring solution supports SNMP as the trigger foralerts, you have the best of both worlds. You’ll appreciate the level of monitoring you get withvendor-supplied technology, but you preserve a unified, consolidated monitoring structure. The serverhardware alerts will flow into your monitoring solution, where you can apply the same alert andresponse functionality you use for all the other technologies that you monitor.

SNMP also helps when you must tie into another monitoring solution. I’ve noted the advantageof having a monitoring solution that can accept SNMP messages as alert criteria. However, dependingon your environment, it might be just as important for your monitoring solution to support SNMP onthe other end. That is, you might want your solution to generate SNMP messages that can flow to anupstream monitoring solution.



System-Monitoring Solution Architecture Knowing who your users are, which technologies you need to monitor, and which types of reportsand alerts they need will help you understand which type of product architecture will work for yourorganization. Monitoring solution architecture can be monolithic, agent-based, or a hybrid of the two– agent optional.

Monolithic System-Monitoring Solutions Monolithic monitoring solutions have a single component that performs the data-collection, alerting,and reporting functions. Monolithic solutions are the quickest and easiest kind of solution to install.You pick a server and install the software. You then configure which servers the monitoring solutionshould watch. The solution will need appropriate credentials for the systems it monitors. Most Windows-based monitoring solutions let you take advantage of domain authentication, which meansthat you simply configure the monitoring solutions’ service to run as a domain account with appropriate authority to all the systems it monitors.

If the solution monitors Windows systems in a domain that doesn’t trust the domain in which thesolution resides, the solution must have functionality that lets you specify alternate credentials. You’llneed to configure the solution with an appropriate policy that specifies what information to collect,how frequently to collect it, what the alert rules are, whom to alert, and how to send alerts.

Although monolithic systems are easy to get up and running, most are designed for smaller organizations and might lack granularity in certain areas, which limits you to applying the same policies to all systems. But the bigger problem with monolithic system monitoring is the load it canput on your network. Because the monitoring solution resides on one system, it must use the network every time it polls a monitored system for the data being monitored.

If you have 14 servers with six sets of data being monitored every minute on each system, yourmonitoring system will be generating 5040 queries on the network every hour. Such a load might betolerable if the monitored systems and the monitoring solution itself are all on the same LAN. Butwhat if those servers are at remote sites with limited bandwidth between them and the monitoringsystem?

Deploying monitoring systems across a WAN presents two problems. First, the scenario abovecould overwhelm some WAN connections or consume an unacceptable portion of the bandwidth.Second, latency can be a problem. If you’re required to monitor certain sets of data frequently, thelatency of the WAN connection might begin to cause false positives.

For example, suppose you configure your monitoring server to check a crucial counter on yourSQL Server at a remote site every 10 seconds. At certain times of the day, both the WAN connectionand SQL Server become loaded. Suppose that SQL Server usually takes 2 seconds to process therequest for the counter data, but during peak activity periods can take up to 5 seconds. As SQLServer and the WAN both become loaded and the total time for each request increases, the timerequired to update a counter can exceed the frequency with which the monitoring server polls thecounter.

When this situation occurs, the monitoring server panics and sends the operator a false positiveindicating that SQL Server is down. The only choices you have to address the problem are to reducethe frequency of updates, change the data being monitored, or monitor fewer systems. Deploying amonitoring server at the remote site might be an option, but that means another monitoring system to



manage. And monolithic solutions often don’t provide a way to push out policies and updates tomultiple systems because they’re designed for small shop, single-installation scenarios.

Monolithic architecture isn’t all bad, however. In addition to its ease of deployment, it hasanother significant advantage over agent-based systems. You’ll find a note about that advantage,which will become clear after I discuss some elements of agent-based monitoring solutions.

Agent-based System-Monitoring Solutions A monitoring solution with agent-based architecture moves the data-collection process into an agentprogram separated from the rest of the monitoring solution – and collocates the data-collection process with each system being monitored. Each system being monitored runs a copy of the agent,which monitors the system locally. The agent can collect the data without ever leaving the system.

The agent needs to use the network only if data falls outside its thresholds and the agent mustnotify the alert and reporting engine. Occasionally, the agent must also send the reporting componentan update of data that has been selected for long-term analysis.

Agents can monitor many sets of data with high frequency without affecting your bandwidth.Thus, the agent-based architecture is useful when you need to frequently check many sets of data onmany systems and when you must monitor remote systems. The drawbacks of agent-based systems,however, are twofold.

First, the agent-based architecture is harder to deploy because you must install, configure, andmaintain an agent on each system monitored. (Likewise, when the system-monitoring solution’sdeveloper updates the product, you must update all those systems.) Second, you must install andexecute “foreign” agent code on production systems to monitor them. All software has bugs, andmonitoring solutions are no exception.

NoteBecause a monolithic monitoring solution monitors systems over the network, no invasive codeis installed or executed on the systems being monitored. Not only does a monolithic systemsave work associated with deployment and updates, but more importantly, it reduces the risk ofinterference, which I introduced earlier in this chapter.

Imagine that an agent-based monitoring solution installed on a production application server trips over a bug. It falls into an infinite loop and monopolizes the CPU. Suddenly, the monitoringsolution adversely affects the system you hoped to protect. If the monitoring software is executing ona separate system and the same problem occurs, only the monitoring process is affected, not the production application.

Of course, simply querying a system remotely poses some risk, but doing so is much safer thanactually running code on the production server. Even if the monitoring software is rock solid, youmight encounter stiff resistance from the people responsible for the production system. Not beingacquainted with the software, they might refuse to install it. And if they’re in a different part of theorganization, you might not be able to resolve the concern.

Application vendors can also make it difficult for you to implement agent-based solutions. Forexample, enterprise resource planning (ERP) application vendors have been known to refuse to

n



support servers on which you’ve installed “nonstandard” software. Even if the vendor doesn’t refusesupport, it’s not unusual for vendors to immediately blame problems on other products present onthe server, causing you to spend valuable time and effort proving that a problem isn’t related to themonitoring solution.

Agent-Optional System-Monitoring Solutions As you can see, the advantages and disadvantages of monolithic and agent-based solutions createtradeoffs. You can pick the architecture that fits best and live with its disadvantages. However, a third architecture can provide the best of both worlds. Agent-optional systems let you exploit theadvantages of both monolithic and agent-based architectures within the same solution.

With this architecture, you have the choice of whether or not to deploy agents onto your serversand devices. If all of your servers are in one physical location, you can install the solution on oneserver (as the monitoring engine) and have that monitoring engine monitor all of the servers acrossthe LAN. To determine how many monitoring engines should you have, you need to consider someof the factors that I discussed above, such as the depth and breadth of the monitoring. For example,you cannot have one monitoring engine if you plan on monitoring various metrics on 500 servers,even if all the servers are under one roof. By implementing multiple monitoring engines, you distribute the load.

If you support an organization that has multiple physical sites spanning the globe, you may benefit from installing agents to monitor each of the individual sites. This flexible architecture letsyou solve some sticky monitoring problems on a WAN. For example, imagine you have a corporatedata center with branch offices in London, New York, and Sydney. You have 48 servers to monitor atyour data center and 5 to 12 servers to monitor at each branch office.

By deploying four agents (one at each site), you can monitor your entire network without burdening your WAN. One agent resides at your data center and one at each branch office. Eachagent collects data from the servers on its LAN and forwards alerts and reporting data over the WANto your monitoring engine. Agent-optional systems provide the greatest flexibility and scalability.

System-Monitoring Solutions: Management and Policy Regardless of a monitoring solution’s components, you need to know whether it offers ease of management and applies policy consistently. After you’ve defined your most complex rules andresponse policies, you won’t want to redefine them on each monitoring agent on your network.Therefore, you need to consider manageability.

Given the number of components you expect to deploy, determine what’s entailed in pushingcommon policies out to all of your agents and alert/response engines. Find out whether the solutionprovides a way to automate deploying vendor updates to the various modules of the solution. Discover whether you can back up, restore, export, and import your monitoring policy and reportdesigns.

Also, ask whether the solution provides archival functionality for reporting databases. Does itrequire client software for every user of the system or are some functions Web-based, saving you thework and complication of deploying, supporting, and updating client components?



Security And, when it comes to system monitoring, don’t neglect security. Your monitoring process collects agreat amount of information that would be valuable to an unscrupulous employee or outside attacker.For example, the existence and IP addresses of various servers and applications comprises valuablereconnaissance information.

Agents and other monitoring processes usually run with elevated privileges to be able to collectthe desired information and perform defined responses. If attackers can compromise the monitoringprocess, they can exploit the authority of the account under which the process runs.

Ensure that the monitoring solution provides authentication and encryption for traffic betweenvarious monitoring components (e.g., between agents and alert/reporting engines). Make sure that thesolution implements roles so that you can limit which actions different solution users can perform.For example, you don’t want operators to be able to arbitrarily redefine or suppress annoying alerts.

When it comes to reporting, you certainly need the flexibility to define who gets to run whichreports – but your needs might go beyond that. You might need to limit report access to appropriateindividuals by technology, geographical site, or business area. To provide such access control, you’llneed to make sure that the monitoring solution lets you leverage existing accounts, such as those inAD. Otherwise, you’ll have to create new accounts for each user within the solution.

Your System-Monitoring Solution Homework If you want to implement excellent system monitoring, do your homework. Get a comprehensive picture of who will use your monitoring system. That task often ends up encompassing more peoplethan you originally anticipated because raising the subject of one monitoring or reporting need isoften the catalyst for a broader discussion about system monitoring.

Determine which technologies need system monitoring and who needs the information gathered.Discover which types of alerts and reports they need for each technology. After you understand theneeds, research which architecture best supports the structure of your network and organization.

For areas in which a candidate monitoring solution’s native functionality comes up short, considerwhether you can take advantage of the customization and extensibility features of the solution to fillthe gap. Or look further for a best-of-breed component solution you can integrate. When you finallycome up with a short list of contenders, hands-on evaluation is best.

During the evaluation, test the vendor’s support technicians. Find out whether they’re accessible,knowledgeable, and available. The secret to selecting and successfully implementing a system-monitoring solution lies in really understanding your requirements – both technical and organiza-tional – and in knowing the capabilities and shortcomings of potential solutions.

Next: Monitoring Windows Servers As I look forward to future chapters, I note that the term “proactive monitoring” is widely bandiedabout in marketing content for monitoring solutions. However, proactive is more than a word. An ITmanager once commented that he didn’t need a monitoring solution that would tell him when hisExchange server crashed; he had users for that.

To fulfill his inhouse service level agreements (SLAs) and keep his customer departments fromtaking their budget to outside IT service companies, he needed a solution that would warn him that



the server was getting overloaded before it crashed. The Latin root of the word “monitor” is to warn.Warning means letting someone know about a future risk far enough in advance to take preventativeaction.

But proactive monitoring requires knowledge of the indicators that signal an impending problemand the ability to monitor those indicators. You can’t monitor everything on every server; too manyparameters and counters exist. In future chapters, I’ll consider several key server technologies and discuss further what it takes to do truly proactive monitoring. In Chapter 2, I’ll discuss monitoringWindows servers.



16

Chapter 2:

Monitoring Windows ServersIn terms of total lines of code, Windows is the largest OS in the world. The Windows OS provides a number of key end-user services right out of the box, including file sharing and services to applications – authentication, directory services, encryption, network communications, and more.Clearly, the OS itself is the obvious place to start monitoring.

In this chapter, I explore what it takes to monitor the OS that lies at the core of your Windowsnetwork. I’ll begin with the three fundamental areas that you should consider monitoring for anyserver running Windows 2000 and later.

I’ll then consider other core Windows components and services that benefit from monitoring –identifying those metrics for which monitoring pays off. I’ll detail a variety of technical resources and approaches that you can use to monitor these resources, including the Windows event log andtext-based logs.

After I discuss why and how you can monitor each area, I’ll review how your monitoring solution should respond to various alerts – from simple pager alerts to automated remediation scripts.Although real-time alert and response is important, you can’t ignore the value of trend analysis. I’llend the chapter with some comments about your monitoring strategy.

Monitoring the Fundamentals: CPU, Memory, and Disk The most common causes of support calls and server outages are capacity-related problems with theserver’s basic resources: CPU, memory, and disk. Thankfully, these resources are easy to monitor byusing Windows performance objects, the system event log, and a few simple system APIs that mostmonitoring solutions support.

If you monitor these basic resources properly, you’ll be able to head off problems before theyoccur and affect users. The challenge of proactive monitoring is to identify warning signs that indicatea future problem, monitor for those warning signs, and respond before the problem develops.

Proactive monitoring requires both trend analysis over longer periods of time and real-timealerting. You need both methods because resource shortages can occur because of incrementallyincreasing demand or sudden system events. Your window of response time and your responseoptions often hinge upon whether the resource outage results from a gradual demand increase or asudden anomaly.

Monitoring the CPU For example, consider the possible causes for slowdowns and service outages because of CPU over-use. CPU use is an extremely dynamic measure – one that commonly jumps around betweenclose to 0 percent and almost 100 percent several times per second. When a system isn’t doing anyreal work, CPU use should hover between 0 percent and 2 percent. This situation should hold true


even when you monitor various performance counters and connect remotely through terminal services to other open but idle client connections.

However, as soon as a client executes a transaction or requests a Web page, the CPU mightspike anywhere, including up to 100 percent, for a second or two. But the spike is no cause foralarm. Whenever the CPU has any work to do, the system devotes all available CPU cycles to thework.

The CPU use will stay at 100 percent until the immediate job is finished or until the CPU mustwait for a resource or some other component. Most commonly, CPUs process part of a transaction,then must wait for disks to retrieve or write some information. Immediately, CPU use falls back closeto zero unless another thread or process has work for the CPU to do.

You can see why frequent jumps to near full utilization aren’t uncommon on most servers. Thejumps simply indicate that a server has work to do. Although scientific research servers modelingocean currents or car-body aerodynamics are CPU-intensive, servers in typical commercial roles (e.g.,file serving, database serving) are disk-intensive.

Because most systems have a significant surplus of CPU capacity (compared to even high-endstorage systems), the CPU will find itself frequently waiting on disk. Exceptions include massive computational operations, such as reindexing. Usually, the higher the level of an application’s development environment, the more CPU time it requires for the same amount of actual work. Forexample, a well-written C++ application uses less CPU time than the same application written inVisual Basic (VB).

Nevertheless, a good rule of thumb is that it’s usually an ominous sign to see any type of commercial server remain at more than 90 percent CPU use for more than a minute. Preventing CPU-related problems requires that you both analyze trends and enable near real-time alerts.

CPU-Use Trend Analysis First, let’s look at trend analysis for CPU use. Imagine an e-commerce server that handles purchasesfrom Internet customers. As the company grows and gains more customers, the server’s CPUbecomes increasingly busy. In this case, the CPU demand makes no sudden jump. The demandgrows week by week until customers start experiencing slowdowns and become frustrated.

Some customers might complain by email or telephone, but others will simply leave the Web site and purchase elsewhere. Usually, such gradual demand changes require hardware upgrades topermanently address the problem rather than some simple system tweaks made on the fly. To avoiddowntime and manage the costs, hardware upgrades require advance planning. Fortunately, gradualdemand changes give you plenty of notice – as long as your monitoring solution is capturing theright data and you’re reading trend-analysis reports regularly.

Identifying Progressive Change in CPU Use To detect a progressive change in CPU demand, you need to set up a data collection rule in yourmonitoring solution that captures current percent of CPU use every minute or so. Let’s assume theserver provides satisfactory response times for customers throughout the day and week. Establish abenchmark by collecting the data for a reasonable period of time (e.g., a week, a month) duringwhich the server operates under typical circumstances.

The benchmark period will help you establish two things. First, you discover the peak hours ofworkload for the server on a daily, weekly, and monthly basis. Second, for those peak periods, youdiscover what the average CPU use is. After you have your benchmark, you can continue to monitor

Chapter 2 Monitoring Windows Server 17


the data and run a report perhaps every week or month that provides average CPU use during theserver’s peak periods.

You then compare the new averages to the benchmark. Has use changed? If so, by how much?If it’s a marginal change, continue to run the report at the same interval. (Perhaps your monitoringsolution can automatically run the report and deliver it to you through email or by other means. Thisapproach helps ensure that you analyze the report regularly.)

Is the rate of change fairly steady? If so, you can predict when the server will reach an unacceptable level of CPU use and plan backwards from that date to budget and schedule anupgrade. Congratulations! You just prevented lost business and other costs related to handling urgent problems.

But what if, between reporting intervals, the server’s CPU use jumps significantly? Think aboutwhich changes – internal (e.g., configuration, updates) or external (e.g., marketing campaigns) –might have caused the jump. Consider increasing the frequency of the report so that you can keep aclose eye on what’s happening.

If the data returns to its previous level during the next regular reporting interval, the temporaryjump is probably related to an external event, such as a sale or promotion, or to internal maintenanceevents, such as database consistency checks, purges, or reindexing. Provided the CPU isn’tapproaching critical use levels, you might choose to simply note the jump on the server’s operationslog and return to business as usual. If the jump is significant enough, you might want to continueresearching the cause.

CPU-Use Near Real-Time Alerts On the other hand, consider an example of a sudden, accelerating increase in demand. In this case,the server is functioning as usual at first, but a software update or other system change leads to arogue process that begins to consume as many CPU cycles as it can get.

Trend analysis won’t help you here. Your best chance to respond to such a problem requiresmonitoring that comes as close to real-time as you can achieve. You need for your monitoring solution to analyze consecutive snapshots of CPU use in relation to each other. For example, perhapsyou’ve configured a monitoring rule to check CPU use every 30 seconds. You might configure analert rule that fires if CPU use exceeds 90 percent for three consecutive checks. That means yourserver apparently has been pegged for between 90 and 120 seconds. At that point, the alert rulewould send a message to your operations staff members, and they would start working the problem.This situation gives you a readymade opportunity to consider the various maturity levels yourresponse strategy can achieve. (For more information, see Sidebar 1, “Response Maturity.”)

One other note before I leave the subject of CPU monitoring. Sometimes the CPU-use situation isreversed: You might want to be alerted when your CPU falls and remains below a certain level. Forexample, I know an operator who’s responsible for a server that processes massive amounts of matchdata for hours at a time on a regular schedule.

If the server’s CPU isn’t consistently using at least 50 percent of its capacity, that means he needs to fix a problem so that the processing is completed within the scheduled window for batchprocessing. The most dependable way to identify a problem with the batch system is to check for anidle CPU. Doing so also catches hung processes. So think outside the box; you might identify imaginative ways to use performance metrics.





Response Maturity

Knowing about a problem before it happens or at least as soon as possible is the initial goal of monitoring. However,knowing about the problem isn’t valuable unless you can do something about it. What you can do depends upon youroperators’ and your solution’s level of response maturity. You’ll find that a good monitoring solution not only tells youabout a problem but also helps you fix it.

Consider, for example, how an operator might respond to an alert that a server’s CPU has pegged out at +90 percent. First, the operator must discover whether the CPU is actually staying above 90 percent or whether the monitorchecks just happen to fall during brief but typical bursts of use. Assuming the CPU is indeed pegged, the operator mustimmediately start exploring two matters. The operator needs to discover which process is consuming all those cyclesand whether end-user response time is adversely affected.

The operator might find that the server remains pegged at +90 percent for an extended period without hurtingresponse time – if a low priority thread or process is using up the CPU. With Windows pre-emptive multitasking, lowerpriority threads are free to monopolize the CPU until a higher priority process has something to do.

Perhaps an antivirus product is performing a server-wide scan – thus sending CPU use above 90 percent. But suppose the antivirus scan is running at a lower than usual priority. The server might continue to service user requests at an acceptable rate because the Server service is running at its usual priority – thus pre-empting the antivirus scanwhenever the Server service needs the CPU.

The operator might, on the other hand, discover that response time has already or is approaching an unacceptablelevel. At this point, the process behind the problem becomes the focus. Is the greedy process directly involved in servicing end-user requests or is it something more ancillary, such as an antivirus scan? If the rogue process isn’t criticalto the server’s current mission, the operator might consider terminating it or lowering its priority.

Over time, operations teams identify certain processes that are prone to going rogue (e.g., falling into an infiniteloop), and they learn how to fix the problem. Whether the problem is familiar (and has a known solution) or unprecedented, you should always have a prescribed course of action.

IT departments usually mature from depending on contacting “the guru” when a problem occurs to developing aprescriptive response database that current operators can access whenever problems occur. Such a database helpsensure a consistent course of response based on experience.

As response moves beyond a basic level, the monitoring solution automatically supplies suggested responses or linksto relevant content in the response database. In addition, some monitoring solutions let operators attach notes to thealert so that they can go back later and research whether the problem was a false alarm. They can review how theproblem was handled – including what worked and what didn’t work.

At a further level of response maturity, the monitoring system executes prescribed actions in response to an alert. Some problems are unpredictable because they produce no early warning signs. Therefore, you can’t respondproactively and fix the situation before it affects users or business processes. Automated responses not only save operators from wasting time handling routine problems but also greatly speed up response time.

Especially in the case of problems that produce no warning signs, automated responses are critical to limitingdamage and cost. Monitoring solutions offer a variety of automated response options. A minimum response method thatany monitoring solution should support is the ability to run a specified system command when a given alert is fired.

In addition to sending messages, some monitoring solutions can respond by running a SQL command, creating anincident in a Help desk application, firing an SNMP trap, restarting a service, or rebooting the server.

Monitoring Memory RAM is a precious resource for system stability and performance. Because accessing a page ofmemory from the swap file on disk takes exponentially more time than accessing a page in RAM,correcting a RAM shortage can often have a greater impact on performance than increasing CPUcapacity. Also, even though Window’s virtual memory management should prevent any stability problems associated with RAM shortages, experience has proven again and again that sometimesstrange system phenomena just go away after a system has enough memory. Again, you’ll want touse both trend analysis and alerts to manage RAM and avoid RAM-associated problems.

Memory Trend Analysis Exactly which data should you monitor? If you look at the memory performance object in theMicrosoft Management Console (MMC) Performance snap-in, you’ll see the many counters at your disposal. A good way for you to detect a RAM shortage is by monitoring for two thresholds: pagesper second and committed bytes. Pages per second tells you how many times per second applications or system processes try to access a page in memory that turns out to have been previously paged out to the swap file.

When such a page isn’t in memory, Windows must pause the process and transfer the page fromdisk back to a physical page in RAM – possibly having to first make room by writing some otherpage to disk. According to Microsoft, “% Committed Bytes In Use is the ratio of Memory\\CommittedBytes to the Memory\\Commit Limit. Committed memory is the physical memory in use for whichspace has been reserved in the paging file should it need to be written to disk.”

The rule of thumb is that if the server exceeds a certain number of pages per second and committed bytes in use exceeds a certain percent, you have a memory problem. The threshold forpages per second ranges between 95 and 220, depending on your server, its role, and whom you ask.

The critical threshold for percent of committed bytes in use is between 80 percent and 90 percent. Setting up an alert rule for percent of committed bytes requires that your monitoring solutionsupport multiple criteria (in this case, pages per second and percent of committed bytes) andBoolean logic so that you can configure alerts that fire only if multiple data levels meet specified criteria. You can find many other performance counters for memory, but keep in mind that monitoring memory isn’t as simple as it might seem because of the way virtual memory works. (See the Microsoft Windows 2000 Server Resource Kit for a more detailed examination of Windowsperformance monitoring.)

Monitoring Disk The two most common disk-related causes of server downtime are running out of disk space andhaving disk hardware fail. Fortunately, it’s relatively easy to monitor for both situations and even provide advance alerts in most cases. Trend analysis is valuable only for providing strategic warningsabout future disk space shortages – not for predicting disk hardware problems.

Disk-Space Trend Analysis For trend analysis and alerting relative to disk space, Windows offers two performance objects: PhysicalDisk and LogicalDisk. Although PhysicalDisk can help you monitor hardware-level operations,it isn’t useful for monitoring disk space because a physical disk drive is often divided into logical



drive volumes and multiple drives are often grouped into a single logical volume that spans the physical disks. (Given that a physical drive might be partitioned or part of a larger array volume, youcan see that percent of free space has no meaning on physical drives. Therefore, PhysicalDisk has nocounters related to disk space.)

You can use LogicalDisk to monitor free space two ways. To monitor free space as a percent,you can use “% Free Space.” To monitor the actual amount of free space, you can use the “FreeMegabytes” counter. As a rule, the percent of free space shouldn’t drop below 15 percent, but youmight need to adjust the percent up or down depending on the size of your volume and on server-specific factors that affect the optimum proportion of free space.

You should use “% Free Space” for servers that are just file servers or application servers – on which you have a generous surplus of disk space and don’t foresee a shortage under typical circumstances and operations. On the other hand, for application or database servers that performlarge batch operations and/or have log files that can grow large, you should consider benchmarkingthe difference between typical disk space used and peak disk space used when batch and log filesare at their largest. This number tells you how much temporary space you need during peak operations.

After you determine your temporary space requirement, you can set up an alert that warns you if“Free Megabytes” approaches your temporary space requirement plus a comfort factor – for example,the greater of the following measures: 10 percent of the total volume size or 300MB. Setting up diskspace-based alerts gives you a chance to respond tactically when a server runs out space. However,don’t neglect taking the strategic step of performing trend analysis on disk-space requirements forimportant servers.

By recording free disk space for a server at least once a day, you can see whether and how fastits disk-space requirement is growing – and predict when the server will run out of space. For thoseservers not subject to trend analysis, make sure you set the free-space alert thresholds high enough togive you time to do something about a situation (e.g., deleting files, moving files, adding disk space)before the server runs out of space.

Physical Disk-Drive FailureThe other disk-related cause of server outage is a physical disk failure. To detect disk hardware prob-lems, you need to monitor the System log for error and warning events with disk and ntfs as theirsources. These events notify you about bad blocks and other concerns that indicate disk problems. Inaddition, many servers have RAID arrays that have their own methods of reporting hardware prob-lems depending on the product.

A RAID product might be able to alert you through email, pager, or Short Messaging Service(SMS) messages. (For more information about SMS, go to http://www.zdnet.com.au/reviews/coolgear/mobiles/0,39023387,20248392,00.htm.) However, rather than running those proprietary hardware monitoring solutions as isolated monitoring systems, you’ll often find it helps to integratethe proprietary hardware-monitoring facilities of your RAID and other server hardware with yourmonitoring solution.

The proprietary monitoring software for different hardware products might report problems bygenerating SNMP messages, by logging problems to a text file, or by logging messages to the Windows Application or System logs. If you want to monitor RAID and other server hardware fromyour monitoring solution, both components must support a common problem-reporting method. Most



http://www.zdnet.com.au/reviews/coolgear/mobiles/0,39023387,20248392,00.htm

http://www.zdnet.com.au/reviews/coolgear/mobiles/0,39023387,20248392,00.htm

products, in addition to other proprietary reporting features, will log events to the System log. Theadvantage of integration is that you gain the detailed information that only the proprietary monitoringagents offer – but your operators get to keep everything under your monitoring solution’s umbrella.

Monitoring the Windows Event Logs In addition to monitoring the basics of CPU, memory, and disk, you’ll find it valuable to monitorother resources and services common to almost any Windows system. Most core components of theWindows OS output their activity to the event-log service. You can stay on top of most Windowsproblems by monitoring for errors and warnings in the System and Application logs and for specificaudit conditions in the Security log. However, unless you configure some filters, be prepared toreceive some unimportant warnings and errors.

Monitoring the System Log First, let’s examine both the important and unimportant alerts you’ll get from the System log. Onequite useful source of events in the System log is the Service Control Manager (SCM), which isresponsible for starting and stopping services and handling dependencies between services.

If a service or driver fails to start for whatever reason, you’ll receive an error event with a sourceof Service Control Manager. Events whose source is Server indicate problems with the Server servicethat provides file and printer sharing and access to other resources that you can view in the MMCComputer Management console.

If you use Windows Update or Microsoft Software Update Services (SUS) to keep your serversup-to-date and secure against new exploits, you might want to monitor for events from the Automatic Updates source. This event source informs you about any problems with keeping thesystem up-to-date. This source also warrants monitoring for informational events. The reason is thatsome updates require a system reboot to become active – and for the system to resume checking fornew updates.

One such informational event is event ID 21 (Restart required). Event ID 21’s full descriptionreads, “Restart Required: To complete the installation of the following updates, the computer must berestarted. Until this computer has been restarted, Windows cannot search for or download newupdates.”

Also, depending on how you’ve configured the Automatic Updates client, Windows might waitfor an operator to log on to a server and approve updates before they’re applied. Event ID 17 (Installation ready) reports this situation. The full description of event ID 17 reads “Installation Ready:The following updates are downloaded and ready for installation. To install the updates, an adminis-trator should log on to this computer and Windows will prompt with further instructions.” I definitelyrecommend monitoring for these two informational event IDs from Automatic Updates.

If you use the Distributed File System (Dfs) on your servers, you should monitor for errors andwarnings with a source of dfssvc or dfsdriver. Also monitor for events from the DHCPServer source,which keeps you informed about problems on Windows DHCP servers.

Another useful source to monitor is Eventlog. Eventlog alerts you about an eclectic mix of eventsranging from error event ID 6008 (Unexpected system reboot) and error event ID 6000 (System orapplication log is full) to informational event ID 6011, which tells you that the server’s name has beenchanged!



Errors with a source of IPSec inform you about IP Security (IPSec) failures that could compromisenetwork security or interrupt connectivity. Because Layer Two Tunneling Protocol (L2TP) uses IPSec,the IPSec source is relevant to VPN servers.

The LsaSrv and NETLOGON sources both provide important errors for a variety of network connectivity and security situations. You should monitor RemoteAccess source events on servers running RRAS. The RemoteAccess source provides error events that alert you to failed dial-in andVPN connections that might indicate an attack or a network problem requiring assistance.

Although the monitoring sources I’ve covered don’t comprise an exhaustive list of event sourcesand event IDs, I’ve tried to select some of the most useful sources and events. As you determinewhich sources are important for your environment, configure your monitoring solution to send youhigh- to medium-level alerts about any warnings and errors from those sources.

NoteFor certain sources I’ve mentioned, you might need to configure a few event ID-specific rulesfor events that are important but are logged as informational rather than as errors or warnings.

Unspecified Sources Also, configure the monitoring solution to send you lower priority alerts for errors and warnings from any unspecified sources. As time goes on, you’ll be able to sort out the unimportant errors orwarnings – and keep them from generating alerts. I recommend this approach for several reasons.First, Microsoft offers no exhaustive, accurate documentation for the System log. Second, new eventsappear and old ones disappear as Microsoft releases new versions of Windows. Because the sourcesfluctuate in this way, you won’t be able to and shouldn’t eliminate alerts and warnings fromunknown sources.

Moreover, different hardware products and software packages introduce their own event sources.Because these sources can provide extremely important information, they’re another reason not tocompletely suppress errors and warnings from unknown sources. Over time, you’ll identify certainevent sources whose errors or warnings should receive a higher priority, and you can adjust yourmonitoring rules accordingly.

You might want to prioritize certain third-party event sources in advance – such as the sourcesthat RAID controllers and other server hardware components and devices introduce. To find outwhich third-party sources to monitor, you’ll need to refer to the hardware documentation. Alternatively, you can open Event Viewer, select the System log, and open the filter dialog box.Peruse the event sources drop-down list. Usually, you can recognize relevant sources by their names.For example, Compaq sources always begin with cpq, as in cpqarray, or Compaq, as in CompaqPower Management.

Monitoring the Application Log You’ll also benefit by monitoring the Windows server Application log. Although Microsoft obviously intended the System and Application logs to serve as separate repositories for system- andapplication-level activity, respectively, this distinction is blurred in some areas. For example, you’llfind certain components of Windows that report events to the Application log.

n



One of the most important examples is client-side Group Policy processing. Any problems Windows encounters in applying Group Policy will yield error or warning events from the secedit,esent, or userenv sources. Monitoring for such events is important because they indicate that yourserver is failing to apply policies stored centrally in Active Directory (AD), which can lead to aninconsistently configured or unsecured system.

If you use the certificate auto-enrollment feature in Windows Server 2003 or Windows XP, youcan monitor for auto enrollment source events to detect any problems with this function. ApplicationError and Application Hang source events inform you about applications that crash or hang – andthat Windows must therefore shut down. These events alert you to application problems as soon asthey occur.

If you use the Windows Backup utility, be sure to monitor for errors and warnings from thentbackup source so that you’ll be notified about problems with your backups. If you have any Windows Certification Authorities (CAs), you need to know that Certificate Services logs all CA eventsunder the CertSrv source. The Volume Shadow Copy Service and the Winlogon subsystem also logtheir events to the Application log. The sources are VSS and winlogon, respectively.

As you can see, the Application log contains a lot of system events. Although you’ll want tomonitor the Application log to keep abreast of the problems that applications report to it, you wouldwant to monitor this log to detect Windows-level problems alone.

I recommend the same strategy for the Application log that I suggested for the System log. Create a rule that gives you a low priority alert for all errors and warnings. Then, over time, configureadditional rules to suppress unimportant events and assign a higher priority to more important events.

Monitoring the Security Log The Windows Security log provides a wealth of information about security-related activity on thesystem. The Security log has no informational, warning, or error events. All events are either AuditSuccess or Audit Failure events. Don’t fall for the overly simplistic approach that some administratorstake of monitoring for failed security events only – thinking that they’ll catch anything suspicious bydoing so. As you’ll see, some of the most important events are Audit Success events.

Windows has nine distinct audit policies that let you specify exactly which types of securityactivity you want the OS to record in the Security log. All events in the Security log have the samesource – Security – but each of the nine audit policies has a corresponding category in the Securitylog, as Figure 2.1 shows.



Figure 2.1 Security Properties dialog box showing audit policy categories [fig1.bmp]

By default, Windows (depending on the version) has little to no auditing turned on. You can remedythis situation by enabling appropriate audit policies through Group Policy, as Figure 2.2 shows.



Figure 2.2 Audit policy security settings dialog box [fig2.bmp]

However, be careful. If you enable all nine audit policies, your system will be inundated withevents. Although many of the events are important, handling them can slow down your system. I’lldiscuss the subset of the audit policies that I recommend you enable initially.

Enabling Audit account logon events keeps you informed about authentication events in connection with the accounts stored on that system – local SAM accounts on member servers and ADaccounts on domain controllers (DCs).

Enabling Audit account management lets you track changes to user accounts and groups on asystem as well as user account lockouts. Tracking user account changes – such as password resetsand members being added to privileged groups (e.g., Administrators, important department or application groups) is crucial to detecting inappropriate access being granted. Because such Auditaccount management events (e.g., gaining inappropriate access) are Audit Success events, you cansee why I stress the importance of monitoring Audit Success and not just Audit Failure events.

With other audit policies as well, Audit Success events can alert you to potential problems.Enabling Audit policy change informs you about audit policy and other major security policy changesto the system. Enabling Audit system events informs you about certain high-level system events, suchas a reboot or someone clearing the Security log. If you use IPSec or L2TP, you might considerenabling Audit logon events to track IPSec-related events.



But even with these few audit policies enabled, you won’t want an alert every time Windowslogs a security event. The Security log is an exception to the general rule of generating at least a lowpriority alert for all errors and warnings. For the Security log, you must identify exactly which eventsyou want to monitor and set up corresponding alert rules. For a list of important security events, seeTable 2.1.

Table 2.1 Security Log Quick Reference

Important Windows Security Events For Domain ControllersEvent ID Category Explanation675 Audit account logon events Event 675 on a domain controller indicates a failed initial attempt to

logon via Kerberos at a workstation with a domain account usuallydue to a bad password but the failure code indicates exactly whyauthentication failed. See Kerberos failure codes below.

676 or Failed 672 Audit account logon events Event 676 gets logged for other types of failed authentication. SeeKerberos failure codes below. NOTE: Windows 2003 Server logs afailed event 672 instead of 676.

681 or Failed 680 Audit account logon events Event 675 on a domain controller indicates a failed logon via NTLMwith a domain account. Error code indicates exactly whyauthentication failed. See NTLM error codes below. NOTE:Windows 2003 Server logs a failed event 680 instead of 681.

642 Audit account management Event 642 indicates a change to the specifi ed user account such asa reset password or a disabled account being re-enabled. Theevent’s description specifi es the type of change.

632, 636, 660 Audit account management All 3 events indicate the specifi ed user was added to the specifiedgroup. Group scopes Global, Local and Universal correspond to the3 event IDs.

624 Audit account management New user account was created.644 Audit account management Specified user account was locked out after repeated logon failures.517 Audit system events The specified user cleared the security log.612 Audit policy changes Of the system’s 9 audit policy categories, 1 or more was changed.

Unfortunately this event doesn’t indicate the user responsible sinceaudit policy is controlled by group policy.

Kerberos Failure CodesError Code Cause6 The username doesn’t exist.12 Workstation restriction; logon time restriction.18 Account disabled, expired, or locked out.23 The user’s password has expired.24 Pre-authentication failed; usually means bad password32 Ticket expired. This is a normal event that get frequently logged by computer accounts.37 The workstation’s clock is too far out of synchronization with the DC’s clock.For other Kerberos Codes see http://www.ietf.org/rfc/rfc1510.txt



http://www.ietf.org/rfc/rfc1510.txt

Table 2.1 Security Log Quick Reference continued

NTLM Error CodesError Code (Decimal) Error Code (Hexadecimal) Explanation3221225572 C0000064 user name does not exist3221225578 C000006A user name is correct but the password is wrong3221226036 C0000234 user is currently locked out3221225586 C0000072 account is currently disabled3221225583 C000006F user tried to logon outside his day of week or time of day

restrictions3221225584 C0000070 workstation restriction3221225875 C0000193 account expiration3221225585 C0000071 expired password3221226020 C0000224 user is required to change password at next logonReprinted with permission of Monterey Technology Group, Inc.

Although setting up alerts based on the Security log is obviously important, you might find itworthwhile to record certain security events for longer term trend analysis if your organization has aformalized support-ticket process for tracking security-related requests. For example, by collecting theright events from your DCs’ Security logs, you could compare events – for such matters as passwordsthat the Help desk resets and group-membership changes – to the audit trail in your Help desk application. This approach would let you verify that security maintenance activities are linked toapproved support tickets.

On DCs, you’ll find two or three additional event logs. All DCs have Directory Service (DS) andFile Replication Service (FRS) logs. The Directory Service log tracks all events relevant to AD, such asGlobal Catalog (GC) maintenance, trust, replication, and Knowledge Consistency Checker (KCC). Thislog lets you know about any problems between DCs in your forest.

The FRS replicates Group Policy-related files between DCs and is important to the function ofGroup Policy within your domain. I recommend monitoring both the Directory Service and File Replication Service logs for warnings and errors – in much the same way that I suggested you monitor the System and Application logs.

The DNS Server log appears only on Windows servers that are also DNS servers. It captures valuable information for monitoring DNS server problems.

Monitoring Text-based Log Files Some developers choose to design their applications so that the applications maintain their own text-based log files in addition to (or in lieu of) outputting events to the Application log. Moreover,for whatever reason, some Windows components don’t use the event log at all but instead createtheir own text-based log files. If you want to monitor and report on activity in these logs, your monitoring solution must support parsing text-based log files with different formats.

Whether you monitor these text-based files will depend upon which applications you run andwhether you use any of the Windows components that maintain text-based log files. Let’s look at theWindows components that do use these files.

The DHCP service maintains logs in %windir%\System32\Dhcp. Figure 2.3 shows a sample seriesof entries from the DHCP log.



Figure 2.3 Microsoft DHCP service activity log

As you can see, the log’s header includes a helpful explanation of the event ID codes that appearin the event records. Some of the event ID codes indicate definite problems with DHCP or dynamicDNS registration, which Windows DHCP performs on behalf of earlier Windows clients. WindowsServer 2003 and Win2K Server DHCP servers have built-in logic to control how much disk space thedaily DHCP server logs consume.

Microsoft DHCP Service Activity Log

Event ID Meaning00 The log was started.01 The log was stopped.02 The log was temporarily paused due to low disk space.10 A new IP address was leased to a client.11 A lease was renewed by a client.12 A lease was released by a client.13 An IP address was found to be in use on the network.14 A lease request could not be satisfied because the scope’s address pool was exhausted.15 A lease was denied.16 A lease was deleted.17 A lease was expired.20 A BOOTP address was leased to a client.21 A dynamic BOOTP address was leased to a client.22 A BOOTP request could not be satisfied because the scope’s address pool for BOOTP was

exhausted.23 A BOOTP IP address was deleted after checking to see it was not in use.24 IP address cleanup operation has began.25 IP address cleanup statistics.30 DNS update request to the named DNS server31 DNS update failed32 DNS update successful50+ Codes above 50 are used for Rogue Server Detection information.

ID,Date,Time,Description,IP Address,Host Name,MAC Address24,07/02/04,00:00:56,Database Cleanup Begin,,,,25,07/02/04,00:00:56,0 leases expired and 0 leases deleted,,,,25,07/02/04,00:00:56,0 leases expired and 0 leases deleted,,,,11,07/02/04,18:09:57,Renew,10.42.42.14,14000008020506A.,0040058ED7D4,



If a given day’s activity exceeds the configured threshold, Windows stops logging DHCP eventsuntil either more disk space is available or the next day starts. Windows lets you configure twothresholds to control event-log size.

First, you can configure a maximum number of megabytes for all DHCP server audit logs combined. This threshold defaults to 7MB, and Windows restricts each day’s log to one-seventh ofthe maximum space allowed for DHCP server audit logs. Thus, by default, each day’s log can growto a maximum of 1MB. Windows automatically overwrites week-old audit logs, so you retain onlyone week’s activity.

You can configure DhcpLogFilesMaxSize to be large enough to accommodate a full week’sactivity for your system. If you want to keep more than a single week’s activity on hand, you’ll needto copy each day’s audit log file to another location before Windows overwrites it the followingweek. How you set the DhcpLogFilesMaxSize registry value can keep you from dropping below theminimum free space on disk threshold, which I discuss next.

Second, Windows lets you configure a minimum amount of space that must be preserved on thedisk on which you store your audit logs. The default minimum is 20MB. If free space on the diskdrops below the minimum threshold, Windows stops logging DHCP. DHCP starts logging again whendisk conditions permit. You can reconfigure the DHCP event log thresholds by using REG_DWORDvalues in the DHCP server’s registry under the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DhcpServer\Parameters registry subkey.

The DhcpLogMinSpaceOnDisk subkey specifies the minimum amount of space that must be pre-served on the disk that contains your audit logs. Specify this value in megabytes. For example, if youwant to stop DHCP logging when disk free space falls below 80MB, simply set the value to 80MB.

IAS and RRAS Logs Do you use Internet Authentication Service (IAS) or RRAS? Both of these components produce text-based log files in C:\WINDOWS\system32\LogFiles.

IAS Log To make sense of these logs, you and the monitoring solution need to understand their format. TheIAS log format is explained under “Interpreting IAS-formatted log files” in Windows Server 2003 Help.

To activate IAS logging, open the MMC Internet Authentication Service snap-in and click theRemote Access Logging folder in the tree-view pane. Right-click the Local File logging method in thedetails pane, and select Properties. On the Settings tab, which Figure 2.4 shows, select all three checkboxes to enable full logging, then click the Local File tab.



Figure 2.4 Local File Property Settings tab

On that tab, which Figure 2.5 shows select a time period to determine how frequently IAS startsa new log.

Figure 2.5 Selecting IAS new log time period



IAS begins each filename with IN and formats the filename according to the time period you select.For example, if you select the Daily time period, filenames use the INyymmdd.log format.

RRAS Logs RRAS logs information to both the Application log and the text-based log files in C:\WINDOWS\system32\LogFiles. RRAS logs errors and warnings about the service itself (e.g., network problems)to the Application log. RRAS logs authentication events for dial-in and VPN connections to the text-based log files in the same format that IAS uses.

You configure RRAS logging from the MMC Routing and Remote Access Service snap-in. To configure the level of information logged to the Application log, open the Properties dialog box ofthe RRAS server object and select the Logging tab, which Figure 2.6 shows.

Figure 2.6 Configuring RRAS logging

To configure logging options for the text-based log file, you select Local File logging method inthe Remote Access Logging folder and open its Properties dialog box. You’ll see that RRAS offers thesame logging options that IAS offers.

Finally, all Microsoft IIS-related services, including the World Wide Web Publishing Service, FTP,SMTP, and Network News Transfer Protocol (NNTP) maintain logs in C:\WINDOWS\system32\LogFiles under separate folders. The actual format of these logs depends on the log format optionyou configure with the MMC IIS Administration snap-in.



Monitoring Strategy Monitoring Windows necessitates being able to collect information from varied resources. To monitorthe critical resources of a Windows system, your monitoring solution needs to – at a minimum – trackperformance counters, Windows event logs, and text-based log files, and perform user-specifiedsystem commands.

WMI Windows Management Instrumentation (WMI) has become an increasing popular and useful way tomonitor everything imaginable on the Windows platform through easy-to-understand SQL-likequeries. Therefore, native WMI support is highly valuable in a monitoring solution because it lets youconveniently leverage WMI’s flexibility to implement custom monitoring rules. However, you can alsouse WMI from scripts, which emphasizes the importance of being able to run OS commands fromwithin the monitoring solution.

If your monitoring solution can’t natively monitor the various formats of text-based log files,there’s still hope – if the solution lets you run system commands and scripts and you’re willing tolearn how to use some free Microsoft tools. For parsing and analyzing IAS-formatted logs, you canuse the Iasparse tool. You can find Iasparse in the Windows Server 2003 Support\Tools folder on theWindows Server 2003 CD-ROM, in the Microsoft Windows Server 2003 Resource Kit, and in theMicrosoft Windows 2000 Server Resource Kit.

For parsing almost any other type of log file, try LogParser. LogParser gives you the data-miningpower of a SQL database, such as Microsoft Access, and you can use the tool to automatically process the megabytes of data that your network’s diverse logs generate every day. You can down-load the most recent version of the tool, LogParser 2.1, as part of the Microsoft Internet InformationServer (IIS) 6.0 Resource Kit tools, at http://www.microsoft.com/downloads/details.aspx?familyid=56fc92ee-a71a-4c73-b628-ade629c89499& displaylang=en.

WMI and log utilities like the ones mentioned above are no replacement for a real monitoringsolution, but they can help fill in the gaps of your monitoring solution.

With technology, it’s often the item you don’t bother to check that causes the problem; the component that should never fail suddenly does. You can lose valuable time trying to figure out theroot cause of a problem. That’s why, as I’ve mentioned previously, I recommend designing yourmonitoring strategy to alert you about anything that looks suspicious – such as all warnings anderrors in the event log.

At first, you’ll be inundated with alerts. Soon, you’ll begin to pare down the number of alerts by correcting problems and disabling unnecessary or not fully implemented features that cause unimportant alerts. You’ll identify the most important alerts and increase their priority. You’ll alsodetermine that although you can’t prevent some unimportant errors and warnings, you can keepthem from firing alerts.

As you define run-book procedures for common alerts, you should document the procedureswithin the monitoring solution. Doing so provides integrated help to operators. Also, identify certainroutine alerts for which you can configure your monitoring solution to automatically respond withmitigation attempts, and, failing those attempts, escalate the alert to reach an operator. At that point,you’ve reached a high level of monitoring and response maturity.



http://www.microsoft.com/downloads/details.aspx?familyid=56fc92ee-a71a-4c73-b628-ade629c89499&displaylang=en

http://www.microsoft.com/downloads/details.aspx?familyid=56fc92ee-a71a-4c73-b628-ade629c89499&displaylang=en

Next: Monitoring AD In Chapter 3, I’ll discuss monitoring AD. I’ll cover monitoring performance metrics and service availability, including related components, such as DNS and FRS. I’ll also discuss key maintenanceactivities in various areas, including trust relationships, user accounts, group membership changes,organizational unit (OU) authority, and Group Policy management. I’ll explore monitoring replicationand security, including delegation of authority, AD object property changes, authentication, and unauthorized queries.



35

Chapter 3:

Monitoring ADActive Directory (AD) is as essential a component of your Windows network as the switches thatcarry the network’s packets. However, unlike switches, which are basically Plug and Play (PnP) components, AD is a fragile conglomeration of technologies that must function together. At the coreof AD is, of course, the Directory Service (DS), which is based upon X.500 and accessed through the Lightweight Directory Access Protocol (LDAP). But the related components and functions that surround AD’s core must also work well for AD to be healthy.

AD replication is crucial to keeping your forest’s domain controllers (DCs) up-to-date and synchronized. Windows relies heavily on DNS to find DCs, Global Catalog (GC) servers, and otherAD-related resources, so DNS must perform efficiently for AD to function well.

Group Policy, another AD component, is essential to the configuration of all computers and userprofiles on the network. Group Policy resides partly in AD and partly on the DC file system. ForGroup Policy Objects (GPOs) updating, both AD replication and file replication through the FileReplication Service (FRS) must function correctly. Problems with these components can cause aplethora of difficulties – including users being unable to log on (or experiencing long logon delays),roaming profile problems, increased Help desk calls, and general dissatisfaction with IT.

In addition, if any component that Group Policy requires should fail, security settings might notbe applied to workstations and servers. If a DC runs out of Relative Identifiers (RIDs) and can’t reachthe domain RID master, administrators can’t create new users and groups or join computers to thedomain. (RID master is one of the Flexible Single-Master Operations – FSMO – roles that one DC inthe forest must fill.)

And because Exchange 2000 (and later) depends on AD as its directory, email systems and workflows can suffer interruptions if AD has a problem. Because an instance of AD resides on eachDC, AD stops functioning when a DC runs out of space. The resulting replication problems can causenot just logon failures but also unexplained account lockouts on that DC.

Perhaps one of the worst results of undetected AD problems, however, is inconsistent data in the directory. Because cleaning up such problems takes huge amounts of time, including extensivediagnostic time, administrators sometimes just give up and restore AD from backups. Such restorations are not only exacting and time-consuming, but also potentially counterproductive (e.g., an accidental authoritative restore could erase backups).

AD’s Resilience: Pro and Con A well-designed AD deployment has few if any single points of failure because of redundant components, such as multiple DCs. In fact, AD problems often go unnoticed because AD often findsa way to compensate for them. This “downside” of AD’s fault tolerance makes monitoring all themore essential to keeping AD healthy.


Monitoring AD: Single Points of Failure Let me illustrate how AD’s resilience could get you into trouble. Suppose you have a problem withone of two DCs that are GC servers. One GC server runs out of disk space and goes down, but noone notices because the other GC server handles the load without complaint. At this point, your network is limping on one GC without its usual fault tolerance. Some time later, a second problem(e.g., a bad port on the switch) develops and takes the remaining GC offline. Now, all users noticethe problem because they can’t log on.

Proper monitoring can prevent such a scenario because it would reveal the first GC’s problemright away. You could restore the first GC to service before the second GC developed a problem. ADis resilient, but you never want to run on a single point of failure longer than necessary.

Monitoring AD: Change Control Security is also an important reason to monitor AD. You should know if high-privilege groups suddenly gain new members or if an unusual number of logon failures or account lockouts occurs.Changes to domain accounts or to audit policies and trust relationships are infrequent but sensitiveevents that you should be aware of. In addition, given the complexity and wide-reaching effects ofGroup Policy, change control is critical to avoiding misconfigurations that affect the availability orsecurity of hundreds or thousands of computers. Unless you can track down who changed whatwhen, you’ll be hard put to enforce any accountability.

You need to be able to track changes to other AD objects as well, such as users, groups, computers, and organizational units (OUs). Although you don’t want an alert each time frequentmaintenance operations cause such changes, you must be able to audit such administrator activity forcompliance with the latest legislation and regulatory requirements. Also, you need AD information fortrend analysis – to plan properly for capacity requirements and growth. All things considered, it’s easyto see why you need to monitor AD and its related components.

Monitoring AD Through Event Logs The event logs on your DCs are key sources of AD monitoring information. Although monitoring ADentails more than watching the event logs, they’re definitely the place to start. Each DC has a DS logand an FRS log – and one or more of your DCs will also have a DNS log.

DS Log The instance of AD on a DC logs all of its operational activity, errors, and warnings to the DS log.One of the first steps in monitoring AD is to set up your monitoring solution to watch the DS log oneach DC for errors – and to alert you when appropriate.

Windows reports replication errors, trust failures, corruption, and other problems with the DS tothe DS log. Microsoft has divided the many events you might see in the DS log into 24 differentevent sources, which Table 3.1 lists.



Table 3.1 DS event log sources

Backup Directory Access DS RPC Client DS RPC Server DS Schema ExDS Interface Field Engineering Garbage Collection GC Group Caching Initialization/Termination Internal Configuration Internal Processing Intersite Messaging Knowledge Consistency Checker (KCC) LDAP Interface Linked-Value Replication Messaging API (MAPI) Interface Name Resolution Performance Replication Security Service Control Setup

As with any log, you soon realize that certain error or warning events are simply unimportantpests. Although you can prevent the logging of some of these messages (e.g., by disabling an unimplemented feature), AD will go on logging others. Your monitoring solution should let youselectively suppress unimportant messages to keep your operators from being swamped with alertsand starting to ignore them thinking that AD is always crying wolf.

FRS Log Although the AD replication function replicates X.500 objects inside AD, FRS handles file-level replication. File-level replication is useful for storing duplicate file content on multiple servers as partof Windows Dfs. Regardless of whether you use FRS for your own files, DCs use FRS to keep thefile-system portion of GPOs synchronized between all the DCs.

GPOs have two physical components – a Group Policy Container (GPC) and a Group PolicyTemplate (GPT). The GPC resides in AD and contains all the property settings defined for the GPO.The GPT resides in AD under \sysvol in a folder whose name matches the policy’s globally uniqueidentifier (GUID). This folder contains files that have the actual Group Policy settings, including security settings, administrative template-based policies, script files, and any applications that Windows 2000 and later publishes or advertises (i.e., makes available to users in the Control Panelunder Add/Remove Programs for optional installation) through the policy.

Chapter 3 Monitoring AD 37




As you can see, in addition to AD replication, Group Policy depends on FRS replication. The FRSlogs on your DCs will keep you informed about any problems with FRS. Remember that AD doesn’thave a central control system. Therefore, as is true of monitoring other aspects of AD, you’ll need tomonitor the FRS log on each DC. Although I can’t cover every possible FRS problem that mightoccur, I’ll discuss a few for which you should definitely monitor.

Exhausted Disk Space Exhausted disk space is a common cause of FRS problems. Accordingly, you should monitor for twoevents in particular: event ID 13511 (The FRS database is out of disk space) and event ID 13522(Staging area is full). An outbound partner that hasn’t connected for a while can cause these eventIDs to be logged. You can delete the connection and stop and restart FRS to force deletion of thestaging files.

These two events indicate that the local DC is building up changes to Group Policy files for oneor more of its replication partners. Those replication partners are falling increasingly out of sync astheir copies of Group Policy become more and more out of date.

RPC Problems Between DCs RPC problems between DCs can also cause FRS replication problems. You can detect such problemsby monitoring for event ID 13508 (FRS was unable to create an RPC connection to a replicationpartner). The problem that results in this event ID demonstrates the advantage of a monitoring solution that can correlate events and let you configure thresholds as well, rather than being limitedto single-event-based alert rules.

You’ll often see event ID 13508 – indicating an RPC problem – followed in relatively short orderby event ID 13509 (FRS was able to create an RPC connection to a replication partner), indicating thatthe RPC problem was resolved. Event ID 13508 is often logged when the local DC tries to open anRPC connection to the replication partner because the partner DC hasn’t yet been updated with achange to the replication topology. The partner therefore rejects the connection.

The local DC logs event ID 13508 but keeps trying to connect. Eventually, the topology changereplicates to the partner through AD replication, and the partner accepts the RPC connection from thelocal DC. After the RPC connection succeeds, the local DC logs event ID 13509. If the connectioncontinues to fail, the DC eventually logs another event ID 13508.

You can configure your monitoring solution to alert you only if it detects an event ID 13508without an event ID 13509 (within a specified period of time). The wait period depends on the max-imum time it should take for AD replication changes to reach the partner, which in turn depends onyour organization’s site topology and replication schedule.

Unless you’ve blocked out replication periods or have a complicated, multi-hop AD replicationtopology, a good rule of thumb would be about 75 minutes. Your rule might be as follows: “If youfind an event ID 13508 with no subsequent event ID 13509 within 75 minutes, alert me.” If the FRSor DC itself is restarted before logging event ID 13509, it won’t log an event ID 13509 when itrestarts, even if the FRS succeeds in connecting to the replication partner. Therefore, the rule mustinclude ending the wait for event ID 13509 without sending an alert if a system restart (i.e., securityevent ID 512 – System restarted) occurs.

Monitoring for event ID 13548 (System clocks are too far apart on replica members) lets youknow if two FRS partners fall more than 30 minutes out of synchronization with each other. Two

replication partners can end up with duplicate connects defined in the FRS replication topology.Event ID 13557 (Duplicate connections are configured) lets you know when this duplication occursand specifies which duplicate connection you should delete.

The FRS maintains a log called the NTFS update sequence number (USN) journal, which trackschanges to files in the replica set and queues those changes for replication. If replication processingfalls behind, changes start to build up – and the NTFS USN journal can reach its maximum size asdefined in the registry. When the journal reaches maximum size, the journal “wraps,” which discardsthe unreplicated changes.

Monitoring for event ID 13568 (Journal wrap error) lets you know when a journal wrap occurs. Ajournal wrap causes file replication to stop until the administrator stops the FRS and performs anonauthoritative restore on the DC so that it can resynchronize with its replication partners. Until yourestart FRS replication, GPO changes are delayed. The delay means that computers in your domainwon’t receive security settings or other settings defined in Group Policy.

Other than event ID 13508, the FRS log is pretty quiet. Therefore, you might decide to begin bymonitoring all warnings and errors. You should know, however, that a common FRS warning youcan usually safely ignore is event ID 13512 (The File Replication Service has detected an enabled diskwrite cache on the drive containing the directory c:\winnt\ntFile Replication Service\jet on the computer computer_name). The complete event ID text indicates that the FRS might not recoverwhen power to the drive is interrupted and that critical updates might be lost.

Although this alert sounds dangerous at first, your hardware determines whether you suffer theconsequences described. If you have a UPS and/or your RAID controller has battery-backed-upmemory, you probably won’t suffer any problems. In fact, according to Microsoft, “If you enable theWrite Caching feature and if there are provisions for a power loss (such as an uninterruptible powersupply [UPS] attached to the computer, hard disks, or drive controller), the FRS database may not bedamaged because FRS does not turn off Write Caching.”

Nevertheless, FRS logs this event each time the system starts up. Because you can’t disable Windows from logging this event, make sure your monitoring solution suppresses the resulting alerts.

DNS Log The DNS log isn’t too chatty with errors and warnings, so you can probably safely monitor it forerrors and warnings as well. This monitoring pay offs because users immediately feel DNS problems.For example, event ID 6524 (Invalid response from master DNS server at <IPAddress> duringattempted zone transfer of zone test.microsoft.com) indicates a zone transfer failure. That is, the DNSserver failed to update subordinate servers with its zone file. This failure can lead to users and com-puters being unable to find AD and other Windows resources on the network.

Other DC Logs In addition to the logs more directly related to AD, don’t ignore the System and Security logs on yourDCs. A system-level problem with Windows on a DC can mean trouble for your entire domain orforest, so you should follow the recommendations I offered in Chapter 2 for monitoring your DCs. Inparticular, you should monitor CPU use and disk space.

As you saw in the case of both AD DC and FRS events, disk-space exhaustion can cause seriousproblems with AD. Rather than waiting for the application to tell you that it’s out of disk space, useWindow’s performance counters to monitor the amount of free disk space. Similarly, if your CPU use



remains above 90 percent for an extended period of time, don’t wait to begin troubleshooting. Youmight have one of several different problems, such as a runaway process or too heavy a workloadfor the DC.

Monitoring AD with Command-Line Tools Monitoring AD, however, is more than just monitoring your DC event logs. Several tools on the Windows CD-ROM under support\tools and in the resource kit are useful in diagnosing and monitoring AD replication. These command-line tools are easy to build into your monitoring strategy – as long as your monitoring solution is capable of periodically running commands andtesting their output.

Dcdiag To actively verify that replication is functioning correctly on a DC, you can use Dcdiag, which you’llfind on your Windows server CD-ROM. You’ll discover that Dcdiag /test:replications will performconnectivity and replication tests for the local DC and its connections to other DCs. As long as yourmonitoring solution finds the phrases “passed test Connectivity” and “passed test Replications,” it neednot alert you.

Replication health also depends on proper permissions for replication connections and DCs. Tocheck replication permissions for the local DC, configure the monitoring solution to periodically execute Dcdiag /test:netlogons and look for “passed test NetLogons” in the command output.

Although these two tests are among the most important that Dcdiag offers, the tool can performmany more (e.g., FSMO checks, topology verification tests). If you run Dcdiag without parameters, itperforms all of its tests. If you include the /v option, Dcdiag provides verbose information about eachtest as it executes.

If your monitoring solution can capture the output and let you read it from the alert console,Dcdiag’s /v option will prove valuable for diagnosing the problem right from the console. Otherwise,you’ll need to manually run Dcdiag /v on the DC in question.

To avoid redundant alerts, you should probably have Dcdiag skip some checks. By default,Dcdiag checks the FRS log as well as the System and DS logs for errors and warnings. Becauseyou’re probably monitoring event logs directly for such events, you don’t need to have Dcdiag duplicate those checks.

TipTo run all of Dcdiag’s DC checks (except for event log-related checks) and get verbosereporting, run Dcdiag /skip:kccevent /skip:frsevent /skip:systemevent /v. Configure yourmonitoring solution to alert you if it finds the word “fail” in the command output.

Netdiag Netdiag is another support tool that you’ll find useful. To verify that DNS is functioning and that thelocal DC has all appropriate records registered in the DNS server, you can have your monitoring solution run Netdiag /test:DNS and look for “DNS test . . . . . . . . . . . . . : Passed” in Netdiag’s output.

j



Gpotool Because GPOs reside partly in AD and partly on the file system of DCs and because GPOs replicatethrough separate replication mechanisms, AD GPOs can get out of sync with their corresponding fileson the DCs’ Sysvol. Gpotool is a great resource kit tool for checking Group Policy consistency.

If you run Gpotool without specifying any parameters, it verifies every GPO on every DC,checking properties, version numbers (in both the file-system and AD objects), and replication. Configure your monitoring solution to alert you if it doesn’t find “Policies OK” at the end of the command output. If your monitoring solution lets you view the output of command rules, include the/verbose switch on the command line to get full reporting.

AD Performance Monitoring AD performance monitoring is valuable for two reasons. First, being alerted when a certain AD performance measure exceeds your typical baseline can give you early warning about a problem inyour AD infrastructure. You gain time to respond before users or critical business processes areaffected. Second, collecting AD performance metrics helps you analyze trends and plan when andwhere your system needs additional resources.

AD has two performance objects whose counters you can monitor. The database object providesstatistics about the AD database’s table, log, and cache operations. The NTDS object provides statisticsabout every other aspect of the DS running on the local DC. I’ll discuss some key NTDS object counters first.

NTDS Object You need to monitor AD replication traffic over WAN links because that measurement tells you howefficient your WAN is and how much of your WAN’s bandwidth AD consumes throughout the day.You can monitor the amount of incoming replication traffic from other sites to this DC with the DRA Inbound Bytes Compressed (Between Sites, After Compression)/sec and RA Inbound BytesCompressed (Between Sites, Before Compression)/sec counters.

Because AD compresses replication data destined for another site, DRA Inbound Bytes Compressed (Between Sites, After Compression)/sec is a more useful counter. You’ll need to configure your monitoring solution – through the Microsoft Management Console (MMC) ActiveDirectory Sites and Services snap-in – to collect this counter on a frequency appropriate to the replication schedule configured on your site links.

Monitoring DS on the Local DC A DC’s workload comes from two sources. First, each DC must perform incoming and outgoing replication tasks regularly. Because of replication’s batched nature, replication tends to create bursts orspikes of activity rather than a constant load on the DC. You can control replication by the intrasiteand intersite replication schedules in the MMC Active Directory Sites and Services snap-in.

The more frequently you replicate, the smaller and more predictable the workload associatedwith each replication is (although your overall efficiency decreases somewhat because of the overhead associated with each replication). On the other hand, the less frequently you replicate, thelarger and less predictable the size of each replication becomes. Latency increases and consistencybetween DCs decreases because replication occurs less often.



Second, the period during which you allow replication might be more important than replicationfrequency. Your DCs handle many tasks from client computers on the network. Authentication operations, LDAP queries, and, to a lesser extent, LDAP updates (e.g., user account, group maintenance) constitute a more constant flow of work than the batched work of replication.

You can block out times during which replication is disabled to let the WAN to devote its bandwidth to other tasks or to let the DC handle peak logon times. Typically, client-related work risesduring peak logon periods and drops significantly during nonwork hours.

Monitoring Client Work and Replication During the day, Group Policy refreshes, email messages, account maintenance, and the odd logonand logoff cause a modest stream of LDAP queries. If you’re running batch updates to AD objects(e.g., LDAP Data Interchange Format – LDIF – files), you’ll obviously see a spike in file writes. Largeupdate jobs can adversely affect overall AD performance on a DC because AD’s optimized design isheavily weighted to service query requests instead of update requests.

To monitor client-related work on your DCs, you can use LDAP Client Sessions, which tells youthe number of currently connected LDAP clients. LDAP Searches/sec reports how many search operations the DC is performing per second. You can also monitor authentication statistics with Kerberos Authentications/sec and NTLM Authentications/sec. These two counters tell you how manytimes per second the DC authenticates Kerberos and NTLM clients.

A key counter for verifying that a DC is keeping up with replication changes is DRA RemainingReplication Updates. This counter indicates the number of changes to objects received in the currentdirectory replication update packet but not yet applied to the local server. After the DC beginsapplying the replication changes in each packet, this counter should decline sharply each time thecounter checks. A gradual decline indicates that the DC is functioning slowly. Based on my experience, you might want to receive an alert if this counter consistently stays above 12.

Although DRA Remaining Replication Updates tells you how well the DC is processing the current replication packet, DRA Pending Replication Synchronizations tells you how many replicationsynchronizations from other DCs are behind the current packet, waiting to be processed. The largerthe number, the longer the backlog. My experience shows that a good rule of thumb is to set up analert if two to three consecutive collections of this counter exceed 75.

Database Object The database object is useful for alerting you when the database is performing poorly. Poor performance can affect users, Group Policy application, and replication. Like any database, AD reliesheavily on cache to keep performance acceptable. You also need to be aware of poor database performance for trend analysis and capacity planning.

Monitoring AD Database Operations An easy way to know whether a given DC needs more memory is to monitor Cache Page FaultStalls/sec, which tells you how many times per second a cache page fault can’t be serviced becausenot enough cache memory is available. If this counter remains above zero most of the time, performance is suffering. You should consider installing more memory.

Database cache requirements are another reason you need a monitoring solution that lets youspecify sophisticated alert criteria. For example, with the right monitoring capabilities, you could



collect this counter every 5 minutes for reporting purposes but also specify that if it’s above 0 severaltimes consecutively, you should receive an alert.

Cache % Hit provides the cache-hit ratio – the classic measurement for any database. The cache-hit ratio tells you what percent of page requests is found in cache rather than requiring a fileoperation to bring requested pages in.

Usually, the cache-hit ratio target is 80 percent to 90 percent. In addition to collecting the cache-hit ratio statistic for trend analysis, you might want an alert if the ratio remains below 80 percent for three consecutive checks during hours of typical use.

You can also track AD performance by using File Operations Pending, which tells you howmany tasks are piling up for the file system to complete. If this counter consistently remains highduring typical use hours, you might have a bottleneck either because you lack enough cache orbecause your file system is performing slowly. Check your disk space and disk fragmentation leveland consider whether your current disk subsystem is equal to its tasks.

Other Ways to Monitor AD You want to monitor AD from every angle possible. Additional ways to monitor AD include monitoring the services on DCs and monitoring AD directly.

Monitoring Services on DCs Another way to catch AD problems as soon as possible is to monitor the services on each DC. Inaddition to FRS, which I discussed previously, AD depends on the Net Logon service, the KerberosKey Distribution Center (KDC) services, the Windows Time service, the Intersite Messaging service,and the subset of DCs that run the DNS Server service.

The Net Logon service and the Kerberos KDC services are essential to authenticate users andcomputers on the network. The Windows Time service keeps the computer clocks in the forest synchronized. This synchronization is essential because Kerberos authentication will fail if a computerfalls more than 5 minutes out of sync with another computer. The Intersite Messaging service supports replication between sites. If your monitoring solution supports service monitoring, configureit to periodically check the status of each of these services on your DCs and to alert you if they aren’tcurrently started.

Direct AD Checks An excellent way to get absolute verification that AD is functioning properly is to regularly test ADfunctionality. For example, you can set your monitoring solution to periodically resolve a DNS name.Doing so lets you check the response from the server.

In addition, you can regularly log on to a server with a domain account to confirm that AD isauthenticating users, then delete the connection. Also, periodically execute an LDAP query against aknown object in AD to verify that AD is servicing LDAP requests properly.

Some monitoring solutions let you check current service-pack and hotfix levels, a feature thatcomes in handy in monitoring AD. Microsoft recommends keeping all DCs in the forest running atthe same service-pack level with the same hotfixes installed to avoid the problems associated withrunning different software versions.



Monitoring AD Security Monitoring security in AD involves two basic activities. You can audit administrative activity, and youcan monitor for intrusion attempts.

Auditing Administrative Activity You can monitor AD security events by watching the Security log on each DC. For example, a givensecurity event (e.g., a new user being created) will be logged on the originating DC only – that is, theDC on which the event actually occurred. The other DCs receive the change through replication anddon’t report the event to their respective Security logs.

Windows has two Security log categories – the DS category and the Account Management category – that lend themselves to monitoring maintenance activities in AD. The DS category lets youmonitor any and all access to AD objects – even to their individual properties. However, DS auditingtends to be very granular and a bit cryptic. To decipher events in this category, you must be familiarwith the object class and attribute names in the AD schema. I’ll cover first the Account Managementcategory, which you’ll find simpler to monitor, then discuss the DS category.

Detecting Account Management Changes The Account Management category is a more immediately useful category for monitoring AD. Thiscategory provides specific event IDs for each type of operation on users, groups, and computers, asTable 3.2 shows.

Table 3.2 Account Management events

Created Changed Deleted Member

Added Removed

User 642 642 630

Computer 645 646 647

Groups Security Local 635 641 638 636 637

Global 631 639 634 632 633

Universal 658 659 662 660 661

Distribution Local 648 649 652 650 651

Global 653 654 657 655 656

Universal 663 664 667 665 666

From Monterey Technology Group, Inc.’s “Security Log Secrets,” used with permission.

Table 3.3 shows some additional event IDs that indicate specific changes to user accounts.



Table 3.3 Operation-specific user change events

Windows 2003 only Event ID Description

• 626 User account enabled

Change password attempt627 W2k: This event is logged for both password changes and resets

W3: Only logged for changes

User account password set• 628 W2k: Not logged

W3: Logged for resets

644 User account locked out

• 671 User account unlocked


You’ll find collecting event IDs that indicate new user accounts and users being added to groupsuseful. Doing so gives you a way to audit administrative authority and track down any instances ofunauthorized access.

Monitoring account management events in AD can also provide some highly valuable data forcapacity planning and trend analysis. You’ll be able to analyze figures for high-level tasks such asuser creation, group changes, and password resets. Such figures can help you, for example, whenyou need to make decisions about merging Help desk operations or preparing administratively foremployee growth.

DS Category As I noted previously, the DS category has its specific uses. DS events are the only way you candetect changes to OUs and GPOs. Auditing for certain changes to both objects is important becausechanges to them can affect thousands of computers in your domain.

In Windows Server 2003, DS events are logged as event ID 566 (Success audit); on Win2K DCs,they’re logged as event ID 565 (Success audit). Changes to an OU’s ACL indicate a change in theadministrative authority delegated to subadministrators over objects that reside in that OU.

For example, a domain administrator can delegate password-reset authority over the users in theNYC OU to the NYC Help desk. Such a delegation of authority results in a change to the ACL of theNYC OU. The change would show up in the originating DC’s Security log, as Figure 3.1 shows.



Figure 3.1 Change to an OU’s ACL

Note the object class and name fields in the description that identify that the change was to anOU and specify the OU’s name. The WRITE_DAC access indicates a write operation to the OU’s Discretionary ACL (aka simply “ACL”).

A change to an OU’s Group Policy tab can affect the policy settings on all of the computers andusers in that OU and in its sub-OUs, as Figure 3.2 shows.

Event Type: Success AuditEvent Source: SecurityEvent Category: Directory Service Access Event ID: 566Date: 8/15/2004Time: 7:22:12 PMUser: ELM\administratorComputer: W3DCDescription:Object Operation:

Object Server: DSOperation Type: Object AccessObject Type: organizationalUnitObject Name: OU=NYC,DC=elm,DC=localHandle ID: -Primary User Name: W3DC$Primary Domain: ELMPrimary Logon ID: (0x0,0x3E7)Client User Name: administratorClient Domain: ELMClient Logon ID: (0x0,0x12913)Accesses: WRITE_DAC

Properties:WRITE_DAC organizationalUnit

Additional Info:Additional Info2:Access Mask: 0x40000



Figure 3.2 Change to an OU’s Group Policy tab

The gPLink and gPOptions properties reveal the update of the OU’s Group Policy tab. Changesto the GPO itself carry the same significance. You can identify them by events that show Write accessto the versionNumber property of groupPolicyContainer objects, as Figure 3.3 shows.


Object Server: DSOperation Type: Object AccessObject Type: organizationalUnitObject Name:OU=NYC,DC=elm,DC=localHandle ID: -Primary User Name: W3DC$Primary Domain: ELMPrimary Logon ID: (0x0,0x3E7)Client User Name: administratorClient Domain: ELMClient Logon ID: (0x0,0x12913)Accesses: Write Property

Properties:Write Property

Default property setgPLinkgPOptions

organizationalUnit




Figure 3.3 Update to a GPO

To determine which GPO was actually changed, you’ll need to note the GUID in the objectname field of the event’s description, then track down the GPO with that GUID. You can view aGPO’s GUID by using the Object tab of the GPO’s Property window.


Object Server: DSOperation Type: Object AccessObject Type: groupPolicyContainerObject Name: CN={31B2F340-016D-11D2-945F-00C04FB984F9},

CN=Policies,CN=System,DC=elm,DC=localHandle ID: -Primary User Name: W3DC$Primary Domain: ELMPrimary Logon ID: (0x0,0x3E7)Client User Name: administratorClient Domain: ELMClient Logon ID: (0x0,0x12913)Accesses: Write Property

Properties:Write Property

Default property setversionNumbergPCMachineExtensionNames

groupPolicyContainer




Detecting Intrusion Attempts One of the key methods you can use to track AD intrusion attempts is monitoring authentication failures in the Security log Account Logon audit category. Windows logs separate events by authentication protocol.

You can track Kerberos authentication successes and failures with event ID 672 (Authenticationticket granted), event ID 673 (Service ticket granted), event ID 674 (Ticket granted renewed), eventID 675 (Domain account authentication failed), and event ID 676 (Authentication ticket requestfailed). Table 3.4 presents some Kerberos authentication events.

Table 3.4 Kerberos authentication events

Event ID Category Explanation

675 Audit account logon events Event 675 on a domain controller indicates a failed initialattempt to logon via Kerberos at a workstation with a domainaccount usually due to a bad password but the failure codeindicates exactly why authentication failed. See Kerberosfailure codes below

676 or Failed 672 Audit account logon events Event 676 gets logged for other types of failed authentication.See Kerberos failure codes below. NOTE: Windows 2003Server logs a failed event 672 instead of 676.

681 or Failed 680 Audit account logon events Event 675 on a domain controller indicates a failed logon viaNTLM with a domain account. Error code indicates exactlywhy authentication failed. See NTLM error codes below.NOTE: Windows 2003 Server logs a failed event 680 insteadof 681.


For Kerberos failure codes, see RFC 1510, “The Kerberos Network Authentication Service (V5),” athttp://www.ietf.org/rfc/rfc1510.txt.

You can track NTLM authentication successes with event ID 680 (Successful authentication) andfailures with event ID 681 (Failed authentication). Table 3.5 presents NTLM authentication error codes.

Table 3.5 NTLM error codesError Code Error Description Decimal Hexadecimal

3221225572 C0000064 User name does not exist

3221225578 C000006A User name is correct but the password is wrong

3221226036 C0000234 User is currently locked out

3221225586 C0000072 Account is currently disabled

3221225583 C000006F User tried to log on outside his/her day of week or time of day restrictions

3221225584 C0000070 Workstation restriction

3221225875 C0000193 Account expiration

3221225585 C0000071 Expired password

3221226020 C0000224 User is required to change password at next logon



http://www.ietf.org/rfc/rfc1510.txt

You’ll also find it useful to collect successful authentication events for capacity planning andtrend analysis. They can help you answer questions such as “How many people log on each day?”

The Importance of Monitoring AD AD is critical to your entire IT infrastructure, and it’s a complex, multicomponent, distributed systemthat requires many types of monitoring to ensure its overall health. To comprehensively monitor AD,you need to be able to monitor event logs, test the output of OS commands, check the status of services, test AD directly, and monitor performance objects. You also must be able to set thresholdsso that you prevent alerts triggered by brief typical spikes in activity. The complexities of monitoringAD provide a deep test of a monitoring solution’s power and flexibility.

Next: Monitoring Exchange Server In the next chapter, I’ll discuss monitoring Exchange server, examining the particular challenges ofthat technology. Topics include information stores, public folders, and mailbox sizes. I’ll examineoverall Exchange traffic, including email message counts and email message sizes, as well as the insand outs of the Message Transfer Agent (MTA) and SMTP servers. I’ll also discuss bridgeheads, inactive mailboxes, and round-trip metrics, which gauge whether email messages are deliveredpromptly.



51


Chapter 4:

Monitoring SQL ServerMonitoring databases such as Microsoft SQL Server pays off in many ways. Although many things cango wrong in a database – from deadlocks to overflowing log files – you can usually do something toavert disaster. Effective monitoring can help you nip problems in the bud – before they bloom intofull-fledged system outages and user complaints.

Because databases are the vulnerable middle component between OSs and applications, you alsoneed to monitor databases for security reasons. The database layer is an attractive target for commit-ting fraud. Users (including malicious ones) can manipulate data by using high-level SQL commands– and, depending on the design of the application, they can avoid being subject to all the businessrules and access checks enforced at the application layer.

Given all the ways end users can circumvent application logic and access data directly throughspreadsheets and reporting tools, you need to watch who accesses your database server and how forsecurity reasons. In addition, database-level monitoring is arguably as important or more importantthan OS-level monitoring for compliance with legislation such as the Sarbanes-Oxley Act (H.R. 3763).The act stipulates many required and expected corporate best practices, including several that addressdata retention and access.

Because databases tend to grow over time and business processes greatly affect them, monitoringfor the sake of trend analysis and capacity planning is especially useful. Good capacity planning cankeep system performance uniformly excellent over time, even as resource use changes.

Finally, database performance is often the biggest contributing factor to overall system perfor-mance. Monitoring and tweaking database servers to improve their contribution to system perfor-mance is well worth your time and effort. Although this chapter discusses SQL Server, many of thefactors I’ve listed are true for all databases.

I’ll look primarily at three methods for monitoring SQL Server: performance counters, event logs(in particular, the Application log), and SQL queries and transactions. As I examine these methods, I’lldiscuss how to use them to monitor potential problem areas, performance, security, and currentcapacity (for planning purposes).

Monitoring: Performance CountersMonitoring SQL Server with performance counters involves monitoring not just SQL Server–specificcounters but also several OS counters that I discussed in Chapter 2. Because I’m dealing with a spe-cific database server rather than a general Windows server, whenever possible, I’ll specify the kindsof thresholds you’ll want to set up and the inferences you might draw from the data you capture.

SQL Server Performance CountersFirst, let’s look at SQL Server–specific performance counters. Because databases themselves, use pat-terns, and hardware configurations are unique, it’s tricky to recommend exact thresholds for SQL

Server performance counters. Whereas 20 active transactions might be high for a single CPU com-modity server with ATA drives, it might be business as usual for a higher end quad CPU server witha SCSI RAID system. I’ll explore instead how you can discover the correct thresholds for your system.

Wide variances hold true for such matters as cache statistics, locks, and table scans. You’ll alwaysneed to understand your application, note the symptoms of a problem, then look for counters thatcan help you identify the problem and resolve it. For example, suppose users are complaining aboutlong transaction times. One counter, Lock Waits/sec, might help you identify and gather initial dataabout the problem (e.g., Lock Waits/sec gets too high at specific times). But why are lock waits sohigh? Further investigation might reveal that your cache-hit ratio is too low for the type of transactionsbeing executed. You might then realize that a crucial index has been dropped or that statistics needto be refreshed. You’ll find that having a tool to monitor your database is good, but that you can’texpect to simply activate a default rule set and let your monitoring solution run.

Before you implement a comprehensive monitoring system, you must establish baselines by mon-itoring key statistics for a sufficient period of time before you activate a set of alert rules. Otherwise,your operators might throw their pagers out the window the first week you implement overall moni-toring. I’ll examine some of the key statistics for which you should establish baselines. How can youknow, however, what constitutes a sufficient period of time?

If your database is directly involved with one or more business processes (e.g., an EnterpriseResource Planning – ERP – system, an accounting system), you should look at the cycles andreporting periods of each business process. You can establish a good baseline for some systems with7 days of activity. For example, a reservations system database has a somewhat constant workload,with peak times during the day and to a lesser degree during the week. Therefore, you could ade-quately baseline such a system in a week of typical activity. Other systems, such as a general ledgersystem, might require an entire quarter.

Buffer Cache-Hit Ratio Buffer cache hit ratio is a key SQL Server performance counter. Microsoft defines this counter as “Per-centage of pages that were found in the buffer pool without having to incur a read from disk.” Ahigh ratio shows that most of your queries could be satisfied without going to disk. If your ratiodrops too low, your server will slow down as it waits more and more for the disk subsystem to bringpages into memory.

Your goal should be to keep your buffer cache-hit ratio above 90 percent. However, the natureof some applications, such as OLAP processing, might make 90 percent impossible to attain becauseof I/O requirements. Therefore, establishing a baseline is important. If you determine that your buffercache-hit ratio is too low, consider increasing the amount of RAM that SQL Server can use, addingmore RAM to the server, or – if a particular query causes the problem – redesigning that query tomake better use of buffer cache.

Page Splits If your database experiences slowdowns during high row-insertion periods or periods during whichindexed columns are updated, you might be suffering from too many page splits. Let me offer somebackground for the following discussion about page splits.



Page Splits Defined To begin with, you need to understand the basics of index organization and the concept of fill factor.SQL Server organizes indexes as B-trees, with one page root as the starting point for index traversal.The root page might have pointers to two or more pages on the next index level, and each of thosepages might have pointers to multiple pages at the next level. The last, or bottom, level of an indexis the leaf level, which must maintain all the index key values in a sorted sequence.

In a clustered index, the leaf level is the data itself, so SQL Server stores the data in sorted order.In a nonclustered index, the leaf level contains pointers to the data.

Fill Factor Defined The fill factor, which is a percent, is a value you specify when you create an index to tell SQL Serverhow full you want the index’s leaf-level pages to be. SQL Server defaults to a fill-factor value of 0.The fill-factor value is especially important in an insert-intensive environment.

Because an index’s leaf level must maintain all the index key values in a sorted sequence, whensomeone inserts a new row into a table, the index key value in that row determines the row’s posi-tion in the index (or table, if the index is clustered). For example, if you have an index based on lastname, inserting a row with a last-name value of Marlin requires that SQL Server insert a new indexrow in the same page with the other names that start with Ma, possibly between Margolin andMartin. If the page in which the new row belongs is completely full, SQL Server must split the pageand link a new page into the page chain. (SQL Server will move approximately half the rows fromthe original full page to the new page.)

Page splitting is a resource-intensive operation that can slow the performance of your insert oper-ations. In addition, because the new page probably won’t be physically contiguous to the originalpage, you begin to fragment the index or table.

Creating an index with a low fill factor means that your table has room to grow before pagesplitting would be required. However, if your pages are only partially filled, you’ll need more pagesto hold all your data. The index can become quite large. Microsoft defined the fill-factor value of 0 asa compromise between having room to grow and making sure the table and indexes are no largerthan necessary. With a fill factor of 0, the leaf level is full, but the pages in the upper levels of theindex, which might also need to be split if they become full, have some room for growth.

Identifying Page-Split Slowdowns To identify slowdowns that occur because of page splits, monitor the SQL Server:Access Methods: Pagesplits/sec counter. What’s a reasonable page-split rate? A general rule of thumb is to keep the page-split rate below 100. Keep in mind, however, that that rate might be fine for some systems but toohigh for others.

NotePage splits are a problem only if they cause a noticeable slowdown. If your disk I/O subsystemis equal to the task, your server can deal with high page splits without slowing your system down.

The best approach is to monitor the server for a period of time and baseline the page-splitcounter. If you start to experience slowdowns, determine whether page splits have increased. If your

n

Chapter 4 Monitoring SQL Server 53




trend analysis reveals that the page splits/sec rate is growing steadily, you know that some of yourindexes are getting full and that it’s time to rebuild them. Before rebuilding your indexes, you mightconsider lowering the fill-factor value, which can prevent the need to rebuild indexes or at least sig-nificantly lengthen the time until you must rebuild indexes because of page splits.

Efficient Index Use When it selects data, SQL Server occasionally chooses to scan an entire table instead of using an index.Although table scans can be more efficient, they usually aren’t. If you fail to create an index on columnsthat Where clauses use frequently or if for some reason SQL Server doesn’t use the index, the numberof table scans will shoot up and your performance will probably degrade. Performance suffers becauseselecting data by indexed column is usually much faster than selecting by nonindexed column.

SQL Server uses index searches to start range scans, to retrieve a specific record by using oneindex, and to reposition keys within an index. A large number of table scans can cause performanceproblems. You can reduce the number of table scans involved in these tasks by first identifying thetables SQL Server scans, then modifying the database to create indexes for those tables.

To get a profile of how your applications access data, you can monitor the SQLServer:AccessMethods: Full Scans/sec counter. This counter shows the number of table scans SQL Server performsinstead of using an index. To get the other side of the story, use the SQLServer:Access Methods: IndexSearches/sec counter, which shows the number of index searches SQL Server performs.

Establishing baselines for these two counters is the recommended approach. However, you mightwant to try simply adjusting toward a rule-of-thumb ratio of 90/10. For most applications, optimalindexing performance occurs when you access data about 90 percent through indexes and no morethan 10 percent through table scans. A good monitoring solution will let you compare the two coun-ters and alert you if they fall out of the proportion to each other that you’ve established.

If you suspect inefficient indexing is causing performance problems, you can use the code thatListing 1 presents to display the SQL statements currently executing and to determine which tablesSQL is accessing. By comparing the columns used to limit each command’s scope (i.e., the Whereclause) to the indexes available in the table, you can determine whether you need additional indexesor whether SQL Server isn’t using existing indexes.

Listing 1 Code to display the SQL statements currently executing

DECLARE @MV_PROCESS CHAR(40), @MV_COMMAND2 CHAR(254)DECLARE CUR_TABLE_LIST CURSOR FOR SELECT P.SPID FROM MASTER..SYSPROCESSES P WHERE P.CMD <> ‘AWAITING COMMAND’ AND P.SUID <> 1 AND P.SPID <> @@SPID

OPEN CUR_TABLE_LISTFETCH NEXT FROM CUR_TABLE_LIST INTO @MV_PROCESSWHILE (@@FETCH_STATUS <> -1)

BEGINIF (@@FETCH_STATUS <> -2)BEGINSELECT @MV_COMMAND2 = “DBCC INPUTBUFFER ( “ + @MV_PROCESS + “)”EXEC (@MV_COMMAND2)

ENDFETCH NEXT FROM CUR_TABLE_LIST INTO @MV_PROCESS

ENDDEALLOCATE CUR_TABLE_LIST

You can then execute long-running Select statements in the Query Analyzer, making sure thatyou use the Show Query Plan option or run the graphical execution plan utility.

Running a query with an execution plan selected doesn’t execute the query but instead displayshow the query will be resolved when it’s executed.

As you look at the execution plan, ask yourself the following questions: Are the keys in theWhere clauses indexed? Are the frequently used key columns indexed? (If not, consider testing perfor-mance with key columns indexed.) Is SQL Server using indexes or performing table scans?

If SQL Server isn’t using the indexes, you can run the Database Consistency Checker (DBCC)command DBCC SHOW_STATISTICS (table name, index name) to see statistical information aboutselected indexes. Also, run the UPDATE STATISTICS table command, which updates the distributioninformation that SQL Server uses to determine whether to use an index. After the update, display theexecution plan again to note any improvements in index use.

Overall Activity To keep an eye on how busy your SQL Server system is, the best single statistic is the SQLServer: SQLStatistics: Batch Requests/sec counter. Although the SQLServer:Databases: Transaction/sec countermight at first glance seem a better choice, bear in mind that the latter counter reports only on activityinside committed transactions.

Most applications perform a lot of work outside transactions – such as executing simple queries.Therefore, the batch requests counter paints the most accurate picture of how many work units theserver processes per second. By analyzing this statistic over time, you can configure a rule that alertsyou when the SQL Server workload approaches atypical levels.

Concurrency Bottlenecks Sometimes, the cause of database slowdowns isn’t so much capacity as it is concurrency. Forexample, SQL Server must use a variety of locks while it executes transactions. It must ensure thatdata isn’t corrupted at the application level by different transactions updating the same data at thesame time.

Depending upon the design of the database and the update transaction, you might experienceslowdowns from one transaction locking data that other transactions need. Sometimes the problem isas simple as a single row that each transaction updates creating a bottleneck. To identify the slow-downs that such locks cause, baseline the SQLServer:Locks: Average Wait Time (ms) counter for the_Total instance. If you experience a slowdown and this counter has jumped well above its typicalbaseline, you can start monitoring additional lock counters to discover the location of the problem.

Another type of lock is a latch. Latches are lighter weight locks that SQL Server uses when itmoves rows and index data around. Latches can also cause slowdowns, but such slowdowns aremuch less common. You can baseline and monitor SQLServer:Latches: Average Latch Wait Time (ms)in the same way that I recommended for SQLServer:Locks: Average Wait Time (ms) counter.

Memory Shortage Memory shortage is at the root of many performance problems. Rather than trying to monitor everycounter that could indicate a memory shortage, however, you can monitor whether SQL Server itselfindicates that it needs more memory to perform well. To do so, compare SQLServer:Memory Man-ager: Total Server Memory (KB) and SQLServer:Memory Manager: Target Server Memory (KB).



Total Server Memory (KB) tells you how much total memory SQL Server is currently using.Target Server Memory (KB) tells you how much memory SQL Server would use if it were available. IfTarget Server Memory exceeds Total Server Memory, you know that SQL Server needs more RAM foroptimal performance under the current workload. Establishing trends for both markers is usefulbecause you’ll be able to determine whether SQL Server is consistently starved for memory or simplyencountering peak processing times when it would use more memory. SQL Server lets you controlthe amount the server’s total physical RAM it can use.

As you consider SQL Server’s memory use, keep in mind that if you don’t leave enough memoryfor Windows and other applications to run, your performance might begin to degrade just because ofpage-file use. To determine whether Windows or other applications need more memory, monitor theMemory: Pages/sec counter. This counter shows the number of pages that Windows retrieved fromdisk because of page faults or that Windows wrote to disk to free up space in the working set.

Although paging spikes aren’t unusual, this counter number should remain close to zero. Anincrease in paging can signal the need to add memory; you might also attempt to reduce the numberof other applications running on the same computer as SQL Server.

SQL Server Backups You can monitor the throughput of your database with the SQLServer:Backup Device Object: DeviceThroughput Bytes/sec counter. Unless you start experiencing long backups for some reason, you won’tneed to monitor this counter for performance reasons.

However, if you want assurance that backups are actually taking place when they’re scheduled,you could establish a rule to perform a check a few minutes after your scheduled backup begins. Ifthe rule reveals that the counter is inactive or 0, it can alert you to the possibility that backups havefailed to start. I like this method because it confirms that data is actually being sent to the backupdevice rather than simply checking to see whether a scheduled job started.

Low Disk Space Because SQL Server now lets database and log files grow automatically, you don’t need to monitorSQL Server counters to check for low free space. Instead, you can monitor Windows counters fordisk space. Use the LogicalDisk performance object and monitor % Free Space or Free Megabytes oneach logical disk on the server. Analyze whether and how fast SQL Server files consume disk space,then project how soon you’ll run out.

TipConfigure a rule to alert you if either counter suddenly falls below a certain value. Doing sowill catch rogue processes before they overflow your server.

CPU Use A good rule of thumb is that you don’t want SQL Server processor use to be constantly above 80 per-cent. To identify such a situation, monitor the Processor:% Processor Time for the _Total instance. Ifuse of your server CPUs is consistently 80 percent or higher, you need to determine what’s con-

j



suming CPU capacity. First, you can determine how much CPU time applications such as SQL Serverand the OS (as it handles network traffic and disk I/O) consume.

SQL Server and other applications run in user mode, and core OS functions such as networkingand disk I/O run in kernel (aka privileged) mode. The Processor performance object has two coun-ters that correspond to these two modes: the Percent Privileged Time counter and the Percent UserTime counter.

If the lion’s share of CPU time is spent in privileged mode, you might have an I/O optimizationproblem. If, as is more likely, the majority of time is spent in user mode, you must track down whichprocess is consuming your CPU time. Is it the SQL Server process or some other process running onthe server? To find out, you must monitor the Process:% Processor Time counter for the sqlservrinstance and for any other processes that consume extensive CPU time.

Disk and Network PerformanceOne of the best ways to identify disk-related performance problems is with PhysicalDisk: AverageDisk Queue Length. This disk-performance counter in the PhysicalDisk object shows how manyrequests are usually waiting for disk access. Microsoft recommends that the number of waiting I/Orequests be no more than one and a half to two times the number of spindles that comprise thephysical disk.

If this counter’s number is consistently higher than the recommended number, you might benefitfrom faster disks or additional disk drives. The Bytes Total/sec counter, which is in the Network Inter-face object, can help you find a network-adapter bottleneck. Compare this number to your total avail-able bandwidth. Generally, the counter should remain at less than 50 percent of the availablebandwidth.

Monitoring: Application Log For tracking and responding to problems that SQL Server itself detects, monitor the Application log.SQL Server uses the Application log for status messages about matters such as service startup andshutdown, backups, and configuration changes. SQL Server’s core message source in the Applicationlog is MSSQLServer.

The MSSQLServer source logs a number of different event IDs in the Application log. However,instead of monitoring for specific event IDs, I recommend monitoring for events from theMSSQLServer source and configuring your monitoring solution to look in the event description for theSQL Server error number.

The error number is the first field in the event description, and it identifies the actual event. Theterm “error number” can be misleading because many of the errors are informational messages suchas “17162 : SQL Server is starting at priority class ‘normal’(1 CPU detected).”

On the other hand, the MSSQLServer source isn’t consistent about which event type it designates.Although it logs events as informational, warnings, or errors, don’t put too much stock in the designa-tions. Some informational events are actually pretty important errors to SQL Server. Figure 4.1 showsone such event – a login failure.



Figure 4.1 Application log MSSQLServer source login failure event

SQL Server can report a ton of error numbers – many more than I can document here. However,you can easily research SQL Server errors by browsing through Microsoft Developer Network’s(MSDN’s) documentation “SQL Error Messages” at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/trblsql/tr_syserrors1_6m5z.asp.

NoteA single server can have multiple instances of SQL Server. In such cases, SQL Server createsa different source in the Application log for each instance in the format MSSQL$instance –where instance is the name of the SQL Server instance.

SQL Server Security-related Activities The Application log is also the best place to monitor SQL Server security-related activities. In theApplication log, you’ll be able to track successful and failed logins as well as attempts to performunauthorized activities in the database. Table 4.1 provides a list of common security events in SQLServer. I’ve selected these security events from those listed at http://doc.ddart.net/mssql/sql2000/html/.To look at additional SQL security events, select “Troubleshooting” and scroll down to the point atwhich the error number ranges begin.

The severity of these security events depends greatly on the type of connection between theclient and server. If the client is actually an application server connecting to SQL Server on behalf ofthe user, both the access-denied and login-failure events are suspicious because it’s unlikely for theapplication server to fail logins or perform unauthorized actions. On the other hand, if the user’sapplication connects directly to SQL Server, you can expect frequent login problem events becauseusers enter the wrong password. In either case, access-denied events usually indicate that someone istrying to execute ad hoc commands outside the production application.

n



http://msdn.microsoft.com/library/default.asp?url=/library/en-us/trblsql/tr_syserrors1_6m5z.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/trblsql/tr_syserrors1_6m5z.asp

http://doc.ddart.net/mssql/sql2000/html/

NoteSQL Server programmers can develop applications that generate events in the Application logby using the RAISERROR command. RAISERROR lets programmers designate user-defined errornumbers (always 50,000 or above). Therefore, it’s a good idea to monitor for SQL Servermessages in the Application log with error numbers 50,000 or above and identify those eventsto the operator as user-defined events. For example, an application could detect a user trying toperform a transaction he or she isn’t authorized to perform and record the incident with aRAISERROR designation.

Table 4.1 Common security events in SQL Server

Error Severity Error Description

229 14 %ls permission denied on object ‘%.*ls’, database ‘%.*ls’, owner ‘%.*ls’.230 14 %ls permission denied on column ‘%.*ls’ of object ‘%.*ls’, database ‘%.*ls’, owner ‘%.*ls’.262 16 %ls permission denied in database ‘%.*ls’.3279 16 Access is denied due to a password failure7415 16 Ad hoc access to OLE DB provider ‘%ls’ has been denied. You must access this provider

through a linked server.7416 16 Access to the remote server is denied because no login-mapping exists.7610 16 Access is denied to ‘%ls’, or the path is invalid. Full-text search was not installed properly.7628 17 Cannot copy Schema.txt to ‘%.*ls’ because access is denied or the path is invalid. Full-text

search was not installed properly.10011 16 Access denied.15062 16 The guest user cannot be mapped to a login name.15063 16 The login already has an account under a different user name.15067 16 ‘%s’ is not a local user. Remote login denied.15068 16 A remote user ‘%s’ already exists for remote server ‘%s’.15483 0 Denied login access to ‘%s’.15484 0 Could not revoke login access from ‘%s’.15485 0 Revoked login access from ‘%s’.

Source: http://msdn.microsoft.com

Monitoring: SQL Queries and Transactions To thoroughly monitor your SQL servers, look for a monitoring solution that lets you run SQL com-mands from within the solution. If your monitoring solution doesn’t let you directly execute SQLServer commands, you’ll need to rely on writing a batch file or script that accesses the server for you.

Running an appropriate SQL query transaction lets you ensure beyond any shadow of a doubtthat your database application is completely functional. Periodically running your own SQL queries ortransactions from your monitoring solution also lets you track data unavailable through SQL perfor-mance counters and monitor application-level information – such as how many order entries aremade per hour.

n



http://msdn.microsoft.com

SQL Rules If your monitoring solution supports SQL rules (e.g., the results of a SQL query or command), youcan collect such numbers as order entries per hour over a long period of time. Doing so lets youextend your trend analysis and capacity planning activities to include much higher level decisionsthan those that simple OS and database statistics support.

SQL Server: It Pays to Monitor As you’ve read, monitoring SQL Server is worth your time. SQL Server performance counters providethe best way to monitor your database server’s performance. The Application log is most useful tomonitor for database errors and problems – as well for security- related activities. (Remember to baseyour monitoring on the SQL error number within the description of SQLServer source events – anddon’t forget that servers can have multiple instances of SQL Server.) You might also be able to usethe Application log to monitor for errors that your application detects while processing transactions.And, in addition to monitoring at the more abstract OS and database levels, you can configure rulesthat execute SQL queries against your database. This capability gives you the flexibility not only tomonitor your applications but also to discover trends that reveal what’s happening inside your systemat the application level.

Next: Monitoring Exchange Server In the next chapter, I’ll discuss monitoring Microsoft Exchange Server, examining the particular chal-lenges of that technology. Topics include information stores, public folders, and mailbox sizes. I’llexamine overall Exchange traffic, including email message counts and email message sizes, as well asthe ins and outs of the Message Transfer Agent (MTA) and SMTP servers. I’ll also discuss bridgeheads,inactive mailboxes, and round-trip metrics that gauge whether email messages are delivered promptly.



61


Chapter 5:

Monitoring Exchange ServerMonitoring your Microsoft Exchange environment is crucial to staying out of hot water with your user community. Because everyone depends upon email and other Exchange services, a problemwith Exchange will generate Help desk calls faster than a problem with any other resource on yournetwork.

Although a database likewise requires second-to-second uptime, monitoring and managing adatabase server is fairly straightforward compared to monitoring Exchange. The reason lies in themany components and services that make up just a single Exchange server.

In this chapter, I’ll discuss what it takes to effectively monitor Exchange, beginning with key OS resources, such as hard disk, memory, processors, and the network. After I identify the systemservices and processes that comprise Exchange, I’ll describe the problems an outage of each mightcause. I’ll then consider the message queues that lie at heart of Exchange Server monitoring andshow you how to recognize message-queue problems as quickly as possible.

Next, I’ll discuss monitoring specific Exchange performance counters, and I’ll introduce you tothe rich environment of Windows Management Instrumentation (WMI) classes that you can use forcustomized monitoring. I’ll then explore how to use the Application log to monitor Exchange. Finally,under the heading “Miscellaneous Monitoring,” I’ll review two network services absolutely critical toExchange: DNS and Active Directory (AD).

When you monitor Exchange Server in many different ways, you have a better chance ofcatching a problem before it affects your users. Also, if you keep data from your monitoring sourcesfor a few months, you can perform trend analysis that will let you proactively upgrade hardware orredeploy services before a capacity problem occurs.

Monitoring Key OS Resources Because Exchange runs on top of Windows, the health of Windows is critical to the health ofExchange. Make sure you don’t neglect monitoring hard disk space, available memory, page file use,and processor use. You can refer to Chapter 1 for guidelines and recommendations about monitoringspecific counters, but I’ll review a few thresholds defined by best practices.

On an Exchange Server, CPU use shouldn’t exceed 80 percent for more than 5 minutes, and free virtual memory shouldn’t fall below 25 percent at any time. As with all types of servers, youshould monitor these OS-level resources not only for real-time alerting but also for longer-term trendanalysis. Trend analysis is important for diagnosing performance and capacity problems. Exchangelevel indicators might tell you that email message delivery or that some other function is taking longerand longer, but only OS parameters can tell you which hardware component needs upgrading orexpansion.

Services and Processes Monitoring the status of both services and processes is valuable for ensuring that all components ofExchange are up and running. Monitoring services is a matter of verifying that the right services are inthe started state. You should make sure that the following services specific to Exchange are started:

• Microsoft Exchange Event Service—Monitors folders and generates events for Exchange 5.5 applications.

• Microsoft Exchange IMAP4 Service—Provides Exchange IMAP4 services.

• Microsoft Exchange Information Store (IS) Service—Provides access to mailbox and public folderstores (a core service to Exchange).

• Microsoft Exchange Message Transfer Agent (MTA) Stacks Service—Provides Exchange X.400 services. This service is important only if you support Exchange 5.5 servers.

• Microsoft Exchange POP3 Service—Provides Exchange POP3 services. This service is importantonly if you have POP3 accounts.

• Microsoft Exchange Routing Engine Service—Processes Exchange email message routing and link state information. Without this service, state information about connectors and queues isunavailable, which stops all email messages from going in and out of the server.

• Microsoft Exchange Site Replication Service—Replicates Exchange information between Exchange2003 and Exchange 2000 or between Exchange 2000 and Exchange 5.5. This service is importantonly if you have back-level Exchange servers.

• Microsoft Exchange System Attendant Service—Monitors Exchange and provides essential services, such as running LDAP queries, publishing free/busy information, generating the OfflineAddress Book (OAB), maintaining mailboxes, and synchronizing with the Microsoft IIS metabase.Because all other Exchange services depend on this service, Exchange Server is unavailablewithout it.

• Exchange Management Service—This service supplies WMI information to other Exchange services and any other scripts or monitoring tools which you implement.

Other services not technically part of Exchange Server but nevertheless crucial to it are

• Distributed Transaction Coordinator—Coordinates transactions that are distributed across multipledatabases, message queues, and file systems.

• Internet Information Services (IIS) Admin Service—Lets you administer the Exchange HTTP virtualserver in the Microsoft Management Console (MMC) IIS snap-in.

• Network News Transport Protocol (NNTP)—Transports newsgroup messages across the network.

• Simple Mail Transport Protocol (SMTP)—Transports email messages across the network.

• World Wide Web Publishing Service—Provides HTTP services for Microsoft Exchange Server and IIS.

• Event Logging Service—Logs events from Exchange and other applications in the Application log.

• IIS Admin Service (Internet Information Service Administration Service)—Lets you administer IIS.

• NTLM Security Support Provider—Provides NT LAN Manager (NTLM) authentication.



• Remote Procedure Call (RPC)—Provides RPC services.

• Server Service—Provides connectivity to other Windows clients.

• Workstation—Provides client connectivity to Windows servers.

Some additional services you should consider monitoring are third-party applications that provide important Exchange-related services. Those applications include backup software, antivirusand antispam solutions, and Exchange Server–based fax solutions.

The processes listed below host most of the services listed above. Although monitoring the statusof services provides the best granularity in making sure each component of Exchange Server is upand running, you can’t monitor resource use by service. That’s where process monitoring comes in.Process monitoring lets you isolate which component of Exchange is causing or experiencing aproblem because of resource use. The Exchange Server processes are

• Store.exe—Exchange IS

• Inetinfo.exe—IIS

• Emsmta.exe—Exchange MTA Stacks Service

• Mad.exe—Exchange 5.5 System Attendant

• Exmgmt.exe—Exchange Management Service

• W3wp.exe—Kernel level HTTP parser

Windows lets you monitor use of OS resources at the process level. Memory and CPU use (seeChapter 1 for details) are probably the two most useful counters to monitor for each of ExchangeServer’s processes. And a continuing, anomalous level of either the memory or CPU counters canindicate a runaway or hung process.

Monitoring Message Queues Key to monitoring Exchange Server is monitoring its message queues. Your Exchange Server environment depends upon a set of cooperating processes that run on different systems. WhenExchange passes an email message from one process to another, it must frequently queue the emailmessage until the receiving process can handle it. For example, the Exchange SMTP virtual server canqueue email messages that are waiting for the virtual server to perform directory lookup or for therouting engine to determine the appropriate next hop for the email messages.

The sending and receiving processes might be on the same or different systems. All connectors(e.g., IBM Lotus Notes, Novell GroupWise, SMTP, X.400) queue email messages that must wait forExchange to establish network connections with messaging processes running on other systems.

NoteThe following material about Exchange message queues is adapted from Mike Daugherty’sarticle “Managing Exchange Server 2003 Message Queues,” which appeared in the July 2004issue of Exchange and Outlook Administrator. The article (InstantDoc #42536) is available tosubscribers only at http://windowsitpro.com.

n

Chapter 5 Monitoring Exchange Server 63


http://windowsitpro.com



Using Performance Counters The number of email messages in a queue can vary widely from second to second depending on therate of incoming email messages and the speed with which Exchange Server can route them. BecauseExchange’s primary function is message delivery, you need to detect when Exchange fails to moveemail messages on to their final destination. Rather than trying to identify a threshold for a number ofemail messages that should trigger an alert, you should monitor for an increase in queue length overa period of time (e.g., 15 minutes). A 15-minute period typically lets Exchange Server catch up witheven a sudden surge of incoming messages.

TipIf a queue is longer each time it’s checked within a 15-minute period, you probably have aproblem.

Performance counters provide one of the best ways to track the number of entries in queues,and queue length can help you detect potential message-transport problems. When monitoringreveals that a queue is longer than you think is appropriate, Exchange System Manager (ESM) or your monitoring solution lets you view the queued email messages and take any necessary correctiveaction.

A queue might be longer than you expect for many reasons. For example, the email message atthe head of the queue might have encountered a problem that prevents it from being delivered.Other messages build up behind it. If a single email message causes a problem, you have the optionof removing it from the queue and returning it to the sender with a nondelivery report (NDR). However, the queue might simply be experiencing a temporary increase in email message traffic. To discover the source of the problem, you need to review the queue entries.

Because unusually large or backed-up message queues can indicate serious system or networkingproblems, you should monitor queue size regularly. Use your monitoring solution to track the lengthof each queue (i.e., the number of email messages). If your monitoring solution sees sustainedgrowth of the queue for more than 10 to 15 minutes, set it to issue an alert to trigger troubleshootingefforts. Exchange administrators should know the types of Exchange Server message queues, how tomonitor and manage each of them, and how to troubleshoot and resolve some common messagequeue–related problems.

Message Queues: A Review The protocols your environment supports determine the number of message queues on yourExchange server. Exchange uses two types of queues for each protocol: link queues and systemqueues.

Link Queues Delivering an email message frequently involves relaying the message through one or more intermediate servers. As an email message travels between servers, the immediate destination isappropriately called the “next-hop” server. After the routing engine determines an email message’snext-hop server, the virtual server adds the message to a link queue for that server. The virtual server

j

places all email messages destined for the same next-hop server in the same link queue. An emailmessage remains in the link queue until Exchange establishes an active connection with and transfersthe message to the next-hop server.

The names of Exchange’s link queues include the name of the next-hop server plus a designationthat indicates remote delivery. For example, if SMTP messages are going to a next-hop server namednext.com, the virtual server adds those messages to a link queue named next.com (Remote delivery).The virtual server creates and removes next-hop link queues as needed. Any email messages thatcan’t be transferred to the next-hop server are placed at the end of the queue for later delivery.

System Queues System queues hold email messages that await processing. Before Exchange sends email messagesfrom one system to another, several system processes must examine the email messages and preparethem to move across the network. The system processes categorize messages, resolve addresses, convert content, and calculate next-hop routing. If email messages are in system queues for longperiods (e.g., longer than 15 minutes), the system’s messaging processes might be experiencing problems.

The types of system queues depend on the protocols you use. Most protocols use severalqueues. The key protocol SMTP uses five queues:

• Local domain name (Local Delivery) queue—This queue’s email messages await delivery to amailbox on the local Exchange server. The virtual server names this queue for the local emaildomain.

• Messages awaiting directory lookup queue—The virtual server keeps email messages in this queueuntil Exchange can expand distribution lists (DLs) and look up the message recipients in AD.

• Messages waiting to be routed queue—This queue holds email messages until the virtual serverdetermines the next-hop server and moves each message to the link queue for that server.

• Final destination currently unreachable queue—If Exchange can’t find an active networkpathway to the final-destination server, Exchange adds the email message to this queue. You can use the virtual server’s Message Delivery Properties dialog box to set a retry interval and anexpiration period (after which an undelivered message is returned as undeliverable).

• Pre-submission queue—This queue holds new outgoing email messages that the SMTP virtualserver has accepted but hasn’t yet processed.

Each foreign-protocol connector has copies of five queues:

• READY-IN queue—This queue holds email messages from a foreign email system, such as LotusNotes. The connector has converted the email message format (e.g., converted the content,mapped attributes) but hasn’t yet resolved the recipient addresses.

• MTS-IN queue—This queue holds email messages after the connector has looked up the recipientaddresses in AD and the email messages are ready for delivery.

• MTS-OUT queue—This queue holds email messages that await address resolution so thatExchange can send them to the foreign email system.



• READY-OUT queue—This queue holds email messages that Exchange is sending to a foreignemail system after their addresses are resolved but before the connector has converted the message format.

• BADMAIL queue—This queue holds email messages that the connector can’t successfully process.(The connector doesn’t retry them.) They stay in the queue until an administrator deletes them.

You can always see the PendingRerouteQ X.400 queue in ESM or in your monitoring solution.However, the queue will be empty unless X.400 email messages are waiting to be rerouted after atemporary connection problem.

Message Queue Counters You can monitor queues through Exchange 2003’s ESM or through your monitoring solution. In either case, you can monitor specific performance counters for each queue of interest. Table 5.1 presents several queue counters that Microsoft identifies as critical to Exchange Server performanceand availability, along with recommended thresholds for each counter.

For more about Exchange performance, see “Troubleshooting Exchange Server 2003 Perfor-mance” at http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/e2k3perf.mspx.

Table 5.1 Exchange Server queue performance counters

Mail queue/Counter Expectations

SMTP Server\Local Queue Length This counter indicates The counter’s maximum value should be less than 1000,the number of email messages in the SMTP queue for and the queue length should remain steady near itslocal delivery. average (variances should be small).

SMTP Server\Remote Queue Length This counter The counter’s maximum value should be less than 1000,indicates the number of email messages in the SMTP and the queue length should remain steady near itsqueue for remote delivery. average (variances should be small).

SMTP Server\Categorizer Queue Length This counter The counter’s maximum value should be less than 10. indicates the number of email messages in the SMTP queue for DS attribute searches.

MSExchangeIS MailboxSend Queue Size This counter The counter’s value should be less than 500 at all times.indicates the number of email messages in the mailbox store’s send queue.

MSExchangeIS Mailbox\Receive Queue Size This counter The counter’s value should be less than 500 at all times.indicates the number of email messages in the mailbox store’s receive queue.

MSExchangeIS Public\Send Queue Size This counter In a server with no mail-enabled public folders, the indicates the number of email messages in the public counter’s value should be below 10. Otherwise, thefolder’s send queue. value should be below 500 at all times.

MSExchangeIS Public\Receive Queue Size This counter The counter’s value should be below 500 at all times.indicates the number of email messages in the public folder’s receive queue.



http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/e2k3perf.mspx

Troubleshooting Message Queues After you identify a problem queue, you can use ESM or your monitoring solution (which will use oremulate ESM’s functions) to investigate the cause. ESM’s common queue viewer lets you see thestatus of all queues on a server regardless of their protocol. You can display summary informationabout message queues by expanding the administrative group that contains the SMTP virtual serverwhose message queues you want to examine, expanding the Servers folder, and selecting Queues.The details pane lists information about each message queue. You can use Choose Columns (fromthe View menu) to modify the default order of the columns, hide columns, or show hidden columns.

ESM displays the message queue’s name and protocol, the Exchange component that adds entriesto the queue, the queue’s state, its current number of email messages, the total size of all messages inthe queue, the time at which the client submitted the oldest message, the time at which the next retryattempt will be made (for queues in a Retry or Scheduled state), and whether the queue is a systemor a link queue.

A queue’s connection state can indicate whether the queue is experiencing problems. The sevenconnection states are

• Disabled—Indicates that the link between this server and the next-hop server isn’t available.

• Active—Indicates that an active connection exists between this server and the next-hop server.

• Ready—Indicates that the queue is ready for a connection to be allocated to it.

• Retry—Indicates that previous connection attempts have failed and the server is awaiting anotherattempt.

• Scheduled—Indicates that the queue is waiting for the next scheduled connection attempt.

• Remote—Indicates that the queue is waiting for a remote system to accept delivery of the emailmessage.

• Frozen—Indicates that the administrator has frozen the queue to prevent email messages fromexiting the queue. Other email messages can still be added to a frozen queue. (Freezing anActive queue immediately terminates that queue’s transport sessions.)

Troubleshooting a Connection to a Remote Server Until Exchange can deliver email messages to a receiving process on another system, the email messages remain in a queue. A queue that contains email messages that are older than the rule-of-thumb 15-minute time period might indicate a connection problem. The oldest email message in thequeue tells you how long messages have been waiting for delivery. If network or system problemsprevent immediate delivery of an email message, the virtual server places the message at the end of the queue—and schedules a retry. If the problem still exists at the time of the retry, the email message is again placed at the end of the queue.

To change a queue that’s in the Retry or Scheduled state to the Active state, you can use theForce Connection command. Force Connection makes the queue keep processing email messages as if the retry time had been reached. A queue in the Retry state because of a network or server error quickly returns to the Retry state until the underlying error is corrected. To force an immediateconnection, use ESM or your monitoring solution to display a summary of the queues for the server.Right-click a queue and select Force Connection to force an immediate connection to the remoteserver.



NoteIf, after you’ve deployed Force Connection, the message queue fails to connect, you have anetwork problem between this server and the remote server—or the remote server is down.

Freezing and Unfreezing Queues Freezing a queue—temporarily preventing email messages from exiting the queue—supports severaltroubleshooting activities. When you freeze a queue, you change the message queue’s connectionstate to Frozen and no email messages leave the queue (i.e., the currently queued email messagesaren’t delivered). New messages can join a frozen queue, but they won’t be delivered until youunfreeze the queue. You can use ESM or your monitoring solution to freeze or unfreeze a queue, byright-clicking the queue and selecting Freeze or Unfreeze.

Tip Freezing outbound-message queues is a good way to suspend sending messages when you facea virus outbreak.

Examining Message Queues When your monitoring solution indicates that a message queue might not be working, you canexamine the queued email messages for clues to the problem. Because you can view informationabout individual email messages, you can manage them (e.g., freeze or delete them) at a granularlevel.

Viewing individual email messages in a queue is resource-intensive. When you request messagedisplay, ESM or your monitoring solution lets you list a specific number of email messages in a queueor filter to select a subset of the queued messages. After you specify which email messages you wantto view, ESM or your solution displays a set of message properties for each one.

You can display the queues for a given server. Right-click the queue whose email messages youwant to see and select Find Messages. A Find Messages dialog box lets you enter search criteria. Youcan select email messages by sender, recipient, message state (e.g., frozen), and more. You can thenclick Find Now to display email messages that match the search criteria.

The Search Results pane displays the email message’s sender, state, and size, client submissiontime, Exchange server reception time, and the time at which Exchange will discontinue deliveryattempts and delete the message from the queue. (To see an email message’s subject, you wouldneed to Enable subject logging and display on the General tab of an individual server’s Propertiespage.)

The submission time of outgoing email messages that originate on your local server tells youhow long an email message has been queued for delivery. An outgoing message queued for morethan 15 minutes might indicate a problem with your Exchange environment.

After ESM or your monitoring solution displays the selected email messages, you can sort them by clicking the column heading. You can examine, freeze, unfreeze, and delete individual

j

n



displayed email messages. If you want to display a queued email message’s Properties dialog box fortroubleshooting purposes, double-click any email message in the Search Results pane. The dialog boxdisplays additional information about the message, including the Message ID.

Freezing and Unfreezing Messages You can freeze a specific email message much as you would freeze a queue. A frozen email messageremains in the queue until you unfreeze it. Freezing an email message lets you keep a suspiciousmessage in the queue until you can review it. If a large message is blocking a queue, you can freezethat message to let the queue process other messages.

Tip When you use ESM or your monitoring solution to freeze an email message in a queue, youmust first freeze the queue. Otherwise, the email message that you want to freeze might beprocessed and exit the queue before you can freeze it.

After you’ve frozen the queue, right-click the queue and select Find Messages. Specify message-search criteria and click Find Now. In the Search Results pane, right-click the email messages you’reinterested in and select Freeze or Unfreeze. Freezing and unfreezing email messages won’t affectother email messages in the queue. Don’t forget to unfreeze the queue.

NoteWhen you unfreeze the queue, the email messages you have frozen stay frozen.

You can use several methods to delete email messages. Although you can delete all email messages in a queue with a single command, it’s usually safer to select individual email messages fordeletion. You can use the Find Messages dialog box’s Search Results pane to select messages fordeletion or use a custom filter to delete messages whose criteria you specify (e.g., size, sender,delivery delay).

Caution An email message that you delete from a queue won’t be delivered to its recipient. Deletionsare permanent and destructive.

To delete an individual email message from a queue, use ESM or your monitoring solution tofreeze the queue, right-click the frozen queue, select Find Messages, specify message criteria, andclick Find Now. ESM will display email messages that match the search criteria in the Search Resultspane. Right-click the email message you want to delete, then select either Delete Messages (with orwithout an NDR to notify the sender). Don’t forget to unfreeze the queue.

d

n

j



TipWhen you delete messages from a queue, I recommend always notifying the sender.

Monitoring Specific Exchange Performance Counters To monitor the performance of or measure the use of specific Exchange Server functions, you candeploy a variety of performance counters. A key area of Exchange Server that can influence responsetime for online Outlook users is the IS service’s processing of RPC requests.

RPC Requests Outlook constantly makes RPC requests to Exchange Server as users move between folders and compose messages or update other objects. Two key Exchange Server RPC performance counters areMSExchangeIS\RPC Requests and MSExchangeIS\RPC Averaged Latency. By default, the IS can serviceonly 100 RPC requests at a time.

Therefore, a good rule of thumb is to set your alert threshold for MSExchangeIS\RPC Requests to30. RPC Averaged Latency is computed based on the most recent 1024 packets. Start your alertthreshold for the MSExchangeIS\RPC Averaged Latency counter at 50 milliseconds (ms) and raise ituntil you stop getting alerts for typical operations.

Store.exe Although monitoring physical RAM and virtual memory use at the OS level is important for catchingoverall memory problems, it’s also important to monitor the store.exe process’s virtual memory use.Regardless of how much physical RAM your server has, store.exe has a limited amount of virtualmemory it can address—and you can’t solve virtual memory problems by adding RAM.

Exchange manages virtual memory and heap resources by using its own specially developedconstructs. Available virtual memory and fragmentation in virtual memory are as important as diskspace.

You can use four easy-to-monitor counters to detect problems with store.exe’s virtual memory.To monitor for a shortage in free blocks of virtual memory, set your alert threshold to let you know ifMSExchangeIS\VM Total Free Blocks drops below 1. To detect fragmentation, configure alerts if VMLargest Block Size drops below 32MB, if VM Total 16MB Free Blocks drops below 1, or VM TotalLarge Free Block Bytes drops below 50MB.

You also need to detect problems with exchmem, Exchange Server’s own heap mechanism. Ifexchmem gets too fragmented, Exchange Server performance suffers. To determine the level ofexchmem fragmentation, set up alerts that tell you if either MSExchangeIS\Exchmem: Number ofheaps with memory errors or MSExchangeIS\Exchmem: Number of memory errors exceeds 0.

Occasionally, Exchange Server might need to create additional heaps. However,MSExchangeIS\Exchmem: Number of Additional Heaps should never exceed 3. Therefore, configurean alert for this threshold as well.

If any of these alerts are triggered, schedule a restart of Exchange Server’s services as soon aspossible. If the alerts continue to occur, escalate your troubleshooting efforts and contact Microsoftsupport.

j



Exchange Server has a variety of other performance counters that you can monitor to evaluatespecific areas of Exchange Server functionality. For example, you can monitor the number of emailmessages the server sends and receives by using inbound and outbound counters on connectors suchas MSExchangeIMC and MSExchangeMTA. Explore the performance objects in the MMC PerformanceMonitor snap-in to become familiar with what’s available.

Two particularly valuable performance objects to monitor are MSExchangeDSAccess Caches andMSExchangeDSAccess Processes. MSExchangeDSAccess Caches lets you monitor the DSAccess cachewith counters for cache insertion, searches, LDAP queries, objects not found, and total entries in thecache. MSExchangeDSAccess Processes helps you monitor LDAP search calls and the time taken tosend, search, and read requests—and receive a response. The object has instances for mad.exe,store.exe, inetinfo.exe, and emsmta.exe.

For more information about performance objects, go to the message flow table displayed athttp://www.winnetmag.com/files/04/5033/table_02.html.

Monitoring Through WMI Exchange Server performance counters are sufficient for monitoring performance metrics within asophisticated monitoring solution. Such a solution lets you build larger monitoring rules by linkingthresholds and other conditions into compound structures with Boolean logic. But what if you needto monitor Exchange from within a script? The answer is WMI. WMI classes are objects that let youquery areas of Windows or other applications by using simple SQL-like query commands.

Exchange Server implements a number of WMI classes that it uses internally, but you can usethem as well. The ExchangeLink class provides information about each link’s state and properties—including retry count and queue length. The ExchangeQueue class provides detailed informationabout email message queues and their contents. You can determine the name of the link for whichthe queue is waiting or iterate through the email messages in the queue. You can also perform thevarious operations, such as freeze and unfreeze, that I described previously.

The ExchangeServerState class is highly valuable because it provides information about OSresources crucial to Exchange Server—including disk space, memory, services, as well as other detailsabout the state of Exchange Server itself. To learn more about Exchange Server’s WMI classes andabout using scripts to access them, go to http://www.microsoft.com/technet/prodtechnol/exchange/guides/whatnewe2k3/7af2c704-d772-4578-b5ff-c345c7bdbd02.mspx.

Monitoring Through the Application Log A discussion of Exchange Server monitoring wouldn’t be complete without a look at event logging.Exchange Server reports events to the Application log. The level of logging you configure throughESM or your monitoring solution determines the number and total volume of events logged.

Exchange Server has four diagnostic logging levels. When Exchange Server logs events, each service corresponds to a different event source. Each source in turn has one or more categories. Youcan configure a different level for each category of event. This configuration option is importantbecause enabling diagnostic logging at even a minimal level can produce a huge number of events ina short time.



http://www.winnetmag.com/files/04/5033/table_02.html

http://www.microsoft.com/technet/prodtechnol/exchange/guides/whatnewe2k3/7af2c704-d772-4578-b5ff-c345c7bdbd02.mspx

http://www.microsoft.com/technet/prodtechnol/exchange/guides/whatnewe2k3/7af2c704-d772-4578-b5ff-c345c7bdbd02.mspx

The default level is None. However, that designation is misleading. Level None still logs eventsfor critical problems and is, in fact, probably sufficient for day-to-day monitoring. Levels Minimum,Medium, and Maximum monitor increasingly granular details of each operation Exchange performs.Usually, you would use these levels only when you’re diagnosing a very specific problem or whenMicrosoft support technicians request it. Table 5.2 shows which event source in the Application logcorresponds to which Exchange service.

Table 5.2 Application log service name event sources

Service Name Event Source

Microsoft Exchange Connector for Novell GroupWise LME-GWISE

Microsoft Exchange Connector for Lotus Notes LME-Notes

Microsoft Exchange Connector for Lotus cc:Mail MSExchangeCCMC

Microsoft Exchange Router for Novell GroupWise MSExchangeGWRtr

MS Mail Connector Interchange MSExchangeMSMI

MS SchedulePlus Free-Busy Connector MSExchangeFB

Microsoft Exchange Directory Synchronization MSExchangeADDXA

Microsoft Exchange IMAP4 IMAP4Svc

Microsoft Exchange Information Store MSExchangeIS

Microsoft Exchange MTA Stacks MSExchangeMTA

Microsoft Exchange POP3 POP3Svc

Microsoft Exchange Routing Engine MSExchange Transport

Microsoft Exchange Site Replication Service MSExchangeSRS

Microsoft Exchange System Attendant MSExchangeSA, MSExchangeAL, MSExchangeDX

Together, these event sources can log hundreds of different event IDs, which indicates howmany things can potentially go wrong in Exchange. You can see that it’s not practical to try to identify each individual event that you could monitor.

To begin with, you should start monitoring for warnings and errors from the IS, System Attendant, and Routing Engine, as well as any other optional services important to your environment.As you monitor these events, you’ll soon identify innocuous warnings and events that you can suppress from triggering alerts.

However, two particular areas of activity are worth your attention. If you have an antivirus solution that integrates with Exchange Server, you can identify problems that Exchange detects withthe antivirus solution by monitoring for Application log errors and warnings in the range of eventsfrom event ID 9565 (Invalid virus scanner configuration) to event ID 9581 (Virus scanner was notloaded). The text of Event ID 9565, for example, informs you that the virus scanner can’t be startedbecause of an invalid virus scanner configuration.

The second area has to do with virtual memory problems , which I discussed previously. IfExchange Server detects a problem with its virtual memory, it will log event ID 9582, with the following text: “The virtual memory necessary to run your Exchange server is fragmented in such a



way that normal operation may begin to fail. It is highly recommended that you restart all Exchangeservices to correct this issue.” Depending upon the severity of the problem, Exchange will log theevent as an error or warning. You definitely want to be alerted if this event occurs and to restartExchange Server as soon as you can schedule it.

Miscellaneous Monitoring Because Exchange Server is tightly integrated into the rest of the Windows infrastructure, problemswith your overall Windows network can affect Exchange Server as well. DNS connectivity is crucial toExchange Server’s proper functioning. If you don’t already monitor the health of your DNS servers,you should start now.

The easiest way to monitor DNS is to monitor for DNS source warning and error events in theSystem logs on all your DNS servers. However, the most effective way to detect DNS problems ofany kind is to set up a task that regularly queries the DNS server for an IP address to ensure that theserver is functioning properly. Most monitoring solutions let you set up such DNS resolution checks.

Since the release of Exchange Server 2000, Exchange no longer has its own LDAP directory.Instead, Exchange relies on AD for user accounts, distribution lists, and other directory information.For recommendations about key aspects of AD monitoring, see Chapter 3.

Also, the root cause of many Exchange problems has nothing to do with Exchange at all butresults from connectivity problems with your WAN, VPN, or Internet connections. Whenever youdetect a Exchange-related problem that involves communications with other Exchange or messagingservers, check the status of these resources first.

Monitoring the Many Facets of Exchange Server Exchange Server provides a wide array of ways to monitor its many features. If you want to preventservice outages, monitor free disk space on each volume of Exchange and keep up with your server’sCPU, disk, and memory use trends so that you can preempt problems with timely upgrades. Moreimportantly, monitor your message queues because they’re one of the best ways to immediatelydetect a problem. And to respond quickly to problems that Exchange Server itself recognizes, monitoryour key Exchange servers’ Application logs for warnings and events.

A final recommendation: One of the best ways to detect a problem with any application is to put a process in place that constantly performs test operations against the server that mirror what realusers are doing. In terms of Exchange, you might schedule a script as a task that every few minutessends an email message to a certain mailbox and checks that mailbox for the email message’s arrival.If the email message fails to appear within the amount of time you expect, trigger an alert. To provethat email messages are making it from one Exchange server to another, use a mailbox on a differentserver from the one where you originate the message.

The good news is that such scripts have already been written for you. The Exchange Management Pack provides scripts to test mail flow between servers. These scripts periodically sendemail messages and verify that the email messages are received. You can learn more about thesescripts in the Exchange 2003 Management Pack Configuration Guide, athttp://go.microsoft.com/fwlink/?LinkId=25436.



http://go.microsoft.com/fwlink/?LinkId=25436

Next: Capacity Planning and Trend Analysis Proactive monitoring leads naturally to capacity planning. First, you need to consolidate your data.Second, you need to interpret it. Proper reporting and trend analysis help you put together an accu-rate picture of your environment’s current state and future needs. And automating the processes thatbuild that picture will make your life as an administrator easier.



75


Chapter 6:

Trend Analysis and Capacity PlanningSomeone once said that statistics never lie but that people use statistics to lie all the time. Lying aside,it’s easy to draw misleading conclusions from numbers. Although you can devote a career to trendanalysis and its related disciplines, you can also spend a little time learning some basics of analysisand reap good results with insightful planning for increased capacity.

If you don’t have a head for figures, don’t worry. Microsoft Excel and other applications take care of many of the details for you. With each chapter of Taking Control: Monitoring the WindowsPlatform Proactively, I’ve explored monitoring different areas of (or closely related to) the Windowsplatform. I’ve concentrated on what you should monitor and how to use alerts for tactical responsesto unexpected problems.

Nothing is an adequate substitute for tactical measures such as near real-time monitoring andalerts. However, you can preempt many problems if you add some strategic measures to your monitoring effort. In this final chapter, I look at what you can do with all those numbers that youconfigure your monitoring solution to collect.

Trend Analysis: Establishing Baselines Real-time (or, more accurately, short-term) monitoring is actually a form of trend analysis—based on amuch smaller interval than usual. Making short-term monitoring effective, however, depends uponestablishing an accurate baseline by performing long-term trend analysis.

Every server represents a “turn” on the kaleidoscope of hardware combinations, use patterns, andbusiness processes unique to a department or company. The only way to determine what’s typical foryour server is to watch it. But establishing baselines is just the beginning of trend analysis.

You can use historical data to plot a line into the future and predict when critical thresholds willbe reached. For example, if you capture the amount of free disk space you have once per day fornumber of months, you can average your findings to discover how disk-space consumption isgrowing monthly. If you then project that growth into the future, you can determine when you’llneed to add more space or look at getting rid of unneeded files.

Of course, linear trend analysis is notorious for looking at the universe as if the observed subject will continue in its current overall direction at the already observed rate of acceleration ordeceleration. In the real world, of course, conditions change.

Vendors introduce new products, prices drop, and orders go through the roof instead of obediently following the vector that historical data predicts. Or you buy more capacity to accommodate anticipated growth, then sales fall apart as they did after the dot-com boom, leavingtelecommunications companies with loads of unused bandwidth.

Nevertheless, linear trend analysis is an easy and inexpensive analytical tool and, in many situations, it does the job. For example, linear trend analysis works well for small to medium-sizedsystems whose computing capacity can easily be expanded with commodity hardware. But enterprise

systems that already stretch the envelope of top-end hardware require more complex analysis thatyou can achieve only with true analytic modeling.

Analytic modeling isn’t something you can do with simple spreadsheets, but software packagesare available that let you input baseline data collected from your current systems, then play with different configurations to see how system changes and growth assumptions will affect performance.Modeling lets you quickly try different what-if scenarios without a major investment of time and effortin testing.

Compensating for Anomalies Because all of your analysis depends on your baseline, it’s important to capture as accurate a baselineas possible. Unfortunately, creating a baseline doesn’t involve just capturing counter values. Anomalous values can creep into your baseline and throw it off. For example, you’ll see occasionalspikes because of unusual activity. A spike might occur if you run a defragmentation application oneevening or execute a report in response to an auditor’s request. System startups also cause spikes asthe system performs its IPL and pages memory in and out until it establishes its working set.

On the other hand, some events can cause zero values or gaps. Such valleys can occur with performance counters taken from services such as Microsoft SQL Server or Microsoft Exchange whenthe service is temporarily down because of a crash or for system maintenance. All of these events canskew your baseline, keeping it from establishing a meaningful representation of the typical systemworkload that you’re trying to understand.

You can compensate for anomalies in your baseline data two ways. You can either filter themout of the collected time period or try to drown them with a high sampling rate. Let’s look first at filtering out the anomalous data.

Filtering Anomalies It’s easy to filter out peaks and valleys if your data resides in an ODBC database, which is wheremost monitoring solutions store data. Sometimes, all you need to do is write a select * command thatexcludes anomalously high and low data.

This method works if you’re trying to seek the typical range for a fairly static variable, such asCPU use during a server’s usual workload. However, this method would be disastrous if you plan toplot the data on a chart for which one axis is some measurement of time—for example, a chart onwhich you’re plotting increased CPU use over the course of some weeks or months. In that case,how do you determine what’s anomalous?

Adjusting Data Toward Your Goal First, graph the undoctored data with Excel, including dates and times on the graph. When you seeweird highs and lows, consult your server maintenance log or operations incident database as well asyour own knowledge of operational and business routines that affect the server to identify the reasonfor the irregularities. Then, you can decide whether such periods should remain part of your baselineor be removed.

Base your decision on your goal for the baseline. If you’re establishing disk-space requirements,be careful about culling out peaks. The peaks might be related to processes or events that are likelyto occur again (e.g., database reorganizations, extracts for other applications). With disk space, you



must be able to accommodate peak requirements, whereas with transactions, the requirement mightbe to maintain a certain level of response time under usual loads but to accept slower response timesduring peak order times (e.g., during holiday seasons).

Another caution regarding your baseline data, one that applies even if you don’t adjust the data.A problem can occur when you seek to tie data to specific time periods (e.g., days of the week) inyour final baseline. The problem arises if you need an accurate sum of particular data for each timeperiod.

For example, if you cull a portion of Monday’s data to compensate for some anomaly, you throwoff Monday’s baseline. If, on the other hand, you’ve collected data every hour and seek a dailyaverage for each day of the week, you’ll be OK (provided you divide each day’s sum by the numberof snapshots actually collected and not by the number of hours in a day).

I haven’t and of course can’t cover every possible concern. And I don’t pretend to be a statistician. However, I know that a little common sense goes a long way to prevent the garbage-in/garbage-out syndrome when you try to derive useful numbers from your monitoring data.

Drowning Anomalies The other, more strategic method for dealing with irregularities in your baseline involves “drowningthem out” with volumes of data. If the anomalies are truly anomalous and don’t occur too frequently,a higher volume of typical data will smooth out the irregularities.

To use this approach, you can either sample more frequently (e.g., every 5 minutes instead ofevery hour) or sample for a longer period of time (e.g., for 6 months rather than for 2 months). Ofcourse, sometimes you need your baseline now and can’t wait 6 months. You might also need to beconcerned about creating an undue load on the system by sampling too frequently or by using toomuch of a limited amount of storage space for baseline data. Usually, however, you can find anapproach that gives you granular enough data without breaking the system you’re monitoring or yourmonitoring solution.

Capacity Planning: Predicting the Future Even if you napped during math class, you can still perform some pretty cool trend analysis by usingExcel. Excel makes it easy to use regression analysis to reveal trends in your data, then project thosetrends into the future. Regression analysis estimates the relationship between variables so that you canpredict a given variable from one or more other variables.

Most commonly, you’ll use a line or bar chart, but you can throw a trend line on any of Excel’stwo-dimensional area, bar, column, line, stock, xy (scatter), or bubble charts (but not on pie, three-dimensional, or stacked charts). Depending on the type of type of trend you’re analyzing, you canchoose from different algorithms that suit one type of trend analysis better than another.

Choosing the Right Trend Line You need to have some idea of the shape of the line your data might take to choose the right typeof Excel trend line to use. One of the most common trends in system monitoring is charting the rateof growth of data (e.g., disk-space use). Disk-space growth on file servers is usually straightforwardbecause users tend to create new files at a steady rate, and they seldom delete files.

Chapter 6 Trend Analysis and Capacity Planning 77




Linear Trend Line For such data, you can use the simple linear trend line. For example, in the chart that Figure 6.1shows, the red line represents data collected by monitoring. The dotted line shows the trend of thatactual data projected into the future by using linear trend analysis.

Figure 6.1 Linear trend analysis of disk-space use

When you plot a trend line on your chart, Excel provides a easy-to-use sanity check to gaugehow much confidence you should place in the type of trend line you selected and in the predictedtrend. The sanity check is the r-squared value. Without waxing too mathematical, the r-squared valueis a number between 0 and 1 that shows you how closely the estimated values on the trend linecompare to your actual values. A perfect baseline produces an r-squared value of 1.0, but you canusually be happy with anything greater than .9.

Exponential Trend Line Other trend-line types are useful when you plot data that conforms to some type of curve. Forexample, your typical transaction server provides a fairly static response time for increasing levels oftransactions per second until it reaches a certain threshold at which the number of transactions overwhelms the system’s cache optimization and capacity. Response time starts to curve sharplyupward as the system is tasked with more and more transactions.

Perhaps you’ve captured use data on a transaction that provides plenty of data for low, usual,and current high rates for the system. Users report that on heavy entry days, the server begins toincreasingly slow down. You would like to know at what point the number of requests per secondcauses the response time to become unacceptable. (I’m presuming that you have a rule-of-thumb foracceptable response time.) However, you don’t have luxury of running a benchmark programbecause the server is required 24 hours a day.

Percent of disk space used

Perc

ent

1st half of 2004

89

87

85

83

81

79

77

75

1/1/

2004

2/1/

2004

3/1/

2004

4/1/

2004

5/1/

2004

6/1/

2004

This situation offers a perfect opportunity to use another form of trend analysis—the exponentialtrend line. An exponential trend line is useful when your data grows or drops off at an increasinglyhigher rate. Excel will take the curve that the data you collected already exemplifies and plot thatcurve into uncharted waters. You’ll be able to observe unacceptable response times and predict thebreaking point of your server without actually breaking it. After you know how many transactions persecond crosses the threshold of unacceptability, you can predict when that threshold will be crossed.

Suppose that your company’s business is growing at a pretty steady rate. Based on other chartsand conversations with users, you’ve determined that transaction rates peak monthly as customersrush to make purchases before closing out the month. You can then plot a linear trend line by usingthe peak transactions per second from each month. Projecting that line into the future, you determinethat you have about 7 months before the end-of-month processing will cause the server to grind to ahalt. You can then start considering which type of upgrades will address the problem.

To do so, you need to identify where the bottleneck occurs. Is it a matter of CPU use or doesyour system lack enough memory to hold the application’s working set of code and sufficient cacheto accommodate the breadth of the database that the transactions are hitting? You’ll need to plot therelevant system resources on the same chart as the response-time curve during high-volume periodsto determine which component is being overwhelmed.

Defining What’s Typical Even if you’re largely interested in predicting the future, after you have a good baseline, you can put it to work in other ways. You can use the baseline to revise your alert thresholds to reflect somereal numbers instead of the rules of thumb you used when you first set them up. By using averages,standard deviation, and other simple formulas, you can easily get an idea of the average value of acounter and the highs and lows to expect during typical use. From that point, you can program alertsand other actions for your monitoring solution to take when the counter falls outside expectedranges.

Because of daily, weekly, or monthly business processes, many counters don’t follow a flat line.By plotting a particular performance counter on a line chart, you can determine what’s typical foryour particular system at different times during business periods that affect your server. If the datavaries a lot depending on use patterns during the day or week, Excel can help you smooth thingsout so that you can recognize the overall pattern and know what to expect throughout the timeperiod.

The best trend line to use when you have several peaks and valleys in your data is a polynomialtrend line, which is designed for analyzing fluctuations over a large data set. When you plot a polynomial trend line, Excel will ask you to select an order. Use Order 2 when the data set has onepeak or valley, Order 3 for two peaks or valleys, and Order 4 for up to three fluctuations. As in othertrend lines, the r-squared value will help you gauge how much confidence you should place in thetrend line’s results.

Reporting It’s always frustrating for an IT professional to deal with subjective, anecdotal reports about the performance and availability of the systems under IT’s control. The only way to combat negativeanecdotal reports is to give your management some real evidence—a defense based on real numbers.This approach demonstrates yet another benefit of your investment in a monitoring solution.

Chapter 6 Trend Analysis and Capacity Planning 79


And, thankfully, your solution should make reporting the easiest part of system monitoring. Many monitoring solutions let you design reports and schedule them to be emailed to appropriatemanagers daily, weekly, or monthly.

Look for data that provides a clear picture of the system’s performance or availability. For example,most monitoring solutions let you compute system availability as a percent based upon Ping replies.Another valuable gauge of availability is the number of planned and unplanned system restarts.

Windows Server 2003 makes it easy to monitor such events with its tracking of system restarts(and the reasons for them) in the event log. Producing a monthly record of system availability andrestarts is a good way to prove your system’s reliability.

Monitoring: Some Reminders No matter what size IT shop you work in, system monitoring is a must. You can find monitoringtools to fit your budget and level of need. At the very least, monitor your server’s Application,System, and Security logs for warnings, errors, and suspicious security events. After all, if your servercan recognize a problem, you need to know about it.

In addition, some basic availability monitoring can often alert you to failures before your usersnotice—and it’s simple to monitor availability. At the very least, ping your important systems everyfew minutes or so. For a better verification of availability, schedule your monitoring solution to periodically perform an application-level request or transaction to make sure that the system is trulyresponsive at the user level.

For Web servers, verifying availability might mean requesting a Web page, performing a queryagainst a database server, reading a file on a file server, or sending an email message throughExchange. Trying to predict imminent problems based on performance counters is trickier andrequires more effort, as you’ve read. The key to configuring the right thresholds for alerts is to get adecent baseline, which requires time, storage space, and analysis.

Be judicious in your use of alerts because your operations staff will start to ignore a monitoringsolution that cries wolf too often. Also, look for ways to use automation with your monitoring solution so that it takes prescribed courses of action based on events or thresholds instead ofbecoming a nagging system that requires constant babysitting.

You should also try to automate production and delivery of reports. If your monitoring strategydepends on someone having to remember to periodically run and analyze a report, the chances arethat it won’t be done regularly. If your system automatically produces the report and sends it to theperson’s Inbox, the information has a better chance of being reviewed.

Finally, remember that you need to monitor at more than one level. A network Ping commandcan’t prove that your e-commerce system is necessarily up and ready to process purchases. On theother hand, if you monitor at the application level only, you won’t have crucial data that you wouldneed to diagnose the source of system slowdowns. Determine a base set of performance counters thatyou collect for each server so that you will already have a history of data to analyze when it’s needed.

Proactive Monitoring You can all too easily put off monitoring because it helps you only with tomorrow’s problems.Chances are you have plenty of problems to deal with today. However, the sooner you start, thesooner your organization will begin to reap the benefits and the better you’ll look the next time yourperformance is evaluated.



Documents

Books - Maverick's Blogvenus.sci.uma.es/docs/ebooks/TakingControl... · Books Contents Chapter 2 Monitoring Windows Server . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Monitoring