The present disclosure generally relates to systems and methods for monitoring health of actively executing computer applications, and more particularly to SQL server monitoring, Internet information services monitoring, server monitoring, vulnerability and security update analysis monitoring, SQL database free space monitoring, long running agent job monitoring, blocked server processes monitoring, and to related topics.
Ensuring that the health of applications based on Windows® and other systems can be easily monitored has become increasingly crucial, particularly as businesses have increasingly based their mission-critical applications on Windows®-based systems. Some of the key challenges facing computer systems administrators today include how to manage the health of key applications. Such applications include Microsoft® SQL Server 2000, a very complex relational database; Windows® Internet Information Services, upon which web front ends are built; and crucial operational aspects of the Windows® operating system. It is additionally important to support systems administrators to ensure that servers are deployed securely with regard to security updates and best practice configuration standards.
Monitoring the health of a SQL server, such as Microsoft® SQL Server 2000, can be difficult for some monitoring systems due, for example, to the large list of components that make up Microsoft® SQL Server 2000 and the wide range of configurable options for each of these. Many software customers have different configurations of Microsoft® SQL Server 2000 and may have intermixed configurations of SQL Server, where they are running multiple versions, multiple instances or different stock keeping units (SKUs) on a single computer. In such instances, the task of monitoring SQL Server is significantly more complex. For example, a customer can run Microsoft® SQL Server version 7.0 in a version switch configuration with Microsoft® SQL Server 2000. Furthermore, this customer may also be running a copy (or multiple copies) of Microsoft® Data Engine (MSDE) on the same computer that appears at first glance very similar to SQL Server Enterprise Edition. Accordingly, monitoring this customer's application would be difficult.
There are many elements to monitoring basic health of an operating system, but one of the most fundamental is to understand when a given server or set of servers is bottlenecked on physical resources. Although there are many causes of bottlenecking, the most common resource bottleneck is related to the amount of processing cycles available to services running on the server. A significant complication has arisen in recent years where servers are designed to use all available processing resources without affecting the performance of the principle functions that the server is expected to perform. This may be accomplished by employing resource-throttling techniques that can be as simple a thread pools running at lower than normal thread priority. In these cases, looking solely at the processing utilization may not give a full picture of cycles available to the principle server functions, and thus more sophisticated algorithms may be required.
Another area that systems administrators should monitor is related to tracking the security posture of various types of servers. In a manner similar to many software applications, those running on servers may be prone to security vulnerabilities. These vulnerabilities may be related to the underlying platform (i.e. the OS), or related to user inexperience with management and maintenance of the application. Currently, a common way to alert users about vulnerabilities in the software that are due to software defects or flaws in the design is some form of public disclosure or bulletin. Microsoft® alerts users to problems through a document, mssecure.xml, that is easily downloadable over the Internet. However, this provision leaves the burden on the user to distribute and/or leverage the download in their distributed environment, and to determine the overall security posture of their applications and servers.
Accordingly, a need exists for a more complete solution to monitoring health of actively executing computer applications.
Systems and methods are described that monitor health of actively executing computer applications, and particularly which monitor relational database space availability. In one implementation, a warning threshold is defined for free space within a database located on a SQL server. The complexity of the database is assessed, in part by locating each file within the database. A health state is then established for each of the files located within the database, wherein the health state is based on a comparison of free space in each of the located files to the warning threshold.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
Overview
The following discussion is directed to related topics affecting the health of actively executing computer applications. In particular, SQL server monitoring, Internet information services monitoring, server monitoring, vulnerability and security update analysis, SQL database free space monitoring, long running agent monitoring and blocking server processes will be discussed. By monitoring aspects of these topics, synergistic interactions result, thereby promoting the health of actively executing computer applications.
SQL Server Monitoring
Although a database system may appear healthy when performing basic health checks, it may be performing poorly, either consistently or at inconsistent intervals. A common reason for poor performing of a relational database system is blocking. Blocking occurs when one connection from an application or process holds a lock on a SQL server resource and a second connection requests the same resource. Utilization of the server resources forces the second connection to wait, since it is blocked by the first. In this manner, one connection can block another connection, regardless of whether they originate from the same application or separate applications on different client computers.
Another common reason for poor performance is an agent job that overruns or exceeds a predefined running threshold. A job can perform a wide range of activities, including running Transact-SQL scripts, command line applications, and Microsoft® ActiveX® scripts. Jobs can be scheduled to execute at specific times or recurring intervals. A long running agent job might indicate a potential problem with the SQL server or with the specified SQL server agent job.
Accordingly, it is important for a monitoring systems to pro-actively identify conditions, so that: common user experience problems are identified (e.g. a user querying for data and waiting an unacceptable period of time because of a block); important data uploads are performed within an acceptable period of time, thereby making data available when required (e.g. by the start of business the following day); and data upload or maintenance jobs run during off-peak (non-business) hours to avoid affecting the performance of the database system.
Accordingly, pro-active monitoring of SQL database health is important. In one implementation of these concepts, a Microsoft SQL Server 2000 management pack runs from Microsoft Operations Manager 2005 agents installed on computers that are being monitored. From this agent, the management pack can discover the relevant aspects of Microsoft® SQL Server to be monitored. Prior to performing a health check the management pack can first identify: the components which have been installed by the user which should be monitored; instances of each component that have been installed; prior versions or different SKUs of Microsoft® SQL Server; and the different configuration options of these SKUs such as Named Instances, Cluster Configuration or different roles that an instance is performing (e.g. log shipping, replication etc.).
These concepts are further illustrated by an embodiment where Microsoft SQL Server 2000 MOM management pack performs a multiphase check on a timed basis to inspect the health of Microsoft® SQL Server on a regular basis. By first identifying basic health conditions, it can then simulate the user experience by performing a connection and query, which takes into account the port bindings, connectivity, database health and database engine health. This multiphase check identifies potential issues that a user may experience rather than rely on basic reactive checks looking for failure or error events.
Additionally, the management pack performs health checks from external locations as defined by the administrator, which simulate clients and give the administrator feedback without actual client participation. These external ‘clients’ perform regular actions typical of a user, such as querying the database. This query response time is evaluated, both for successful completion, as well as for responsiveness, to fully understand if Microsoft® SQL Server is healthy, accepting connections and responding in an acceptable manner.
The health of a database system is fundamental to its performance. In a Microsoft®-based implementation of these concepts, a management pack monitors the health of the database system by monitoring for blocking processes. The management pack tracks live running process and watches for blocking conditions. When a blocking condition is identified, the management pack alerts the administrator with information about the blocking condition.
Also, the management pack tracks SQL Server Agent jobs. Running agent jobs are tracked in real time and compared against a predefined acceptable running threshold. Violations of this running threshold are raised in the form of alerts to the administrator with information about the violation and job.
The example of
Blocks 108-112 show operation of local monitoring, which when used in combination with remote monitoring, yields a synergistic result. At block 108, monitoring agents are installed on database computers. At block 110, the monitoring agents perform a health check successfully. However, at block 112, blocking conditions are identified on a local node. Accordingly, at block 114, the administrator is notified of the poor performance and blocking. Thus, remote and local monitoring were used together, to provide more information that either would have individually.
At block 206, a loop is entered and repeated for each SQL server instance. At block 206, a check is made for use of SQL server 2000. Naturally, this check could be modified to check for any desired instance or revision thereof. At block 208 a check is made to determine if the instance is to be excluded from monitoring. At block 210, a check is made to determine if the instance is disabled. At block 212, a check is made to determine if the SQL service is running. As seen in
Referring to
At block 304, a remote connectivity check is performed. At block 306, a check is made to determine if contact was made to the computer on which the database is running. If not, an error alert is sounded at block 308. If contact was made, at block 310, a check is made to determine if the query was executed. If not, an error alert is made (block 308). If the query was executed, a check is made at block 312 to determine if the response time was acceptable. If the response time was unacceptable, there is an alert (block 314). If the response time is acceptable, no action is required (block 316).
Internet Information Services Monitoring
Referring to
At block 518, all application pools are discovered. Where an application pool failure is detected (block 520) a check is made to determine if the pool restarted gracefully (block 522). If not, the administrator is notified (block 516).
At block 524, all web sites are discovered. At block 526, a check is made to determine if logging is enabled. If not, real time analysis will not be available (block 528). If so, the web application logs are analyzed (block 530). At block 532, a check is made to determine if an application error has occurred. If so, at block 534 a check is made to determine if the error is the 50th occurrence (or other value, depending on the application). If not, at block 536, a consolidated event is collected for reporting. If the error was the 50th occurrence, a check is made at block 538 to determine if the errors resulted in the last 120 minutes (or other selected time period). If so, the administrator is notified (block 516).
Server Monitoring
In another embodiment, agents may be installed on computers that are being monitored. From these agents, the management pack is able to determine processor (CPU) performance health by sampling each processors “% Processor Time” performance counter over a predefined number of samples (which may be designed to be user configurable).
Once a sufficient number of samples have been collected (another user configurable aspect) an average value for the “% Processor Time” performance counter is calculated for each processor. This average value for each processor is compared against a threshold value (again, user configurable). In the event that the average exceeds the threshold value, a second processor utilization metric will be evaluated. This second metric is the “Processor Queue Length” performance counter. In this case, the “Processor Queue Length” is sampled and if it exceeds the “Processor Queue Length” threshold value (also user configurable) a processor utilization threshold indicator will be created.
Evaluation of these two performance counters enables the monitoring system to dramatically reduce false positive alerts that are often caused by spikes in processor utilization and background processes which do not directly impact core server functionality
Vulnerability and Security Update Analysis
Currently, the common way to alert users about vulnerabilities in the software which are either due to software defects or flaws in the design is some form of public disclosure and bulletin. Most users are able to subscribe to this security bulletin in the form of an email or view them in a browser like: http://www.microsoft.com/security/bulletins/default.mspx.
The following outlines a system and method to monitor the health of Microsoft SQL Server 2000, Internet Information Services, Windows Server, or other server in another environment, and determining the security posture of a managed computer. Accordingly, this system and method provides the following capabilities to alleviate and simplify the administrator's task of scanning servers. First, a distributed install of a security scanning engine is performed. This allows functionality to be provided through firewalls, and is very scalable, as the number of servers increases. Second, the scanning tasks can then be offloaded to the local machine. And third, central reporting the security posture of each managed computer is facilitated by this arrangement. To ensure that the user is able quickly act to any vulnerability detection or security update alert, this configuration provides notification through a response infrastructure as well as viewing the security posture of any given managed computer through an alert or report. This affords the administrator the ability to asynchronously aggregate the security posture of all servers in the environment using an automated regularly scheduled mechanism.
Microsoft® provides an mssecure.xml that is easily downloadable over the Internet, but the burden is still left to the user to distribute or leverage this in their distributed environment to determine the overall security posture of their applications and servers. In addition, although the administrator could configure each machine to access the internet to download this security manifest, in many cases, servers will be isolated in a secure DMZ network that does not have direct access to the internet or an internet proxy server. This results in the additional administrator burden to distribute the security manifest by some other means.
To solve all these problems, the configuration described herein allows an administrator to designate a server as the intermediary file transfer server whose only function is to proxy the mssecure.xml security manifest and nothing else. This provides an in-depth defense by reducing the attack surface of that server, which does not proxy anything else. This configuration therefore allows the agents to automatically detect this file transfer server and download the security manifest from this server.
As vulnerability assessment scanning engines improve, this configuration allows the administrator to leverage newer and updated version of such products by downloading them. This ensures that the administrator can update the scanning engine to leverage new features as well as improvements to the engine itself.
Referring to
SQL Database Free Space Monitoring
At block 1132, a check is made to determine if the database has less space than the warning threshold (which was set in blocks 1108-1116). If there is more space than the threshold, the database has a green health state (block 1130). Otherwise, at block 1134, a check is made to determine if the error threshold was exceeded. If so, at block 1138 the database has a red health state. If not, at block 1136 the database has a yellow health state.
Long Running Agent Jobs
Blocking Server Process IDs
Security Issues
Exemplary Methods
Exemplary methods for implementing aspects of health monitoring for actively executing computer applications will now be described with primary reference to the flow diagrams of
Referring to
At block 1608, the complexity of the database is assessed by locating each file within the database. Blocks 1610-1614 of
At block 1616, a health state is established for each of the files located within the database. Blocks 1618-1620 of
In the embodiment shown at block 1704, the SQL server's configuration is studied. In particular, an inventory is made of factors such as the SQL server version, the SKU of the server instance, how the server is configured, and for what purpose the server was configured.
In the embodiment shown at block 1706, the SQL server's configuration is further studied. In particular, an inventory of the database is performed, wherein files, objects, the attributes of the objects (e.g. an Autogrow setting associated with the object) are all cataloged.
At block 1708, a query is defined that will be made by the client computer to the SQL server. An expected response time is also defined, within which time the SQL server should make a response to the client computer. The expected response time may be based on experience with similar queries and databases.
At block 1710, a report, outlining the results of the query, is made to an administrator. In the embodiment of implementation 1700, the report includes a comparison of an actual response time with the expected response time. Using this information, the administrator is able to determine if the SQL server is performing adequately.
While one or more methods have been disclosed by means of flow diagrams and text associated with the blocks of the flow diagrams, it is to be understood that the blocks do not necessarily have to be performed in the order in which they were presented, and that an alternative order may result in similar advantages. Furthermore, the methods are not exclusive and can be performed alone or in combination with one another.
Exemplary Computer
The computing environment 1900 includes a general-purpose computing system in the form of a computer 1902. The components of computer 1902 can include, but are not limited to, one or more processors or processing units 1904, a system memory 1906, and a system bus 1908 that couples various system components including the processor 1904 to the system memory 1906. The system bus 1908 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a Peripheral Component Interconnect (PCI) bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
Computer 1902 typically includes a variety of computer readable media. Such media can be any available media that is accessible by computer 1902 and includes both volatile and non-volatile media, removable and non-removable media. The system memory 1906 includes computer readable media in the form of volatile memory, such as random access memory (RAM) 1910, and/or non-volatile memory, such as read only memory (ROM) 1912. A basic input/output system (BIOS) 1914, containing the basic routines that help to transfer information between elements within computer 1902, such as during start-up, is stored in ROM 1912. RAM 1910 typically contains data and/or program modules that are immediately accessible to and/or presently operated on by the processing unit 1904.
Computer 1902 can also include other removable/non-removable, volatile/non-volatile computer storage media. By way of example,
The disk drives and their associated computer-readable media provide non-volatile storage of computer readable instructions, data structures, program modules, and other data for computer 1902. Although the example illustrates a hard disk 1916, a removable magnetic disk 1920, and a removable optical disk 1924, it is to be appreciated that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes or other magnetic storage devices, flash memory cards, CD-ROM, digital versatile disks (DVD) or other optical storage, random access memories (RAM), read only memories (ROM), electrically erasable programmable read-only memory (EEPROM), and the like, can also be utilized to implement the exemplary computing system and environment.
Any number of program modules can be stored on the hard disk 1916, magnetic disk 1920, optical disk 1924, ROM 1912, and/or RAM 1910, including by way of example, an operating system 1926, one or more application programs 1928, other program modules 1930, and program data 1932. Each of such operating system 1926, one or more application programs 1928, other program modules 1930, and program data 1932 (or some combination thereof) may include an embodiment of a caching scheme for user network access information.
Computer 1902 can include a variety of computer/processor readable media identified as communication media. Communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
A user can enter commands and information into computer system 1902 via input devices such as a keyboard 1934 and a pointing device 1936 (e.g., a “mouse”). Other input devices 1938 (not shown specifically) may include a microphone, joystick, game pad, satellite dish, serial port, scanner, and/or the like. These and other input devices are connected to the processing unit 1904 via input/output interfaces 1940 that are coupled to the system bus 1908, but may be connected by other interface and bus structures, such as a parallel port, game port, or a universal serial bus (USB).
A monitor 1942 or other type of display device can also be connected to the system bus 1908 via an interface, such as a video adapter 1944. In addition to the monitor 1942, other output peripheral devices can include components such as speakers (not shown) and a printer 1946 that can be connected to computer 1902 via the input/output interfaces 1940.
Computer 1902 can operate in a networked environment using logical connections to one or more remote computers, such as a remote computing device 1948. By way of example, the remote computing device 1948 can be a personal computer, portable computer, a server, a router, a network computer, a peer device or other common network node, and the like. The remote computing device 1948 is illustrated as a portable computer that can include many or all of the elements and features described herein relative to computer system 1902.
Logical connections between computer 1902 and the remote computer 1948 are depicted as a local area network (LAN) 1950 and a general wide area network (WAN) 1952. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets, and the Internet. When implemented in a LAN networking environment, the computer 1902 is connected to a local network 1950 via a network interface or adapter 1954. When implemented in a WAN networking environment, the computer 1902 typically includes a modem 1956 or other means for establishing communications over the wide network 1952. The modem 1956, which can be internal or external to computer 1902, can be connected to the system bus 1908 via the input/output interfaces 1940 or other appropriate mechanisms. It is to be appreciated that the illustrated network connections are exemplary and that other means of establishing communication link(s) between the computers 1902 and 1948 can be employed.
In a networked environment, such as that illustrated with computing environment 1900, program modules depicted relative to the computer 1902, or portions thereof, may be stored in a remote memory storage device. By way of example, remote application programs 1958 reside on a memory device of remote computer 1948. For purposes of illustration, application programs and other executable program components, such as the operating system, are illustrated herein as discrete blocks, although it is recognized that such programs and components reside at various times in different storage components of the computer system 1902, and are executed by the data processor(s) of the computer.
Conclusion
Although aspects of this disclosure include language specifically describing structural and/or methodological features of preferred embodiments, it is to be understood that the appended claims are not limited to the specific features or acts described. Rather, the specific features and acts are disclosed only as exemplary implementations, and are representative of more general concepts.