This disclosure relates generally to the field of computer software. More particularly, but not by way of limitation, it relates to a policy-driven technique for dynamically self-configuring and deploying large scale, complex, heterogeneous monitoring agent networks.
Monitoring agents are in widespread use to monitor infrastructure, software, packaged applications etc across the enterprise. Many enterprises have very large scale, complex, heterogeneous networks of agents, deploying thousand of agents to monitor their software, applications, infrastructure etc. Typical monitoring agents are not plug-n-play products and in such large-scale deployments the configuration, deployment, and manageability of the agent itself becomes very difficult.
Each agent may have a large set of configuration parameters, flags, thresholds, alarm ranges etc defined in its configuration to help control its monitoring functionality. The contents of this set of configuration parameters varies greatly based on the components to be monitored, including parameters that vary depending on the monitored domain and software, and may include monitored application specific modules, e.g. specific monitoring functionality for Oracle, Windows NT, etc. In a complex, large scale, heterogeneous environment, in which a wide variety of software applications are being monitored, maintaining these configuration parameters on a per agent basis quickly becomes a very difficult, if not impossible, task.
There are several drawbacks that make this approach untenable at the scale, complexity, and heterogeneity at which customers are using monitoring agents: (a) configuration is based on how the monitoring is to be accomplished, not what monitoring is desired; (b) there is no sharing and reuse of agent configuration; (c) configuration changes are not propagated to distributed monitoring agents; (d) per agent configuration process is manual, thus error-prone, and requires significant domain knowledge and expertise, with the person configuring the agent required to know what software is installed on a host and hence what configuration is required for the agent; (e) there is no scalability to the configuration process, because there is no way to use policies or rules to control configuration of the agents or to group configuration properties into meaningful sets or views, so that for example, a DB administrator cannot categorize groups of servers running a common database engine as a group and associate that group with a set of configuration properties, or managed the group and are thus more scalable; (f) agent configuration is not dynamic, but must be manually configured, so that if new software is deployed on a host monitored by a agent, the new software does not automatically start getting monitored, and if software is removed from a host, the relevant configuration does not get removed from the monitoring agent.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without these specific details. In other instances, structure and devices are shown in block diagram form in order to avoid obscuring the invention. References to numbers without subscripts or suffixes are understood to reference all instance of subscripts and suffixes corresponding to the referenced number. Moreover, the language used in this disclosure has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter. Reference in the specification to “one embodiment” or to “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment of the invention, and multiple references to “one embodiment” or “an embodiment” should not be understood as necessarily all referring to the same embodiment.
A centralized, policy-driven approach allows dynamic self-configuration and self-deployment of large scale, complex, heterogeneous monitoring agent networks. Such an approach resolves the scalability and manageability issues of manually configured conventional agents. Embodiments of the agents can be self-configuring using a dynamic, adaptive technique. An administrator can group hosts on which agents into groups that have similarly configured agents.
Initially when a bare bones agent 150 is installed on the host to be monitored 155, the agent 150 is a generic agent, initially configured with sufficient information to allow the agent 150 to contact a central configuration manager (CCM) 100 running on server 105. The contact information typically includes credentials for and a location of the CCM 100. In one embodiment, the contact information may include a port number or other information useful for making a connection with the CCM 100. Upon startup, the agent 150 may report the characteristics of the host/server characteristics back to the CCM 100 as a first collection of information 110. This first collection may include information such as the operating system (OS) running on the host 155, the IP address of the host 155,a version of the agent 150, and an identity of the host 155, all of which is reported to the CCM 100.
In one embodiment, the CCM 100 comprises all possible components that an enterprise has purchased rights to use and could potentially manage. In an alternate embodiment, the CCM 100 comprises all components that are available from a vendor of the CCM 100, even if the enterprise configuring the CCM 100 has not purchased rights to use some of those components.
The CCM 100 then pushes a manageability probe 120 to the agent 150. The probe 120 executes under the control of the agent 150, and discovers all manageable components that are configured on the host 155. After running the probe 120, the agent 150 reports the information about the platform on which the agent 150 is running and a basic inventory 130 of the manageable components associated with the host 155. Any desired format for the basic inventory 130 may be used. Preferably, the format of the basic inventory 130 is designed for easy parsing by the CCM 100. In one embodiment, the basic inventory 130 may include information that the host 155 is a Windows server, and that the probe 120 detected manageability of “Services,” “Processes,” and detected a specific instance of an Oracle database software.
The CCM 100 on receiving the basic inventory 130 analyzes it by using the context data, evaluating applicable policies, host group membership information and decides which management modules need to be enabled on the agent 150, their dependencies and their respective configuration, e.g., setting alarm thresholds, Simple Network Management Protocol (SNMP) settings, etc.
In one embodiment, the determination of manageability information and the enablement of management modules may be an iterative process. A management module when enabled may discover some additional information that is reported to the CCM 100. The CCM 100 may respond by analyzing and subsequently making changes to the configuration of the agent 150, including enabling additional management modules and updating the configuration of existing management modules of the agent 150. In a further embodiment, the iterative process may also involve intervention on the part of an administrative user to override the automatic selection of configuration information and enablement of management modules of the agent 150.
In one embodiment, a CCM 100 administrator may use a user interface for the CCM 100 to select which applications that have been discovered by the manageability probe should be monitored. The CCM 100 may include a configuration store, typically a database, in which the management modules may be loaded and stored for deployment with an agent 150.
Preferably, configuration is not a one-time activity by the CCM 100, but is a dynamic, adaptive, self-configuration cycle that the CCM 100 performs continuously, receiving events from the agent 150, analyzing the manageability data 130, and responding with changes to the configuration of the agent 150. For example, if new software is deployed on the host 155 monitored by agent 100, the management probe 120 on the agent 150 may discover the new software and report it to the CCM 100. The CCM 100 may then analyze the manageability information 130 and send the agent 150 appropriate configuration data to configure the agent 150 for enabling monitoring of that software. This dynamic update technique may be performed automatically without the need for any manual configuration by an administrative user. Similarly, if software is uninstalled from the host 155, the relevant configuration may be automatically removed from the configuration of the agent 150.
As indicated above, blocks 240 through 260 may be iteratively performed, as the agent discovers additional manageability data, such as when a new software is installed on the host 155. This may include removal of configuration information, such as when an application is removed from the host 155.
As a result, the generic agents 150 may be automatically configured heterogeneously, with each agent 150 configured to monitor its corresponding host 155 based on what is discovered about the host 155.
In one embodiment, hosts 155 on which agents 150 execute may be organized into one or more host groups. Grouping improves scalability and other advantages.
From an administration perspective, grouping provides a convenient and scalable way to define various configuration views on the environment. For example, in one embodiment, a database administrator (DBA) is responsible for managing multiple databases, and may be given responsibility for dynamically related configuration parameters in the agent 150 by a policy. The DBA may decide to split their multiple databases (and thus their hosts 155) into groups corresponding to how the DBA differentiates between the groups in terms of configuration, such as defining Small, Medium and Large groups.
In one embodiment, groups and hosts may have associated metadata properties that can be used to dynamically bind and evaluate policies and thus configuration parameters at runtime. This runtime-determined configuration lends flexibility and dynamism to the configuration process.
Grouping allows administrators to deal with groups of hosts and the relationships between those hosts, instead of the individual hosts, which may reduce the number of distinct entities that have to be separately configured. Grouping provides a much more scalable and maintainable system than handling configuration on an individual basis for thousands of hosts.
In one embodiment, group membership of hosts and host groups may be dynamic based on some properties of the host or of the host group.
In one embodiment, relationships may be defined between groups, such as defining a hierarchical relation between groups, which can affect the order and priority of evaluation of dynamically associated policies.
In this example hierarchy, group 300 is a top-level group to which both hosts 325 and 335 belong. A group P (310) may be defined to include all hosts at a particular location or site. As illustrated in
Two subgroups of group 310 are illustrated in
Two hosts are illustrated in
In addition to using groups of hosts to provide a dynamic approach to self configuration and driving configuration, in one embodiment, a policy driven approach to driving large scale system self configuration provides even greater scalability. Policies may be used to compute the set of applicable agent configuration properties for a host.
In one embodiment, a policy contains one or more rules that follow the condition-action pattern, where a Boolean rule condition is evaluated and if it is true, the action part of the policy is executed. In such an embodiment, the action part of the policy typically sets one or more related agent configuration parameters. Thus, a policy may drive the agent 150's configuration properties, settings, thresholds etc.
In a further embodiment, policy rules may encapsulate logic and need not be hard coded, but may be based on host groups or host properties and Boolean logic. The association between rules and the host groups or hosts may be dynamic, based on a rule condition and existing host group or host properties.
The following is an example of a policy:
Condition: HostGroup.tags includes ‘PLATINUM’ and HostGroup.lifecycleStage=‘PROD’
Actions: set fdLimit=5000
In this policy, if the host group tags include “PLATINUM” and the host group is marked as being in the PROD lifecycle stage, then the policy sets a configuration variable “fdLimit” to a value of “5000.” As illustrated in this example, a policy may employ multiple conditions, but the remaining examples described below are illustrated with a single condition for clarity.
The following is another example of a policy:
Condition: HostGroup.lifecycleStage=‘TEST’
Actions: set fdLimit=500
In this second policy, if the host group is marked as being in the “TEST” lifecycle stage, then the policy sets the configuration variable “fdLimit” to a value of “500.”
For computing the set of applicable agent configuration properties for a host, the policy evaluation process takes into account the host's group membership. For each group it finds the host in, the CCM policy evaluation technique follows the hierarchy of host groups and hosts going from the topmost ancestor host group down to the specific host, evaluating the policies applicable at each level. This may result in agent configuration properties being appended to, modified, or deleted at each level. The CCM 100 may then send the resulting final set of configuration properties to the agent 150 for application.
In one embodiment, the policy rules themselves may have a priority or precedence associated with them, enabling an ordering of the evaluation of all applicable policies.
In this example, there are no policies associated with group 300. Group 310 is next considered and its associated policies evaluated. Because host group 310 is defined in the host group with the name P, condition 410 is true, and actions 412 sets properties corresponding to that condition. Because the TAGS for group 310 include the value SILVER, actions 422 are also performed, resulting in property set 470: /SNMP/SUPPORT=YES, SLALEVEL=SILVER, and /SNMP/DEFAULTPORT=161.
Moving down the hierarchy, host group 330, which indicates that LIFECYCLE=PROD, and TAGS=PLATINUM, causes the evaluation of condition 430, which updates the SLALEVEL configuration property from SILVER to PLATINUM (action 432), resulting in property set 480: /SNMP/SUPPORT=YES, /SNMP/DEFAULTPORT=161, and SLALEVEL=PLATINUM.
Moving down the hierarchy again, host 335 indicates that the OS=WIN and that it is a TYPE=PHYSICAL host, having a host name of host3.test.corp.com. Therefore, conditions 440 and 450 cause execution of actions 442 and 452. Action 342 adds the property MAXPROCESSLIMIT=100, while action 452 changes the /SNMP/SUPPORT property to NO, and adds the property PRELOADEDKMS=CORP_TEST, resulting in property set 490: /SNMP/SUPPORT=NO, /SNMP/DEFAULTPORT=161, SLALEVEL=PLATINUM, MAXPROCESSLIMIT=100, and PRELOADEDKMS=CORP_TEST. The CCM 100 may then use these configuration properties to push information to the agent 150 executing on host 335 to configure the agent 150 according to property set 490, automatically, and without human intervention. The host groups, hosts, policies, and properties described in
The technique described above allows properties to be defined for host groups, which may then be overridden by subsidiary host groups or by hosts themselves.
In the examples described above, the policies are hard coded, with specific values selected in each condition and action definition. In one embodiment, a policy engine allows defining policies that use variables and expressions for additional flexibility. In this embodiment, the CCM separates out the condition definition/template from the property values applicable using a property expression and lookup system. This allows separation of responsibility for maintaining the policies. For example, one administration team may define conditions and a different administration team may maintain the actions, defining property values or thresholds to be used in the rules, keeping these activities separate.
In this embodiment, illustrated by the block diagram of
When evaluating policy 510, which has a condition 512 of HOSTGROUP.NAME=P, if the condition 512 when evaluated as described above is true, then action 514 of assigning properties is taken. The policy engine 530 assigns hard coded properties /SNMP/SUPPORT=YES and /SNMP/DEFAULTPORT=161. An additional property employs variables (in this example, indicated by a prefix of “${” and a suffix of “}”). The rule evaluation engine looks up the STAGE variable in the property lookup system, and discovers that the STAGE variable has a value of “DEV.” This value is then placed into the property being evaluated by the policy engine 530, resulting in a property of “FDLIMIT=${DEV_FD_LIMIT}.” The policy engine 530 again queries the property lookup system 540, obtains the value 500 for the DEV_FD_LIMIT variable, and sets the FDLIMIT configuration element to have a value of 500.
In one embodiment, also illustrated in
The runtime namespace is a namespace for building a policy resolution. For each query a new runtime namespace may be constructed. This runtime namespace is transient in nature and is empty at the start of a query and seeded with the query facts. After the query, the policy result is the filtered by the scope. A policy engine should contain at least one policy to be useful. Each policy comprises three sets of information:
Precedence: determines the order in which this policy will be evaluated, relative to the other policies for a certain “evaluation set.”
Precondition: If this condition is met, the policy action will be executed. Precondition evaluation will be triggered on changes in the runtime namespace for parameters that are included in the precondition.
Action: once the precondition is met, the action will be executed. As a result of an executing action that changes the runtime namespace, preconditions may be scheduled for evaluation. In addition to setting configuration data values as described above, the actions may be extended to execute code on the system, communicate to other systems or read files on the filesystem, if desired.
Policies may reference data stored in the policy namespace and preferably do not contain any constructs containing actual policy data.
At startup, the policy engine reads the policies and the policy data. The policies may be stored in “compiled” form for performance reasons. The policy compiler may turn a policy into the following:
Precedence: taken from the policy.
Precondition: taken from the policy.
Triggers: the list of runtime variables referenced in the precondition. If any of the precondition runtime variables are modified, the precondition should be re-evaluated
Action: taken from the policy.
Alternately, a purely interpretative policy engine may omit the policy compiler.
In block 610, the policy engine 530 reads the policy data upon startup, compiling the policies and determining policy triggers, then waits for a query in block 620. Upon receiving a query in block 630, the policy engine initializes the runtime namespace and applies any facts supplied with the query to the runtime namespace. If no triggers are changed that would cause execution of a policy action, as determined in block 650, then a query result may be returned in block 640, based on the compiled policies.
If a trigger is changed, then in block 660 each affected policy may be reevaluated. The preconditions for each affected policy are evaluated. Then, any policies where the preconditions are met may be ordered by precedence. The policy actions may then be executed in the precedence order and any changes to the triggers may be recorded. As needed, policy data may be loaded from the policy namespace.
Referring now to
System unit
It is to be understood that the above description is intended to be illustrative, and not restrictive. For example, the above-described embodiments may be used in combination with each other. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention therefore should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein.”