The present disclosure relates to computing environments, and more particularly to methods, techniques, and systems for filtering metrics of monitored computing-instances based on severity levels.
In application/operating system (OS) monitoring environments, a management node may communicate with multiple endpoints to monitor the endpoints. For example, an endpoint may be implemented in a physical computing environment, a virtual computing environment, or a cloud computing environment. Further, the endpoints may execute different applications via virtual machines (VMs), physical computing devices, containers, and the like. In such environments, the management node may communicate with the endpoints to collect performance data/metrics (e.g., application metrics, OS metrics, and the like) from underlying OS and/or services on the endpoints for storage and performance analysis (e.g., to detect and diagnose issues).
The drawings described herein are for illustration purposes and are not intended to limit the scope of the present subject matter in any way.
Examples described herein may provide an enhanced computer-based and/or network-based method, technique, and system to filter metrics based on severity levels for ingesting into a monitoring tool in a computing environment. Computing environment may be a physical computing environment (e.g., an on-premise enterprise computing environment or a physical data center) and/or virtual computing environment (e.g., a cloud computing environment, a virtualized environment, and the like).
The virtual computing environment may be a pool or collection of cloud infrastructure resources designed for enterprise needs. The resources may be a processor (e.g., central processing unit (CPU)), memory (e.g., random-access memory (RAM)), storage (e.g., disk space), and networking (e.g., bandwidth). Further, the virtual computing environment may be a virtual representation of the physical data center, complete with servers, storage clusters, and networking components, all of which may reside in virtual space being hosted by one or more physical data centers. The virtual computing environment may include multiple physical computers executing different endpoints (e.g., physical computers, virtual machines, and/or containers). The endpoints may execute different types of applications.
Further, performance monitoring of such computing-instances (i.e., the endpoints) has become increasingly important because performance monitoring may aid in troubleshooting (e.g., to rectify abnormalities or shortcomings, if any) the computing-instances, provide better health of data centers, analyse the cost, capacity, and/or the like. An example performance monitoring tool or application or platform may be VMware® vRealize Operations (vROps), Vmware Wavefront™, Grafana, and the like.
Further, the computing-instances may include monitoring agents (e.g., Telegraf™, collectd, Micrometer, and the like) to collect the performance metrics from the respective computing-instances and provide, via a network, the collected performance metrics to a remote collector. Furthermore, the remote collector may receive the performance metrics from the monitoring agents and transmit the performance metrics to the monitoring tool for metric analysis. A remote collector may refer to an additional cluster node that allows the monitoring tool (e.g., vROps Manager) to gather objects into the remote collector's inventory for monitoring purposes. The remote collectors collect the data from the computing-instances and then forward the data to a management node that executes the monitoring tool. For example, remote collectors may be deployed at remote location sites while the monitoring tool may be deployed at a primary location.
Furthermore, the monitoring tool may receive the performance metrics, analyse the received performance metrics, and display the analysis in a form of dashboards, for instance. The displayed analysis may facilitate in visualizing the performance metrics and diagnose a root cause of issues, if any.
In such computing environments, the number of metrics collected by the application remote collector increases with an increase in the number of computing-instances. However, not all the collected metrics may be relevant for the metric analysis, for instance, when the computing-instance is performing well. Even though all the collected metrics may not be relevant, the metrics are ingested to the monitoring tool for performance analysis, which involves a significant amount of computation. Further, the monitoring tools may charge clients for every metric that is ingested to the monitoring tool. Since the metrics are not filtered based on relevance, clients may end up paying a significant amount for such monitoring tools.
Examples described herein may provide a computing node (e.g., a virtual machine that implements a remote collector service) to filter metrics of monitored computing-instances prior to ingesting to a monitoring tool. During operation, the computing node may receive the metrics of a monitored computing-instance from a monitoring agent running on the monitored computing-instance. Further, the computing node may retrieve a data structure corresponding to the received metrics. The data structure may be generated corresponding to historical events/incidents that occur in a datacenter. The data structure may include multiple metric dependency levels of the metrics with each metric dependency level mapped to a corresponding severity condition. Furthermore, the computing node may determine a severity level of a root metric of the received metrics using the retrieved data structure. Upon determining the severity level, the computing node may filter the received metrics based on the metric dependency levels in the data structure and the determined severity level. Further, the computing node may ingest the filtered metrics to the monitoring tool to monitor a health of the monitored computing-instance.
Thus, examples described herein may provide a knowledge base of historical incidents, which may be used to derive a mechanism to ingest relevant metrics to the monitoring tool. The computing node may receive the metrics from the monitoring agent and performs filtering of the metrics using the knowledge base of incidents prior to ingesting the metrics to the monitoring tool. Hence, examples described herein may bridge a gap between the monitoring agent and the monitoring tool by filtering the metrics dynamically, thereby reducing the cost for the clients.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present techniques. It will be apparent, however, to one skilled in the art that the present apparatus, devices, and systems may be practiced without these specific details. Reference in the specification to “an example” or similar language means that a particular feature, structure, or characteristic described is included in at least that one example, but not necessarily in other examples.
System Overview and Examples of Operation
Example system 100 includes monitored computing-instances 102A-102N, a monitoring tool 120, and a computing node 106 to receive the metrics (e.g., performance metrics) from monitored computing-instances 102A-102N and transmit the metrics to monitoring tool 120 for metric analysis. Example monitored computing-instances 102A-102N may include, but not limited to, virtual machines, physical host computing systems, containers, software defined data centers (SDDCs), and/or the like. For example, monitored computing-instances 102A-102N can be deployed either in an on-premises platform or an off-premises platform (e.g., a cloud managed SDDC). Further, the SDDC may include various components such as a host computing system, a virtual machine, a container, or any combinations thereof. Example host computing system may be a physical computer. The physical computer may be a hardware-based device (e.g., a personal computer, a laptop, or the like) including an operating system (OS). The virtual machine may operate with its own guest OS on the physical computer using resources of the physical computer virtualized by virtualization software (e.g., a hypervisor, a virtual machine monitor, and the like). The container may be a data computer node that runs on top of host operating system without the need for the hypervisor or separate operating system.
Further, monitored computing-instances 102A-102N includes corresponding monitoring agents 104A-104N to monitor respective computing-instances 102A-102N. In an example, monitoring agent 104A deployed in monitored computing-instance 102A fetches the metrics from various components of monitored computing-instance 102A. For example, monitoring agent 104A real-time monitors computing-instance 102A to collect metrics (e.g., telemetry data) associated with an application or an operating system running in monitored computing-instance 102A. Example monitoring agents 104A-104N include Telegraf agents, Collectd agents, or the like. Example metrics may include performance metric values associated with at least one of central processing unit (CPU), memory, storage, graphics, network traffic, or the like.
An example computing node 106 may be a remote collector, which is an additional cluster node that allows monitoring tool 120 to gather the metrics for monitoring purposes. For example, computing node 106 may be a physical computing device, a virtual machine, a container, or the like. Computing node 106 receives the metrics from monitoring agents 104A-104N via a network and filter the metrics prior to ingesting the metrics to monitoring tool 120. In an example, computing node 106 may be connected external to monitoring tool 120 via the network.
An example network can be a managed Internet protocol (IP) network administered by a service provider. For example, the network may be implemented using wireless protocols and technologies, such as WiFi, WiMax, and the like. In other examples, the network can also be a packet-switched network such as a local area network, wide area network, metropolitan area network, Internet network, or other similar type of network environment. In yet other examples, the network may be a fixed wireless network, a wireless local area network (LAN), a wireless wide area network (WAN), a personal area network (PAN), a virtual private network (VPN), intranet or other suitable network system and includes equipment for receiving and transmitting signals.
Further, computing node 106 includes an incident knowledge base 108. Incident knowledge base 108 stores historical events that occur in a datacenter. Further, incident knowledge base 108 stores metrics that are relevant for each historical event and dependency relationship between the metrics corresponding to each historical event.
Furthermore, computing node 106 includes a metric dependency graph knowledge base 110 to store a data structure representing the relationship between a plurality of metrics. In an example, the data structure includes multiple metric dependency levels of the metrics with each metric dependency level mapped to a corresponding severity condition. The data structure may be a directed acyclic graph (DAG) including the metric dependency levels indicating an order of dependency between the plurality of metrics. The directed acyclic graph may include a plurality of nodes each representing a metric of the plurality of metrics and a set of edges connecting the plurality of nodes representing dependency relationships between the plurality of metrics. Incident knowledge base 108 and metric dependency graph knowledge base 110 may be stored in a storage device of computing node 106 or in a storage device connected external to computing node 106.
Furthermore, computing node 106 includes a processor 112 and a memory 114. The term “processor” may refer to, for example, a central processing unit (CPU), a semiconductor-based microprocessor, a digital signal processor (DSP) such as a digital image processing unit, or other hardware devices or processing elements suitable to retrieve and execute instructions stored in a storage medium, or suitable combinations thereof. Processor 112 may, for example, include single or multiple cores on a chip, multiple cores across multiple chips, multiple cores across multiple devices, or suitable combinations thereof. Processor 112 may be functional to fetch, decode, and execute instructions as described herein.
During operation, for each historical event that occurs in a datacenter, processor 112 may:
Further, memory 114 includes a metric collector unit 116 and a metric rule unit 118. During operation, metric collector unit 116 receives metrics of a monitored computing-instance (e.g., 102A) from a monitoring agent (e.g., 104A) running on monitored computing-instance 102A. Further, metric collector unit 116 retrieves the data structure corresponding to the received metrics from metric dependency graph knowledge base 110.
Furthermore, metric rule unit 118 determines a severity level of a root metric (e.g., a parent metric) of the received metrics using the retrieved data structure. In an example, metric rule unit 118 determines that a value of the root metric matches a severity condition in the data structure. Further, metric rule unit 118 determines the severity level of the root metric corresponding to the matched severity condition.
Further, metric rule unit 118 filters the received metrics based on the metric dependency levels in the data structure and the determined severity level. In an example, metric rule unit 118 determines a metric dependency level based on the severity level of the root metric. Further, metric rule unit 118 may filter the received metrics by discarding the metrics that correspond to metric dependency levels greater than the determined metric dependency level. An example process to filter the metrics is described in
Furthermore, metric rule unit 118 ingests the filtered metrics to monitoring tool 120 to monitor health of monitored computing-instance 102A. In an example, metric rule unit 118 may
In some examples, the functionalities described in
At 202, an incident occurred in a computing-instance (e.g., a datacenter) may be received. In a datacenter management, an incident tracker may report incidents or issues occurred in the datacenter. For example, an incident may be “slow workload/application performance on multiple virtual machines on multiple host computing systems”, “slow throughput for cold migrations”, and the like. Further, the incident may be translated to metrics related to the host computing systems, which in turn depends on metrics related to a network, a storage, and the like. Furthermore, the received incident and associated metrics may be stored in an incident knowledge base (e.g., incident knowledge base 108 as shown in
At 204, a data structure (e.g., a directed acyclic graph (DAG)) and severity levels definition may be derived for each incident stored in incident knowledge base 108. For example,
severity level(Si)=Mi(Numa,Numb).
In the example shown in
In another example as shown in
severity level(Si)=Mi(numa,numb) AND (Mj(numc,numd) OR Mk(nume,numf))
An example Boolean expressing with metrices M1, M2, and M3 may be defined as M1(10, 30) AND (M2(40, 50) OR M3(35, 45)), which my translate to:
(M1>=10 and M1<=30) AND ((M2>=40 and M2<=50) OR (M3>×=35 and M3<=45)).
As shown in
In an example, metric dependency level 1 metrics (e.g., M1) may include “host health status”, metric dependency level 2 metrics (e.g., M11 and M12) may include “central processing unit (CPU) capacity usage”, “memory capacity usage”, “net throughput usage”, “disk throughput usage”, and the like. Further, metric dependency level 3 metrics (e.g., M111, M112, M121, and so) may include “central processing unit load average time”, “memory capacity contention”, “net throughput provisioned”, “disk throughput contention”, and the like that depends on a corresponding one of the metric dependency level 2 metrics.
In an example, dependency of metrics can be arrived at with “dependsOn” (e.g., 266 of
Furthermore, the severity levels as depicted in
Referring back to
At 306, a severity level of the first metric may be determined based on the severity conditions in the data structure. In an example, determining the severity level of the first metric may include:
At 308, the received metrics may be filtered based on the data structure and the severity level of the first metric. In an example, filtering the received metrics may include:
At 310, the filtered metrics may be ingested to a monitoring tool to monitor a health of the monitored computing-instance. In an example, ingesting the filtered metrics to the monitoring tool may include ingesting values of the filtered metrics over a period to the monitoring tool to monitor the health of the monitored computing-instance. In another example, ingesting the filtered metrics to the monitoring tool may include:
At 406, the metrics collector service may check whether an “aggregate” option (e.g., as shown in field 264 of
At 410, the resultant metric value along with the DAG and severity definition may be transmitted to a metrics rule unit (e.g., metric rule unit 118 of FIG. 1). In an example, the metrics rule unit may use the received information that is provided by the metrics collector service and performs the evaluation as shown in blocks 412 to 424.
At 412, a severity level “N” may be considered to evaluate the resultant metric value. At 414, the resultant metric value of the parent metric may be evaluated against a severity condition associated with the severity level “N” in a data structure. At 416, a check may be made to determine whether the severity condition matches with the resultant metric value. When the severity condition matches, at 418, metrics names of metric dependency level 1 to metric dependency level N may be collected from the DAG. At 420, metrics values for metric dependency level 1 to metric dependency level N for the metrics names may be collected. Further, the metric values from metric dependency level N+1 onwards may be dropped.
When the severity condition does not match, the severity level “N” may be reduced by “1” and the steps 414 to 424 may be repeated to evaluate the resultant metric value. When the severity condition “N−1” matches, metrics names of metric dependency level 1 to metric dependency level N−1 may be collected from the DAG and metrics values for metric dependency level 1 to metric dependency level N−1 for the metrics names may be collected. Further, the metric values from metric dependency level N onwards may be dropped. Thus, the process is repeated until the resultant metric value matches with one of the severity conditions in the data structure.
Considering DAG of
At 422, the filtered metrics may be ingested to the monitoring tool by a metrics ingestor. At 424, the monitoring tool may analyze the filtered metrics to determine health of different components of computing-instance.
In an example, for ingesting a metric to the monitoring tool, it costs X, so to send N number of metrics to the monitoring tool, it may cost X*N. With the examples described herein, by mapping the various severity levels to the number of metrics that will be ingested, the cost may be reduced. For example, consider ingesting metrics at a frequency of 1 min and a monitoring agent is pushing 7 metrics to the collector service. When the metrics are not filtered, all the 7 metrics are ingested to the monitoring tool. So, cost would be X*7. With the examples described herein, 7 metrics may be grouped to various levels in a form of the DAG like:
In an example, only 1 metric may be ingested when the computing-instance is working normal, which would have costed only X*1. Thus, 7× times reduction in the cost may be achieved, which is 86% of savings. In another example, when the working condition deteriorates, there will be a gradual increase in the cost. For example, the cost may reach its maximum and equates to X*7 when an incident occurs. Thus, examples described herein may facilitate to ingest the necessary metrics in the various phases of the incident occurrence namely pre-incident, incident, and post-incident to understand the issues better and analyze them from the monitoring tool while having a control on the number of metrics that gets ingested which is directly proportionate to cost.
It should be understood that the processes depicted in
Machine-readable storage medium 504 may store instructions 506, 508, 510, 512, and 514. Instructions 506 may be executed by processor 502 to receive an event that occurs in a monitored computing-instance of a datacenter. Instructions 508 may be executed by processor 502 to receive metrics that are relevant for the event and relationship between the metrics.
Instructions 510 may be executed by processor 502 to generate a data structure including metric dependency levels associated with the metrics based on the relationship between the metrics. In an example, the data structure may be a directed acyclic graph (DAG) including the metric dependency levels indicating an order of dependency between the plurality of metrics.
Instructions 510 may be executed by processor 502 to define a severity condition corresponding to each metric dependency level in the data structure. In an example, instructions to define the severity condition comprise instructions to:
Instructions 510 may be executed by processor 502 to maintain a metric dependency graph knowledge base to store the data structure and the defined severity condition for each metric dependency level;
Instructions 512 may be executed by processor 502 to filter incoming metrics corresponding to an upcoming event based on the data structure and the defined severity conditions in the metric dependency graph knowledge base. In an example, instructions to filter the incoming metrics corresponding to an upcoming event may include instructions to:
Instructions 514 may be executed by processor 502 to ingest the filtered metrics to a monitoring tool to monitor a health of the monitored computing-instance.
Machine-readable storage medium 504 may further store instructions to be executed by processor 502 to receive a second event that occurs in a monitored computing-instance of a datacenter; and update the metric dependency graph knowledge base with a second data structure of metrics that are relevant to the second event along with associated severity conditions.
Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a non-transitory computer-readable medium (e.g., as a hard disk; a computer memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more host computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques.
It may be noted that the above-described examples of the present solution are for the purpose of illustration only. Although the solution has been described in conjunction with a specific embodiment thereof, numerous modifications may be possible without materially departing from the teachings and advantages of the subject matter described herein. Other substitutions, modifications and changes may be made without departing from the spirit of the present solution. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.
The terms “include,” “have,” and variations thereof, as used herein, have the same meaning as the term “comprise” or appropriate variation thereof. Furthermore, the term “based on”, as used herein, means “based at least in part on.” Thus, a feature that is described as based on some stimulus can be based on the stimulus or a combination of stimuli including the stimulus.
The present description has been shown and described with reference to the foregoing examples. It is understood, however, that other forms, details, and examples can be made without departing from the spirit and scope of the present subject matter that is defined in the following claims.