Embodiments of the invention relate to the field of cloud computing; and more specifically, to the dynamic management of monitoring tasks in a cloud environment.
Compute and network performance monitoring is critical to evaluate the overall performance of applications and services on cloud environments and identify and diagnose performance bottlenecks. A key challenge related to large-scale measurements is the fact that the measurement methods themselves consume compute and network resources. One example is active network measurements. An active monitoring function (i.e., probe) running on a first network device, injects probe packets in the network which are received in another network device. The probe packets can be collected in the receiver network device or reflected to the sender network device. Some measurements require the probe packets to be timestamped in both the sender and the receiver at sending and arrival events. In this way one could study interactions between probe packets and the cross traffic and draw conclusions about the network characteristics and the cross traffic dynamics.
In case of active network measurements, the overhead in terms of consumed capacity can become critical if the measured paths partly overlap. Running multiple active measurements at the same time over the same path can lead to “measurement conflicts”, where probe packets of one measurement tool are viewed as user traffic by another tool. Measurement conflicts can result in faulty measurements. For example, it has been shown that overlapping links of measurement paths may bias metric estimation due to interference in the network. Additionally, overlapping use of compute resources by CPU intensive monitoring functions can also lead to measurement conflicts and inaccurate measurement results, e.g., by affecting the accuracy of the timestamp information. In addition to active network measurements, transfer of monitoring results from a node to another (e.g., transfer of passively captured traffic traces), can consume compute and network resources, and affect the performance of applications.
Task scheduling is a widely studied problem. As an example, in a known scheduling approach the scheduling of tasks is considered with respect to the inter-dependencies between the tasks as well as their usage of resources such as memory and bandwidth. In this approach, a scheduler obtains resource consumption from the user annotations of the task in order to schedule them. However, the scheduler cannot modify or re-configure the tasks in order to adapt them to the available resources.
Further a variety of approaches have considered the scheduling problem of active measurement tasks to prevent measurement conflicts while satisfying the measurement requirements. For example, some approaches consider scheduling measurement tasks with the goal of reducing inference (e.g., M. Zhang, M. Swany, A. Yavanamanda and E. Kissel, “HELM: Conflict-free active measurement scheduling for shared network resource management,” 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), Ottawa, ON, 2015, pp. 113-121). In another example, a centralized scheduler is used to schedule latency measurements between servers in a datacenter or between datacenters with the objective of achieving full coverage of the network (e.g., “Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis” Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, Varugis Kurien, Microsoft, Midfin Systems).
However the existing monitoring approaches and their scheduling include several disadvantages. Centralized scheduling solutions of monitoring tasks do not have access to compute resource utilization of servers and cannot schedule local monitoring tasks, (e.g., monitoring communication between applications within one server hosting the application units). Further, current solutions only consider resource conflicts for performing measurements and do not consider other monitoring-related tasks such as transfer of monitoring results. Current solutions are optimized for static placement of long-lived probes in topologies that experience little change. In addition, existing solutions only schedule pre-configured monitoring tasks and cannot re-configure the monitoring tasks to optimize the scheduling.
There is a need for a solution that enables dynamic reconfiguration and scheduling of monitoring task that is adaptable to changes that occur in network and in resources usage, in order to provide efficient usage of resources and accurate measurement results.
In a first broad aspect, a method for resource-aware dynamic monitoring of application units is described. The method comprises causing instantiation of a monitoring element operative to monitor an activity of one or more application units which are part of a network; obtaining a first usage status of resources which are to be used by the monitoring element when in operation; setting one or more configuration parameters of the monitoring element based upon the first usage status of the resources; and scheduling the monitoring element based upon the first usage status of the resources, where scheduling the monitoring element causes the monitoring element to start monitoring the one or more application units at a predetermined start time. The method continues with obtaining a second usage status of the resources; and determining whether to update the monitoring element based upon the second usage status of the resources. The method continues with, responsive to determining that the monitoring element is to be updated, performing at least one of the following: (i) updating the one or more configuration parameters of the monitoring element based upon the second usage status of the resources, and (ii) rescheduling the monitoring element based upon the second usage status of the resources.
In a second broad aspect, a network device for resource-aware dynamic monitoring of application units, is described. The network device being configured to cause instantiation of a monitoring element operative to monitor an activity of one or more application units which are part of a network; obtain a first usage status of resources which are to be used by the monitoring element when in operation; set one or more configuration parameters of the monitoring element based upon the first usage status of the resources; and schedule the monitoring element based upon the first usage status of the resources, wherein scheduling the monitoring element causes the monitoring element to start monitoring the one or more application units at a predetermined start time. The network device is further operative to obtain a second usage status of the resources; and determine whether to update the monitoring element based upon the second usage status of the resources. The network device is further to perform, responsive to determining that the monitoring element is to be updated, at least one of the following: (i) update the one or more configuration parameters of the monitoring element based upon the second usage status of the resources, and (ii) reschedule the monitoring element based upon the second usage status of the resources.
In a third broad aspect, a non-transitory computer readable storage medium that provide instructions, is described. The instruction when executed by a processor, cause said processor to perform operations comprising: causing instantiation of a monitoring element operative to monitor an activity of one or more application units which are part of a network; obtaining a first usage status of resources which are to be used by the monitoring element when in operation; setting one or more configuration parameters of the monitoring element based upon the first usage status of the resources; and scheduling the monitoring element based upon the first usage status of the resources, where scheduling the monitoring element causes the monitoring element to start monitoring the one or more application units at a predetermined start time. The operations further include obtaining a second usage status of the resources; and determining whether to update the monitoring element based upon the second usage status of the resources. The operations further include, responsive to determining that the monitoring element is to be updated, performing at least one of the following: (i) updating the one or more configuration parameters of the monitoring element based upon the second usage status of the resources, and (ii) rescheduling the monitoring element based upon the second usage status of the resources.
The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
The following description describes methods and apparatus for enabling resource-aware dynamic monitoring of application units. In the following description, numerous specific details such as logic implementations, opcodes, means to specify operands, resource partitioning/sharing/duplication implementations, types and interrelationships of system components, and logic partitioning/integration choices are set forth in order to provide a more thorough understanding of the present invention. It will be appreciated, however, by one skilled in the art that the invention may be practiced without such specific details. In other instances, control structures, gate level circuits and full software instruction sequences have not been shown in detail in order not to obscure the invention. Those of ordinary skill in the art, with the included descriptions, will be able to implement appropriate functionality without undue experimentation.
References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
Bracketed text and blocks with dashed borders (e.g., large dashes, small dashes, dot-dash, and dots) may be used herein to illustrate optional operations that add additional features to embodiments of the invention. However, such notation should not be taken to mean that these are the only options or optional operations, and/or that blocks with solid borders are not optional in certain embodiments of the invention.
In the following description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. “Coupled” is used to indicate that two or more elements, which may or may not be in direct physical or electrical contact with each other, co-operate or interact with each other. “Connected” is used to indicate the establishment of communication between two or more elements that are coupled with each other.
An electronic device stores and transmits (internally and/or with other electronic devices over a network) code (which is composed of software instructions and which is sometimes referred to as computer program code or a computer program) and/or data using machine-readable media (also called computer-readable media), such as machine-readable storage media (e.g., magnetic disks, optical disks, read only memory (ROM), flash memory devices, phase change memory) and machine-readable transmission media (also called a carrier) (e.g., electrical, optical, radio, acoustical or other form of propagated signals—such as carrier waves, infrared signals). Thus, an electronic device (e.g., a computer) includes hardware and software, such as a set of one or more processors coupled to one or more machine-readable storage media to store code for execution on the set of processors and/or to store data. For instance, an electronic device may include non-volatile memory containing the code since the non-volatile memory can persist code/data even when the electronic device is turned off (when power is removed), and while the electronic device is turned on that part of the code that is to be executed by the processor(s) of that electronic device is typically copied from the slower non-volatile memory into volatile memory (e.g., dynamic random access memory (DRAM), static random access memory (SRAM)) of that electronic device. Typical electronic devices also include a set or one or more physical network interface(s) to establish network connections (to transmit and/or receive code and/or data using propagating signals) with other electronic devices. One or more parts of an embodiment of the invention may be implemented using different combinations of software, firmware, and/or hardware.
A network device (ND) is an electronic device that communicatively interconnects other electronic devices on the network (e.g., other network devices, end-user devices). Some network devices are “multiple services network devices” that provide support for multiple networking functions (e.g., routing, bridging, switching, Layer 2 aggregation, session border control, Quality of Service, and/or subscriber management), and/or provide support for multiple application services (e.g., data, voice, and video).
Existing approaches for monitoring tasks performed by application units lack adaptability to network topology changes and/or resources allocation. The embodiments of the present invention present a solution that enables dynamic reconfiguration and scheduling of monitoring task that is adaptable to changes that occur in network and in resources usage, consequently enabling efficient usage of resources and accurate measurement results. The embodiments present methods and apparatuses that can automatically and dynamically configure and schedule different types of monitoring functionalities (e.g., monitoring tasks that include active and passive network measurements, transfer of monitoring results and captured traffic) with the objective of avoiding overloading measurement resources (e.g., CPU, network links, memory, etc.) while providing accurate measurement results. The embodiments present several advantages in cloud environment, by enabling monitoring of virtualized or physical application unit running in a data center. In particular, cloud environments where virtualized applications can be started, stopped, and moved dynamically, the solution of the present invention enable measurement tasks to keep up with the dynamic nature of the applications. Additionally, the monitoring tasks are dynamically configured such that while they provide accurate measurement results they do not eat up the resources (CPU, memory, network links, etc.) used by the application units.
The embodiments present a method and a network device for resource-aware dynamic monitoring of application units. A monitoring element is instantiated. The monitoring element is operative to monitor an activity of one or more application units which are part of a network. A first usage status of resources, where the resources are to be used by the monitoring element when in operation, is obtained. One or more configuration parameters of the monitoring element are set based upon the first usage status of the resources. The monitoring element is scheduled based upon the first usage status of the resources, where the scheduling of the monitoring element causes the monitoring element to start monitoring the one or more application units at a predetermined start time. A second usage status of the resources is obtained. A determination of whether to update the monitoring element based upon the second usage status of the resources, is performed. In response to determining that the monitoring element is to be updated, at least one of the following operations is performed: (i) updating the one or more configuration parameters of the monitoring element based upon the second usage status of the resources, and (ii) rescheduling the monitoring element based upon the second usage status of the resources. In some embodiments, the process of obtaining the usage status of the resources is continuously performed and as a result the monitoring element can be updated and/or rescheduled dynamically based upon the usage status of the resources.
Thus, the embodiments present a method and a network device that constantly obtain the status of shared resources (local and/or external) to adapt the configuration and scheduling of monitoring tasks in reaction to resource utilization changes. The dynamicity of monitoring configuration and scheduling allows minimization of impact on other services running in the cloud environment.
As will be discussed in further details below, the elements of ND 101 enable a dynamic and resource aware monitoring of the application units 116A-N. LMC 112 is operative to instantiate, configure and schedule the monitors 1180-S. The LMC is operative to receive a monitoring request. The monitoring request includes the type of monitoring task to be performed, as well as the application units to be monitored. In some embodiments, the monitoring task can be executed to monitor several application units. For example, a monitoring element operative to perform a passive monitoring of packets forwarded by an application unit can monitor all packets forwarded by all the application units. Further in some embodiments, the request may include initial configuration parameters for the monitoring task to be performed (e.g., frequency of monitoring, a priority associated with the monitoring task, etc.). The monitoring request information can be received by the LMC 112 from higher management and orchestration levels (e.g., a cloud management unit that is operative to manage the data center). Each LMC maintains a list of the monitoring tasks that should be executed.
The monitors 1180-S are operative to perform the requested monitoring task. Each monitoring element is configurable and schedulable by the LMC 112. In some embodiments, each monitoring element performs one or more types of measurements of the activity of application units 116A-N. For example, a first monitoring element (e.g., Monitoring element 1180) can be used to perform passive measurement of the application units (e.g., capture network traffic transmitted or received by an application unit). In some embodiments passive measurement can be performed by tapping into a switch of ND 101 (e.g., virtual switch or an Ethernet switch). Another monitoring element (e.g., Monitoring element 118P) can be used to perform active measurement between a first application unit and another application unit (which can reside on the ND 101 or on another ND). An active measurement monitoring element injects probe packets in the network which are received in another probe (e.g., another network device coupled with ND 101 through the network 105). The probe packets can be collected in the receiver probe or reflected to the network device 101. In some embodiments, the reflected probe packets can be received and processed by the active measurement monitor. In some embodiments, the probe packets are timestamped in both probes (by the active measurement monitoring element in ND 101 and in the received ND) at sending and arrival events. The active measurement monitoring element can be used to study interaction between probe packets and the cross traffic and draw conclusions about the network characteristics and the cross traffic dynamics.
A monitoring element can be operative to perform analytics by calculating statistics from the measured data (e.g., for performing troubleshooting tasks). In a first embodiments, the analytics monitoring element can reside locally (within ND 101). Alternatively, in a second embodiment, it can be implemented on an external network device in communication with ND 101 to receive the measurements and analyze them. In the first embodiment, an analytics monitoring element also uses local compute resources which are considered when scheduling monitoring tasks. In the second embodiment, the measurement data (e.g., application traffic traces) is transferred for analysis and also consumes shared compute and network resources. In another example, a monitoring element (e.g., monitoring element 118S) can be a database for storing measurement results of one or more monitoring tasks. The monitors are configured/reconfigured, instantiated, stopped, and scheduled/rescheduled by LMC 112. Each monitoring element can be configured to perform periodic monitoring of one or more application units or an on-demand single monitoring task.
ND 101 further includes a resource monitoring element 114. The resource monitoring element 114 is operative to monitor the usage and availability of local physical resources of the ND 101 (e.g., processor(s), memory, network links and interfaces, etc.). Similarly to the monitors 1180-S, the resource monitoring element 114 is also controlled by LMC 112, such that it can be configured/reconfigured, instantiated, stopped, and scheduled/rescheduled by LMC 112. The resource monitoring element can be configured to perform periodic monitoring of the resources or alternatively an on-demand single monitoring task of the resources. The resource monitoring element 114 is operative to determine the status of the resources and transmits it to the LMC 112. Resource usage and availability monitoring can be performed by using various tools such as the SAR tool in Linux (Service Activation Report) or a cAdvisor container for monitoring in a container execution environment. In some embodiments, LMC 112 is operative to communicate with LMCs in other NDs and with CNSM 102 using a Representational State Transfer (REST) API.
ND 101 is further communicatively coupled with CNSM 102. CNSM 102 is a network device operative to transmit the status of external resources (e.g., network resource utilization of the network including the ND 101) to the LMC 112. In one embodiment, the CNSM 102 includes a distributed data store that receives information on network links from various LMCs located in network devices within the network. In this embodiment, the CNSM 102 does not make any decisions and provides the gathered information about the network status to the LMCs when needed. The CNSM 102 receives information about network monitoring elements from different LMCs (e.g., when LMC 112 initiates a monitoring element that performs an active network monitoring task or transfers monitoring data to an external network device, it also forwards this information to the CNSM 102) and uses the information to generate a global view of the measurements occurring in the entire network.
In some embodiments, the global view can be organized as a graph where each node of the graph corresponds to a monitoring endpoint (MEP) and each edge corresponds to a measurement between two endpoints (i.e., an end point of a measurement task performed by a respective monitoring element). The graph can optionally be weighted where edge weights correspond to the type, frequency, and duration of the measurements. CNSM 102 then maps the measurement graph onto the topology graph of the network such that each MEP is associated with a corresponding network device. Several MEPs can be mapped to a single network device.
CNSM 102 continuously receives updates from different LMCs in the network, and upon receipt of the information, it updates the global view (e.g., graph) and the mapping onto the topology of the network. Similarly, when the topology of the network is updated, CNSM 102 updates the mapping. The mapping of two monitoring functions MEPi and MEPj is performed by identifying the network devices in the network topology where these monitoring elements are executed, and determining the path between them.
In some embodiments, the CNSM 102 further performs network overlap detection and stores the information for use by the LMCs in the network. Network overlap detection can be performed at the CNSM 102, whenever the network topology is updated or the network measurements are started, stopped, or modified. In some embodiments, network overlap detection can be performed by assigning a counter/weight to each link on the network topology. When a monitoring element is mapped on a link, the link counter is increased. After mapping of all the monitoring elements, the links and endpoints which have high counters/weights (e.g., when compared with a predetermined threshold) are considered as overloaded resources. LMCs that have to schedule network monitoring elements, receive the overload information from the CNSM 102 and consider it for updating or scheduling the monitoring elements. For example, referring back to
In some embodiments, in addition to gathering and forwarding the information, the CNSM 102 may act as an admission controller. In this embodiment, the CNSM 102 uses the network overlap information to make a decision about admitting a network monitoring task or not. In some embodiments, CNSM 102 may further use admission decisions based on local resources usage status. In these embodiments, CNSM 102 receives updates to network changes from each ND including an LMC as well as status on the state of usage of the local resources (e.g., CPU and memory usage). Alternatively, when the CNSM 102 does not act as an admission controller for the monitoring elements, it stores the overlap detection information such that it is queried by LMCs.
When a monitoring request is received at operation (A), LMC 112 instantiates a monitoring element for performing the requested monitoring task. At operations (B1-B2), LMC 112 obtains the current status of the resources. In one embodiment, LMC 112 obtains a usage status of the local resources within the network device 101 from the resource monitoring element 114. In some embodiments, in addition or alternatively to the local resources LMC 112 may further obtain the current usage status of the external resources from the CNSM 102 (operation B2). The usage status of the resources (local or external resources) include information regarding the current resource usage of the resources (e.g., the memory usage within ND 101, CPU usage within ND 101, network link utilization within or outside of ND 101, etc.) and the availability of the resources. In some embodiments, if the usage status is not available (e.g., the resource monitoring element is not operating), LMC 112 instantiates the resource monitoring element 114. In some embodiments the resource monitoring element 114 can be instantiated by sending a request to the application unit manager 120. In some embodiments, when the ND 101 includes virtualized network elements (container, virtual machines implementing the different elements of ND 101), the application unit manager 120 is a virtual element manager that is operative to instantiate the different instances of the elements within ND 101.
At operation (C), based upon the usage status obtained, LMC 112 configures the monitoring element with proper parameters for performing the requested monitoring task. The LMC 112 further defines a scheduling strategy for executing the configured monitor. In some embodiments, when the monitoring task involve network resources that are outside ND 101, LMC may inform CNSM 102 about the network resources that will be used during the measurements. For example, if the configured monitoring element is to perform active measurement between ND 101 and another network device, LMC 112 informs the CNSM 102.
Following the instantiation, configuration and scheduling of a first monitoring element to perform a monitoring task of an application unit 116A-N, the LMC 112 continuously obtain usage status of the resources (local and/or external resources) and determines whether to update/reconfigure the monitoring element and/or reschedule the monitoring element such that there is no overload of the resources. For example, as will be described in further details below, if the available resources are not sufficient for the currently scheduled monitoring tasks, the LMC 112 reconfigures the monitoring tasks and/or reschedules them according to the available resources.
The operations in the flow diagrams will be described with reference to the exemplary embodiments of
In some embodiments, LMC 112 may optionally perform discovery of the application units to be monitored. When the application units are virtual elements (e.g., VMs/containers), the discovery can be performed by tapping onto hypervisor/container daemon messaging and retrieving application creation and destruction messages. For example, an LMC 112 can listen to event messages generated by the local Docker daemon regarding creation and removal of application containers. Once an application is started, LMC 112 instantiates the required active monitoring function(s) for example inside an active monitoring container. In this way an LMC 112 identifies the measurement endpoints.
In some embodiments, LMC 112 may further perform discovery of application-generated flows. The discovery of the flows enable LMC 112 to be used for discovery of remote monitoring endpoints and pairing the measurement points (i.e., pairing a monitoring element at ND 101 with another monitoring element in an external network device). This can be done by passively observing the application traffic. For example, by tapping onto a virtual switch or physical switch to which the application unit (e.g., container/VM) is connected. Based on the observed application flows, the LMC 112 can then identify the remote measurement endpoint. This information can then be used of the active monitoring sessions of an active monitoring element.
Flow then moves to operation 220, at which a first usage status of resources is obtained. The usage status obtained relates to usage and availability of resources which are to be used by the instantiated monitoring element when in operation. The resources can be local resources (e.g., network links, CPU, memory) or external (e.g., network links and paths with external network devices). When the monitoring task to be executed by the monitoring element requires the use of external network resources (e.g., when the monitoring task is an active measurement between ND 101 and another ND in the network), the LMC 112 also informs the CNSM 102 about the network resources involved in the measurements.
Alternatively, when LMC 112 determines that no external resources will be involved the flow of operations moves to operation 330. At operation 330, LMC 112 use the obtained usage status (local and/or external) to configure, update the configuration, schedule or update the scheduling of the monitoring element. In a non-limiting scenario, when the monitoring element involve performing a task of local measurement (e.g., measuring the throughput between two local VMs), the configuration and the scheduling is done locally without communicating with CNSM 102. Alternatively, when the monitoring element involve external measurements (e.g., active network monitoring or transfer of monitoring results to a database on an external server), the LMC 112 can query the status of the network resources by communication with the CNSM 102.
Referring back to the flow diagram of
At operation 230, LMC 112 schedules the monitoring element based upon the first usage status of the resources. Scheduling the monitoring element causes the monitoring element to start monitoring one or more application units at a predetermined start time by executing the requested monitoring task. LMC 112 schedules the monitoring element with the objective of avoiding any contention on the resources that will be used (e.g., memory, CPU, network link and interfaces within the ND 101, and/or external network resources such as network paths and path endpoints). Overlapping measurement resources (e.g., probe packets being forwarded over network paths also used by data traffic, CPU usage shared with the application units, etc.) can adversely affect the measurement results and can potentially overload the different resources used, and can result in a negative effect on the application units' performance. Thus, in the embodiments presented herein, LMC 112 takes into consideration any potential overlapping resource usage and automatically defines a scheduling strategy for the monitoring element that avoids contention on the shared resources. By using conflict information (i.e., information related to overlapping resources such as CPU, memory, NIs) LMC 112 solves a scheduling problem in order to provide efficient execution of the monitoring element.
A variety of scheduling techniques can be used. In one exemplary embodiment, a first scheduling approach consists in running a single measurement task at any point in time for example using an Earliest Deadline First (EDF) scheduling algorithm. In another exemplary embodiment, Dynamic Priority Scheduling can be used. According to this technique, the scheduling of the monitoring element is dynamically adapted to change and form an optimal configuration in a self-sustained manner. For example, a priority of a monitoring element can be modified based on the current resource consumption. In non-limiting example, when an active monitoring element is already running and a passive monitoring element is to be scheduled, if the CPU consumption increases due to heavy operations performed by an application unit, the priority of the passive monitoring task can be decreased and its scheduling delayed. In another example, if the network becomes the bottleneck, the priority of scheduling the passive monitoring element can be increased while the priority of the active monitoring element can be decreased such that its scheduling is updated or delayed.
Referring back to the flow diagram of
When the new usage status of the resources is obtained, the LMC 112 determines, at operation 245, whether to update the monitoring element based upon the second usage status of the resources. In response to determining that the monitoring element is to be updated, The LMC 112 can update the monitoring element by either adjusting/updating (operation 225) its configuration parameters based upon the second usage status of the resources, and/or rescheduling (operation 230) the monitoring element based upon the second usage status of the resources. The configuration can be done for example by setting a configuration file of the monitoring element or restarting the monitoring element with new input parameters specifying the start time and duration of each session monitoring session.
At operation 524, The LMC 112 can update the priority of the monitoring element. In another embodiment, the LMC 112 may update the scheduling of a monitoring element (operation 526) or a scheduling of a low priority monitoring element 524. For example, in an embodiment, where each monitoring task is assigned a priority (which can vary according to the type of the task, and/or according to the resources usage), LMC 112 is operative to schedule and update the monitoring elements based upon their respective priorities. For example, in some embodiments, the LMC 112 may determine to not update the configuration parameters of a monitoring element even if the resources are overloaded. For example, monitoring data can be valuable in congestion or overload situations to enable troubleshooting, therefore LMC 112 may determine not to reduce the frequency of measurements in order to better assess and troubleshoot the elements that cause the overload or congestion. In another example, LMC 112 may determine to reschedule or update a subset of the monitoring element. In another example, when a monitoring element has completed executing a monitoring task, this can free up local/external resources and LMC 112 can re-configure and re-schedule the remaining monitoring elements if needed, or schedule new monitoring elements such (e.g., transferring monitoring data to external analytics system or databases, resuming passive monitoring, etc.). In another example, when traffic volume produced by an application unit is low, passive traffic capturing requires less resources compared to when the traffic volume is high. Therefore, when the usage status indicates that there is an increase in the use of the resources due to an increase of traffic volume, the LMC 112 can update the scheduling of a passive monitoring element to capture traffic at a lower frequency. Thus, in the embodiments presented herein, LMC 112 continuously observes resource usage and adapts the scheduling and configuration of the monitoring elements to resource usage changes.
In order to configure and schedule the APC monitoring element, the LMC 112 first obtains, at operation 410, the status of local resource usage. For example, LMC 112 obtains CPU usage and NIC capacity as generating probe packets uses these local resources. At operation 415, LMC 112 determines whether the local resource usage is high. When it is determined that the local resource usage is high, the LMC 112 updates one or more monitoring elements based on the local resource usage obtained. For example, the LMC 112 may determine to schedule the APC measurements for later. Alternatively, the LMC can decide to re-configure and re-schedule another ongoing monitoring task. For example, if the APC monitoring element is associated with a high priority, the LMC 112 may reconfigure a monitoring element with a lower priority with a slower measurement frequency or to be scheduled for a later time. For example, if passive traffic capturing is ongoing, the LMC can change the packet sampling rate to reduce the CPU usage.
Since APC is an active measurement, the LMC 112 further obtains (at operation 425) information about the status of the network. This information is obtained from CNSM 102. At operation 430, LMC 112 determines whether the resource usage is high. If overlapping active network measurements are ongoing on the same path as the new APC monitoring element is intending to measure, then the results will be biased and the overlapping measurements can cause congestion in the network and affect user application performance. Therefore, in this example, if the LMC 112 determines that there are overlapping measurements (i.e., the usage status of the external resources is high), the LMC 112 can update the APC monitoring element (operation 435). For example, LMC 112 can postpone the scheduling of the new APC monitoring element or configure the APC monitoring element to send packet trains with a lower frequency. When the LMC 112 initiates the APC monitoring element, it transmits (operation 445), updated network usage information to the central network status monitor and causes an update of the global resources that reflect the new measurement initiated in the network.
The network device 701 includes hardware 701 comprising a set of one or more processor(s) 705 (which are often commercial off-the-shelf COTS processors) and NIC(s) 710 (which include physical NIs 715), as well as non-transitory machine readable storage medium 720 having stored therein a Local Monitor Controller (LMC) 712, a resource monitoring element 714, one or more monitoring elements 7180-R. A physical NI 715 is hardware in a network device 701 through which a network connection (e.g., wirelessly through a wireless network interface controller (WNIC) or through plugging in a cable to a physical port connected to a NIC 710) is made. During operation, the processor(s) 705 may execute software to instantiate a hypervisor 770 (sometimes referred to as a virtual machine monitor (VMM)) and virtual machines 740L-N that are run by the hypervisor 770, which is collectively referred to as software instance 702. A virtual machine 740 is a software implementation of a physical machine that runs programs as if they were executing on a physical, non-virtualized machine; and applications generally do not know they are running on a virtual machine as opposed to running on a “bare metal” host electronic device, though some systems provide para-virtualization which allows an operating system or application to be aware of the presence of virtualization for optimization purposes. The virtual machines 740, and that part of the hardware 901 that executes that virtual machine (be it hardware dedicated to that virtual machine and/or time slices of hardware temporally shared by that virtual machine with others of the virtual machine(s)), may form a separate virtual CNSM.
Network device 701 performs similar functionality as ND 101 of
The physical (i.e., hardware) CNSM 102 is a network device that can perform some or all of the operations and methods described above for one or more of the embodiments. The physical CNSM 102 can include one or more network interface controllers (NICs; also known as network interface cards) 815, processor(s) (“processor circuitry”) 810, memory 805, a Central Network Status Monitor 820.
The processor(s) 810 may include one or more data processing circuits, such as a general purpose and/or special purpose processor (e.g., microprocessor and/or digital signal processor). The processor(s) is configured to execute the Central Network Status Monitor 820, to perform some or all of the operations and methods described above for one or more of the embodiments, such as the embodiments of
The network device 900 includes hardware 901 comprising a set of one or more processor(s) 905 (which are often commercial off-the-shelf COTS processors) and NIC(s) 910 (which include physical NIs 915), as well as non-transitory machine readable storage medium 920 having stored therein a Central Network Status Monitor Code 925. A physical NI 915 is hardware in a network device 900 through which a network connection (e.g., wirelessly through a wireless network interface controller (WNIC) or through plugging in a cable to a physical port connected to a NIC 910) is made. During operation, the processor(s) 905 may execute software to instantiate a hypervisor 970 (sometimes referred to as a virtual machine monitor (VMM)) and a virtual machine 940 that is run by the hypervisor 970, which is collectively referred to as software instance 902. A virtual machine 940 is a software implementation of a physical machine that runs programs as if they were executing on a physical, non-virtualized machine; and applications generally do not know they are running on a virtual machine as opposed to running on a “bare metal” host electronic device, though some systems provide para-virtualization which allows an operating system or application to be aware of the presence of virtualization for optimization purposes. The virtual machines 940, and that part of the hardware 901 that executes that virtual machine (be it hardware dedicated to that virtual machine and/or time slices of hardware temporally shared by that virtual machine with others of the virtual machine(s)), may form a separate virtual CNSM.
A virtual CNSM performs similar functionality to the CNSM 102 illustrated in
A network interface (NI) may be physical or virtual; and in the context of IP, an interface address is an IP address assigned to a NI, be it a physical NI or virtual NI. A virtual NI may be associated with a physical NI, with another virtual interface, or stand on its own (e.g., a loopback interface, a point-to-point protocol interface). A NI (physical or virtual) may be numbered (a NI with an IP address) or unnumbered (a NI without an IP address). A loopback interface (and its loopback address) is a specific type of virtual NI (and IP address) of a NE/VNE (physical or virtual) often used for management purposes; where such an IP address is referred to as the nodal loopback address. The IP address(es) assigned to the NI(s) of a ND are referred to as IP addresses of that ND; at a more granular level, the IP address(es) assigned to NI(s) assigned to a NE/VNE implemented on a ND can be referred to as IP addresses of that NE/VNE.
While the flow diagrams in the figures show a particular order of operations performed by certain embodiments of the invention, it should be understood that such order is exemplary (e.g., alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, etc.).
While the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described, can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SE2016/051319 | 12/28/2016 | WO | 00 |