The present disclosure generally relates to cloud computing. In more detail, aspects in the context of adapting telemetry settings in a cloud computing environment are presented. The technique may be implemented, e.g., in a Kubernetes cluster.
Production-grade software applications in a cloud computing environment are typically instrumented with telemetry probes allowing the collection of statistical indicators reflecting the health of the application. The granularity with which the indicators are sampled is maintained at the lowest possible sampling rate so as not to consume computational resources in excess of what is needed to maintain a significant statistical sample of the application behavior. The sampling rate is set at installation/deployment time, and may be changed through manual intervention, when needed.
Such a sampling approach is well suited for statistical monitoring of the performance of a live system in production. The sampling approach, however, is of little or no help in troubleshooting misbehaving applications, although with the same measurement approach it is possible to serve the additional purpose of gathering data for punctual troubleshooting and identification of the root cause of issues.
In view of security aspects, it is further desirable to detect misbehaviors that can potentially be caused by malicious software introducing disturbances in the system.
Accordingly, there is a need for a technique, which solves at least one of the above problems or other related problems of prior art techniques. Specifically, and without limitation, there is a need for an improved technique for performing telemetry for cloud applications.
According to a first aspect, a method for adapting telemetry settings in a cloud computing environment is provided. The method is carried out by a telemetry controller within the cloud computing environment. The method comprises receiving information on at least one misbehavior of an application component, determining, based on the received information, whether to modify a sampling frequency of telemetry data for the application component, and if it is determined to modify the sampling frequency of telemetry data for the application component, initiating modification of the sampling frequency to a modified value.
The following description may apply to all aspects described in this disclosure. In particular, the following description concerning the method aspects may not only apply to the method aspects but also to the device aspects described below, where applicable.
The cloud computing environment may comprise or may be represented by a Kubernetes cluster. The cloud computing environment may comprise one or more application components, in particular, PODs. Each application component may comprise one or more containers (in particular, Docker containers). The cloud computing environment may further comprise a control plane. The control plane may supervise the one or more application components. The cloud computing environment may further comprise a telemetry subsystem configured to receive telemetry data from the one or more application components. The telemetry subsystem may be configured to store and index the received telemetry data. The cloud computing environment may further comprise a telemetry interface via which the telemetry subsystem may make the telemetry data available, e.g., to an external client.
Although the present application focuses on Kubernetes as a framework for a cloud computing environment, the present disclosure should not be understood as being limited to Kubernetes. The technique described herein may be implemented on other current or upcoming cloud computing systems and architectures.
The telemetry controller may be a logical entity in the cloud computing environment. In particular, the telemetry controller may be an application component within the cloud computing environment. The telemetry controller may be a Kubernetes POD. Each of the aforementioned application components (in particular, the PODs and the telemetry controller) may be represented by a logical server allocated to one or more physical servers. The method is carried out by the telemetry controller and, therefore, by a logical device (i.e., a cloud entity), wherein it is not relevant for the following discussion, which physical entity carries out the method steps. In fact, the method steps may be carried out by one or more physical entities, i.e., the logical device “telemetry controller” may be allocated to one or more physical servers.
The received information on the at least one misbehavior of the application component may be received from any logical entity of the cloud computing environment, e.g., directly from the respective application component or from the control plane. The received information on at least one misbehavior of an application component may comprise, e.g., an identifier of the application component and/or further details regarding the misbehavior, such as type of the misbehavior, start time of the misbehavior, end time of the misbehavior, parts of the application component affected by the misbehavior, etc.
The step of determining may include considering information on previous misbehavior of the respective application component. The information on previous misbehavior may be stored in a storage accessible by the telemetry controller.
The step of initiating may comprise updating a configuration file of the respective application component or transmitting data to the respective application component that causes an updating of a configuration file of the application component. The modified value may be a higher value (i.e., indicating a higher sampling frequency) than a previous value for the sampling frequency.
The step of determining may comprise determining whether a frequency of misbehavior of the application component is above a predefined threshold value.
The frequency of misbehavior may be represented by a number of misbehavior in a predetermined time window. For this purpose, information on previous misbehavior may be stored in a storage accessible by the telemetry controller. The threshold value may be represented by a number of misbehavior in the predetermined time window. The step of initiating modification may be triggered by determining that the actual number of misbehavior in the predetermined time window is above the number of misbehavior set as the threshold value.
The misbehavior of the application component may correspond to a restart of the application component.
The information on at least one misbehavior may be received from a control plane of the cloud computing environment.
The misbehaviors of the one or more application components may be monitored by the control plane. An interface may be provided between the control plane and the telemetry controller, in order to transfer the information on the at least one misbehavior to the telemetry controller.
The cloud computing environment may comprise a Kubernetes cluster.
In this regard, it has already been explained above that the cloud computing environment may comprise the typical elements of a Kubernetes cluster, such as one or more PODs, a control plane for controlling the one or more PODs, and a telemetry subsystem.
The step of initiating modification of the sampling frequency of telemetry data for the application component may comprise initiating an adaption of a value stored in a Kubernetes ConfigMap object assigned to the application component.
The initiating of the adaption of the value may comprise, e.g., sending a new value to be stored in the ConfigMap or sending an instruction to stop using a first predefined value stored in the ConfigMap and to start using a second predefined value stored in the ConfigMap. For example, the initiating of the adaption of the value may comprise initiating a change of a flag value stored in a ConfigMap assigned to the respective application component. The predefined value may define a percentage of traces to be sampled for the respective application component.
The method may further comprise, after a predefined amount of time after initiating the modification of the sampling frequency, initiating a modification of the sampling frequency back to a previous value.
The initiating the modification of the sampling frequency back to the previous value may comprise transmitting the previous value to be stored in a ConfigMap of the respective application component. Alternatively, it may comprise transmitting an instruction to stop using a second predefined value stored in a ConfigMap assigned to the respective application component and to start using a first predefined value stored in the ConfigMap. For example, the initiating the modification of the sampling frequency back to the previous value may comprise initiating a change of a flag value stored in a ConfigMap assigned to the respective application component.
The predefined threshold value may be derived from a configuration object.
The configuration object may be stored at the telemetry controller. The configuration object may be stored such that it is accessible by the telemetry controller.
The configuration object may further comprise at least one of a time value indicating a periodicity of wakeup events of a telemetry control process, a restart threshold indicating a threshold value for a number of times the application component may restart within a predefined time window without initiating modification of the sampling frequency, a value indicating a duration of the time window, a value indicating a predefined amount of time after initiating the modification of the sampling frequency before initiating a modification of the sampling frequency back to a previous value, the modified value of the sampling frequency, and a value indicating a level of logging.
The configuration object may be a Kubernetes ConfigMap object.
The telemetry controller may use at least one of the following interfaces: an interface for the receiving of the information on at least one misbehavior of the application component from a control plane of the cloud computing environment, an interface for the initiating of the modification of the sampling frequency of telemetry data for the application component, an interface for storage and retrieval of saved information of restarts and/or configuration of at least one application component, and an interface for configuration of key parameters used by the telemetry controller.
Over each of the interfaces, the telemetry controller can receive and/or transmit information from and/or to another entity of the cloud computing environment, such as a storage, an application component, or the control plane.
According to a second aspect, a method for processing telemetry data in a cloud computing environment is provided. The method comprises the method of the first aspect, carried out by a telemetry controller within the cloud computing environment. The method further comprises receiving, by a telemetry subsystem of the cloud computing environment, the telemetry data of the application component, retrieved with the modified sampling frequency, and making available, by the telemetry subsystem, the telemetry data via a telemetry interface.
While the method of the first aspect is directed to actions performed by the telemetry controller (logical entity), the method of the second aspect describes a method carried out by additional components of the cloud computing environment and/or external entities, such as the telemetry subsystem. The telemetry subsystem may be a logical device within the cloud computing environment. The telemetry data may be retrieved by the telemetry subsystem. The telemetry subsystem may store and/or further process the retrieved telemetry data. For example, the telemetry subsystem may perform additional processing steps to the telemetry data, such as statistical evaluation. The telemetry data may be made available via the telemetry interface to an external client, e.g., via a web interface.
The method of the second aspect may further comprise accessing, by an external client, the telemetry data via the telemetry interface, and determining whether the application component has suffered repeated failures.
The external client may be provided in the form of a web browser. Based on the determining, the external client may determine repeated failures caused by malevolent software that, breaching into the system, can topple over application components.
According to a third aspect, a cloud computing entity for adapting telemetry settings in a cloud computing environment is provided. The cloud computing entity hosts a telemetry controller within the cloud computing environment. The telemetry controller is configured to receive information on at least one misbehavior of an application component, determine, based on the received information, whether to modify a sampling frequency of telemetry data for the application component, and if it is determined to modify the sampling frequency of telemetry data for the application component, initiate modification of the sampling frequency to a modified value.
The method aspects discussed above with regard to the first aspect may also apply accordingly to the third aspect. As discussed above, the cloud computing entity may be represented by a logical device, e.g., a logical server. The cloud computing entity may be a distributed cloud computing entity, in particular, distributed over a plurality of physical entities, such as physical servers. The cloud computing entity may be hosted by one or more physical servers. The telemetry controller may be an application component, in particular, a Kubernetes POD.
The step of determining may comprise determining whether a frequency of misbehavior of the application component is above a predefined threshold value.
The misbehavior of the application component may correspond to a restart of the application component.
The telemetry controller may be configured to receive the information on at least one misbehavior from a control plane of the cloud computing environment.
The cloud computing environment may comprise a Kubernetes cluster.
The step of initiating modification of the sampling frequency of telemetry data for the application component may comprise initiating an adaption of a value stored in a Kubernetes ConfigMap object assigned to the application component.
The telemetry controller may be further configured to, after a predefined amount of time after initiating the modification of the sampling frequency, initiate a modification of the sampling frequency back to a previous value.
The telemetry controller may be configured to receive the predefined threshold value from a configuration object.
The configuration object may further comprise at least one of: a time value indicating a periodicity of wakeup events of a telemetry control process, a restart threshold indicating a threshold value for a number of times the application component may restart within a predefined time window without initiating modification of the sampling frequency, a value indicating a duration of the time window, a value indicating a predefined amount of time after initiating the modification of the sampling frequency before initiating a modification of the sampling frequency back to a previous value, the modified value of the sampling frequency, and a value indicating a level of logging.
The configuration object may be a Kubernetes ConfigMap object.
The telemetry controller may be configured such that it uses at least one of the following interfaces: an interface for the receiving of the information on at least one misbehavior of the application component from a control plane of the cloud computing environment, an interface for the initiating of the modification of the sampling frequency of telemetry data for the application component, an interface for storage and retrieval of saved information of restarts and/or configuration of at least one application component, and an interface for configuration of key parameters used by the telemetry controller.
According to a fourth aspect, a computer program is provided, comprising instructions which, when executed by one or more processor(s), cause the one or more processor(s) to carry out the method of the first aspect.
According to a fifth aspect, a computer-readable medium is provided, comprising instructions which, when executed by one or more processor(s), cause the one or more processor(s) to carry out the method of the first aspect.
The computer-readable medium may be a non-transitory computer-readable medium. For example, the computer-readable medium may be a solid state storage medium (e.g., a SSD), an optical storage medium (e.g., a CD or DVD), or a magnetic storage medium.
Further details of embodiments of the technique are described with reference to the enclosed drawings, wherein:
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as a specific cloud computing environment in order to provide a thorough understanding of the technique disclosed herein. It will be apparent to one skilled in the art that the technique may be practiced in other embodiments that depart from these specific details.
For example, those skilled in the art will appreciate that, although the following description of the present disclosure is provided in the context of a Kubernetes cluster, the technique of the present disclosure can be applied to other cloud management systems, apart from Kubernetes.
Further, the skilled person will appreciate that the following description describes logical components of a cloud computing system, i.e., components of a cloud computing environment, which do not necessarily have one single well-defined physical counterpart. It is rather the nature of a cloud computing environment, such as a cloud computing environment managed by Kubernetes, that logical components are assigned to one or more physical servers.
Therefore, when the description of the present disclosure states that a specific method step is carried out by a specific logical component (such as the telemetry controller), it has to be kept in mind that the respective step may be carried out in a cloud computing environment by one or more physical entities.
The control plane 6 performs (via interface IC.1) a black box monitoring of the PODs 4, and in case a POD 4 becomes unresponsive, it is terminated by Kubernetes and a new instance is created. In other words, the POD is restarted.
Production grade applications are equipped with a telemetry subsystem 8, the DST (distributed system telemetry) item 8 in the picture that is responsible for collecting (via interface ID.1) telemetry data from the PODs 4, storing and indexing it, making it available (via interface ID.2) for retrieval by means of a telemetry interface ID.2. External clients (e.g. web browsers) can use the telemetry interface ID.2 to browse and drill into the data.
In addition to the components shown in
The essence of the solution of the following embodiments may be described to comprise one or more of the following aspects, without limitation:
A general idea implemented by embodiments of the present application may be described as leveraging the telemetry infrastructure to adaptively increase the observability of parts of the cloud applications that experience recurring failures over a configurable period of time.
In order to achieve this purpose, the telemetry infrastructure of a Kubernetes cluster can be extended with two assets:
Whenever a component in the application is reported to incur unexpected failures or other misbehavior, the monitoring agent can use the new interfaces to increase the sampling to a configurable level and for a configurable time window.
The additional data can then be accessed and inspected with the existing telemetry reporting infrastructure (in particular, via the ID.2 interface), by an external client 10.
A new application component “Telemetry Controller” 12 is added to the system, and four new interfaces (ITC.1, ITC.2, ITS.1, and ITM.1) are identified for the essential exchange of control information between it and the other parts of the system.
Interface ITC.1 is used by the Telemetry Controller 12 to retrieve from the control plane 6 the events related to the other application PODs 4; events that signal POD misbehavior (in particular, restarts) are stored in a local storage 18 via interface ITS.1 for later processing.
Whenever new restart events are detected, the Telemetry Controller 12 queries its local storage 18 to detect if any POD 4 has restarted with a frequency greater than the tolerance configured (so called flaky PODs 4) in the “Telemetry Control Settings” table 14 (implemented by a Kubernetes ConfigMap object, so that it is easily accessible and editable).
When flaky PODs 4 are detected, the Telemetry Controller 12 enters in action to modify the sampling frequency of telemetry data for that specific POD 4; this is done over the ITC.2 interface, through which the Telemetry Controller 12 sets suitable values in an Externalized Sampling Settings ConfigMap 16 of the POD 4 of interest. What the “suitable values” should be is configured in the Telemetry Control Settings 14 via the ITM.1 interface (a ConfigMap). Application PODs 4 may implement the ITC.2 interface by
After a configurable amount of time (again, taken from the Telemetry Control Settings 14), the sampling settings 16 for the flaky POD 4 are re-set to their previous value.
All the data gathered can then be inspected at any time via the Telemetry Interface ID.2.
The logic of the solution is implemented within the Telemetry Controller 12, and actioned via the interfaces it uses:
According to a first step 32, the telemetry controller 12 receives information on at least one misbehavior of an application component 4. The misbehavior may correspond to a restart and the application component 4 may be a Kubernetes POD 4, e.g., one of the PODs 4 shown in
According to a second step 34, the telemetry controller 12 determines, based on the received information, whether to modify a sampling frequency of telemetry data for the application component 4. The decision may be made on the basis of a determination whether an amount of misbehaviors within a predetermined time window (and, therefore, a frequency of misbehaviors) is above a predefined threshold value.
According to a third step 36, if it is determined in step 34 to modify the sampling frequency of telemetry data for the application component, the telemetry controller 12 initiates modification of the sampling frequency to a modified value. The modification may be carried out via the ITC.2 interface, wherein a new (modified) value for the sampling frequency may be written into the sampling settings 16 and/or activated in the sampling settings 16 of the respective POD 4. The modified value may indicate a higher sampling frequency than a previous value used as a sampling frequency for performing telemetry on the respective POD 4.
The cloud computing entity 40 comprises a receiving unit 42, a determining unit 44, and an initiating unit 46. Each of these units 42, 44, and 46 represents a logical unit that may be embodied in the form of software and/or hardware and that may be running on one or more physical entities, e.g., physical servers. The receiving unit 42 is configured to carry out step 32 of
The flow chart of
In a first initial step 52, a list of the running PODs 4 of the respective cloud computing environment is saved to storage 18 via interface ITS.1. The required information is received from the Kubernetes control plane 6 (also referred to as K8S API in
In a step 58, the telemetry controller 12 sleeps for an amount of time set by a parameter TCS.cycleTime (in seconds), also referred to herein as cycleTimeSeconds. In a step 60, the list of running PODs is updated and stored in storage 18 via interface ITS.1. The required information is received via interface ITC.1 from the control plane 6. In a step 62, the latest restart events per POD 4 are updated and stored in storage 18 via interface ITS.1. The information required for this step is received from the control plane 6 via interface ITC.1.
In a step 64, the service configuration is reloaded via interface ITM.1 from the telemetry controller settings 14. In a step 66, it is queried for PODs 4 with a number of restarts (#restart) larger than a predefined threshold value (TCS.restartThreshold, also referred to as restartThreshold) in a predetermined time window (TCS.restartTimeWindow, also referred to as restartTimeWindow). In other words, it is checked whether #restart>TCS.restartThreshold within TCS.restartTimeWindow. Thereby, it is determined whether a frequency of misbehavior of the application component 4 (POD 4) is above a predefined threshold value. The information on the number of restarts is received from the storage 18 via interface ITS.1.
In a step 68 it is determined, based on the query of step 66, if there are flaky PODs 4. The expression “flaky PODs” is defined herein as PODs, for which the equation of step 66 is true, i.e., PODs 4, which have a number of restarts in a given time window larger than a predefined threshold value. In case there are no flaky pods (“NO” in
In a step 70, the current POD telemetry settings are received from the telemetry settings 16 of the respective flaky POD 4 via interface ITC.2 and backed up in the storage 18 via interface ITS.1.
In a step 72, extended sampling settings are set in the respective flaky POD 4. More precisely, extended samplings settings are stored in the telemetry settings 16 of the flaky POD 4 via interface ITC.2. In other words, the sampling frequency of the respective POD 4 is modified to a modified value (e.g., increased).
In a step 74, the respective POD 4 is put in an extended observation list, extendedObservation list, which is stored in storage 18 via interface ITS.1.
The flow chart of
In a first step 82, the telemetry controller 12 sleeps for an amount of time set by the parameter TCS.cycleTime (in seconds), also referred to herein as cycleTimeSeconds. In a step 84, the service configuration is reloaded from the telemetry controller settings 14 via interface ITM.1. In a step 86, the PODs 4 in the extendedObservation list are retrieved from storage 18 via interface ITS.1. In a step 88, it is determined, based on the result of step 86, if there are PODs 4 under extendedObservation. In other words, it is determined in steps 86 and 88, which of the PODs 4 of the Kubernetes cluster are under extended observation, if any.
If it is determined that there are no PODs 4 under extended observation (“NO” at step 88 in
In a step 90, it is determined whether the POD 4 (i.e., the POD 4 in the extendedObservation list) has been on the extendedObservation list for more than a predefined amount of time, represented by a parameter TCS.podExtendedObservationTimeWindow (also referred to herein as podExtendedObservationTimeWindow). If this is not the case (“NO” at step 90 in
In a step 94, the respective POD 4 is removed from the extendedObservation list. The extendedObservation list is stored in storage 18 and updated via interface ITS.1.
Note that the name of the data structures and fields used in the present disclosure is purely indicative and can be changed by an implementer without affecting the generality and novelty of the technique described herein. This applies, in particular, to the names of the parameters used in the settings tables discussed in the following.
In the following, an example of the telemetry controller settings 14 is given. The telemetry controller settings 14 are accessed by the telemetry controller 12 via interface ITM.1. According to an embodiment, the telemetry controller settings 14 are represented by a Kubernetes ConfigMap object. The following snippet shows an example of the telemetry controller settings 14 in the form of a Kubernetes ConfigMap object (configuration object).
In view of the above, the configuration object comprises:
In the following, an example of the externalized sampling settings 16 is given. The sampling settings 16 are accessed by the telemetry controller 12 via interface ITC.2. According to an embodiment, the sampling settings 16 are modelled with a Kubernetes ConfigMap object. The sampling settings 16 are mounted as volumeMount by all application PODs 4.
The following snippet shows an example of the sampling settings 16 of a POD 16 in the form of a Kubernetes ConfigMap object (configuration object).
In view of the above, the sampling settings 16 comprise a current value to which the sampling frequency is set: tracingSampling set to 0.002 (=0.2%) in the above example. The above example therefore shows a case, in which the sampling frequency is set to a standard value, i.e., a previous value to which the sampling frequency is set after the predefined amount of time. As shown in the above example of telemetry controller settings 14, this value will be set to 1 in case the respective POD 4 is in the extendedObservation list.
Further, the sampling settings 16 comprise a value indicating a level of logging: loggingSampling set to “WARNING” in the above example, indicating that logging is performed at “WARNING” level and above. In case the respective POD 4 would be in the extendedObservation list, this value (loggingSampling) would be set to “DEBUG”, according to the above example of telemetry controller settings 14.
As described above, the telemetry controller 12 presented in the embodiments uses the following interfaces:
At least some of the embodiments may result in one or more of the following advantages. The automated collection of data is achieved with greater level of detail for flaky components of the cloud application. The restoration of lesser invasive telemetry can be triggered when sufficient data has been collected. Computing resources can be preserved by keeping the telemetry collection at lightweight levels when there is no actual reason to sample at higher frequency. The additional troubleshooting data can be presented via the same reporting mechanism used for statistical data reporting.
Many advantages of the present disclosure will be fully understood from the foregoing description, and it will be apparent that various changes may be made in the form, construction and arrangement of the units and devices without departing from the scope of the present disclosure and/or without sacrificing all of its advantages. Since the embodiments can be varied in many ways, it will be recognized that the present disclosure should be limited only by the scope of the following claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/078307 | 10/13/2021 | WO |