Disclosed are embodiments related to a method and system for resilience based upon probabilistic estimate of failures.
Making remote calls to software running in different processes, including on different machines across a network, can result in failure, or hanging without a response, until some time-out limit is reached. When coordinating across multiple remote calls, handling that failure in an appropriate way is necessary. One approach known in the art is to use a circuit breaker (CB) technique, such as illustrated in
Avoiding failures in this way is important particularly in scenarios where the continuation of the failed executions can lead to inconsistencies. For example, inconsistencies here may include inconsistent data getting persisted.
The CB technique relies on a method that counts the number of failed requests towards a particular channel or service. Client 102 makes requests to server 104. If the request succeeds, the counter is unchanged, but if the request fails, the counter is incremented. The counter is compared against a pre-configured circuit breaker threshold (which is 3 in the illustrated example). While the counter is below the threshold the CB remains in the closed state 110. Once the counter reaches the threshold, the CB is tripped to the open state 112. While in the open state 112, the CB immediately blocks requests from being sent to that channel or service which is facing problems. The CB remains in the open state 112 until a configured reset time-out is reached, when it can verify if normal operation (closed state 110) can be reestablished for that channel or service. In some examples, verifying if normal operation can be reestablished takes place in a half-open state. That is, after a timeout period, the circuit switches to a half-open state to test if the underlying problem still exists. If a single call fails in this half-open state, the breaker is once again tripped. If it succeeds, the circuit breaker resets back to the normal closed state.
When the circuit breaker is open the client 102 may decide take different approaches depending on the nature of the involved service: e.g., to call an alternate API if the primary one is down or under load, return cached data from a previous response, notify the user, provide feedback to the user and retry the action in the background, and/or log problems to logging services.
There are other approaches where the analysis of performance degradation is used to determine actions towards services operations. For example, US20170046146A1 discusses autonomously healing microservice-based applications. The method comprises the steps of detecting a performance degradation of at least a portion of the application; and responsive to detecting the performance degradation, downgrading at least one of the plurality of microservices within the application.
[1] CircuitBreaker, Martin Fowler, 6 Mar. 2014, https://martinfowler.com/bliki/CircuitBreaker.html
[2] Patterns of resilience, Uwe Friedrichsen, 5 Nov. 2014, https://www.slideshare.net/ufried/patterns-of-resilience
Circuit breaker techniques needs to be tuned to an optimal counter threshold. Resorting to a higher counter threshold (requiring a high number of failures) leads to unnecessary failures in order to trip the breaker, and consequently more inconsistency as a result of the failures. Alternatively, a lower counter threshold (requiring a low number of failures) leads to more downtime in that channel or service. Existing CB technique thus relies on a deterministic counter to trigger the open state. Such a deterministic way to evaluate a failure scenario has as a main downside the incapability to foresee trends. When a CB relies only on the counting of absolute failed requests as the criteria to trip the circuit breaker, such an approach is not optimized, since it does not foresee any future tendencies. This represents a waste of resources or a non-ideal performance.
This sort of deterministic approach to evaluating a failure scenario, as used by typical CB techniques, has as a primary downside the inability to foresee trends or statistical tendencies. This sort of approach therefore wastes resources and has non-optimal performance. For example, setting the threshold for tripping the open state of the CB is problematic.
The alternative approaches described in the background (e.g., autonomously healing microservice-based applications) also rely on deterministic methods to evaluate the channel or service degradation and decide on how to act (as the regular CBs do) based on only the counting of absolute failed requests.
Embodiments provide for a statistical estimator that can predict the likelihood of failure of a given request to a service. This statistical estimator can be added before a request to a service is made, such that if the likelihood of failure is too high then the request may be blocked and the counter of the CB incremented without having to incur a failed request. On the other hand, if the likelihood of failure is not too high, the request may proceed and the counter will be incremented if the request fails.
Advantages of embodiments are that they help to prevent inconsistent states in the system, reduce the amount of network I/O, reduce load on an already stressed server resource, and require less “real” failures to predict the risk for fault.
According to a first aspect, a method for providing resilience in a computing environment. The method includes, prior to making a request to a first service, determining a first probability based on environment parameters, wherein the first probability represents a likelihood of the request to the first service failing. The method includes determining that the first probability meets or exceeds a first threshold. The method includes as a result of determining that the first probability meets or exceeds the first threshold, (i) declining to make the request to the first service and (ii) incrementing a counter, wherein the counter is an internal variable for determining a circuit breaker state.
In some embodiments, the method further includes, prior to making a second request to a first service, determining a second probability based on environment parameters, wherein the second probability represents a likelihood of the second request to the first service failing. The method includes determining that the second probability is below a first threshold. The method includes, as a result of determining that the second probability is below the first threshold, (i) making the second request to the first service and (ii) incrementing a counter if the second request to the first service fails, wherein the counter is an internal variable for determining a circuit breaker state.
In some embodiments, the method further includes, prior to making additional requests to the first service, determining additional probabilities based on environment parameters, wherein the additional probabilities represent a likelihood of the additional requests to the first service failing. The method further includes determining, for each of the additional requests to the first service, whether the corresponding probability meets or exceeds the first threshold. The method further includes, for each of the additional requests to the first service, if the corresponding probability meets or exceeds the first threshold (i) declining to make the corresponding request to the first service and (ii) incrementing the counter; and if the corresponding probability is below the threshold then (i) making the corresponding request to the first service and (ii) incrementing the counter if the corresponding request to the first service fails. The method includes determining that the counter exceeds a second threshold. The method includes, as a result of determining that the counter exceeds the second threshold, transitioning to an open circuit breaker state for the first service, where requests to the first service are disabled during the open circuit breaker state.
In some embodiments, determining a first probability based on environment parameters comprises using a rule-based estimator. In some embodiments, determining a first probability based on environment parameters comprises using machine learning. In some embodiments, using machine learning includes applying deep reinforcement learning. In some embodiments, the environment parameters the first probability is based on feedback signals from the first service, including one or more of a round-trip time, an acknowledgement (ACK) message, a negative acknowledgment (NACK) message, a node state indicator, and a cluster health indicator. In some embodiments, the first service performs a storage operation. In some embodiments, the first service performs a charging function. In some embodiments, the first service comprises a group of services or microservices. In some embodiments, the first service is provided by a node in a telecommunications network, and determining the first probability based on environment parameters is performed in a cloud computing environment. In some embodiments, the first service is a network function managed by an orchestration layer.
According to a second aspect, a computer program comprising instructions which when executed by processing circuitry of a node, causes the node to perform the method of any one of the embodiments of the first and second aspects.
According to a third aspect, a carrier containing the computer program of the third aspect is provided. The carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
According to a fourth aspect, a network node is provided. The network node includes processing circuitry. The network node includes a memory, the memory containing instructions executable by the processing circuitry, whereby the network node is configured to perform the method of any one the embodiments of the first and second aspects.
According to a fifth aspect, a network node for providing resilience in a computing environment is provided. The network node is configured to, prior to making a request to a first service, determine a first probability based on environment parameters, wherein the first probability represents a likelihood of the request to the first service failing. The network node is configured to determine that the first probability meets or exceeds a first threshold. The network node is configured to, as a result of determining that the first probability meets or exceeds the first threshold, (i) decline to make the request to the first service and (ii) increment a counter, wherein the counter is an internal variable for determining a circuit breaker state.
In some embodiments, the network node is further configured to, prior to making a second request to a first service, determine a second probability based on environment parameters, wherein the second probability represents a likelihood of the request to the first service failing. The network node is configured to determine that the second probability is below a first threshold. The network node is configured to, as a result of determining that the second probability is below the first threshold, (i) make the second request to the first service and (ii) increment a counter if the second request to the first service fails, wherein the counter is an internal variable for determining a circuit breaker state.
In some embodiments the network node of the sixth aspect or the seventh aspect is configured to perform the method of any one of the embodiments of the first aspect and the second aspect.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
A probability p that a request will succeed is equivalent to a probability 1−p that a request will fail. If 1−p meets or exceeds a probability threshold, that indicates that the request is likely to fail. If a probability of success (p) is used, it can be converted to a probability of failure (1−p) that can be checked against the threshold, so that if (1−p) meets or exceeds a threshold, that indicates the request is likely to fail. Whether a probability of success (p) is used or a probability of failure (1−p) is used, it is therefore possible to consider a probability meeting or exceeding a probability threshold as meaning the request is likely to fail.
As illustrated, the CB is tripped after the counter reaches the CB threshold value, based on successive requests having a probability (provided by the statistical estimator 202) that is below a probability threshold. In addition to incrementing the counter based on the statistical estimator 202, the counter may also be incremented if a request that is made results in a failure. Further, the CB may be tripped following a series of requests, some of which are successful requests and others that either are failed requests or have corresponding probabilities that are below the probability threshold, provided that the counter exceeds the CB threshold. In some embodiments, the counter may only count failed requests or probabilities that are below the probability threshold from a certain time window (e.g., the preceding five minutes, half hour, 24 hours, etc.).
Accordingly, a modified CB technique is provided where the statistical estimator 202 is introduced.
Embodiments herein may be applied to control access to a given channel or service in a Business Support System (BSS), among other things. For example, embodiments may be applied in the following circumstances:
The situation shown in
The situation shown in
Orchestration can be done in multiple ways. For example,
As shown in
Taking the BSS systems described above as an example, a call to one service (e.g., customer provisioning (composite)) may result in a call to further services (e.g., storage operations on the persistence layer). In some embodiments, there may be statistical estimators 202 provided for each layer where service calls are made. In some embodiments, a statistical estimator 202 associated with an initial service (e.g., customer provisioning (composite)) may receive information from a statistical estimator 202 associated with a later service (e.g., storage operations on the persistence layer), including a probability that the later service is likely to succeed. The information from statistical estimators 202 associated with later services may be used by the statistical estimator 202 associated with an initial service.
For the DRL case, as shown, the reward function is a function which describes how the agent “ought” to behave. These functions may be thought to be a weight for a state and an action pair, which assign the relative importance of a transition from a given state with a given action with respect to our objective. Different use cases may warrant different reward functions. The service variables may include the current state of the system as represented by the state of the service(s) it is composed of. These services in turn may be represented by different variables such as their ongoing throughput, request latency, etc. This information may be used by the DRL model to generate a probability.
In embodiments, statistical estimator 202 may be a combination of the rule-based approach (
The two requests above the dotted line illustrate the flow without the statistical estimator 202. As shown, client 102 makes a request that is intercepted by service composite 602. Service composite 602 checks whether CB 604 is open. If not (i.e. normal operation), service composite 602 forwards the request to server 104. If the request fails, then a CB counter is incremented. If the counter is incremented and it exceeds a CB threshold, then the CB is tripped to its open state.
The request below the dotted line illustrates the flow with the statistical estimator 202. Client 102 makes a request that is intercepted by service composite 602. Service composite 602 checks with statistical estimator 202 to determine a probability that the request will succeed. If the request is not likely to succeed, then a CB counter is incremented. If the request is likely to succeed, then the service composite 602 forwards the request to the server 104, optionally checking whether CB 604 is open first as in the flow without the statistical estimator 202.
The statistical estimator 202 is optimized in terms of latency, network overhead, and end-user feedback, because it works on a proactive strategy of limiting unsuccessful invocation as opposed to the traditional CB technique which in part is a reactive strategy as it waits for failures to happen before taking action.
An exemplary probability estimation algorithm is now provided. In this example, a probability to succeed for a persistence service is being estimated.
The estimation output is the value between 0 and 1 with following exemplary meanings:
A rules-based algorithm may, for example, include the following rules:
Given a preferred data center (DC) with 3 Persistent Service nodes and a preferred DC with 10 Persistent Service nodes, sample successChance values could be as below. A given node may be considered overloaded if its reported rejection level is greater than or equal to the request priority. In the example here, a consistency level indication from the persistence service may also impact the estimation.
Step s902 comprises, prior to making a request to a first service, determining a first probability based on environment parameters, wherein the first probability represents a likelihood of the request to the first service failing.
Step s904 comprises determining that the first probability meets or exceeds a first threshold.
Step s906 comprises, as a result of determining that the first probability meets or exceeds the first threshold, (i) declining to make the request to the first service and (ii) incrementing a counter, wherein the counter is an internal variable for determining a circuit breaker state.
Step s910 comprises, prior to making a request to a first service, determining a first probability based on environment parameters, wherein the first probability represents a likelihood of the request to the first service failing.
Step s912 comprises determining that the first probability is below a first threshold.
Step s914 comprises, as a result of determining that the first probability is below the first threshold, (i) making the request to the first service and (ii) incrementing a counter if the request to the first service fails, wherein the counter is an internal variable for determining a circuit breaker state.
One or more of process 900A and process 900B may include additional steps or elements, as further described herein. In some embodiments, the method may further include, prior to making additional requests to the first service, determining additional probabilities based on environment parameters, wherein the additional probabilities represent a likelihood of the additional requests to the first service failing. The method may further include determining, for each of the additional requests to the first service, whether the corresponding probability meets or exceeds the first threshold. The method may further include, for each of the additional requests to the first service, if the corresponding probability meets or exceeds the first threshold (i) declining to make the corresponding request to the first service and (ii) incrementing the counter; and if the corresponding probability is below the threshold then (i) making the corresponding request to the first service and (ii) incrementing the counter if the corresponding request to the first service fails. The method may further include determining that the counter exceeds a second threshold. The method may further include, as a result of determining that the counter exceeds the second threshold, transitioning to an open circuit breaker state for the first service, where requests to the first service are disabled during the open circuit breaker state.
In some embodiments, determining a first probability based on environment parameters comprises using a rule-based estimator. In some embodiments, determining a first probability based on environment parameters comprises using machine learning. In some embodiments, using machine learning includes applying deep reinforcement learning. In some embodiments, the environment parameters the first probability is based on feedback signals from the first service, including one or more of a round-trip time, an acknowledgement (ACK) message, a negative acknowledgment (NACK) message, a node state indicator, and a cluster health indicator. In some embodiments, the first service performs a storage operation. In some embodiments, the first service performs a charging function. In some embodiments, the first service comprises a group of services or microservices. In some embodiments, the first service is provided by a node in a telecommunications network, and determining the first probability based on environment parameters is performed in a cloud computing environment. In some embodiments, the first service is a network function managed by an orchestration layer.
A1. A method for providing resilience in a computing environment, the method comprising:
A1′. A method for providing resilience in a computing environment, the method comprising:
A2. The method of one of embodiments A1 and A1′, further comprising:
A3. The method of any one of embodiments A1, A1′, and A2, wherein determining a first probability based on environment parameters comprises using a rule-based estimator.
A4. The method of any one of embodiments A1, A1′, and A2, wherein determining a first probability based on environment parameters comprises using machine learning.
A5. The method of embodiment A4, wherein using machine learning includes applying deep reinforcement learning.
A6. The method of any one of embodiments A1, A1′, and A2-A5, wherein the environment parameters the first probability is based on feedback signals from the first service, including one or more of a round-trip time, an acknowledgement (ACK) message, a negative acknowledgment (NACK) message, a node state indicator, and a cluster health indicator.
A7. The method of any one of embodiments A1, A1′, and A2-A6, wherein the first service performs a storage operation.
A8. The method of any one of embodiments A1, A1′, and A2-A6, wherein the first service performs a charging function.
A9. The method of any one of embodiments A1, A1′, and A2-A8, wherein the first service comprises a group of services or microservices.
A10. The method of any one of embodiments A1, A1′, and A2-A9, wherein the first service is provided by a node in a telecommunications network, and determining the first probability based on environment parameters is performed in a cloud computing environment.
A11. The method of embodiment A10, wherein the first service is a network function managed by an orchestration layer.
C1. A computer program (1143) comprising instructions which when executed by processing circuitry (1102) of a node (1100), causes the node (1100) to perform the method of any one of embodiments A1, A1′, and A2-A11.
C2. A carrier containing the computer program (1143) of embodiment C1, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium (1142).
D1. A network node (1100), the network node comprising:
E1. A network node (1100) for providing resilience in a computing environment, the network node (1100) being configured to:
E1′. A network node (1100) for providing resilience in a computing environment, the network node (1100) being configured to:
E2. The network node of embodiment E1, wherein the network node is further configured to perform the method of any one of embodiments A2-A10.
E3. The network node of embodiment E1′, wherein the network node is further configured to perform the method of any one of embodiments A2-A10.
While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above described exemplary embodiments. Moreover, any combination of the above-described embodiments in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.
Number | Date | Country | Kind |
---|---|---|---|
202141027688 | Jun 2021 | IN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SE2021/050927 | 9/23/2021 | WO |