The present disclosure relates to the field of virtualization and mixed-criticality systems or mixed critical systems (MCSs). A hypervisor device is provided in the present disclosure which can mitigate failure of a virtual machine (VM) by masking an interrupt request (IRQ) of another VM of lower priority. The present disclosure also provides a corresponding method and computer program.
In the field of virtualization, consolidation is a conventional technique aimed at aggregating software components (e.g., VMs) on top of a hardware platform (e.g., micro-controllers). In scenarios where the VMs have different safety integrity levels the aggregated system is known as an MCS.
An aspect of an MCS is the ability of providing Freedom From Interference (FFI) of each software component (e.g., each VM) from another. A drawback with conventional consolidation is that FFI is not always possible due to intrinsic hardware limitations or interference carried out by the design of such integrated software components. In those cases, there are no guarantees that safety requirements allocated on each software component will not be violated, e.g., because of cascading failures between two software components with a different integrity level.
Thus, software components interference, due to consolidation, is a well-known issue on virtualized systems. Conventional solutions in particular do not provide any mechanism that, in case of dependent failures, is able to guarantee the correct execution of safety-related features allocated on intermediate Automotive Safety Integrity Level (ASIL) VMs rather than quite generic solutions, such as the complete suspension of VMs or the scaling of Central Processing Unit (CPU) frequency.
As a result, there is the need for improved failure mitigation of VMs operated by a hypervisor.
In view of the above-mentioned problem, embodiments of the present disclosure provide a hypervisor with improved mitigation of failures of VMs running on an MCS. The present disclosure describes selectively masking IRQs of a lower priority VM to mitigate a failure of a higher priority VM.
A first aspect of the present disclosure provides a hypervisor device for failure mitigation of a virtual machine (VM) where the hypervisor device is configured to: operate a first VM; operate a second VM, where the first VM has a higher priority level than the second VM; determine an interference parameter indicating a magnitude of interference of the second VM on the first VM; and mask at least one interrupt request (IRQ) relating to the second VM based on the interference parameter to mitigate a failure of the first VM.
This ensures that overall interference on a safety-critical VM in an MCS is mitigated and availability of intermediate-safety VMs is still high even if with a reduced quality-of-service (QoS).
In a possible implementation, masking an interrupt includes not handling an interrupt.
In a possible implementation, the at least one IRQ relating to the second VM can be masked without compromising the availability of the second VM.
In a possible implementation, both the first VM and the second VM are operated by a same hardware platform.
In a possible implementation, a failure of the first VM includes a value of at least one of exceeding a predefined threshold for: a CPU load, a Graphics Processing Unit (GPU) load, a memory load, a input/output (I/O) load, a cache-miss, network load, or a storage load.
In a possible implementation, the IRQ, which is to be masked in an IRQ of the second VM, is to a hardware platform which operates the first and the second VM.
In a possible implementation, the interference parameter includes an interference class (e.g., memory interference, I/O interference, etc.) and/or an interference magnitude (e.g., low, high, intermediate, etc.).
In a possible implementation, a hardware platform that operates VMs of different priority levels is a mixed critical system (MCS).
In an implementation form of the first aspect, the priority level includes a safety integrity level.
The present disclosure provides solutions in which the failure of safety critical VMs can be mitigated. For example, the safety integrity level includes an integrity level such as an automotive safety integrity level (ASIL).
In a further implementation form of the first aspect, the hypervisor device is further configured to determine the interference parameter based on a performance counter associated with the first VM and/or based on a performance counter associated with the second VM.
This is beneficial as it allows for detailed analysis of a load of a VM, e.g., to determine a failure of the respective VM in advance.
In a possible implementation, the performance counter is provided by the hardware platform which operates the respective VM.
In a further implementation form of the first aspect, the hypervisor device is further configured to obtain a first relationship indicating an influence of a performance counter type and/or a performance counter value on the magnitude of interference of the second VM with the first VM; and determine the interference parameter based on the first relationship.
This ensures that the interference parameter can be precisely calculated taking into account performance counter types and values.
In a possible implementation, the first relationship includes a first value and/or a first formula. In a possible implementation, the first relationship is obtained by the hypervisor device automatically, and/or by means of user input.
In a further implementation form of the first aspect, the hypervisor device is further configured to obtain a second relationship indicating the influence of an IRQ relating to the second VM on the magnitude of interference of the second VM with the first VM; and determine the interference parameter based on the second relationship.
This ensures that the interference parameter can be precisely calculated taking into account the influence of certain IRQs.
In a possible implementation, the second relationship includes a second value and/or a second formula. In a possible implementation, the second relationship is obtained by the hypervisor device automatically, and/or by means of user input.
In a further implementation form of the first aspect, the hypervisor device is further configured to obtain a first group of IRQs, and to select the at least one IRQ from the first group.
This allows for grouping IRQs, which may have a similar effect on failure mitigating and QoS, and to improve selecting of an IRQ which is suitable for a failure at hand.
In a further implementation form of the first aspect, the hypervisor device is further configured to obtain the first group of IRQs based on the second relationship.
This ensures that the first group can be determined more precisely.
In a further implementation form of the first aspect, the hypervisor device is further configured to obtain a second group of IRQs, and to further select the at least one IRQ from the second group if an attempt to mitigate the failure of the first VM based on masking an IRQ from the first group fails.
This ensures that several groups can be determined, each of which is ideal for a certain application scenario.
In a possible implementation, the hypervisor is further configured to select the at least one IRQ from the second group if an attempt to mitigate the failure of the first VM based on masking all IRQs from the first group fails.
In a further implementation form of the first aspect, the hypervisor device is further configured to obtain the second group of IRQs based on the second relationship.
This ensures that also the second group can be determined more precisely.
In a further implementation form of the first aspect, masking an IRQ from the second group leads to a higher degradation of QoS of the second VM than masking an IRQ from the first group.
This ensures that the groups can be tailored to a desired amount of mitigation or a desired amount of QoS.
In a further implementation form of the first aspect, a magnitude of interference of the second VM with the first VM for all IRQs in the first group is below a predefined threshold, and/or a magnitude of interference of the second VM with the first VM for all IRQs in the second group is above a predefined threshold.
This ensures that groups of IRQs can be put together in a manner that increases the effectiveness of failure mitigation.
In a further implementation form of the first aspect, the interference parameter indicates at least one of: CPU interference, GPU interference, memory interference, I/O interference, cache-miss interference, network interference, storage interference, or bus interference.
This allows for determining various kinds of interference.
In a further implementation form of the first aspect, the first VM is the VM with a highest priority level operated by the hypervisor device.
This ensures that a failure of a VM with a highest ASIL level can be effectively mitigated.
A second aspect of the present disclosure provides a method for failure mitigation of a virtual machine (VM), where the method includes the steps of: operating, by a hypervisor device, a first VM; operating, by the hypervisor device, a second VM, where the first VM has a higher priority level than the second VM; determining, by the hypervisor device, an interference parameter indicating a magnitude of interference of the second VM on the first VM; and masking, by the hypervisor device, at least one interrupt request (IRQ) relating to the second VM based on the interference parameter to mitigate a failure of the first VM.
In an implementation form of the second aspect, the priority level includes a safety integrity level.
In a further implementation form of the second aspect, the method further includes determining, by the hypervisor device, the interference parameter based on a performance counter associated with the first VM and/or based on a performance counter associated with the second VM.
In a further implementation form of the second aspect, the method further includes obtaining, by the hypervisor device, a first relationship indicating an influence of a performance counter type and/or a performance counter value on the magnitude of interference of the second VM with the first VM; and determining the interference parameter based on the first relationship.
In a further implementation form of the second aspect, the method further includes obtaining, by the hypervisor device, a second relationship indicating the influence of an IRQ relating to the second VM on the magnitude of interference of the second VM with the first VM; and determine the interference parameter based on the second relationship.
In a further implementation form of the second aspect, the method further includes obtaining, by the hypervisor device, a first group of IRQs, and to select the at least one IRQ from the first group.
In a further implementation form of the second aspect, the method further includes obtaining, by the hypervisor device, the first group of IRQs based on the second relationship.
In a further implementation form of the second aspect, the method further includes obtaining, by the hypervisor device, a second group of IRQs, and to further select the at least one IRQ from the second group if an attempt to mitigate the failure of the first VM based on masking an IRQ from the first group fails.
In a further implementation form of the second aspect, the method further includes obtaining, by the hypervisor device, the second group of IRQs based on the second relationship.
In a further implementation form of the second aspect, masking an IRQ from the second group leads to a higher degradation of QoS of the second VM than masking an IRQ from the first group.
In a further implementation form of the second aspect, a magnitude of interference of the second VM with the first VM for all IRQs in the first group is below a predefined threshold, and/or a magnitude of interference of the second VM with the first VM for all IRQs in the second group is above a predefined threshold.
In a further implementation form of the second aspect, the interference parameter indicates at least one of: CPU interference, GPU interference, memory interference, I/O interference, cache-miss interference, network interference, storage interference, or bus interference.
In a further implementation form of the second aspect, the first VM is the VM with a highest priority level operated by the hypervisor device.
The second aspect and its implementation forms include the same advantages as the first aspect and its respective implementation forms.
A third aspect of the present disclosure provides a computer program including instructions which, when the computer program is executed by a computer, cause the computer to perform the method according to the second aspect or any of its implementation forms.
A fourth aspect of this disclosure provides a storage medium storing executable program code which, when executed by a processor, causes the method according to the second aspect or any of its implementation forms to be performed.
The present disclosure describes an innovative failure mitigation mechanism, used in MCSs running on top of an hypervisor e.g. for a micro-controller. The proposed mechanism aims at mitigating the effect of cascading failures when FFI cannot be completely prevented in systems where software components have a different ASIL. The present disclosure describes that interrupts assigned to workloads in VMs are classified according to the interference effect of such interrupts on the highest safety critical VM. Interruption of intermediate ASIL VMs (or not safety related ones) can be selectively deactivated in case the safety requirements of the most safety-critical workloads are going to be violated. This ensures mitigation of the overall interference on the most safety-critical VMs and leads to increased availability of intermediate-safety VMs even if with a reduction of their QoS.
The present disclosure increases availability of safety-related functionalities allocated on intermediate-ASIL VMs. Measurement of interference caused by every single IRQ on a high priority VM allows IRQs clustering, also referred to as “coloring,” in the following description. Such operation allows the hypervisor device to gradually degrade the functionalities of intermediate priority VMs when it detects an interference in the high priority VM. The degradation of functionalities in the intermediate priority VM allows to preserve, for as long as possible, the safety-related functionalities and the execution of the high priority VM without interference.
It should be noted that all devices, elements, units and means described in the present disclosure could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present disclosure as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.
The above-described aspects and implementation forms of the present disclosure will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which:
A failure e.g., can be detected, if the interference parameter 103 exceeds a predefined threshold. To mitigate the failure, the hypervisor device 100 is configured to mask at least one interrupt request, IRQ, 104 relating to the second VM 102 based on the interference parameter 103.
By masking at least one IRQ 104 of the second VM 102, depending on the magnitude of interference of the second VM 102 on the first VM 101, the hypervisor device 100 allows for gradually mitigating the failure of the first VM 101 (caused by the interference of the second VM 102), without compromising the availability of the second VM 102.
In a possible implementation, the priority level can include a safety integrity level. The safety integrity level e.g., may include an automotive safety integrity level (ASIL).
In the below paragraphs the hypervisor device 100 will be described in more detail in view of
The hypervisor device 100 allows for mitigating interference caused by lower priority or lower ASIL VMs 102 on higher priority or higher ASIL VMs 101, both running on a same hypervisor device 100 on the same hardware platform. Such mitigation can be performed by the underlying hypervisor device 100 exploiting the IRQs 104 of the lower priority VMs 102 as “knobs”. That is, basically subsets of these interrupt lines are not handled during system execution, depending on the magnitude of the measured interference (e.g., the interference parameter 103). The sensors used for measuring such interference at runtime can be performance counters provided by the hardware platform.
That is, as illustrated in
Interference generally may depend on specific implementation and integration of the hypervisor device 100. Thus, every MCS which may take advantage from the hypervisor device 100 can be analyzed to identify a relationship between an interrupt served in the lower priority VM and the relative interference caused on the VM with higher or highest priority. According to these assumptions, the present disclosure may include two distinct phases: an offline phase and an online phase.
In the offline phase a system optionally can be analyzed under specific circumstances where inputs and outputs are controlled and monitored. A goal of the offline phase can be to produce outcomes which can be used by the hypervisor device 100 and can be prerequisites for the next phase.
One of these aspects can be identifying a formula which takes performance counter values as an input and produces a scalar value of a current interference. This may, in some examples, be achieved by observing only the highest priority VM executing with and without interference. Values of performance counters can then be correlated with the behavior of the safety functions carried out by the VM. Injected interferences can be controlled in terms of typology (memory, I/O etc.) and in terms of magnitude (low, medium, high). By analyzing the trend of performance counters values it is also possible to define stochastic precision and relevance of a specific counter in an overall interference calculation.
In other words, the hypervisor device 100 may further be configured to obtain a first relationship 202 indicating an influence of a performance counter type and/or a performance counter value on the magnitude of interference of the second VM 102 with the first VM 101 (that is, interference of the first VM 101 on the second VM 102); and determine the interference parameter 103 based on the first relationship 202. The first relationship 202 e.g., may include the formula which takes performance counter values as an input and/or the values of performance counters.
Another aspect can be to measure interference caused by a single IRQ handled in the lower priority VM to the higher or highest priority VM. In this scenario, both VMs are executed together but the lower priority VM has all the IRQs disabled except the one which is under measurement. The entity of the interference caused by the specific IRQ e.g., can be calculated using the formula derived in the previous step. The interrupt's weight can then be adjusted by using a “bias” which may depend on the functionality associated with the IRQ and its relevance for the implementation of the safety related function(s) running of the VM.
In other words, the hypervisor device 100 may further obtain a second relationship 203 indicating the influence of an IRQ 104 relating to the second VM 102 on the magnitude of interference of the second VM 102 with the first VM 101; and determine the interference parameter 103 based on the second relationship 203. That is, the second relationship specifically may include the interference caused by a single IRQ and/or the bias.
Another aspect can be the clustering of IRQs. According to this aspect, interrupts can basically be divided into subsets a.k.a. “colors” depending on the effects measured in the previous step. The clustering algorithm, the number, and/or dimension of the clusters can be chosen arbitrarily. Clustering also allows for an identification of the thresholds of interference which separate a “color” from the next one.
In other words, the hypervisor device 100 may obtain a first group of IRQs 204, and select the at least one IRQ 104 from the first group 204. In a possible implementation, the first group of IRQs 204 can comprise one of the clusters described above.
More specifically, the first group of IRQs 204 can be determined based on the second relationship 203. That is, the first group 204 (i.e., the clusters) can be determined based on the interference caused by a single IRQ on one of the VMs 101, 102.
All the colors and the IRQs which compose them can be collected and described in a configuration which can be provided to the hypervisor device 100. Such a configuration can be used e.g., during the online phase. The hypervisor device 100 may sample values from performance counters and it will calculate the current interference on the highest priority VM e.g., by using the formula derived in the offline phase. Depending on the magnitude of the interference, a so called “degraded mode” can be applied to the lower priority VM by disabling a specific “color” of interrupts according to the provided configuration. Such an algorithm can be carried out by the hypervisor device in the online phase.
In other words, the hypervisor device may obtain a second group of IRQs 205, and further select the at least one IRQ 104 from the second group 205, if an attempt to mitigate the failure of the first VM 101 based on masking an IRQ from the first group 204 fails. The second group 205 may include one of the other clusters or colors of IRQs. Also the second group 205 (i.e., the clusters or colors) can be determined based on the interference caused by a single IRQ on one of the VMs 101, 102 (that is, based on the second relationship 203).
The clustering or coloring of the IRQs can be done in an order, according to which masking an IRQ from the second group 205 leads to a higher degradation of quality of service, QoS, of the second VM 102 than masking an IRQ from the first group 204. In other words, the second VM 102 may be degraded stepwise to ensure that the first VM 101 has enough resources without immediately switching of the second VM 102 at once.
In a possible implementation, a magnitude of interference of the second VM 102 with the first VM 101 can be below a predefined threshold for all IRQs in the first group 204. In other words, the IRQs in the first group cause less interference on the first VM, but at the same time do not influence the behaviour of the second VM 102 that much when being masked.
In a possible implementation, a magnitude of interference of the second VM 102 with the first VM 101 can be above a predefined threshold for all IRQs in the second group 205. In other words, the IRQs in the second group cause more interference on the first VM 101, but also do influence the behaviour of the second VM 102 more when being masked.
According to the following disclosure, the hypervisor device 100 can be responsible for managing the IRQs by forwarding them to a corresponding VM. For notation simplification, each IRQ propagated to a given VM is identified by a unique index below. It follows that the IRQ index (i.e., IRQk) corresponds to the tuple composed of the physical PIN associated with the interrupt and the identifier of the VM that manages such interrupt. Thus, in case the same physical interrupt is forwarded to n VMs, it is referred as n different indexes. As for index notation in the following part of the disclosure, the target VM with the highest priority (or ASIL) can be referred to as the target VM with index i (i.e., VMi).
Pj
Interference parameters 103 are typically memory interference, I/O interference, cache-miss interference and so forth. At each instance, the values of parameters Pj
The interference parameter 103 can be used to define the interference effect of the target VM, referred to as eVMi. As shown in formula 1 below, such effect is calculated by combining the interferences that affect the target VM and each interference is due to the corresponding interference parameter:
In this equation, {circumflex over (P)}j
pj
The IRQ interference parameters pj
In the above formula, pj
The algorithm can be divided into two parts: an OFFLINE phase and an ONLINE phase.
As for the OFFLINE phase, the algorithm may include the following steps:
1. For each IRQ, the IRQ interference effect e(IRQk, VMi) on the target VM is calculated:
The result is also illustrated in
As for the ONLINE phase, the proposed algorithm may include the following steps:
1. At run-time, the hypervisor device 100 monitors the VM behavior and the behavior of the target VM VM; (i.e., the first VM 101).
2. In case that the hypervisor device 100 detects a degradation in the target VM 101 (e.g., some monitored parameters highlight interference by exceeding the offline pre-computed values), the hypervisor device 100 switches to a degraded state:
With reference to the OFFLINE phase, in step 1 the definition of functions and parameters used to calculate the IRQ interference effect e(IRQk, VMi) on the target VM hinges on the interference detection on the target VM, performed by performance counters 201a, 201b.
As it is also illustrated in
The Performance counters 201a, 201b may maintain bounded values YPMU(t). By analyzing the behavior of these values, it is also possible to establish a suitable stochastic distribution (mean, variance etc.) and then derive the counters precision. Then, interferences can be added to the system and an error E(t) is detectable in the output. Thus, performance counters shall reflect the error's magnitude and dynamics EPMU(t). A heuristic can be defined to estimate the error at a certain time from the values of the performance counters. Interferences I(t) can be divided into different functional sets by classes (memory interference, IO interference) and by magnitude (low, high, intermediate).
According to an exemplary embodiment of the present disclosure which is described in view of
Step 701: parameters Pj
Step 702: For each parameter the interference function Ij
Step 703: The function fi(.) is defined that combines the different types of interferences potentially affecting the VMi. The function is typically a weighted sum where the weight depends on the impact of the corresponding parameter on the safety related intended functionalities on the target VM.
Step 704: For each interference parameter, the effect pj
Step 705: For each interrupt IRQk, the BIAS(IRQk) is calculated. Such value is defined according to the IRQ logical and safety-related functionality of IRQk in the VM receiving the interrupt (i.e., timers on VM have highest BIAS).
Step 706: For each interrupt IRQk, e(IRQk, VMi) is calculated. Apply the formula shown in formula 2 using the interference function Ij
Step 707: The clusters for e(IRQk, VMi) are defined.
The result is the association of the interrupt within a cluster (or “color”). Notice that colors are ordered in terms of degradation effect. A color associated with a cluster, whose bounds values refers are lower, refers to a lower degradation effect on the target VM.
As for the ONLINE phase, the hypervisor device 100 can monitor the behavior of the target VMi 101 via performance counters. In case the hypervisor device 100 detects an interference in the target VM 101, because the monitored parameters are moving outside pre-computed acceptable values, it masks the IRQs 104 belonging to the cluster 204 with the highest degradation effect. In case the monitored parameters of the target VM 101 are still showing interference, the hypervisor continues progressively to mask the IRQs associated with lower degradation effect 205. In case the hypervisor has to restore the status of the target VM 101, it progressively unmasks the IRQ starting from the color associated with lower degradation effect. That is, the IRQs can be restored in the inverse order with respect to the masking order.
In the above described scenario, the VMs 101, 102 composing the MCS are seen as “black boxes”, since they are taken “as is” and integrated on the same hardware platform without any modification. An alternative solution for interference detection can be VM introspection by monitoring specific parameters that can be treated like a state space of a dynamic system. A “white box” approach can be adopted where code and VM internals are reachable by the hypervisor device 100 and can be monitored at run-time. The hypervisor device 100 can access specific memory location inside a VM's private memory space, allocating more memory itself (e.g. memory pages table), which may require code availability.
The hypervisor device 100 may be used in application domains where safety, integrity and availability must be guaranteed (e.g. automotive, avionics, railways, robotics and medical). Within such domains, the hypervisor device 100 can be applied in a hypervisor-based environment where different ASIL level functionalities are confined in their own VMs. VMs could interfere among each other e.g., due to a high I/O-based workload.
With reference to the automotive field, the hypervisor device 100 can be used to mitigate the effect of cascading failures causing the violation of timing constraints in an Automotive Open System architecture (AUTOSAR) adaptive virtualized system, as the one shown in
In the above example, VM3 workload is composed by ASIL tasks and a task, referred to as processing task, which is responsible for managing CAN messages. The processing task is activated every time a message is received from the CAN bus as shown in
As shown in
The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed disclosure, from the studies of the drawings, this disclosure, and the independent claims. In the claims as well as in the description, the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.
This application n is a continuation of International Application No. PCT/EP2022/064546, filed on May 30, 2022, the disclosure of which is hereby incorporated by reference in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/EP2022/064546 | May 2022 | WO |
| Child | 18963131 | US |