Benefit is claimed under 35 U.S.C. 119(a)-(d) to Foreign Application Serial No. 202041001650 filed in India entitled “SUPPORTING INVOCATIONS OF THE RDTSC (READ TIME-STAMP COUNTER) INSTRUCTION BY GUEST CODE WITHIN A SECURE HARDWARE ENCLAVE”, on Jan. 14, 2020, by VMware, Inc., which is herein incorporated in its entirety by reference for all purposes.
There are several ways in which guest program code running within a virtual machine (VM) can keep track of the VM's system time (i.e., the amount of real-world time that has elapsed from the point of powering-on the VM). One method is to employ a programmable interval timer, known as a system clock timer, that causes interrupts to be generated on a periodic basis which are delivered to the VM's guest operating system (OS). Upon receiving such an interrupt, the guest OS increments a counter value (i.e., system clock counter) that indicates the VM's system time, and this system clock counter can be queried by other guest code via an appropriate guest OS Application programming interface (API). For example, if the system clock timer is programmed to generate an interrupt every 10 milliseconds (ms) and the system clock counter has a value of 1000, that means a total of 10×1000=10,000 ms (or 10 seconds) have passed since the time of VM power-on.
Another way in which guest code can track/determine VM system time is to employ the RDTSC (Read Time-Stamp Counter) instruction that is implemented by Central processing units (CPUs) based on the x86 CPU architecture. When guest code calls the RDTSC instruction, the physical CPU mapped to the VM's virtual CPU (vCPU) writes a hardware-derived timestamp value (referred to herein as an RDTSC timestamp) into two vCPU registers. This RDTSC timestamp indicates the number of physical CPU clock cycles, or ticks, that have occurred since the time of host system power-on/reset (subject to an offset specified by the host system's hypervisor to account for the time at which the VM was powered-on). Thus, this timestamp can be considered reflective of the amount of real-world time that has transpired during that period. The guest code that invoked the RDTSC instruction can then retrieve the RDTSC timestamp from the appropriate vCPU registers and thereby determine the current VM system time.
To account for scenarios in which the physical compute resources of a host system become overcommitted (i.e., scenarios where the number of running vCPUs exceed the number of available physical CPUs), some existing hypervisors implement a time virtualization heuristics module that uses heuristics to intelligently accelerate the delivery of system clock timer interrupts to VMs which have been de-scheduled and subsequently re-scheduled on the host system's physical CPU(s) due to CPU starvation/contention. This accelerated interrupt delivery ensures that the system time of such VMs (as determined via their system clock counters) eventually catches up to real-world time.
In addition, hardware virtualization technologies such as Intel VT and AMD-V provide the capability to intercept RDTSC instructions invoked by VMs. When time virtualization heuristcs are active, existing hypervisors make use of this capability to (1) trap an RDTSC instruction call made by VM guest code, (2) emulate execution of the RDTSC instruction (i.e., generating a RDTSC timestamp in software rather than via the CPU hardware), and (3) provide the software-generated RDTSC timestamp to the calling guest code. This RDTSC trapping and emulation mechanism allows the hypervisor to provide an RDTSC timestamp to the calling guest code that is consistent with the hypervisor-level time virtualization heuristics applied to the VM's system clock timer interrupts, and thus ensures that the VM as a whole has a coherent view of its system time from both clock sources.
However, a significant complication with the foregoing is that the RDTSC instruction may be invoked by guest code running within a secure hardware enclave of the VM. In these cases, the hypervisor cannot provide an emulated RDTSC timestamp to the calling guest code because the state of that guest code is isolated from, and thus cannot be modified by, the hypervisor. As a result, the hypervisor cannot ensure that the RDTSC timestamps received by the secure hardware enclave guest code will be consistent with time virtualization heuristics applied to the VM's system clock timer interrupts.
In the following description, for purposes of explanation, numerous examples and details are set forth in order to provide an understanding of various embodiments. It will be evident, however, to one skilled in the art that certain embodiments can be practiced without some of these details, or can be practiced with modifications or equivalents thereof.
Embodiments of the present disclosure are directed to techniques that may be implemented by a hypervisor of a host system for supporting invocations of the RDTSC instruction (or equivalents thereof) by guest program code running within a secure hardware enclave of a VM. As used herein, a secure hardware enclave (also known as a hardware-assisted trusted execution environment or TEE) is a region of computer system memory, allocated via a special set of CPU instructions, where user-world code can run in a manner that is isolated from other processes running in other memory regions (including those running at higher privilege levels, such as a VM's guest OS or the hypervisor). Thus, secure hardware enclaves enable secure and confidential computation. Examples of existing technologies that facilitate the creation and use of secure hardware enclaves include SGX (Software Guard Extensions) for Intel CPUs and TrustZone for ARM CPUs.
At a high level, the techniques of the present disclosure guarantee that invocations of the RDTSC instruction made by guest code within a VM's secure hardware enclave do not result in a conflict between (1) the guest code's understanding of VM system time as determined using the RDTSC instruction, and (2) the guest code's understanding of VM system time as determined using the VM's system clock timer/counter, which may be manipulated by the hypervisor via time virtualization heuristics. These and other aspects are described in further detail in the sections that follow.
As mentioned previously, there are two different types of timekeeping systems which guest program code running within VMs 104(1)-(J) can rely on in order to track/determine the system time of their respective VMs. The first type of timekeeping system uses a VM-level system clock counter which is incremented based on interrupts that are delivered in accordance with a programmable system clock timer. The second type of timekeeping system uses the RDTSC CPU instruction which is supported by x86 CPUs (and is assumed to be supported by physical CPUs 106(1)-(K)).
In a virtualized host system like system 100 of
In the foregoing scenario, the guest OS of VM 104(1) cannot receive any interrupts during the 30 ms timeframe in which the VM's vCPU is de-scheduled and in the waiting state because the VM's vCPU is not actually running during that period. Thus, the guest OS will receive its sixth interrupt at the real-world time of 50 ms (initial operation)+30 ms (waiting state)+10 ms (timer interval)=90 ms. However, because the guest OS has only received 6 interrupts, the guest OS will erroneously believe that the correct VM system time at that point is 60 ms (i.e., 6 interrupts×10 ms per interrupt). In addition, as long as the guest OS continues receiving further interrupts at the programmed interval of 10 ms from that point onward, the guest OS's notion of VM system time (per the system clock counter) will perpetually lag behind real-world time by 30 ms.
To address this issue, certain existing hypervisors implement a time virtualization heuristics module (shown as module 108 in
For example, with respect to VM 104(1) discussed above, at the time the VM's vCPU is re-scheduled on physical CPU 106(1) after 30 ms in the waiting state, time virtualization heuristics module 108 can detect that the VM's guest OS has effectively missed three interrupts (one interrupt per 10 ms) after the fifth one and thus apply catch-up heuristics that cause the next six interrupts (i.e., the sixth, seventh, eighth, ninth, tenth, and eleventh interrupts) to be delivered to the guest OS at an accelerated pace of every 5 ms, rather than at the programmed pace of every 10 ms. The delivery times of interrupts 6-10 at this accelerated pace is presented in the table below in terms of real-world time and is contrasted against the delivery times for these interrupts that would occur at the normal programmed pace:
As can be seen in the second table column above, once the eleventh interrupt has been delivered to the guest OS of VM 104(1) at the accelerated pace of 5 ms intervals, the VM's notion of its system time per the system clock counter (i.e., 11 interrupts×10 ms per interrupt=110 ms) will have caught up with the real-world time of 110 ms. Thus, time virtualization heuristics module 108 has brought the VM's system time in alignment with real-world time, and module 108 can be deactivated at that point for VM 104(1) (i.e., future interrupts can be delivered to the VM's guest OS at the normal programmed pace of 10 ms) until needed again. If no catch-up heuristics were applied to VM 104(1), the VM's system time would lag behind real-world time by 30 ms at every interrupt point as shown in the third column of Table 1.
One consequence of implementing time virtualization heuristics module 108 is that the RDTSC instruction may return timestamp values that conflict (or in other words, are out of sync) with the system clock counter of a VM when module 108 is enabled/active with respect to that VM. This is problematic because some guest OSs leverage the RDTSC instruction in combination with the system clock counter for higher resolution system time readings. For example, because the frequency at which system clock timer interrupts are generated is relatively low (e.g., every 10 ms), a guest OS can determine a more accurate time reading by calculating (time from system clock counter)+RDTSC time elapsed since the last system clock timer interrupt. Thus, it is desirable to ensure that the VM's system clock counter remains in sync with the RDTSC timestamps returned by the RDTSC instruction when time virtualization heuristics are active.
The foregoing is currently achieved via a hypervisor-level RDTSC trap handler and corresponding RDTSC emulator (shown via reference numerals 110 and 112 in
When the trap/exit to hypervisor 102 occurs, the hypervisor 102 passes control to RDTSC trap handler 110, which in turn uses RDTSC emulator 112 to emulate the invoked RDTSC instruction in software and thereby generate a RDTSC timestamp that takes into account the current state of time virtualization heuristics module 108 with respect to the subject VM. RDTSC trap handler 110 then writes the emulated RDTSC timestamp to one or more vCPU registers of the VM and control returns to the calling guest code, which retrieves the emulated RDTSC timestamp from the vCPU register(s) and continues with its execution. Finally, once the catch-up heuristics applied via time virtualization heuristics module 108 have enabled the VM's system clock counter to catch up with real-world time, module 108 unsets the trap bit. This causes further RDTSC instruction calls from within the VM to be once again handled directly by a physical CPU 106 in accordance with the normal mode of operation.
For instance, in the previously discussed example of VM 104(1), assume guest code within the VM calls the RDTSC instruction at the time of delivery of the seventh interrupt (i.e., at the real-world time of 90 ms). If the RDTSC instruction were passed directly through to a physical CPU 106, the physical CPU would return a hardware derived RDTSC timestamp reflecting the real-world time of 90 ms. However, that would be undesirable because the guest OS of VM 104(1) has only received seven interrupts at that point, and thus the guest OS believes the total elapsed time is 70 ms per its system clock counter. Thus, in this scenario RDTSC trap handler 110/emulator 112 can recognize that the VM's system clock counter indicates a system time of 70 ms (in accordance with the currently applied heuristics) and return a consistent RDTSC timestamp of 70 ms to the calling guest code.
As noted in the Background section, a significant complication with performing hypervisor-level trapping and emulation of the RDTSC instruction is that, in some cases, this instruction may be called by guest code running within a secure hardware enclave of a VM. An example of such a secure hardware enclave is shown via reference numeral 114 within VM 104(1) of
To address this and other similar problems, in various embodiments time virtualization heuristics module 108 and/or RDTSC trap handler 110 can be enhanced in a manner that guarantees RDTSC invocations made by guest code within a VM's secure hardware enclave return RDTSC timestamps that are always consistent with the VM's system clock counter. For example, in one set of embodiments (referred to herein as the TSC scaling approach and detailed in section (3) below), components 108 and/or 110 can be modified to leverage the TSC (Time-Stamp Counter) scaling feature available on many modern x86 CPU designs to modify, via physical CPUs 106(1)-(K), the hardware-derived RDTSC timestamps that are returned to secure hardware enclave guest code in accordance with module 108's time virtualization heuristics (rather than performing this modification in software via RDTSC emulator 112). In certain embodiments, this TSC scaling approach can completely avoid the need for hypervisor 102 to explicitly trap and emulate RDTSC instruction calls made from within a VM's secure hardware enclave (or more generally, from anywhere within the VM) in order to achieve consistency with the accelerated interrupt delivery performed by time virtualization heuristics module 108.
In another set of embodiments (referred to herein as the heuristics suppression approach and detailed in section (4) below), components 108 and/or 110 can be modified to (1) deactivate the time virtualization heuristics applied to a given VM when an invocation of the RDTSC instruction by guest code within a secure hardware enclave of that VM is trapped by hypervisor 102 (thereby effectively dropping the interrupts that would have otherwise been delviered to the VM if the VM was not de-scheduled), (2) move forward the VM's system clock counter to match real-world time, and (3) disable any further RDTSC trapping for the VM until module 108 determines that time virtualization heuristics should be reactivated. The combination of steps (1) and (2) ensures that, at the time of the initial RDTSC exit/trap, the VM's system clock counter will be brought in alignment with RDTSC timestamps generated directly by the CPU hardware. Accordingly, hypervisor 102 can refrain from trapping and emulating any further RDTSC instruction calls made from within the secure hardware enclave at that point, until an event occurs that causes time virtualization heuristics for the VM to be reactivated. With this approach, the VM's guest OS will observe a sudden jump forward in system time due to the system clock counter being moved forward; however, this approach is relatively easy to implement and thus can be opportunistically employed for scenarios where it is unlikely that the RDTSC instruction will be called by guest code within a secure hardware enclave.
It should be appreciated that the architecture of host system 100 shown in
As indicated above, TSC scaling is a feature found on modern x86 CPU designs which allows a hypervisor (or other program code) to set, in CPU hardware, a scaling factor to be applied to all RDTSC timestamps generated by that hardware for a given context (e.g., a given VM). With the TSC scaling approach, hypervisor 102 of
Starting with block 202, time virtualization heuristics module 108 can detect the occurrence of an event/scenario that indicates time virtualization heuristics (e.g., accelerated interrupt delivery) should be activated with respect to a given VM 104 and can activate the heuristics accordingly. For example, module 108 may detect that VM 104 has been de-scheduled and re-scheduled on a physical CPU 106, which has caused the VM's guest OS to miss one or more system clock timer interrupts.
At block 204, upon activating the time virtualization heuristics for VM 104, module 108 can determine a scaling factor that should be applied to RDTSC timestamps returned to guest code within VM 104 (such as, e.g., guest code running within a secure hardware enclave of the VM) based on the activated heuristics. For instance, if the activated heuristics cause module 108 to accelerate the delivery of interrupts to the VM's guest OS by a factor of 2, module 108 may determine that RDTSC timestamps for the VM should be cut in half (subject to an appropriate offset). Generally speaking, the goal of the scaling factor determined at block 204 is to ensure that any hardware based RDTSC timestamps scaled using this scaling factor will reflect a VM system time that is consistent with the VM's system clock counter, per the accelerated interrupt delivery applied via module 108. Although not explicitly shown in
Once timer virtualization heuristics module 108 has determined the scaling factor, module 108 can invoke an appropriate CPU instruction for programming this scaling factor into the physical CPU 106 that is mapped to the vCPU of VM 104, in accordance with the physical CPU's TSC scaling feature (block 206). The result of this step is that all RDTSC timestamps generated by that physical CPU from that point onward will be scaled per the scaling factor.
Then, at some future point in time, time virtualization heuristics module 108 can determine that the system clock counter for VM 104 has caught up with real-world time, and thus can deactivate the heuristics previously activated for VM 104 at block 202 (block 208). Finally, in response to deactivating the heuristics, time virtualization heuristics module 108 can also disable the TSC scaling programmed/enabled at block 206, which will cause physical CPU 106 to generate future RDTSC timestamps for VM 104 in an unscaled fashion (block 210), and subsequently return to block 202 in order to repeat the process as needed. Note that throughout the entirety of workflow 200, time virtualization heuristics module 108 does not set the trap bit for trapping RDTSC instruction calls to hypervisor 102. Thus, any invocations of the RDTSC instruction made by guest code within VM 104 will always be handled directly by the CPU hardware of the host system, without the involvement of hypervisor 102.
Starting with block 302, time virtualization heuristics module 108 can detect the occurrence of an event/scenario that indicates time virtualization heuristics (e.g., accelerated interrupt delivery) should be activated with respect to a given VM 104 and can activate the heuristics accordingly. In addition, at block 304, time virtualization heuristics module 108 can enable a trap bit that causes all RDTSC invocations from VM 104 (such as from, e.g., guest code running within a secure hardware enclave of the VM), to be trapped by hypervisor 102.
Then, at the time an RDTSC exit/trap occurs (in other words, at the time guest code within VM 104 calls the RDTSC instruction and causes a trap/exit to hypervisor 102), RDTSC trap handler 110 can determine a scaling factor that should be applied to RDTSC timestamps returned to the calling guest code based on the activated heuristics (block 306) and can invoke an appropriate CPU instruction for programming this scaling factor into the physical CPU 106 that is mapped to the vCPU of VM 104, in accordance with the physical CPU's TSC scaling feature (block 308). These two steps are substantially similar to blocks 204 and 206 of workflow 200.
At block 310, RDTSC trap handler 110 can determine at what point in the future the TSC scaling should be disabled. In certain embodiments, this can involve communicating with time virtualization heuristics module 108 to determine when VM 104's system clock counter will be fully caught up with real-world time (or stated another way, when the activated heuristics for VM 104 can be deactivated). Upon determining this, RDTSC trap handler 110 can set a timer (as, e.g., a background process/thread within hypervisor 102) for automatically disabling the TSC scaling at that point in time (block 312).
Finally, at block 314, RDTSC trap handler 110 can disable the trap bit for VM 104 previously set at block 304 and workflow 300 can return to block 302 in order to repeat the process as needed.
It should be appreciated that workflows 200 and 300 of
Starting with block 402, time virtualization heuristics module 108 can detect the occurrence of an event/scenario that indicates time virtualization heuristics (e.g., accelerated interrupt delivery) should be activated with respect to a given VM 104 and can activate the heuristics accordingly. In addition, at block 404, time virtualization heuristics module 108 can enable a trap bit that causes all RDTSC invocations from VM 104 (such as from, e.g., guest code running within a secure hardware enclave of the VM), to be trapped by hypervisor 102.
Then, at the time an RDTSC exit/trap occurs (in other words, at the time guest code within VM 104 calls the RDTSC instruction and causes a trap/exit to hypervisor 102), RDTSC trap handler 110 can instruct time virtualization heuristics module 108 to discard its internal state with respect to VM 104, thereby deactivating/suppressing the heuristics for the VM (block 406). This means that module 108 will no longer attempt to deliver to VM 104 the system clock timer interrupts that the VM had missed while in an inactive/de-scheduled state. RDTSC trap handler 110 can further move forward VM 104's system clock counter to match real-world time (block 408). For example, if the VM's system clock counter is currently set to 5 (i.e., 50 ms, assuming 10 ms per interrupt) but the real-world elapsed time from the point of VM power-on is 90 ms, RDTSC trap handler 110 can move forward the system clock counter to 9. In a particular embodiment, this can be achieved by issuing a remote procedure call (RPC) to an agent running within VM 104 for a forward clock correction. Upon being invoked, the agent can adjust the VM's system clock counter accordingly.
Finally, at block 410, RDTSC trap handler 110 can disable the trap bit for VM 104 previously set at block 404 and workflow 400 can return to block 402 in order to repeat the process as needed.
Certain embodiments described herein involve a hardware abstraction layer on top of a host system (i.e., computer). The hardware abstraction layer allows multiple containers to share the hardware resource. These containers, isolated from each other, have at least a user application running therein. The hardware abstraction layer thus provides benefits of resource isolation and allocation among the containers. In the foregoing embodiments, virtual machines (VMs) are used as an example for the containers and hypervisors as an example for the hardware abstraction layer. As described above, each virtual machine includes a guest operating system in which at least one application runs. It should be noted that these embodiments may also apply to other examples of containers, such as containers not including a guest operating system, referred to herein as “OS-less containers” (see, e.g., www.docker.com). OS-less containers implement operating system-level virtualization, wherein an abstraction layer is provided on top of the kernel of an operating system on a host computer. The abstraction layer supports multiple OS-less containers each including an application and its dependencies. Each OS-less container runs as an isolated process in userspace on the host operating system and shares the kernel with other containers. The OS-less container relies on the kernel's functionality to make use of resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces and to completely isolate the application's view of the operating environments. By using OS-less containers, resources can be isolated, services restricted, and processes provisioned to have a private view of the operating system with their own process ID space, file system structure, and network interfaces. Multiple containers can share the same kernel, but each container can be constrained to only use a defined amount of resources such as CPU, memory and I/O.
Further embodiments described herein can employ various computer-implemented operations involving data stored in computer systems. For example, these operations can require physical manipulation of physical quantities—usually, though not necessarily, these quantities take the form of electrical or magnetic signals, where they (or representations of them) are capable of being stored, transferred, combined, compared, or otherwise manipulated. Such manipulations are often referred to in terms such as producing, identifying, determining, comparing, etc. Any operations described herein that form part of one or more embodiments can be useful machine operations.
Yet further, one or more embodiments can relate to a device or an apparatus for performing the foregoing operations. The apparatus can be specially constructed for specific required purposes, or it can be a general-purpose computer system selectively activated or configured by program code stored in the computer system. In particular, various general-purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations. The various embodiments described herein can be practiced with other computer system configurations including handheld devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
Yet further, one or more embodiments can be implemented as one or more computer programs or as one or more computer program modules embodied in one or more non-transitory computer readable storage media. The term non-transitory computer readable storage medium refers to any data storage device that can store data which can thereafter be input to a computer system. The non-transitory computer readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer system. Examples of non-transitory computer readable media include a hard drive, network attached storage (NAS), read-only memory, random-access memory, flash-based nonvolatile memory (e.g., a flash memory card or a solid state disk), a CD (Compact Disc) (e.g., CD-ROM, CD-R, CD-RW, etc.), a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The non-transitory computer readable media can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.
In addition, while described virtualization methods have generally assumed that virtual machines present interfaces consistent with a particular hardware system, persons of ordinary skill in the art will recognize that the methods described can be used in conjunction with virtualizations that do not correspond directly to any particular hardware system. Virtualization systems in accordance with the various embodiments, implemented as hosted embodiments, non-hosted embodiments or as embodiments that tend to blur distinctions between the two, are all envisioned. Furthermore, certain virtualization operations can be wholly or partially implemented in hardware.
Many variations, modifications, additions, and improvements are possible, regardless the degree of virtualization. The virtualization software can therefore include components of a host, console, or guest operating system that performs virtualization functions. Plural instances can be provided for components, operations, or structures described herein as a single instance. Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations can be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component can be implemented as separate components.
As used in the description herein and throughout the claims that follow, “a,” “an,” and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The above description illustrates various embodiments along with examples of how aspects of particular embodiments may be implemented. These examples and embodiments should not be deemed to be the only embodiments and are presented to illustrate the flexibility and advantages of particular embodiments as defined by the following claims. Other arrangements, embodiments, implementations and equivalents can be employed without departing from the scope hereof as defined by the claims.
Number | Date | Country | Kind |
---|---|---|---|
202041001650 | Jan 2020 | IN | national |