Field of the Invention
The present invention relates to a processor comprising an instruction set and registers and which is adapted to run a virtual machine monitor software (VMM) for hosting multiple guest operating systems, implementing an instruction translation lookaside buffer (ITLB) with ITLB-entries and a data translation lookaside buffer (DTLB) with DTLB-entries.
Virtual machine monitor (VMM) software that is able to run multiple guest operating systems concurrently commonly does emulation of privileged instructions to let a guest operating system think it would run on the original hardware environment. The VMM has to deal with many processor (CPU) specification/implementation details. One common problem arises from the fact that all modern CPUs have separated optimized memory access path for the instruction fetch read operation and data read/write access. The separation involves also separated memory management units (MMUs) for virtual addressing.
The problem now is that the VMM in general does not have the possibility to access the opcode of a faulting instruction in a straightforward way since the virtual address of the faulting instruction is mapped by an instruction translation lookaside buffer (ITLB) whereas the VMM would need the address to be mapped by a data translation lookaside buffer (DTLB) entry since the VMM has to get the instructions opcode with a load instruction that uses the data memory path.
To visualize aforesaid problem reference is drawn to appending
In known processor types as a rule any VMM can bypass the problem by resolving the faulting guest virtual address with the steps shown in
This flow diagram visualizes the steps necessary to handle above problem to get the opcode of a faulting instruction coming up in a guest operating system. Starting from the virtual address of said instruction in step 10 the according entry in the virtualized ITLB is searched. By this the guest physical address of the instruction is provided for. In step 20 this guest physical address is translated into the host physical address. With this host physical address of the instruction the instruction bundle is loaded from the memory using physical addressing mode in step 30. The instruction bundle includes the information about the opcode of the instruction, which is extracted from the instruction bundle in step 40.
Finally, this opcode is analyzed and the according instruction is emulated by the VMM in step 50.
As an example in the Intel IA-64 architecture a corresponding (simplified) code fragment is listed in the following Table 1:
This code sequence has to be performed once while attempting to emulate an instruction. Since the SEARCH_ITLB routine typically executes a loop to find the corresponding entry inside the virtualized ITLB table and the GUEST_TO_HOST routine typically consists of a multilevel page table lookup the whole code sequence is a rather time consuming task. A reasonable part of the performance overhead that a VMM brings compared to an operating system running “on bare metal” is the result of running through that code sequence every time an instruction is emulated.
It is an object of the invention to provide instructions and registers, respectively, with the help of which the performance of the processor is drastically enhanced when handling multiple guest operating systems in a virtualized environment.
The common concept of the according invention is the use of some kind of shortcuts to the above showed code sequence with the effect of reducing the time needed to perform the necessary steps.
In a first aspect of the invention above object is met by a processor wherein the instruction set comprises an instruction providing for translation of a virtual address to a physical address based exclusively on ITLB-entries. To assure that these ITLB-entries are consistent the processor prohibits a flushing of the ITLB-entries in case of a program interrupt. By this instruction handling the loading of the opcode can be realized in the physical mode.
According to a second aspect of the invention a processor is provided in which the registers comprise a separate physical address interruption control register storing the physical address of a faulting guest instruction. Due to this additional control register the loading of the opcode could be realized in the physical mode without the need to have a translation from the guest physical address to the host physical address.
According to a third aspect of the invention the instruction set of a processor comprises a load instruction using the instruction translation lookaside buffer ITLB rather than the DTLB for a translation of an address of a faulting guest instruction. This means that the load of the instruction bundle 3 indicated in
According to a fourth aspect of the invention the registers of the processor comprise at least one instruction bundle interruption control register storing an instruction bundle of a faulting guest instruction. It is an advantage of this embodiment of the invention that the opcode can be extracted directly from the instruction bundle held in the control register.
Finally, in another aspect of the invention the registers of the processor comprise an opcode interruption control register storing directly the opcode of a faulting guest instruction.
The disclosure of
In
Now using the first aspect of the invention the virtual address of a faulting instruction is handled by the novel instruction which in step 21 provides for a translation of the guest virtual address to the host physical address of the instruction exclusively using ITLB-entries. After that the steps 30, 40 and 50 already explained above are made to get the opcode of the faulting instruction.
A corresponding (simplified) code sequence in the Intel IA-64 architecture is listed in the following Table 2 assuming that above cited novel instruction has the mnemonic tpa.i ra=rb:
As can easily be seen this code sequence got rid of the two (time expensive) routines SEARCH_ITLB and GUEST_TO_HOST and thus avoids a time consuming loop.
According to the second aspect of the invention a interruption control register 31 is provided for storing the physical address of the faulting guest instruction. This means that the translation work of step 21 is avoided and the host physical address of the instruction is achievable from this control register 31.
The (simplified) code sequence in the Intel IA-64 architecture is listed in the following Table 3 assuming that the new control register 31 is named cr.piip:
Again the code sequence is more compact compared to the above examples.
According to the third aspect of the invention the special load instruction ldsz.i is used in step 41 using the instruction translation lookaside buffer ITLB for a translation of a faulting guest instruction. This means that in the flow chart of
The (simplified) code sequence in the Intel IA-64 architecture of the steps 42, 40, 50 is listed in the following Table 4 assuming that the aforesaid load instruction has the mnemonic ldsz.i ra=[rb]:
Here an even smaller code fragment compared to the previous tables uses the new load instruction.
According to the fourth aspect of the invention the procedure to get the opcode of a faulting instruction can even be foreshortened by the instruction bundle interruption control register 45 which holds the instruction bundle of a faulting guest instruction including the opcode of the faulting instruction. With the help of this register the opcode can be extracted from the instruction bundle as is depicted in step 40 of
In the Intel IA-64 architecture the according (simplified) code sequence is listed in the following Table 5 assuming that aforesaid new register 45 is named cr.iib:
This code now is near to the theoretic optimum and can be executed in very few cycles.
According to the last aspect of the invention an opcode interruption control register 51 is used to hold the instruction opcode itself. Thus the opcode of a faulting guest instruction can directly be derived from this register 51, analyzed and the according instruction emulated (step 50).
The code sequence in the Intel IA-64 architecture then is extremely short as can be seen from the following Table 6 (the new register 51 is named cr.iop):
This code is apparently the optimum as it consists of only one instruction left.