Computing is becoming more data centric, where low-latency access to a very large amount of data is critical. This trend is largely supported by increasing role of big-data applications in our day-to-day lives. In addition, virtual machines are playing a critical role in server consolidation, security and fault tolerance as substantial computing migrates to shared resources in cloud services. This trend is evident due to increasing support for public, enterprise and private cloud services by various companies. These trends put a lot of pressure on the virtual memory system—a layer of abstraction designed for applications to manage physical memory easily.
The virtual memory system translates virtual addresses issued by the application to physical addresses for accessing the stored data. Since the software stack accesses data using virtual addresses, fast address translation is a prerequisite for efficient data-centric computation and for providing the benefits of virtualization to a wide range of applications. But unfortunately, growth in physical memory sizes is exceeding the capabilities of the virtual memory abstraction—paging. Paging has been working well for decades in the old world of scarce physical memory, but falls far short in the new world of gigabyte-to-terabyte memory sizes.
An example method of managing memory in a computer system implementing non-uniform memory access (NUMA) by a plurality of sockets each having a processor component and a memory component is described. The method includes replicating page tables for an application executing on a first socket of the plurality of sockets across each of the plurality of sockets; associating metadata for pages of the memory storing the replicated page tables in each of the plurality of sockets; and updating the replicated page tables using the metadata to locate the pages of the memory that store the replicated page tables.
Further embodiments include a non-transitory computer-readable storage medium comprising instructions that cause a computer system to carry out the above method, as well as a computer system configured to carry out the above method.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements disclosed in one embodiment may be beneficially utilized on other embodiments without specific recitation.
CPU 108 includes one or more cores 112. Each core 112 is a microprocessor or like type processor element. Each core 112 includes cache memory (cache 114) and a memory management unit (MMU) 116, as well as various other circuits that are omitted for clarity (e.g., an arithmetic logic unit (ALU), floating point unit (FPU), etc.). CPU 108 can include other circuitry shared by cores 112 (e.g., additional cache memory), which is omitted for clarity.
MMU 116 implements memory management in the form of paging of system memory 110. MMU 116 controls address translation and access permissions for memory accesses made by core 112. MMU 116 implements a plurality of address translation schemes based on privilege level (also referred to as “translation schemes”). Each translation scheme generally takes an input address (IA) and, if permitted based on the defined access permissions, returns an output address (OA). If an address translation cannot be performed (e.g., due to violation of the access permissions), MMU 116 generates an exception. MMU 116 is controlled by a plurality system registers in registers 114. MMU 116 can include a translation lookaside buffer (TLB) 118 that caches address translations.
One type of translation scheme includes a single stage of address translation that receives a virtual address (VA) in a virtual address space and outputs a physical address (PA) in a physical address space. The virtual address space is a flat logical address space managed by software. The physical address space includes the physical memory map that includes system memory 110. Another type of translation scheme includes two stages of address translation. The first stage of address translation receives a VA and outputs an intermediate physical address (IPA) in an intermediate physical address space. The second stage of address translation receives an IPA and outputs a PA. The IPA address space is a flat logical address space managed by software. Two-stage address translation is discussed further below with respect to a virtualized computing system.
Software platform 104 includes a host operating system (OS) 140 and applications 142. Host OS 140 executes directly on hardware platform 102. Host OS 140 can be any commodity operating system known in the art, such as such as Linux®, Microsoft Windows®, Mac OS®, or the like. Host OS 140 includes a virtual memory subsystem 144. Virtual memory subsystem 144 comprises program code executable by CPU 108 to managing paging and access to system memory 110 (e.g., on behalf of applications 142).
Virtual memory subsystem 144 divides system memory 110 into pages. A “page” is the smallest unit of memory for which an IA-to-OA mapping can be specified. Each page (also referred to herein as a “memory page”) includes a plurality of separately addressable data words, each of which in turn includes one or more bytes. Each address includes an upper portion that specifies a page and a lower portion that specifies an offset into the page. Each address translation involves translating the upper portion of the IA into an OA. CPU 108 can support one or more page sizes. For example, some processors support 4 kilobyte (KB), 2 megabyte (MB), and 1 gigabyte (GB) page sizes. Other processors may support other page sizes. In addition, the width of the IA can be configurable for each address translation scheme.
Each enabled stage of address translation in a translation scheme uses memory mapped tables referred to as page tables 120. If not cached in TLB 118, a given address translation requires one or more lookups of page tables 120 (referred to as one or more levels of lookup). A page table walk, which can be implemented by the hardware of MMU 116, is the set of lookups required to translate a VA to a PA. Page tables 120 are organized into hierarchies, where each page table hierarchy includes a base table and a plurality of additional tables corresponding to one or more additional levels. For example, some processors specify up to four levels of page tables referred to as level 1 through level 4 tables. The number of levels in a page table hierarchy depends on the page size. Virtual memory subsystem 144 can also maintain page metadata 122 for each page defined by the system. In some embodiments, virtual memory subsystem 144 also maintains shared log data 124, as discussed further below.
When applications have very large working sets, TLBs can cause performance degradation on native machines. Such workloads running in a virtualized system exacerbate this problem. In big memory machines, when big-memory applications execute, the TLB misses causes many RAM accesses (not just data cache accesses) specifically for lower levels of page tables. This causes the TLB misses to be costly for application performance. This also translates to much worse performance for big-memory virtual machines. Big memory machines typically have a multi-socket NUMA architecture. Many of these machines can have really different memory access latencies depending on which RAM a particular core is accessing. The ratio of latencies has been shown to vary from 2× to sometimes over 10× depending on the NUMA architecture designed. The main issues is that if the page tables are allocated in different sockets, the TLB miss may have to traverse multiple sockets to resolve the miss and introduce a TLB entry. If all the page table entries were in local memory (RAM or caches on the same socket as the core), then the TLB miss may have resolved a lot faster, since it would not be affected by NUMA latency effects. As TLB misses becomes a lot more frequent for big-memory applications, this will be an impediment to attaining their performance potential.
In an embodiment, virtual memory subsystem 144 replicates page tables for each multi-threaded process on all sockets transparently. Thus, each memory component 206 stores data “D” and replicated page tables (e.g., L1-L4). The technique supports huge pages in the replicated page tables. The replication can be enabled and controlled by a user library, such as numctl, that controls which sockets the application can run on and where its data can be allocated. The technique creates a complete replica of the page table on each of the socket, but points to the same data page on the leaf levels. Typically, if a core in socket 0 has a TLB miss for data “D” which is local to the socket, it may perform up to four remote accesses to resolve the TLB miss to only find out that the data was local to its socket and then access the local data. Whereas with the page table replication described herein, socket 0 on a TLB miss for data “D” which is local to the socket performs up to four local accesses to resolve the TLB miss and access the local data D. The local accesses for all TLB misses makes the applications run faster. This is shown in
A page table is mostly managed by software (OS 140) most of the time and read by the hardware (on a TLB miss). Accessed and dirty bits are usually set atomically by hardware on the first access to the page and first write to the page respectively. These two bit of metadata are only set by the hardware and reset by the OS. When replicated, this metadata needs to be kept coherent (instantly or at least lazily) for correctness. Accessed and dirty bits are used by the OS for system-level operations like swapping. In an embodiment, virtual memory subsystem 144 logically ORs accessed and dirty bits between all the page table replicas when read by the OS. Virtual memory subsystem 144 also sets this metadata in the other replicas if one of replicas has it set.
Usual updates to page tables provide a virtual address and the update required to it. For example, a new physical page is allocated for a virtual address on page fault or change access permission on a page through mprotect syscall. The main bottleneck in updating all replicas is to walk all the N page table replicas in an N-socket system. This would require a lot more memory references since walking each replica would take up to four memory references on a page fault path or syscall path.
Virtual memory subsystem 144 optimizes the update path by creating a circular linked list of all replicas using page metadata 122. Each of the replica page points to the next replica page. For example, virtual memory subsystem 144 can use struct page in LINUX, which is allocated for each physical page to store the pointer to the next replica. Similarly, virtual memory subsystem 144 can use other per page data structure in other OSes and hypervisors to create such a circular linked list of replicas. With this optimization, the update of all N replicas takes 2N memory references; N for updating the N replicas itself and N for reading the pointers to the next replica.
Virtual memory subsystem 144 updates the page tables and all its replicas at the same time to not cause any incoherence in the replicated page tables. This is easily controlled since OS updates the page tables and it has the knowledge of each replica and where each replica is. But on each update, virtual memory subsystem 144 does need to go through the circular linked list and apply each update to other replicas. Even though the updates are optimized with circular buffers, they can still have higher overheads if they happen frequent enough. In an embodiment, virtual memory subsystem 144 reduces the overheads by staging the updates: first quickly write the update to a shared log (virtual address and update to the page table) and then apply the updates to all the replicas asynchronously at a later time. This provides a quick way to improve the critical update path by not updating any of the replicas.
In an embodiment, virtual memory subsystem 144 implements two exceptions for the update process shown in
With a single page table mapping one virtual address to one physical address, it is not possible to replicate read-only data easily without changes to the page tables. But since there are replicated page tables, one virtual address mapped to different physical addresses can be created based on which replica page table being used. In this case, the replica page tables will not be mapping the same data pages any more. They will deviate for the read-only pages that are replicated themselves. In an embodiment, virtual memory subsystem 144 performs the replication of data pages only for read-only pages since there are no updates to those data pages.
Each VM 532 supported by hypervisor 530 includes guest software (also referred to as guest code) that runs on the virtualized resources supported by hardware platform 106. In the example shown, the guest software of each VM 532 includes a guest OS 534 and one or more applications (apps) 536. Guest OS 534 can be any commodity operating system known in the art, such as such as Linux®, Microsoft Windows®, Mac OS®, or the like.
Hypervisor 530 includes, among other components, a kernel 540, a virtual memory subsystem 544, and virtual machine monitors (VMMs) 546. Kernel 540 provides operating system functionality (e.g., process creation and control, file system, process threads, etc.), as well as CPU scheduling and memory scheduling. VMMs 546 implement the virtual system support needed to coordinate operations between hypervisor 530 and VMs 532. Each VMM 546 manages a corresponding virtual hardware platform that includes emulated hardware, such as virtual CPUs (vCPUs 548) and guest physical memory. vCPUs 548 are backed by cores 112. Guest physical memory is backed by system memory 110. Each virtual hardware platform supports the installation of guest software in a corresponding VM 532.
In virtualized computing system 500, guest software in a VM 532 can access memory using the two-stage address translation scheme. In this context, a virtual address is referred to as a “guest virtual address” or GVA. An intermediate physical address is referred to as a “guest physical address” or GPA. A physical address is referred to as a “host physical address” or HPA.
Virtualization brings two layers of page tables to translate addresses. In embodiments, there are two ways of handling the two levels of address translation: 1. replicate on different sockets both levels of page tables while keeping the same exact mappings at both levels of page tables. This option requires guest OS to know the NUMA architecture; 2. intelligently mapping the guest page tables differently on different sockets. This option is designed to be transparent to the guest OS. There are two ways achieving this intelligent mapping: (a) The host page table maps the pages of guest page table pages to different host physical pages by using the replicated nested page tables. In general, we can create multiple gPA→hPA mappings for just the guest page table pages and replicate them on all sockets. For this option, the hypervisor needs to track the updates to the guest page tables, so that it is correctly updated in all replica. We would leverage marking the guest page tables as read-only to so that hypervisor is interrupted on updates to the guest page table (a concept out of shadow paging). (b) Two-level page tables has longer TLB misses, we can reduce the length of TLB misses by using shadow paging. We can replicate the shadow page table on multiple sockets instead of replicating any parts of two-level page tables. Note that all the optimizations discussed above for computing system 100 can be applied to replication of two-level page tables to improve performance.
Full replication of both page tables is enabled by modification to guest OS 534 and hypervisor 530. This option works with full virtualization. Paravirtualization can help improve performance further. In addition, the NUMA architecture has to be exposed to guest OS 534, so the guest OS 534 can perform replication of guest page tables independent of the nested page tables. With this option, note that replication is independent in guest OS 534 and hypervisor 530: either can decide not to replicate at the cost of performance. For example, guest OS 534 may decide not to replicate the guest page tables, but hypervisor 530 can decide to replicate the page tables. This example shows an option that will lead to higher performance than no replication at all, but lower performance than both levels replicated. Keeping these options in mind, replication of both levels is discussed below.
To intelligently replicate guest page tables by using nested page tables without the guest OS 534 having any knowledge requires hypervisor 530 to distinguish between guest page tables pages and data pages in gPA. Due to this constraint, there are a few options in design that has different tradeoffs.
To distinguish between guest page table pages and data pages in gPA, the well trusted solution is marking guest page tables as read-only. This technique can be leveraged to replicate the guest page table pages on multiple sockets. The techniques can map gPA of a guest page table page to different hPAs in the nested page tables, thus having the effect of replicated guest page tables.
Thus, the hypervisor 530 write-protects the guest page tables pages as done in shadow paging and guest OS 534 interrupts the hypervisor 530 on a update to the guest page table. Note that the hypervisor 530 differentiates between guest page table pages and data pages in guest physical address space. The gPA→hPA for each replica in nested page table for guest page table pages maps to the different host physical page whereas the data page in each replica of nested page table maps to the same physical page. This is depicted in
In the previous section, hypervisor intervention was implemented by marking guest page tables as read-only (write protected) to replicate guest page tables transparently. The same mechanism is used by shadow paging. In some embodiments, software shadow paging is performed instead of hardware nested paging. Since the cost of hypervisor intervention is present, then use the replicated shadow paging to reduce the TLB miss overheads as well.
Updates to guest page table by marking them read-only (write protected) is a costly operating since each VMM intervention on page table update costs 1000s of cycles. Some of these overheads can be reduced by using binary translation or paravirtualization. If the guest OS specifies which pages are guest page table pages and sends updates on guest page table to VMM with a hypercall, the overheads can be reduced. If VMM is provided the information that it needs by the guest OS, instead of it creating mechanisms to create the information itself, it can be more efficient.
Recently, there has been an increasing interest in supporting nested virtualization—a VMM can run various VMMs, which can run their own VMs. This approach comes with its own set of benefits: (i) public cloud providers can give consumers full control of their VMs by making them run on their own favorite VMM, (ii) increase security and integrity of VMs on public clouds even when the VMM provided by service provider is compromised, (iii) enable new features used by VMMs which the service providers are slow to adopt (like VM migration), and (iv) help debug, build and benchmark new VMMs. Vendors like VMware have been supporting nested virtualization by using software techniques.
Hardware advancement such as VMCS shadowing are being included in recent processors (e.g., Intel Haswell) to perform efficient nested virtualization. Unfortunately, these benefits of nested virtualization come at the cost of increasing the memory virtualization overheads because of the additional levels of abstraction. This new interest in nested virtualization motivates us to think about providing a scalable solution for memory virtualization in presence of more levels of virtualization. The same techniques described above can be used to improve performance of nested virtualized systems running on NUMA machines.
I/O page tables are used by devices and accelerators to access system physical memory. In presence of multiple sockets, devices are connected directly to one socket or are shared between multiple sockets. In case the device is shared between multiple sockets, it makes sense to replicate the I/O page tables on sockets which shares the devices. This way the device can access the local physical memory by using the closest page table. In addition, the data can be prefetched based on the local pointed by I/O page table entry.
The various embodiments described herein may employ various computer-implemented operations involving data stored in computer systems. For example, these operations may require physical manipulation of physical quantities—usually, though not necessarily, these quantities may take the form of electrical or magnetic signals, where they or representations of them are capable of being stored, transferred, combined, compared, or otherwise manipulated. Further, such manipulations are often referred to in terms, such as producing, identifying, determining, or comparing. Any operations described herein that form part of one or more embodiments of the invention may be useful machine operations. In addition, one or more embodiments of the invention also relate to a device or an apparatus for performing these operations. The apparatus may be specially constructed for specific required purposes, or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
The various embodiments described herein may be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
One or more embodiments of the present invention may be implemented as one or more computer programs or as one or more computer program modules embodied in one or more computer readable media. The term computer readable medium refers to any data storage device that can store data which can thereafter be input to a computer system—computer readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer. Examples of a computer readable medium include a hard drive, network attached storage (NAS), read-only memory, random-access memory (e.g., a flash memory device), a CD (Compact Discs)—CD-ROM, a CD-R, or a CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.
Although one or more embodiments of the present invention have been described in some detail for clarity of understanding, it will be apparent that certain changes and modifications may be made within the scope of the claims. Accordingly, the described embodiments are to be considered as illustrative and not restrictive, and the scope of the claims is not to be limited to details given herein, but may be modified within the scope and equivalents of the claims. In the claims, elements and/or steps do not imply any particular order of operation, unless explicitly stated in the claims.
Virtualization systems in accordance with the various embodiments may be implemented as hosted embodiments, non-hosted embodiments or as embodiments that tend to blur distinctions between the two, are all envisioned. Furthermore, various virtualization operations may be wholly or partially implemented in hardware. For example, a hardware implementation may employ a look-up table for modification of storage access requests to secure non-disk data.
Certain embodiments as described above involve a hardware abstraction layer on top of a host computer. The hardware abstraction layer allows multiple contexts to share the hardware resource. In one embodiment, these contexts are isolated from each other, each having at least a user application running therein. The hardware abstraction layer thus provides benefits of resource isolation and allocation among the contexts. In the foregoing embodiments, virtual machines are used as an example for the contexts and hypervisors as an example for the hardware abstraction layer. As described above, each virtual machine includes a guest operating system in which at least one application runs. It should be noted that these embodiments may also apply to other examples of contexts, such as containers not including a guest operating system, referred to herein as “OS-less containers” (see, e.g., www.docker.com). OS-less containers implement operating system—level virtualization, wherein an abstraction layer is provided on top of the kernel of an operating system on a host computer. The abstraction layer supports multiple OS-less containers each including an application and its dependencies. Each OS-less container runs as an isolated process in userspace on the host operating system and shares the kernel with other containers. The OS-less container relies on the kernel's functionality to make use of resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces and to completely isolate the application's view of the operating environments. By using OS-less containers, resources can be isolated, services restricted, and processes provisioned to have a private view of the operating system with their own process ID space, file system structure, and network interfaces. Multiple containers can share the same kernel, but each container can be constrained to only use a defined amount of resources such as CPU, memory and I/O. The term “virtualized computing instance” as used herein is meant to encompass both VMs and OS-less containers.
Many variations, modifications, additions, and improvements are possible, regardless the degree of virtualization. The virtualization software can therefore include components of a host, console, or guest operating system that performs virtualization functions. Plural instances may be provided for components, operations or structures described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the appended claim(s).
This application claims priority to U.S. Provisional Patent Application Ser. No. 62/744,990, filed Oct. 12, 2018, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62744990 | Oct 2018 | US |