Embodiments of the invention relate generally to management of a copy-on-write (CoW) fault.
Many commercially available operating systems (OS) use copy-on-write as a method to achieve optimization in operations. Copy-on-Write (CoW) is used in a fork operation, where the operating system (OS) creates a replica of a process (i.e., a running instance of an application). The original process requesting the fork( ) operation is the parent process and the newly created process is the child process. The child process expects to have a copy of the contents of parent's address space at the time of fork. As known to those skilled in the art, the copy-on-write in a fork( ) operation applies only to a process' private memory pages. Copy-on-write is an optimization that causes physical memory pages of the parent process to be shared with the child process for memory read operations. These shared pages are marked by the OS as copy-on-write. A page that is marked copy-on-write will remain as a shared page to the parent process and child process even if both processes perform a read operation on the shared page. In an alternative implementation in, for example, the HP-UX operating system, a shared page will be marked as copy-on-write for the parent process and copy-on-access for the child process.
However, when either the parent process or the child process writes to a shared page that is marked copy-on-write, a page fault exception (i.e., copy-on-write fault) occurs, where the process that is performing the write operation is given a copy of the page to be written. Copying of metadata of the shared page will occur at the time of the fork( ) operation. At the time of CoW fault, actual data are copied from the shared page. After a process writes to that copied page, that page will remain visible to that process but will not be visible to other processes until there is another instance of an event such as a fork( ) system call and the new page as marked as copy-on-write once again. The use of copy-on-write permits a very efficient fork operation because copying all pages of the parent process onto the address space of the child process is avoided by use of the shared pages.
A Translation Lookaside Buffer (TLB) is a cache in a processor and is used to improve the speed of translations of virtual addresses to physical addresses. A TLB contains a list that translates the virtual addresses into physical addresses for the pages. When a page is copied, a temporary translation kernel virtual address is required to be used and to be pointed to the source page that will be copied. A temporary kernel translation is used for the source page only. This is always needed when the parent process is the process that takes the CoW fault. When the parent process writes to the CoW page, the existing read-only translation to the source page will need to be removed since the parent process is to be pointed to the new page. Therefore, a new kernel translation for the source page is needed to make the copy to the new page. When the child process takes the CoW fault, the source page may have the parent process's read-only translation. In that case, this read-only translation is used for the source page and there is no need to create a new kernel translation. But if parent process's translation to the source page does not exist for some reason, then a new kernel translation for the source page will need to be created. After the page is copied, a global purge of this temporary translation kernel virtual address is required to be performed. A hardware walker will place this temporary address in all TLBs in other processors. This global purge will remove this temporary translation kernel virtual address from all TLBs in the system, since this temporary address is now a stale translation that can cause data corruption. However, this global purge requires the processors to contend for a global spinlock. A global spinlock for the global TLB purge is required on the Intel Itanium Platform Family (IPF) architecture. On a machine with many processors, this spinlock contention can reduce the application performance speed.
A local purge of the temporary translation kernel virtual address may instead be performed, where the temporary address is removed from only the TLB of the processor that is involved in the fork operation. However, a local purge would not purge this temporary address that may have been stored in other TLBs in other processors. The use of this temporary address in a subsequent fork operation can cause data corruption.
Therefore, the current technology is limited in its capabilities and suffers from at least the above constraints and deficiencies.
Non-limiting and non-exhaustive embodiments of the present invention are described with reference to the following figures, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified.
In the description herein, numerous specific details are provided, such as examples of components and/or methods, to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that an embodiment of the invention can be practiced without one or more of the specific details, or with other apparatus, systems, methods, components, materials, parts, and/or the like. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of embodiments of the invention.
The OS 125, memory management unit 126, and processors 110 can provide a standard virtual memory subsystem, where memory space (virtual memory) 140 on a hard disk 145 is swapped with memory space (e.g., RAM) in the system memory 120 to provide increased memory storage for applications.
In
The process P1 has a private data segment 205 which is mapped to a virtual memory page 206 by the space/offset tuple 210 which includes the space identifier (space ID) “SP1” and offset “VA1”. A tuple can have other space ID and offset names, depending on the virtual memory address that is identified by the tuple. The tuple 210 maps the private data segment 205 to the virtual memory page 206 with the virtual memory address, SP1.VA1. In actual implementations, the virtual memory addresses (e.g., SP1.VA1) and physical memory addresses (e.g., PFN1) are typically bit values. However, for purposes of clarity in the discussions below, these addresses are referred herein by use of particular example names (e.g., SP1.VA1 and PFN1).
The private data segment 205 has a descriptor 215 which will map the virtual memory page 206 (containing the segment 205) to the physical memory page 202. The tuples (e.g., tuple 210) and descriptors (e.g., descriptor 215) are typically stored as metadata in the system memory 120. Typically, the descriptor 215 is a virtual frame descriptor VFD. As known to those skilled in the art, a VFD can be set with a VALID flag which indicates that data (e.g., private data segment 205), that is mapped to a virtual memory page, is currently in the system memory 120 (
A translation 220 maps the virtual memory page 206 (as well as the segment 205) at virtual memory address SP1.VA1 to the physical memory page 202 at physical memory address PFN1. The translations that are shown in
The physical page 202 is memory space in the system memory 120. The physical page 202 has a page frame use count (pf usecount) 230 which indicates the number of processes that can access the physical page 202. In
In
After the fork( ) system call 235, for process P1, the VALID flag in the VFD descriptor remains set for the private data segment 205 in parent process P1. Therefore, this VALID flag means that the private data segment 205 in the virtual memory page 206 is currently in the system memory 120 (
After the fork( ) system call 235, a translation 245 still maps the virtual memory page 206 (at virtual memory address SP1.VA1) to the physical memory 202 (at physical memory address PFN1) as shown by SP1.VA1-PFN1. The access rights attributes 250 indicates “translated read only” which means that the process P1 can only read from the physical page 202, and a write attempt by process P1 to the page 202 will generate a copy-on-write fault. Note also that the VFD descriptor 215 now has the CW flag set, which indicates that physical page 202 has been set to copy-on-write.
Also, after the fork( ) system call 235, the usecount 232 for physical page 202 is set (incremented) to “2” because the parent process P1 and child process P2 can now access the physical page 202.
The child process P2 has the same bits (VALID, CW, PFN1) set in the VFD descriptor 255 for the private data segment 240 in virtual memory page 265. The tuple 260 indicates that the private data segment 240 is allocated to a different virtual memory page 265 at virtual memory address SP2.VA1, where “SP2” is the space ID and “VA1” is the same VA1 offset for the private data segment 205 of process P1. The VFD descriptor 255 points the virtual memory page 265 to the physical page 202 because of the PFN1 value in the VFD descriptor 255. However, the child process P2 does not have a translation (as symbolized by the no translation block 270) for mapping the virtual memory address SP2.VA1 to the physical address PFN1 because the child process P2 has not yet attempted to access the physical page 202. The translation for mapping the virtual memory address to physical address is not created until a process actually requests an access a physical memory page.
As discussed above, physical page 202 is currently set to copy-on-write (CoW) after the fork( ) system call 235 is completed. The above-discussed virtual memory subsystem currently uses CoW for the parent process P1. Therefore, if the parent process P1 attempts a write access 275 to the physical page 202, CoW will be “broken” for page 202, and parent process P1 will obtain a copy of the original page 202. Assume that the parent process P1 attempts to write 275 to the physical page 202 that is mapped by the virtual memory address SP1.V1. Since the physical page 202 has a READ only translation 245, a data access rights violation (copy-on-write fault) will occur in response to the write access attempt by the parent process P1. The copy-on-write fault is represented by block 280 for convenience. This copy-on-write fault 280 is detected by a CoW code path that is implemented in a hdl_cwfault routine in the OS 120. As shown in
An embodiment of the invention advantageously eliminates the need to perform a purge of stale temporary translations that may be in all TLBs 115 in the system 100 (
In
The pgcopy( ) routine will then copy the contents of PFN1 over to PFN2. In previous methods, a temporary translation maps a temporary virtual address (e.g., KERNELSPACE.KVADDR) to PFN1 and this temporary translation is then globally purged to remove this temporary translation from all TLBs in all processors. This temporary translation requires the purging from all TLBs because this translation can map the same virtual address KERNELSPACE.KVADDR to multiple physical pages if additional pgcopy( ) routines are subsequently performed. Therefore, this temporary translation can map the same virtual address to different physical pages, among the different TLBs, and can result in the staleness problem that was previously discussed above. As also discussed above, this global purge results in a spinlock contention that can reduce the speed of application performance. As discussed in detail below, the use of the space/offset tuple 285 advantageously eliminates the need to perform this global purge when copy-on-write has been completed.
When copy-on-write has been completed, the SP1.VA1 tuple 210 gives the process P1 its original read and write access rights, as shown by the attributes 291 in
In
In
Reference is made to
The space/offset tuple 285 provides a unique spaceID and offset that will always point to the physical address PFN1. Since a physical memory address is unique for each physical page, each tuple 285 will be unique because of the unique offset value.
As mentioned above, a standard hardware walker 294 (which is typically part of a processor hardware) can insert into the TLBs 115 the temporary translation of the physical page to be copied during copy-on-write. In
As another example, assume that process P3 is currently pointing to page 202 at address PFN1 and is running on processor 110(1), and assume further that a fork( ) system call 235 creates the new process P3 which is a child process of process P2. The process P3 would point to page 202. If the process P3 attempts to write to the physical page 202 that results in a CoW fault and the process P3 moves into a sleep state and wakes up on a different processor (e.g., processor 110(2)) during copy-on-write, then the process P2 can still use the translation 293 in the TLB 115(2) to correctly point to the source page 202, as similarly discussed above. After copy-on-write is completed, the process P3 will point to the new physical page 298 at address PFN4. Note that if physical page 298 will be subsequently copied for a copy-on-write, the temporary translation to be given to the page 298 will be SP.PFN4.
Therefore, a temporary translation (e.g., translation 293) contains a temporary virtual address (e.g., SP.PFN1) that uses the physical address (e.g., PFN1) of the mapped physical page. Since this temporary translation will always point to the correct physical page, this temporary translation is unique for each physical page. As a result, this temporary translation is not required to be globally purged from the TLBs 115. Therefore, an embodiment of the invention advantageously eliminates the global TLB purges that were required in the previous methods that used a temporary translation that was not unique to each physical page and as a result, was subject to staleness. By eliminating the global TLB purges, an embodiment of the invention permits an operating system is to become more scalable (i.e., more processors can be added to the system) and applications can notice a significant performance improvement.
In block 310, a translation is assigned to the first page, where the translation is a virtual memory address to physical memory address translation, and where the offset portion in the translation includes a physical address value of the first page.
In block 315, a second physical memory page is created. The second physical memory page is a copy of the first page.
In block 320, when copy-on-write has been completed, the first process will have access rights (read and write access writes) to the second page. A second process, which is a child process of the first process (and which may be created by, e.g., a fork( ) system call), will not have read access rights to the first page at the end of breaking CoW for the parent process. The child process will have no access to the first page. If the child process accesses the first page, the child process will claim ownership of the first page. There will not be another copy of the first page, unless the child process performs a fork( ) and creates a third process, thereby setting up another CoW relationship where the first page is set to copy-on-write.
It is also within the scope of the present invention to implement a program or code that can be stored in a machine-readable or computer-readable medium to permit a computer to perform any of the inventive techniques described above, or a program or code that can be stored in an article of manufacture that includes a computer readable medium on which computer-readable instructions for carrying out embodiments of the inventive techniques are stored. Other variations and modifications of the above-described embodiments and methods are possible in light of the teaching discussed herein.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the claims. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.