1. Technical Field
The present invention relates in general to a method and system for data processing and in particular to memory management. Still more particularly, the present invention relates to an improved method and system for adjusting page sizes allocated from system memory.
2. Description of the Related Art
The memory system of a typical personal computer includes one or more nonvolatile mass storage devices, such as magnetic or optical disks, and a volatile random access memory (RAM), which can include both high speed cache memory and slower main memory. In order to provide enough addresses for memory-mapped input/output (I/O) as well as the data and instructions utilized by operating system and application software, the processor of a personal computer typically utilizes a virtual address space that includes a much larger number of addresses than physically exist in RAM. Therefore, to perform memory-mapped I/O or to access RAM, the processor maps the virtual addresses into physical addresses assigned to particular I/O devices or physical locations within RAM.
In the PowerPC™ RISC architecture, the virtual address space is partitioned into a number of memory pages, which each have an address descriptor called a Page Table Entry (PTE). The PTE corresponding to a particular memory page contains the virtual address of the memory page as well as the associated physical address of the page frame, thereby enabling the processor to translate any virtual address within the memory page into a physical address in memory. The PTEs, which are created in memory by the operating system, reside in Page Table Entry Groups (PTEGs), which can each contain, for example, up to eight PTEs. According to the PowerPC™ architecture, a particular PTE can reside in any location in either of a primary PTEG or a secondary PTEG, which are selected by performing primary and secondary hashing functions, respectively, on the virtual address of the memory page. In order to improve performance, the processor also includes a Translation Lookaside Buffer (TLB) that stores the most recently accessed PTEs for quick access.
Although a virtual address can usually be translated by reference to the TLB because of the locality of reference, if a TLB miss occurs, that is, if the PTE required to translate the virtual address of a particular memory page into a physical address is not resident within the TLB, the processor must search the PTEs in memory in order to reload the required PTE into the TLB and translate the virtual address of the memory page. Conventionally, the search, which can be performed either in hardware or by a software interrupt handler, sequentially examines the contents of the primary PTEG, and if no match is found in the primary PTEG, the contents of the secondary PTEG. If a match is found in either the primary or the secondary PTEG, history bits for the memory page are updated, if required, and the PTE is loaded into the TLB in order to perform the address translation. However, if no match is found in either the primary or secondary PTEG, a page fault exception is reported to the processor and an exception handler is executed to load the requested memory page from nonvolatile mass storage into memory.
PTE searches utilizing the above-described sequential search of the primary and secondary PTEGs slow processor performance, particularly when the PTE searches are performed in software. The use of larger page sizes typically reduces TLB misses, but results in inefficient usage of memory since the entire portion of memory allocated to a large page may not always be utilized. Consequently, an improved method for selectively adjusting the size of memory pages is needed.
Disclosed are a method, system, and computer program product for selectively adjusting the size of memory pages. In one embodiment, the method includes, but is not limited to, the steps of: collecting profile data (e.g., the number of Translation Lookaside Buffer (TLB) misses, the number of page faults, and the time spent by the Memory Management Unit (MMU) performing page table walks); identifying the top N active processes, where N is an integer that may be user-defined; evaluating the profile data of the top N active processes within a given time period; and in response to a determination that the profile data indicates that a threshold has been exceeded, promoting the pages used by the top N active processes to a larger page size and updating the Page Table Entries (PTEs) accordingly.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.
The invention itself, as well as a preferred mode of use, further objects, and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
With reference now to the figures and in particular with reference to
As illustrated in
Code that populates main memory 50 includes an operating system (OS) 61. OS 61 includes kernel 63, which provides lower levels of functionality for OS 61 and essential services required by other parts of OS 61. The services provided by kernel 63 include memory management, process and task management, disk management, and input/output (I/O) management. According to the illustrative embodiment, kernel 63 includes a kernel-space promotion agent 65 (e.g., a kernel daemon) that provides the functionality shown in
BIU 12 is connected to instruction cache and MMU (Memory Management Unit) 14 and data cache and MMU 16 within processor 10. High-speed caches, such as those within instruction cache and MMU 14 and data cache and MMU 16, enable processor 10 to achieve relatively fast access times to a subset of data or instructions previously transferred from main memory 50 to the caches, thus improving the speed of operation of the data processing system. Data and instructions stored within the data cache and instruction cache, respectively, are identified and accessed by address tags, which each comprise a selected number of high-order bits of the physical address of the data or instructions in main memory 50. Instruction cache and MMU 14 is further coupled to sequential fetcher 17, which fetches instructions for execution from instruction cache and MMU 14 during each cycle. Sequential fetcher 17 transmits branch instructions fetched from instruction cache and MMU 14 to branch processing unit (BPU) 18 for execution, but temporarily stores sequential instructions within instruction queue 19 for execution by other execution circuitry within processor 10.
In the depicted illustrative embodiment, in addition to BPU 18, the execution circuitry of processor 10 comprises multiple execution units for executing sequential instructions, including fixed-point unit (FXU) 22, load-store unit (LSU) 28, and floating-point unit (FPU) 30. Each of execution units 22, 28, and 30 typically executes one or more instructions of a particular type of sequential instructions during each processor cycle. For example, FXU 22 performs fixed-point mathematical and logical operations such as addition, subtraction, ANDing, ORing, and XORing, utilizing source operands received from specified general purpose registers (GPRs) 32 or GPR rename buffers 33. Following the execution of a fixed-point instruction, FXU 22 outputs the data results of the instruction to GPR rename buffers 33, which provide temporary storage for the result data until the instruction is completed by transferring the result data from GPR rename buffers 33 to one or more of GPRs 32. Conversely, FPU 30 typically performs single and double-precision floating-point arithmetic and logical operations, such as floating-point multiplication and division, on source operands received from floating-point registers (FPRs) 36 or FPR rename buffers 37. FPU 30 outputs data resulting from the execution of floating-point instructions to selected FPR rename buffers 37, which temporarily store the result data until the instructions are completed by transferring the result data from FPR rename buffers 37 to selected FPRs 36. As its name implies, LSU 28 typically executes floating-point and fixed-point instructions which either load data from memory (i.e., either the data cache within data cache and MMU 16 or main memory 50) into selected GPRs 32 or FPRs 36 or which store data from a selected one of GPRs 32, GPR rename buffers 33, FPRs 36, or FPR rename buffers 37 to memory.
Processor 10 employs both pipelining and out-of-order execution of instructions to further improve the performance of its superscalar architecture. Accordingly, instructions can be executed by FXU 22, LSU 28, and FPU 30 in any order as long as data dependencies are observed. In addition, instructions are processed by each of FXU 22, LSU 28, and FPU 30 at a sequence of pipeline stages. As is typical of high-performance processors, each sequential instruction is processed at five distinct pipeline stages, namely, fetch, decode/dispatch, execute, finish, and completion.
During the fetch stage, sequential fetcher 17 retrieves one or more instructions associated with one or more memory addresses from instruction cache and MMU 14. Sequential instructions fetched from instruction cache and MMU 14 are stored by sequential fetcher 17 within instruction queue 19. In contrast, sequential fetcher 17 removes (folds out) branch instructions from the instruction stream and forwards them to BPU 18 for execution. BPU 18 includes a branch prediction mechanism, which in one embodiment comprises a dynamic prediction mechanism, such as a branch history table, that enables BPU 18 to speculatively execute unresolved conditional branch instructions by predicting whether or not the branch will be taken.
During the decode/dispatch stage, dispatch unit 20 decodes and dispatches one or more instructions from instruction queue 19 to execution units 22, 28, and 30, typically in program order. In addition, dispatch unit 20 allocates a rename buffer within GPR rename buffers 33 or FPR rename buffers 37 for each dispatched instruction's result data. Upon dispatch, instructions are also stored within the multiple-slot completion buffer of completion unit 40 to await completion. According to the depicted illustrative embodiment, processor 10 tracks the program order of the dispatched instructions during out-of-order execution utilizing unique instruction identifiers.
During the execute stage, execution units 22, 28, and 30 execute instructions received from dispatch unit 20 opportunistically as operands and execution resources for the indicated operations become available. Each of execution units 22, 28, and 30 are preferably equipped with a reservation station that stores instructions dispatched to that execution unit until operands or execution resources become available. After execution of an instruction has terminated, execution units 22, 28, and 30 store data results, if any, within either GPR rename buffers 33 or FPR rename buffers 37, depending upon the instruction type. Then, execution units 22, 28, and 30 notify completion unit 40 which instructions have finished execution. Finally, instructions are completed in program order out of the completion buffer of completion unit 40. Instructions executed by FXU 22 and FPU 30 are completed by transferring data results of the instructions from GPR rename buffers 33 and FPR rename buffers 37 to GPRs 32 and FPRs 36, respectively. Load and store instructions executed by LSU 28 are completed by transferring the finished instructions to a completed store queue or a completed load queue from which the load and store operations indicated by the instructions will be performed.
The performance of processor 10 may be monitored in hardware through performance monitor counters (PMCs) 40 within processor 10. Additional performance information can be collected by software, such as operating system 61.
In an exemplary embodiment, processor 10 utilizes a 32-bit address bus and therefore has a 4 Gbyte virtual address space. (Of course, in other embodiments 64-bit or other address widths can be utilized.) The 4 Gbyte virtual address space is partitioned into a number of memory pages, each of which has a respective Page Table Entry (PTE) address descriptor that associates the virtual address of the memory page with the corresponding physical address of the memory page in main memory 50. The memory pages are preferably of multiple different sizes, for example, 4 KB, 16 KB, 64 KB, 256 KB, 1 MB, 4 MB and 16 MB. (Of course, any other size of memory pages may alternatively or additionally be employed.) As illustrated in
Referring now to
With reference now to
Referring now to
As depicted in
As illustrated, DMMU 100 contains segment registers 102, which are utilized to store the Virtual Segment Identifiers (VSIDs) of each of the sixteen 256-Mbyte regions into which the 4 Gbyte virtual address space of processor 10 is subdivided. A VSID stored within a particular segment register is selected by the 4 highest-order bits (bits 0-3) of an EA received by DMMU 100. DMMU 100 also includes Data Translation Lookaside Buffer (DTLB) 104, which in the depicted embodiment is a two-way set associate cache for storing copies of recently-accessed PTEs. DTLB 104 comprises 32 lines, which are indexed by bits 15-19 of the EA. Multiple PTEs mapped to a particular line within DTLB 104 by bits 15-19 of the EA are differentiated by an address tag comprising bits 10-14 of the EA. In the event that the PTE required to translate a virtual address is not stored within DTLB 104, DMMU 100 stores that 32-bit EA of the data access that caused the DTLB miss within DMISS register 106. In addition, DMMU 100 stores the VSID, H bit, and API corresponding to the EA within DCMP register 108 for comparison with the first 4 bytes of PTEs during a table search operation. DMMU 100 further includes Data Block Address Table (DBAT) array 110, which is utilized by DMMU 100 to translate the addresses of data blocks (i.e., variably-sized regions of virtual memory) and is accordingly not discussed further herein.
With reference now to
Although 52-bit virtual addresses are usually translated into physical addresses by reference to DTLB 104, if a DTLB miss occurs, that is, if the PTE required to translate the virtual address of a particular memory page into a physical address is not resident within DTLB 104, DMMU 100 searches page table 60 in main memory 50 in order to reload the required PTE into DTLB 104 and translate the virtual address of the memory page. The table search operation performed by DMMU 100 checks the PTEs within the primary and secondary PTEGs in a selectively non-sequential order such that processor performance is enhanced.
Turning now to
As depicted at block 603, page promotion agent 65 resets a timer (e.g., one of PMCs 40) utilized to specify the interval (as measured in CPU cycles or time) over which profiling data is to be collected. In addition, page promotion agent 65 clears the contents of performance monitoring data storage (e.g., a performance monitoring buffer in main memory 50 and/or other PMCs 40). Next, at block 605, page promotion agent 65 (or other portion of kernel 63) and/or performance monitoring hardware with processor 10 collect profiling data corresponding to the active processes within processor 10 over the timer-specified interval (e.g., 5 seconds) and store the profiling data within performance monitoring data storage, such as the performance monitor buffer in main memory 50 and/or PMCs 40 within processor 10. In one embodiment, the profiling data includes, but is not limited to, the CPU cycles consumed by each active process, the number of TLB misses, the number of page faults, and the time spent by performing page table walks during table search operations.
As shown in block 610, at the end of the specified interval, page promotion agent 65 identifies the top N active processes within processor 10 by reference to the profiling data, where N is an integer that may be defined, for example, by default or by a user of the data processing system through an interface presented by operating system 61. Page promotion agent 65 combines the profile data for each metric (e.g., total TLB misses, total page faults, and total time spend performing page table walks) of the top N active processes, as depicted in block 615. As shown in block 620, promotion agent 65 then determines whether the aggregate value of the profile data for a specified number (e.g., one) of the metrics has reached a threshold value, which may defined by the user or by default. If none of the aggregate values of the profile data for the top N active processes has reached the corresponding threshold values, the process returns to block 603 and page promotion agent 65 continues to collect profile data during a subsequent time interval.
If, however, page promotion agent 65 determines at block 620 that the aggregate value(s) of a specified number (e.g., one) of the profile data corresponding to the top N active processes has reached the associated threshold value(s), page promotion agent 65 promotes the memory pages of the top N active processes to the next-largest page size (e.g., from 16 KB pages to 64 KB pages) and modifies the PTEs of the top N active processes accordingly, as shown in block 625. Swapped out pages corresponding to the top N active processes are thus swapped back in to larger pages.
Following block 625, the process passes to block 627, which illustrates a determination of whether or not page promotion agent 65 has been terminated, for example, through a shutdown of operating system 61 or through a system administrator individually terminating page promotion agent 65. If so, page promotion agent 65 then terminates the process shown in
It is understood that the use herein of specific names are for example only and not meant to imply any limitations on the invention. The invention may thus be implemented with different nomenclature/terminology and associated functionality utilized to describe the above devices/utility, etc., without limitation.
While an illustrative embodiment of the present invention has been described in the context of a fully functional computer system with installed software, those skilled in the art will appreciate that the software aspects of an illustrative embodiment of the present invention are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the present invention applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of signal bearing media include recordable type media such as thumb drives, floppy disks, hard drives, CD ROMs, DVDs, and transmission type media such as digital and analog communication links.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.